Moving Sum

C

Christos 22 years ago

Hi to all,

The sum or average of a certain number of samples (for ex. the last 100 values received) have to be checked constantly against a threshold.

I thought of implementing this by keeping a "Moving Sum" which will work by adding the new value and subtracting the oldest. I think that can be implemented by adding to a register the value just arriving and subtracting the value coming out of an 100 word deep shift register.

Now, if a longer sum has to be checked then there is a memory problem because a lot of values have to be stored. In addition more than one "Moving Sums" is needed so if I use the above implementation I will have in addition to store the same data more than once (for ex. the 1000 word Shift Register will include the 100 word S.R. data).

Any idea of how this could be implemented?

The final system will have to keep 10 moving sums with the largest being

250,000 (8-bit) values for each of the 16 independent input channels.

Help to the design problem will be appreciated and acknowledged.

Christos

__________________________________________________

Christos Zamantzas CERN, European Organization for Nuclear Research Div. AB/BDI/BL tel: +41 22 767 3409 CH-1211 Geneva 23 fax: +41 22 767 9560 Switzerland snipped-for-privacy@cern.ch __________________________________________________

Vote

N

Nicholas C. Weaver 22 years ago

Can you cheat? That is, instead of having a 250,000 deep moving sum, have it be 250,000 deep but only at intervals of every 1000 samples?

The other option is once you have to go off-chip for memory for the FIFO's, the size doesn't matter much because you can easily just throw ~1GB of DRAM on the other side.

Nicholas C. Weaver nweaver@cs.berkeley.edu

Vote

R

Ray Andraka 22 years ago

--

--Ray Andraka, P.E. President, the Andraka Consulting Group, Inc.

401/884-7930 Fax 401/884-7950 email snipped-for-privacy@andraka.com

formatting link

"They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759

Vote

J

Josh Model 22 years ago

Another option, if you're using Xilinx parts, is to take advantage of those SRL16's. In V2P parts (and I think V2) there are 2 per slice. With 64 of these guys cascaded together, you've got a 1-bit wide, 1024-bit long moving sum (barrel shift down for the average). That's 32 slices, or 8 CLB's per bit-width, depending on the level of abstraction you like to think about.

--Josh

by

subtracting

"Moving

addition

Register

Vote

J

Josh Model 22 years ago

Whoops, probably won't work in your case-- I didn't read that last paragraph. But the SRL's are still good for less gigantic moving sums.

those

moving

work

snipped-for-privacy@cs.berkeley.edu

Vote

J

Jack Stone 22 years ago

Do you require that each of the previously recieved values are considered equally in the average calculation? If you can assume that the current samples are more important than those recieved a long time ago then you can calculate an "average" via an exponentially weighted moving average filter. Or in other words, us a 1st-order LPF.

a_k = (1/(n+1))*s_k + (n/(n+1))*a_k-1

where: s_k = sample input at instant k, n = number of samples in moving-average window, a_k = average at instant k, and a_k-1 = average at instant k-1

As you can see this is quite easy to implement, requiring to multiplies, one addition, and one register for a_k-1 storage. If you choose n+1 do be a power of two then one of the multiplications (or I guess it is a divide) becomes a simple shift operation.

-Jack Stone

Vote

G

Gregory C. Read 22 years ago

Initializing the whole mess must also be considered. While the starting sum can be zero, all the values in memory must be preset to zero and the comparison to a threshold must be declared invalid until 'n' new values have been accumulated.

-- Greg snipped-for-privacy@hotmail.com.invalid (Remove the '.invalid' twice to send Email)

by

subtracting

"Moving

addition

Register

Vote

C

Christos 22 years ago

I have thought also of this but the idea was rejected as it increases the total system error. In order to make an interval you have to wait to receive all of its samples before you add the interval to the sum. Thus, you update the sum slower which increases the system error.

For a similar reason it is not possible to use a Low-Pass Filter. I guess to have an average of the 250,000 values you need as many taps.

A sync. SRAM (probably 2Mx36b) will be available to the board. I have been calculating and I think that it is enough.

----------------------------------------------------------------------------

---

I have never heard of the CIC filter I will investigate.

Stratix).

low, you

trade

The FPGA will be a Stratix, the sample rate is slow enough: 25 KHz (acquisition every 40us) and external SRAM. The data are already received multiplexed, but I don't get why it is better storing with larger width than depth. Probably because I have no idea how to implement the Sratix - SRAM communication. Any Ap.Note or book?

Thanks a lot to all, Christos

Vote

H

Hal Murray 22 years ago

[exponential weighted averaging]

The other multiply/divide turns into a shift and subtract.

The suespammers.org mail server is located in California. So are all my other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited commercial e-mail to my suespammers.org address or any of my other addresses. These are my opinions, not necessarily my employer's. I hate spam.

Vote

C

Christos 22 years ago

The exponential weighted averaging cannot be used because all data into the window have to be treated equally as all have the same importance. If using a LPF and n is the number of samples in the window then if you want to have an average of the last 100 values received then your filter has to be 100 tap long. Correct? I am asking just in case I have not understood fully how digital filters are implemented. The truth is that I have never done it, just read about it.

Christos

unsolicited

addresses.

Vote

K

Kolja Sulimma 22 years ago

and

This is easy todo, even without the CIC Filters suggested by Ray Andraka: In external memory you keep a circular buffer of 16x250000 samples. You keep your 160 Sums inside of your FPGA. To update them you do the following: For each channel input the new sample X write new sample to its external ram location in a circular buffer for each moving sum of this channel read the value Y that "falls of" the sum from external ram add X-Y to the moving sum inside the FPGA.

This requires 16 writes and 160 reads to external memory with a resulting bandwidth of 4.400.000 memory accesses per seconds. If the values are stored in memory with the right alignment you can do

4 accesses in parallel reducing the bandwidth to 1.100.000 accesses per second.

Maybe you should instantiate a processor in you fpga and use that to implement this.

OT: What are you doing at LHC that has a sample rate of 25kHZ?

Kolja Sulimma Frankfurt What are you

Vote

T

Tom Seim 22 years ago

The bottom line is if you want a moving window averager you are going to have to have a memory that will hold the entire window of points. If your FPGA doesn't have enough memory (and it won't for 250,000x10 data) then you will need external memory. Otherwise your question is like how do you put a gallon of water into a pint sized glass.

Vote

J

Jack Stone 22 years ago

Yes and no. The benifit of using the exponential weighted averaging is that the filter is a _single_ tap, not n-taps long. It is an IIR-filter structure. If you wanted to average with a FIR-filter structure then, yes you would need n taps.

Vote

C

Christos 22 years ago

For the moment I have used a process similar to the one you describe. The circular buffer is a SR and the result of the X-Y is fed to an accumulator using signed numbers. I think it works very well (the clk has to be min 20 ns, otherwise it goes unstable/setup violations, but this is not a problem as the real clk will be much slower). On the other hand this is going to be used only up to ~10 ms of data (250 values). The processes that will go up to 100s will take an average of 8 values and store that value to the external memory. In that way the data are minimised by a factor of 8 and the system error is negligible. The SRAM that was found can be used in the architecture of 1M x 72bit, so 8 accesses in parallel times two in 40 us seems to be more than ok One problem now is how to implement this! my experience do go that far! You've said something about instantiating a processor (I guess something like NIOS), are you sure that this will not complicate things more? Is there something ready to implement a circular buffer to the external ram?

The second problem, and the reason why I asked for help in this group, is that those SRAMS are quite expensive and having in mind that 2000 of them will be needed, it increases the cost significantly. So they are pressing me to find some other way to implement it. (usual stuff: we want the pie and the dog fed!) Ray has given me the idea of the CIC (he will be acknowledged for that in my thesis, as well as all the rest which took the time to answer) but I still haven't figured how it is working! Soon I hope, so that I can figure out if I can use it.

And answering your question about LHC, this system is for machine protection and it is called Beam Loss Monitor. The superconducting magnets have to be prevented from quenching by the particles showers hitting them as some particles are lost from the trajectory. Inside the tunnel some Ionisation Chambers are installed (3600) and they give an amplitude proportional to the particle rate passing through them. This current is fed to a CFC (Current to Frequency Converter) and a counter is measuring the frequency. The counter data, as well as some status, CRC etc from 16 chambers are sent through an optical link to the surface for processing.

And here is where I come, I have to design the threshold comparator for this. Samples at 40us is enough as it is ~half a LHC cycle and maybe it will be increased to 89us (~11KHz)which is one cycle. Just imagine what would be my problems if 40MHz was used and I had to go up to 100s of data!!

Christos

Vote

T

Tom Seim 22 years ago

I don't understand why you need 2000 SRAMs.

25 KHz x 16 x 100 = 40 MB

or 8 SRAMs.

Vote

C

Christos 22 years ago

and

minimised

so 8

ram?

is

them

pressing me

and

In the first paragraph I explain that saving the average values of 8 samples the data are minimised by a factor of 8. So 5 MB have to be stored for this system (up to 100s) which fit together with the data of the first system (up to 10ms) in one SRAM. On the card there are 2 more to hold other data. And

650 of these cards are needed, that gives ~2000 SRAMs.

Sorry for not been that clear but the mail was already too long and I didn't want to kill you with boredom completely!

Christos

Vote

T

Tom Seim 22 years ago

I guess that you have a lot more channels than I realized.

I recommend going to a mass storage device that can hold this amount of data, such as multiple disk drives. Your data rates are relatively slow, giving you the option of streaming interleaved data (don't store a single channel in one place on the disk). You can also look into large DRAMs. Another option is flash memory, but you might wear these devices out.

Tom

Vote

T

Theron Hicks 22 years ago

Christos, I have read all the previous responses and I have one suggestion. Could you use a DRAM based memory? For your first pass generate a running sum of your 250,000 samples. Then add the new value and subtract the old value. That way you need not do the whole sum for each sample. Use a simple micro to control the system. It could be as simple as a pico-blaze. You could even control it with a simple state machine. You really do not need anything as fast as a SRAM. You can refresh the DRAM between the sample periods.

Theron Hicks

by

subtracting

"Moving

addition

Register

Vote

R

Ray Andraka 22 years ago

Regardless which method is used, using a CIC limits the number of memory transactions per sample to just two: a read and a write. The memory is used as a delay queue, so the read pointer is N samples behind the write pointer. The memory required for all those channels is pretty big, so DRAM would be the way to go if you are using semiconductor memories. Since the addressing can easily be made linear, you can simplify it by using page mode or burst accesses. This should make it fast enough to multiplex many channels into one memory .

--

--Ray Andraka, P.E. President, the Andraka Consulting Group, Inc.

401/884-7930 Fax 401/884-7950 email snipped-for-privacy@andraka.com

formatting link

"They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759

Vote

C

Christos 22 years ago

Hi Ray, Forgive me that I still haven't found any time to read about the CIC filter, but from the way you describe its operation it does not require less read and write operations from a simple implementation of a subtract and accumulate which I am testing at the moment. I still need the same read pointer and the same memory. My point is, is there an advantage with CIC that I don't see?

Christos

used as

The

way

easily

This

Vote

Moving Sum

Join the Discussion

Didn't find your answer?