Moving Sum

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Hi to all,

The sum or average of a certain number of samples (for ex. the last 100
values received) have to be checked constantly against a threshold.

I thought of implementing this by keeping a "Moving Sum" which will work by
adding the new value and subtracting the oldest. I think that can be
implemented by adding to a register the value just arriving and subtracting
the value coming out of an 100 word deep shift register.

Now, if a longer sum has to be checked then there is a memory problem
because a lot of values have to be stored. In addition more than one "Moving
Sums" is needed so if I use the above implementation I will have in addition
to store the same data more than once (for ex. the 1000 word Shift Register
will include the 100 word S.R. data).

Any idea of how this could be implemented?

The final system will have to keep 10 moving sums with the largest being
250,000 (8-bit) values for each of the 16 independent input channels.

Help to the design problem will be appreciated and acknowledged.



Christos Zamantzas
CERN, European Organization for Nuclear Research
Div. AB/BDI/BL                         tel: +41 22 767 3409
CH-1211 Geneva 23                  fax: +41 22 767 9560

Re: Moving Sum
Quoted text here. Click to load it

Can you cheat?  That is, instead of having a 250,000 deep moving sum,
have it be 250,000 deep but only at intervals of every 1000 samples?  

The other option is once you have to go off-chip for memory for the
FIFO's, the size doesn't matter much because you can easily just throw
~1GB of DRAM on the other side.
Nicholas C. Weaver                       

Re: Moving Sum
Another option, if you're using Xilinx parts, is to take advantage of those
SRL16's.  In V2P parts (and I think V2) there are 2 per slice.  With 64 of
these guys cascaded together, you've got a 1-bit wide, 1024-bit long moving
sum (barrel shift down for the average).  That's 32 slices, or 8 CLB's per
bit-width, depending on the level of abstraction you like to think about.


Quoted text here. Click to load it
Quoted text here. Click to load it

Re: Moving Sum
Whoops, probably won't work in your case-- I didn't read that last
paragraph.  But the SRL's are still good for less gigantic moving sums.

Quoted text here. Click to load it
Quoted text here. Click to load it

Re: Moving Sum

Quoted text here. Click to load it

I have thought also of this but the idea was rejected as it increases the
total system error.
In order to make an interval you have to wait to receive all of its samples
before you add the interval to the sum. Thus, you update the sum slower
which increases the system error.

For a similar reason it is not possible to use a Low-Pass Filter. I guess to
have an average of the 250,000 values you need as many taps.

Quoted text here. Click to load it

A sync. SRAM (probably 2Mx36b) will be available to the board. I have been
calculating and I think that it is enough.


We've slightly trimmed the long signature. Click to see the full one.
Re: Moving Sum
Look for "CIC filter".  CIC is a Cascaded integrator Comb filter.  It is a
recursive implementation of a moving sum.   In your case, it sounds like you are
sampling the output once for every input sample, so you don't get the benefit of
decimation (if you could decimate, then the delay queue is shortened by a ratio
equal to the decimation ratio).  The CIC consists of an integrator, a subtractor
and a delay queue.  For a moving sum, you are stuck with the storage and the key
is to minimize the number of transactions you need to do with the storage per
sample.  In the case of the CIC, you need to do one read and one write per
sample.  For the depth you are looking at, you'll need to use off chip memory
for the storage (you might fit it into the bulk storage on an Altera Stratix).
You did not mention the sample rate.  If the data rate is sufficiently low, you
can time multiplex the data in/out of the external memory so that you can trade
memory width for depth, which might get you a lower parts count.

Christos wrote:

Quoted text here. Click to load it

--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
We've slightly trimmed the long signature. Click to see the full one.
Re: Moving Sum
Quoted text here. Click to load it

Do you require that each of the previously recieved values are
considered equally in the average calculation? If you can assume that
the current samples are more important than those recieved a long time
ago then you can calculate an "average" via an exponentially weighted
moving average filter. Or in other words, us a 1st-order LPF.

 a_k = (1/(n+1))*s_k + (n/(n+1))*a_k-1

 s_k   = sample input at instant k,
 n     = number of samples in moving-average window,
 a_k   = average at instant k, and
 a_k-1 = average at instant k-1

As you can see this is quite easy to implement, requiring to
multiplies, one addition, and one register for a_k-1 storage. If you
choose n+1 do be a power of two then one of the multiplications (or I
guess it is a divide) becomes a simple shift operation.

-Jack Stone

Re: Moving Sum
[exponential weighted averaging]

Quoted text here. Click to load it

The other multiply/divide turns into a shift and subtract.

The mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
We've slightly trimmed the long signature. Click to see the full one.
Re: Moving Sum
The exponential weighted averaging cannot be used because all data into the
window have to be treated equally as all have the same importance.
If using a LPF and n is the number of samples in the window then if you want
to have an average of the last 100 values received then your filter has to
be 100 tap long.
Correct? I am asking just in case I have not understood fully how  digital
filters are implemented. The truth is that I have never done it, just read
about it.


Quoted text here. Click to load it

Re: Moving Sum
The bottom line is if you want a moving window averager you are going
to have to have a memory that will hold the entire window of points.
If your FPGA doesn't have enough memory (and it won't for 250,000x10
data) then you will need external memory. Otherwise your question is
like how do you put a gallon of water into a pint sized glass.

Re: Moving Sum
Quoted text here. Click to load it

Yes and no. The benifit of using the exponential weighted averaging is
that the filter is a _single_ tap, not n-taps long. It is an
IIR-filter structure. If you wanted to average with a FIR-filter
structure then, yes you would need n taps.

Re: Moving Sum
Initializing the whole mess must also be considered. While the starting sum
can be zero, all the values in memory must be preset to zero and the
comparison to a threshold must be declared invalid until 'n' new values have
been accumulated.

We've slightly trimmed the long signature. Click to see the full one.
Re: Moving Sum
Quoted text here. Click to load it

This is easy todo, even without the CIC Filters suggested by Ray
In external memory you keep a circular buffer of 16x250000 samples.
You keep your 160 Sums inside of your FPGA. To update them you do the
For each channel
   input the new sample X
   write new sample to its external ram location in a circular buffer
   for each moving sum of this channel
      read the value Y that "falls of" the sum from external ram
      add X-Y to the moving sum inside the FPGA.

This requires 16 writes and 160 reads to external memory with a
resulting bandwidth of 4.400.000 memory accesses per seconds.
If the values are stored in memory with the right alignment you can do
4 accesses in parallel reducing the bandwidth to 1.100.000 accesses
per second.

Maybe you should instantiate a processor in you fpga and use that to
implement this.

What are you doing at LHC that has a sample rate of 25kHZ?

Kolja Sulimma
What are you

Re: Moving Sum
For the moment I have used a process similar to the one you describe.
The circular buffer is a SR and the result of the X-Y is fed to an
accumulator using signed numbers.
I think it works very well (the clk has to be min 20 ns, otherwise it goes
unstable/setup violations, but this is not a problem as the real clk will be
much slower).
On the other hand this is going to be used only up to ~10 ms of data (250
The processes that will go up to 100s will take an average of 8 values and
store that value to the external memory. In that way the data are minimised
by a factor of 8 and the system error is negligible.
The SRAM that was found can be used in the architecture of 1M x 72bit, so 8
accesses in parallel times two in 40 us seems to be more than ok
One problem now is how to implement this! my experience do go that far!
You've said something about instantiating a processor (I guess something
like NIOS), are you sure that this will not complicate things more?
Is there something ready to implement a circular buffer to the external ram?

The second problem, and the reason why I asked for help in this group, is
that those SRAMS are quite expensive and having in mind that 2000 of them
will be needed, it increases the cost significantly. So they are pressing me
to find some other way to implement it. (usual stuff: we want the pie and
the dog fed!)
Ray has given me the idea of the CIC (he will be acknowledged for that in my
thesis, as well as all the rest which took the time to answer) but I still
haven't figured how it is working! Soon I hope, so that I can figure out if
I can use it.

And answering your question about LHC, this system is for machine protection
and it is called Beam Loss Monitor. The superconducting magnets have to be
prevented from quenching by the particles showers hitting them as some
particles are lost from the trajectory.
Inside the tunnel some Ionisation Chambers are installed (3600) and they
give an amplitude proportional to the particle rate passing through them.
This current is fed to a CFC (Current to Frequency Converter) and a counter
is measuring the frequency. The counter data, as well as some status, CRC
etc from 16 chambers are sent through an optical link to the surface for

And here is where I come, I have to design the threshold comparator for
Samples at 40us  is enough as it is ~half a LHC cycle and maybe it will be
increased to 89us (~11KHz)which is one cycle. Just imagine what would be my
problems if 40MHz was used and I had to go up to 100s of data!!


Quoted text here. Click to load it

Re: Moving Sum
Quoted text here. Click to load it

I don't understand why you need 2000 SRAMs.

25 KHz x 16 x 100 = 40 MB

or 8 SRAMs.

Re: Moving Sum

Quoted text here. Click to load it
so 8
Quoted text here. Click to load it
pressing me

In the first paragraph I explain that saving the average values of 8 samples
the data are minimised by a factor of 8. So 5 MB have to be stored for this
system (up to 100s) which fit together with the data of the first system (up
to 10ms) in one SRAM. On the card there are 2 more to hold other data. And
650 of these cards are needed, that gives ~2000 SRAMs.

Sorry for not been that clear but the mail was already too long and I didn't
want to kill you with
boredom completely!


Re: Moving Sum
Quoted text here. Click to load it

I guess that you have a lot more channels than I realized.

I recommend going to a mass storage device that can hold this amount
of data, such as multiple disk drives. Your data rates are relatively
slow, giving you the option of streaming interleaved data (don't store
a single channel in one place on the disk). You can also look into
large DRAMs. Another option is flash memory, but you might wear these
devices out.


Re: Moving Sum
    I have read all the previous responses and I have one suggestion.  Could
you use a DRAM based memory?  For your first pass generate a running sum of
your 250,000 samples.  Then add the new value and subtract the old value.
That way you need not do the whole sum for each sample.  Use a simple micro
to control the system.  It could be as simple as a pico-blaze.  You could
even control it with a simple state machine.  You really do not need
anything as fast as a SRAM.  You can refresh the DRAM between the sample

Theron Hicks

Quoted text here. Click to load it
Quoted text here. Click to load it

Re: Moving Sum
Regardless which method is used, using a CIC limits the number of memory
transactions per sample to just two: a read and a write.  The memory is used as
a delay queue, so the read pointer is N samples behind the write pointer.  The
memory required for all those channels is pretty big, so DRAM would be the way
to go if you are using semiconductor memories.  Since the addressing can easily
be made linear, you can simplify it by using page mode or burst accesses.   This
should make it fast enough to multiplex many channels into one memory .

--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
We've slightly trimmed the long signature. Click to see the full one.
Re: Moving Sum
Hi Ray,
Forgive me that I still haven't found any time to read about the CIC filter,
but from the way you describe its operation it does not require less read
and write operations from a simple implementation of a subtract and
accumulate which I am testing at the moment. I still need the same read
pointer and the same memory.
My point is, is there an advantage with CIC that I don't see?


Quoted text here. Click to load it
used as

Site Timeline