Population Count circuit

A

acd 18 years ago

Out of curiosity: I am currently studying some problems which require many population- count operations ov various sizes.

I have used population count in a course I taught last year, as it can demonstrate various design styles (function, recursive function, nested GENERATE statements, etc.) in VHDL. I used a tree of full adders and the performance on a Spartan 3 as reported by the xilinx tools was moderate (using pipelining). * I think that one should get a much faster implementation even on an FPGA if one uses redundant representations very much like multipliers do (which do essentially several population circuits of the partial products).

So, FPGA-experts: is that true, from which pipeline depth/input size is it worth? How is the cost/performance compared to an full-adder implementation?

Theoreticians: I could not find a complexity expression for this operation. But it sounds like textbook knowledge. Maybe I have the wrong textbooks.

Andreas

Vote

J

John_H 18 years ago

Is your task looking at one input vector, determining which of many population counts need to be incremented and providing the many counts to the user after the full population represented by the input vectors have been processed?

Counters are fast and simple in FPGAs, at least when compared to the logic used to determine what counters need to be incremented in the first place.

A little more elaboration on the specific needs would be helpful; if I'm far off base it's because I don't yet understand what you're trying to accomplish and what basic limitations you're trying to design around.

- John_H

Vote

A

acd 18 years ago

Sorry, I thought the term "population count" was clear. I mean counting the number of ones in a given input vector. In this case, I want a pipelined circuit which can process a new input vector every clock cycle, giving the result a fixed number of cycles later at the output. As mentioned, a tree of adders with increasing width would do it. But I think using only a full adder at the last stage may be faster/ cheaper.

Andreas

Vote

J

John_H 18 years ago

Thanks for the clarification. What's the size of your N? A simple pipelined tree will give you speed limited by the largest register-to-register adder which is pretty good. Do you need a higher speed still?

When Altera and Xilinx had maximum speeds just over 100 MHz, I had a "ones counter" that digested 32 bits at a time differently in the two architectures where my maximum frequency was a high 90 MHz. The implementation was different for the two approaches to leverage some Altera logic paths that were faster than I could get with simple adders. For where FPGAs are now, I woulnd't expect the different flow to produce any significant gain, especially for larger N. My recollection is the alternate approach's delays went up linearly with N while the pipelined tree goes up as log(N). The approach wasn't something I found in a book but was home-grown.

There are other ways, but your absolute "best" implementation depends on your requirements for N, for speed, and for latency.

- John_H

Vote

J

John McCaskill 18 years ago

In addition to this, if you have block rams that you are not using for anything else, they can be used for a look up table. Four 16K by 1 bit block rams would give you the population count of a 14 bit vector. They can also be combined with the LUTs for a hybrid solution.

Regards,

John McCaskill

formatting link

Vote

G

glen herrmannsfeldt 18 years ago

The full adder tree is known as a carry save adder tree.

As far as I know, it is the best for large N, and assuming that a full adder is a reasonable logic unit.

For FPGAs with LUTs with other than three inputs, and for not so large N, it might be that there are other implementations, but I don't believe they are all that much better.

For not so large N, it is the boundary cases that become significant, similar to the problem of how many circles you can fit inside a given sized square without overlapping, as the size of the square increases.

With 4 input LUTs, two implemented as full adders will convert three inputs into a two bit (ones and twos) output. Three will convert four bits into three outputs (ones, twos, fours). When N isn't so large, in many cases the number on inputs needed at a given level is not a multiple of three, and the four input LUTs can be useful.

I don't believe the result will be much faster, (in number of logic stages), but might use fewer LUTs.

The last one I did only needed to output 0, 1, 2, 3 or more, which simplified the logic a little (but not a lot) from 36 inputs.

-- glen

Vote

G

glen herrmannsfeldt 18 years ago

(snip)

What do you mean by faster? Depending on your clock rate, you might do more than one stage of adders between pipeline latches. That will give you the result sooner. In most FPGAs there are more latches than needed, so there is no savings. It might be that some design could use slightly less logic (LUTs), but it won't be a lot less.

As someone else mentioned, using block RAMs, if you have them available, might help.

Otherwise, if your clock rate is slow you might be able to reuse some of the logic. That is, trade speed for logic using a faster clock on the pipeline registers than the data input.

-- glen

Vote

S

Symon 18 years ago

...furthermore, you can do 28 bits if you're prepared to add the outputs of the two ports of the dual port RAMs. HTH., Syms.

Vote

C

comp.arch.fpga 18 years ago

A tree aproach has the disadvantage of irregular routing which really hurts performance. If you want maximum speed and do not care about cost you could have a triangular setup. The vector is input at the top and in each clock cycle it is sent one DFF down and one to the left. On the left there is a column of carry save incrementers that add the bit coming from the right to the value from the top. At the bottom you add a log(N) stages to finalize the carry save result. This is a very expensive solution that probably can run at speeds a lot higher than the 550MHz the Virtex-5 is officially designed for.

There was a correlator design published that used a similar approach to achieve 250MHz in a XC3195A.

The BRAM result probably is more realistic and gets you 550 MHz.

Kolja Sulimma

Vote

Population Count circuit

Join the Discussion

Didn't find your answer?