Counting ones

A

Aart van Beuzekom 22 years ago

Hi,

Can anybody tell me a practical way (in VHDL) to count the number of ones in a bit pattern? Counting should be done with combinatorial logic, which means that a for-to loop cannot be used here (but being quite new to FPGA and VHDL, I'm not even sure). The number of bits to be counted is about 30, so a single look-up table is not the solution, using only a Spartan II.

Any suggestions? Purpose is amongst others pattern matching, where I count how many bits differ from the expected result.

Thanks,

Aart

Vote

A

Aart van Beuzekom 22 years ago

I forgot to mention that the for-to loop cannot be used because I do not want the system to use any clock cycles - the result just has to be there.

Vote

U

Uwe Bonnes 22 years ago

Aart van Beuzekom wrote: : Hi,

: Can anybody tell me a practical way (in VHDL) to count the number of : ones in a bit pattern? Counting should be done with combinatorial logic, : which means that a for-to loop cannot be used here (but being quite new : to FPGA and VHDL, I'm not even sure). The number of bits to be counted : is about 30, so a single look-up table is not the solution, using only a : Spartan II.

: Any suggestions? Purpose is amongst others pattern matching, where I : count how many bits differ from the expected result.

It has been discussed before. Try

formatting link

to search this group for old answers.

Bye

Uwe Bonnes bon@elektron.ikp.physik.tu-darmstadt.de Institut fuer Kernphysik Schlossgartenstrasse 9 64289 Darmstadt --------- Tel. 06151 162516 -------- Fax. 06151 164321 ----------

Vote

B

Bertram Geiger 22 years ago

Aart van Beuzekom schrieb:

Its not VHDL, but: if you have bits numberd x0, x1, x2, x3 ... x30 than you can simply add them: Y = x0 + x1 + x2 + ... x30 for counting ONE's

greetings, Bertram

Bertram Geiger, Graz - AUSTRIA Private Mail: remove the letters "b a d" from my reply address

Vote

A

Aart van Beuzekom 22 years ago

Hi Uwe,

Thanks for four response. Fortunately, I checked this newsgroup's archive before posting. The problem is that it isn't exactly pattern matching I want, but counting the number of errors (at the other end of a poor communication link) in a bitstream, compared to an expected pattern. Something that for example can be used to find a kind of preamble signal. Any more suggestions?

Aart

Vote

G

Glen Herrmannsfeldt 22 years ago

I don't remember it being discussed so recently. It might be too far back to find, or maybe another newsgroup. comp.arch might be a good choice.

Anyway, adding with carry save adders is the best combinatorial algorithm I know of. I don't know how well it works in an FPGA, though.

-- glen

Vote

U

Uwe Bonnes 22 years ago

:> Aart van Beuzekom wrote: ... :> It has been discussed before. Try

formatting link

to search this group for old :> answers. :> :> Bye

: Hi Uwe,

: Thanks for four response. Fortunately, I checked this newsgroup's : archive before posting. The problem is that it isn't exactly pattern : matching I want, but counting the number of errors (at the other end of : a poor communication link) in a bitstream, compared to an expected : pattern. Something that for example can be used to find a kind of

The second hit on deja in a search for "fpga counting ones" gives:

formatting link

The 6th hit is Re: Count 1's algorithm... ... To count 16 ones you would need 5 CLBs and have one ... you have to do it and how many bits you are counting. ... with a fanin and fanout of 2. The FPGA LUTs generally ... comp.arch.fpga - 4 Feb 2000 by Dragon - View Thread (11 articles)

Hope this helps

Uwe Bonnes bon@elektron.ikp.physik.tu-darmstadt.de Institut fuer Kernphysik Schlossgartenstrasse 9 64289 Darmstadt --------- Tel. 06151 162516 -------- Fax. 06151 164321 ----------

Vote

I

Ian Poole 22 years ago

Hi Aart

There is an excellent example of a combinational ones counter in the course notes on our Comprehensive VHDL course ;-) Check out my employers website for more details...

Our example does use a for loop - I'm not sure why you feel a for-loop cannot be used for combinational logic, unless you are putting wait statements inside the for loop, and in that case it probably won't synthesise (unless you have a behavoral compiler). To work out what a loop will do, simply replicate the contents of the loop once for each iteration of the loop, replacing the loop parameter with a constant. eg

process(b , c) begin for i in 0 to 3 loop a(i) = b(i) and c(3-i); end loop end process;

is equivalent to

process(b , c) begin a(0) = b(0) and c(3); a(1) = b(1) and c(2); a(2) = b(2) and c(1); a(3) = b(3) and c(0); end process;

Try a for-loop looping the entire 'range of your error vector. Inside the for loop, sum all the bits. So long as you get the types right and don't put any timing in the loop, it will synthesise to combinational logic.

HTH

Ian Poole

-- Ian Poole, Consultant and Trainer

DOULOS - Developing Design Know-how VHDL * Verilog * SystemC * Perl * Tcl/Tk * Verification * Project Services

The contents of this message may contain personal views which are not the views of Doulos Ltd., unless specifically stated.

Vote

J

John_H 22 years ago

In either the Xilinx or Altera architecture, it's probably most efficient to pre-add in groups of 4 bits then add thse results in an adder tree. For 32 bits (for instance) you can get 8 values with counts of 0-4 with simple LUTs. The next stages give 4 values of 0-8 by adding the previous values in pairs, the next level is 2 values of 0-16 and the last stage is a single adder with a result from 0 to 32. This can be extended to whatever quantity of bit errors you want to count per-cycle.

If you just try to add all 32 bits in one line - which is entirely valid - some synthesizers will give you poor results. The structured coding tesnds to give structured results. If you need higher speed, you can pipeline the adder tree to get extreme clock rates.

- John_H

Vote

M

Mike Treseler 22 years ago

Related thread:

formatting link

-- Mike Treseler

Vote

R

rickman 22 years ago

There is nothing inherent in a for loop that uses "clock cycles". As long as you don't use a wait statement on a clock edge, you will not have a FF (which is what you are referring to as using clock cycles).

Rather than to think of a VHDL solution from the problem statement, I prefer to think in terms of what hardware I would design to solve the problem and then use VHDL to describe the hardware. In your case you need a simple bank of adders to add in succession. You can optimize this for speed or logic which will depend on your means of implementing the design. This will be different in an FPGA with 4 input LUTs vs. a gate oriented array.

Once you have decided on an architecture to solve the addition problem, you can then code it in VHDL easily. Very possibly a for loop can be used without creating FFs.

Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Vote

G

Glen Herrmannsfeldt 22 years ago

course

If you think in terms of traditional programming languages, for loop implies iteration, though as you say that isn't required. The PL/I preprocessor, with the %DO statement, would be an example of where it didn't.

There are people who would like to use C as a source language for synthesis, which I still don't believe in.

Maybe asking for a combinatorial, instead of iterative design, would have been better.

-- glen

Vote

J

Jake Janovetz 22 years ago

Hi Aart-

In the Xilinx FPGAs, a LUT RAM block can conveniently be 32x1. You could use three of these, with five identical inputs to provide a look-up sum of five bits. Then, take the results of these sums (five inputs at a time) into an adder tree. For example, if A, B, C, D, E, F are the sets of 5-bits of a 30-bit input, then you'd sum A+B, C+D, E+F. Two more levels and you'd have a complete sum. I'm not sure if this is what the group decided on as the most efficient method, but it's the first one that comes to mind. You could use BRAMs if you'd like for a larger lookup (10-bits at a time to a 4-bit result if you use RAMB4_S4_S4, for example).

Jake

Vote

A

Aart van Beuzekom 22 years ago

formatting link

and how many

generally ...

Hi Uwe,

OK, I didn't even know that newsgroup existed, I'll check it.

Aart

Vote

A

Aart van Beuzekom 22 years ago

formatting link

Thanks to all for the reponse!

Aart

Vote

P

Peter Alfke 22 years ago

To inject a fresh idea into this long thread:

One dual-ported BlockRAM, configured as a 4K x 4 ROM can obviously take

12 inputs and generate 4 binary encoded outputs, but it can do this twice, once per port. (Both ports look at the same code converter simultaneously). So one BlockRAM takes 24 inputs and generates two sets of 4 binary outputs which can be combined in one CLB to generate the 5-bit result. Two BlockRAMs can similarily encode 48 inputs, and the final result is available in substantially less than 10 ns. Note that the BlockRAM requires an input clock edge to initiate the conversion.

S>

Vote

U

Ulf Samuelsson 22 years ago

course

website

loop

iteration

The for loop method generates n-1 adders for n bits. Probably quite slow as well. The synthesizers might of course be smarter than I think...

Best Regards, Ulf Samuelsson ulf@a-t-m-e-l.com This is a personal view which may or may not be share by my Employer Atmel Nordic AB

Vote

R

Ray Andraka 22 years ago

--

--Ray Andraka, P.E. President, the Andraka Consulting Group, Inc.

401/884-7930 Fax 401/884-7950 email snipped-for-privacy@andraka.com

formatting link

"They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759

Vote

J

John 22 years ago

"John_H" wrote in message news:...

I love coming back to a subject and gaining a little more insight!

In the current discussion (and the thread from

3 years ago) I made a mistake (and so has everyone else contributing, a surprising event on usenet). There is a seeming paradox, in that it is more efficient to use the LUTs to do a three input sum at the leaves of the adder tree than to do a 4 input sum. I'll show below the most efficient (area wise) FPGA implementations that I know (for 16 bits, and OP's 30 bits), and challenge anyone to come back and show an even better way. After that, some simple words about the 'paradox' of why the three input/LUT solution is better than four input/LUT in this case.

Look first at a _LUT_only_ implementation (16 bits), which applies to any LUT based FPGA...

A full adder (FA3) can be implemented in 2 LUTs These are 3-LUTs!, not 4-LUTS, although in practice

4-LUTs are used because that is what is available. It is also called a (3,2) counter. See L.Dadda's papers, including "Pipelined Adders" in IEEE Transactions on Computers, Mar 1996, for a good discussion of adders, or Patterson et.al on "Optimal Carry Save Networks", available at Citeseer, for better/deeper discussions than I can give.

This is the full adder: ___

-0-| F |

-0-| A |-1-

-0-|_3_|-0-

(the numbers are powers of 2 at a position) Repeating: The FA3 sums 3 bits, using 2 3-LUTs. Next, 4 full adders are arranged to sum 7 bits, using 8x 3-LUTs: ___

0--| F | 0--| A |-1--------+ 0--|_3_|-0---+ | ___ | +-----1-| F | ___ +--|----------1-| A |---2 0--| F | | | ___ +1-|_3_|---1 0--| A |-1+ +0-| F | | 0--|_3_|-0----0-| A |-1+ 0-------------0-|___|-0-----------0

Call the above a 7 Adder (7Add): ___ | 7 | 7 | A |-2-

0-/-| d |-1- |_d_|-0-

(a more correct term might be '(7,3) counter', but '7Add' fits in the ascii drawing)

Two 7Adds can be used to sum 14 bits into two 3-bit numbers, using 16x 3-LUTs.

3x 4-LUTs are used to sum 4 bits to a 3-bit result in a 4Add: ___ | 4 | 4 | A |---2 0-/-| d |---1 |_d_|---0

Produce the final result from the output of the two 7Adds plus the remaining two bits, using two 4Adds and an FA: ___ | 7 | 7 | A |-2-------------------------------+ ___

0-/-| d |-1-------------------+ +-2-| 4 | |_d_|-0-----+ +-------|-------------2-| A |--4 | | +-|-------------2-| d |--3 | | | | ___ +-2-|_d_|--2 ___ +---|-----+ | +-1-| F | | | 7 | | +-|-----------|---1-| A |-2-+ 7 | A |-2-+ | | ___ | +-1-|___|--------------1 0-/-| d |-1---+ +-0-| 4 | | | |_d_|-0-------0-| A |-2-+ | 0-----------------0-| d |-1---+ 0-----------------0-|_d_|----------------------------0

Total LUT count is 24 (3 Virtex CLBs), using

6x 4-LUTs and 18x 3-LUTs. This is the absolute minimum using 3-LUT and 4-LUT based logic alone (that I know of).

Now look at how the VirtexII carry logic may improve things...

At the leaves of the tree (left hand side), two 3-LUTs are still used to build full adders. The next step is building a 7Add, which sums the output of 2 FAs plus one more bit. Here, either carry logic can be used, at a total cost of 8 LUTs, or a LUT only solution, also costing 8 LUTs. A 15Add sums two 7Adds and another bit. This costs two 7Adds (16 LUTs), plus a standard 3 bit carry logic based adder with carry-in and carry-out, another

5 LUTs. Adding in the last bit is a 4 bit increment with carry-out, costing 6 LUTs using carry logic. Total: 27 LUTs! Three more than the LUT only circuit. Frustrating, isn't it? For this size, a sum of 16 bits, trying to use carry logic does worse than a LUT only implementation. For a 15Add, carry logic allows a marginal improvement of one LUT, using only 21 LUTs instead of 22. For larger bit counts, the carry logic becomes increasingly important, as will be explained below.

Finally, a few simple-minded words about the 'paradox'...

Many operations can be viewed as an exercise in compression, reducing a number of inputs to a smaller number of outputs through some sort of multi-level tree structure. In this case, inputs are the bits to count, and outputs are the count. At any tree level, a simple, naive figure of merit qualifies a circuit: (InputBits/OutputBits) * (InputBits/LUTs) The larger this number, the better the circuit is for that level. The first ratio helps the tree converge faster, the second reduces the LUT count. For 4-LUT based logic, the highest possible figure of merit would be 16, indicating four input bits produce a single output bit, with a single LUT. Parity trees, "and" trees, "or" trees acheive 16. Implementing the leaf side initial level of bit counting with 4-LUTs gives (4/3)*(4/3)=1.778. Implementing the initial bit counting with 3-LUTs gives (3/2)*(3/2)=2.25. At any stage where bits of the same weight are aggregated and compressed, the 3-LUT implementation is better if they can be grouped by threes. At the end of the 16Add tree, some 4-LUTs are used to advantage because there are 4 signals to combine that cannot be cleanly split into groups of three. For every other location in the tree, the 3-LUTs work out better. When carry logic is used at a combining stage, uniting two addends of the same size plus an LSB, the metric becomes: ((2*Size+1)/(Size+1))*((2*Size+1)/(Size+2)) where Size is the number of bits in each addend. This gives: Size Merit 1 1.5 2 2.083 3 2.45 4 2.7 ... approaching 4

When 3-LUT based full adders are cascaded to add two numbers and a carry-in, the metric becomes: ((2*Size+1)/(Size+1))*((Size+1)/(2*Size)) Giving: Size Merit 1 2.25 2 2.083 3 2.042 4 2.025 ... approaching 1

Comparing the tables, for combining three equal weight bits, the LUT solution is better; for going from full adder to 7Add the two circuits have equal LUT count; for going from 7Add to 15Add carry logic should be used. The optimum (30,5) counter (that I know of) uses a mix of

3-LUTs and carry-logic, and no 4-LUTs.

Whew! Talk about being over-anal(ytical). Hope I haven't bored anyone, just wanted to get the 'best' circuits public (in hopes someone else has a better one), and show something not immediately obvious about the leaves of the adder tree. Peter's BlockRam implementation is also slick!

I'll finish with a question: Just what are the best/worst results from HDL synthesis tools that folks get with this function? I.e., barring forcing the mapping, and letting the tool optimize from something like: Count

Vote

P

Peter Alfke 22 years ago

John, just to be contrary: Remember my suggestion of using a 4K x 4 dual-ported BlockROM. So it can take 24 inputs and generate two 4-bit outputs that then have to be combined externally. On two BlockROMs you can have 48 inputs and then combine 4 sets of 4 inputs in one or two CLBs. Fast and simple, but needs a trigger clock, since BlockROMs are synchronous.

Peter Alfke, Xil>

Vote

Counting ones

Join the Discussion

Didn't find your answer?