Ones Count 64 bit on Xilinx in VHDL

- B
- Brad Smallridge
  
  Contact options for registered users
posted
18 years ago

Tue, Jul 19, 2005 10:52 PM

Hello Group,

What is the best way to count 64 incoming simultaneous bit signals to determine the number of 1s (in VHDL)? I have clock cycles to spare but the result must be pipelined so that each clock cycle produces a new count.

Brad Smallridge b r a d @ a i v i s i o n . c o m

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Jul 19, 2005 10:58 PM

Add them.

Add registers to your path and make your tool retime them.

This has been covered in the newsgroup in the past. How many levels of logic you can deal with depends on your device and your clock. Just adding the individual bits together will produce the desired results and you can pipeline to your heart's content allowing a new result every clock (after the initial latency) in the time it takes to run through one carry-chain adder.

- B
- Brad Smallridge
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Jul 19, 2005 11:12 PM

Yeah, I understand this. But I can't wrap my head around how to code it.

Do you do like this: if( clk'event and clk='1') then partial_sum1_2bit

- B
- Brad Smallridge
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Jul 19, 2005 11:33 PM

I also don't understand what you mean by "having your tool retime them". I don't have Precision or any advance tools here.

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Jul 19, 2005 11:37 PM

- J
- JJ
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Jul 20, 2005 12:05 AM

64 is 0+63+1 63 is 31+31+1 31 is 15+15+1 15 is 7+7+1 7 is 3+3+1 simple recursion

a few adder rows should be pretty quick and way less resources than BlockRam, takes about 6 levels of small adders

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Jul 20, 2005 12:05 AM

In case you are interested in price and performance:

3 BlockRAMs plus 6 CLBs, four levels of pipelining, running at 200 MHz+ Not too bad :-) Peter Alfke

- R
- Ray Andraka
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Jul 20, 2005 3:09 AM

Brad, Basically, you want to gather bits together in small adders. a wallace tree does that using full adders to compress 3 single bit inputs, all withthe same weight into two signals, a sum and a carry. The sum has the same weight as the inputs, the carry has weight 2x the input. Then you use another layer to sum all like weighted bits, and repeat until you are left with two signals of each weight. You combine those with a conventional adder.

What Peter described is going to be more clock cycle efficient because you use the BRAM in place of a wallace tree. His description isn't really a wallace tree because it doesn't have the same structure (no tree of carry-save adders, and the final outputs are complete sums of the bits for those BRAMs, not a carry vector and a sum vector like a wallace tree). You could use wallace trees to combine the results, from the BRAMs, but it isn't efficient in an FPGA.

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com  
http://www.andraka.com  

 "They that give up essential liberty to obtain a little 
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Jul 20, 2005 4:54 PM

If I were to do it in Verilog, I might use always @(posedge Clk27M) TotalOnes Yeah, I understand this. But I can't wrap my head around how to code it.

me

- V
- Vladislav Muravin
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Jul 20, 2005 5:30 PM

Brad,

There are so many ways of doing this, depending on your FPGA family, required timing and the available resouces, but other than using the "natural resources", simple LUTs, pipelines, even multipliers, etc., you can also use memories. Personally, I like using memories for state machines, especially for channelized state machines or LUT for pre-computed CRC calculation.

If we are talking about Virtex family, we have 16384 bits RAMs, which can be used as 4096x4 LUT, where you have a '1's counter within 12-bit vector, which is applied as an address of the entry. Each entry holds the number of '1's. It is clear how to expand this concept further to any vector, depending on the timing requirements and the available resources.

One way is that you can try 5 memory blocks like this and it will give you

60 bits covered, then simply add the "data_out"s and the extra bits and pipeline them. There could be more "balanced" or optimal usage of memories and FFs.

I hope i did not make any math mistake here.

Vladislav

- B
- Brad Smallridge
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Jul 20, 2005 7:25 PM

I would like to switch to Verilog, but not on this project.

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Jul 20, 2005 9:19 PM

Vladislav, I agreee. And the nicest thing is that you can fold two BlockRAMs into one, by using the two ports independently. So one BlockRAM takes care of 24 inputs and generates two sets of 4 bits each. That means you need only 3 BlockRAMs for up to 72 inputs. (plus a few CLBs to combine the outputs, unless you want to use two more BlockRAMs to do that) 5 BlockRAMs total gives a 2-clock latency. It all depends what you are after, speed or cost. Peter Alfke

- B
- Ben Twijnstra
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Jul 20, 2005 9:37 PM

Hi Peter,

I personally feel that using blockrams is a bit wasteful - I coded something up in VHDL that used 144LEs in an Altera Cyclone 1, slowest speed grade, running at 115MHz with two clocks of latency as well. No idea how big that would be in a Spartan - my guess is that it would be similar.

Then again, if there's no LUTs left, and there's some leftover BRAMs, then sure this is a great solution.

BTW: Peter, would you (plural) mind if I downloaded a WebPack so I can compare?

Best regards,

Ben

- R
- Ray Andraka
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Jul 20, 2005 10:12 PM

Ben, you are correct, IF you need the block RAMs elsewhere in your design, or if they are not located conveniently with respect to the logic this is related to. Using LUTs, it can be done in 5 layers of logic, which even without pipelining but with floorplanning will run pretty quickly. If you pipeline it on every layer, it might even out-perform the BRAM , but only if you are very careful about the placement.

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com  
http://www.andraka.com  

 "They that give up essential liberty to obtain a little 
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Jul 20, 2005 11:00 PM

I wasn't suggesting you should switch to verilog, just the code that I showed is Verilog but the concept should translate directly yo VHDL. Add 64

1-bit values in a single VHDL line. If the synthesizer doesn't do a good job, have eight lines of eight values each then add those 8 4-bit results in one line to get your 7-bit result.

- J
- JJ
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Jul 21, 2005 4:40 AM

I did this with 63 inputs all 32bits wide in a plain virtex 800 many yrs ago

If you are building a syncronizer for a 64 bit sync field, if you can cut off 1 bit either the 1st or last and use 63 bits, you can save the last row of adders. Since mine was 32 wide it save alot more than 6 adders. The 1 bit loss probably wouldn't affect a syncronizer application.

I wouldn't want to replicate 3 BRAMs 32 times though.

Whats the application?

- J
- JustJohn
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Jul 21, 2005 9:36 PM

In case you haven't found it yet, VHDL code for 30 bits (without pipelining, that should be easy to add as a bunch of registers at the end, which XST or Synplify may apply register re-timing to) was posted at:

formatting link

Should be straightfoward to extend to 64 bits.

If it's not too much trouble, can I ask what is the application? I thought the circuit was neat, but wonder how folks use it.

John

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Jul 21, 2005 11:59 PM

JJ, I hope you realize that 3 BRAMs is all you need. Nobody would suggest to replicate them. For what?? Peter Alfke

- B
- Brad Smallridge
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Jul 22, 2005 5:42 AM

Well, thanks for all your suggestions. As far as BRAMs, I would rather use them elsewhere. I ended up with this rather verbose code shown below. And I don't know how well it synthesizes, probably not to well, because I think it is using several hundred LUTs. It's actually a 62 ones counter and the bits can be turned off from the center out with the B signals.

Brad

signal B00 : std_logic; signal B01 : std_logic; signal B02 : std_logic; signal B03 : std_logic; signal B04 : std_logic; signal B05 : std_logic; signal B06 : std_logic; signal B07 : std_logic; signal B08 : std_logic; signal B09 : std_logic; signal B10 : std_logic; signal B11 : std_logic; signal B12 : std_logic; signal B13 : std_logic; signal B14 : std_logic; signal B15 : std_logic; -- center signal B16 : std_logic; signal B17 : std_logic; signal B18 : std_logic; signal B19 : std_logic; signal B20 : std_logic; signal B21 : std_logic; signal B22 : std_logic; signal B23 : std_logic; signal B24 : std_logic; signal B25 : std_logic; signal B26 : std_logic; signal B27 : std_logic; signal B28 : std_logic; signal B29 : std_logic; signal B30 : std_logic;

signal EL00 : std_logic; signal EL01 : std_logic; signal EL02 : std_logic; signal EL03 : std_logic; signal EL04 : std_logic; signal EL05 : std_logic; signal EL06 : std_logic; signal EL07 : std_logic; signal EL08 : std_logic; signal EL09 : std_logic; signal EL10 : std_logic; signal EL11 : std_logic; signal EL12 : std_logic; signal EL13 : std_logic; signal EL14 : std_logic; signal EL15 : std_logic; signal EL16 : std_logic; signal EL17 : std_logic; signal EL18 : std_logic; signal EL19 : std_logic; signal EL20 : std_logic; signal EL21 : std_logic; signal EL22 : std_logic; signal EL23 : std_logic; signal EL24 : std_logic; signal EL25 : std_logic; signal EL26 : std_logic; signal EL27 : std_logic; signal EL28 : std_logic; signal EL29 : std_logic; signal EL30 : std_logic;

signal ER00 : std_logic; signal ER01 : std_logic; signal ER02 : std_logic; signal ER03 : std_logic; signal ER04 : std_logic; signal ER05 : std_logic; signal ER06 : std_logic; signal ER07 : std_logic; signal ER08 : std_logic; signal ER09 : std_logic; signal ER10 : std_logic; signal ER11 : std_logic; signal ER12 : std_logic; signal ER13 : std_logic; signal ER14 : std_logic; signal ER15 : std_logic; signal ER16 : std_logic; signal ER17 : std_logic; signal ER18 : std_logic; signal ER19 : std_logic; signal ER20 : std_logic; signal ER21 : std_logic; signal ER22 : std_logic; signal ER23 : std_logic; signal ER24 : std_logic; signal ER25 : std_logic; signal ER26 : std_logic; signal ER27 : std_logic; signal ER28 : std_logic; signal ER29 : std_logic; signal ER30 : std_logic;

signal sum_2_00 : std_logic_vector(1 downto 0); signal sum_2_01 : std_logic_vector(1 downto 0); signal sum_2_02 : std_logic_vector(1 downto 0); signal sum_2_03 : std_logic_vector(1 downto 0); signal sum_2_04 : std_logic_vector(1 downto 0); signal sum_2_05 : std_logic_vector(1 downto 0); signal sum_2_06 : std_logic_vector(1 downto 0); signal sum_2_07 : std_logic_vector(1 downto 0); signal sum_2_08 : std_logic_vector(1 downto 0); signal sum_2_09 : std_logic_vector(1 downto 0); signal sum_2_10 : std_logic_vector(1 downto 0); signal sum_2_11 : std_logic_vector(1 downto 0); signal sum_2_12 : std_logic_vector(1 downto 0); signal sum_2_13 : std_logic_vector(1 downto 0); signal sum_2_14 : std_logic_vector(1 downto 0); signal sum_2_15 : std_logic_vector(1 downto 0); signal sum_2_16 : std_logic_vector(1 downto 0); signal sum_2_17 : std_logic_vector(1 downto 0); signal sum_2_18 : std_logic_vector(1 downto 0); signal sum_2_19 : std_logic_vector(1 downto 0); signal sum_2_20 : std_logic_vector(1 downto 0); signal sum_2_21 : std_logic_vector(1 downto 0); signal sum_2_22 : std_logic_vector(1 downto 0); signal sum_2_23 : std_logic_vector(1 downto 0); signal sum_2_24 : std_logic_vector(1 downto 0); signal sum_2_25 : std_logic_vector(1 downto 0); signal sum_2_26 : std_logic_vector(1 downto 0); signal sum_2_27 : std_logic_vector(1 downto 0); signal sum_2_28 : std_logic_vector(1 downto 0); signal sum_2_29 : std_logic_vector(1 downto 0); signal sum_2_30 : std_logic_vector(1 downto 0);

signal sum_3_0 : std_logic_vector(2 downto 0); signal sum_3_1 : std_logic_vector(2 downto 0); signal sum_3_2 : std_logic_vector(2 downto 0); signal sum_3_3 : std_logic_vector(2 downto 0); signal sum_3_4 : std_logic_vector(2 downto 0); signal sum_3_5 : std_logic_vector(2 downto 0); signal sum_3_6 : std_logic_vector(2 downto 0); signal sum_3_7 : std_logic_vector(2 downto 0); signal sum_3_8 : std_logic_vector(2 downto 0); signal sum_3_9 : std_logic_vector(2 downto 0); signal sum_3_10 : std_logic_vector(2 downto 0); signal sum_3_11 : std_logic_vector(2 downto 0); signal sum_3_12 : std_logic_vector(2 downto 0); signal sum_3_13 : std_logic_vector(2 downto 0); signal sum_3_14 : std_logic_vector(2 downto 0); signal sum_3_15 : std_logic_vector(2 downto 0);

signal sum_4_0 : std_logic_vector(3 downto 0); signal sum_4_1 : std_logic_vector(3 downto 0); signal sum_4_2 : std_logic_vector(3 downto 0); signal sum_4_3 : std_logic_vector(3 downto 0); signal sum_4_4 : std_logic_vector(3 downto 0); signal sum_4_5 : std_logic_vector(3 downto 0); signal sum_4_6 : std_logic_vector(3 downto 0); signal sum_4_7 : std_logic_vector(3 downto 0);

signal sum_5_0 : std_logic_vector(4 downto 0); signal sum_5_1 : std_logic_vector(4 downto 0); signal sum_5_2 : std_logic_vector(4 downto 0); signal sum_5_3 : std_logic_vector(4 downto 0);

signal sum_6_0 : std_logic_vector(5 downto 0); signal sum_6_1 : std_logic_vector(5 downto 0);

signal sum_7_0 : std_logic_vector(6 downto 0);

begin

s15:process(clk) begin if(clk'event and clk='1') then sum_2_15

- J
- JJ
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Jul 22, 2005 7:47 AM

The application involved an 32x oversampled 28MHz PSK streamm from a powerline. The logic ran at 28MHz behind a 32tap analog DLL so syncing was done by looking fow a correlation at each of 32 phases in parallel. With even minor power line filtering, the bit edges are all over the place making it tough to say where bits start or end. Anyway it was derived from an ASIC design and BRams were not plentifull in those early Virtex.

If I did it today, I'd probably use N x faster clock on digital logic with N x less HW and factor the N out of the oversampling front end logic.

I still wouldn't use BRAMs today, I'd use them for other functions. Using 63bits rather than 64 bits takes precisely 63 adder cells and each doubling adds 2 adder delays (ASIC that is), and 64 takes an extra

6 on top. When did adders become expensive.

johnjakson at usa dot com