Use of hardware adders with long words to perform multiple additions in parallel

W

Wojciech M. Zabolotny 12 years ago

I was solving a problem, when I needed to calculate every clock a sum of multiple values encoded on a small number of bits (the latency of a few clocks is allowed).

A natural solution seemed to be a binary tree of adders, consisting of N levels, when on each level I calculate a sum of two values. E.g. assuming, that I have to calculate a sum of 8 values, I can calculate:

On the 1st level: Y0 = X0 + X1, Y1=X2+X3, Y2=X4+X5, Y3=X6+X7 (4 adders)

On the 2nd level: V0 = Y0+Y1, V1=Y2+Y3 (2 adders)

On the 3rd level: Z = V0+V1 (1 adder)

If each level is equipped with a pipeline register, I can get a short critical path, and the final latency is equal to 3 clocks, the new values may be entered every clock, and the result is availeble every clock. The whole design uses 7 adders.

However modern FPGAs are equipped with adders using long words. E.g. the Xilinx family 7 chips use adders with 25 bit input operands. If we assume, that the input values are encoded only at 5 bits, we can significantly reduce consumption of resources. Lets encode input words X0..X7 on bits of operands on the 1st level as follows: A(4 downto 0)=X0; A(5)='0'; A(10 downto 6)=X2; A(11)='0'; A(16 downto 12)=X4; A(17)='0'; A(22 downto 18)=X6; A(23)='0';

B(4 downto 0)=X1; B(5)='0'; B(10 downto 6)=X3; B(11)='0'; B(16 downto 12)=X5; B(17)='0'; B(22 downto 18)=X7; B(23)='0';

Then on the first layer we can perform all calculations using only single adder: C=A+B, and sub-sums are encoded as follows: C(5 downto 0)= X0+X1=Y0; C(11 downto 6)=X2+X3=Y1; C(17 downto 12)=X4+X5=Y2; C(23 downto 18)=X6+X7=Y3

On the 2nd level we work with 6-bit values (7-bit after addition of leading '0'), so we can perform up to 3 additions in a single adder (but we need only 2)

D(5 downto 0)=Y0; D(6)='0'; D(12 downto 7)=Y2; D(13)='0'; E(5 downto 0)=Y1; D(6)='0'; D(12 downto 7)=Y3; D(13)='0';

After addition: F=D+E we get: F(6 downto 0)=Y0+Y1=V0 ; F(13 downto 7)=Y2+Y3=V1;

The final addition may be performed in a standard way. Please note, that now we had to use only 3 adders!

The interesting problem is, how we should organize our adders' tree for different lengths of word in the adders, different lengths of the input values and different number of the input values. I've prepared a simple tool, written in Python, which automatically generates the appropriate VHDL code. The sources have been posted on the usenet alt.sources group: Subject: Generator of VHDL code for parallel adder (using long-word hardware adders to perform multiple additions in parallel) Google archive:

formatting link

However I'm interested if it can be done in pure VHDL? I hope that the above idea will be useful for someone...

Regards, Wojciech M. Zabolotny wzab@ise.pw.edu.pl My GPG/PGP keys: standard: B191 ACF0 7909 83FA 3F9B 450C 407E 3C4B 4569 D119 confidential: 2BF3 F90F 6EA8 7D35 59FD 5080 78ED 33DE 1312 D8F8

Vote

A

Allan Herriman 12 years ago

Not a direct answer to your questions... have you researched Wallace Trees? Like most solutions to computer arithmetic, it dates from about half a century ago. Wallace tried to find a way of making a faster hardware multiplier. Part of that involves adding a large number of binary numbers, and Wallace's solution allowed for a shorter critical path in the adder tree.

formatting link

BTW, yes, it can be done in pure VHDL. Expect to use a lot of if- generate and for-generate constructs, possibly with functions to work out the ranges of the for-generates.

A few years back a wrote a "popcount" (e.g. population count) module in VHDL. It had an input width that was configurable, and added up the number of '1' bits in that input word. E.g. a 512 bit input would result in a 10 bit output word. The whole thing worked in only a modest number of levels of logic and was moderately fast (180MHz after PAR in the logic of the day) because I used a Wallace Tree. The first stages (which were adding six 1 bit numbers together) were implemented directly in six-input LUTs. The second (and perhaps third) stages were implemented using similar LUTs. I only switched to using behavioural (synthesising to ripple carry) adders once the words became wider a few stages into the tree.

Regards, Allan

Vote

G

glen herrmannsfeldt 12 years ago

The traditional solution to this problem is the carry save adder.

It isn't always so obvious that the traditional solutions are still optimal in FPGAs with built-in carry logic, but I think it still works.

-- glen

Vote

W

wzab01 12 years ago

f

N

.

e

t

Thanks a lot for suggestions. I have not described it, but in my design I have yet another constraint. I need to squeeze as much of those summing systems in single FPGA, while ma intaining latency on the lowest possible level.

That's why I tried to fully utilize the hardware features of my FPGA, and r euse adders to perform multiple operations in parallel.

In fact my problems starts at the level where "the words become wider" (quo ting from your post).

Regards, Wojtek

Vote

Use of hardware adders with long words to perform multiple additions in parallel

Join the Discussion

Didn't find your answer?