about fast adder

- G
- Giox
  
  Contact options for registered users
posted
18 years ago

Thu, Jul 7, 2005 7:37 AM

I'm interested in the implementation of a fast adder for 32 bit data. The CLA is too expensive so I'm searching for something different, can you provide me some reference? I think that Ling adder can be a good choice, but I don't know.. Thanks a lot

- S
- Sylvain Munaut
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Jul 7, 2005 12:01 PM

On a FPGA going faster than the dedicated fast carry ripple chain for only 32 bits data might not be easy. What is you target speed and what is your current speed ?

Sylvain

- G
- Giox
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Jul 7, 2005 12:15 PM

Hi, I'm using a virtex 300E and after the synthesis step (not place and route), the frequency is estimated as 82.129MHz. The performances are better than whose that I need, but the occupied area is considerable. Gio

- D
- des00
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Jul 7, 2005 12:24 PM

Which software you use to sintesis you project? did you try to use pipeline ? can you show you code sourse ? des00

- G
- Giox
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Jul 7, 2005 1:15 PM

I'm using the Xilinx ISE pack, with it's synthesis pack. I'm not able to show the code, but it is simply a 32 bit CLA bit from 4 different 8 bit CLA with group propagate and generate Gio

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Jul 7, 2005 1:32 PM

Have you tried a simple adder? Verilog: module myadd ( input clk, input [31:0] a, b, output reg [31:0] y ); always @(posedge clk) y = a + b; enmodule

The dedicated adder circuitry is very fast silicon. Trying to best the native performance of the adder is difficult. Most people have their performance hurt by having more than one (or two) levels of logic in the adder. If you go from registered inputs to registered outputs you should get significantly better performance than the CLA structure you're trying.

Let us know how your performance changes with a simple 32-bit adder.

Giox wrote:

- D
- des00
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Jul 7, 2005 1:40 PM

special for you, i did simple test This code

library ieee; use ieee.std_logic_arith.all; use ieee.std_logic_unsigned.all; use ieee.std_logic_1164.all;

entity adder is port ( in_clock : in std_logic; in_reset_b : in std_logic; in_dataA : in std_logic_vector(31 downto 0); in_dataB : in std_logic_vector(31 downto 0); in_strobe : in std_logic; out_data : out std_logic_vector(32 downto 0); out_strobe : out std_logic ); end entity adder;

architecture adder of adder is begin process (in_clock, in_reset_b) is begin if (in_reset_b = '0') then out_data '0'); elsif (rising_edge(in_clock)) then if (in_strobe = '1') then out_data

- G
- Giox
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Jul 7, 2005 2:18 PM

Gulp, interesting. I tested your code with my tools, it is faster with simplify than with my tools. However it seems that the biggest trouble is the use of CLA, it seems that the synthesis process allows for better results than the CLA that I implemented by hand. I'm not as experienced as you but is it possible that a standard (read from standard university book) implementation of CLA generate conflicts that disable the use of specific feature of the FPGA? It seems that yes but I would like your advice. Thanks again Giovanni

- S
- Sylvain Munaut
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Jul 7, 2005 2:28 PM

You mean you didn't try the simple + first ?

All modern FPGA have a dedicated carry ripple chain that allows a very quick propagation of the carry from a LogicCell to the adjacent one. So by using this, you only need n LogicCells for a n bits adders and the carry is handled by dedicated logic.

When trying to do your CLA, you only used generic logic so you add supplementary delays. Using others architecture for addition than the simple + is only good for very big adders.

Sylvain

- G
- Giox
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Jul 7, 2005 2:37 PM

Sorry, but I have no experience in this field and so I thought that the simple approach could not be prductive so I skipped it. Thanks for you help. Giovanni

- J
- JJ
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Jul 7, 2005 4:43 PM

You have obviously been reading some ASIC/VLSI oriented Arithmetic texts which don't really apply to FPGA except in general algorithms. The ripple carry is I'd guess disproportionately faster in FPGA relative to Lut logic by about 3x than in ASIC since its given for free and highly optimized. Think about it, an adder might be placed in any set of Luts lined up in a column so only a ripple can be provided. All the clever logic schemes are highly irregular and must use Lut logic about 3-5x slower than ASIC logic.

I also wanted 32bit add for a cpu, the ripple was too slow IIRC about

170MHz for simple a+b but registered on inputs & output, using fastest speed grade for V2Pro.

For awhile I used a CSA array, ie 7 8bit ripple add sections with a follow stage to combine the proper select carries. It did cycle faster, maybe near 300MHz IIRC but it used up about 3x the area and needs the extra pipeline. Another downside was that in order to use CSA the addition is done twice with a carry in of both 1 and 0 for the cells

8-15,16-23,24-32. This doubles the fanout on those registers driving duplicate adders and limits the speed up and complicates hand placement. CSA and CLA, Ling schemes are better used for ASICs and full custom. I believe some Altera devices have CSA adder logic built in.

In the end I flipped to an alternate approach, use a 2 cycle design that is limited to the 16bit critical path, this also happens to be very close to blockram cycle time so now in 2 clocks, I get 4port ram,

32b add at near 150MHz (actual clock is 2x that). It also uses far less HW and had other simplifying effects elsewhere. Then by placing pair of 2cycle cpus on oposing clocks, it gets the equiv performamance of 1 300MHz cpu datapath. This idea can usually be copied for DSP engines pretty well.

johnjakson at usa dot com

- M
- Marko
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Jul 8, 2005 5:00 AM