faster Spartan III adder

Hi,

I need to add a pair of 8 bit (unsigned) integers to get a 9 bit (unsigned) result at 250 MHz, preferably in an XC3S50-4.

Using the Coregen adder/subtractor V7 with maximum pipelining (9) and RPM on, the best cycle time I can get is 4.55 ns. At each pipeline level the critical path is a LUT, a MUXCY, and another LUT.

Can anyone point me at some hints for a faster implementation (besides going to a faster part?

TIA

Paul Smith Indiana University Physics

Reply to
Paul Smith
Loading thread data ...

"Paul Smith" schrieb im Newsbeitrag news:d84q4e$uga$ snipped-for-privacy@rainier.uits.indiana.edu...

solution (one possibility) is simple

do it 2 times in parallel and demux the output results back into one stream then the addition works on 125MHz and the demux should be single LUT so it works on 250 without problems

:) antti3cents

Reply to
Antti Lukats

(1) try using timing driven packing and placement, it might help. (2) if you can pipeline the result, i.e. just add one more sample, this would solve a problem. but try using high level code, but core generator. (3) split the addition into two parallel ones at 1/2 frequency, and multiplex them at full frequency

take also a look how much % of the delay is contributed by routing :o(

Hope this helps

Vladislav

Reply to
Vladislav Muravin

The trouble is probably your second LUT. The first LUT feeds the S input of the carry chain, yes? This would be the LUT attached directly to the carry chain. The second LUT means - for reasons unknown to us - the result is going through additional logic. It's this logic that needs to be tweaked a little.

Since it didn't pass through an XORCY I'm guessing this is the carry-out of the 8-bit adder? Look at what else feeds the LUT and try to determine why the synthesizer wants to add logic to the adder's OUTPUT rather than in the

4-input LUT.

Reply to
John_H

"Paul Smith" schrieb im Newsbeitrag news:d84q4e$uga$ snipped-for-privacy@rainier.uits.indiana.edu...

Hmm, strange. a 8 bit adder should fit into one level of logic. make sure both inputs are registered and placed correctly (close to the carry chain). The output should be registerd too, of course ;-)

OK, I did a quick test using Webpack 7.1.

A plain description reaches 3.995 ns, uhhh tight timing ;-) Looking at the floorplanner (after P&R) I see the mess.The registers for my inputs are placed inside the IOBs. Not bad in general, but bad here, where we need every fraction of a ns. So I disable the option for placing the registers into the IOBs and run again. BINGO! 3.5ns.

But the automatic P&R tools are lazy bastards. A look at the floorplanner reveals, that the input registers are spread over the chip. OK, handmade is handmade. We add some LOCs into the UCF. New run.

3.49 ns. Hmm, not too much improvement, but since the placement is fixed this should be reliable. See the files below.

Njoy. Falk

-- VHDL

----------------------------------------------------------------------------

--
-- Company:
-- Engineer:
 Click to see the full signature
Reply to
Falk Brunner

Paul, if you want to be fast, run lean. You want to add, so pick an adder, not an adder/subtractor. This design should only take 9 or 10 LUTs, and the carry chain should be just combinatorial. And you don't have an active carry input to the LSB. So eliminate that path from the speed analysis. Try to get the basic functionality (without the routing) as fast as possible. Then apply some floorplanning. Peter Alfke, Xilinx.

Reply to
Peter Alfke

or just place post and pre-registers. or par your design as a macro.

Laurent

formatting link

Reply to
Laurent Gauch

Hi Falk,

What temperature/voltage did you get these results at? At 85 C and 1.14 volts I'm getting just over 4 ns in an XC3S50-4

Paul

Falk Brunner wrote:

Reply to
Paul Smith

You're getting the longer times because you're going through 2 levels of logic. The second LUT you mentioned will cause some grief but with proper constraints even that might be reasonable. Located right next to each other, a carry chain feeding a LUT might get the timing you need but you need to coerce the tools.

Falks's results are most definitely from a carry chain without the extra logic level.

Reply to
John_H

"Paul Smith" schrieb im Newsbeitrag news:d89k4c$hpl$ snipped-for-privacy@rainier.uits.indiana.edu...

OK, my 3.5ns was with a -4 device. With -5, 85C and 1.14V its 3.98 ns. Hmm. Maybe its better to have two adders runnung ant half the speed and MUXing the results.

Regards Falk

Reply to
Falk Brunner

Hi Peter,

I'm using the coregen adder/subtractor v7 (from ISE 7.1.02i); if I tell it to generate an adder it seems to generate the same logic as the synthesis tool does from VHDL.

For a latency of 1, I get 4.186 ns for 85 C and 1.14 volts.

The critical path seems to be from input A(5) to Q(8); this seems very strange; I would have expected the critical path to be from A or B (0) to Q(8).

I'm going to try various values for the latency; I would think internal pipelining would speed things up.

paul

Peter Alfke wrote:

Reply to
Paul Smith

Paul, let's for a moment forget about routing. The 8-bit adder must fit into a single CLB (8 LUTs with carry) plus an extra LUT for the 9th bit. And the critical path has to be, as you say, from the LSB to the MSB, since the carry path ripples. If you get a different structure, change it! Coregen and other tools should help you, not prevent you, from getting max performance. Peter Alfke

Reply to
Peter Alfke

Using the coregen to generate an RPM gives a "vertical" structure with the slices one above the other using 2 CLBs. I can certainly try not using the RPM and arranging the slices so they are all in the same CLB.

For an RPM coregen adder I get:

latency 1:

3.974 ns for an 8 bit output 4.186 ns for a 9 bit output

latency 2:

3.956 ns for a 9 bit output

latency 3:

PAR fails for an RPM complaining that IOBs aren't supported in RPMs - looked at .edn file and don't see any IOBs?!?!

turning off RPM I get 4.311 ns

latency 4:

RPM works OK, but I get 4.044 ns. Worse than latency 2. Looking at critical path there are now 2 LUTs with MUXCY between them.

CoreGen Adder/Subtractor v7 isn't as clever as I would have hoped; it does look like I can do better without it.

Interesting stuff, I'll keep at it.

As other pe> Paul, let's for a moment forget about routing.

Reply to
Paul Smith

"Paul Smith" schrieb im Newsbeitrag news:d8a8u2$o4h$ snipped-for-privacy@rainier.uits.indiana.edu...

I guess you have to look at reality. 4ns isn't awfull lot of time. Spartan-3 is fast, but not the fastest FPGA known to man. Even in the slower speed grade and highest temperature and lowest voltage. Do you REALLY need to push the limits that far? I would go for lets say 50C max and intentionally tune up the core voltage to the max. limit. This will probably give you enough margin to run the adder at 250 MHz.

Looking at the output of the timing analyzer you can see that there is almost no more gain in additional pipeline stages. There is no logic shorter than 1 LUT level. And even splitting up the adder into two 4 bit adders doesn't help, since the carry chain propagation is the smallest part of the timing chain (just 120 ps for two bits, max into 3x120 ps for A_0 to C_8).

Regards Falk

Reply to
Falk Brunner

ISE 7.1 defaults to 85 C and 1.14 volts; I thought the idea was that if a design meets timing under these limits you were safe under "normal" conditions.

Hopefully I won't REALLY be pushing the limits that far....

Falk Brunner wrote:

Reply to
Paul Smith

Paul,

You are getting an extra level of logic, most likely because of the way the synthesizer handles an add/subtract. I am assuming you coded it something like;

if sub='1' then q

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
 Click to see the full signature
Reply to
Ray Andraka

Hi,

Another solution is to fully pipeline the adder but it requires that the result can be pipelined and that you are not area limited.

This solution doesn't use the carry chain but instead doing a normal adder with one lut for calculate the carry bit and one lut for calculate the result bit. Each LUT is directly connected to a DFF.

Making it 9 or 10 doesn't change the speed, just the size. So the critical path is DFF->LUT->DFF which should meet your speed requirement.

The code below is a quick and dirty implementation of this. It can now do 3 ns now in a Spartan3-4. It can be improved both in area and speed. If you want it faster then the LUTS and DFFS needs to be floorplanned using RLOC.

Göran Bilski

library IEEE; use IEEE.std_logic_1164.all;

entity adder is generic ( Size : natural := 8); port ( Clk : in std_logic; A : in std_logic_vector(Size-1 downto 0); B : in std_logic_vector(Size-1 downto 0); Res : out std_logic_vector(Size-1 downto 0) ); end entity adder;

architecture IMP of adder is

type array_type is array (natural range 0 to Size) of std_logic_vector(Size-1 downto 0); signal A_Temp : array_type; signal B_Temp : array_type; signal Res_Temp : array_type; signal Carry : std_logic_vector(Size downto 0);

begin -- architecture IMP

Res_Temp(0) '0'); carry(0)

Reply to
Göran Bilski

No, if the timing meets at these conditions, it is safe to run it at

85°C and 1.14V ...

Sylvain

Reply to
Sylvain Munaut

How much "margin" is built in to the timing specs? If the Xilinx timing analyzer tells me the design meets a 3.999 ns cycle time should I worry about running it at 4 ns? If not, how much margin should I allow for (as a rule of thumb)?

Paul

Reply to
Paul Smith

That'll work, but it occupies an awful lot of area for not a lot of speed gain when you consider that you need skew and deskew registers on the inputs and output. You'll use up less area by running two (or more) adders in parallel and distributing the inputs in a round-robin fashion. If you only count the input registers and adders, this doesn't sound like much of an area savings over the fully pipelined adder (you save the deskew register, but that's about it). HOwever, you also have to consider that when you use the carry chain, you essentially get two luts for the price of one. Each bit of the adder requires a sum function and a carry function, which without the carry chain occupies two LUTs or a complete slice. By using the carry chain logic, you get the carry function for each bit for free so it only occupies one LUT (half a slice) per bit. The long and short of it is that with the Xilinx CLB structure, you'll get a more efficient result by using parallel adders than by pipelining a single adder.

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
 Click to see the full signature
Reply to
Ray Andraka

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.