V5 and carry lookahead

- A
- acd
  
  Contact options for registered users
posted
17 years ago

Thu, May 18, 2006 9:25 PM

I was axcited when I read "carry lookahead" with respect to V5. But when looking at the diagrams in the user guide it looks to me like ripple carry. I do not want to be picky, but carry lookahead means to me (poly)logarithmic growth of delay with respect to adder length. The timing model (as far as I understood it) suggests that the delay grows linearly with the adder length. Now what is it, ripple carry or lookahead.

It is clear that FPGAs with linear layout of adders ultimately approximate linear delay/adder length but if wire delay is already the dominant problem, then a more compact arrangement like along a Sierpinsky curve could be used.

There is paper from Hosler, Hauck and Fry from 97 which discusses several adder designs with respect to FPGAs, but in 65nm wire delay, even with optimal buffering would have to be considered.

Andreas

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, May 18, 2006 10:56 PM

It is carry-look-ahead over 4 bits, ripple-carry between these 4-bit slices. The "effective ripple" delay is 21 ps per bit, and that's what counts. And it includes the wire-delay. Yes, the carry delay grows linearily with the bit-length, but it is a very short delay per bit. Peter Alfke, Xilinx Applications

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, May 18, 2006 10:56 PM

It is carry-look-ahead over 4 bits, ripple-carry between these 4-bit slices. The "effective ripple" delay is 21 ps per bit, and that's what counts. And it includes the wire-delay. Yes, the carry delay grows linearily with the bit-length, but it is a very short delay per bit. Peter Alfke, Xilinx Applications

- F
- fpga_toys
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, May 19, 2006 6:24 AM

Hi Peter,

Any of the Xilinx guys do a performance study for wider adders with this new carry architecture in relation to carry select or Brent-Kung FPGA implementations, or maybe able to offer revised versions of those for the V5 given new tradeoffs with the 6-LUT and carry changes?

Have fun! John

- B
- Ben Jones
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, May 19, 2006 8:59 AM

Yes indeed. Obviously the new LUT6 architecture changes the playing field somewhat when it comes to arithmetic. There has been plenty of work done on identifying the optimal mappings for basic arithmetic functions so the tools can do a Good Job. (Nominally. :))

The improvements in the carry chain speed are substantial. Although there's still a noticable hit when getting on and off them, the raw propagation speed is a real step up from previous generations. The fabric speed is really catching up to the embedded IP blocks now...

Cheers,

-Ben-

- F
- fpga_toys
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, May 19, 2006 11:37 AM

We've already been looking at technology specific mapping for FpgaC, and one of the things noticed was that LUT4s didn't pack well with arithmetics, and were already looking at F5/F6 to improve that problem. Building to LUT6s is certainly a better fit for the netlists we generate, so my response is YIPPIE :)

Also the 64x1 LUT RAMs are also a blessing, as it makes it far easier to support many applications with short arrays that size ... where the

16 and 32 deep arrays are frequently not enough. Is there an expander function in the slice fabric to cascade these, like the 32x1 in the V2 and V2Pros? Dual port fabric?

Any chance I can get some better docs and suggested arithmetic implementations so we can target these devices with the new technology mapper?

I'm interested in performance for 32bit and 64 bit arithmetics as Long and Long Long variables, will it be the case that the carry logic is slower than look ahead functions as with the current carry chains?

- B
- Ben Jones
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, May 19, 2006 1:48 PM

Hi John,

I love the LUT6 architecture, particularly for muxes (4:1 in a single LUT,

16:1 in a single slice, with no wasted inputs).

I don't believe you get anything to cascade between slices, but a single SLICEM will give you 256x1-bit by using all four LUTs. You can also get a variety of dual-port configurations: up to 128x1 true dual-port, or up to

64x3-bit simple dual-port per slice. My personal favourite: 32x2 or 64x1 quad-port per slice (that's 1xRW and 3xRO ports).

I don't know how much of that information gets published - not so much a secrecy thing as an hours-in-the-day thing. Mostly it's seen as being of internal interest only. (Your [external] interest has been duly noted. :))

I don't have exact details to hand, but the carry-chain delay (CIN->COUT) in V5 is about the same as V4 - maybe slightly shorter - but for 4 CY stages per slice, not just 2. i.e. carry-chain dominated logic could potentially go around 2x faster. The difference between 32-bit add and 64-bit add is therefore around 600ps... so for the majority of applications, the carry chain takes some beating! However, straightforward ripple-carry arithmetic can be a bit wasteful of LUT input resources.

It's also possible (with some degree of cunning) to create an efficient

3-input adder in the fabric, although there is some speed penalty to this.

Cheers,

-Ben-

- F
- fpga_toys
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, May 19, 2006 10:35 PM

Yep ... I was already looking at that for the RC5 cracker demo code I did last year, as it should have a much better fit and performance. Not that many LX330 devices would have equiv performance to all of dnet, assuming you can actually power the device and keep it cool fully packed.

Yippie ... that is more than enough(for now) ... and the dual/quad port configurations are exactly what I've found useful in FpgaC for typical loops, one, two or three references with a writer. Being able to have both the array storage and most of the arithmetics LUTS packed into the same slice/clb really cuts down on routing requirements/delays.

Hmm ... interesting ... space/time tradeoffs are another area we need to spend more time looking at for FpgaC. So far that balance has been static, and favors performance in most cases. Dense packing like that could certainly be useful.

One of the interesting side effects of doing bit level optimization and packing in FpgaC, is applications like the RC5 cracker end up packing both arithmetics and the barrel shifter components into the same LUT and avoids wasting inputs and logic levels (which offsets the poor/general technology mapping to some extent) ... that just gets better with LUT6s. The down side is that it get's harder to extract from the truth table possible fits to specialized logic in the slice as the truth table grows 2^n in size and the number of permutations to search does as well.