Floating point reality check

H have recently been working on a floating point unit for a Virtex 4 SX 35. I have a floating point adder and a floating point multiplier. The adder has 6 pipeline stages and the multiplier has 3 stages.

The idea behind this project is to find out the kind of floating point performance that is possible in a modern FPGA.

Our floating point format uses up to 15 bits of mantissa (with an implicit one) and up to 10 bits of exponent. We have managed to get a complex butterfly running at up to 250 MHz. So my question now is if these numbers are reasonable or if anyone knows of a reference to a faster fpu.

We have already used some tricks to improve performance, for example by manually instantiating LUTs so that we can build an adder with a

2 to 1 MUX on one of the operands using only one LUT per bit.

We have also tried to build up the design using RLOC:ed modules. This did not lead to improved performance as compared to a non RLOC:ed design. This could change once we start to fill up the device though. At the moment we are only utilizing about 20% of the FPGA.

/Andreas

Reply to
Andreas Ehliar
Loading thread data ...

I've got a floating point 4/8/16 point kernel for Virtex4 that meets timing at 400 MHz in the -10 speed grade part (limited by the speed of the DSP48 and BRAM blocks). It has 24 bit mantissas and 8 bit exponents (IEEE single precision floating point). Instances of that kernel are combined obtain three parallel 400 MS/sec single precision 64 to 2048 point floating point FFTs for an aggregate continuous complex data stream of up to 1.2GS/sec. The 3 parallel 64-2k pt FFTs fit into a single V4SX55-10 device, along with QDR-II RAM interfaces.

You need to use the adders in the DSP48's in order to reach 400MHz clock rates, the LUT carry chains are too slow. Reaching the 400 MHz performance with the density needed requires considerable hand-optimization as well as a number of algorithmic tricks. You also won't get the density if you start with floating point math operations as your basic building blocks.

Reply to
Ray Andraka

That's pretty impressive. How did you implement the carry-kill chain, or whatever they call the ciruit that finds the location of the leading '1'? This can be made with a carry chain, but I don't know if it would work with a 2.5ns period. -Kevin

Reply to
Kevin Neilson

A clever use of DSP48 and BRAM blocks. The fabric carry chain definitely won't reach 400MHz, especially with 30 bits.

Reply to
Ray Andraka

Im just curious on how many pipeline-stages/clock-cycles it takes for one sample to be processed. And also how many of these pipeline-stages does the normalization use.

\Per

Reply to
Per Karlström

128 clock pipeline for the 4/8/16 point FFT kernel. It accepts a complex sample on each clock cycle. The normalization is an 11 clock pipeline.
Reply to
Ray Andraka

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.