Best FPGA for floating point performance

Hello,

Does anybody already made a comparison of the high performance FPGA (Stratix II, V4, ?) relative to double precision floating point performance (add, mult, div, etc.) ?

It's for an HPC aplication.

Thanks

Marc

Reply to
Marc Battyani
Loading thread data ...

Marc,

IEEE floating point standard? You need to be more specific.

Does it need to integrate with a processor?

I believe the Xilinx IBM 405 Power PC using the APU interface in Virtex

4 with the floating point IP core provides the best and fastest performance.

Especially since no other FPGA vendor has a hardened processor to compete with us.

If all you want is the floating point processing, without a microprocessor, then I think you will find similar performance between Xilinx and our competition, with us (of course) claiming the superior performance edge.

It would not surprise me at all to see them also post claiming they are superior.

For a specific floating point core, with a given precision, for given features, it would be pretty easy to bench mark, so there is very little wiggle room here for marketing nonsense.

I would be interested to hear from others (not competitors) about what floating point cores they use, and how well they perform (as you obviously are interested).

Austin

Marc Battyani wrote:

Reply to
Austin Lesea

Marc Battyani ( snipped-for-privacy@fractalconcept.com) wrote: : Hello,

: Does anybody already made a comparison of the high performance FPGA (Stratix : II, V4, ?) relative to double precision floating point performance (add, : mult, div, etc.) ?

: It's for an HPC aplication.

Hi Marc, I don't have a comparisom of various cores but a lot of info is out there in datasheets.

However, in an HPC application the performance of your maths cores may not be the bottleneck, rather it is likely to be a question of how fast can you interface the host system to the FPGA, how fast can you shunt data around between CPU, CPU RAM, FPGA and FPGA RAM etc.

The heavyweight HPC/FPGA hybrid systems I have seen, such as the Cray-XD1 and SGI NUMAflex/Altix stuff use Xilinx FPGAs.

Although I wouldn't want to generalise for the whole field, other interested parties such as Nallatech and Starbridge Systems tend to go for Xilinx.

Certianly Xilinx seem to have a head start in the field (not thanks to their tools from the word on the street :-) - possibly this has more to do with interfacing than FP core performance.

Not answering the origional question, but there you go :-)

Cheers, Chris

(A strong believer in FPGA type stuff for HPC, although perhaps the granularity is less than optional and the tools not very well suited, but hey it's early days.)

: Thanks

: Marc

Reply to
c d saunter

While an x86, or cell cluster could whip FPGA at IEEE FPU in raw clock speed ( I am not sure about cost though), you can flip the odds some by defining your own numerics with a direct mapping to the plentifull

18bit muls.

If I am not mistaken IEEE is not the be all and end all of FPU and has a certain no of detractors esp in some fields regarding rounding, exceptions etc. If you do define your own FP set you can simulate it farely easly right on your HPC app and see if it gives comparable results. For instance 1,2 or4 multipliers running a 37b mantissa might be enough to not use double IEEE, only you can figure that out.

I think I even go for a custom cpu design with a highly serial by 18.18 datapath and try to pump it as fast as the fabric will allow. I notice that the soft core FPUs out there don't run anywhere near the 300MHz speeds being quoted for mul units. Perhaps the V4 500MHz DSP block can be microcoded into a decent FPU unit but as soon as you need the odd features,

Anyway I think thats what I would do, if that doesn't work too well then I look at qinetix and other vendors, these links can be found on the X,A sites.

So what is your app and what hardware are you running on?

Reply to
JJ

"Austin Lesea" wrote

IEEE 754. It's for a computational accelerator. It will get values from a general purpose processor (Xeon, Itanium, etc.) and send the results back in the same format.

Though the internal computations could be done in another format.

The other stuff needed is pretty standard (PCI(-X or Express), DDR2, etc. )

No.

performance.

OK, that one is easy. ;-)

The idea is to hardwire some formula by doing the maximum of concurrent FLOP. This is the only way to go faster than a very fast processor like an Itanuim II or even a simple Xeon.

Sure! And this time it should be easy to get useful technical numbers.

Marc

Reply to
Marc Battyani

JJ,

Perhaps you should read:

formatting link

first?

At 429 MHz for a Virtex 4 for a square root, that is 56 clocks, or 130.5 ns for the answer. 7.663 million floating point sqrure roots per second.

And, if you need more, you can implement more than one core, and get more than one answer per 56 clocks....

I am not aware of any x86 that can run quite that fast (even for one core). Their claims are that the floating point hardware unit speeds up the software exection by at least a factor of 5. We are talking here about a speedup of 80 to 100 times over using fixed point integer software to emulate a floating point square root....not a factor of 5!

Aust> While an x86, or cell cluster could whip FPGA at IEEE FPU in raw clock

Reply to
Austin Lesea

(Stratix

Yes, memory bandwidth is one of the bottlenecks, especially for the general purpose processors.

Very interesting. In fact this is what we want to do (on a smaller scale probably ;-) I find it somewhat depressing to see that Cray can't come up with something much better than a bunch of FPGAs but at the same time it's very cool to have access to the same technology than Cray. Or even better as they seem to use Virtex II :)

OK.

Well in fact I'm also interested by all the HPC/FPGA question anyway.

Sure, much fun anyway.

Marc

Reply to
Marc Battyani

Using a grid is fine when the problem can be parallelized with a rather coarse granularity but it's not always the case.

Yes, I though about using a 36 bit mantissa to reduce the number of hard multiplier needed and the latency. The input/ouputs need to be in IEEE754 though.

The apps can be rather diverse. In fact as Chirstopher pointed out, it looks like we are doing some kind of small Cray-XD1 ;-) As for the hardware, we are designing it.

Marc

Reply to
Marc Battyani

Hi Austin

Very interesting, but V4,S3E is still pretty darn new, I don't check on it every 5mins but QinetiQ is definitely hot in this area (not surprising given their (sq) roots at RSRE).

At some point I will do a detailed study of FPGA FPU design v x86 FPU numbers for my transputer project.

application seems a bit clearer now.

Usually when I see HPC-FPGA, I might infer somebody working with Opteron+VirtexII Pro sytems like Cray, SGI kits but doesn't look like it here.

Regards

JJ

Reply to
JJ

As far as I know, the biggest problem with floating point in FPGAs is the barrel shifter needed to pre and post normalize addition (and subtraction).

Floating point multiply and divide are a little harder than fixed point, but the post normalization only needs to shift one bit. (I think that is right, but maybe two.)

I would assume you can set the rounding mode at compile time, and any other applicable IEEE mode bits.

-- glen

Reply to
glen herrmannsfeldt

(Stratix

In fact I'm not sure that full IEEE floating point accuracy is needed. For sure single precision is not enough but probably double precision is not really needed. The problem is that people who write the algorithms do it in C(++) using double precision floats and they use double precision libraries, etc. So it's not obvious to see what precision is really needed. After all in an FPGA we can use the exact number of bit needed. (In fact it is even possible that a fixed point format could work)

Marc

Reply to
Marc Battyani

(snip)

If fixed point will do it, even significantly wider than floating point, it is likely the best way. Floating point add is a lot more expensive than fixed point. The difference is much smaller for multiply and divide, not counting any overhead specific to full IEEE implementations.

-- glen

Reply to
glen herrmannsfeldt

Glen, for this application, I'd argue that the floating point might be cheaper if he needs the dynamic range, especially if fixed point pushes him to wider than 35x35 multipliers. a floating point multiplier has very little extra compared to fixed point, and you can get away with a considerably smaller multiply. He may find that he can get away with a

17 bit significand with floating point (in which case a single multiplier per node in the array is needed), or at worst 4 multipliers for single precision. On the other hand, if his dynamic range demands more than 35 bit multiplication if converted to fixed point, then he's got 9 embedded multipliers per multiply, plus adders to combine the partials. Generally speaking, using floating point for multiplication and division is cheaper than using fixed point. The opposite is true of addition and subtraction. In this case however, his addition has to essentially be done in fixed point, so he can do the conversion to fixed using denorms, do the row add and then renormalize the sum. In any event, I don't see any problems getting this matrix multiply into a spartan3 as a floating point implementation.
--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
 Click to see the full signature
Reply to
Ray Andraka

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.