How to choose FPGA for a huge computation?

Question

I have post an topic serveral days ago, and there is the link.

formatting link

The total computation is described below: integer add 2442 Giga operations float add 814 Giga operations float substract 2424 Giga operations float muliply 1610 Giga operations float divide 809 Giga operations

And I need these operations done in 1 ~ 3 minutes, so what kind of FPGA is needed? And should I use multiple FPGAs to finish the computation?

What if I need the work done in 10 minutes?

I prefer Xilinx FPGA. And if possible, lower speed grade device is better, at least for budget reason.

Frank Buss · Accepted Answer

T think this will be very expensive with FPGAs. Maybe this is a better solution:

formatting link

I've read an article about it in a magazin and they claim to be by a factor of 25 faster than a usual desktop, if the algorithm is highly parallelizable. Use multiple cards and you can reach every speed you want, if it is parallelizable.

Michael J?rgensen · Answer

Well, it depends on the amount of data to be processed, and whether it is possible to pipeline the computation.Could you describe the algorithm, without going into too many details?-Michael.

Nico Coesel · Answer

I think a fast PC can do this easely. 2.5G operations in 3 minutes is14M operations per second.-- Reply to nico@nctdevpuntnl (punt=.)Bedrijven en winkels vindt U op

John Williams · Answer

Hi, wrote:Regardless of the FPGA you choose, or the implementation of your mainprocessing loop, don't even start until you've done a thorough IO /memory bandwidth analysis.  Even in 2007 we're still seeing paperssaying "we got 20X speed up in the core, then when we put it on thememory bus we got 1.5X".A wise person once said to me "interfaces before implementation" - andhe was right.  Unless your algorithm gets its input from /dev/zero andsends its output to /dev/null (in which case why bother?), the mostdifficult part of your design will be the interfaces, not thecomputational kernel.Exceptions might be the rare occasion that you have a huge amount ofcomputational work to do on a small data set (crypto key cracking maybe?).Also, see if you can trade the floats for fixed-point integer.  Thesoftware people won't like it because they want the same 30insignificant digits to compare with the software version, but it's yourjob to convince them otherwise.Good luck,John

hitsx · Answer

Memory bandwidth is quite a substantial problem.In my algorithm, mostly of the computation is something like this:double a,b,c,d;int i,j,m;for (i=1;i

Uwe Bonnes · Answer

...Did you try to write equation containig the indexes?-- Uwe Bonnes                Institut fuer Kernphysik  Schlossgartenstrasse 9  64289 Darmstadt--------- Tel. 06151 162516 -------- Fax. 06151 164321 ----------

hitsx · Answer

Yes. I should rewrite it:double a[ii][jj][kk];double b[ii][jj][kk];double c[ii][jj][kk];double d[ii][jj][kk];int i,j,m;for (i=1;i

PFC · Answer

]

More questions.

I assume you have to do this computation repeatedly ? Which part of the= =

computation changes between runs ? All of a,b,c,d, or just one or two ? = Or =

just the offsets, maybe ? Do your 7 stages each use the value of c computed from the previous =

stages ? Does each stage use the same a,b,d as the previous ones, or =

different ?

ajjc · Answer

It is hard to see exactly what you are doing without seeing it...*grin*... but in that spirit, check out the paper:

APPLICATION-SPECIFIC MEMORY INTERLEAVING FOR FPGA-BASED GRID COMPUTATIONS: A GENERAL DESIGN TECHNIQUE, byTom VanCourt and Martin Herbordt

formatting link

alan

Totally_Lost · Answer

In many respects that is still putting the cart before the horse.One needs to step back, divine the global architecture of theapplication, and make some determinations of if this is even apractical problem for an FPGA, or should it maybe be moved to a moreoptimal CPU/Cache/Memory architecture. For a large data set, memorybandwidth isn't going to be substantially different, unless thealgorithm can be twisted to reduce memory bandwidth by a differentprocessing ordering or local caching. This is one area where a CPU/Cache/Memory architecture may well simply smoke any possible FPGAdesign.For very large data sets, frequently processors like Itanium 2 with12MB and larger caches will smoke a typical PC or FPGA implementationby simply getting rid of significant portions of the raw memorybandwidth into faster Caches.One side effect of this is that the existing source code for theapplication may have already been heavily optimized to squeeze everypossible memory cycle out of the problem. The...

evansamuel · Answer

Depending on the accuracy and bit size. You can get upto

160ns per calculation of 'c' using 16 bits. '(a*b+c)/d' is performed in one clock cycle. The division operation requires a higher rate clock to complete the operation within the aloted time. Worst case is 2.4us for 64bit. This varies depending whether pipelining can be used.

Once the process is created, and tested, further optimization may be possible to get more speed. It all depends on the application. Division is the slow down, if 'd' is converted to decimal faster multiplcation can be used resulting in 2ns (500mhz clock) per calculation of ((a * b + c) * (1/d)). Thats 500million complete calculations per second or 1000 million multiplications, and 500 million additions. As you can see more complex equation could be performed with in a single cycle providing even more performance advandages over a DSP.

If you can provide more detail on the equations and accuracy of the variables. I can give you much more accurate times and whether a DSP is better for your application.

snipped-for-privacy@charter.net

How to choose FPGA for a huge computation?

Join the Discussion

Didn't find your answer?