How to choose FPGA for a huge computation?

I have post an topic serveral days ago, and there is the link.

formatting link

The total computation is described below: integer add 2442 Giga operations float add 814 Giga operations float substract 2424 Giga operations float muliply 1610 Giga operations float divide 809 Giga operations

And I need these operations done in 1 ~ 3 minutes, so what kind of FPGA is needed? And should I use multiple FPGAs to finish the computation?

What if I need the work done in 10 minutes?

I prefer Xilinx FPGA. And if possible, lower speed grade device is better, at least for budget reason.

Reply to
hitsx
Loading thread data ...

T think this will be very expensive with FPGAs. Maybe this is a better solution:

formatting link

I've read an article about it in a magazin and they claim to be by a factor of 25 faster than a usual desktop, if the algorithm is highly parallelizable. Use multiple cards and you can reach every speed you want, if it is parallelizable.

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
Reply to
Frank Buss

formatting link

Well, it depends on the amount of data to be processed, and whether it is possible to pipeline the computation.

Could you describe the algorithm, without going into too many details?

-Michael.

Reply to
Michael J?rgensen

I think a fast PC can do this easely. 2.5G operations in 3 minutes is

14M operations per second.
--
Reply to nico@nctdevpuntnl (punt=.)
Bedrijven en winkels vindt U op www.adresboekje.nl
Reply to
Nico Coesel

formatting link

Regardless of the FPGA you choose, or the implementation of your main processing loop, don't even start until you've done a thorough IO / memory bandwidth analysis. Even in 2007 we're still seeing papers saying "we got 20X speed up in the core, then when we put it on the memory bus we got 1.5X".

A wise person once said to me "interfaces before implementation" - and he was right. Unless your algorithm gets its input from /dev/zero and sends its output to /dev/null (in which case why bother?), the most difficult part of your design will be the interfaces, not the computational kernel.

Exceptions might be the rare occasion that you have a huge amount of computational work to do on a small data set (crypto key cracking maybe?).

Also, see if you can trade the floats for fixed-point integer. The software people won't like it because they want the same 30 insignificant digits to compare with the software version, but it's your job to convince them otherwise.

Good luck,

John

Reply to
John Williams

Memory bandwidth is quite a substantial problem. In my algorithm, mostly of the computation is something like this:

double a,b,c,d; int i,j,m; for (i=1;i

Reply to
hitsx

snipped-for-privacy@hit.edu.cn wrote: ...

Did you try to write equation containig the indexes?

--
Uwe Bonnes                bon@elektron.ikp.physik.tu-darmstadt.de

Institut fuer Kernphysik  Schlossgartenstrasse 9  64289 Darmstadt
 Click to see the full signature
Reply to
Uwe Bonnes

Yes. I should rewrite it:

double a[ii][jj][kk]; double b[ii][jj][kk]; double c[ii][jj][kk]; double d[ii][jj][kk]; int i,j,m; for (i=1;i

Reply to
hitsx

]

More questions.

I assume you have to do this computation repeatedly ? Which part of the= =

computation changes between runs ? All of a,b,c,d, or just one or two ? = Or =

just the offsets, maybe ? Do your 7 stages each use the value of c computed from the previous =

stages ? Does each stage use the same a,b,d as the previous ones, or =

different ?

Reply to
PFC

It is hard to see exactly what you are doing without seeing it...*grin*... but in that spirit, check out the paper:

APPLICATION-SPECIFIC MEMORY INTERLEAVING FOR FPGA-BASED GRID COMPUTATIONS: A GENERAL DESIGN TECHNIQUE, byTom VanCourt and Martin Herbordt

formatting link

alan

Reply to
ajjc

In many respects that is still putting the cart before the horse.

One needs to step back, divine the global architecture of the application, and make some determinations of if this is even a practical problem for an FPGA, or should it maybe be moved to a more optimal CPU/Cache/Memory architecture. For a large data set, memory bandwidth isn't going to be substantially different, unless the algorithm can be twisted to reduce memory bandwidth by a different processing ordering or local caching. This is one area where a CPU/ Cache/Memory architecture may well simply smoke any possible FPGA design.

For very large data sets, frequently processors like Itanium 2 with

12MB and larger caches will smoke a typical PC or FPGA implementation by simply getting rid of significant portions of the raw memory bandwidth into faster Caches.

One side effect of this is that the existing source code for the application may have already been heavily optimized to squeeze every possible memory cycle out of the problem. The resulting code is probably larger, and possible way overly complex for a starting point for any FPGA design. One might well have to reverse engineer the original simpler algorithms first, in hopes that there is indeed an embarassingly parallel FPGA solution that avoids the raw memory bandwidth.

Other cases of interest are choices in data path sizes ... they might all be 64 bit, simply because it has been optimized for a high end HPC engine that was 64bit native. After stepping back an looking at the problem very closely from an architecture and requirements perspective, sometimes insightes emerge that the problem doesn't need all that significance and dynamic range everywhere .... allowing the FPGA implemention to be partitioned to match the real problem needs, not what had formally been done on the prior solution.

Once architecture, data and algorithm issues are well understood, and we have a fair idea what the processing kernel must do, then clearly looking at matching interface designs is not only required, but finally practical too. Designing interfaces before the architecture and processing kernels are understood, is doing work toward an undefined requirement.

Performance by Design John

Reply to
Totally_Lost

Depending on the accuracy and bit size. You can get upto

160ns per calculation of 'c' using 16 bits. '(a*b+c)/d' is performed in one clock cycle. The division operation requires a higher rate clock to complete the operation within the aloted time. Worst case is 2.4us for 64bit. This varies depending whether pipelining can be used.

Once the process is created, and tested, further optimization may be possible to get more speed. It all depends on the application. Division is the slow down, if 'd' is converted to decimal faster multiplcation can be used resulting in 2ns (500mhz clock) per calculation of ((a * b + c) * (1/d)). Thats 500million complete calculations per second or 1000 million multiplications, and 500 million additions. As you can see more complex equation could be performed with in a single cycle providing even more performance advandages over a DSP.

If you can provide more detail on the equations and accuracy of the variables. I can give you much more accurate times and whether a DSP is better for your application.

snipped-for-privacy@charter.net

Reply to
evansamuel

After exchanging side emails with the OP, this is a roughly a giga- pixel plus 3D medical imaging problem.

Unfortunately, the loop above is only one of a half dozen intermediate loops that must be calculated per dataset for the users application. The OPs existing algorithm requires processing the data in one particular loop order, and then different loop orders to scale, smooth and extract features from the raw 3D pixel data. Some of the OPs loops can be combined with minor changes in loop order that should be transparent, others clearly can not ... requiring the complete processing of one pass, before the next can begin. The different loop orders require different "relatively random" accesses to the memory array holding the data, forcing the application to be dependent on up to a half dozen memory read cycles per write cycle (depending on memory width and ability to cache in the FPGA). This requirement comes from the 3D image algorithm processing each pixel based on each of the nearby pixels in the 3D space ... F(x,y,x) + F(x+1,y,z) + F(x,y+1,z) + F(x,y,z+1) + ... + etc some half dozen or so memory references per resulting pixel, per pass. The algorithm specification from their sponsor is in MatLab, with implied iteration and function references in the inner loop, so some of these operations are significantly more complex than the OPs 3 variable MAC loop above.

So the limiting performance of this application is the sum of some significant memory latencies, plus FP multply/square/divide/sqrt (plus a few trig) operation latencies, constrained by some hard serializations in the data sequencing. The problem probably is solvable in FPGAs with some significant risks, and is an excellent fit for SIMD/MIMD cpu architectures, such as a GPU/CELL application, with multiple independent memory arrays to obtain the needed memory bandwidth.

The OP would like to do this with an off the shelf FPGA board/system with a very small budget (about 3 months of a senior designers burdened salary at western labor rates), including hardware. In a traditional commercial setting the available budget would cover neither the labor or the hardware prototype, so the students involved are looking for a solution inside this tiny budget using student labor rates and off the shelf hardware. The project sponsor will get a super bargain if these kids can pull it off.

Not impossible, but certainly an interesting tough nut, and a dream real life research project for the kids either way.

Reply to
Totally_Lost

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.