Scientific Computing on FPGA

lancepickens · 2006-11-03T01:40:49+00:00

Hi,Coming from a scientific computing standpoint (with no hardwareexperience).I was wondering if you can improve any dedicated tasks by designing aspecialpurpose chips ala FPGA to run your code? Does anyone have anyexperiencewith this?

B

Ben Jones 19 years ago

Some things deserve to be preserved for prosperity. :-D

Surprised you found the blog though, must be more careful with my robots.txt in future...

Cheers,

-Ben-

Vote

S

scott moore 19 years ago

Vote

Y

yttrium 19 years ago

...

Vote

R

Ray Andraka 19 years ago

There are plenty of scientific apps out there that are being sped up by FPGAs. I, for instance am just finishing up a 2D 32-2048 point floating point FFT engine in Virtex4 that can compute a 2D 2Kx2K FFT on complex data in under 13 msec including raster order input and output for an imaging application. It resides in a single Virtex4 SX55 with external QDR memory.

Vote

M

Marc Battyani 19 years ago

Well this power is precisely the problem of grid systems. I mean the real power in Watts.

The modern PC have also problem with very very slow communication between them, and the current processors have also some issues with the very slow memory access. The good news is that it's even getting worse with the FBDIMM ;-)

We are not doing scientific but financial computing so it's closely related. Though it's not always obvious to have an idea of what speed gain can be achieveable on FPGA systems, we have already got performance ratios of x50 to x140 on things like Monte-Carlo simulation for instance.

Marc

Vote

E

eric 19 years ago

There is another simple possibility to speed up calculations. Do you know that the GPU on a modern graphics card can do matrix calculations a lot times faster than a CPU? There are several SDKs and you wont need special Hardware for that.

Xilinx and Altera stuff takes a lot of time and a high risk because if you really have a problem to solve the engineers from there say: "This is not supported"... You get only help if you a buying in large quantaties.

And in scientific computing you never have a large quantatie.

So use alternativly a GPU for fast calculations like FFT and Sorting.

My favorite SDK is BrookGPU

formatting link

Easy to use and a lot of out oft the box runnig examples. It works with every modern 3d Accellertor because it uses OpenGL and DirectX.

You will implement the stuff in a week or two. In VHDL you'll need half a year for the simulation and timig stuff for the same result.

Think about it!

Eric

snipped-for-privacy@gmail.com schrieb:

messagenews: snipped-for-privacy@e3g2000cwe.googlegroups.com...

Vote

M

Marc Battyani 19 years ago

"eric" wrote

GPU have a lot of limitations like single precision FP, memory bandwitdh/latency and organization, computation model, etc.

Well it's the old ASIC(fixed) vs FPGA topic ;-)

Marc

Vote

R

reiner 19 years ago

Tommy,

There is a lot of papers around about software to FPGA migration projects. (The highest speed-up factor I found there is 6000.) A few of them are referenced in:

formatting link

Rainier

Vote

S

scott moore 19 years ago

Thats a fairly agenda driven paper. Its central theme is "Von Neumann is dead" (by the way, VN didn't create what is properly called the "stored program computer, he just had his name on the front of the joint paper that detailed it).

Certainly a carefully written parallel program can have speed gains over a standard computer program, but there are many ways to achieve that parallelism, including FPGAs, multiprocessors, grids, etc. To be a viable replacement for general computing, FPGAs must have a high level language facillity associated with them, and existing efforts in that direction (C to hardware compilers) result in near parity with modern high speed processors. The true speed advantage does not occur until the design is transferred to full ASIC.

Scott Moore

Vote

M

Marc Reinig 19 years ago

Ray,

Does this include the transfer time to and from the source/destination, or only the calculation time?

Marc Reinig

Laboratory for Adaptive Optics Fax

UCO/Lick

Vote

E

eric 19 years ago

Marc Battyani schrieb:

But you get fast results in development and processing and if this is your goal and not the "i can do everything" with my FPGA it's the better choice.

In most cases are 24Bit FP enought and you dont have to pay anything for the processing hardware. At least you can split the processingtask between CPU and GPU and get faster results.

FPGA is cool if you design for stand alone devices but for scientific calculation are better and easier to implement solutions out there.

It depends on what you want to do. And of course if you have time and money enough use the i can do everything solution.

Eric

Vote

F

fpga_toys 19 years ago

That's only true for some C=>FPGA implementations, in particular those that are in some form, cycle accurate, like Celoxica's product Handel-C.

Others, which more, or less, synthesize like similar Verilog/VHDL will see as much parallelism as the native algorithm has. After learning the code generation, writing FPGA specific code, will generate significantly faster execution. In particular, unrolling loops heavilty, and generating fine pipelining, is relatively easy for some algorithms, that can see two to three orders of magnitude per midsized FGPA, and can easily be cascated in a dense mesh to build special purpose Petaflop super computers in the $3-50M range.

The specific advantage of FPGA's is achieved by designing around LOTS of LUT based memories, which completely avoides the memory serialization which fundamentally lowers performance on most traditional architectures. In addition the coupling between processing elements, is direct wired connections, rather than a bit/word serial bus.

There are clearly algorithms which are serial, with little to no parallelism, that FPGA's simply can not help, and are in fact the same speed or slower than a traditional CPU.

Vote

F

fpga_toys 19 years ago

Take a certain class of NP backtracking problems, like nqueens, and what you get with 10 clustered CPU's is 9.9x performance increase if lucky. A medium sized FPGA can hold a few hundred nqueens engines running at 200MHz, for an effective performance of about 100X.

Other algorithms like RC5 cracking engines, either fully unrolled, or a massive bit/digit serial version, can match or exceed the performance of hundreds of clustered processors using mid to large FPGAs. Pencil and paper estimates suggest a few dozen LX200 parts are equivalent to the entire DNet effort, at a fraction of the power and cost of the thousands of DNet machines. That is assuming you can actually get enough power into the chip, and cool it at max clock rate for a fully utilized part. Performance of some of the high end Altera parts, and the Virtex-5 parts is expected to be significantly better.

Bit/digit serial floating point, using mulitply-accumulate (MAC) architectures, easily implements dense 2D and 3D physics models using Gauss-Seidel Relaxation type solvers. These MAC's easily interconnect with little routing fuss, and easily can be clocked at near MAX clock rate, and produce an iteration every word-length number of clocks, or so. A dense mesh of FPGAs can handle problems much bigger than even a fast large clustered processor can solve, with the communications costs between cell groups simply FPGA pin speeds. Pin I/O can be a fraction of internal clock rate, and still give good convergence.

In short, a water cooled cubic foot of large FPGAs, has higher performance for these applications than the fastest super computers that exist today, and a fraction of the power and cost, and you don't need a basket ball court sized machine room, or the huge amount of air flow for cooling (which wastes energy moving it all). The cubic foot of FPGAs, does however require about 200-500KW of power, which is an interesting power/cooling project on it's own. The project is most easily done by running the FPGA stack directly off a 480V 3PH power grid, with the mesh planes stacked to the rectified DC voltage, and the planes AC coupled (LVPECL) to avoid level conversion.

If space is a constraint, the above solution will run on a desk top for a large physics problem, and require 10,000sqft of 42U racks for an equiv PC blade cluster solution (and a dedicated power plant)

Vote

R

Ray Andraka 19 years ago

It includes transfer in from the source, computation of the 2D FFT and transfer out to the sink.

Vote

Scientific Computing on FPGA

Join the Discussion

Didn't find your answer?