fpga speed logic/density MIPS/FLOPS as compared to general purpose microprocessors

I was looking for some documentation on recent speed and logic density enhancements on modern fpgas (i have searched xilinx and IEEE with only mildly successful results). I am submitting a paper to a computer vision conference (CVPR) and am trying to promote the use of fpgas for embedded computer vision and pattern recognition applications. I am basically looking for a document that shows a comparison of FPGAs with GPPs or DSPs and a forcast of trends into the future with regard to speed and logic density of FPGAs when compared to speed/density of GPuPs.

points of comparison and interest: power, fixed point calcs, floating point calcs, onboard memory/BRAM, clock rate, DSP and scientific comp. related ops comparison (MACS, matmul, etc)

thanks for your help,

geoffrey

Reply to
g.wall
Loading thread data ...

Geoffrey, the most important aspect is the freedom in systems architecture offered by FPGAs. This allows you to use massive parallelism for higher performance, or serial processing for lower cost etc. If you look just at clock rate, FPGAs suffer a penalty imposed by the interconnect flexibility. You have to compensate for that by taking advantage of the architectural flexibility offered in FPGAs. And do not forget (dynamic) reconfigurability... Peter Alfke, Xilinx Applications

Reply to
Peter Alfke

I know that Keith Underwood of Sandia National Labs has compared the FP performance of FPGAs to GPPs and projected this into the future.

formatting link

Stephen

Reply to
Stephen Craven

I have several papers by Mr Underwood and will probably cite him in my work. These are excellent sources for floating point comparison and very helpful to that end. I was hoping that someone might know of a similar resource that compared other aspects of FPGA based designs to microprocessors.

thanks

Reply to
g.wall

For video processing the biggest impact that this flexibility has is on memory bandwidth. I have watched attempts to use DSP chips for video processing, I have studied using them, and I have been involved to varying degrees with a number of successful video processing designs using FPGAs.

In all cases the FPGA won the design primarily because you can make the memory as wide as you need, and can often play tricks like having multiple memory buses feeding the processing. When you have a processing algorithm that requires more than one frame buffer, or a frame buffer and coefficient memory operating simultaneously, a DSP with one memory interface just can't cut it. By the time you pile enough DSPs on a board to equal the performance of an FPGA solution you're burning holes in the carpet with the power dissipation.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com
Reply to
Tim Wescott

A number of the various papers fail to search out the best space time tradeoffs. Mistakes like doing 64bit floating point multipliers the hard way in an fpga, or doing an FFT/IFFT as wide parallel which isn't always the best space time tradeoff.

There are MANY other architectures that can be developed to optimize the performance of a particular application to FPGA, beside brute force implementation of wide RISC/CISC processor core elements here. Frequently bit serial will yield a higher clocking rate (as it doesn't need a long carry chain), and doesn't need extra logic for partial sums or carry lookahead, so it also delivers more functional units per part, but at the cost of latency which can frequently be hidden with the faster clock rate and high function density per part. It can also remove memory as a staging area for wide paralle functional units, and thus remove a serialization imposed by the solutions architecture.

Bit serial operations using Xilinx LUT fifo's can be expensive in both power and clock rate reductions, but that is not the only way to use LUTs for bit serial memory. Consider using some greycode counters and using the LUT's simply as 16x1 rams instead ... faster and less dynamic power.

There are lots of ways to get unexpected performance from FPGAs, but not by doing it the worst possible way.

Be creative. $30M US of FPGAs and memories can easily build a

1-10 Petaflop super computer that would smoke existing RISC/CISC designs ... we just don't have good software tools and compilers to run applications on these machines, or have developed enough programming talent used to getting good/excellent performance from these devices.

There are a few dozen better ideas about how to make FPGAs as we know them today, into the processor chip of tomarrow, but that is another discussion.

Consider distributed arithmetic made FPGA's popular for high performance integer applications, and it's not even a basic type available from any of the common compilers or HDL's. Consider the space time performance of three variable floating point multiple accumulate (MAC) algorithms using this approach for large matrix operations.

Consider this approach for doing high end energy/force/weather simulations using a traditional Red/Black interleave as you would use for these applications under MPI. 3, 6, 9, 12 variable MAC's are a piece of cake with distributed arithmetic, and highly space time efficient. The core algorithms of many of these simulations are little more than MAC's, frequently with constants, or near constrants that seldom need to be changed.

Consider for many applications the dynamic range needed during most of the simulation is very limited, allowing systems to be built with FP on both ends of the run, and scaled integers in the middle of the run, even simpifing the hardware and improving the space time fit even more.

The big advantage to FPGAs is breaking the serialization that memory creates in RISC/CISC architectures. Memoryless computing using pipelined distributed arithmetic is the ultimate speedup for many applications, including a lot of computer vision and pattern recognition applications.

So read the papers carefully, and consider if there might not be a better architecture to solve the problem. If so, take the numbers and conclusions presented with a grain of salt.

Reply to
air_bits

It can't quite be 'memoryless', but I understand your point;

I'm waiting for Stacked die FPGAs that have fast/wide memory interfaces to Mbytes of fast xRAM......

There is quite a speed/Icc cost, to driving the all the pin buffers.pcb traces in more normal memories.

Meanwhile, I see more 'opening' of the Cell processor, which could revise some of these FPGA/CPU benchmarks. The Cell might even make a half-decent FPGA simulation engine, for development ?

-jg

Reply to
Jim Granville

The Cell processor architecture does have some interesting uses, and strong memory bandwidth, which delivers better than impressive performance for it's target markets.

Architecturally it's strengths are also some of it's worst weaknesses for building high end machines that would scale well for applications which assume distributed memory.

The cell processor is a next generation CPU to continue Moore's Law. The FPGA's which follow to target the same high performance computing market, will also come with application specific cores and multiple memory interfaces to kick but in the same markets. These FPGA's with the same die size and production volumes will have the same cost. The large FPGAs today which have similar die sizes are produced in lower volumes at a higher cost which currently skews the cost effectiveness equation toward traditional CPUs. Missing are good compiler tools and libraries to even the playing field. Cell will suffer some from that too.

Reply to
air_bits

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.