FPGA-based hardware accelerator for PC

J

Jeremy Ralph 20 years ago

If one wanted to develop an FPGA-based hardware accelerator that could attach to the average PC to process data stored in PC memory what options are there available.

Decision factors are:

ease of use (dev kit, user's guide, examples)
ability to move data with minimal load on the host PC
cost
scalability (i.e. ability to upsize RAM and FPGA gates)
ability to instantiate a 32 bit RISC (or equiv)

Someone recommended the TI & Altera Cyclone II PCIe dev board, which is said to be available soon. Any other recommendations?

Also, what is the best way to move data between PC mem to FPGA? DMA? What transfer rates should one realistically expect?

Thanks, Jeremy

PDTi [ http://www.productive-eda.com ] SpectaReg -- Spec-down code and doc generation for register maps

Vote

F

Falk Brunner 20 years ago

Jeremy Ralph schrieb:

Nice idea, but to beat a nowady CPU (Pentium 4 and Athlon 64 etc.) and a nowadays GPU (Nvidi Gforce whatever-is-uptodate etc.) is hard to achive even with the big guys in the business. (yeah, yeah, special task can be optimized to run faster on FPGA based hardware, but to speed up "normal" PC tasks is difficult)

Sure.

PCI is 133 Mbyte/s max. AGP is 2 GByte/s max. (AFAIK) PCI-Express is nx250 Mbyte/s (with n up to 16)

MfG Falk

Vote

P

Piotr Wyderski 20 years ago

What could it accelerate? Modern PCs are quite fast beasts... If you couldn't speed things up by a factor of, say, 300%, your device would be useless. Modest improvements by several tens of percents can be neglected -- Moore's law constantly works for you. FPGAs are good for special-purpose tasks, but there are not many such tasks in the realm of PCs.

You already have a high-performance CPU on board, why do you need another one? Use your FPGA to do something massively parallel and let the CPU perform the CPU-ish stuff. The high rank Xilinx devices contain one or more PowerPCs for that purpose and that solution seems to be the best possible.

DMA is good, there are PCI frontend IP cores available.

132MiB/s when not overclocked.

Best regards Piotr Wyderski

Vote

J

Jeremy Ralph 20 years ago

Thanks Falk, for the numbers. Any reason why AGP couldn't be used for non-graphics streams?

--- PDTi [

formatting link

] SpectaReg -- Spec-down code and doc generation for register maps

Vote

J

Jeremy Ralph 20 years ago

Hi Piotr,

Thanks for your response. Please find my comments below:

So let's say one was able to demo a 50% performance improvement for some specialized task using FPGA, custom RTL and HAL. Let's say the design is scalable such that with 8X FPGA gates I'd get a 8*50% performance improvement. Yes, FPGAs are costly and can't compare to ASICs... but 50% in an FPGA could mean 400% in an ASIC. Moore's law also holds true for ASICS and FPGAs.

If the 32 bit RISC was optimized for some specialized task, then it might make sense to have it alongside a high-performance CPU. For acceleration stuff I don't envision a RISC being too useful. More interested in prototyping some RISC centric soft-IP designs. Hoping to kill two birds with one stone and find a board that can be used for both applications.

Cheers, Jeremy

PDTi [ http://www.productive-eda.com ] SpectaReg -- Spec-down code and doc generation for register maps

Vote

J

JJ 20 years ago

FPGAs and standard cpus are bit like oil & water, don't mix very well, very parallel or very sequential.

What exactly does your PC workload include.

Most PCs that are fast enough to run Windows and the web software like Flash are idle what 99% of the time, and even under normal use still idle 90% of the time, maybe 50% idle while playing DVDs.

Even if you have compute jobs like encoding video, it is now close enough to real time or a couple of PCs can be tied together to get it done.

Even if FPGAs were infinitely fast and cheap, they still don't have a way to get to the data unless you bring it to them directly, in a PC accelerator form, they are bandwidth starved compared to the cache & memory bandwidth the PC cpu has.

There have been several DIMM based modules, one even funded by Xilinx VC a few years back, I suspect Xilinx probably scraped up the remains and any patents?

That PCI bus is way to slow to be of much use except for problems that do a lot of compute on relatively little data, but then you could use distributed computing instead. PCIe will be better but then again you have to deal with new PCIe interfaces or using a bridge chip if you are building one.

And that leaves the potential of HT connections for multi socket (940 & other) Opteron systems as a promising route, lots of bandwidth to the caches, probably some patent walls already, but in reality, very few users have multi socket server boards.

It is best to limit the scope of use of FPGAs to what they are actually good at and therefore economical to use, that means bringing the problem right to the pins, real time continuous video, radar, imaging, audio, packet, signal processing, whatever with some logging to a PC.

If a processor can be in the FPGA, then you can have much more throughput to that since it is in the fabric rather than if you go through an external skinny pipe to a relatively infinitely faster serial cpu. Further, if your application is parallel, the you can possibly replicate blocks each with a specialized processor possibly with custom instructions or coprocessor till you run out of fabric or FPGAs. Eventually though input & output will become limiting factors again, do you have acquisition of live signals and or results that need to be saved.

It really all depends on what you are processing and the rate it can be managed.

John Jakson transputer guy

Vote

A

Alif Wahid 20 years ago

PCI-X 64-bits @ 133 MHz will give you around 1 GByte/s max in one direction. Most high-end server mother-boards have PCI-X rather than PCI.

Currently the maximum theoretical speed of PCI-Express is 2.5 Gbits/s per lane per direction as specified in the standard. That immediately drops to 2.0 Gbits/s per lane per direction due to 8b10b encoding. Then of course in practice the smallish nature of Transaction Layer Packet (TLP) sizes (i.e. the ratio of payload compared to header) cause further reduction in the useful data throughput. In reality you're looking at approximately 1.5 Gbits/s per lane per direction of real data throughput. The big advantage with PCI-Express is the seamless scalability and the point-to-point serial protocol. So a 16-lane PCI-Express end point should give you 24 Gbits/s in each direction of useful data throughput.

Regards,

Alif.

Vote

A

Alif Wahid 20 years ago

What about PCIe IP cores? That may be a better option than bridge chips since it keeps everything FPGA oriented. However, it does mean that the FPGA must have gigabit speed serial transceivers built in and that limits one's options a little bit.

Regards

Alif

Vote

J

JJ 20 years ago

I always hated that the PCI cores were so heavily priced compared to the FPGA they might go into. The pricing seemed to reflect the value they once added to ASICs some 10 or 15 years ago and not the potential of really low cost low volume applications. A $100 FPGA in small vol applications doesn't support $20K IP for a few $ worth of fabric it uses. It might be a bargain compared to the cost of rolling your own though, just as buying an FPGA is a real bargain compared to rolling my own FPGA/ASIC too.

FPGAs need lots of I/O to be useful, so why not put the damn IP in hard macro form and let everyone at it. And do the same again for PCIe (if thats possible?). We see how much more usefull FPGAs are with all the other stuff thats been added over the years, BlockRams, multipliers, clocks etc, but its really the IO where the s*1t hits the fan first.

I would say that if we were to see PCIe on chip, even if on a higher $ part, we would quickly see alot more co pro board activity even just plain vanilla PC boards. I wonder if there were multiple built in narrow PCIe links whether they could be used to build node to node links ala HT for FPGA arrays? .

Not that I really know much about PCIe yet.

John Jakson transputer guy

Vote

F

Falk Brunner 20 years ago

Jeremy Ralph schrieb:

Non that I know of.

Regards Falk

Vote

P

Piotr Wyderski 20 years ago

Jeremy,

Then one can match that performance buying a faster CPU or a multi-CPU board or a cluster of PCs. High-end FPGAs are extremely expensive, low-end chips are not competitive, so the final result of the analysis is "no", either because of unacceptable price/performance factor or because of neglectable performance improvement.

AFAIR a nice 2 GHz Sempron chip costs about $70. No FPGA can beat its price/performance ratio if its tasks would be CPU-like. An FPGA device can easily win with any general-purpose CPU in the domain of DSP, advanced encryption and decryption, cipher breaking, true real-time control etc., but these are not typical applications of a PC computer, so there is not much to accelerate. And don't forget about easier alternative ways, like computing on a GPU present on your video card:

formatting link

There are several impressive results.

They can, raw speed is not the only factor that does matter. In my opinion the order of flexibility provided by FPGAs easily compensates their lower performance.

But then you drop the most important feature, which is the ability to reconfigure the device on the fly.

No, because in this case you are trying to outperform an out-of-order, highly parallel processor core able to complete ~6 simple instructions per cycle and clocked at 2+ GHz. Reasonable soft CPU cores run at about 200 MHz and complete only one instruction per cycle. It means that a cheap CPU you can buy anywhere has about 60 times higher performance in sequential processing. Even if you could provide the same performance (not to mention about outperforming it, which is the key idea, anyway), it would mean that you are at least Harry Potter. :-)

You can do this using existing development boards.

Best regards Piotr Wyderski

Vote

J

JJ 20 years ago

snipping

I have fantastic disbelief about that 6 ops /clock except in very specific circumstances perhaps in a video codec using MMX/SSE etc where those units really do the equiv of many tiny integer codes per cycle on

4 or more parallel 8 bit DSP values. Now thats looking pretty much like what FPGA DSP can do pretty trivially except for the clock ratio 2GHz v 150MHz.

I look at my C code (compilers, GUI development, databases, simulators etc) and some of the the critical output assembler and then time some parts on huge 1M timed loops making sure no iteration benefits from cacheing the previous run. I always see a tiny fraction of that ~6 ops/cycle. SInce my code is most definitely not vector code or media codec but is a mix of graph or tree traversal over large uncacheable spans, I often see avg rates exactly 1 op per clock on Athlon TB at

1GHz and also XP2400 at 2GHz. My conclusion is that the claims for wicked performance are mostly super hype that most punters accept all too easily. The truth of the matter is that Athlon XPs rated at say 2400 are not 2.4 x faster than TB at 1GHz in avg cases, maybe on vector codecs. When I compare WIndows apps for different cpus, I usually only see the faster cpu performing closer to sqr of its claimed speedup.

A while back, Toms Hardware did a comparison of 3GHz P4s v the P100 1st pentium and all the in betweens and the plot was basically linear and thats on stupid benchmarks that don't reflect real world code. One has to bear in mind the P4 not only used 30x the clock to get 30x the benchmark performance, it also used perhaps 100x the transistor count as well and that is all due to the Memory Wall and the necessiity to avoid at all costs accessing DRAM. Now if we did that on an FPGA benchmark we would be damned to hell, one should count the clock ratio and the gate or LUT ratio but PCs have gotten away with using infinite transistor budgets to make claims.

This makes sense to me since the instruction rate is still bound by real memory accesses to DRAM some percent of the time for cache misses, I figure around 2% typical or even more. DRAM has improved miserably over 20yrs in true random access about 2x from 120ns to 60ns Ras to Dout time. If you use cache misses close to 0.1% then you get the hyped nos, but code doesn't work like that atleast mine doesn't.

Try running a random number generator say R250 which can generate a new rand number every 3ns on a XP2400 (9 ops IIRC). Now use that no to address a table >> 4MB. All of a sudden my 12Gops Athlon is running at

3MHz ie every memory access takes 300ns or so since every part of the memory system is wreaked (deliberately in this case). Ironically if thats all you wanted to do, an FPOA cpu without complex MMU, TLBs could generate random numbers in 1 cycle and drive an SDRAM controller just as fast if not faster since SDRAMs can cyle fully random closer to 60ns. Now in packet switching and processing where large tables are looked up with random looking fields, they use RLDRAM to get SRAM like performance.

So what does real code look like, any old mixture of the 2 extremes, ie sometimes its memory crippled, sometimes if everything is in L1 cache, it really does seem to do 2 ops/clock if array accesses are spread out, even with small forward branches. So all the complexity of these OoO machines is there to push up the avg rate and keep it just above 1 for typical integer codes, more for specially tuned codes. Each FP code used though is equivalent to a large no of ops on an integer only cpu, but then I rarely use FP except for reporting averages.

So on an FPGA cpu, without OoO, no Branch prediction, and with tiny caches, I would expect to see only abouit .6 to .8 ops/cycle and without caches, a fraction of that. So that leaves the real speed difference much closer, maybe 10-20 to 1 for integer codes, but orders more for FP codes. For an integer only problem where some of the code can be turned into specialized instructions as in your applications list, the FPGA cpu is more transparent and possibly a more even match if replicated enough, but still it is dificult even to get parity and writing HDL is much harder than plain C.

I have no experience with the Opterons yet, I have heard they might be

10x faster than my old 1GHx TB but I remain skeptical based on past experience.

On the Harry Potter theme, I have suggested that an FPGA Transputer cpu that solves the Memory Wall by trading it for a Thread Wall using latency hiding MTA cpu AND esp latency hiding MTA RLDRAM can be a more serious competitor to conventional OoO BP, SS designs that continue to flog regular SDRAMs. In that sort of design a 10 PE + 1 MMU Transputer node setup with RLDRAM can match 1000 Mips since each PE is only

100Mips but you have to deal with 40 threads with almost no Memory Wall effect, ie a Thread Wall. Since the PEs are quite cheap, the limit on FPGAs is really how many MMUs can be placed on a FPGA for max Memory throughput and that seems to be a pin & special clocks limit rather than core limit. Perhaps using spare BlockRams as an L1 RLDRAM intermediate, one could get many more busy cpus inside the FPGA sharing the RLDRAM bandwidth on L1 misses.

regards

John Jakson transputer guy

(paper at wotug.org)

Vote

A

Andreas Ehliar 20 years ago

You might be interested in knowing that Lattice is doing just that in some of their LatticeSC parts. On the other hand, you are somewhat limited in the kinds of application you are going to accelerate since LatticeSC do not have embedded multipliers IIRC. (Lattice are targetting communication solutions such as line cards that rarely needs high performance multiplication in LatticeSC.)

/Andreas

Vote

A

Andreas Ehliar 20 years ago

One interesting application for most of the people on this newsgroup would be synthesis, place & route and HDL simulation. My guess would be that these applications could be heavily accelerated by FPGA:s. My second guess that it is far from trivial to actually do this :)

/Andreas

Vote

F

fpga_toys 20 years ago

CPU's are heavily optimized memory/cache/register/ALU engines, which for serial algorithms, will always out perform an FPGA unless the algorithm isn't strictly serial in nature.

In doing TMCC/FpgaC based research for several years, it's suprising how many natively serial algorithms can be successfully rewritten with some significant parallel gains in an Fpga domain. The dirty part of "fixing" traditional optimized C code, is actually removing all the performance specific coding "enhancements" ment to fine tune that C code for a particular ISA. In fact, some rather counter intuitive coding styles (for those with an ISA centric experience set) are necessary to give an FPGA C compiler the room to properly optimize the code for performance.

Consider variable reuse. It's very common to declare just a couple variables to save memory (cache/registers) and heavily reuse that variable. For an FPGA C compiler, this means constructing multiple multiplexors for each write instance of the variable, and frequently comitting a LUT/FF pair for it. If the instances are separated out, then frequently the operations for the individual instances will result in nothing more than a few wires and extra terms in the LUTs of other variables. So the ISA optimized C code can actually create more storage, logic, and clock cycles that may well be completely free and transparent with a less optimized and direct coding style that issolates variable by actual function.

Generally a small 16 element array is nearly free, by using LUT based RAMs for those small arrays. Thus it becomes relatively easy to design FPGA code sequences around a large number of independent memories, by several means, including agressive loop unrolling. Because even LUT based memories are inheriently serial (due to addressing requirements) it's sometimes wise to rename array reverences to individual variables (IE V[0] become V0, V[1] becomes V1, etc) which may easily unroll several consecutive loops into a single set of LUT terms in a single reasonable clock cycle time. The choice on this depends heavily on if the array/variables are feedback terms in the outer C loops (FSM) and need to be registered anyway.

As I've noted in other discussions about FpgaC coding styles, pipelining is another counter intuitive strategy that is easily exploited for FPGA code, by reversing statement order and providiing addition retiming variables, to break a deep combinatorial code block up into faster, smaller blocks with the reverse order of updating creating pipelined FF's in the resulting FSM for the C code loops.

VHDL/Verilog/C all have similar language and variable expression terms, and can all be compilied with nearly the same functional results. The difference is that coding FSMs is a natural part of loop construction with C, and frequently requires considerable care in VHDL/Verilog. When loops execute in a traditional ISA all kinds of exceptions occur which cause pipeline flushes, wasted prefetch cycles, branch prediction stalls, exception processing and other events which prevent fast ISA machines from reaching even a few percent of best case performance. These seemingly sequential loops that would intuitively be highly suited for ISA execution can in fact, actually turn into a very flat one cycle FSM with some modest recoding that can easily run at a few hundred MHz in an FPGA, and dozens of cycles in an ISA at a much slower effective speed.

A large FPGA with roughly a thousand IO pins can keep a dozen 32bit quad DDR memories in full tilt bandwidth, a significantly higher memory bandwidth than you can typically get out of a traditional ISA CPU. Some applications can benifit from this, but I requires that the primary memory live on the FPGA Accel card, not on the system bus. This means that the code which manipulates that memory, must all be moved into the FPGA card, to avoid the bandwidth bottleneck. Like wise some applications may well need the FPGA to have direct connection to a couple dozen disk drives, and network ports, to server wirespeed network to storage applications, and only use the host CPU for "exceptions" and "housekeeping".

Vote

A

Adam Megacz 20 years ago

Place:

formatting link

Route:

formatting link

- a

PGP/GPG: 5C9F F366 C9CF 2145 E770 B1B8 EFB1 462D A146 C380

Vote

W

Wayne 20 years ago

Hypertransport offer 41GB/s speed. Maybe it is the best way to move data between PC and FPGA.

Wayne

Vote

J

Jeremy Ralph 20 years ago

Yes, 41GB/s would a nice rate for moving data around. Am I correct in assuming this would require a motherboard board with two or more 939 AMD sockets? Any idea how much effort would be involved in programming the host to move data between. I expect there are some open libraries for this sort of thing. Also, how much work to have the FPGA handshake the hypertransport protocol? Hopefully the FPGA board vendor would have this covered.

Found this product, which looks interesting. Anyone know of other HT products of interest?

formatting link

Seems the HT route could get expensive (more costly FPGA board + new motherboard & processor).

Thanks all for the great discussion!

PDTi [ http://www.productive-eda.com ] SpectaReg -- Spec-down code and doc generation for register maps

Vote

P

Piotr Wyderski 20 years ago

John, of course it is about peak performance, reachable with great effort. But the existence of every accelerator is explained only when even that peak performance is not enough. Otherwise you simply could write better code at no additional hardware cost. I know that in most cases the CPU sleeps because of lack of load or stalls because of a cache miss, but it is completely different song...

Yes, in my case a Cyclone @ 65MHz (130MHz internally + SDR interface,

260 MHz at the critical path with timesharing) is enough. But it is a specialized waveforming device, not a generic-purpose computer. As a processor, it could reach 180MHz and then stabilize -- not an impressive value today, not to mention that it contsins no cache, as BRAMs are too precious resources to be wasted that way.

Interesting. In fact I don't care about P4, as its architecture is one big mistake, but linear speedup would be a shame for a Pentium 3...

Northwood has 55 million, the old Pentium had 4.5 million.

Yes, that is true. 144 MiB of caches of a POWER5 does help. A 1.5GHz POWER5 is as fast as a 3.2GHz Pentium 4 (measured on a large memory-hungry application). But you can buy many P4s at the price of a single POWER5 MQM.

Man, what 4MiB... ;-) Our application's working set is 200--600MiB. That's the PITA! :-/

In a soft DSP processor it would be much less, as there is much vector processing, which omits (or at least should) the funny caches built of BRAMs.

I like the Cell approach -- no chache => no cache misses => tremendous preformance. But there are only 256KiB of local memory, so it is restricted to specialized tasks.

Best regards Piotr Wyderski

Vote

P

Piotr Wyderski 20 years ago

A car is not the best tool to make another cars. It's not a bees & butterflies story. :-) Same with FPGAs.

And who actually would need that?

Best regards Piotr Wyderski

Vote

FPGA-based hardware accelerator for PC

Join the Discussion

Didn't find your answer?