FPGA-based hardware accelerator for PC

J

JJ 20 years ago

Ofcourse, I don't think we differ much in opinion on the matter. But I prefer to stick to avg throughputs available with C codes.

I think in summary any HW acceleration is justified when it is pretty much busy all the time, embedded or or least can shrink very significantly the time spent waiting to complete, but few opportunities are going to get done I fear since the software experts are far from having the knowhow to do this in HW.. For many apps that an FPGA might barely be considered, one might also look at the GPUs or the Physix chip or maybe wait for ClearSpeed to get on board (esp for flops) so FPGA will be the least visible option.

The BRAMs are what define the opportunity, 500 odd BRAMs all whacking data at say 300MHz & dual port is orders more bandwidth than any commodity cpu will ever see, so if they can be used independantly, FPGAs win hand down. I suspect alot of poorly executed software to hardware conversion combines too many BRAMs into a single large and relatively very expensive SRAM which gives all the points back to cpus. That is also the problem with soft core cpus, to be usefull you wants lots of cache, but merging BRAMs into useful size caches throws all their individual bandwidth away. Thats why I propose using RLDRAM as it allows FPGA cpus to use 1 BRAM each and share RLDRAM bandwidth over many threads with full associativity of memory lines using hashed MMU structure IPT sort of.

Toms IIRC didn't have AMD on the lineup, must have been 1-2yrs ago. The P4 end of the curve was still linear but the tests are IMO bogus as they push linear memmory tests rather than the random test I use. I hate when people talk of bandwidth for blasting GB of contiguous large data around and completely ignore pushing millions of tiny blocks around.

100x overstating it a bit I admit, but the turn to multi cores puts cpu back on the same path as FPGAs, Moores law for quantity rather than raw clock speed which keeps the arguments for & against relatively constant.

Actually I ran that test from 32k doubling until I got to my ram limit

640MB (no swapping) on a 1GB system and the speed reduction is sort of stair case log. At 32K obviously no real slow down, the step bumps obviously indicate the memory system gradually failing, L1, L2, TLB, after 16M, the drop to 300ns can't get any worse since the L2,TLBs have long failed having so very little associativity. But then again it all depends on temporal locality, how much work gets done per cache line refill and is all the effort of the cache transfer thrown away every time (trees). or only some of the time (code).

In the RLDRAM approach I use, the Virtex 2Pro would effectively see 3ns raw memory issue rates for full random accesses but the true latency of

20ns is well hidden and the issue rate is reduced probably 2x to allow for rehashing and bank collisions. Still 6ns issue rate v 300ns for full random access is something to crow about. Ofcourse the technology would work even better on full custom cpu. The OS never really gets involved to fix up TLBs since there aren't any, the MMU does the rehash work. The 2 big penalties are that tagging adds 20% to memory cost, 1 tag every 32bytes, and with hashing, the store should be left Best regards

regards

John Jakson transputer guy

Vote

J

JJ 20 years ago

Yeh, I have been following Lattice more closely recently, will take me some time to evaluate their specs more fully, may get more interested if they have a free use tool chain I can redo my work with.

Does anyone have PCIe on chip though?

John Jakson transputer guy

Vote

J

JJ 20 years ago

Well xyz auto workers do eat their own usually subsidised by the employer.

I disagree, in a situation where FPGAs develop relatively slowly and P/R jobs take many hours, there would be a good opportunity to use FPGAs for just such a job. But then again FPGAs and the software is evolving too fast and P/R jobs in my case have gone from 8-30hrs a few years ago to a few minutes today so the incentive has gone.

If I was paying $250K like the ASIC guys do for this and that, a hardware coprocessor might look quite cheap and the EDA software is much more independant of the foundries. DAC usually has a few hardware copro vendors most of them based on FPGAs. At one time some of those were even done in full custom silicon, that was really eating your own.

I would be rather amazed if in a few years my 8 core Ulteron x86 chip was still running EDA tools on 1 core.

John Jakson transputer guy

Vote

P

Phil Tomson 20 years ago

Certainly on the simulation side of things various companies like Ikos (are they still around?) have been doing stuff like this for years.

To some extent this is what ChipScope and Synplicity's Identify are doing only using more of a logic analyzer metaphor. Breakpoints are set and triggered through JTAG.

As far as synthesis itself and P&R I would think that these could be acellerated in a highly parallel architecture like an FPGA.

There are lots of algorithms that could be sped up in an FPGA - someone earlier in the thread said that the set of algorithms that could benefit from the parallelism available in FPGAs was small, bit I suspect it's actually quite large.

Phil

Vote

P

Phil Tomson 20 years ago

Err... well, cars aren't exactly reprogrammable for many different purposes, though, are they?

Possibly you? What if we could decrease your wait for P&R from hours to minutes? I suspect you'd find that interesting, no?

Phil

Vote

P

Phil Tomson 20 years ago

Actually, that's what could make it the perfect marriage.

General purpose CPUs for the things they're good at like data IO, displaying information, etc. FPGAs for applications where parallelism is key.

I think the big problem right now is conceptual: we've been living in a serial, Von Neumann world for so long we don't know how to make effective use of parallelism in writng code - we have a hard time picturing it. Read some software engineering blogs: with the advent of things like multi-core processors, the Cell, etc. (and most of them are blissfully unaware of the existence of FPGAs) they're starting to wonder about how they are going to be able to model their problems to take advantage of that kind of paralellism. They're looking for new abstractions (remember, software engineering [and even hardware engineering these days] is all about creating and managing abstractions). They're looking for and creating new languages (Erlang is often mentioned in these sorts of conversations). Funny thing is that it's the hardware engineers who hold part of the key: HDLs are very good at modelling parallelism and dataflow. Of course HDLs as they are now would be pretty crappy for building software, but it's pretty easy to see that some of the ideas inherant in HDLs could be usefully borrowed by software engineers.

Well, there's that DRC computing product that puts a big FPGA in one slot of a dual opteron motherboard passing data between the Opteron and FPGA at very high speed via the hypertransport bus. It seems like the perfect combination. The transfer speeds are high enough to enable lots of types of FPGA accelerator applications that wouldn't have been practical before.

Certainly there are classes of problems which require very little data transfer between FPGA and CPU that could work acceptably even in a PCI environment.

One wonders how different history might be now if instead of the serial Von Neumann architectures (that are now ubiquitious) we would have instead started out with say, cellular automata-like architectures. CAs are one computing architecture that are perfectly suited for the parallelism of FPGAs. (there are others like neural nets and their derivatives). Our thinking is limited by our 'legos', is it not? If all you know is a general purpose serial CPU then everything starts looking very serial.

(if I recall correctly, before he died Von Neumann himself was looking into things like CAs and NNs because he wanted more of a parallel architecture)

There are classes of biologicially inspired algorithms like GAs, ant colony optimization, particle swarm optimization, etc. which could greatly benefit from being mapped into FPGAs.

Phil

Vote

P

Phil Tomson 20 years ago

That's why OpenCores is so important.

formatting link

As FPGAs become cheaper we're going to need an open source ecosystem of cores. They've got a PCI bridge design at Open cores, for example.

BTW: it would also be nice to have an open source ecosystem of FPGA design tools... but that's a bit tougher at this point.

Phil

Vote

J

JJ 20 years ago

Yes but open source and closed source are also like oil and water esp together in a commercial environment. If I were doing commercial work I doubt I'd ever use opencores but I might peek at it for an understanding of how it might be done or ask someone else to. On a hobbyist level, well I have mixed feelings about gpl too. I suspect the software world does far better with it since enough people support the gpl movement and there is a large user base for it. Hardware ultimately can't be made for free so it can't be the same model.

John Jakson

Vote

P

pbdelete 20 years ago

What in particular do you think is wrong with the P4 ..?

Vote

J

JJ 20 years ago

Well how about power/heat or even cost v AMD, a constant issue on Intel for the last few years.

It was the return to the P3 that allows them to move forward again with the Centrino, then the DualCore, not sure what this new direction is really all about though. But Netburst and maximum clock freq at any cost for marketing sake is dead.

The only thing good about P4 I ever heard was in memory throughput benchmarks and maybe the media codecs which makes sense for the deeper pipelines that were used.

John Jakson transputer guy

Vote

B

bart 20 years ago

For PCIe x4, I think LatticeSC has the PHY and data link layers in the structured ASIC (MACO) portion of the chip. For the transaction layer, you'd need IP. See:

formatting link

Hope this helps. Bart Borosky, Lattice

Vote

P

Phil Tomson 20 years ago

What's the hesitation?

There are many more open source licenses besides gpl, though gpl is pretty commonly used.

Hardware itself cannot be made for free, however various cores (such as a PCI bridge that sparked this) can be created for free as it's pretty much the same process as software development: code it up in synthesizable HDL, simulate it to make sure it does what you want, synthesize it and try it out in an FPGA. Computers aren't free either, but there is plenty of open source software being created to run on them.

Phil

Vote

J

Jeremy Ralph 20 years ago

Any FPGA DIMM interface modules on the market today? This shounds interesting.

--- PDTi [

formatting link

] SpectaReg -- Spec-down code and doc generation for register maps

Vote

FPGA-based hardware accelerator for PC

Join the Discussion

Didn't find your answer?