Ofcourse, I don't think we differ much in opinion on the matter. But I prefer to stick to avg throughputs available with C codes.
I think in summary any HW acceleration is justified when it is pretty much busy all the time, embedded or or least can shrink very significantly the time spent waiting to complete, but few opportunities are going to get done I fear since the software experts are far from having the knowhow to do this in HW.. For many apps that an FPGA might barely be considered, one might also look at the GPUs or the Physix chip or maybe wait for ClearSpeed to get on board (esp for flops) so FPGA will be the least visible option.
The BRAMs are what define the opportunity, 500 odd BRAMs all whacking data at say 300MHz & dual port is orders more bandwidth than any commodity cpu will ever see, so if they can be used independantly, FPGAs win hand down. I suspect alot of poorly executed software to hardware conversion combines too many BRAMs into a single large and relatively very expensive SRAM which gives all the points back to cpus. That is also the problem with soft core cpus, to be usefull you wants lots of cache, but merging BRAMs into useful size caches throws all their individual bandwidth away. Thats why I propose using RLDRAM as it allows FPGA cpus to use 1 BRAM each and share RLDRAM bandwidth over many threads with full associativity of memory lines using hashed MMU structure IPT sort of.
Toms IIRC didn't have AMD on the lineup, must have been 1-2yrs ago. The P4 end of the curve was still linear but the tests are IMO bogus as they push linear memmory tests rather than the random test I use. I hate when people talk of bandwidth for blasting GB of contiguous large data around and completely ignore pushing millions of tiny blocks around.
100x overstating it a bit I admit, but the turn to multi cores puts cpu back on the same path as FPGAs, Moores law for quantity rather than raw clock speed which keeps the arguments for & against relatively constant.Actually I ran that test from 32k doubling until I got to my ram limit
640MB (no swapping) on a 1GB system and the speed reduction is sort of stair case log. At 32K obviously no real slow down, the step bumps obviously indicate the memory system gradually failing, L1, L2, TLB, after 16M, the drop to 300ns can't get any worse since the L2,TLBs have long failed having so very little associativity. But then again it all depends on temporal locality, how much work gets done per cache line refill and is all the effort of the cache transfer thrown away every time (trees). or only some of the time (code).In the RLDRAM approach I use, the Virtex 2Pro would effectively see 3ns raw memory issue rates for full random accesses but the true latency of
20ns is well hidden and the issue rate is reduced probably 2x to allow for rehashing and bank collisions. Still 6ns issue rate v 300ns for full random access is something to crow about. Ofcourse the technology would work even better on full custom cpu. The OS never really gets involved to fix up TLBs since there aren't any, the MMU does the rehash work. The 2 big penalties are that tagging adds 20% to memory cost, 1 tag every 32bytes, and with hashing, the store should be left Best regardsregards
John Jakson transputer guy