using FPGAs for synthesizing?

F

Frank Buss 19 years ago

I was reading about the multi processor support in upcoming Quartus software and I wonder if another idea to increase synthesizing speed would be to use the FPGA itself to calculate routing etc. Maybe some tasks could take advantage of the high parallel capabilities of FPGAs.

Frank Buss, fb@frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de

Vote

P

Paul Leventis 19 years ago

Hi Frank,

There's not much in the CAD tool that could be easily sped up on an FPGA. CAD tools have huge working sets (active memory) and spend a lot of their time jumping around through that memory. Many key algorithms are manipulating graphs and this sort of stuff isn't easy to parallelize. Spliting it into multiple threads for a CPU is tough enough -- going massively parallel in an FPGA (or heck, a GPU -- anyone read up on the G8800?) is much tougher.

Plus think of the maintenance nightmare... each new alogrithm or tweak becomes a hardware change!

- Paul

Vote

J

JJ 19 years ago

Thats pretty much what I have been saying all along but into the wind though.

Better to solve the Memory Wall problem first before going to multi cores and compounding the problem.

PCs are only blistering fast when given trivial problems such as video coding where lots of grunt work gets done on small data blocks mostly in some raster or linear order.

You can see that when you traverse a graph or tree or even a simple linked list by randomly walking an array who's size is plotted from small ranges to huge ranges what happens to the caches.

If the problem fits in L1 cache, memory access times are close enough to 1ns that a tiny loop can probably scan memory in a few ns/entry and do some useful work, all the caches work together to make the Memory Wall just vanish.

Increase the range from L1 to L2 and the loop time sturts to increase noticeably when little work is done. I bet that for most graph traversals relatively little work is done per hop.

Further increases beyond L2 and it falls apart completely, 1ns can become 100ns plus for every access if all nodes are fully randomly allocated over a memory range of 100MB or so. On an older Athlon XP it even reaches 400ns and a newer D805 reaches 130ns, that suggests the OS gets called to fix up the TLB every accesss. (if anybody with a CoreDuo wants to run the test for me let me know). I don't think the OS makes much difference, same thing needs to happen to change the TLBs. Now if the working set were bigger than the DRAM, you could factor in VM page misses too and 1ns becomes a few ms.

I have proposed a solution to the Memory Wall but it requires taking a hit elsewhere ie a Thread Wall and it requires multithreading both the processor AND the MMU + Memory system. That allows both processors and memory latency to be pretty well hidden and it requires using RLDRAM with a high issue rate and much reduced latency as a total cache replacement. All of its 8 banks can be used concurrently when used by

40 or so threads. The MMU hashes object references and indexes over the address space and reorders bank access so that they get used in the right order as they complete previous requests. The idea here is that processor elements are cheap and memory access is the real limit and that RLDRAM has about 20x more throughput than regular DRAM. When you consider that the hashing MMU replaces TLBs so that 1.5 RLDRAM cycles gives 1 useful random memory access while a true random DRAM access will involve the OS fixing up TLBs with several hidden DRAM cycles per application access.

In the real world I think most apps appear to hide the problem by mixing up linear and random accesses so the drop in performance is far less spectacular, but full time random graph traversal on huge working sets reveals the nasty side of things.

So while you can't fix up your PC to effectively get low Memory Wall you could bypass this by moving parts of FPGA/EDA apps into highly threaded code on FPGA cpu cluster designed around such a RLDRAM system. Around 10 Processor Elements running at same clock as RLDRAM clock sharing memory accesses through a common shared MMU can reach upto 1500 integer register Mips and probably sustain half that with usual load, store, branch rates with even access across the whole RLDRAM array. In effect random access is even better now than linear access as it forces all 8 banks to have even access loads. The hit is that the full utilization needs 4 threads per PE so 40 threads sharing the MMU.

800Mips may not seem impressive but I bet a typical PC randomly walking memory performs far worse.

Now if this sort of processor design were pushed into full custom ASIC,

300MHz FPGA clocks can become atleast 5x faster and the RLDRAM model can also run 5x faster by adding a smaller upfront SRAM equivalent model. Here the SRAM also relies on n way banking to cover slower cycle times so 8 parallel 2ns cycles look equivalent to 0.25ns cycles with effective 0.33ns average access times due to hash collisions. Each usefull access supports about 5-10 opcodes so the overall throughput scales with frequency again since the interleaving allows many memories to run much slower than the processor MMU clock. Conventional single threaded cpu designs demands all the parts run as fast as possible, clearly an impossible goal for large memory useage. Both Raza and Sun have effectively done this with 1.5GHz threaded processors but I haven't seen the memory side go the same way yet.

Ironically all the talk about future 80 way Intel cores coming down the pike is going to make the current Memory Wall problem far worse and it will give you 80 threads to keep busy too.

The nice thing about putting some EDA code into FPGA cpus designed for random walk throughput is that the individual PEs can also have special opcodes added to help the EDA app. If you think in terms of conventional MicroBlaze or Nios designs as being used for running FPGA code, then you are just replicating conventional Memory Wall designs at much slower clocks although full cache misses on very fast or very slow cpus to same sort of primary DRAM will have same final limit.

I also suspect that if the Memory Wall is finally tackled as suggested, Moore's Law would then allow the EDA host cpu to better track the EDA problem, bigger chip designs can still be handled by processors that actually did scale with frequency but the EDA tool must use atleast 20 threads to get the Memory Wall to level down. Those of us from the ASIC world used to deal with floor planning and P/R times in the week time frame so minutes or hours is still better.

While an FPGA hardwired version of EDA is probably not attractive, perhaps just building a database engine to store and search the entire working set useing FPGA to drive RLDRAM banks to guarantee fast probes.might work, perhaps useing the HT bus with an Opteron.

regards

John Jakson

Anyway the paper is still at wotug, google for r16 fpga transputer.

Vote

using FPGAs for synthesizing?

Join the Discussion

Didn't find your answer?