Are FPGAs headed toward a coarse granularity?

Question

I was reading about the MathStar FPOA devices and I started thinking about parallels between the architectural advances in CPUs compared to FPGAs. With the increases in speed with process refinements dropping off in the current technologies, CPU makers have taken advantage of the increased density to provide multiple CPUs rather than continually increasing the clock rates.

FPGAs on the other hand, continue to increase density by making larger versions of the same chips and trying to push speed as they go. The only exception to this is the way they have incorporated functions on the chip that are not the basic building blocks. I believe it started with memory blocks. Then multipliers were added. Now there are a number of different dedicated functional blocks that are available on the high end FPGA devices.

So where is this headed now? My understanding is that the FPOA was a rather coarse grained architecture. I also have the impression that the company is not succeeding because of software issues, not any inherent lacking in the devices or the architecture. In fact, from what I have read, a coarse grained architecture can provide a lot faster processing and higher density than can a fine grained one making the silicon cost a lot lower.

With the cost of gates dropping to such low levels, does it really make sense to continue to provide fine grained devices which use so much of the device for routing? I don't remember who first told me, "We sell you the routing and throw in the logic for free!" A coarse grained architecture should require much less of the programmable routing giving you much more of the "free" logic.

It seems like the current FPGA devices are heading in the coarse grained direction. They just haven't cut the umbilical cord yet. Most likely because that cord is rooted in the old software. A coarse grained architecture would need a doctor's slap on the behind for the birth of new design tools.

So, are coarse grained architectures the way of FPGA... opps FPxA devices in the near future? Will the lowly LUT and FF be pushed into the dark corners of the die in coming years? I think it is not a matter of if, just a matter of when and I think the when is soon!

Jim Granville · Accepted Answer

The main drive seems to be MHz, as hard IP is always faster than Soft Logic.

Pretty much all the FPGA's now have 'DSP blocks' and those blocks get ever-more complex. Some have GHz links hardwired.

The latest Altera device uses more of this 'Hard IP', but the soft-logic speeds have not increased much.

There will always be LUT/FF areas, as that handles the State engines etc, but perhaps the next iteration, will be wide-path BUS routing.

-jg

Kolja Sulimma · Answer

Special purpose hardware is allways faster than general purposehardware, except in the general case ;-)Coarses granularity makes the implementation of what you are buildinga lot more efficient, but at the same time it is less likely to matchthe desires of the designer.Take the DSP block as an example (lets forget the multiplier for now,as this uses an additional advantage: The existence of very cleverhardware structures for multipliers)The muxes and adders use a lot less configuration bits and low levelmuxes as always 18 to 48 elements are configured to implement the samefunctions, and the data lines always run in parallel, can't bepermuted as they could be in the FPGA fabric. This is a huge gain fora 48+18 bit accumulator.But, if you need 49 bits you lose a factor of two immediately. (Thefabric implementation grows only by a 2%, the DSP48 implementation by100%)Andr=E9 DeHon analyzed this in in a chapter of his PhD thesis many yearsago: are graphs showing the efficiency as a function...

David Brown · Answer

Forgive my possible ignorance here (my fairly limited fpga experience is only with smaller Cyclones and PLDs, not big devices), but isn't "granularity 2" pretty much what the Stratix II, III (and IV, when it's available) have in their "adaptive logic module" ? And as far as I can see from the following recent white paper, this is exactly what Altera are saying - using the ALM they get much more into a Stratix with roughly the same number of logic elements / slices / LUTs / flip-flops than into a Virtex. Obviously all such marketing information must be taken with a large handful of salt.

Kolja Sulimma · Answer

No. What I was saying is, that with granularity two, you get slightly less logic into the same number of LUTs, at greatly reduced costs.

Alteras (probably correct) claim is, that because they are more flexible how the inputs to a LUT pair can be routed you can better utilize the LUTs. This added flexibility probably increases the area cost for the input routing significantly. Granularity 2 would mean that a pair of elements (most importantly routing switches) share a configuration. Each output of a LUT can only reach half of the inputs that it could reach in a granularity 1 FPGA (or must take a detour). Useful logic per LUT would go down (Because soem LUTs can't be used), useful logic per chip area would go up (Because each LUT with its associated routing ressources would get less expensive).

Altera is doing the opposite: Paying extra area for added flexibility. It is achieving two goals by this: a) It sounds better for marketing, because chip area is kept secret anyway and sales prices are interpreted creatively. LUT count and utilization OTOH are easily measured. b) The device is easier to use because you can accurately estimate whether your design will fit into the device. This is valueable and might be worth the price. I do not know much about Altera, but I know that starting with Virtex-4 Xilinx decided to spent a lot extra area for routing to make the delays more predictable. This helps the XST software people and the users. But another design would have a better cost/performace ratio.

Kolja Sulimma

Peter Alfke · Answer

ntorsnLet me add my 2 cents worth here, as a personal opinion (not officialXilinx position):In the distant past, each process generation gave us smaller and thuscheaper die, and higher speed, while leakage current was a non-issue.=46rom now on, the next process generation will still give us smallersize, and eventually lower cost, but hardly any raw speed improvement.And leakage current is the big concern...Speed improvement will predominantly come from architectural(granularity) changes. That's why Virtex-5 quadrupled the logic sizeof the LUTs (from 16 bits to 64 bits) to pack logic more tightly, andto reduce routing.That's also why we added many hard-coded functions, multipliers, ALUs,FIFOs, SerDes in each I/O, PCIexpress, Ethernet, and multi-gigabittransceivers in all Virtex-5 LXT/SXT/FXT devices. In the 'FXTsubfamily we also include one or two hard-coded PPC microprocessorswith attached crosspoint and DMA.So we are increasing efficiency and speed and reducing power not onlyin the...

David Brown · Answer

Thanks - that makes it a bit clearer. It seems there are a couple of different ways to interpret the term "granularity" here, based on different aspects of the "grains".

I was thinking in terms of the amount of logic and/or registers that are packed into a single minimal unit - the ALM with 6 inputs and two registers, outputs and arithmetic logic has a larger grain size than traditional 4 input elements (Peter Alfke mentions that the Virtex-5 uses 6-bit LUTs for the same reason) because more can be done in a single elemental step.

You, I believe, are thinking more in terms of configuration and hard coding - a larger "grain" does more work for the same amount of configuration, and is thus smaller and faster than using multiple small grains for the same job. Examples include things like multipliers and multi-bit multiplexers (do any FPGAs have such things as hard macros? It seems an obvious idea for efficient implementation of soft processors, amongst other things).

From Peter's post, it looks like FPGAs are moving towards larger granularity in both senses.

mvh.,

David

Kolja Sulimma · Answer

Well, the thread started with math stars FPOA which are granularity 16 devices, similar to MITs Matrix Architecture.

formatting link

In theses devices groups of 16 wires are routed together and logic elements operate on 16 bit words.

FPGAs by C-Switch contain 20 bit wide busses that connect to RAMs and ALUs of the same width in addition to the fine grain FPGA fabric.

formatting link

(B.T.W.: They claim to have 1067Gbps per pin DRAM support)

Kolja Sulimma

rickman · Answer

I don't know that adders are a good comparison because they are prettydurn fast in an FPGA.  The carry chain has been highly optimized andare still pretty good even at 48 bits.  But if that is what peoplewant from FPGAs, then I guess it shows that the dedicated logic routedoes not always pay large dividends.I have never thought of "granularity" as a property of routing, but Isee your point.  I seem to recall that the ORCA devices claimed tofacilitate routing of the four bit hunks their LUTS/FFs were arrangedin.  They didn't drop the 1x routing, but they had more of the 4x thanother devices might have had.  That was quite a while back when it wasstill AT&T.When you say "dominated", can you put that in definitive terms?  Thelarger designs I have worked on were still very data path orientedwith a lot of muxes (still not very efficient in generic LUTarchitectures) and pipeline delays.  This was all IP type comms work,not RF/IF or mod/demod type stuff.Yes, that is in line with the quote I...

Kim Enkovaara · Answer

I think that FPGAs could achieve quite big advances in performance, if the FPGA tools would do better job in placing the components well to help up the routing. It still seems that the logic and especially the coarse grained units (for example memories) are placed randomly first, and then the tools try to fix the mess they did :) Good placements might enable changes the routing architectures also.

That depends on the design. For telecom chips random logic usually is the dominating factor, and arithmetic is not pushing the limits. DSP style designs and pure packet processing and switching designs are quite different beasts.

I think the biggest problem in this is how to estimate what is the fill level you can achieve beforehand to select the FPGA for PCB design etc.

In the past it was quite a problem when the designs were unroutable and you still had huge amounts of logic free. The routability was different in each design, and not very predictable and caused big problems.

Maybe this problem might get some help from the the ASIC side RTL tools that look for routing congestion already at RTL level and the congested places can be recoded etc. But is this a good way to use coding time is a question.

Even todays granularity is too much for the tools, there are many blocks that have existed for years in the fabrics that the tools just can't infer. The tools are getting better, but the pace is quite slow.

--Kim

Are FPGAs headed toward a coarse granularity?

Join the Discussion

Didn't find your answer?