Are FPGAs headed toward a coarse granularity?

I was reading about the MathStar FPOA devices and I started thinking about parallels between the architectural advances in CPUs compared to FPGAs. With the increases in speed with process refinements dropping off in the current technologies, CPU makers have taken advantage of the increased density to provide multiple CPUs rather than continually increasing the clock rates.

FPGAs on the other hand, continue to increase density by making larger versions of the same chips and trying to push speed as they go. The only exception to this is the way they have incorporated functions on the chip that are not the basic building blocks. I believe it started with memory blocks. Then multipliers were added. Now there are a number of different dedicated functional blocks that are available on the high end FPGA devices.

So where is this headed now? My understanding is that the FPOA was a rather coarse grained architecture. I also have the impression that the company is not succeeding because of software issues, not any inherent lacking in the devices or the architecture. In fact, from what I have read, a coarse grained architecture can provide a lot faster processing and higher density than can a fine grained one making the silicon cost a lot lower.

With the cost of gates dropping to such low levels, does it really make sense to continue to provide fine grained devices which use so much of the device for routing? I don't remember who first told me, "We sell you the routing and throw in the logic for free!" A coarse grained architecture should require much less of the programmable routing giving you much more of the "free" logic.

It seems like the current FPGA devices are heading in the coarse grained direction. They just haven't cut the umbilical cord yet. Most likely because that cord is rooted in the old software. A coarse grained architecture would need a doctor's slap on the behind for the birth of new design tools.

So, are coarse grained architectures the way of FPGA... opps FPxA devices in the near future? Will the lowly LUT and FF be pushed into the dark corners of the die in coming years? I think it is not a matter of if, just a matter of when and I think the when is soon!

Reply to
rickman
Loading thread data ...

The main drive seems to be MHz, as hard IP is always faster than Soft Logic.

Pretty much all the FPGA's now have 'DSP blocks' and those blocks get ever-more complex. Some have GHz links hardwired.

The latest Altera device uses more of this 'Hard IP', but the soft-logic speeds have not increased much.

There will always be LUT/FF areas, as that handles the State engines etc, but perhaps the next iteration, will be wide-path BUS routing.

-jg

Reply to
Jim Granville

Special purpose hardware is allways faster than general purpose hardware, except in the general case ;-)

Coarses granularity makes the implementation of what you are building a lot more efficient, but at the same time it is less likely to match the desires of the designer.

Take the DSP block as an example (lets forget the multiplier for now, as this uses an additional advantage: The existence of very clever hardware structures for multipliers) The muxes and adders use a lot less configuration bits and low level muxes as always 18 to 48 elements are configured to implement the same functions, and the data lines always run in parallel, can't be permuted as they could be in the FPGA fabric. This is a huge gain for a 48+18 bit accumulator. But, if you need 49 bits you lose a factor of two immediately. (The fabric implementation grows only by a 2%, the DSP48 implementation by

100%)

Andr=E9 DeHon analyzed this in in a chapter of his PhD thesis many years ago:

formatting link
There are graphs showing the efficiency as a function of application word length and hardware granularity.

It should be noted that in FPGAs both delay and area are dominated by the routing ressources. Therefore mainly the granularity of the routing should be optimized.

No design has millions of gates of random logic. Large designs are dominated by arithmetic function blocks. Therefore it is likely that an FPGA with a granularity of 2 for example will have a much better efficiency than current FPGAs. For random control logic half of the LUTs would remain unused, but for datapathes the utilization would approach 100% and the device coud save as much as 75% of the switches and configuration bits.

This is old knowledge for FPGA architecure folks, but there are two strong arguments against it:

1.) It is hard to quantify routing utilization, but the competitors marketing will immediately target the lower LUT utilization as a disadvantage. (But hey, if a LUT costs 75% less, who cares if I can only use 80% of the LUTS? Especially if the clock frequency is better?)

2) Granularity 1 FPGAs make use of the huge knowledge about ASIC EDA algorithms. For higher granularities you need to redevelopemost of the software toolflow from scratch.

There is a small FPGA vendor that has high speed global routing with

10bit granularity. Maybe this is a start. The area savings are marginal, as most of the switches are in local routing, but the speed improvement for long connections is significant.

Kolja Sulimma

Reply to
Kolja Sulimma

Forgive my possible ignorance here (my fairly limited fpga experience is only with smaller Cyclones and PLDs, not big devices), but isn't "granularity 2" pretty much what the Stratix II, III (and IV, when it's available) have in their "adaptive logic module" ? And as far as I can see from the following recent white paper, this is exactly what Altera are saying - using the ALM they get much more into a Stratix with roughly the same number of logic elements / slices / LUTs / flip-flops than into a Virtex. Obviously all such marketing information must be taken with a large handful of salt.

Reply to
David Brown

No. What I was saying is, that with granularity two, you get slightly less logic into the same number of LUTs, at greatly reduced costs.

Alteras (probably correct) claim is, that because they are more flexible how the inputs to a LUT pair can be routed you can better utilize the LUTs. This added flexibility probably increases the area cost for the input routing significantly. Granularity 2 would mean that a pair of elements (most importantly routing switches) share a configuration. Each output of a LUT can only reach half of the inputs that it could reach in a granularity 1 FPGA (or must take a detour). Useful logic per LUT would go down (Because soem LUTs can't be used), useful logic per chip area would go up (Because each LUT with its associated routing ressources would get less expensive).

Altera is doing the opposite: Paying extra area for added flexibility. It is achieving two goals by this: a) It sounds better for marketing, because chip area is kept secret anyway and sales prices are interpreted creatively. LUT count and utilization OTOH are easily measured. b) The device is easier to use because you can accurately estimate whether your design will fit into the device. This is valueable and might be worth the price. I do not know much about Altera, but I know that starting with Virtex-4 Xilinx decided to spent a lot extra area for routing to make the delays more predictable. This helps the XST software people and the users. But another design would have a better cost/performace ratio.

Kolja Sulimma

Reply to
Kolja Sulimma

nto

rs

n

Let me add my 2 cents worth here, as a personal opinion (not official Xilinx position): In the distant past, each process generation gave us smaller and thus cheaper die, and higher speed, while leakage current was a non-issue. =46rom now on, the next process generation will still give us smaller size, and eventually lower cost, but hardly any raw speed improvement. And leakage current is the big concern... Speed improvement will predominantly come from architectural (granularity) changes. That's why Virtex-5 quadrupled the logic size of the LUTs (from 16 bits to 64 bits) to pack logic more tightly, and to reduce routing. That's also why we added many hard-coded functions, multipliers, ALUs, FIFOs, SerDes in each I/O, PCIexpress, Ethernet, and multi-gigabit transceivers in all Virtex-5 LXT/SXT/FXT devices. In the 'FXT subfamily we also include one or two hard-coded PPC microprocessors with attached crosspoint and DMA. So we are increasing efficiency and speed and reducing power not only in the general-purpose fabric, but more importantly through larger hard-coded blocks. But we always make sure that our FPGAs remain general-purpose devices. The art of engineering is forever a compromise between conflicting demands... Peter Alfke

Reply to
Peter Alfke

Thanks - that makes it a bit clearer. It seems there are a couple of different ways to interpret the term "granularity" here, based on different aspects of the "grains".

I was thinking in terms of the amount of logic and/or registers that are packed into a single minimal unit - the ALM with 6 inputs and two registers, outputs and arithmetic logic has a larger grain size than traditional 4 input elements (Peter Alfke mentions that the Virtex-5 uses 6-bit LUTs for the same reason) because more can be done in a single elemental step.

You, I believe, are thinking more in terms of configuration and hard coding - a larger "grain" does more work for the same amount of configuration, and is thus smaller and faster than using multiple small grains for the same job. Examples include things like multipliers and multi-bit multiplexers (do any FPGAs have such things as hard macros? It seems an obvious idea for efficient implementation of soft processors, amongst other things).

From Peter's post, it looks like FPGAs are moving towards larger granularity in both senses.

mvh.,

David

Reply to
David Brown

Well, the thread started with math stars FPOA which are granularity 16 devices, similar to MITs Matrix Architecture.

formatting link
In theses devices groups of 16 wires are routed together and logic elements operate on 16 bit words.

FPGAs by C-Switch contain 20 bit wide busses that connect to RAMs and ALUs of the same width in addition to the fine grain FPGA fabric.

formatting link

(B.T.W.: They claim to have 1067Gbps per pin DRAM support)

Kolja Sulimma

Kolja Sulimma

Reply to
Kolja Sulimma

I don't know that adders are a good comparison because they are pretty durn fast in an FPGA. The carry chain has been highly optimized and are still pretty good even at 48 bits. But if that is what people want from FPGAs, then I guess it shows that the dedicated logic route does not always pay large dividends.

I have never thought of "granularity" as a property of routing, but I see your point. I seem to recall that the ORCA devices claimed to facilitate routing of the four bit hunks their LUTS/FFs were arranged in. They didn't drop the 1x routing, but they had more of the 4x than other devices might have had. That was quite a while back when it was still AT&T.

When you say "dominated", can you put that in definitive terms? The larger designs I have worked on were still very data path oriented with a lot of muxes (still not very efficient in generic LUT architectures) and pipeline delays. This was all IP type comms work, not RF/IF or mod/demod type stuff.

Yes, that is in line with the quote I made above, "We sell you the routing and throw in the logic for free!", and that was in the days of XC4000 devices! It is pretty impressive in fact, just how dominating the routing is in an FPGA. The image in the chip editors is a pretty close approximation to the real spacial relations from what I have heard. If you zoom all the way out you will see very little of the logic, the chip is overwhelmingly routing. But like you point out, it is as much a marketing issue as anything. But even the technical people get wrapped up in the numbers game. It is more than once I have seen a post here about how "their" brand of chip is so much better because it has these features that let you do so much more in so little real estate... but in the end real estate is not on my list of FPGA critera, I only care about speed, power cost, etc. If feature X impacts those things, then I want to know that explicitly, not by inference. This trade off between routing and logic is similar in that no one wants to be known for only being 75% routable even if I can still fit my design in a cheaper chip because of it.

Does it really require new toolflows from *scratch*? I would expect that the current tools could be adopted.

That is interesting, especially how it improves the speed rather than density.

Rick

Reply to
rickman

I think that FPGAs could achieve quite big advances in performance, if the FPGA tools would do better job in placing the components well to help up the routing. It still seems that the logic and especially the coarse grained units (for example memories) are placed randomly first, and then the tools try to fix the mess they did :) Good placements might enable changes the routing architectures also.

That depends on the design. For telecom chips random logic usually is the dominating factor, and arithmetic is not pushing the limits. DSP style designs and pure packet processing and switching designs are quite different beasts.

I think the biggest problem in this is how to estimate what is the fill level you can achieve beforehand to select the FPGA for PCB design etc.

In the past it was quite a problem when the designs were unroutable and you still had huge amounts of logic free. The routability was different in each design, and not very predictable and caused big problems.

Maybe this problem might get some help from the the ASIC side RTL tools that look for routing congestion already at RTL level and the congested places can be recoded etc. But is this a good way to use coding time is a question.

Even todays granularity is too much for the tools, there are many blocks that have existed for years in the fabrics that the tools just can't infer. The tools are getting better, but the pace is quite slow.

--Kim

Reply to
Kim Enkovaara

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.