Stratix 2 ALUT architecture patented ?

This is clearly not offering anyone anything of any import.

I hope my explanations have been clear and understood by the others out there.

Have a nice day.

Austin

Reply to
Austin Lesea
Loading thread data ...

The idea of an architecture comparison using "real" designs is of great interest, however the choice of comparison metric used in the white paper above is woolly. A much better metric for comparison would be the silicon area required (normalised to the same process technology). "Normalized Relative Logic Capacity" in terms of the "ALUT" has little meaning.

Reply to
Irwin Kennedy

It depends. If your metric is throughput (parallel/pipeline/multitask), then it has to be normalized to area or cost. LEs is a funny metric, it has to be silicon area.

But if the metric is latency, then actually area is secondary altogether, and its just the clock cycle.

--
Nicholas C. Weaver                                 nweaver@cs.berkeley.edu
Reply to
Nicholas C. Weaver

Likewise, logic capcity per silicon area has no real meaning. I have never bought a chip because it had a given area. I care about the cost. But that brings in another coefficient/variable that would have to be measured. In the real world manufacturers don't charge according to their costs. They charge according to the market making as much profit as they can squeeze out.

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design      URL http://www.arius.com
4 King Ave                               301-682-7772 Voice
Frederick, MD 21701-3110                 301-682-7666 FAX
Reply to
rickman

I have to agree with rickman ( in spite of his harsh wording) The issue is not square millimeters, the issues are: Capacity, performance, and price (and power, familiarity and software support) The connection between price and silicon area is very tenuous: Defect density, process maturity, manufacturing volume, package cost, and market conditions are equally important factors. Thank God we are not (yet) selling FPGAs by the square millimeter.

And BMW, Lexus and Cadillac are still not selling their cars by the pound, or even the cubic inch. And those products have more than a hundred-year evolution behind them... Peter Alfke

Reply to
Peter Alfke

Geeze Peter, I don't know what I said that you thought was harsh... or are you responding to my statement about companies charging as much as the market will allow? That was not meant as an insult, just a simple statement of fact. If companies did not make a profit, they would not exist. That is the nature of our system, in order for companies to form there has to be a profit motive. No insult intended, I just wanted to clarify that there is only an indirect relationship between a product's cost and its price. Likewise there is only an indirect relationship between the silicon area and the price.

Heck, I am in business to make money. I don't price my products by their cost, I price them by how useful they are to my customers.

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design      URL http://www.arius.com
4 King Ave                               301-682-7772 Voice
Frederick, MD 21701-3110                 301-682-7666 FAX
Reply to
rickman

No, you never answered the question. How many logic cells does the XC3S1000 have? The data sheet says 17,280. Is that correct?

My understanding is that this number is meaningless and that I have to figure it out for myself. If I know that there are 8 LCs per CLB, then I can multiply 1,920 * 8 = 15,360. Funny, that is not the same as 17,280, is it?

Can you explain? Perhaps I don't understand what a logic cell is...

Reply to
google_guy

Well google_guy, maybe you should try Google :-) Google is your friend (if you talk to it nicely). Google search "counting logic cells" and you get a link (without much effort) to

formatting link

and also:

formatting link

But one of my favorites is this:

formatting link

The short story is that the marketing weenies got to gate counts and butchered it to the point of meaninglessness, so engineers turned to LUT/LC counting, and now it is being abused too. Although I have not looked recently, in the past I found that Actel's gate-count claims were fairly honest, as were Xilinx's back around the XC4000 family (See the table in the "marketing_gates" link).

I particularly hate the new meaningless term "System Gates".

When I do estimates, I use E-Gates, which are gates that an engineer may actually get to use.

Philip

=================== Philip Freidin snipped-for-privacy@fliptronics.com Host for

formatting link

Reply to
Philip Freidin

Hi Ray,

There are two specific ways in which Stratix II improves on the arithmetic capabilities of our previous archiectures. First, a single ALM can implement the sum of two 4-LUTs provided they share two inputs f(a, b, c, d) + g(a, b, e, f). Second, it can implement a 3-input adder allowing you to reduce the number of ALMs and logic levels required for adder trees.

Please see

formatting link
and Figures 2-11, 2-12, and 2-13 of the Stratix II databook for further details.

ways to skin the cat

In architecting Stratix II, we took into account the increased challenges for synthesis and place & route. We worked with and continue to work with our 3rd party synthesis providers to improve the quality of synthesis for this architecture. Using today's synthesis tools, Stratix II achieves a 25% logic density advantage over Stratix. And it's not hard to imagine that this will only get better with time.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

c,

A quick correction -- I'm suffering from Vacation-Fried Brain Syndrome:

Each ALM implements _two_ bits of arithemetic:

sum0 = f0(a, b, c, e0) + g0(a, b, c, f0) sum1 = f1(a, b, d, e1) + g1(a, b, d, f1)

Where f0, f1, g0, and g1 are each four-input functions. The simplest use of this is to set e[0..1] = A[0..1] and f[0..1] = B[0..1], make a, b, c, d all don't cares, and set f0, f1, g0, g1 to be "wire LUTs". This just gives you Sum[0..1] = A[0..1] + B[0..1]. There are other more powerful uses, such as user-controllable adder/subtractor, conditional operators, etc.

An ALM can also implement _two bits_ of an 3-input adder .

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Hi Austin,

[To answer your technical/architectural questions]

I would highly recommend looking at Figure 2-6 of the Stratix II databook to gain a better understanding of exactly what hardware there is in the ALM.

The Stratix II ALM can implement all functions of 6 inputs, since it has 64 bits of LUT memory. It can also implement two independent 4-LUTs, a 5-LUT and 3-LUT, two 6-LUTs that share a LUT mask and 4 inputs, two 5-LUTs that share part of their LUT mask plus 2 inputs, a subset of 7-input LUTs, etc. Plus there are a variety of ways to combine this functionality with registers before, after, or independent of the logic, and some gunk for powerful arithmetic.

First the simple question: How does a 6-LUT differ from 4 4-LUTs + 3 2:1 muxes (ala 2 slices, 2 f5 muxes and an f6 mux)? It is not just where you draw the boxes. The silicon area per logic function (or logic efficiency) is much better with a 6-LUT, and this is largely due to area for user-programmable routing.

In a 4-LUT architecture, the LUTs are designed to be independent, thus there are 4 independently routable signals to each LUT. The fx muxes also require a control input, which for now we will assume is independently routed. Thus to implement a 6-input function using a 4-LUT architecture and fx muxes requires a total 19 independently routed signals. This implies 19 routing multiplexers which burn area and power. With a 6-LUT, obviously only 6 routing inputs would be required. So the potential area savings of a 6-LUT come not from a reduction in LUT mask RAM bits (both require 64) but from a reduction in user-configurable routing multiplexers.

Of course, you can't take this argument to extremes. Working against larger LUTs is your ability to map designs into these larger functions. If most of your design maps into 4-input functions and you have a 6-LUT architecture, you'll be wasting a lot of silicon and a 4-LUT based product will be more efficient. For these reasons, there is a bottom to the curve -- a 25-LUT architecture would not be more area efficient than a 4-LUT architecture! Where that bottom is... well, there's lots of academic studies and we've got our own data.

But the Stratix II ALM is more than a 6-LUT architecture. It targets the routing area efficiency gains of larger LUTs, while attempting to minimize the wastage that occurs when you need to implement small logic functions. It provides a few extra inputs (8 instead of 6) and one extra output (2 instead of 1), and is thus slightly less efficient than a true 6-LUT architecture for implementing 6-input functions. However, these inputs and outputs plus a few internal 2:1 muxes allow us to make use of the full ALM under a wide range of function sizes by allowing us to fracture the ALM into independent/semi-dependent functions. This allows us to greatly reduce the number of LUT mask bits that go unused, and allows us to highly utilize the available inputs and outputs of the ALM, resulting in little wasted silicon area for input/output routing.

Why 8 inputs, 2 outputs, and all the little 2:1 muxes? Because our experiments in the end showed that this resulted in the best combination of area and performance, and I can assure you we believed there to be a substantial benefit over the Stratix ALE in order to commit the resources required to support a completely new logic fabric.

On a performance front, larger input LUTs confer a benefit in terms of critical path delay by reducing the number of levels of logic and thus routing hops required to a implement a given cone of logic. But is an 6-LUT based ALM faster than 4-LUT based slices + fx muxes? A paper analysis will not answer this, since both implement 6-input functions (albeit at different area efficiencies). I could start arguing that smaller area turns into/gives area to be spent on better speed, or start counting transistors/gates in the path, but then we'd be getting into a very fuzzy realm full of a gazillion assumptions!

And I must say it was enjoyable to have worked on a radical, innovative architecture such as Stratix II. And given its enhancements over the successful Stratix architecture, I expect it to be flesh-covered and alive for a long while.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

The older academic data that I have seen suggests that the optimum is somewhere between 4-LUTs and 5-LUTs for non-arithmetic circuits. It makes sense that these numbers shift to larger LUTs when the number of routing levels increases and also when the transistors shrink faster than the routing density. (Which seems to be the case for modern technologies).

But silicon area is not everything. Because of the large area that is dedicated to routing in FPGAs, in generally makes sense (area wise) to have an architecture that is a little short on routing. (Better waste unused LUTs than waste unused routing). But it turned out that both the CAD tool developers and the customers did not like this because the flow (in the tools and in the heads) is to place first and route second. This flow becomes a lot easier when the routing is guaranteed to succeed.

Therefore most commercial FPGA we see today target a 100% LUT utilization. This is expensive. But it really helps time to market wise.

Kolja Sulimma

Reply to
Kolja Sulimma

Hi Kolja,

Yes, the academic literature suggesting this is a interesting (Andre DeHon had a paper at FPGA a few years back, I think). In our own experimentation, we see that different designs (obviously) have varying amounts of routing demand per logic element. If you try to build a chip that allows all designs to route, you'll have way too much routing for most designs.

But as you point out, customers aren't too happy if they have a 90% full device that fails to route. And it's mostly a problem when they've done most of their design (and it fits), only to find that late in the game when they add a few more LEs, suddenly they can't route anymore. And when customers run into problems, it costs us in support time, customer loyalty, lost business, etc. So there are cost pressures that push us in the direction of being slightly over-routed.

That said, our devices do make use of the less-than-100% observation, but in more local ways. For example, a LAB in Stratix has 30 general routing inputs (lab lines), and has 10 4-input LEs. Obviously, you could construct a LAB that would be (deterministically) unroutable. It would just not be efficient to build a LAB architecture with 40+ inputs, since Quartus can almost always find LEs that share input signals or feedback to one another in order to reduce input demands, and thus most LABs would have a lot of wasted input muxes. When it can't, it automatically will leave some LEs in a LAB unused in order to cap the number of inputs in use. This is like a localized version of the "don't hit 100%" approach. There is a large body of good academic research on the optimal # of cluster inputs for a given # of BLEs (Rose, Betz, E. Ahmed, Singh, Kouloheris, D. Hill, etc.). There is also research that shows that you should aim to never use the full # of LAB inputs, as this is more efficient than trying to make a fully utilized set of inputs routable (Guy Lemieux).

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Paul,

You answered my questions,

It will be an interesting springtime,

Thanks,

Aust> Hi Austin,

Reply to
Austin Lesea

I think we are just seeing things from different points of view. I was just suggesting a better method for tool-flow+fabric comparison than that used in the referenced Altera white paper. I agree that many other variables will affect who gains more market share. (Perhaps this is because not all that much separates the two solutions on a tool-flow+fabric basis?)

For fun I would argue that the FPGA customer is more discerning from an engineering perspective than the car customer!

Reply to
Irwin Kennedy

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.