July 20th Altera Net Seminar: Stratix II Logic Density

Paul Leventis (at home) · 2005-07-19T12:53:43+00:00

On Wednesday, July 20th @ 11 AM PST, two of my colleagues (Alex Grbic and Paul Ekas) will be giving a net seminar comparing Stratix II and Virtex-4 logic densities. They will describe the logic architectures of these two families, compare logic densities between these two families, discuss our benchmarking methodology and results, and provide software settings to maximize logic packing in Stratix II FPGAs. Details can be found at Stratix II utilizes an innovative logic element we've called an "Adaptive Logic Module" or ALM. This logic structure can efficiently implement two 4-LUTs, one 6-LUT, some 7-input functions, a 3-LUT + 5-LUT, plus other combinations sharing inputs and/or portions of the LUT mask. This capability translates into increased logic density, but complicates matters when it comes to comparing Stratix II results with those of traditional 4-LUT based architectures such as Stratix I and Virtex.I should point out that as part of this talk Alex will be providing results gathered from publicly available benchmark designs, allowing others to replicate the results he will present.We will answer questions provided during the Net Seminar and I look forward to a healthy newsgroup discussion afterwards!Regards,Paul LeventisAltera Corp

R

Ray Andraka 21 years ago

Philip, good point. I think you see what I am trying to say. Basically, that this "mine's bigger" pissing contest is at best a demonstration of how well the parts compare on an undisclosed set of benchmarks that are biased towards the favorite device. My point in my previous posts is that */any/* design is going to be biased toward one of the devices. That bias can be toyed with by making changes in the benchmark designs (and I am talking about changes at the RTL level, not hand crafting). Even naive designs have this bias, but I suspect that the benchmarks were either designs done by the vendor's FAEs or by the vendor's customers, which would already introduce a distinct natural bias toward that vendor's devices. Even if he did use naive designs, marketing would have undoubtedly polished the numbers by either tweaking the designs or cherry picking the benchmarks to support his sales pitch. I'm not saying there is anything wrong with that, just trying to expose the nearly unavoidable bias that is naturally there.

Vote

V

Vaughn Betz 21 years ago

Hi Ray,

I think this is the paper you're talking about --

formatting link

It's the journal version of a paper from FPGA 2000, which matches the timeframe you're talking about, and it is by one of Jonathan's students (Elias Ahmed).

The conclusions are the opposite of what you're saying though: larger LUTs are faster (which was known before), but also that 5 & 6 LUTs are much more area-efficient with modern CAD tools than was previously believed (in Jonathan's 1990 JSSC paper for example). We came to a similar conclusion independently at Altera. However, to improve the area-efficiency of larger LUTs further, we found we had to allow the larger LUT to be fracturable into smaller LUTs when appropriate, and that's the genesis of the ALM.

Not many designs manage to keep all the logic to one level of LUT. The DSP designs you do are the most amenable type for deep pipelining, and I think you are a champion even among experts -- I don't see other DSP designs hitting that level of pipelining. Maybe they're so expert they never need our help though :).

The high-speed designs I see today (typically 250 - 300 MHz, sometimes up to

400 MHz), have multiple levels of logic between the registers. Typical would be ~3 or 5 levels of logic (LUT) on the critical path using the Stratix II ALM, vs. ~4 to 7 levels using Stratix's logic element (4-LUT based). That reduction of routing hops and logic levels by ~25% is a big help in getting performance in those clock ranges.

Most communications designs have more complex logic, and it's typical to have ~6 to 10 levels of logic on the critical path, putting them in the ~120 - 220 MHz range on the main system clock.

Using the larger LUTs in an ALM (5- or 6-LUT) still leaves both ALM registers usable, so there's no problem with register starvation even if you manage to keep all the logic one LUT deep.

Regards,

Vaughn Altera [v b e t z (at) altera.com]

Vote

V

Vaughn Betz 21 years ago

=_NextPart_000_0036_01C58E4D.6B1F8160 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

Actually, what this does is benchmark your "experts". And the results at best are only valid if you then use that expert for your design :-)

Philip, good point. I think you see what I am trying to say. = Basically, that this "mine's bigger"=20 pissing contest is at best a demonstration of how well the parts = compare on an undisclosed set of benchmarks that are biased towards the favorite device. My point = in my previous posts is that any design is going to be biased toward one of the devices. That = bias can be toyed with by=20 making changes in the benchmark designs (and I am talking about = changes at the RTL level, not hand crafting). Even naive designs have this bias, but I suspect that = the benchmarks were either=20 designs done by the vendor's FAEs or by the vendor's customers, which = would already introduce a distinct natural bias toward that vendor's devices. Even if he did = use naive designs, marketing would have undoubtedly polished the numbers by either tweaking the = designs or cherry picking the benchmarks to support his sales pitch. I'm not saying there is = anything wrong with that, just=20 trying to expose the nearly unavoidable bias that is naturally there.

Ray, =20

This is certaintly a valid concern

Vote

S

seannstifler69 21 years ago

I'm sure Altera will trot out a bunch of data showing how much more logic efficient the ALM is than the Slice. Otherwise they would not hold a net seminar. That said I think the ALM has some nice features. I don't think the 6 and 7 give you much advantage. How often can you share 4 inputs with two seperate functions? Or share 6 inputs with two seperate functions? My guess is that the 5 + 3 or 5 + 4 is where it helps more. I think the biggest advantage is reduced levels of logic isn't it? One important point will be what is the Fmax difference between these designs. Aren't more packed designs slower?

Anyway, the truth is something that will not be talked about on here. What is the silicon area of an ALM? What is the silicon area of a SliceM and a SliceL? If the silicon area of a SliceM + SliceL is half the size of 2 ALMs then Xilinx is way ahead. If the silicon area of 2 ALMs is half the size of SliceM + SliceL then Altera is way ahead. An ALM has 64 SRAM bits I believe. 2 x 4 lut plus 4 x 3 lut. A slice has

32 SRAM bits. Last time I checked, 32 SRAM bits takes up less area than 64 SRAM bits. But SRL16 and distributed ram circuits blow up the size of a SliceM.

The truth is that the 180 and the 200 are most likely both "reticle busters". The biggest chip you can make. One chip per 90 nm reticle shot. Since the LX200 has 200K and the S2 has 180K, my guess is that the SliceM + SliceL is a bit smaller than 2 x ALM. The laws of physics are the same for both Altera and Xilinx. Nothing is free. SRL16 and RAM cost area. ALM costs area. Routing costs area. Is the size of the silicon area defined by routing or by circuits?

If the ALM has bigger silicon area then it should have lower resource usage.

Vaughn Betz wrote:

that this "mine's bigger"

an undisclosed set

previous posts is

can be toyed with by

the RTL level, not

benchmarks were either

already introduce a

naive designs, marketing

or cherry picking the

wrong with that, just

favour Altera, either intentionally or unintentionally?

considered a valuable engineering resource at Altera. That's because we believe the only way to make our CAD tools or architectures better is to be scientific and rigorously benchmark our ideas on real customer designs, and choose what ideas go into the CAD tools and architectures based on those results. That means monkeying with the HDL is a big no-no -- it would destroy the intent of the customer design, and would no longer be a valid test case. To allow a design that was targeted at one of Altera/Xilinx/Lattice to compile in the other vendor's tools the megafunctions/coregen modules are replaced with equivalent modules for each manufacturer, but that's all that's allowed for design modification.

bias, but of course it's impossible to be 100% sure of that. If all the designs we had were designed for Stratix II and had had their HDL extensively tuned by the customer for Stratix II or an FAE, a bias would certainly be possible. But the device our benchmark set originally targeted is all over the map: Stratix, VirtexII, VirtexII-Pro, Stratix II. A significant fraction of our design set was intentionally written by customers to let them target (with appropriate megafunctions/coregen modules included) both the Stratix & Virtex families, so they could do their own benchmark. Based on the result of that benchmark, they would eliminate one vendor if their device just couldn't hit the speed or density target, or if both worked, they would be happy to discuss pricing with both vendors :). We also have some designs where we were in competition with an ASIC, so again, the HDL was not written with a particular device in mind.

coded for Stratix and the Virtex family than were coded for Stratix II. That would tend to favour a 4-LUT architecture, if there was a bias. However, since almost no customers code their HDL thinking that carefully about how it will map to this LUT etc vs. that one, I don't think there's much bias against Stratix II either. Basically customers count on CAD tools to implement their logic & routing intelligently these days, and rarely intervene except to add deeper pipelining or perhaps retime part of the logic when it's clear that they aren't going to be fast enough. Those are really architecture-level changes, and don't directly favour one chip vs. another.

that if density is important to you, you should benchmark for density when choosing a device, rather than relying on the number attached to a device. Compile your design, or the opencores designs, and check the % utilization.

Vote

M

Marc Randolph 21 years ago

Thank you for the reply, Paul. Now that you point this out explictly, I think I can see how Altera justifies their numbers (I have no idea if they are truely accurate, but at least I understand them).

As you said, the label that Altera has used is most unfortunate... "LUT" has a pretty specific meaning to a very large number of people, and Altera seems to be twisting that meaning to suit them - and adding much confusion in the process.

Presumably F8 muxes were turned on. Presumably the suite didn't have very many "128-Number, 16-Bit per number adder" trees. Presumably it didn't force Xilinx to use SRL's when there are almost certainly registers available for a simple four bit shift register (the "0.5 slices" entry in table 11 of the white paper is just silly - if the Stratix can use four registers, so can Xilinx. Altera shouldn't try to imply there is an automatic savings with the Stratix II).

I believe at least half the confusion I had on this topic was that Altera uses the term "equivalent four-input LUTs" in reference to Xilinx also (in multiple places, but especially table 4 of the white paper). If they'd just stuck with Slice comparisons, it would have helped greatly. Altera has defined "Equivalent four-input LUTs" to mean ALM * 2.6, so how can that term apply to Xilinx? Not only that, but when you start using the term LUT, it makes it seems like Altera is ignoring the other features of the slice like the F-muxes, counter/carry logic, etc - hence the strong wording on my previous email. I see now that they aren't.

Thank you again for helping me to understand the madness,

Marc

Vote

P

Paul Leventis 20 years ago

Hi Sean,

Thanks for a well thought-out posting with a lot of good questions and comments.

We found functions could share inputs often enough to justify the small amount of extra hardware required to allow these modes. But yes, it is primarily the ability to implement a 4 + 4, a 3 + 5, a 6, etc. that helps the most on density, since this allows us to efficiently pack most of the LUTs that come out of synthesis + tech mapping without wasting much of the ALM. The other modes are not as commonly used but are used often enough to justify their existance.

Yes, the main advantage of the 6-LUT is speed. In a classic 6-LUT architecture, the added speed of reduced logic levels would come at a large increase in silicon cost (due to unused logic in 6-LUTs used to implement 5-, 4- and 3-input functions). The ALM is designed to get the speed benefit of reduced logic levels, without that area penalty.

Packing *can* slow things down, depending on how smart the tool is. If you take two unrelated 4-LUTs that want to be on opposite sides of the chip and pack them into one ALM (or one Slice), then you have probably hurt performance. On the other hand, taking LUTs that have some common fan-in and packing them together probably doesn't hurt performance much, since they likely come from parts of the design that wanted to be together anyway. Also, by packing more tightly, the average distance a routing connection must travel may be reduced, improving performance. For example, the shared LUT mask mode of the ALM that implements 2

6-LUTs sharing 4 inputs into one ALM likely doesn't hurt performance at all -- if anything, it probably helps since it doubles the density of those circuits, such as bused multiplexors, that use it.

This is a good observation and is the heart of the logic architect's design problem. In designing Stratix II, we experimented with numerous Logic Element designs; each was evaluated on the "silicon area per logic function" metric. We are claiming that you need 1.3 Slices to implement the same functionality as an ALM (on average). Provided the ALM is no more than 1.3X bigger than a Slice, then the "silicon area per logic function" is better off for the ALM.

Remember, LX200 has 89,088 Slices, and 2S180 has 71,760 ALMs. So

*assuming* that the two dice have the same area, and *assuming* that all that area is ALMs/Slices (which it is not), then you get an area ratio of ALM:Slice = 1.24:1, which is lower than the logic density ratio of 1.3:1, indicating that the ALM has better density by a ratio of 1.30/1.24. Of course, there are enough assumptions here to completely invalidate this result; we both spend considerable area on other functions, and we have overhead for redundancy technology, but get more yield, etc... and routing is a considerable part of the logic

routing area...

In the end Altera chose to move to the ALM. This involved re-writting much of our LE-optimized IP, new Quartus synthesis, new 3rd party synthesis, considerable changes to our CAD flow, and a lot of other work to do so. The only way we would have committed the resources needed and taken this big risk is if we believed the ALM was a big win in silicon cost and speed.

Regards,

Paul Leventis Altera Corp.

Vote

A

austin 20 years ago

Peter,

I agree. This is a "shell and pea game" being conducted.

"Keep you eye on the LUTs -- they will grow right before your eyes!"

Amazingly, this keeps the eye off: the domain specific solutions (ie SX is the least cost for DSP centric designs, LX is the best for logic, FX for embedded systems), signal integrity, power, speed, EMAC, PPC, MGT, FIFO BRAM, DCM, as well as the other hundred innovations in Virtex 4. For all of these, there is no defense. Nothing to say. So, let us just fool everyone into arguing about something else?

I applaud their marketing department's ability to put a ring through the customer's nose, and lead them around at will. Xilinx has always marveled at their ability to do this. However, we do believe our customers are smarter than that.

By focusing the conversation, and all the brain cells on what they want to have you focused on, they have succeeded in making a large number of engineers do exactly what they wanted.

Keeps them from looking at the other hundred innovations that make the Virtex 4 an undisputed winner.

For me, I will continue to point out the other 100 reasons to choose V4. I'll let the LUT shaped heads battle this one out, but be aware that you are being used.

Austin

Vote

P

Peter Alfke 20 years ago

Austin gave an interesting analogy:

I can visualize Altera Marketing trying to put a ring in designers' noses, and (mis)leading them so that they can see nothing else but LUTs, oblivious to all the more exciting aspects of FPGAs.

We could then call them LUTites, in memory of the Luddites in 1812 Nottingham, who also did not appreciate the technological progress coming at them, destined to really change their lives. Many suspected Luddites were convicted, imprisoned, or hanged. Those were the days...

But don't get us wrong: We love LUTs. Xilinx invented the concept of LUT-based FPGAs. LUTs are great and flexible, whether they have 4, 5, or 6 inputs. But LUTs are not everything anymore, and the really exciting progress in FPGA performance and density happens outside the LUTs.

Don't let anybody put a ring in your nose... Peter Alfke, Xilinx Applications (from home)

Vote

K

Kees van Reeuwijk 20 years ago

Could you Xilinx people please grow up, this is not insightful, this is not even zealous company loyalty, this is just boyclub bickering.

A reasoned discussion about advantages and disadvantages of FPGA architectures is interesting, but do you really think that this kind of posting is informative for /anyone/?

Vote

P

Peter Alfke 20 years ago

Kees, you have a point, but how should we respond when our competitor runs a public web-seminar that obfuscates the issue with irrelevant hair-splitting ? Some people might even believe them... Peter Alfke

Vote

M

Mike Treseler 20 years ago

Perhaps you should give your customers more credit for being able to think for themselves.

-- Mike Treseler

Vote

A

austin 20 years ago

Mike,

We do, we do (give our customer credit to think for themselves, but we worry about those who are not our customers...).

We made our point. No need to go on about it.

Thanks for listening, and taking the time to reply.

Austin

Vote

M

m 20 years ago

in kind, ie. run your own public web-seminar and rebut their claims in a professional way instead of acting like 6 year olds saying "you're stupid; no, you're stupid"

that's called marketing and at times xilinx is pretty good at it too :-)

that's the cost of living in a democracy :-) people believe all kinds of things. some even believe that iraq was behind 9/11 and wmd were found there.

Vote

K

Kolja Sulimma 20 years ago

Paul Leventis schrieb:

That is an oversimplification. You can hardly beat the silicon area per logic function of nand gates (for combinational logic). Motorola had an FPGA family based on that approach but it failed. Most transistor area in an FPGA is used for routing switches so it is more important how the logic block influences the routing requirements than what the area of the logic block is. Therefore if in doubt one should opt for the larger LUTs. Acadmic results of the nineties suggest LUTs should have between 3 and 4 inputs. So most vendors went for 4-luts for above reasos. Academic results also suggest that it is better to have starved routing, e.g. not have all LUTs routable for most designs.

But that is the academic truth. Then there comes marketing. For example Intel decided to build an architecture that optimizes MHz instead of performance because they sold CPUs with "True MHz". First Xilinx and later Altera decided to build FPGAs with less than optimal cost effectiveness that are 100% routable in most cases because customers kept complaining that they could use only 80% of the devices. Now they instead have devices that cost 25% more but can be copmletely utilized. Those are easier to sell. No matter what the true benefit of the new architecture is, my guess is that "more LUTs" is easier to sell than "better LUTs", so Xilinx made the better marketing choice in this case.

Kolja Sulimma

Vote

T

Thomas Entner 20 years ago

I think we should point out that Xilinx-marketing started this whole "mine is bigger"-competition:

formatting link

This is all somehow similar to the PC-CPU-market: Pentium 4s needs much higher clock-rates to achieve the same performance as an Athlon64. When advertising clock-rates this looked bad for AMD. So they introduced model-numbers ("Pentium rating") to show the real speed-relation-ship. This leaded to a lot of discussions ("real MHz") in the beginning, until it turned out that the Pentium-rating was pretty fair.

Things are similar here: Altera has a logic-cell that allows for more logic to be implemented per ALUT then in a LUT of Xilinx. To make things comparable, Altera had to make a "Xilinx-rating" (of +30%, which looks a lot to me...). However, Xilinx-marketing was more clever then Intel, they had already their own "Xilinx-rating", which is +12.5% for very questionable reasons. So there were already discussions that you should compare real-numbers with real-numbers or marketing-numbers with marketing-numbers. In both cases, of course, Altera will look worse then they are.

Please note that in above press-release Xilinx talkes of 200.448 logic-elements (for me this is a LUT + FF), while the device has only

178.176. I think THIS is really misleading. Altera is talking of "equivalent LEs" in similar realeases

formatting link

which is much more correct.

BTW, I think LEs are counting in the first place in a FPGA. Memories, DSP-blocks, etc. are also important, but what does their 500MHz help you, when the logic-array can not (or only with huge effort) match their performance?

Thomas

"Peter Alfke" schrieb im Newsbeitrag news: snipped-for-privacy@o13g2000cwo.googlegroups.com...

Vote

P

Paul Leventis 20 years ago

Yes, you are correct. My definition of "silicon area per logic function" is for the full LAB -- this includes the LE itself, all the routing-related area, and the secondary signal (clock/enable/clear etc) goo that is shared between a group of LEs. This of course complicates the architectural experiments since you must optimize the LAB size, secondary signals, and routing fabric for each LE under evaluation. This is what we try to do.

Yes, but academics don't have to deal with angry customers who find out six months into their design that it won't route. In practice, we aim for something like 99% routability in a 99% full device. Even if this is (academically) wasteful, there is a cost associated with having an unroutable part. And part of the allure of FPGAs is low-risk and rapid development -- high routability is a necessary attribute to meeting these expectations.

I can assure you that 25% overstates the cost overhead for near-guarenteed routability :-)

Time will tell. Yes, more things you can point at are easier to sell... especially when you inflate the count by a further 12.5%. But the ALM also got us a big speed boost relative to Virtex-4, which we're finding is pretty easy to sell too :-)

Regards,

Paul Leventis Altera Corp.

Vote

A

Antti Lukats 20 years ago

"Paul Leventis" schrieb im Newsbeitrag news: snipped-for-privacy@g14g2000cwa.googlegroups.com...

Hi Paul,

do have performance data for S-II in fabric speed?

I cant unfortunatly measure it myself as I dont have S-IIs around but I have measured actual in fabric clock speeds of 950MHz in lowest speed grade V4 - what would be the case in S-II lowest speed grade? Are 1GHz+ internal signals possible in the fabric?

Antti

Vote

P

Paul Leventis (at home) 20 years ago

Yes. What would you like to know? If you compile a design in the freely available Quartus II Web Edition, it will tell you a fair bit about fabric speed (the timing models are very accurate).

While you can get the FPGA fabric to run very fast (probably a few Ghz for shift-registers, counters, local single-level logic functions, etc), getting a clock to that fabric is another story. We limit the Stratix II clock timing model to 550 Mhz; this is for the global clock network. Some of the specialized I/O-related clocks run faster. The reason is that things start getting funky when you run clocks at high speeds, and frankly, there aren't any applications (that I know of) that require operation in excess of 550 Mhz. Our DSP/RAM blocks max out at that speed, and

Vote

A

Andy Peters 20 years ago

FWIW, I completely agree with Kees.

-a

Vote

July 20th Altera Net Seminar: Stratix II Logic Density

Join the Discussion

Didn't find your answer?