July 20th Altera Net Seminar: Stratix II Logic Density

On Wednesday, July 20th @ 11 AM PST, two of my colleagues (Alex Grbic and Paul Ekas) will be giving a net seminar comparing Stratix II and Virtex-4 logic densities. They will describe the logic architectures of these two families, compare logic densities between these two families, discuss our benchmarking methodology and results, and provide software settings to maximize logic packing in Stratix II FPGAs. Details can be found at

formatting link

Stratix II utilizes an innovative logic element we've called an "Adaptive Logic Module" or ALM. This logic structure can efficiently implement two

4-LUTs, one 6-LUT, some 7-input functions, a 3-LUT + 5-LUT, plus other combinations sharing inputs and/or portions of the LUT mask. This capability translates into increased logic density, but complicates matters when it comes to comparing Stratix II results with those of traditional 4-LUT based architectures such as Stratix I and Virtex.

I should point out that as part of this talk Alex will be providing results gathered from publicly available benchmark designs, allowing others to replicate the results he will present.

We will answer questions provided during the Net Seminar and I look forward to a healthy newsgroup discussion afterwards!

Regards,

Paul Leventis Altera Corp

Reply to
Paul Leventis (at home)
Loading thread data ...

I am glad to see that Altera has joined the world in acknowledging Virtex-4 as The Gold Standard for FPGAs. Peter Alfke, Xilinx Applications

Reply to
Peter Alfke

Paul

You and the rest of your team at Altera should be absolutely embarrassed that you continue with this marketing analysis of the Stratix II logic superiority over the Virtex 4. And keep in mind that I have no opinion of one Company over the other.

Looking at the Stratix II 180 vs. the Virtex 4 200, as an example:

In your analysis you claim that the Altera Stratix II "180" with

186,576 "4-input LUTs" is bigger than the Xilinx Virtex 4 "200" with 178,176 "4-input LUTs".

The point is that the Altera 180 part has 143,520 "ALUTs" and

179,400 "Logic Elements" while the Virtex 200 part has 178,176 "LUTs" and 200,448 "Logic Cells".

The concluding point is that in the Altera uses its higher 179,400 (actually increasing the number to 186,576) number and compares it with the lower Xilinx 178,176 number in stating its superiority.

You must be kidding that you are using higher Altera count to compare with the smaller Xilinx count. You must honestly believe that you are dealing with idiots. Anyhow, you should really be comparing the ALUT number with the LUT number anyhow, because that is your closes architectural comparison. Look-up-tables are look-up-tables, regardless of what you call them.

I could elaborate more on the multiplier and memory comparisons, but I won't, and the conclusions are the same.

You guys actually had the guts to release a press release with this analysis:

formatting link

And would you quit marketing your variable input LUT architecture. Xilinx has had a variable input LUT archtecture since the Virtex was introduced in 1998.

You really don't think that smart engineers buy this "analysis", do you?

Just trying to get to the truth.

Tim

Paul Leventis (at home) wrote:

Reply to
tim

"tim" schrieb im Newsbeitrag news: snipped-for-privacy@g44g2000cwa.googlegroups.com...

formatting link

and

Virtex-4

[altera marketing skipped]

S180 has more memory then LX200, what is LOGIC optimized, not memory optimized. besides that the large memory blocks can not be loaded from config memory so its not fair comparison anyway. SX and FX devices way smaller than S180 have way more memory. So Virtex beats the S180 on memory

S180 has more multipliers than LX200, what again is LOGIC optimized not DSP optimized, SX is DSP optimized and again way smaller device has more DSP slices than the S180 has multipliers

on total pin count, well here Altera currently beats Virtex-4 offering.

on the overall I am 100% with Tim that all the Altera comparison is totally unfair. of course its hard to have the total truth, the suitability depends on the application and the quality of the tools and many other things. S180 is possible largest single device 1 for all from Altera, where Xilinx has splitted the high end family into 3 offering what is needed for different applications.

Antti PS Paul, it would be much more interesting to see the Altera Stratix-2 GX announced, or is really delaying so loong? I would have expected it to be released by now. It still is coming? Could be that Lattice high end FPGAs come out before S2GX, and will beat the S2GX similary as machXO beats MAX2. Not saying that MAX2 isnt nice, it is but the things Altera forgot, they are all in machXO, and try to pronounce it MAX2 machXO sounds even similar :), I bet some Lattice guy made the name deliberatly to sound like MAX2 nice move, ;) and no trademarks violated!

Reply to
Antti Lukats

Tim,

The only way I trust to compare the logic capacity of two different architectures is to benchmark them against each other. Modern FPGA architectures, like modern processor architectures, are too complex to say which is more area-efficient, and by how much, based on a hand analysis. Think of trying to guess if a P4, P3 or Athlon is faster based purely on the specifications of their pipelines, issue units and clock rates -- it is impossible, so you have to benchmark them. FPGAs have hit that level too.

The best thing to do is to do your own comparison using the circuits of interest to you in the devices you're considering. But that's a lot of work, especially if you want to test it for multiple circuits (as you really should to get statistically valid answers).

Next best is to get someone else's benchmark results. And that's one of the things that will be presented in the NetSeminar tomorrow.

In terms of what should be counted in Stratix II vs. Virtex4 -- this is very difficult to do by hand. Stratix II is fundamentally based on a larger LUT (5-LUT with extra circuitry, or a 6-LUT, depending on how you look at it) than Virtex4 (4-LUT plus extra circuitry) so counting LUTs doesn't work. Academic and industrial research long ago showed that bigger LUTs implement more logic, so you can't simply count the number of LUTs in an architecture and ignore their size. But how much more logic can a bigger LUT implement, for a typical circuit? Nobody can tell you accurately, except by running a bunch of benchmark circuits and showing the results.

Regards,

Vaughn Altera [v b e t z (at) altera.com]

Reply to
Vaughn Betz

Vaughn Betz wrote: [...]

^^^^ ^^^^^ ^^^^^^^^^ ^^^^^^^^ ^^^^ ^^^^^^^

Howdy Vaughn,

No one (except maybe Xilinx) will fault Altera for trying to show how the 2S180 can pack more logic into the device than the LX200. But engineers ARE likely to fault Altera if they do such a comparison with misleading figures.

Your response _*completely ignored*_ the hard facts that Tim presented. Here are Tim's numbers again, since you clipped them:

V4 Slices Actual LUTs Logic cells (Xilinx claim)

----------------------------------------------------------- LX200 89088 178176 200448

S2 ALMs ALUTs Equiv_four_input_LUTs (Altera claim)

--------------------------------------------------------------------

2S180 71760 143520 186576

Since the last column is the only one where the funny math comes in, that is the only place Altera has any hope of showing how the 2S180 can pack more logic. To do that, Altera needs to provide a convincing argument that Xilinx's 200k number for their logic isn't just a little overly optimistic, but is so to the tune of at least 7%. And while doing so, it'd probably be good to show why Altera's funny numbers for the S2 are NOT overly optimistic.

Until doing that, might I suggest that Figure 1 be fixed on

formatting link

which seems to show that it is valid to compare the 178k number against the 186k number. You admit in your response above (where I underlined) that using the 178k number "doesn't work". So why does Altera use it in their comparisons?

formatting link
also uses the 178k number (even going so far as to claim that it is Xilinx's "equivalent" number, when it is most obviously the *actual* number). This paper is also where the 30% better number is presented without any backup data. Readers might trust this number a bit more if design details (especially the number of designs in each size) were published with the white paper.

I have absolutely nothing against Altera, the S2, or the new ALM. But I detest being misled, especially after it has been pointed out a time or two (at which point it becomes obvious that the misleading is being done on purpose rather than it having happened by accident).

Marc

Reply to
Marc Randolph

Hi Marc,

Comparing different logic architectures is a difficult exercise, and only legitimate way to do so is by benchmarking. That's how we architect our new logic architectures -- we build prototype synthesis and place & route tools, and measure each candidate architecture on a large suite of designs. The problem is making people outside the company believe our benchmarking is correct and impartial. I'd suggest tuning in to the Net Seminar, and then discussing here all the flaws you find in it afterwards. I'm sure Vaughn, Alex and I will be happy to discuss any areas of contention!

Perhaps another way of looking at things is by comparing Stratix II to Stratix (lets remove competition for a moment). When we introduced Stratix II, we said that an ALM is equivalent to approximately 2.5 Stratix LEs; this result was based on our own internal benchmarking with the tools and circuit set we had available at that time. We also have shown in previous white papers that the Stratix LE is more efficient than a Virtex half-slice by a margin of ~10% (I'd need to look up the exact number). This difference arises primarily from the increased (routable) register-packing capabilities of the Stratix LE architecture. So *if* you believe these two results, then the results we give for Stratix II vs. Virtex-4 are at least consistent with our previous claims.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Paul Leventis (at home) wrote: [...]

Howdy Paul,

I thought my post was pretty clear, but let me try one last time. I'm NOT disputing the very real possibility that the ALM allows for more efficient logic packing than the Slice. The *only* thing I'm disputing is the _fact_ that, as Tim pointed out originally and I tried to explain in different words, Altera is happy to boost the S2 actual LUT count by an additional 30% for the "equivilant" logic that can implemented, yet increases Xilinx's actual LUT count by exactly 0% for the stuff surrounding their LUT. As I quoted, even Vaughn admitted that the extra stuff surrounding Xilinx's LUT can't be 0%. To head off Altera claiming that they "don't know how much to add to Xilinx's actual LUT count", either use the inflation figure that Xilinx does (12%), or make up your own and justify it. But it has to be greater than 0%, otherwise all the Altera marketing white papers and websites are comparing apples to oranges.

The below columns provide as-near-as-possible apples to apples comparisons:

V4 Slices Actual LUTs Logic cells (Xilinx claim)

----------------------------------------------------------- LX200 89088 178176 200448

S2 ALMs ALUTs Equiv_four_input_LUTs (Altera claim)

--------------------------------------------------------------------

2S180 71760 143520 186576

In short, please explain which of the above comparison columns is incorrect, and why.

Regards,

Marc

Reply to
Marc Randolph

Vaughn

This is exactly my point. In your first paragraph you switch between the two issues here as if they are the same, which is the point of this discussion (do they teach you that at Altera). We are talking two different issues here, speed and hardware architecture. Yes, speed is complex to measure, which is why you say you are faster than Xilinx and Xilinx says they are faster than you, because of the subjectivity of the sample universe.

But your entire point of this "marketing" campaign is not the speed but the hardware. An LUT is an LUT, no matter how you slice it (granted, the additional logic to enhance the LUT performance does increase the complexity, affecting speed, not LUTs). And your response of the "larger" LUT for Altera is the last point of my e-mail. Would you quit "marketing" your variable input LUT (Xilinx has had a variable input LUT feature since 1998, in fact you are a little late with this feature). If you even look at your own white paper comparing the schematics between the two; you have two individual 4-input "ALUT"s feeding into "combinational logic" that provides the flexibility for the number of inputs, so the foundation is still 2 4-input LUTs. Interestingly, Altera will not disclose how they combine the 2 4-input LUTs to provide the flexibility.

Vaughn, again, you are dealing with engineers, not impressionable consumers.

Tim

Reply to
tim

Paul

You will also see this same "logic" (sorry for the pun) in my response to Vaughn. One things is certain in analyzing the Altera vs Xilinx debate, and that is that both companies claim to have the performance advantage, which is speed. Obviously the outcome is subjective. My complaint here is the hardware analysis, where you blatantly use your higher number and compare it with the Xilinx lower number, which is an abuse of logic.

Let me give you the basis for this debate:

formatting link

I would welcome your response to the hardware comparison. See my comments to Vaughn on the LUT discussion.

Tim

Reply to
tim

Vaughn even then it can be darned near impossible. Each of the FPGAs considered here have a unique set of extra features that can be exploited in a design, and if I design my design to those features it will nearly always make it map worse to the other FPGA. These two devices are essentially equal in size, so the typical 20% or more area savings one can squeeze out of a design by designing to the architecture can easily tip the balance toward whichever device you want to "win". This is a big problem I have with benchmarks. If you really want to compare the devices, you need to have several experts all design it independently for particular targets and then compare the optimized designs. Then, of course that result is only valid for that set of specifications. sure, it can be extrapolated for other designs, but the less the benchmarked design is like the user's design the less useful that benchmark is.

I've always espoused looking at the fpga like a box of legos. You build what you can out of the pieces that come in the box. An average user is going to turn out some average looking stuff, and they might even tend to look somewhat alike. There will be a few guys in the room that figure out how to use some of the pieces in their box to do some really neat things so that they end up with something that is cooler than alll the others.

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com  
http://www.andraka.com  

 "They that give up essential liberty to obtain a little 
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759
Reply to
Ray Andraka

Call me biased, but I was thoroughly bored by that presentation. It's the tired old "my 6-input LUT does more than your two 4-input LUTs". Ten years ago, that would have been exciting. "XC3000 LUTs are better than the simple old XC2064 LUTs". They were, really!

Today's FPGA are not just LUTs, but LUT-RAMs, SRL16s, ISERDES and programmable IDELAYs, FIFO controllers, PPC, Ethernet controllers, Multi-Gigabit Transceivers, cascadable MACs, clock management, and much more. It's like a car salesman bragging: "My trunk is bigger than your trunk, if you measure it my way, with my set of boxes".

But of course, I am biased. I felt pretty good after that hour. If that's all they can throw at us... Peter Alfke

Reply to
Peter Alfke

I do find it a little humorous that about 5-10 years ago that someone published a paper at FPGA that claimed better speed and utilization using a 3-LUT rather than a

4-LUT, and that the presenter was someone that was fairly closely aligned with Altera. I think it might have come out of Jonathon Rose's students in Toronto. Anyway, the pendulum has obviously swung to bigger LUTs are better at A. Again, for designs that use FPGA fabric correctly (ie, not levels upon levels of combinatorial logic), the LUT size is not as big a deal. Like I mentioned earlier, you work with what you have in the box of Legos. That said, I generally AVOID using the F5 and F6 LUT expanders in Xilinx because they slow a design down a bit, and more importantly tend to be one of the things that, if placed, give the mapper fits. They also don't match the bit pitch of the arithmetic.

The underlying structure is important that it is done right, but let's not get wrapped around the axle here. Those bigger luts are going to tend to be underutilized. The bypass is of limited value too if you attempt to keep all your logic to one level of LUTs, as the LUTs will nearly always be associated with a flip-flop.

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com  
http://www.andraka.com  

 "They that give up essential liberty to obtain a little 
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759
Reply to
Ray Andraka

Actually, what this does is benchmark your "experts". And the results at best are only valid if you then use that expert for your design :-)

Each vendor should ship an expert to each customer with each SW release. and there should be service packs for these experts that ship at the same time as the service packs for the software. (what's the ASCII for a half-smiley?)

Philip

Reply to
Philip Freidin

Stratix II ALM:

formatting link
Page 8

Virtex-4 slices:

formatting link
Pages 166 and 167

----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----

formatting link
The #1 Newsgroup Service in the World! 120,000+ Newsgroups

----= East and West-Coast Server Farms - Total Privacy via Encryption =----

Reply to
Henry Wong

Hi Marc,

I see your point now. The problem is the unfortunate use of the term "Equivalent Four-Input LUTs". This value does *not* represent a simple

4-LUT. What that column really represents is "Equivalent Benchmarked Logic Units". In otherwords, it is a measurement of capacity based on benchmark results. In this case, we have normalized the result so that our arbitrary unit of measurement matches one Xilinx half-slice (which is labeled "Actual LUTs" in your table).

These comparisons take into account all features of the respective logic elements, such as dedicated adders, little XOR and MUX widgets, etc. To obtain the Stratix II value of 186K, we benchmarked a large suite of designs by running Synplify + ISE vs. Quartus to figure out how many ALMs were needed vs. how many half-slices were needed. The result was that 2.6 half-slices were needed to implement the same functionality as 1 ALM (on average). Multiply by number of ALMs in 2S180 (71760) and you get 186576. So perhaps a more correct label would have been "Equivalent Virtex Half-Slices".

We could have equally chosen to normalize the results to "Stratix Logic Elements". In this case, rather than 186K vs. 178K, the result would be ~179K vs. ~170K (ballpark guess). In that case, we would be scaling the LX200 capacity by our observed Virtex-4 vs. Stratix packing ratio, and we would be scaling the Stratix II ALM capacity by our benchmarked Stratix II vs. Straitx ratio of 2.5 LEs vs. ALM.

Hope that clarifies things.

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Hi Tim,

I would highly recommend looking at Figure 2-6 of the Stratix II databook

formatting link
to gain a better understanding of exactly what hardware there is in the ALM. This is a very detailed diagram that shows exactly how an ALM is constructed; As you will see, it comprises 2 4-LUTs plus 4 3-LUTs plus a whole bunch of muxing. If you are going to simplify it, it is closest to a 6-LUT with a multiple outputs and some replicated internal nodes, not 2 4-LUTs as you characterize it.

In a posting on this subject in the past (thread "Mine is bigger than yours..."), I've given a few links to Altera white papers and two published conference papers that better describe the ALM and the architectural choices involved in its development. See

formatting link

As for the Xilinx "Variable LUT", there are big differences between a "composible" LUT architecture (what Xilinx has) and a "fracturable" LUT architecture (the ALM). I've posted a very long (technical) post previously on the subject under the thread "Stratix 2 ALUT architecture patented ?". See

formatting link
The short version is that in order to make a

5- or 6-input LUT out of Xilinx slices, you use a lot more silicon area. As a case in point, if you were to try to fill up a LX200 with 6-input LUTs in this manner, you would only be able to fit 44,544, while you could fit 71,760 in the 2S180.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Hi Ray,

This benchmarking study looks at how the two archtiectures studied respond to HDL synthesis. This is the method of design is used by the vast majority of our user base, who don't have the time or knowledge to hand synthesize large portions of their designs. I will concede that for the advanced designer such as yourself, this comparison method is insufficient to draw a conclusion on how well the architecture will work for you. And yes, if a design is ultra-pipelined, there are fewer opportunities to make use of larger LUTs, so it is conceivable that the advantage would be less in these cases.

As for your comment on academic research on LUTs, we still agree with the body of research you point to. Simple LUTs larger than 4-LUTs are less area efficient. The problem with simple 6-LUTs is that you sometimes can't use the whole thing, so you waste a lot of area. With smaller LUTs, you tend to have less wasted LUT area. Of course, taken to an extreme very small LUTs are not efficient either because there is overhead getting signals to and from LUTs. This is why the ALM is the complicated beast that it is. By allowing the 6-LUT to be fractured, we improve logic packing. We can efficiently combine LUTs of varying sizes into one ALM, greatly reducing the amount of wastage.

From a speed perspective, academic research has shown that larger LUTs are better (as far as I can recall at least). There is little speed penalty to using a larger LUT, since the fastest inputs are still the same speed. So to first order, the bigger the LUT, the fewer levels of logic, and those levels still have roughly the same delay (for the critical path).

The ALM was designed to give us the best of both worlds -- the speed of a

6-LUT, without the area penalty.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Hi Peter,

Alex will be crushed to hear that!

But the majority of FPGA area is still consumed by logic. As you point out, this is merely one faucet of the suitability of one architecture or another to a given design opportunity. I would say what we're claiming is that we can fit more or slightly larger passengers in our car. And like a good salesman, you are deflecting by steering people towards the seat heater you've got on the passenger side...

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Paul,

Paul, I was referring mostly to a higher level of tailoring than 'hand synthesis'. I am referring to design decisions that any good designer makes that are influenced by the architecture of the fpga. As a basic example, there are significant differences in the memory structures between Xilinx and Altera devices. Any decent designer is going to look at what's in the FPGA, at least in the macro level, and tailor his design to that. Sure, the differences are a little more subtle when you are looking at the differences in the fabric rather than at the added features, and I'll concede that there are a lot of designers out there that push the button and hope for the best.

The thing is, when pushing the button doesn't achieve acceptable results, those same designers look at the synthesis results and then go back and tweak the design in the areas that the sythesis did the worst with. I would argue that the design adjustments made there are in effect tailoring the design to the architecture, even though it is done somewhat blindly, and certainly not at the level of efficiency as one who writes his RTL in a way to give the synthesis very strong hints on how to assemble the logic based on the designers knowledge of the architecture.

The point is, HDL synthesis is still only a translation. The tools can't think for the designer. Coding style can have a significant effect on the resulting design. Even at the RTL level, the design can and absolutely should be biased by the underlying structure. I would argue that most of your user base does at least consider the 10,000 mile birdseye view of the architecture when doing their designs, and it is this bias that I was referring to that can tip the scales toward any particular architecture.

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com  
http://www.andraka.com  

 "They that give up essential liberty to obtain a little 
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759
Reply to
Ray Andraka

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.