See Peter's High-Wire Act next Tuesday

Why/how ? The rationale here has me lost...

There is plenty of diversity in the marketplace, and users are not so silly as to buy purely on one benchmark.

Look at

formatting link
- there are a lot of uC/iP listed, and they are not fearfull that comming 2nd in some benchmark will be the kiss of death ?

Also look at the uC markets - some of the biggest $$ do not come from the fastest, or smartest, devices.

You have missed another important function of PUBLIC benchmarks, and that is code & design training. Users can see generic and optimal source files, and learn a _lot_ from that.

If the word benchmark has such a strong, averse reaction, then think of them as portable application notes with numbers ? :)

-jg

Reply to
Jim Granville
Loading thread data ...

C'mon, Jim. You cannot be serious. There is no dearth of learning and training material. Xilinx publishes many hundreds of app notes (now often with code examples), there are reference designs, evaluation boards, and the User Giudes have thousands (!) of pages. And Altera does the same. There is plenty of learning material. But we do not need the artificial title "benchmark" that implies objectivity (without ever living up to it). Peter

Reply to
Peter Alfke

Might be worth it for a vendor to use a public core due to reduced support on the software. Gnu tools are pretty good. Once somebody gets them over the hump, the vendor might not need to do much support.

--
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
 Click to see the full signature
Reply to
Hal Murray

True, could be a solution for Actel or Lattice ?

It does not seem that Xilinx are PREPared to do public 'pushed case' examples so users can do their own measurements, and given the 'half lifes' of the Speed Files of FPGAs, that is looking important.

If we cannot trust the vendors marketing depts to match like with like, seems the info flow should come from the engineerings depts ?

Maybe Altera could start ?

-jg

Reply to
Jim Granville

EEMBC isn't public; the sources cost ~ $30K. This makes it pretty much useless, because only the processor vendors buy in, and there's no obligation to publish results. The average user can't run a benchmark on two different systems, and it's in the user's system that performance really counts.

Another issue with benchmarks is that vendors simply target their processor/FPGA/whatever at the benchmark. It would be relatively easy for an FPGA vendor to increase performance on a known benchmark, either by targetting their software at it, or by introducing dedicated hardware in the next device. In the long run, everybody loses.

Besides, how many FPGA end-users actually buy on raw performance? Very few, I suspect, and they're probably the ones who are targetting ASICs anyway.

Evan

Reply to
Evan Lavelle

(snip)

Besides all the problems that benchmarks have, I don't believe everyone has the same design goals. Some systems need to be as fast as possible, others as dense as possible, and some in between.

What I would like to see is a series of devices in similar sizes, but varying in the special purpose blocks. Two popular additions are hardware multipliers and block RAMs. Some designs use more of one or the other, some less. A designer may know early in the design process which of those will be needed.

Consider a family of devices of similar size, but varying in the two dimensional space of number of hardware multipliers and block RAMs. That would probably depend on the ability at mask generation time to substitute a multiplier or RAM for a set of CLBs in an automated fashion, and for routing software to follow those changes. Those within a family should be otherwise pin compatible.

Given such a family of devices, it is unlikely that benchmarks would show all equally useful. Benchmarks for devices with CPU logic inside would be even worse.

-- glen

Reply to
glen herrmannsfeldt

"What I would like to see is a series of devices in similar sizes, but varying in the special purpose blocks. Two popular additions are hardware multipliers and block RAMs. Some designs use more of one or the other, some less. A designer may know early in the design process which of those will be needed. "

Sounds like a description of the ASMBL architecture of Virtex-4, where LX and SX differ in the percentage mix of functions, SX having relatively more multiplier/accumulators and BlockRAMs, otherwise the functions are identical. Seems like we are getting there. Just remember, any different chip is a multi-million dollar investment by the manufacturer... Peter Alfke

Reply to
Peter Alfke

What you mean is it is not free. That is their busines model, EEMBC are there to make money.

Why ? - if the benchmarks are application relevant, then then surely the application speed improves, and everyone wins ?

Static Icc and package sizes are also design benchmarks.

Correct, but benchmarks are not all about speed, they are about a defined set of designs, so you can exercise a device and get mA/MHz, or LUT, or MHz or ns, or whatever parameter matters to you most. A vendors claim of 39% is of little use to anyone. Another use is they can show you how to get more of something, for more effort, in the optimised benchmark category. Of course, the optimised category is not a level playing field, that is the whole point.

-jg

Reply to
Jim Granville

I am way behind in following the newer chips.

How automated is the generation of the different family members?

I was trying to imagine a design where at the mask level one could insert a multiplier, RAM, or CLB array where all the signals would line up. That is, each block could be separately routed and then an array of such blocks assembled.

Still, I am sure that the costs are high, even for just cataloging and inventorying a new device without considering the mask costs. I believe, though, as devices get bigger more intermediate family members will be needed.

(I had wondered once how much it costs to create a new breakfast cereal, maybe just a different flavor of an existing one. There doesn't seem to be a limit to the combinations they can come up with.

thanks,

-- glen

Reply to
glen herrmannsfeldt

Except heres the gotcha:

Assume the magic design fairy hands you a taped-out-and-compete design. Generating the masks ALONE costs $1M or so

Thus a family approach (such as V4) is going to chose a few points and just produce those, and "close enough" ends up being cheaper than "just right" because of the costs of setup.

--
Nicholas C. Weaver.  to reply email to "nweaver" at the domain
icsi.berkeley.edu
Reply to
Nicholas Weaver

(snip regarding a variety of different FPGAs with different numbers of RAMs and multipliers)

Yes, a magic mask fairy would also be nice to have.

I could wonder about putting more than one type of chip on a single mask, though that complicates the testing and packaging. (Making sure that the right labels end up on each one.)

-- glen

Reply to
glen herrmannsfeldt

Virtex-4 has 17 family members in 3 sub-families ( 8 in LX, 3 in SX, and 6 in FX) covering a complexity range from about 12,000 to about

130,000 LUTs and flip-flops, which is a bit more than a 10 : 1 range. (You can use the 2- or 3- digit number in the part designator, multiply it by 1000 to get roughly the number of LUTs or Logic Cells. Let's not quibble about small inflation factors.) That converts to around 30 device-package ombinations, times 3 speed grades, times 2 or 3 temperature ranges. This calls for smart planning, especially since the chip-manufacturing cycle is significantly longer than the ordering lead-time we get from our customers... Peter Alfke
Reply to
Peter Alfke

Yeah, the 39% seems cooked to me, especially with no way to check it for the interested public. Where is that Altera guy hiding ?

-Che

more

that is

Reply to
che_fong

I have posted all I need to say on the subject. Clearly, I believe the +39% is real. We have invested years of engineering time to gather benchmark designs, fairly convert them between architectures, and figured out how to get the best out of both tools, and to produce comparisons. We have disclosed how we run our tests, and the results we achieved. But short of releasing the actual designs, which we cannot do, we will never be able to convince those people (such as you) who believe we are cooking the numbers when we are not. I can't blame you for being in disbelief -- trust me, we double- and triple-checked our results because we were so surprised that Virtex-4 came out so poorly.

Do I believe 39% tells the whole story of comparing these two device families? No; it is just one (very important) parameter.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Paul, let me help you. There are three ingredients to this "surprise":

  1. Altera used its fastest (of three) speed grade against the middle of three Xilinx speed grades. ( I have previously explained your reason for, and the much stronger reason against doing that.)
  2. Altera did not exercise the Xilinx software as strongly as they pushed their own. The software tools are quite different, and require a different approach if absolute highest speed is the goal. Which it was.
  3. It is reasonable to assume that Altera's stored designs are more Stratix-friendly.

So, don't you guys play the surprised innocent onlookers. Nobody expected Altera to be fair. Hell, I think the whole business of competitive benchmarks being run and promoted by an interested party is a sham and a disgusting deception. That's why I refused to enter the mudbath... Peter Alfke

Reply to
Peter Alfke

Correct me if I am wrong, but didn't Altera use the most current speed file data that was available at the time? Or was the data available in the speed file and just the parts are not available? Lets face it. Even if the speed file data was available, data based on estimates is pretty pointless. We have seen significant changes in speed files even

*after* a chip is in production. So the data is pretty meaningless *before* the parts are in production.

This is a point that no one can prove either way. Xilinx does not release their benchmark designs and Altera does not either. So the users are left not knowing if any of the info is correct.

That sounds like marketing-speak. Regardless, until we get a set of benchmarks that are open *and* useful, this is all just a tempest in a teapot.

But here you are... :)

--
Rick Collins

rick.collins@XYarius.com
 Click to see the full signature
Reply to
rickman

And there you have the best possible argument for Public (WEB) Source code. - I cannot believe the designers at Altera feel happy to have "invested years of engineering time", and find themselves unable to publicly verify the numbers. ( and even have them openly laughed at ? ) - To me, that is a total waste of time. You, and your customers deserve better.

Simple solution: Get some designs you CAN release !?

-jg

Reply to
Jim Granville

Hi Jim,

From a marketing perspective, yes it makes life difficult. That is only a secondary goal of our benchmarking effort. The primary reasons we collect designs and measure our performance is to (a) improve our CAD tools and (b) experiment on new architectures. When developing new cad algorithms and new architectures, we need to be able to compare the new vs. the old to see if the change is a useful one. For example, there is no way we could have ever made the radical change of moving from our old Stratix 4-LUT based LE to the Stratix II decomposable 6-LUT with shared LUT function capability. There is a lot of pain (synthesis effort, IP changes, customer impact, etc.) associated with changing the logic architecture of a family, and we need good, solid data to back it up. Similarly, when we make changes to our synthesis, placement and routing algorithms, every such change must be validated for functionality and quality.

Hopefully someone out there will put together some new public domain big benchmarks (like the old MCNC benchmarks, still quoted so often in academic literature). It would do the academic community some good to see what real designs look like these days.

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Hi Peter,

All are reasonable comments/questions which I will address below. But let's first step away from Altera's direct Stratix II to Virtex-4 result, and like good engineers sanity check the +39% number by trying to compare things another way.

Virtex-4 vs. Virtex II-Pro is around 5% faster (I don't have the exact result on hand). Whatever perceived bias there may be, this is the result we get when we run those two chips head-to-head, using the same designs, same software, and same methodology. And we've heard this from Xilinx users who have been surprised at the lack of performance increase when comparing the two chips. There have been postings on this subject in this very newsgroup. Yes, some IP blocks have got faster, and there have been changes to various aspects of the chip, but the basic logic + routing fabric really hasn't improved much. As an architect, I am not surprised at the lack of performance increase. Nothing has changed in V-4 on the logic or routing front vs. VIIpro that would lead to speed. The stripping of SRL16s from the M slices should lead only to some area reduction. And going to 90 nm from

130 nm doesn't automatically confer a speed advantage, since this depends on choices of exactly where you target the process, what gate lengths you use, what threshold voltages you use (and where), and even things like using slow thick-oxide transistors. As an example of this, moving from Cyclone (130 nm) to Cyclone II (90 nm), we're only seeing somewhere in the neighbourhood of a 10% performance advantage.

Contrast this to Stratix II. Stratix II is 45-50% faster than Stratix. Again, same designs, same methodology, same tools. Perhaps we cannot be trusted to run Xilinx tools fairly, but we had better know how to run our own tools and chips. Why are we seeing this much advantage? A small part comes from process. But most of it is as a result of the new logic architecture -- to first order, larger LUTs mean fewer logic levels with roughly equivalent delay per level, thus faster overall performance. And there are numerous other changes under-the-hood relating to the routing architecture and electrical design that lead to further performance improvements. If we had not innovated, we also would have been left with a product that was not much faster than its predecessor.

So is a performance advantage in the 39% range that difficult to believe? Well, unless you think that Virtex II Pro was way faster than Stratix (numerous head-to-head battles in the field do not support this), a big advantage for Stratix II is reasonable to expect.

I guess you could say that's the marketing of our results. The science behind the +39% is valid -- we clearly state what we are comparing to and why we are comparing that way. And from a customer (today) perspective, that is what they would see too. There are no speed files for -12 and no entries in the data sheet. Assuming your fastest speed grade (whenever it finally comes out) is 10-15% faster than the middle speed grade, we'll still be talking about a ~25-30% advantage for Stratix II.

I disagree. We spent months trying the default settings, the best settings, and every combination of settings we could in order to maximize the performance of designs run on ISE. The performance of ISE is difficult to compare against, since it tends to do a crappy job when over-constrained or under-constrained. This means we have to spend a lot of time finding just the right set of constraints to determine the best (per-domain) performance. To get a result for a Virtex-4 design, we run the tool ~15 times on average in order to find the best constraints for that design.

Please publicly post the best-effort methodology you would like us (and your customers) to employ, or be more specific about what about our approach leads to a tool bias. I would be happy to discuss the merits of various benchmarking approachs.

I'll agree that there *could* be some unintentional bias here. Its tricky -- a lot of our benchmark designs come from customers who engage with us because they are having troubles, sometimes with meeting performance. This may be on an older chip (say, APEX) or it may be that they were having tool issues. What this means is part of our benchmark set comprises designs that *did not* do well on our chips. Some of our designs are targetted originally at Xilinx devices and we've been called in to try to hit performance in an Altera part because the customer was having availability issues. And yes, some of our designs are just plain old design-to-Stratix designs. So its a bit of a mess to try to decipher whether or not we have a bias based on the benchmark set...

One point I would make is that Stratix II's logic architecture is radically different from Stratix's. So even if we had a bias towards Stratix designs, I'm not sure that would mean we should automatically see an advantage for Stratix II vs. Virtex-4 since in many ways Virtex-4 and Stratix are more similar than is Stratix II to either architecture.

I don't think I've ever expressed surprise at the response. I sometimes wonder whether we should have screwed being fair and instead just posted totally unfair results like +60% or +70%, or to take a page from Xilinx's books, "up to +230%" so that once people de-rate for assumed unfairness, they'd end up somewhere near the right result.

And Xilinx has never dared to make any performance claims in the past? At least our +39% result is a step forward in that we are using averages (real, geometric averages with a full data set) and not "up to" numbers, and we are posting details on our methodology, and are doing our best not to stack the deck.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Hi Rick,

You are correct. No speed files are available for -12. No numbers are in the datasheet. So we compare to -11.

I think performance comparisons based on preliminary timing models are still valid. Regardless of how correct speed files are, that is the performance a customer will see, and is what customers are using to select devices and speed grades. Of course, performance comparisons need to be updated with each release of speed files (and cad software too -- algorithms are always improving).

I cannot speak for Xilinx and their speed files. But on the Altera side of things, I would not expect much change in core performance. For all families I've been involved with (Stratix, Cyclone, and beyond), our core performance predictions made in the preliminary timing models have been very close (within 5%) of final production numbers. Stratix II core (logic + routing) speed will not be changing more than a few % in the future. Our models have already been correlated to silicon and compare very well. The toggle rate limitations on DSP and memory blocks will likely increase (again) since we're still in the process of finishing off our characterization of these blocks and we like to stick with conservative limits until that characterization is completed.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.