See Peter's High-Wire Act next Tuesday

This is getting tedious, but Paul did write " we were so surprised that

Virtex-4 came out so poorly." That's where I got the SURPRISE from. :-)

It looks like everybody agrees: the evolution from 130 nm to 90 nm does not automatically give a big performance boost. (it lowers the cost, though). So Altera and Xilinx used additional means to improve Stratix-II and Virtex-4 performance, but the two companies did this in very different ways. Altera changed the LUT structure significantly, and I can believe that this makes certain applications faster, if they can tolerate shared inputs. But Altera made no systems-oriented functional changes, added no new functions or structures. Xilinx did the opposite, leaving the LUT structure pretty much as before, but adding and improving functionality in many ways (as I emphasized in the web-seminar). What does that mean for the old benchmarks? Since all Altera benchmark use established legacy designs, and only designs available to Altera, they will benefit from LUT-level improvements, but will of course have no clue about major structural and functional improvements, as introduced by Virtex-4.

I bet there are no dual-clock FIFOs in the Altera benchmarks, or 32-tap FIR filters, or Gigabit SerDes, or microprocessors, or even SRL16s or LUT-RAMs. Such applications do not exist in the old designs, or they are implemented in such different, less efficient ways that they do not migrate.

Altra evolved Stratix-II in a direction that old legacy benchmarks can easily take advantage of. Xilinx evolved Virtex-4 into a more systems-oriented direction. I am convinced the Virtex-4 innovations provide a bigger performance boost for new designs that can take advantage of the new features. Who cares about porting obsolete designs?

This also answers the quest for public benchmarks: There is no way that otherwise nice guys like Paul and Peter would ever agree on such benchmarks. I would insist on SRL16s, FIFOs etc, and Paul would load up with applications that are favored by the complex LUTs with their shared inputs. Both of us would have to be selfish, since there is so much at stake. There can be no common ground, since there is no "typical benchmark".

Benchmarks are dead, long live performance! Peter Alfke

Reply to
Peter Alfke
Loading thread data ...

I think most of what Peter has said is very reasonable. But I think you

*can* have common public benchmarks if you start at a higher level rather than try to share HDL code. Typically, a project will know which vendor is going to be used and can design their architecture to fit. So why not spec a design function and let the vendor write their own code to implement it?
--

Rick Collins

rick.collins@XYarius.com

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design       http://www.arius.com
4 King Ave.                               301-682-7772 Voice
Frederick, MD 21701-3110     GNU tools for the ARM http://www.gnuarm.com
Reply to
rickman

Funny, I was sure this was an Altera link ( you have seen this ? )

formatting link

To me that looks rather more than a 5% nudge ?

Does this mean the prelim models were not as good as you claim, or are you more carefully qualifying the best ones, to better compete with Xilinx's claims ?

In all this speed/benchmark hoopla, I see one thing that is never mentioned is price.

Will we see a 'bragging rights' bin, that is the three sigma yield limit, and so very expensive, but hey, look how fast we are !!

-jg

Reply to
Jim Granville

Exactly, this is the 'optimised' branch I was talking about.

Uses do not care what tricks you use, they want to know the peak MHz, MHz/$, or mW/MHz or whatever, for a particular application.

They EXPECT differences in fit and ideal usage, by end application, but right now, this type of information is sparse indeed.

They also are more interested in a design broadly in their field, than in how fast one node can spin.

FPGAs are getting more complex, with the dedicated blocks, so HDL designers need vendor-tuned examples of how to efficently use those blocks.

-jg

Reply to
Jim Granville

There is a lot of negative baggage associated with the term Benchmark.

Yet there is a lot of performance related info that would help designers, both in choosing between chips and when working on a design.

I'm thinking of things like the speed of a 32 bit counter. It's a reasonably basic building block. It's not the whole story, but it's one very useful data point.

The next question is how many variations make sense? Do they have major impacts on the speed and/or take extra resources? Enable, load, overflow flag, ...

Maybe the info I'm looking for should be printed in green if there is a good fit with the hardware (sweet spot) and flagged with red if the answer might surprise you (say because it takes an extra level of logic).

Other basic things I'd like to see... Data rate between 2 chips (same type/speed). Routing delays to go N steps H or V.

Maybe that's all BS because most engineering time is spent working at a higher level. (But I always get involved with the details.)

Maybe a handful of medium size tasks would be interesting. Maybe library elements. Maybe just good examples. Pulse width modulators - fixed and programmable parameters. FIR filters. State machines - need several examples. DRAM controller

I was thinking about that too. One complication is that you can push some things off to the non-FPGA parts of the system or pull things in.

What sort of problem would be a good whole-system example?

Vendor neutral would be nice, but I'm also interested in good examples that take advantage of special features, and I'm willing to push things around and/or change the problem if that makes things fit better for the total system.

--
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's.  I hate spam.
Reply to
Hal Murray

How about:

Frequency/Pulse Counter [24/32/40/48 bits ] This needs counters, and capture. Counters can be either Decimal, or Binary, with some Bin-BCD post conversion, for display of the results.

DDS - same widths, this needs wide adders Expanding to Waveform generation, and pulse generation...

These are complex enough to push the FPGA, have easily verifiable output everyone can relate to, and would be genuinely usefull for education.. Some would be simple enough for the MAX II or smallest ProASIC3...

and more ambitious: a Storage Oscilloscope, that needs a timebase, and wide data path pumping, [assume external fast-enough ADC] to any choices of USB / Firewire / ATA / SerialATA (etc) for the results.

Of course..

-jg

Reply to
Jim Granville

Jim, I like your idea, but it is not all that straightforward. I mentioned in the seminar that there is a loadable synchronous counter inside every DSP Slice, and it runs at 500 MHz, no ifs, no buts. Now, I will build a 5 GHz counter using the MGT. Is that allowed? DDS obviously runs at 500 MHz in the DSP slice, but I will run a virtual 8 GHz DDS either by using 16 accumulators, or by doing some clever math. Is that kosher? If you saw my seminar, I mentioned those things, and there will be app notes, etc. Storage oscilloscope is dear to my heart, but it has many arbitrary parameters. Its performance and cost are determined by the A/D at the input, not the FPGA.

I suggest we keep generating creative app notes, reference designs, and evaluation boards, without calling them benchmarks.

BTW: Choosing between X and A isn't all that difficult, it's only between two suppliers. Selecting between umpteen brands of breakfast cereals, washing mashines, cars, or colleges for your kids is a much tougher task. ( Life in NZ may be simpler, but the choices in the US are mindboggling. ;-) Peter Alfke

Reply to
Peter Alfke

Certainly, Yes, just publish what you are using, and how. Ideally, do a design that allows optional enable of that feature.

Again, certainly. I (and others) would be more interested in the ways to push the envelope, than in a simple 500MHz number. I would also suggest doing both, so users can see the performance jump, and resource needed to get that ( and any trade-offs ? )

Understood - but the FPGA tests would be for a virtual A/D - you would design, and state the system can burst acquire [this Size] at [This MHz] and Stream continually at [that MHz]. If there are no (yet) available ADCs, then the FPGA is not (presently) the weak link.

I expect with 2Gsps ADCs around, you would struggle to hit that ?

- but you could go wider, eg 16 bit datapaths have little extra cost in a FPGA, but a 1Gsps 16 bit ADC is quite different to deisgn than

1Gsps 8 bit ADC...

I did see that one can now get 1Gsps 16 bit DACs (!), so there is a ref point for a low distortion, signal generator design for your FPGA ....

formatting link

Counters/pulse measure/generate you could ship as working demo-board examples [LCD display], but a full storage scope might need a rather special demo board, with a shorter design life... Talk with ADI, or Atmel ?

Call them anything you like :) - just update them real early in the product life cycles, so users can do the comparisons.

Call them 'priority updated App notes' if you like, and (re)post the sources, _and_ tool report files, for each speed-grade revision, so users can review the info, without needing to install version XYZ of the tools, or they can fully duplicate if they wish.

Actel and Lattice might beg to differ....?

-jg

Reply to
Jim Granville

Hi Jim,

I explicitly noted that toggle rate limitations were an exception to my "5%" rule on core performance. We artificially restrict the performance of some blocks in the chip (relative to HSpice sims) pending silicon correlation. We like to test the chip under harsh conditions -- for example, with many blocks/registers all switching simulateously. Once these tests show that a block can run at a particular speed, we may up the "speed limit" for that block. Memory blocks are funny beasts that may not quite behave the way we expect based solely on simulations, and as such we make sure we spec their performance conservatively until we have silicon in our hands, tested at extreme conditions, to tell us its ok.

Why do we do this? The Fmax of plain-old logic and delay of any path that involves multiple block types and hops is not that sensitive to negative changes in spec. If we were (hypothetically speaking) to slow down the LUT by 3%, the user could probably cover the drop in speed by increasing effort levels in the CAD tools or re-optimizing their design. On the other hand, block speed (maximum toggle rate of a M4K block, for example) is a system-architecture issue. If we say our RAMs run at 400 Mhz and we're wrong, this could completely undermine a system design. No level of manual or cad-tool optimization will ever cover a reduction in block performance.

We're always interested in showing off the best performance we can. In the case of our block speeds, we would like to expose faster speeds, but will only do so if we believe it is safe to do so. In Quartus 4.1, we increased speeds, and it is likely we will do so again in a future release of the software. I'm not terribly worried about competing against Xilinx's block speeds, since (a) the "fast" device is not out yet and (b) it is rare for people to need to run their core blocks at such high speeds, since the logic

  • routing fabric (particularly in slower chips) can't keep up. Plus we'll see where the final spec for Stratix II ends up.

Some would argue (probably all from Altera :-)) that you do see that today with Xilinx. Not 3 sigma, but clearly they are pushing the process for the fast speed grade otherwise they'd have it out already. You'll notice that Altera has always released all three speed grades concurrently except for the very largest devices. You can draw your own conclusions from that.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Hi Peter,

I congratulate you on your efforts to deflect attention from poor core performance to various IP block toggle rates. But I'd hate to break it to you Peter -- while designs now incorporate more IP than they used to, they continue to have lots and lots of LUTs and routing. I don't think making a superior architecture for these "legacy" functions is a waste of energy.

I have never argued that core performance is the only aspect of chip performance, but it is a very important one. Our experience from Stratix was that not too many customers ever pushed blocks (RAMs, MACs) to run at their peak toggle rate. It is difficult to pump enough data through the rest of the fabric at a high enough rate to do so. There are some applications where the block toggle rate is a limiting factor, but they are few and far between.

Core performance can be an insurmountable road block to design success, and worse, it is hard to predict in advance. You know early in a design process that you need DDR at 200 Mhz and a memory block operating at 350 Mhz. And you can check the data sheet to see if part will meet this need. You don't know what speed the logic of your design will hit until you've run P&R on the full design. With Stratix II, we minimize the chance that core performance will be an in issue. Faster logic + routing performance translates into less effort for our users and our FAEs, since more designs meet performance goals at the push of a button.

What our +39% number is saying is that if both Stratix II and Virtex-4 meet the customer's needs in terms of major features (I/O standards, pins, etc) and critical block performance (say, necessary speed for a memory), then Stratix II will be the better choice -- it will be more likely to meet the overall performance goals of the product and will do so with greater ease.

Shared inputs do not need to be "tolerated" -- they are an available but not necessary feature. Ignoring all other aspects of the ALM, a 6-input LUT can do a lot more than a 4-LUT, reducing the depth of the critical path and hence increasing its speed. The reason where straight 6-LUTs lose out to

4-LUTs is in area (or silicon cost) -- that's where all the other innovations in the ALM come in, including shared inputs. For example, the ALM can split into two (fully indepedent) 4-LUTs. Or you can share some inputs and/or LUT mask bits and create two larger functions. With the ALM, you get the speed when you need it and good area when you do not.

On the system/functional changes, a number of V4 features are merely playing catch up to what was done in Stratix. You now have a MAC block, but you still can't do 36x36 or 9x9 multiplies efficiently. You now have a flexible clock network like Stratix. And some of the innovations are areas of past contention and are nothing new. Hard processors and SRL16 come to mind -- we've debated these before. On the high-speed I/O front, we will have Stratix II GX and eventually you will release your devices with high-speed I/Os... it'll be Virtex II-Pro vs. Stratix GX all over again. That leaves us with FIFOs. They are interesting... but what value do they add over soft-logic implementations? Are they worth the silicon cost? And I should point out that Stratix II contains changes to the memory blocks that allows more efficient soft-logic FIFO implementations; we may not have built a full FIFO, but we made it easier to do so.

There are in fact FIFOs, FIR, SRL16s and LUTRAMs in our benchmark set. But you do have a point -- no one will ever be able to make a generic benchmark set that will properly take advantage of system-level features of an architecture, and hence benchmarking will never tell the full story on system throughput.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

"Peter Alfke" schrieb im Newsbeitrag news: snipped-for-privacy@l41g2000cwc.googlegroups.com...

Hi Peter

I have already done that! its reletivly simple to use MGT as DDS with virtual 10GHz clock for every user clock 40 accu samples are calculated

and I also am using MGT as 3GS/S logic analyzer with ChipScope

I wish it would make sense for me to publish all that work

Antti

Reply to
Antti Lukats

Paul Leventis (at home) wrote: (non-constructive criticism snipped)

I do wonder if the optimal LUT size has changed over the years.

Is there work showing the optimal LUT size as a function of silicon resources needed to implement such LUTs?

-- glen

Reply to
glen herrmannsfeldt

Hi Antti -- just an idea -- how about Xilinx keep you supplied with FPGA Eval Boards, & Tools, on 'long loan', and you supply Xilinx with source codes... ? Peter?

-jg

Reply to
Jim Granville

Glen,

" > I do wonder if the optimal LUT size has changed over the years.

Good point. Paul has referred to their studies of replacing a 4 LUT with a 6 LUT, and then re-running synthesis to see just how much improvement one sees.

Assuming one can get enough >4 term,

Reply to
austin

and Intel also varies the Vth over 21 steps, to have CLK, Vdd, Vth to tune for speed/power trade offs.

formatting link
?articleId=59301578

A significant difference at the LUT spec level that I DID see ( and I presume still applies ? ) is that Altera have differing LUT path delays ( all LUT legs are not created equal ), whilst Xilinx treated them all equal. That means the Altera SW/HW can presumably choose the faster legs, where that matters, and so shave 100's of ps off the critical path ? => Faster P&R on otherwise similar silicon ?

-jg

Reply to
Jim Granville

Hi Glen,

Elias Ahmed & Jonathan Rose from the Unversity of Toronto published "The Effect of LUT and Cluster Size on Deep-Submicron FPGA Performance and Density". See

formatting link
Elias's M.A.Sc. thesis was on clustering and optimal lut sizes. This paper contains many references to previous work in the area and is probably a good starting point. The paper's conclusion is that a LUT size between 4 and 6 is and cluster sizes of between 3 and 10 LEs are best from a balanced area-delay perspective. If you want higher speed, larger LUTs are better. One suggested area of future research is finding a way to reduce logic levels without the area cost of large LUTs -- and this is what we have done in Stratix II with the ALM. Figure 12 is particularly interesting.

I think Guy Lemieux had some work in this area from his PhD -- not sure if its published anywhere yet.

At the FPGA 2005 conference in two weeks, the Stratix II logic architecture and some experimental results will be presented in a paper by David Lewis et al.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Hi Austin,

No offence, but I don't think you're going to get far with an attack on the logic architecture. I think you should read the paper at FPGA 2005 when you get a chance. It is very informative.

Irrelevant to the speed question. A simple 6-LUT in Stratix II does not share inputs. LUT input sharing is an area optimization that we employ intelligently to avoid any penalty on performance while reducing number of ALMs required.

I should also point out that all our experiments during architecture experimentation are full synthesis + place & route runs on 100+ designs. One thing anyone who works on FPGA logic & routing architecture knows is that intuition isn't worth too much -- you can argue about "will shared inputs hurt things?" until you are blue in the face, but in the end only an experiment will tell the truth.

Funny side story: One time during architecture experimentation someone put up a graph of some parameter (I forget what). We sat their and rationalized why the answer would come out that way, and were all content. Then we realized the graph was backwards and the trend was actually the other way around. We could rationalize that answer too...

Bottom line: Logic & routing architecture development must involve large amount of experimentation with real cad tools, otherwise you just don't arrive at the right answer.

Wrong again. The ALM breaks into two independent 4-input LUTs with no delay penalty (ok, maybe a couple ps for a gate load or two) relative to a Stratix-like 4-LUT.

Your first claim is correct. And the ALM gives the speed of a 6-LUT, but the extra circuitry added to support fractured modes (2 4-LUTs, shared LUTs, etc.) is minimal and adds a tiny amount of delay to the guts of the LUT.

Good questions. The reality is that a real design mapped into 6-LUTs doesn't yield all 6-input functions -- it decomposes into a set of functions ranging from 1- to 6-inputs. That is why making a pure 6-LUT architecture is not so great for area -- for those functions that don't use 6-inputs, you are wasting a lot of routing & logic area. That is why we added the fracturing capabilities of the ALM. This makes the ALM more expensive than a straight 6-LUT for implementing 6-input functions, but overall once the distribution of functions is taken into account, the ALM comes out on top.

One alternative could be to make an architecture with a hybrid set of LUT sizes. But then you have to wonder whether you've picked the right mix of the two, you have the pain of hetrogeneous floorplanning in layout, you have potential issues with placement, more complicated CAD tools, etc.

No, the ALM can't do that. We've argued about that many times in this newsgroup -- SRLs and LUT RAM add cost to the Logic Element. M512 memories are more efficient for many circuits, while SRLs/LUT RAM are more efficient for others. Are SRL/LUT RAM a bad idea? No. Are they a slam dunk? No.

Yes, that was a risk. We took that risk because we believed the ALM was a large enough win on the performance front to be worth the investment in synthesis and the risk to product success. If the ALM had given us 5% performance, there's no way we would have gone for it. But sometimes you need to take a big jump in architecture to get out of a local minimum in the space of architecture possibilities.

We worked with our 3rd party synthesis vendors well in advance of the release of Stratix II, and our own integrated synthesis was used during architecture development and thus was already ready to go.

Actually, the area of everything (LUTs, routing, RAMs, etc) shrinks as technology does, so I don't think the 4-LUT vs. 6-LUT question changes too much with process. The only effect here is that perhaps as the amount of delay in routing vs. logic moves around with process, the precise answer as to what LUT and Cluster sizes are optimal will shift slightly, but this is probably a small effect. A bigger effect could be evolution in the quality of synthesis and CAD tools, the changing nature of user designs, and advances in routing architecture.

I haven't seen any worst-case power numbers from you guys Austin. Typical is a marketing number -- how do you define your "typical" silicon? Let's hold off the power conclusions until both companies release final specs.

Besides, total power is what matters and you guys have curiously been shying away from dynamic power. Your own web page claims equivalent dynamic power for a Virtex-4 LUT + routing vs. an Stratix II LUT + routing -- but our LUTs implement 25% more logic, and hence there are fewer of them in a given design. Our pin capacitance is 1/2 that of V-4 -- you know what this does for I/O power? Imagine 200 I/Os toggling at 200 Mhz with 6 pF instead of 12 pF loads @ 2.5V (just as an example) -- if I've done my math right that's

1.5W right there. Now there's a strong chance I've done my math wrong (give me a break, its late) but you get the point!

Yes, there are tricks to employ. But they can cost speed. They can cost area. They can increase wafer costs. All involve trade-offs. In the end, we each picked the tricks we wanted to use. Gate oxide is only one variable to play with, and not the most effective one on the speed vs. leakage trade-off front.

True. Both companies have teams beating on this software for lots of performance. But if you're now saying that perhaps future software and a future speed grade will help you catch up on performance, I'm feeling pretty happy with our position.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Hi Jim,

Our LUTs have (significantly) different delays on different inputs.

Yes, the software does take advantage of the variance in LUT delay to optimize the critical path. This is why using 6-LUTs to implement 4-input functions is no worse for speed than using a 4-LUT -- the four fastest inputs of a 6-LUT are basically the same speed as the four inputs of a

4-LUT.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Oops, I missed one point (thanks Carolyn!).

Yes, the ALM is a universal 6-LUT. It can do some functions of 7-inputs, and all functions of 6-inputs.

Please refer to

formatting link
Page 2-8 is the diagram you want to stare at closely.

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Jim,

Yes, I see everyone taking advantage that LUT delays are variable, based on input. The last stage selection is always faster, so there is one input that is always faster than the others. Makes the software slower to P&R, but there is an advantage, and if taken, can provide that little extra ps of improvement.

As for Vt's, there is low Vt, and Hight Vt. As for body biasing, it has practically no benefit (slope of body vs power is terrible, easier to just raise or lower Vdd). So Intel has 100 adjustments, most of which are useless.

And, yes, we use low, and high Vt's, on core, mid-ox, and thick ox in the triple oxide process. Only way to go.

Aust> aust>

formatting link
?articleId=59301578

Reply to
Austin Lesea

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.