V4 vs. Stratix-II...

J

Joseph H Allen 21 years ago

I'm upgrading a design, and I'm in the early phases of choosing a vendor. I'm trying to compare parts based on experience I've had in the past, so I'm focusing on block RAM clock to out delay as a critical performance number:

Altera M4K vs. Xilinx Block RAM clock to out delay, non-registered outputs:

Stratix-II -3 2.46 ns Stratix-II -4 2.828 ns Stratix-II -5 3.393 ns

Xilinx-V4 -11 1.83 ns Xilinx-V4 -10 2.10 ns

Xilinx-V2 -4 2.65 ns (current part)

V4 appears to be 1.62 times faster for the slowest speed grade parts (which I'm probably most interested in, though I should really compare equal priced parts), and slower even than the original V2 design. Am I missing something? Several posts here suggest that Stratix-II interconnect is faster- is there any datasheet evidence to back this up? Lets say the RAM output is at least feeding a 2:1 MUX before being registered, and porbably has to travel ~1/3 the width of the chip.

Also, help me fill in my chart:

LUT delay:

Xilinx-V2 -4 439ps Xilinx-V4 -10 200ps Xilinx-V4 -11 170ps Stratix-II ? (can't find any data)

Carry delay:

Xilinx-V2 -4 106ps Xilinx-V4 -10 90 ps Xilinx-V4 -11 80 ps Stratix-II ? (can't find any data)

Routing delay:

I can do this with fpga_editor in Xilinx. How to do it for Stratix-II ?

/* jhallen@world.std.com (192.74.137.5) */ /* Joseph H. Allen */ int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0) +r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p158?-79:0,q?!a[p+q*2 ]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}

Vote

B

Ben Twijnstra 21 years ago

Hi Joseph,

I stopped reading data sheets since they're way too big and the information is never organized the way I need to have it. So I tend to simply write little test cases and let the tools tell me what I need to know.

I would personally just compile the design with your new constraints in both ISE and Quartus II (v5 has just been released) and see who comes out best.

I suggest you re-check Stratix-II timing with Quartus II 5.0 - Altera has been doing some re-characterization which seemingly hasn't made it to the handbook yet. In an M4K I am using in a Stratix II I'm getting 1.85ns for a

-3 part and 2.4ns for a -5 part.

Well, it kind of varies between (off the cuff) 83ps and 400ps depending on the input that changes and the mode the ALM is in.

Easy to check in Quartus with, for example, an 8-input AND or so. I'm getting cell delays between 0.047 and 0.404ns depending on the mode and the input of the ALM (see below on how to do this).

Open the timing analyzer. Right-click a path and select "List Paths" from the menu. When expanding the messgaes in the status window you should get detailed info on both cell and routing delay of the path.

Best regards,

Ben

Vote

P

Peter Sommerfeld 21 years ago

Hi Joseph,

Remember that in Q II 5.0 the M4k performance has increased from 400 to

550 MHz. It looks like you're using the out-of-date numbers for tCO. The new ones should be ~ 1.88 ns (I'm guessing).

There's a few ways to find the routing delays in Q II. The most detailed way is to open the Timing Floorplanner (Assignments/Timing Closure Floorplan), right-click a used logic cell, and choose Locate>Chip Editor.

right-click, and choose "Generate Connections Between Nodes". You can show the actual routes used with View/Highlight Routing.

The easier way is to stay in the Timing Floorplanner, Ctrl-click the stuff you want to find delays for, make sure View/Routing/"Show Routing Delays" is selected, and choose View/Routing/"Show Paths Between Nodes".

Interesting ... the Sratix II handbook doesn't have LUT timing params. I was sure they were there for Stratix. Well it shouldn't be too difficult with Chip Editor ... maybe someone gets an answer before I do ...

-- Pete

Joseph H Allen wrote:

vendor.

so I'm

number:

outputs:

(which

priced

is

the RAM

porbably

Stratix-II ?

H. Allen */

a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)

+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p158?-79:0,q?!a[p+q*2

#"[!a[q-1]]);}

Vote

A

Austin Lesea 21 years ago

Joseph,

I just saw a presentation that shows that V4 is faster on all interconnet paths (by as much as 500 ps for long paths) except the immediate neighbor paths, where we are just ever to slightly slower than S2 neighbor paths.

I also saw LUT comparisons, which took 8 slides, with animations, as comparing the 4LUTs to the ALM-LUT is not trivial: you have to look at each and every input to output delay. And then you have to make a guess as to how your logic will get synthesized. Yes, we are faster for 4 LUT (most inputs), and they are faaster for wider functions (but not all inputs).

For example: S2 4LUT input delays to output (in order): 155ps, 382ps,

360ps, 275ps. V4 4LUT: 165ps, 165ps, 165ps, 165ps. (fastest speed grades, both companies).

Then there is the interconnect. V4 is 500 ps faster for full chip routes, 400 ps faster for 1/2 chip routes, 100-200 ps faster for a few CLBs, LABs, and 100-200ps for neighbor routes. Some very short routes are 30ps better in S2.

Below 32 bits, S2 is slightly better for an adder, and over 32 bits, V4 is better. Same for cary chain, where S2 is ~ 200 ps better at ~ 16 bits, and V4 is >500ps better at 48 bits, and longer carry chains (equal at 24 bits).

In our suite of test designs, we come out ~9% faster (on average) with a +/- 4% error margin. Of course some designs will be faster than that, and some slower, too. We generally favor wider arithemetic, and pipelining, where S2 favors empty designs, and small arithemetic functions. We tend to excell when the design gets full, and complex (like it does at the end of your project!).

BRAM functionality depends a lot on the use of registers, as use of the fabric registers really slows things down (and takes more power) than using the registers built into the BRAM. Of course, anythign you can direct into the DSP48s will just scream, and outperform anything S2 has.

I think that the newsgroup here will basically tell you to try a design in both architectures, and play with the constraints to see how well it does.

Or, what I prefer, is to contact the FAEs of the respective companies, and ask them to show you how your design will perform (let them drive the tools).

Or, do both.

Austin

Vote

T

Tommy Thorn 21 years ago

...(lots of numbers deleted)...

Without detailing what you're comparing (ie., which device at which speed grade) none of this is meaningful.

Tommy -- not affiliated with either fighting bulls.

Vote

A

austin 21 years ago

Tommy,

I thought I was clear, fastest speed grade, S2 and V4.

Aust> Aust>

Vote

J

Jim Granville 21 years ago

Since this is side-by-side, I was wondering why Xilinx spec all paths the same.

Is that actually the worst path, and then the SW is free to use any path ? [but your physical speed margin might change, on a re-route]

Or is there really such a difference in the implementation that Xilinx's end up precisely identical, and Altera's vary over 2:1 ?

-jg

Vote

J

John M 21 years ago

Joesph,

I agree with Ben. With so many variables and so much marketing B.S., your best bet is to compile using both a V4 and SII. I've found that performance is highly dependent on implementation, synthesis tools, and how full the device is. These are all variables outside of your FPGA vendor selection. You also note that you're probably going with the slowest speed grade, so I assume cost is an issue. A true comparison cannot be made with cost included. In addition, you should also consider whether EasyPath for Xilinx or Hardcopy for Altera are alternatives to help lower your cost. Finally, I would like to make one point about interconnect. Who cares if V4 or SII is slightly faster? It's the routing software that is going to make the major difference. Whichever software requires me to do the least amount of floorplanning is the one that wins. Also, how well does the software perform as the chip gets full? Personally, I think the floorplanning tools of ISE are easier to use than Quartus. However, I think Quartus does a much better job at placement and routing as a design gets very full (>90% utilization).

John

Vote

A

Austin Lesea 21 years ago

Jim,

I have been corrected by many. No, they are not all the same (in the hardware, and as an IC designer, I already knew that). However, in the past they were treated as all equal (for efficiency, finding and using the faster path is not necessarily a big benefit).

I do not know if the paths are treated the same or not (on the 4LUT) in V4 p&r. I am sure someone will tell me (now).

I think the point I was trying to make is that the 4LUT is faster than the ALM for a class of functions (4 inputs or less), and slower for wider functions (on some pins). So, the quality of the synthesis, followed by the place and route (constraints) will make a huge difference in the performance.

I have been told that for every design that is better in S2, after some work, can be made even better than S2 in V4. I do not doubt that Altera can, and does, make the exact same claim.

I disagree that the ultimate (best) performance in S2 is better, as that is not what our research has shown. Again, Altera has their own suite of XX designs that they use to benchmark their device, and they also make exactly the same claim.

Given the state of the marketing wars (see the "mine is...." thread), I think I'll stay safely in the engineering camp, and say: if you are really adamant about comparing the two, go take your finished design, and run it through both design tools, and make your own decision. Our FAEs are available to help you with that chore.

And please take into account that we offer: DSP48, EMAC, PPC, FIFO-BRAM that can be used to even greater advantage.

Austin

Vote

R

Rudolf Usselmann 21 years ago

...

Austin,

to settle this argument once and for all, why not take a bunch of designs that are freely available on OpenCores, and present utilization and performance reports without doing any tweaking of the designs ? There are many VHDL and Verilog deigns available on OpenCores from CPUs, to Crypto cores to communication cores.

Both companies could present their own results including with a script as to how to reproduce the results, in case somebody wanted to double check.

If you could agree to do this fir Xilinx, and perhaps we ghet a volunteer from the Altera Camp, we can openly chose some designs ...

Best Regards, rudi ============================================================= Rudolf Usselmann, ASICS World Services,

formatting link

Your Partner for IP Cores, Design, Verification and Synthesis

Vote

P

Peter Alfke 21 years ago

Rudi, nice idea, but it won't work, with the two companies involved. Many years ago, there was PREP, with a very similar idea. It died because the FPGA manufacturers could not resist the temptation to tinker with the results ( I used the words "lied and cheated"). Our "friends" presented designs with "virtual" flip-flops, to improve the packing density. It became one big shouting match.

The stakes are just too high for either of the marketing departments to admit "defeat", and there are too many subtle aspects of designing with FPGAs, hardware and software. "Everybody is the winner" will be the unavoidable outcome.

It seems that the user community likes the intense competition and diversity. And we like the fact that FPGAs have not become a commodity where price is the only differentiator. There is still lots of room for creativity and innovation. Peter Alfke

Vote

A

Antti Lukats 21 years ago

Rudi,

it would not work that way and you get nil support to the idea (officially at least) from any FPGA vendor. There is just too much on the stake. But some companies are doing something similar by having test environment which are run agains the latest tools for multi FPGA vendors. Those are the companies that design FPGA/ASIC tools. And to my knowledge most of those companies are pissed to FPGA companies because ah their bread is getting less as the FPGA vendor tools are getting better (or including new functionality in it) and I think there are some other problems also. Anyway those companies run testbenches. For a little different reason, but I think they pretty much 'see' and 'know' the differencies between the FPGA fabrics from different vendors. But all that benchmarking is strictly inside those companies and there is no public info. The 'fpga' benchmarking in open, has failed. It is virtually not possible to be done wihout some kind of biasing and the results are not useable without very strict explanatians under what circumstances the compare results are valid. The hdl to fabric mapping is too complex (the all process) and there are too many small things that may or may not have impact on the results.

Antti with his last 2 cents :)

Vote

A

Austin Lesea 21 years ago

Rudi,

The problem is that without any regard to device specific features, the results will vary by a tremendous amount.

Austin

Vote

J

Joseph H Allen 21 years ago

Thanks you all. This has been very helpful.

-- /* snipped-for-privacy@world.std.com (192.74.137.5) */ /* Joseph H. Allen */ int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)

+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p158?-79:0,q?!a[p+q*2 ]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}

Vote

S

Simon Peacock 21 years ago

I think you are somewhat missing the point with the A & X question.. in that you ask the wrong question.

Its not who has the best architecture or which one is fastest.. it actually doesn't really matter... for 99% of the designs, as Austin's pointed out before... either is good enough... and if your in the 1% that matters, then anything that you do won't give you a good enough idea until you try and fit the final FF or CLB, and even then your design will be so customised that an A design is almost impossible to translate to X and visa versa.

What really matters is what price X or A's FAE will sell you the parts at, what support they will give you, what evaluation boards are about that do some if not all your needs.

The decision at my work was which company gave us the best discount, That happened to be Xilinx. It also happened that they do bus LVDS which we are using so our design naturally forced A out anyway, we just didn't tell anyone :-)

If you are building a one off then it really doesn't matter anyway. Use a dartboard and a blindfold it will be as accurate as a detailed study... for one off.. just choose a eval board with a largish device, get it all working and see how big it is, then choose a device twice the size required (for the inevitable fixups)

my two cents

Simon

Allen */

a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)

+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p158?-79:0,q?!a[p+ q*2

#"[!a[q-1]]);}

Vote

P

Paul Leventis (at home) 21 years ago

Hi Joseph,

First, I must stress that compar As some posters have already pointed out, RAM speeds have increased in Quartus 5.0. The latest comparison I've seen shows us with a Tco advantage vs. Virtex-4 when the RAM output registers are used, and a slight disadvantage when the RAM is unregistered -- in either case a few hundred ps difference.

As for LUT delays, here are the latest numbers I've got for a fastest speed grade 7-input LUT (ALM can do some inputs of 7-inputs, and all functions of

6-inputs), as well as for a 4-LUT (the ALM can do two independent 4-LUTs).

Input 7-LUT 4-LUT A 378 ps 366 ps B 357 ps 228 ps C 240 ps 225 ps D 240 ps 53 ps E 144 ps F 53 ps G 234 ps

According to Austin's post, Virtex-4 (fastest speed grade -- I dare you to try to buy one ;-)) shows 165 ps across-the-board (seems bogus to me, but what do I know). So which LUT is faster based on this data? Well, it depends on how we lumped our delays into logic vs. routing (see above). It also depends on how often Quartus II will manage to route your critical signal on the fast LUT inputs -- usually it does a very good job of this.

The other critical component for logic fabric performance is the routing. Based on an analysis of routing delay between registers placed a varying distance apart in the X- and Y-directions, we've found that we have a ~20% delay advantage (fastest speed grade vs. fastest speed grade). Of course, even this type of study has its caveats -- how do you normalize distance to take into account differences in logic density?

Stratix II employs a low-k inter-metal diaelectric (k = 2.9) vs. Virtex-4's "reduced-k" diaelectric (k = 3.6), given us a ~20% metal capacitance advantage. If you set aside architectural and circuit differences, to first order you'd expect this to translate into a performance advantage for Stratix II.

Regards,

Paul Leventis Altera Corp.

Vote

P

Paul Leventis (at home) 21 years ago

I would guess that you did not normalize to take into account packing density. How do you define a "short" route? Do you multiply the # of CLBs and # of LABs by the right ratio of logic? I'd argue that 1 LAB = 8 ALMs = ~10-10.5 slices (based on our density analysis).

Anyway, the average distance of a hop in a critical path is roughly 3 LABs, so short connections are the most important. Our data shows a performance advantage in hops of this length.

That's interesting... did you miss the news that we've increased Stratix II DSP performance to 550 Mhz in Quartus II 5.0? Not to mention that the S2 DSP can do 36-bit multiplies in hardware (vs. 18-bit for DSP48)... but I will not digress into a feature pissing contest.

On this, I agree with Austin. Kick the tires. Just be sure to set timing constraints before doing so, and also make sure not use "toy" designs (neither tool is particularly well optimized for very small designs in very large chips). And beware numerical noise -- placement & routing is a heuristic. If you perturb any aspect of the input, the output can change due to random differences in algorithm outcome.

Regards,

Paul Leventis Altera Corp.

Vote

A

Austin Lesea 21 years ago

Paul,

Yes, you can get the fastest speed grade. Really a cheap shot, that one. I sense some real desperation.

And, stop with the low-K dielectric. All of the Toshiba parts are low K. Guess what? We do not speed grade or power grade them differently, because it just doesn't make that much of a difference!

Perhaps an ASIC can take proper advantage of low K, but the FPGAs just do not show much of an improvement at all.

And stop with the power "advantages of S2."

The Japanese engineer who touched the S2 and V4 chips on our demonstrator said it all: "S2 hot! V4 cool..."

Austin

Vote

P

Paul Leventis 21 years ago

low

differently,

So the long delay in getting the -12 speed grade out had nothing to do with this fab transition? It must be fun characterizing one product produced in two fabs with two different processes (one low-k, one not, and who knows what else is different).

just

I wish we had this "defie the laws of physics" technology you use on Virtex-4. First you claim your devices do not draw more current with increased voltage. Then you claim that increased metal capacitance has no impact on speed or power. I'm waiting for you to claim that I/O pin capacitance doesn't matter for performance, signal integrity or power...

A very scientific test! Let's do some quick math here... Even if you found some demo with a 1W VccInt difference, this should only translate to ~10 C difference in chip temperature (still air, no heat sink on

2S60 --> Theta-JA = 10.4 C/W), which would hardly be discernable to the touch. Why was this demo so much hotter to the touch then? My educated guess (based on the analysis of one of our customers) is that you had unequal I/O settings, causing lots more I/O dissipation in our chip. Really, that is rather low.

Regards,

Paul Leventis Altera Corp.

Vote

A

Austin Lesea 21 years ago

Paul,

I am sure the newsgroup is getting really bored with this. I certainly am. Short and sweet:

Two fabs: It is a challenge, but then having two qualified sources of supply is a definite advantage for our customers.

Low-K: Don't get me wrong, I like low K, I like low pin capacitance too. I also like fine wine, and a good meal. I had already asked you to fab the S2 without low-K and measure it. We did that for V2 and V2P, and again for V4 at Toshiba and UMC. We know. You guess.

Low power: What is low, is our power dissipation. The static leakage kills you folks as the part gets hot. And what FPGA in the high end isn't running hot? Yours just run even hotter due to the leakage (or require more expensive heatsink solutions). This one is so easy to prove it is silly for you to even try to compete on total power.

Austin

Vote

V4 vs. Stratix-II...

Join the Discussion

Didn't find your answer?