Performance claims

- A
- Austin Lesea
  
  Contact options for registered users
posted
19 years ago

Tue, Dec 7, 2004 4:00 PM

All,

formatting link

For anyone interested in how V4 really stacks up.

Austin

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Dec 7, 2004 4:40 PM

Stacks up to what? FPGA-90 is no product that I am aware of. Why can't Xilinx use the name of the competition part? Otherwise this is a pretty pointless paper.

--

Rick "rickman" Collins

snipped-for-privacy@XYarius.com Ignore the reply address. To email me use the above address with the XY removed.

Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL

formatting link

4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

- K
- Kolja Sulimma
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Dec 7, 2004 4:40 PM

Recheck Table 2. The VHDL code is swapped.

Kolja Sulimma

- S
- Symon
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Dec 7, 2004 4:56 PM

- S
- steven derrien
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Dec 7, 2004 5:03 PM

Hi Austin,

I just had a quick look, and there seems to be a mistake in table 2, p.5 (Verilog descriptions should be swapped for one stage vs two stage pipeline).

Regards,

Steven

Aust> All,

- A
- Antti Lukats
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Dec 7, 2004 5:30 PM

there was one good pointer in the above Xilinx white paper! its on page 6

formatting link

! :)

and yes looks like Stratix just got a new name: "FPGA-90nm"! LOL, if "FPGA-90nm" is now reference/alias to Altera Stratix then its good add for them! or?

Antti

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Dec 7, 2004 6:30 PM

Symon,

Checking.....

Aust> Austin,

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Dec 7, 2004 6:56 PM

Yes,

Code is swapped in the table.

Will be fixed shortly.

Thank you to all who caught it.

It is not supposed to be a test!

Aust> Hi Austin,

- P
- Paul Leventis (at home)
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Dec 9, 2004 1:21 AM

I would like to offer some clarification of points raised in this whitepaper, first in summary and then in some detail. I will occasionally refer to our web-based performance seminar

formatting link

for further details.

Constraints. The clock constraint methodology we employ matches

that outlined in the whitepaper. It is good to see that both

companies can agree on something!

High-Effort Compiles. We run the ISE software in the mode that

yields the highest results across our benchmark set. We also run a

seed sweep ("multi-pass") for ISE at the end of the process.

Retiming. ISE does not offer physical synthesis during place and

route. Quartus II does. We do not use XST (and hence XST

retiming) since we find this results in a far greater disadvantage

for Xilinx than when we use a common synthesis tool (Synplicity in

this case).

Block Performance. Maximum block toggle rates are pretty worthless

if the fabric that stitches the blocks together can't keep up. Our

design set includes a variety of types of resources including RAMs

and DSPs, yet yields +39% performance advantage. Why? Our blocks

have comparable propagation delays which it turns out matters more,

and our logic & routing are substantially faster. Also, our Fmax

limits have increased in Quartus II 4.2 and will continue to

increase as we complete our detailed characterization process.

Design entry. Good advice that applies to any modern FPGA

(Stratix II and Virtex-4).

Speed Grades. We compare to what's available in the software. If

users know how much faster a -12 device will be (we do not), they

can derate our 39% average performance advantage accordingly.

Clock Constraints

^^^^^^^^^^^^^^^^^

We appear to agree on how to constrain clocks.

For synthesis, we employ the flow suggested by Synplify to optimize multiple clock designs. This results in optimization of all clock domains. Are there other ways to do it? Probably -- but since Synplicity Pro 7.7 is a common-denominator in our comparisons, it is hard to see how changing this would affect the 39% average performance advantage that we see for Stratix II.

For ISE, as outlined in the web-seminar (slide #9) and other locations, we constrain each clock independently and iterate to find the best such (tight) constraints. As you suggest, we do not look at paths that cross clock domains (difficult to do in an apples-to-apples way). We do not over constrain ISE as we have found this degrades Xilinx performance. Slide #10 shows the results of the iterative constraint process for one design (with two clocks); I think it highlights the rigour and correctness of this process.

I should point out that for Quartus II, we don't need to jump through hoops since applying a global 1 Ghz constraint on the clocks will result in each clock being optimized as best as possible.

Synthesis/P&R Effort

^^^^^^^^^^^^^^^^^^^^

On the P&R front, we use the ISE settings that yield the best performance results across our benchmark set. We also run a seed-sweep (or "multi-pass" compile) using ISE at the end of our iterative process.

For synthesis, we have no reason to believe that enabling a high-effort mode in Synplicity would change the conclusions of our comparison, since we are using the same synthesis tool for both Stratix II and Virtex-4.

Register Retiming/Physical Synthesis

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Quartus II can perform physical synthesis optimizations during place-and-route. These algorithms have access to detailed placement and timing information, enabling further optimization that synthesis just can't know about. ISE does not provide any such optimizations. Note: We always include Tco and Tsu constraints, so our re-timer will not violate I/O timing to improve core speed.

We did not use Synplicity's retiming options during these comparisons, and are in the process of evaluating how the comparison changes when we use these options. While one might guess that these optimizations would reduce Quartus' physical synthesis upside, register retiming is only one of the many algorithms employed in Quartus physical synthesis and is responsible for a very small part of +39% performance we see.

I'm told that ISE also offers some sort of retiming option during synthesis with XST. We find that using XST yields much worse Xilinx results (which make us look much better), so do not use XST, and hence do not use that retiming option.

Block Performance

^^^^^^^^^^^^^^^^^

Our benchmarking results address overall performance across real designs. These designs contain RAMs, DSP/MAC/Multipliers, adders, counters, and other such building blocks in a large variety of sizes and varying quantities. We do not claim that Stratix II is 39% faster on all building blocks, but rather that when you put it all together Stratix II is 39% faster.

Why is this? Fundamentally, the logic and routing of Stratix II is significantly faster -- and you need logic & routing to stitch together the blocks. Also, critical paths often start or end on a RAM/DSP, and are very rarely just a RAM/DSP toggling in isolation. The timing microparameters of the RAM/DSP are quite comparable between the two families. According to the Virtex 4 data sheet, the DSP microparameters are faster in the -12 device and we will certainly rerun the analysis when Xilinx releases software that enables this fastest speed grade.

Our Fmax limit is not simply just 1/Tco. The block toggle rate limits imposed by Quartus II are selected based on characterization to guarantee operation of our devices in all environments, under all noise and switching conditions. When you clock a block very quickly, you start getting interesting effects that can affect operation. As we complete the characterization of hard IP blocks, we will raise these limits. The Quartus II 4.2 software introduces higher Fmax limits than stated in this table, and further increases are likely in future software releases.

Speed Grades

^^^^^^^^^^^^

I believe we have addressed this in numerous forums. We use the available speed grades in the software. We can't compare to something we can't get our hands on. Users can derate our +39% average performance result by the difference between our fastest and medium speed grade to get a flavour for how things will compare if & when a fast Virtex-4 speed grade is made available.

Regards,

Paul Leventis

Altera Corp.

- P
- Paul Leventis (at home)
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Dec 9, 2004 1:32 AM

[In my attempt to text-format my first posting, I somehow double spaced it... weird. If I fail this time, my descent into management is complete.]

I would like to offer some clarification of points raised in this whitepaper, first in summary and then in some detail. I will occasionally refer to our web-based performance seminar

formatting link

for further details.

Constraints. The clock constraint methodology we employ matches that outlined in the whitepaper. It is good to see that both companies can agree on something!

High-Effort Compiles. We run the ISE software in the mode that yields the highest results across our benchmark set. We also run a seed sweep ("multi-pass") for ISE at the end of the process.

Retiming. ISE does not offer physical synthesis during place and route. Quartus II does. We do not use XST and hence XST retiming) since we find this results in a far greater disadvantage for Xilinx than when we use a common synthesis tool (Synplicity in this case).

Block Performance. Maximum block toggle rates are pretty worthless if the fabric that stitches the blocks together can't keep up. Our design set includes a variety of types of resources including RAMs and DSPs, yet yields +39% performance advantage. Why? Our blocks have comparable propagation delays which it turns out matters more, and our logic & routing are substantially faster. Also, our Fmax limits have increased in Quartus II 4.2 and will continue to increase as we complete our detailed characterization process.

Design entry. Good advice that applies to any modern FPGA (Stratix II and Virtex-4).

Speed Grades. We compare to what's available in the software. If users know how much faster a -12 device will be (we do not), they can derate our 39% average performance advantage accordingly.

Clock Constraints ^^^^^^^^^^^^^^^^^ We appear to agree on how to constrain clocks. For synthesis, we employ the flow suggested by Synplify to optimize multiple clock designs. This results in optimization of all clock domains. Are there other ways to do it? Probably -- but since Synplicity Pro 7.7 is a common-denominator in our comparisons, it is hard to see how changing this would affect the 39% average performance advantage that we see for Stratix II. For ISE, as outlined in the web-seminar (slide #9) and other locations, we constrain each clock independently and iterate to find the best such (tight) constraints. As you suggest, we do not look at paths that cross clock domains (difficult to do in an apples-to-apples way). We do not over constrain ISE as we have found this degrades Xilinx performance. Slide #10 shows the results of the iterative constraint process for one design (with two clocks); I think it highlights the rigour and correctness of this process. I should point out that for Quartus II, we don't need to jump through hoops since applying a global 1 Ghz constraint on the clocks will result in each clock being optimized as best as possible.

Synthesis/P&R Effort ^^^^^^^^^^^^^^^^^^^^ On the P&R front, we use the ISE settings that yield the best performance results across our benchmark set. We also run a seed-sweep (or "multi-pass" compile) using ISE at the end of our iterative process. For synthesis, we have no reason to believe that enabling a high-effort mode in Synplicity would change the conclusions of our comparison, since we are using the same synthesis tool for both Stratix II and Virtex-4.

Register Retiming/Physical Synthesis ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Quartus II can perform physical synthesis optimizations during place-and-route. These algorithms have access to detailed placement and timing information, enabling further optimization that synthesis just can't know about. ISE does not provide any such optimizations. Note: We always include Tco and Tsu constraints, so our re-timer will not violate I/O timing to improve core speed. We did not use Synplicity's retiming options during these comparisons, and are in the process of evaluating how the comparison changes when we use these options. While one might guess that these optimizations would reduce Quartus' physical synthesis upside, register retiming is only one of the many algorithms employed in Quartus physical synthesis and is responsible for a very small part of +39% performance we see. I'm told that ISE also offers some sort of retiming option during synthesis with XST. We find that using XST yields much worse Xilinx results (which make us look much better), so do not use XST, and hence do not use that retiming option.

Block Performance ^^^^^^^^^^^^^^^^^ Our benchmarking results address overall performance across real designs. These designs contain RAMs, DSP/MAC/Multipliers, adders, counters, and other such building blocks in a large variety of sizes and varying quantities. We do not claim that Stratix II is 39% faster on all building blocks, but rather that when you put it all together Stratix II is 39% faster. Why is this? Fundamentally, the logic and routing of Stratix II is significantly faster -- and you need logic & routing to stitch together the blocks. Also, critical paths often start or end on a RAM/DSP, and are very rarely just a RAM/DSP toggling in isolation. The timing microparameters of the RAM/DSP are quite comparable between the two families. According to the Virtex 4 data sheet, the DSP microparameters are faster in the -12 device and we will certainly rerun the analysis when Xilinx releases software that enables this fastest speed grade. Our Fmax limit is not simply just 1/Tco. The block toggle rate limits imposed by Quartus II are selected based on characterization to guarantee operation of our devices in all environments, under all noise and switching conditions. When you clock a block very quickly, you start getting interesting effects that can affect operation. As we complete the characterization of hard IP blocks, we will raise these limits. The Quartus II 4.2 software introduces higher Fmax limits than stated in this table, and further increases are likely in future software releases.

Speed Grades ^^^^^^^^^^^^ I believe we have addressed this in numerous forums. We use the available speed grades in the software. We can't compare to something we can't get our hands on. Users can derate our +39% average performance result by the difference between our fastest and medium speed grade to get a flavour for how things will compare if & when a fast Virtex-4 speed grade is made available.

Regards,

Paul Leventis Altera Corp.