All,
For anyone interested in how V4 really stacks up.
Austin
All,
For anyone interested in how V4 really stacks up.
Austin
Stacks up to what? FPGA-90 is no product that I am aware of. Why can't Xilinx use the name of the competition part? Otherwise this is a pretty pointless paper.
--
Rick "rickman" Collins
snipped-for-privacy@XYarius.com Ignore the reply address. To email me use the above address with the XY removed.
Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL
Recheck Table 2. The VHDL code is swapped.
Kolja Sulimma
Hi Austin,
I just had a quick look, and there seems to be a mistake in table 2, p.5 (Verilog descriptions should be swapped for one stage vs two stage pipeline).
Regards,
Steven
Aust> All,
there was one good pointer in the above Xilinx white paper! its on page 6
and yes looks like Stratix just got a new name: "FPGA-90nm"! LOL, if "FPGA-90nm" is now reference/alias to Altera Stratix then its good add for them! or?
Antti
Symon,
Checking.....
Aust> Austin,
Yes,
Code is swapped in the table.
Will be fixed shortly.
Thank you to all who caught it.
It is not supposed to be a test!
Aust> Hi Austin,
I would like to offer some clarification of points raised in this whitepaper, first in summary and then in some detail. I will occasionally refer to our web-based performance seminar
that outlined in the whitepaper. It is good to see that both
companies can agree on something!
yields the highest results across our benchmark set. We also run a
seed sweep ("multi-pass") for ISE at the end of the process.
route. Quartus II does. We do not use XST (and hence XST
retiming) since we find this results in a far greater disadvantage
for Xilinx than when we use a common synthesis tool (Synplicity in
this case).
if the fabric that stitches the blocks together can't keep up. Our
design set includes a variety of types of resources including RAMs
and DSPs, yet yields +39% performance advantage. Why? Our blocks
have comparable propagation delays which it turns out matters more,
and our logic & routing are substantially faster. Also, our Fmax
limits have increased in Quartus II 4.2 and will continue to
increase as we complete our detailed characterization process.
(Stratix II and Virtex-4).
users know how much faster a -12 device will be (we do not), they
can derate our 39% average performance advantage accordingly.
Clock Constraints
^^^^^^^^^^^^^^^^^
We appear to agree on how to constrain clocks.
For synthesis, we employ the flow suggested by Synplify to optimize multiple clock designs. This results in optimization of all clock domains. Are there other ways to do it? Probably -- but since Synplicity Pro 7.7 is a common-denominator in our comparisons, it is hard to see how changing this would affect the 39% average performance advantage that we see for Stratix II.
For ISE, as outlined in the web-seminar (slide #9) and other locations, we constrain each clock independently and iterate to find the best such (tight) constraints. As you suggest, we do not look at paths that cross clock domains (difficult to do in an apples-to-apples way). We do not over constrain ISE as we have found this degrades Xilinx performance. Slide #10 shows the results of the iterative constraint process for one design (with two clocks); I think it highlights the rigour and correctness of this process.
I should point out that for Quartus II, we don't need to jump through hoops since applying a global 1 Ghz constraint on the clocks will result in each clock being optimized as best as possible.
Synthesis/P&R Effort
^^^^^^^^^^^^^^^^^^^^
On the P&R front, we use the ISE settings that yield the best performance results across our benchmark set. We also run a seed-sweep (or "multi-pass" compile) using ISE at the end of our iterative process.
For synthesis, we have no reason to believe that enabling a high-effort mode in Synplicity would change the conclusions of our comparison, since we are using the same synthesis tool for both Stratix II and Virtex-4.
Register Retiming/Physical Synthesis
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Quartus II can perform physical synthesis optimizations during place-and-route. These algorithms have access to detailed placement and timing information, enabling further optimization that synthesis just can't know about. ISE does not provide any such optimizations. Note: We always include Tco and Tsu constraints, so our re-timer will not violate I/O timing to improve core speed.
We did not use Synplicity's retiming options during these comparisons, and are in the process of evaluating how the comparison changes when we use these options. While one might guess that these optimizations would reduce Quartus' physical synthesis upside, register retiming is only one of the many algorithms employed in Quartus physical synthesis and is responsible for a very small part of +39% performance we see.
I'm told that ISE also offers some sort of retiming option during synthesis with XST. We find that using XST yields much worse Xilinx results (which make us look much better), so do not use XST, and hence do not use that retiming option.
Block Performance
^^^^^^^^^^^^^^^^^
Our benchmarking results address overall performance across real designs. These designs contain RAMs, DSP/MAC/Multipliers, adders, counters, and other such building blocks in a large variety of sizes and varying quantities. We do not claim that Stratix II is 39% faster on all building blocks, but rather that when you put it all together Stratix II is 39% faster.
Why is this? Fundamentally, the logic and routing of Stratix II is significantly faster -- and you need logic & routing to stitch together the blocks. Also, critical paths often start or end on a RAM/DSP, and are very rarely just a RAM/DSP toggling in isolation. The timing microparameters of the RAM/DSP are quite comparable between the two families. According to the Virtex 4 data sheet, the DSP microparameters are faster in the -12 device and we will certainly rerun the analysis when Xilinx releases software that enables this fastest speed grade.
Our Fmax limit is not simply just 1/Tco. The block toggle rate limits imposed by Quartus II are selected based on characterization to guarantee operation of our devices in all environments, under all noise and switching conditions. When you clock a block very quickly, you start getting interesting effects that can affect operation. As we complete the characterization of hard IP blocks, we will raise these limits. The Quartus II 4.2 software introduces higher Fmax limits than stated in this table, and further increases are likely in future software releases.
Speed Grades
^^^^^^^^^^^^
I believe we have addressed this in numerous forums. We use the available speed grades in the software. We can't compare to something we can't get our hands on. Users can derate our +39% average performance result by the difference between our fastest and medium speed grade to get a flavour for how things will compare if & when a fast Virtex-4 speed grade is made available.
Regards,
Paul Leventis
Altera Corp.
I would like to offer some clarification of points raised in this whitepaper, first in summary and then in some detail. I will occasionally refer to our web-based performance seminar
Clock Constraints ^^^^^^^^^^^^^^^^^ We appear to agree on how to constrain clocks. For synthesis, we employ the flow suggested by Synplify to optimize multiple clock designs. This results in optimization of all clock domains. Are there other ways to do it? Probably -- but since Synplicity Pro 7.7 is a common-denominator in our comparisons, it is hard to see how changing this would affect the 39% average performance advantage that we see for Stratix II. For ISE, as outlined in the web-seminar (slide #9) and other locations, we constrain each clock independently and iterate to find the best such (tight) constraints. As you suggest, we do not look at paths that cross clock domains (difficult to do in an apples-to-apples way). We do not over constrain ISE as we have found this degrades Xilinx performance. Slide #10 shows the results of the iterative constraint process for one design (with two clocks); I think it highlights the rigour and correctness of this process. I should point out that for Quartus II, we don't need to jump through hoops since applying a global 1 Ghz constraint on the clocks will result in each clock being optimized as best as possible.
Synthesis/P&R Effort ^^^^^^^^^^^^^^^^^^^^ On the P&R front, we use the ISE settings that yield the best performance results across our benchmark set. We also run a seed-sweep (or "multi-pass" compile) using ISE at the end of our iterative process. For synthesis, we have no reason to believe that enabling a high-effort mode in Synplicity would change the conclusions of our comparison, since we are using the same synthesis tool for both Stratix II and Virtex-4.
Register Retiming/Physical Synthesis ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Quartus II can perform physical synthesis optimizations during place-and-route. These algorithms have access to detailed placement and timing information, enabling further optimization that synthesis just can't know about. ISE does not provide any such optimizations. Note: We always include Tco and Tsu constraints, so our re-timer will not violate I/O timing to improve core speed. We did not use Synplicity's retiming options during these comparisons, and are in the process of evaluating how the comparison changes when we use these options. While one might guess that these optimizations would reduce Quartus' physical synthesis upside, register retiming is only one of the many algorithms employed in Quartus physical synthesis and is responsible for a very small part of +39% performance we see. I'm told that ISE also offers some sort of retiming option during synthesis with XST. We find that using XST yields much worse Xilinx results (which make us look much better), so do not use XST, and hence do not use that retiming option.
Block Performance ^^^^^^^^^^^^^^^^^ Our benchmarking results address overall performance across real designs. These designs contain RAMs, DSP/MAC/Multipliers, adders, counters, and other such building blocks in a large variety of sizes and varying quantities. We do not claim that Stratix II is 39% faster on all building blocks, but rather that when you put it all together Stratix II is 39% faster. Why is this? Fundamentally, the logic and routing of Stratix II is significantly faster -- and you need logic & routing to stitch together the blocks. Also, critical paths often start or end on a RAM/DSP, and are very rarely just a RAM/DSP toggling in isolation. The timing microparameters of the RAM/DSP are quite comparable between the two families. According to the Virtex 4 data sheet, the DSP microparameters are faster in the -12 device and we will certainly rerun the analysis when Xilinx releases software that enables this fastest speed grade. Our Fmax limit is not simply just 1/Tco. The block toggle rate limits imposed by Quartus II are selected based on characterization to guarantee operation of our devices in all environments, under all noise and switching conditions. When you clock a block very quickly, you start getting interesting effects that can affect operation. As we complete the characterization of hard IP blocks, we will raise these limits. The Quartus II 4.2 software introduces higher Fmax limits than stated in this table, and further increases are likely in future software releases.
Speed Grades ^^^^^^^^^^^^ I believe we have addressed this in numerous forums. We use the available speed grades in the software. We can't compare to something we can't get our hands on. Users can derate our +39% average performance result by the difference between our fastest and medium speed grade to get a flavour for how things will compare if & when a fast Virtex-4 speed grade is made available.
Regards,
Paul Leventis Altera Corp.
ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.