Fastest ISE Compile PC?

Has anyone recently done any benchmarking of Windows PC's for Xilinx ISE Compiles?

Is ISE multithreaded? Can it use multiple processors (or cores)? Do big CPU caches help?

Regards Marc P4-3GHz HT 2GB DDR2-533 RAM

Reply to
marc_ely
Loading thread data ...

I'm not sure if QuartusII or ISE is multithreaded but the first generation dual core systems I wasn't impressed with. My experience is my company provided me with a dual core system to do development work with.

When my system wasn't living up to my expectations I did a little research. The MS Windows performance meter and some third party tools showed little activity. When it did it was about 85% or better, occasionally pegged. I did some other poking around on my system and discovered that they cheaped out with the graphics card and hard drives.

My advice to you and this is partly experience and the other part gut feeling is compare the price difference between the Extreme and Dual Core chips if the price is negligible look for the one with faster front side bus (FSB) speed. And, the second item I'd look into is a caching SATA controller that supports mirror and some really fast hard drives. Avoid striping the drives there is a performance hit but try mirroring. From my observations, development tools I use are mostly memory and hard drive bound. When you compile and PAR your design a fast CPU is beneficial but it is also working with a lot supporting files and storing/retrieving information from memory.

In the past most users report the biggest benefits from more and faster memory.

Derek

Reply to
Derek Simmons

No, but I have for Quartus which is very similar.

No and not for a while to come.

Nope.

Oh yeah, but once you have that, core frequency is all that matters.

I recently went from an Athlon 64 2.0 GHz/1 MiB L2$ to a E6600 Core 2 Duo 2.4 GHz/4 MiB L2$. For my benchmark, the time for Synth/P&R went from 12m34/33m40 to ~6m/~15m, thus more then double the P&R performance. When overclocked to 3.3 GHz the result scaled to

5m54/11m12, thus 3X the P&R performance. Other experiments confirm that it scales linearly with frequency (assuming memory scales equally).

I have expensive memory, but from my experiments the benchmark results showed very little sensitivity to memory bandwidth and latency.

The 4 MiB Core 2 Duo is a very fast chip for FPGA work, probably the fastest x86 available, but it's still not fast enough to reduce the compilation times to an acceptable level.

Tommy

Reply to
Tommy Thorn

Hi Tommy

Thanks for the info. Yes I found your posts about 2mins after I sent one out (after searching and finding nothing current). That's the problem with info on the web... it's often out of date and finding the right stuff can be needle in haystack.

I think I will go for a CoreDuo with 4MB.

Marc

Reply to
marc_ely

Could you give us some info on what the disk subsystems look like for each machine? (ide, sata what speed, any raid? etc)

What do you think explains for no change in synthesis for C2D change from 2.4GHz to 3.3GHz ?

Reply to
mk

I could, but it would misleading as it's completely irrelevent to the posted numbers. The benchmark is operating almost exclusively out of the buffer cache and even then it's not reading that much data.

That said, for everything else disk latency matters a lot, so I used a single SATA 150 GB Raptor (15,000 RPM) in the new box. The old box had a quiet average speed Samsung PATA drive (7,200 RPM).

My measurements were too informal. There is a change, just not as substantial. I'd need to study this closer to understand what's going on.

Tommy

Reply to
Tommy Thorn

My system has arrived and I did a quick benchmark:

my lab-system: P4, 2.6 GHz, 2GBytes RAM another system: P4, 3 GHz, 2GBytes RAM my new machine: Core 2 Duo, E6700, 2GBytes RAM with Asus P5LD2 Deluxe

a full run with ISE 6.3 (from synthesize to bitgen) with a recent design takes:

my lab-system: 30 minutes another system: 28 minutes my new machine: 14 minutes

I would say it is worth the money and I guess we'll buy some more of those machines ...

bye, Michael

Reply to
Michael Schöberl

I took the plunge and built up a 2nd PC using a Core2Duo.

Here are the specs: Old PC: P4 3GHz HT, 2GB DDR2-533 RAM, Gigabyte GA81915 mobo, stock cooler New PC: Core2Duo E6600, 2GB DDR2-800 RAM, ASUS P5B Mobo, ArcticFreezer7 cooler

Using a Spartan3 design running clean from scratch in ISE 8.2.3i Old PC: 82mins New PC: 35mins New PC (overclocked to 3.2GHz): 25mins

I'm really pleased with the Core2Duo and would recommend it.

Marc

Reply to
marc_ely

While the CoreDuo looks the thing right now, on the disk side I'd be interested to know if the new IDE Flash drives that go up to 32GB are any use as a replacement for high RPM drives.

The only reviews I have seen (Toms IIRC) obviously have much lower latency but not yet much throughput around 30MBytes/sec but at least the ms delays should now be us delays. At this stage I wouldn't be concerned about wearout as I expect these things to be get replaced sooner or later, prices seem to be falling on Flash much faster than DRAM now and the throughput is bound to reach closer to PATA max rates.

just a thought John Jakson

Reply to
JJ

Conclusion dual cores (multiprocessor) benefits Xilinx ISE substantially?

Reply to
pbdelete

No, cache size matters.... As far as I know, neither ISE nor Quartus use the second core, but both benefit from the huge cache.

Thomas

formatting link

Reply to
Thomas Entner

Not just regular L2 cache but the TLB or address cache matters even more I suspect but harder to characterize and explain. When the data set is still beyond even the bigger combined cache of a Dual, the increase in associative ways of the bigger TLB kicks in to reduce the incidence of the OS having to refill MMU page tables which can blow ns cache hits into several 100ns accesses for full cache miss.

I ran a test on an older 2GHz Athlon XP2400 and a 2.6GHz D805 for a loop that just randomly accesses ints from a 512MB array using a mask to control the variability of address from 256 ints to the 128M max and for each case run the loop 1M times.

I believe this represents the worst possible behaviour of any CAD application that must traverse huge graphs or trees that can not fit cache but easily fit DRAM.

The D805 generally runs 30% faster as the clock suggests while the tests are entirely cache bound but the Athlon has 256K of L2 with 256 ways in the TLB. The 805 has 1MB of L2 in each core and I expect the TLB has 1k ways of associativity. Only 1 core is used. I expect the CoreDuo or 64b Athlons to perform somewhat better.

For in cache times the loop iterates in 7ns or 10ns resp for D805 v xp2400. As the range of addresses increases past 64K the Athlon staircases to 60ns then out around 2M degrades to 80ns-150ns and at

128M range settles at 400ns per iteration over the original 10ns or 40 times slower to crawl memory.

The D805 fairs some better, it tolerates another 2b of address but degrades to 60ns at 256K level then reaches 130ns at the 128M level. In other words when the L2 cache always misses, the D805 spends far less time patching up the TLB and MMU page tables.

The D805 runs Windows2k with 1GB of DDR400 and the Athlon runs BeOS on

1GB of DDR266 but thats not real important.

Conclusion is that paying for bigger TLBs is probably far better than more cpus since it just keeps the uni processor closer to its ideal performance for codes that have poor locality of reference. Adding more cores probably makes things worse as the quad core shows unless code is really multithreaded.

John Jakson transputer guy

Reply to
JJ

I'm sure the second core will make a difference - while the one long task is occupying one core, other minor tasks will run on the other core. While these other tasks might only take a tiny proportion of the processor time, you avoid the penalties of task switching (like losing your cache) on the working processor.

Reply to
David Brown

Assuming you set the thread affinity for the long task. If you observe top on linux or task manager on windows xp, vista you will se that the %99.9 cpu consuming task is being migrated from cpu to cpu quite frequently. I am not sure why the scheduler of either OS does this.

Reply to
mk

Relevant to several recent threads, Altera just announced their Stratix III and with it Quartus 6.1 of which the first bullet item is:

"Multiprocessor support: Allowing parallel processing during compilation for computers with multiple processors results in a reduction in compile times. Quartus II software offers the first multiprocessor support from an FPGA vendor to take advantage of the new multiple-core processors."

The actual software is available *now* (according to the press release). Trying to get it reveals that *now* is really December 4th :-)

I look forward to see how it scales with multiple cores.

Tommy

Reply to
Tommy Thorn

Hi Tommy,

On two cores we've seen between 1.6X and 1.9X the performance (depending on the algorithm) for the parallelized sections of code, yielding up to a 20% compile time reduction. Adding more cores gives you big speed-ups on those portions of code -- but Amdahl's Law kicks in pretty fast. The remaining single-threaded algorithms become a larger portion of the run-time as you add processors, diminishing the overall returns.

FPGAs are getting bigger faster than CPUs are getting faster; this has been true for a long time. Without innovation in the software, compile times would grow with each generation. Thankfully, we've been able to close this gap, and even improve our run-time (and memory consumption) over time. Multi-cores is just the next step in this evolution. Modern CAD systems such as Quartus II contain numerous algorithms, all of which contribute significantly to the run-time of the system. Each algorithm presents its own challenges for parallization (if that's a word). Over time as we parallelize more and more of the tool, the benefits and scalibility will increase.

Memory consumption is also a challenge as FPGAs continue to scale in size. Keeping memory use in check yields many benefits -- cheaper machines, sticking with 32-bit OSes, and better cache locality (and hence run-time). You'll find QII 6.1 (even for Stratix III) performs well on this metric too.

Customers can get the software today via their local Altera sales representative or distributor sales office. General/full availibility is December 4th as you've indicated.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis

I came across the posting for the Stratix III the other day on their website. Short of putting engineering samples in everybody hands, you'd think they would want to coordinate the release of the new version of Quartus II with the announcement for the new devices so that engineers can see how their desings fair in the new software and devices.

I only had a few minutes to look at the website but the new devices look like they have made them more granular and have doubled the frequency of their devices.

I am a Quartus II user and my sales rep, Linda, has always done a good job of getting me a copy of the software. So, I have one morew thing to look forward to in December.

Derek

Tommy Thorn wrote:

Reply to
Derek Simmons

I started using a Mac Pro a few weeks ago - Dual Core2Duo Xeons, 2GB RAM running XP SP2. Although ISE isn't muti-threaded, I found a use for the

2nd processor yesterday - I ran a second instance of ISE. I'm working on a multi-chip design, and I synthesized one project while routing a second project. I set the affinity so that they executed on different processors (at least I think they were on different processors). I didn't benchmark the execution speed, but the time didn't seem out of line.
--
Joe Samson
Pixel Velocity
Reply to
Joseph Samson

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.