Best CPU platform(s) for FPGA synthesis

(snip)

It might be because more registers are available, and IA32 code is register starved.

-- glen

Reply to
glen herrmannsfeldt
Loading thread data ...

Hi JJ,

Here is a rather long but detailed reply to your questions courtesy of Adrian, one of our parallel compile experts.

You were correct in guessing that quartus_fit included floating-point operations, but as other writers here have responded, memory accesses are easily as important in terms of runtime, if not more so. By contrast, quartus_sta is dominated by integer operations and memory accesses. Incidentally, this is why quartus_fit will produce a different fit on different OS's while quartus_sta will not - integer operations are exact across all platforms but the different compilers optimize floating-point operations differently between Windows and Linux, which result in a different fit.

Quartus II's new NUM_PARALLEL_PROCESSORS is required to enable any kind of parallel compilation. We do not offer any support for HyperThreaded processors and actually recommend our users disable it in the BIOS, as it can decrease memory system performance even for a normal, non-parallel compilation. By contrast, multi-core machines yeild good results. If you have an Intel Core 2 Duo, for example, you'd set NUM_PARALLEL_PROCESSORS to 2. If you have two dual-core Opterons, you'd set it to 4, and so on.

Currently, some parts of quartus_fit, quartus_tan and quartus_sta can take advantage of parallel compilation, though the best improvement is usually in quartus_fit. Small designs and those with easy timing and routability constraints will typically not see much improvement, but larger and harder-to-fit circuits (the designs that need it the most!) can see substantial reductions. While the speedups are currently modest and nowhere near linear with the number of processors used, they have improved with every release since Quartus 6.1 and we plan to continue this in future releases.

We do not currently support additional parallel features during incremental compilation; ie, different partitions will not be mapped and fit completely in parallel; the fitter will get as much benefit from parallel compilation as it would without any partitions.

One gotcha with parallel compilation is related to my first point about Quartus having lots of memory accesses. On some current systems, the memory system can become a significant bottleneck. For example, an Intel Core 2 Quad chip has two shared L2 caches, which enables very fast communication between cores (1,2) and (2,3), but relatively slow communication between (1,3) and (2,4) since those memory requests must all share the front-side bus. In this case, setting NUM_PARALLEL_PROCESSORS to 4 may even give a worse result than setting it to 2 by forcing half the communication to take place over this slower FSB. Even with only two processors in use, the OS may sometimes schedule the processes on cores (1,3) and (2,4) unless you specify otherwise. Solutions to this problem can be found at

formatting link
Not all platforms are affected; you'll have to try it and see.

At present, Quartus II currently supports a maximum of four processors (or cores), so your dual Quad configuration will mostly go unused. However, your intuition about leaving a processor free is correct; if you have a four-core system and leave NUM_PARALLEL_PROCESSORS to 3, you will never see Quartus take more than 75% of your computer's CPU.

As for different OS's, the 32-bit Windows version of Quartus is a little faster than the Linux version; the differences are largely due to the quality and settings of the optimizing C compilers we use on these two platforms, and varies somewhat between various Quartus executables. 64-bit versions of Quartus are slightly slower than 32- bit versions due to the increase in working set size (memory) from 64- bit pointers; this in turn reduces cache hits and thus slows down the program. This behaviuor is true of most 64-bit applications.

Note: You can run 32-bit Quartus in 64-bit Windows/Linux with no such performance penalty, and gain access to 4 GB of addressable memory. This should meet user needs for all but the largets and most complicated of Stratix III designs. See information on memory requirements of Quartus at

formatting link
Also, I've posted on this topic previously
formatting link

36boga).

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis

As mentioned in the Anandtech article, there are stability issues with running in 3GB mode. We have seen these stability issues with Quartus on WinXP w/3GB mode. If you need more 2GB of memory for your Quartus executable, your best bet is to run 32-bit Quartus on (a) 32-bit or 64- bit Linux or (b) 64-bit Windows.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis

Yes, turning on multiple CPU support (NUM_PARALLEL_PROCESSORS setting) will help :-)

It will also depend on whether this is a slow or fast compile. A toy design will see no speed-up, since the run time will be dominated by aspects of the compiler that are normally a small portion of run time

-- reading databases from disk, setting up data structures, etc. It is only the key time-consuming algorithms that have been parallelized (and only some of them at that). Gains will be the largest on large designs with compilcated timing assignments.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis

I'd recommend running them on MicroBlaze.. good opportunities for h/w acceleration ;-)

Jon

Reply to
Jon Beniston

Yo, Adrian! ;) and Paul and everyone else, that's some great info and is very much appreciated.

Since quartus_fit is dominating my runtime (EP2S180 and HC230), and quartus_fit gains the most from extra CPUs, it makes sense for me to go at least to 4 CPUs (I currently only have dual-processor boxes, thus the need to go shopping). Do you know if the HardCopyII fitter also makes use of multiple processors?

When Quartus does spawn jobs off to up to 4 processors, can each one of those spawned jobs use up to 4GB?

In the case of Quartus supporting a max of 4 processors, at the very least an 8-processor box would allow me to run two copies of Quartus at the same time (e.g., different designs, or different flavors of the same design). 8 processors on 64-bit Linux w/ 16GB of RAM with 32-bit Quartus would seem to be a well-balanced setup if most Quartus jobs remain under 2GB, correct?

Since memory access is such a big part of the overall runtime, obviously the faster memory buses on newer machines will help. (Good thing, because the clock speed difference along from an Opteron 250 to a newer Opteron 2218 isn't much of an increase: 2.4GHz to 2.8GHz).

Since the databases for big chips get so large (and memory accesses apparently so random), does a larger data cache buy you much? The L1 I&D caches are relatively small on both AMD and Intel, although Opteron is 2x (64K Instr, 64K Data) larger than Intel's.

For the L2 cache, Intel's is 2x larger than AMDs on a per-core basis. Since Intel shares two caches between neighboring cores (as you say

1&2 or 3&4 can share quickly, but slow from 1/3 and 2/4), whereas Opterons have a dedicated cache per core, would Opterons see a speedup from less contention for the cache, or a slowdown from having to go outside the local caches in order to share data? (I guess a function of how often the quartus_fit algorithms need to share data, right?)

If I were trying to run two Quartus jobs simultaneously on one 8-CPU machine (with NUM_PARALLEL_CPUS = 4 for each run), I would expect competition for external memory to be huge, and thus statistically some benefit to Intel's larger cache. And with more "stuff" cached, that the higher clock speeds on current Intel CPUs might give the runtime advantage to Intel. On the other hand, AMD has the Direct Connect Architecture and HyperTransport, so...

I know you vendor guys are reluctant to publish benchmark info, but from the currently-available, mainstream, small-server perspective with 8 processors, I'm kind of pushed toward the following CPU choices:

4 dual-core Opteron 2218's (2.6 GHz, 90nm process, 2MB L2 cache as 1MB dedicated per core ) 4 dual-core Opteron 2220's (2.8 GHz, 90nm process, 2MB L2 cache as 1MB dedicated per core ) 4 dual-core Intel 5160's (3.0 GHz, 65nm process, 1333 MHz FSB, 4MB shared L2 cache) 2 quad-core Intel X5355's (2.66 GHz, 65nm process, 1333 MHz FSB, 8MB L2 cache, shared 4MB per core pair)

Of those, is there an obvious bang for the buck advantage (weighted more toward bang than buck) for any one of those in particular?

------- P.S. Those QX6850's are hard to come by; Dell's overclocked XPS720's look sweet, but my company won't spring for overclocked boxes...

Thanks again, very very much!

Reply to
jjohnson

snipped-for-privacy@frank-buss.dehttp://www.frank-buss.de,http://www.it4-systems.de

is there such a setting for xilinx ise as well?

thx, -wei

Reply to
Wei Wang

Why only 3GB max of 4GB? thanks, -Wei

Reply to
Wei Wang

snipped-for-privacy@frank-buss.dehttp://www.frank-buss.de,http://www.it4-systems.de

Found similar memory recommendations for Xilinx's largest XC5VLX330 FPGA,

formatting link
only Linux-64 machines are supported, memory recommendation: typical

7.2GB and peak 10.6GB.
Reply to
Wei Wang

This web page needs to be updated: NT64 is also supported, but runtime will be faster on Linux64, so that's what we recommend.

Steve

Reply to
<steve.lass

Hi Steve,

Could you give us (Xilinx users) some more detailed recommendations on what would be the best platform to run ISE/EDK tools when working on midsize to big designs? Tell us what you are using @ Xilinx? :)

Thanks, /Mikhail

Reply to
MM

I can give you some general recommendations. For the best place and route runtimes, use a 64bit Linux system. If your design is small enough to fit into 4G of memory (LX110 or smaller), and you are not programming devices (the 32bit cable drivers don't work on a 64bit system), you can use the 32bit executables to save memory. Otherwise, go ahead and use the 64bit executables. They use more memory and the runtime is simular.

As mentioned earlier, synthesis, map, place and route do not use multithreading, so you will not get an advantage using multiple processors for a single design. However, ProjNav is multithreaded so if you are doing different tasks, other processors will be used. In addition, upcoming software releases will use those processors.

Steve

Reply to
<steve.lass

Note that it works just fine to install 32-bit ISE on a 64-bit Linux system, and to install the 64-bit cable drivers.

In my experience, the open source user-space-only cable interface works far better than the Xilinx-supplied cable drivers anyhow:

formatting link

Reply to
Eric Smith

The short answer is that the upper 1GB is reserved for the kernel. If you want a bit more detail you can look at for example the following article:

formatting link

/Andreas

Reply to
Andreas Ehliar

and

Is there a 64-bit version of EDK ? If not, can I mix 64 bit ISE with 32 bit EDK?

Thanks, /Mikhail

Reply to
MM

What I found was very interesting, it was taking me 12 hours to run the MAP process before, but yesterday it only took me ~3 hours to run MAP, and PAR only too took ~40 mins as well.

I was trying to figure out the reasons, then found in *.map *.mrp files that there was always a map phase which took such a long time as ~10+ hours, and that phrase was always very memory hungry. I was using Linux64 with 2GB real memory and 4GB swap memory, as I just found that the real 2GB memory was much smaller than the required peak memory

10.6GB. Yesterday, I was running ISE9.1i for XC5VLX330 on another Linux64 machine with 11G real memory and 8G swap memory, the there wasn't any MAP phrase which took a ridiculous ~10+ hours.

Can Xilinx guys shed some more light on the runtime of MAP and PAR, wrt different memory sizes and CPU cores?

Reply to
Wei Wang

Polywell has some desktop computers with QX6850 available. Although since you're looking at an 8-way workstation (!), QX6850 is probably not an option. Polywell has AMD or Intel workstations with the CPUs you're looking at as well.

For one socket, Intel clearly has the edge over AMD I think. For multi- socket workstations/servers however, I'm not so sure. Benchmarks are harder to find. I would suspect that the Hypertransport bus would help AMD close the gap with Intel a little. Their integrated memory controller probably helps as well in a multi-socket machine.

I searched for benchmarks for the newest 90-nm Opteron but couldn't find any unfortunately...

Patrick

Reply to
Patrick Dubois

Yes, that indeed would be great!

With my current design I found that timing-driven MAP either crashes or takes very long time to complete (relative to PAR). Even more interesting is that I get much better timing and much faster run times by actually disabling timing-driven mapping and use of RLOC constraints in MAP...

/Mikhail

Reply to
MM

Even though our memory requirement table lists devices, memory is more dependent on the design and the timing constraints. Since we can't predict what is in your design, we just give you the typical and max numbers from our collected test cases.

An example for constraints which will reduce memory is instead of creating a bunch of individual from to timespecs, you can create timegroups with the endpoints, then put one timespec on that.

Also, ISE 9.2i is getting an average of 27% improvement in memory utilization.

I don't have any data regarding runtime of different CPU cores.

Steve

Reply to
<steve.lass

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.