Best CPU platform(s) for FPGA synthesis

- J
- jjohnson
  
  Contact options for registered users
posted
16 years ago

Thu, Jul 26, 2007 10:19 PM

OK, the questions apply primarily to FPGA synthesis (Altera Quartus fitter for StratixII and HardCopyII), but I'm interested in feedback regarding all EDA tools in general.

Context: I'm suffering some long Quartus runtimes on their biggest StratixII and second-biggest HardCopyII device. Boss has given me permission to order a new desktop/workstation/server. Immediate goal is to speed up Quartus, but other long-term value considerations will be taken into account.

True or false?

-------------------- Logic synthesis (analyze/elaborate/map) is mostly integer operations? Place and Route (quartus_map) is mostly double-precision floating- point? Static Timing Analysis (TimeQuest) is mostly double-precision floating- point? RTL simulation is mostly integer operations? SDF / gate-level simulation is mostly double-precision floating-point?

AMD or Intel?

------------------- Between AMD & Intel's latest multicore CPUs,

- Which offers the best integer performance?

- Which offers the best floating-point performance? Specific models within the AMD/Intel family? Assume cost is no object, and each uses its highest-performing memory interface, but disk access is (necessary evil) over a networked drive. (Small % of total runtime anyway.)

Multi-core, multi-processor, or both? 32-bit or 64-bit? Linux vs. Windows? >2GB of RAM?

--------------------------------------------------------------------------------------------------------------------------------- Is Quartus (and the others) more efficient in any one particular environment? I prefer Linux, but the OS is now secondary to pure runtime performance (unless it is a major contributor). Can any of them make use of more than 2GB or RAM? More than 4GB? Useful limit on the number of processors/cores?

Any specific box recommendations?

Thanks a gig,

jj

- S
- sharp
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Jul 26, 2007 11:01 PM

Yes.

I don't know why they would use floating point if they don't have to.

I seriously doubt it. I don't see a need for floating point there when delays can use scaled integers.

Yes.

No, or at least not in any implementation I am familiar with. All the delays are scaled up so that integers can be used for them.

In simulation (assuming something with state-of-the art performance), the CPU operations themselves are not very important anyway. It is not compute-bound, it is memory-access-bound. What you need is big caches and fast access to memory for when the cache isn't big enough.

64-bit Linux can make use of more than 4GB of RAM. But don't use 64- bit executables unless your design is too big for 32-bit tools, because they will run slower on the same machine.

Most of these tools are not multi-threaded, so the only way you will get a speedup is if you have multiple jobs at the same time. Event- driven simulation in particular is not amenable to multi-threading, despite much wishful thinking for the last few decades.

- J
- Jon Beniston
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jul 27, 2007 2:18 PM

Dynamic range?

Cheers, Jon

- N
- Nial Stewart
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jul 27, 2007 2:25 PM

I think that memory performance is the limiting factor for FPGA synthesis and P&R.

This machine had a single core AMD 64 processor which I recently replaced with a slightly faster dual core processor.

I ran a fairly quick FPGA build through Quartus to get a time for a before and after comparison before I did the swap.

The before and after times were exactly the same :-(

I think the amount and speed of memory is crucial, it's probably worth paying as much attention to that as to the processor.

Nial.

- F
- Frank Buss
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jul 27, 2007 2:34 PM

Did you changed the setting "use up to x number of CPUs" (don't remember the exact name) somewhere in the project settings?

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

- P
- Patrick Dubois
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jul 27, 2007 2:56 PM

If cost is no object, then go with the Intel quad-core running at 3 GHz : QX6850. Each core has 2 MB of L2 cache (8MB total), which is, according to several reports in this forum, the single most important factor.

I would say go with 4GB of ram, although if you're using the biggest chips, you might need more. Keep in mind that Windows 32-bit will only be able to use 3GB max of this 4 GB, and each application will only be able to access 2GB max. So you might consider Windows 64 bits or Linux

64 bits if necessary.

Patrick

- K
- Kai Harrekilde-Petersen
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jul 27, 2007 4:17 PM

Not a likely problem. Even a 32bit int would be big enough for holding up to a ridiculous 4.3 seconds, assuming 1psec resolution.

As far as I know, everything in the simulate, synth, P&R, and STA chain can be performed with adequate resolution using integers.

Crosstalk and inductive effects might require floating point help, but I would be surprised if even that can be approximated well with fixed-point arithmetic.

Kai

--
Kai Harrekilde-Petersen

- E
- Eric Smith
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jul 27, 2007 4:43 PM

Although that might be true for some specific cases, in general on Linux native 64-bit executables tend to run faster than 32-bit executables. But I haven't benchmarked 32-bit vs. 64-bit FPGA tools.

- J
- jjohnson
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jul 27, 2007 5:17 PM

Thanks everyone, this is real interesting, but please don't stop posting if you have more insights to share!

FWIW, my runtimes in Quartus are dominated by P&R (quartus_fit); on Linux, they run about 20% faster on my 2005-era 64-bit Opteron than on my 2004-era 32-bit Xeon (both with a 32-bit build of Quartus). Another test run of a double-precision DSP simulation (compiled C) ran substantially slower on the Opteron, which I thought was supposed to have better floating-point performance than Xeons of that era. Maybe it was just a case of the gcc -O5 optimization switches being totally tuned to Intel instead of AMD, or maybe my Quartus P&R step is primarily dominated by integer calculations.

I originally suspected P&R might have a lot of floating-point calculations (even prior to signal-integrity considerations) if they were doing any kind of physical synthesis (e.g., delay calculation based on distance and fanout); ditto for STA, because that's usually an integral part of the P&R loops. I also suspected that if floating- point operations (at least multiplies, add/subtract, and MACs) could be done in a single cycle, there would be no advantage to using integer arithmetic instead (especially if manual, or somewhat explicit integer scaling is required).

On the other hand, in something like a router, you can get more exact location info wrt stuff like grid coordinates than you can with floating-point. As far as dynamic range is concerned, I seem to recall that SystemC standardized on 64-bit time to run longer simulations, but SystemC is a different animal in that regard anyway. Nonetheless, I also seem to recall that its implementation of time was 64-bit integers (scaled), because the average FPU operations are really only linear over the 53-bit mantissa part. Assuming they want linear representation of time ticks, I can see the appeal of using 64-bit integers in simulation.

As far as event-driven simulations are concerned, I totally understand how hard it is to make good use of multithreading or multiprocessing, because everything is so tightly coupled in that evaluate/update/ reschedule loop. If you were working at a much higher level (behavioral/transaction), where the number of low-level events is lower and the computation behind "complex" events took up a much larger portion of the evaluate/update/reschedule loop, then multicore/ multiprocessing solutions might be more effective for simulation. (Agree/disagree?) It seems that as you get more coarse-grained with the simulation, that even distributed processing (multiple machines on a network) becomes more feasible. Obviously the scheduler has one "core" and has to reside in one CPU/memory space, but if it has less work to do, then it can handle less frequent communication with the event-processing CPUs in another space.

Back to Quartus in particular and Windows in general... Quartus supports the new "number_of_cpus" or some similar variable, but only seems to use it in small sections of quartus_fit (I think Altera is just making their baby steps in this area).

That appears to be related to the number of processors inside one box. If a single CPU is just hyperthreaded, the processor takes care of instruction distribution unrelated to a variable like number_of_cpus, right? And if there are two single-core processors in a box, obviously it will utilize "number_of_cpus=2" as expected. Does anyone know how that works with dual-core CPUs? i.e, if I have two quad-core CPUs in one box, will setting "number_of_cpus=7" make optimal use of 7 cores while leaving me one to work in a shell or window?

Does anyone know if Quartus makes better use of multiple processors in a partitioned bottom-up flow compared to a single top-down compile flow?

In 32-bit Windows, is that 3GB limit for everything running at one time? i.e., is 4GB a waste on a Windows machine? Can it run multiple

2GB processes and go beyond 3 or 4GB? Or is 3GB an absolute O/S limit, and 2GB an absolute process limit in Windows?

In 32-bit Linux, can it run 4GB per process and as many simultaneous processes of that size as the virtual memory will support?

In going to 64-bit apps and O/S versions, should the tools run equally fast as long as the processor is truly 64-bit?

Thanks again for all the insights and interesting discussion.

jj

- J
- Jon Beniston
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jul 27, 2007 5:24 PM

I think you're a factor of 1000 out.

For an ASIC STA, gate delays must be specified at a much finer resolution than 1ps.

Cheers, Jon

- K
- Kai Harrekilde-Petersen
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jul 27, 2007 8:47 PM

[snip]

As I recall, 32-bit Linux has a limit around 3.0-3.5GB per process. On the 64-bit Linux , I have used 8+GB for a single process doing gatelevel simulations.

Kai

--
Kai Harrekilde-Petersen

- K
- Kai Harrekilde-Petersen
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jul 27, 2007 8:56 PM

Duh, brain fart indeed!

I don't recall seeing sub-psec resolution in the 130nm libraries I have seen, but that doesn't imply that it cannot be so.

But I stand by my argument: the actual resolution should not matter much, as the total clock delays and cycle times should scale pretty much as the library resolution. Otherwise, there wouldn't be a point in choosing such a fast technology (who in their right mind would use a 45m process for implementing an 32kHz RTC, unless they had to?)

Kai

--
Kai Harrekilde-Petersen

- P
- Paul Uiterlinden
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jul 27, 2007 9:02 PM

Below is what I have read about it in "Self-Service Linux®"

formatting link

I have no experience with it.

3.2.2.1.6 The Kernel Segment

The only remaining segment in a process' address space to discuss is the kernel segment. The kernel segment starts at 0xc0000000 and is inaccessible by user processes. Every process contains this segment, which makes transferring data between the kernel and the process' virtual memory quick and easy. The details of this segment?s contents, however, are beyond the scope of this book.

Note:

You may have realized that this segment accounts for one quarter of the entire address space for a process. This is called 3/1 split address space. Losing 1GB out of 4GB isn't a big deal for the average user, but for high-end applications such as database managers or Web servers, this can become an issue. The real solution is to move to a 64-bit platform where the address space is not limited to 4GB, but due to the large amount of existing 32-bit x86 hardware, it is advantageous to address this issue. There is a patch known as the 4G/4G patch, which can be found at ftp.kernel.org/pub/linux/kernel/people/akpm/patches/ or

formatting link

This patch moves the 1GB kernel segment out of each process? address space, thus providing the entire 4GB address space to applications.

--
Paul Uiterlinden
www.aimvalley.nl
e-mail addres: remove the not.

- P
- PeteS
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Jul 28, 2007 3:52 PM

The last time I checked the speed of a full FPGA build, the cache did indeed have the single largest effect, which is hardly surprising. A cache access is typically one internal bus cycle (not a cpu cycle) which is an order of magnitude faster than an external memory access cycle.

Properly optimised code that uses the I-Cache properly will run much faster than inline code, incidentally.

Cheers

PeteS

- A
- Andreas Hofmann
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Jul 28, 2007 8:51 PM

No. Hyperthreading means that the hardware is only virtually doubled. The CPU maintains the state and the register set of two independent threads and tries to utilize all its function units. If one thread has to wait for data from the memory some instructions of the other thread can be issued to the function units. Likewise, if one thread spends its time in the FPU, the other thread can use the remaining function units. If both threads execute the same type of instructions a hyperthreaded CPU rarely has an advantage.

Running on a hyperthreaded CPU the operating system sees two cores and has to schedule its workload like there were two physical cores to gain any benefit. If your software only has one thread hyperthreading like multicores won't speed it up.

I don't know how Quartus makes use of the available CPUs but basically as seen from software there is no difference between two single cores and one dual-core.

3 GB is a practical limit because the PCI bus and other memory-mapped devices typically occupy some hundred megabytes of address space. So you can't use this memory space to access RAM. There are techniques to map memory to other address regions beyond the 4 GB border but you need special chipsets and proper operating system support.

Andreas

- C
- comp.arch.fpga
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Jul 29, 2007 9:52 AM

These are usually not mapped into the address space of a user process.

Kolja Sulimma

- P
- PFC
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Jul 29, 2007 11:12 AM

Nope, but the (32-bit) kernel needs to see the mmap'ed peripherals + the userspace RAM if implementation of stuff like file reading, etc is to be efficient (without juggling with pages)...

- M
- Matthieu
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Jul 30, 2007 2:37 AM

Anandtech ran an article which does quite a good job in explaining the 2 and 3 GB barriers.

formatting link

- I
- Ioiod
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Aug 1, 2007 4:29 AM

Interesting -- on an AMD Athlon X2/5200+ running RHEL Linux 4 update 4 x86_64, just about all Synopsys Design Compiler jobs run FASTER in 64-bit mode than

32-bit mode, between 5-10% faster. THe penalty is slightly larger RAM-footprint, just as you noted. The X2/5200+ is spec'd the same as an Opteron 1218 (2.6GHz, 2x1MB L2 cache..)

This trend was pretty much consistent across all our Linux EDA-tools.

On Solaris SPARC, 64-bit mode was definitely slower than 32-bit mode, by about

10-20%. For the life of me, I can't understand why the AMD would run 64-bit mode faster than its 32-bit mode -- but for every other machine architecture, 64-bit mode is almost always slower.

I forgot to re-run my 32bit vs 64-bit benchmark on Intel Core2 Duo machines. FOr 64-bit, the Intel E6850 (4MB L2 cache, 3.0GHz) ran anywhere from 50-60% faster than the AMD X2/5200+. Don't worry, no production machines were overclocked (for obvious official, sign/off reasons.) It was just a admin's corner cubicle experiment.

When I ran two separate (unrelated) jobs simultaneously on the AMD and Intel machines, the AMD machine handled dual-tasking much better. AMD only dropped 5-7%, for each job. The E6600 fared a lot worse -- anywhere from

10-30% performance drop. (Though not as bad as the Pentium/3 and Pentium/4 based Xeons.)

I'm wondering if the E6600's unified 4MB L2-cache thrashes badly in dual-tasking. Or maybe the better way to look at it, in single-tasking the 4MB L2-cache is

4X more than the AMD Opteron's 1MB cache per CPU-core.

- I
- Ioiod
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Aug 1, 2007 4:30 AM

I think that should be qualified to say 64-bit x86_64 Linux binaries run faster than the same binaries compiled for 32-bit x86 Linux.

For other CPU-architectures (MIPs, SPARC, PowerPC, etc.), the opposite is generally true.