Best CPU platform(s) for FPGA synthesis

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
OK, the questions apply primarily to FPGA synthesis (Altera Quartus
fitter for StratixII and HardCopyII), but I'm interested in feedback
regarding all EDA tools in general.


Context: I'm suffering some long Quartus runtimes on their biggest
StratixII and second-biggest HardCopyII device. Boss has given me
permission to order a new desktop/workstation/server. Immediate goal
is to speed up Quartus, but other long-term value considerations will
be taken into account.


True or false?
--------------------
Logic synthesis (analyze/elaborate/map) is mostly integer operations?
Place and Route (quartus_map) is mostly double-precision floating-
point?
Static Timing Analysis (TimeQuest) is mostly double-precision floating-
point?
RTL simulation is mostly integer operations?
SDF / gate-level simulation is mostly double-precision floating-point?


AMD or Intel?
-------------------
Between AMD & Intel's latest multicore CPUs,
- Which offers the best integer performance?
- Which offers the best floating-point performance?
Specific models within the AMD/Intel family?
Assume cost is no object, and each uses its highest-performing memory
interface, but disk access is (necessary evil) over a networked drive.
(Small % of total runtime anyway.)


Multi-core, multi-processor, or both? 32-bit or 64-bit? Linux vs.
Windows? >2GB of RAM?
---------------------------------------------------------------------------------------------------------------------------------
Is Quartus (and the others) more efficient in any one particular
environment? I prefer Linux, but the OS is now secondary to pure
runtime performance (unless it is a major contributor). Can any of
them make use of more than 2GB or RAM? More than 4GB? Useful limit on
the number of processors/cores?


Any specific box recommendations?



Thanks a gig,

jj


Re: Best CPU platform(s) for FPGA synthesis
Quoted text here. Click to load it
Yes.

I don't know why they would use floating point if they don't have to.

Quoted text here. Click to load it
I seriously doubt it.  I don't see a need for floating point there
when delays can use scaled integers.

Quoted text here. Click to load it
Yes.

No, or at least not in any implementation I am familiar with.  All the
delays are scaled up so that integers can be used for them.

In simulation (assuming something with state-of-the art performance),
the CPU operations themselves are not very important anyway.  It is
not compute-bound, it is memory-access-bound.  What you need is big
caches and fast access to memory for when the cache isn't big enough.


Quoted text here. Click to load it

64-bit Linux can make use of more than 4GB of RAM.  But don't use 64-
bit executables unless your design is too big for 32-bit tools,
because they will run slower on the same machine.

Quoted text here. Click to load it

Most of these tools are not multi-threaded, so the only way you will
get a speedup is if you have multiple jobs at the same time.  Event-
driven simulation in particular is not amenable to multi-threading,
despite much wishful thinking for the last few decades.


Re: Best CPU platform(s) for FPGA synthesis

Quoted text here. Click to load it

Dynamic range?

Cheers,
Jon



Re: Best CPU platform(s) for FPGA synthesis

Quoted text here. Click to load it

Not a likely problem. Even a 32bit int would be big enough for holding
up to a ridiculous 4.3 seconds, assuming 1psec resolution.

As far as I know, everything in the simulate, synth, P&R, and STA
chain can be performed with adequate resolution using integers.

Crosstalk and inductive effects might require floating point help, but
I would be surprised if even that can be approximated well with
fixed-point arithmetic.


Kai
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>

Re: Best CPU platform(s) for FPGA synthesis

Thanks everyone, this is real interesting, but please don't stop
posting if you have more insights to share!

FWIW, my runtimes in Quartus are dominated by P&R (quartus_fit); on
Linux, they run about 20% faster on my 2005-era 64-bit Opteron than on
my 2004-era 32-bit Xeon (both with a 32-bit build of Quartus). Another
test run of a double-precision DSP simulation (compiled C) ran
substantially slower on the Opteron, which I thought was supposed to
have better floating-point performance than Xeons of that era. Maybe
it was just a case of the gcc -O5 optimization switches being totally
tuned to Intel instead of AMD, or maybe my Quartus P&R step is
primarily dominated by integer calculations.

I originally suspected P&R might have a lot of floating-point
calculations (even prior to signal-integrity considerations) if they
were doing any kind of physical synthesis (e.g., delay calculation
based on distance and fanout); ditto for STA, because that's usually
an integral part of the P&R loops. I also suspected that if floating-
point operations (at least multiplies, add/subtract, and MACs) could
be done in a single cycle, there would be no advantage to using
integer arithmetic instead (especially if manual, or somewhat explicit
integer scaling is required).

On the other hand, in something like a router, you can get more exact
location info wrt stuff like grid coordinates than you can with
floating-point. As far as dynamic range is concerned, I seem to recall
that SystemC standardized on 64-bit time to run longer simulations,
but SystemC is a different animal in that regard anyway. Nonetheless,
I also seem to recall that its implementation of time was 64-bit
integers (scaled), because the average FPU operations are really only
linear over the 53-bit mantissa part. Assuming they want linear
representation of time ticks, I can see the appeal of using 64-bit
integers in simulation.

As far as event-driven simulations are concerned, I totally understand
how hard it is to make good use of multithreading or multiprocessing,
because everything is so tightly coupled in that evaluate/update/
reschedule loop. If you were working at a much higher level
(behavioral/transaction), where the number of low-level events is
lower and the computation behind "complex" events took up a much
larger portion of the evaluate/update/reschedule loop, then multicore/
multiprocessing solutions might be more effective for simulation.
(Agree/disagree?) It seems that as you get more coarse-grained with
the simulation, that even distributed processing (multiple machines on
a network) becomes more feasible. Obviously the scheduler has one
"core" and has to reside in one CPU/memory space, but if it has less
work to do, then it can handle less frequent communication with the
event-processing CPUs in another space.

Back to Quartus in particular and Windows in general... Quartus
supports the new "number_of_cpus" or some similar variable, but only
seems to use it in small sections of quartus_fit (I think Altera is
just making their baby steps in this area).

That appears to be related to the number of processors inside one box.
If a single CPU is just hyperthreaded, the processor takes care of
instruction distribution unrelated to a variable like number_of_cpus,
right? And if there are two single-core processors in a box, obviously
it will utilize "number_of_cpus=2" as expected. Does anyone know how
that works with dual-core CPUs? i.e, if I have two quad-core CPUs in
one box, will setting "number_of_cpus=7" make optimal use of 7 cores
while leaving me one to work in a shell or window?

Does anyone know if Quartus makes better use of multiple processors in
a partitioned bottom-up flow compared to a single top-down compile
flow?

In 32-bit Windows, is that 3GB limit for everything running at one
time? i.e., is 4GB a waste on a Windows machine? Can it run multiple
2GB processes and go beyond 3 or 4GB? Or is 3GB an absolute O/S limit,
and 2GB an absolute process limit in Windows?

In 32-bit Linux, can it run 4GB per process and as many simultaneous
processes of that size as the virtual memory will support?

In going to 64-bit apps and O/S versions, should the tools run equally
fast as long as the processor is truly 64-bit?


Thanks again for all the insights and interesting discussion.


jj




Re: Best CPU platform(s) for FPGA synthesis
snipped-for-privacy@cs.ucf.edu writes:

Quoted text here. Click to load it

[snip]

Quoted text here. Click to load it

As I recall, 32-bit Linux has a limit around 3.0-3.5GB per process.
On the 64-bit Linux , I have used 8+GB for a single process doing
gatelevel simulations.


Kai
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>

Re: Best CPU platform(s) for FPGA synthesis

Quoted text here. Click to load it

Below is what I have read about it in "Self-Service Linux®"
http://www.phptr.com/content/images/013147751X/downloads/013147751X_book.pdf
I have no experience with it.

<quote>
3.2.2.1.6 The Kernel Segment

The only remaining segment in a process' address space to discuss is the
kernel segment. The kernel segment starts at 0xc0000000 and is
inaccessible by user processes. Every process contains this segment,
which makes transferring data between the kernel and the process'
virtual memory quick and easy. The details of this segment’s contents,
however, are beyond the scope of this book.

Note:

You may have realized that this segment accounts for one quarter of the
entire address space for a process. This is called 3/1 split address
space. Losing 1GB out of 4GB isn't a big deal for the average user, but
for high-end applications such as database managers or Web servers,
this can become an issue. The real solution is to move to a 64-bit
platform where the address space is not limited to 4GB, but due to the
large amount of existing 32-bit x86 hardware, it is advantageous to
address this issue. There is a patch known as the 4G/4G patch, which
can be found at ftp.kernel.org/pub/linux/kernel/people/akpm/patches/ or
http://people.redhat.com/mingo/4g-patches . This patch moves the 1GB
kernel segment out of each process’ address space, thus providing the
entire 4GB address space to applications.
<end quote>

--
Paul Uiterlinden
www.aimvalley.nl
We've slightly trimmed the long signature. Click to see the full one.
Re: Best CPU platform(s) for FPGA synthesis
Quoted text here. Click to load it

No. Hyperthreading means that the hardware is only virtually doubled.
The CPU maintains the state and the register set of two independent
threads and tries to utilize all its function units. If one thread has
to wait for data from the memory some instructions of the other thread
can be issued to the function units. Likewise, if one thread spends its
time in the FPU, the other thread can use the remaining function units.
If both threads execute the same type of instructions a hyperthreaded
CPU rarely has an advantage.

Running on a hyperthreaded CPU the operating system sees two cores and
has to schedule its workload like there were two physical cores to gain
any benefit. If your software only has one thread hyperthreading like
multicores won't speed it up.

Quoted text here. Click to load it

I don't know how Quartus makes use of the available CPUs but basically
as seen from software there is no difference between two single cores
and one dual-core.

Quoted text here. Click to load it

3 GB is a practical limit because the PCI bus and other memory-mapped
devices typically occupy some hundred megabytes of address space. So you
can't use this memory space to access RAM. There are techniques to map
memory to other address regions beyond the 4 GB border but you need
special chipsets and proper operating system support.

Andreas

Re: Best CPU platform(s) for FPGA synthesis

Quoted text here. Click to load it
These are usually not mapped into the address space of a user process.

Kolja Sulimma


Re: Best CPU platform(s) for FPGA synthesis

Quoted text here. Click to load it


    Nope, but the (32-bit) kernel needs to see the mmap'ed peripherals + the  
userspace RAM if implementation of stuff like file reading, etc is to be  
efficient (without juggling with pages)...

Re: Best CPU platform(s) for FPGA synthesis
Quoted text here. Click to load it

Anandtech ran an article which does quite a good job in explaining the 2
and 3 GB barriers.
http://www.anandtech.com/gadgets/showdoc.aspx?i30%34

Re: Best CPU platform(s) for FPGA synthesis
Quoted text here. Click to load it

As mentioned in the Anandtech article, there are stability issues with
running in 3GB mode.  We have seen these stability issues with Quartus
on WinXP w/3GB mode.  If you need more 2GB of memory for your Quartus
executable, your best bet is to run 32-bit Quartus on (a) 32-bit or 64-
bit Linux or (b) 64-bit Windows.

Regards,

Paul Leventis
Altera Corp.


Re: Best CPU platform(s) for FPGA synthesis
Quoted text here. Click to load it

I think you're a factor of 1000 out.

For an ASIC STA, gate delays must be specified at a much finer
resolution than 1ps.

Cheers,
Jon



Re: Best CPU platform(s) for FPGA synthesis

Quoted text here. Click to load it

Duh, brain fart indeed!

Quoted text here. Click to load it

I don't recall seeing sub-psec resolution in the 130nm libraries I
have seen, but that doesn't imply that it cannot be so.

But I stand by my argument: the actual resolution should not matter
much, as the total clock delays and cycle times should scale pretty
much as the library resolution.  Otherwise, there wouldn't be a point
in choosing such a fast technology (who in their right mind would use
a 45m process for implementing an 32kHz RTC, unless they had to?)


Kai
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>

Re: Best CPU platform(s) for FPGA synthesis
snipped-for-privacy@cadence.com writes:
Quoted text here. Click to load it

Although that might be true for some specific cases, in general on Linux
native 64-bit executables tend to run faster than 32-bit executables.
But I haven't benchmarked 32-bit vs. 64-bit FPGA tools.

Re: Best CPU platform(s) for FPGA synthesis

Quoted text here. Click to load it

I think that should be qualified to say 64-bit x86_64 Linux binaries run
faster than
the same binaries compiled for 32-bit x86 Linux.

For other CPU-architectures (MIPs, SPARC, PowerPC, etc.), the opposite is
generally true.



Re: Best CPU platform(s) for FPGA synthesis

Quoted text here. Click to load it

Interesting -- on an AMD Athlon X2/5200+ running RHEL Linux 4 update 4
x86_64,
just about all Synopsys Design Compiler jobs run FASTER in 64-bit mode than
32-bit mode, between 5-10% faster.  THe penalty is slightly larger
RAM-footprint,
just as you noted.  The X2/5200+ is spec'd the same as an Opteron 1218
(2.6GHz,
2x1MB L2 cache..)

This trend was pretty much consistent across all our Linux EDA-tools.

On Solaris SPARC, 64-bit mode was definitely slower than 32-bit mode, by
about
10-20%.  For the life of me, I can't understand why the AMD would run 64-bit
mode faster than its 32-bit mode -- but for every other machine
architecture,
64-bit mode is almost always slower.

I forgot to re-run my 32bit vs 64-bit benchmark on Intel Core2 Duo machines.
FOr 64-bit, the Intel E6850 (4MB L2 cache, 3.0GHz) ran anywhere
from 50-60% faster than the AMD X2/5200+.  Don't worry, no production
machines
were overclocked (for obvious official, sign/off reasons.)  It was just a
admin's
corner cubicle experiment.

Quoted text here. Click to load it

When I ran two separate (unrelated) jobs simultaneously on the AMD and Intel
machines, the AMD machine handled dual-tasking much better.  AMD only
dropped 5-7%, for each job.  The E6600 fared a lot worse -- anywhere from
10-30% performance drop.  (Though not as bad as the Pentium/3 and
Pentium/4 based Xeons.)

I'm wondering if the E6600's unified 4MB L2-cache thrashes badly in
dual-tasking.
Or maybe the better way to look at it, in single-tasking the 4MB L2-cache is
4X more than the AMD Opteron's 1MB cache per CPU-core.



Re: Best CPU platform(s) for FPGA synthesis

(snip)

Quoted text here. Click to load it
 > 32-bit mode, by about 10-20%.
Quoted text here. Click to load it

It might be because more registers are available, and IA32
code is register starved.

-- glen


Re: Best CPU platform(s) for FPGA synthesis
I think that memory performance is the limiting factor for
FPGA synthesis and P&R.

This machine had a single core AMD 64 processor which I recently replaced with
a slightly faster dual core processor.

I ran a fairly quick FPGA build through Quartus to get a time for a
before and after comparison before I did the swap.

The before and after times were exactly the same :-(

I think the amount and speed of memory is crucial, it's probably
worth paying as much attention to that as to the processor.


Nial.



Re: Best CPU platform(s) for FPGA synthesis

Quoted text here. Click to load it

Did you changed the setting "use up to x number of CPUs" (don't remember
the exact name) somewhere in the project settings?

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de , http://www.it4-systems.de

Site Timeline