Opteron performance tuning (for Quartus / Linux)?

Thanks in advance for the help!

----------------------------------------

My employer just acquired a SunFire server w 16 dual-core Opterons (model 8220, 2.8 GHz) and 128GB of RAM.

Despite the faster clock and memory interface, it's running my Quartus jobs slower than a 2-year old dual-core Opteron (2.4 GHz Model 250).

I suspect my I.T. dept just did a generic configure, and missed out on some major performance tuning opportunities. Maybe they've left a power-saving mode (like PowerNow) in place?

Can anyone suggest the biggest-bang-for-the-buck things to look at?

I don't have root privelege, but I'm trying to help the I.T folks along.

uname -a: Linux monster 2.6.9-55.ELlargesmp #1 SMP Fri Apr 20 16:46:56 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

top: top - 16:58:22 up 1 day, 3:07, 3 users, load average: 1.98, 1.97,

2.17 Tasks: 174 total, 2 running, 172 sleeping, 0 stopped, 0 zombie Cpu(s): 9.0% us, 0.1% sy, 0.0% ni, 90.8% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 131385408k total, 5031132k used, 126354276k free, 135848k buffers Swap: 41943032k total, 0k used, 41943032k free, 3417328k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

13903 jjohnson 22 0 1111m 1.0g 49m R 100 0.8 3:22.81 quartus_eda 14059 jjohnson 15 0 137m 48m 39m S 46 0.0 0:03.08 quartus_cdb

cat /proc/cpuinfo: (same repeated for all 16 processors) processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 8220 stepping : 3 cpu MHz : 2800.274 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clf lush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni cx16 bogomips : 5603.62 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp [4] [5]

Reply to
jjohnson
Loading thread data ...

Quartus is single-threaded; in my experience the only things that matter is how much memory and how much cache you have on the one core that actually runs your job.

In other words, your SunFire should be just a smidgen faster than your old Opteron, and a much worse value for the money.

It shouldn't be *slower*, though, since it presumably has a faster memory system, so something is still wrong.

-hpa

Reply to
H. Peter Anvin

I'm guessing that a 2.4G processor running 32-bit Linux is faster than a 2.8G processor running 64-bit Linux.

Just a guess, G.

Reply to
ghelbig

I'm afraid this is roughly what I'd expect; with eight physical processors each attached to its own memory pool, many memory accesses have to be preceded by asking all seven other processors whether they have any opinions on it, and this takes quite some time.

Look in the documentation for any terms like 'NUMA' and 'process pinning'; Quartus is single-threaded, so will be helped out if the OS can be convinced to allocate the memory that it uses out of the pool of memory physically attached to the processor it's running on, and if the OS is told not to move the quartus process between processors if it can be avoided.

Tom

Reply to
Thomas Womack

I haven't gotten far with solving the problem yet, other than collecting more benchmarks.

FWIW, Quartus has been multi-threaded since v7.0, although only certain tasks (quartus_fit and Timequest/quartus_sta) appear to do much with it.

You can set the variable NUM_PARALLEL_PROCESSORS to a value between 1 and 4, (some of the docs imply up to 16, but my runs errored out when I tried 5-8).

Among the things I noticed, with NUM_PARALLEL_PROCESSORS=1 and NO other jobs running on the machine, quartus_map was pinging back and forth between two CPUs; I suspect pinning that process to one CPU would not hurt the cause. After a while, it appeared to stick to one CPU, but i didn't watch long enough to call it a scientific observation.

Also, with no other jobs running, quartus_map did not finish much faster (about 4% faster) than when five other jobs (up to 13 more threads) were running simultaneously. I guess the memory arbitration has to take place regardless of whether or not the other CPUs are accessing it.

I would hope/expect that a company like Sun has some hardware workarounds to prevent the memory interface from being such a bottleneck; (e.g., with most processes fitting into under 2GB of memory, can/do they devote a few GB to each CPU for full-speed access w/o arbitration, and then only access a shared pool when you go over that limit?)

Hardware being fixed as it is, I guess I'll have to dig deeper thru the AMD and RedHat docs on NUMA, etc...; one article I'm wading thru is this one from nOvell;

formatting link
(Optimizing Linux for Dual-core Opteron Processors) It's a good start, but maybe somewhat Suse and single-chip (one dual core) specific.

If anyone has any links or suggestions more specific to Red Hat and more processors, I'm all ears.

Thanks again1

Reply to
jjohnson

This may be out of the question (due to school/corporate politics), but you'll save a lot of time and money if you just get your own desktop-PC Intel Core 2 Duo E6850 (3.0GHz) .

Although I haven't tried Quartus II, for Xilinx Webpack 9.2i.03 and Sysnopsys Design Compiler, the E6850 is roughly 50-60% faster than our old AMD SocketAM2 X2/5200+ (2.6GHz, 2x1MB cache.)

In other words a 100 minute job on the X2/5200+ took only 65 minutes on the Intel E6850.

Both machines were configured the same: 8GB DDR2/667 ECC unbuffered RAM, Centos 4.5 x86_64 same hard-drive (moved it from one PC to the other)

(^^^Centos 4.5 is an open-source clone of Redhat Enterprise Linux 4 update

5)

Heh, looks like you're running Redhat Enterprise Linux 4 Update 5.

Reply to
Systemv User

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.