Place-and-Route : Intel vs AMD

- L
- Louis
  
  Contact options for registered users
posted
16 years ago

Thu, Jan 10, 2008 10:13 PM

Anybody heard of a recent benchmark comparing both Intel and AMD high- end processors regarding their Place-and-Route (PAR) performance? All I can find is a 2005 intervention here stating that AMD exceeded Intel, probably due to Intel's huge pipeline which is not suited for non-homogeneous processing such as PAR:

formatting link

With Intel apparently taking the lead for general purpose processing with its 45nm technology, is such statement still true? I'm basically looking for the best workstation honest money can buy to run Xilinx's PAR tool. Any suggestions? Thanks.

- H
- H. Peter Anvin
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Jan 10, 2008 11:22 PM

I think you'll find that Intel's Core 2 generation does a lot better than previous one, because they have better memory latency and shorter pipelines than the P4, which just plain sucked.

I don't have a recent benchmark, but I can tell you what I've seen in the past (early 2006 timeframe):

- multicore doesn't matter (unless you try to do other things while running.) Most current FPGA tools are still single-threaded.

- cache size matters more than anything. Going from a 512K to a 1024K cache cut the synthesis time by two-thirds. Intel probably has an advantage here, because they have shared caches; remember to only count the cache available to a single core.

- memory size and memory latency matters too. Get lots of fast RAM.

- the OS will manage memory better if it's a 64 bit OS. Running on a 64-bit Linux seemed to run about 20-25% faster than 32-bit WinXP.

-hpa

- G
- Gary Pace
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jan 11, 2008 3:49 AM

I can only give one example :

Altera Quartus 2 7.2SP1 EP2C50 design, about "70% full"

Last year's rig : AMD Athlon 64 X2 4800+ System, total compile time 13 minutes

This year's rig : Intel Core Quad 6600+ system, total compile time 10 minutes

Both systems have 4GB DDR2 RAM at max speeds (CPU max speed for AMD, P35 northbridge max speed for Intel) Both systems using WD Raptor drives Both systems using /3GB switch in boot.ini

The difference is almost entirely in the placement. Quartus has a multi-processor option, and reports an average of 1.5 processors out of a maximum of 2 for the AMD, and 1.7 out of a maximum of 4 for the Intel

I wonder how the AMD Phenom quad core doo-dah would perform ? I am assuming it accesses main memory via a dedicated 128-bit port like the dual core one. I think the Intel goes via the northbridge, and uses "interleaved dual channel" (meaning what I don't know). Sounds like a better channel to main memory.

- B
- Ben Jackson
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jan 11, 2008 12:27 PM

The Q6600 is multiplier locked, but at the bottom of the availabe front side bus speed. As a result, it's easy to overclock, and people have gotten 3GHz fairly easily. It'd be interesting to see how much effect that has on the build. If you want to go "legit" at those clock speeds there are Q6800 procs...

--
Ben Jackson AD7GD

http://www.ben.com/

- H
- H. Peter Anvin
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jan 11, 2008 10:50 PM

Anyone who uses a 128-bit path (dual channel) uses interleaving (running both in parallel) as long as you have the same amount of memory on each port.

-hpa

- E
- Eric Smith
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jan 11, 2008 11:35 PM

With the possible exception of the Socket 1207 parts, for which documentation is not available, the AMD processors don't have dual memory channels, despite widespread claims. They have a single channel that can operate in either 64-bit or 128-bit width (plus optional ECC). Using 128-bit width has obvious benefits, but interleave is not one of them.

- T
- Tommy Thorn
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Jan 12, 2008 2:17 AM

I mostly agree.

I ran my own scaling experiments before settling on my current setup (using Windows Quartus II 7.X and my mix of designs). I don't have the numbers handy, but for me:

- number of cores made very little difference,

- memory bandwidth, latency, and capacity made *no* measurable difference as long as I had enough (2 GiB+),

- L2 cache size was significant, and finally

- core frequency mattered most.

Basically, given a large enough L2 (4+ MiB), performance scaled linearly with clock frequency (Core 2 Duo). The only AMD part I had to compare with was pretty old (XP 3200+ / 2.0 GHz) and it didn't perform well (d'oh). I went for a conservative over-clocked (~ 3.1 GHz) 4 MiB L2 Core2 Duo.

P&R is one of the few problems left where we still don't have enough single-thread performance and where it is fully justifiable to spend more than half your budget on the CPU alone. (Music to Intel's ears. They just need to get the world hooked on FPGAs :-).

Tommy

- H
- H. Peter Anvin
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Jan 12, 2008 9:33 PM

Well, fixed 128-bit width is pretty much the same thing as dual 64-bit with a fixed interleaving ratio. Specifically, interleaving at 8-byte boundaries.

-hpa

- E
- Eric Smith
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Jan 13, 2008 7:10 AM

Historically, having two banks of interleaved memory meant that if bank 0 was busy reading or writing an even word address, bank 1 could start an access to ANY odd word address, not just n+1. It is my understanding that that was the reason for inventing interleave rather than simply making the memory word longer.

HPA suggested to me in private email that that sort of interleave didn't seem useful when the cache is tranferring whole cache lines, to which I replied:

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Jan 14, 2008 7:24 AM

(snip)

I first learned about interleaving reading about the IBM 360/91, which has 16 way interleaved 750ns memory, and a 60ns processor cycle time. I believe it is 64 bits wide. The design goal was one instruction per clock cycle, which tended to require one 64 bit doubleword per cycle (for 64 bit floating point operations).

That was before cache (on the 360/85).

-- glen