CPU benchmark for Xilinx PAR

P

Paul Gentieu 20 years ago

Here's a benchmark for PAR (high effort level) running on two different CPUs. The design utilized about 40% of an XC2V4000-5 and had some difficult-to-meet timing constraints. PAR's peak memory usage was ~500 MB.

Intel Pentium D 830 (3.0 GHz), 2 GB RAM: Total CPU time to PAR completion: 2 hours 32 mins

AMD Athlon 64 4000+ (2.4 GHz), 2 GB RAM: Total CPU time to PAR completion: 1 hour 2 mins

I was blown away by the result. I was expecting a modest speed increase with the AMD- maybe 1.3x, if you go by the model number- but certainly not 2.5x. Based on this benchmark, the AMD CPU should actually be called a 7500+. :)

The Pentium is a dual core and the AMD is a single, but the Xilinx software utilizes only one core so this is a fair comparison of raw processor speed.

The Pentium probably gets killed by its deep pipelines. I'd guess that PAR, like most real-world apps, consists mainly of spaghetti code rather than regular loops processing masses of similar data. So the Pentium spends a lot of its time flushing pipelines because of mispredicted branches and such. It probably suffers from its higher memory access latency as well.

It sure would be nice if Xilinx could made their software multithreaded... then an Athlon X2 4800+ would really scream. As it is, I'd guess that an Athlon FX-57 (2.8 GHz) will give the fastest PAR performance currently possible.

-Paul

Vote

J

John Adair 20 years ago

Generally we use Athlon64 based machines and are very impressed. We have not done a comparision recently but we have found on previous benchmarking is that some parts of the process are better done on different processors. So if you have a really big design you may want to split the work better 2 machines with one being Intel based the other AMD. I'm sure that some you "geek" script writers could figure a script to automate this.

It would be interesting to try a single core Pentium Extreme against the FX-57.

John Adair Enterpoint Ltd. - Home of Broaddown2. The Ultimate Spartan3 Development Board.

formatting link

Vote

J

JJ 20 years ago

Very interesting

I really doubt its the branch behaviour even though the Athlon series has always been good on office type twisty apps. For branchy code segments that fit in the I cache, these days the branches almost come for free and guess right more often than not.

I'd hazard a guess it has more to do with the data set being very large and missing the L1, L2 and TLBs way too often, "poor locality of reference" , even 1% misses, maybe less maybe enough to wreak havoc.

It not difficult to create a simple data structure that holds millions of items in a hash table and see even an Athlon xp2400 give up 300ns avg accesses to each entry if all accesses appear random.rather than the naive 1ns its L1 cache can actually do.

You can plot a graph of open random address width from 6bits to 24bits and watch execution time go from 1n to 4ns and then roughly stepping

30ns 100ns 300ns for x[i] when i is coming from any old random no generator and masked by width field. Measured on an xp2400.

If this simple test were run on various cpus, we could see how the caching really works for graduating locality disaster cases and choose accordingly.

Now EDA software doesn't deliberately do this, but might get some of the same effect unintended simply by having to walk immense graphs and trees. Think about it, draw a graph with millions of nodes and try to label in such a way that it can be traversed with mostly low address bit changes (high locality) when the nodes in the graph are allocated completely in random fashion. Then think, how many operations actually get performed on each link list traversal, a lot of the time it might be just passing through looking for something, the worst possible situation, all fetch no work.

I don't imagine there is much EDA code that looks like beautiful DSP media codec stuff with super straight line high locality SSE tuned code.

I could be all wrong, but I thinks it the Memory Wall effect and the Opteron maybe does a better job of recovering. That also means a cpu that concentrates on that aspect desn't even need a clock advantage as long as it tolerates poor locality better.

I wonder if its possible to get stats from the cpu performance hardware that shows what the cpu is really doing in memory, bit out of my league.

I wonder if the EDA guys just crank out code or do they ever measure algorithms on different x86 hardware at the cache level, curious?

I also wonder how much FPU is actually used and how so?.

On a threaded cpu designed to work with threaded memory where there is little memory wall (latency tolerence all around), it doesn't take much hardware to design a processor element in FPGA that can match Athlon xp300, and 10 or so ganged together can then match xp3000 but you get

40 odd threads to fill instead of waiting on cache misses. Me, I'd rather fill the threads (occam style) than wait, but most are not of that opinion (yet).

Now if EDA ever becomes highly concurrent, (some have done this in VLSI EDA from simulation to P/R) it does make possible some real speed ups when real threading becomes pervasive in cpus (not this 2,4 thread nonsence).

johnjakson at usa dot ... transputer2 at yahoo dot ...

Vote

B

B. Joshua Rosen 20 years ago

The design utilized about 40% of an XC2V4000-5 and had some difficult-to-meet timing constraints. PAR's peak memory usage was ~500 MB.

hours 32 mins

hour 2 mins

the AMD- maybe 1.3x, if you go by the model number- but certainly not 2.5x. Based on this benchmark, the AMD CPU should actually be called a 7500+. :)

utilizes only one core so this is a fair comparison of raw processor speed.

like most real-world apps, consists mainly of spaghetti code rather than regular loops processing masses of similar data. So the Pentium spends a lot of its time flushing pipelines because of mispredicted branches and such. It probably suffers from its higher memory access latency as well.

then an Athlon X2 4800+ would really scream. As it is, I'd guess that an Athlon FX-57 (2.8 GHz) will give the fastest PAR performance currently possible.

That's consistent with what I've seen. Note the 4000+ has a 1M cache which is critical for the performance of EDA codes. For NCVerilog I've found that when recordvars is off there is a 2 to 1 difference between an A64 with a 1M cache vs one with a 1/2M cache. I now have a 4400+ in addition to the 3400+ and the 3800+ shown on this page,

formatting link

I haven't updated my benchmark page with the the 4400+ results but they are consistent with the other results. The 4400+ is about 10% faster then the 3400+ on single threaded jobs like NC or Xilinx place and results which is exactly what you would given that each core in the 4400+ runs at the same clock speed and has the same cache size (1M) as the 3400+ but it has dual memory channels vs a single channel on the 3400+.

Vote

V

Vladislav Muravin 20 years ago

Paul,

You are not the first to be amazed by this result. I can only add that I was not able to persuade my management to give me Dual AMD 64 due to some unfixable bug (in the management), so I have only P4 :(

I am sure that Xilinx software is always being developed && improved to match any future stuff Vladislav

Vote

B

B. Joshua Rosen 20 years ago

then an Athlon X2 4800+ would really scream. As it is, I'd guess that an Athlon FX-57 (2.8 GHz) will give the fastest PAR performance currently possible.

PAR is multithreaded, use the -m switch.

Vote

B

Brannon 20 years ago

The -m does not work on Windows, according to the documentation. This is silly because they should be using cross-platform code anyway. A decent Windows pthread library utilizing termination drivers is not that expensive.

I fully agree that they should be using SSE, SSE2, SSE3, 3dNow, etc., along with utilizing Intel and AMD's math/DSP libraries. Even if they have to ship different EXEs for each processor it would totally be worth it.

Vote

J

John_H 20 years ago

multithreaded... then an Athlon X2 4800+ would really scream. As it is, I'd guess that an Athlon FX-57 (2.8 GHz) will give the fastest PAR performance currently possible.

When I used the -m switch a while back on our unix system, I was able to specify a node list for different hosts to run the multipass place & route one more than one machine but I couldn't utilize multiple cores in one host. I also can't use more than one host (or core) for one long place & route job; the -m is specifically for multipass place & route (which, by the way, doesn't have the option to use multiple mapper seeds!).

Vote

B

Bret Wade 20 years ago

This PAR feature is called the Turns Engine and it was never designed to support multiple jobs on a single machine. You can get around this by using a hostname alias in the node list file, or by tricking PAR by using variations of Mixed case in the node list file. For example, using a four processor machine named "speedy", and the following node list file would allow four concurrent jobs to run:

speedy Speedy SPeedy SPEedy

Xilinx Answer Record 10511 covers this.

Regards, Bret

Vote

C

concerned_altera 20 years ago

Interesting! Your experience seems consistent with relatively small, RTL (behavioral) designs. For SOC/ASIC designs of any reasonable size, I've found the difference between 0.5MB and 1MB cache to be non-existent (because the working data-set already exceeds larger cache.)

I think one problem is the difficulty in producing publishable benchmarks. In the case of NC-Verilog and even Verilog-XL, I've found benchmarks almost useless. It's easy to find a 'test case' which runs 30-40% faster on a puny Pentium3/S 1.26GHz (512K L2 cache) than on a Ultrasparc III 750MHz (Linux 32-bit vs Solaris 32-bit.) And likewise, it's just as easy to find a 2nd RTL test-case where the USIII literally crushes the Pentium3 (2X as fast.) For SDF backannotated gate-level simulations, the results even out, with the US3 marginally faster than the Pentium3. And the US3 has an

8MB CPU cache (don't remember whether it's L2 or L3), so it looks like the pace of design-database 'bloat' already outpaces CPU cache-size improvements.

In the case of Design-Compiler and Primetime, there seems to be less variation in runtimes (for a given design compared on 2 different platforms.) The almost all of the customer Verilog RTL designs I've crunched through DC, Primetime, and Tetramax, the wimpy Pentium3 1.25GHz outperforms the Ultrasparc III 750MHz. I'd suspect Xilinx's PAR shares a performance profile similar to Design Compiler.

Incidentally, we've found the Athlon64 gets a +20-30% performance boost from

64-bit linux versions of EDA-tools (vs 32-bit linux.) This was our conclusion after re-running quite a few Verilog simulation and synthesis- jobs. It's curious to see the Intel EM64T CPUs take a small performance hit in the same 64-bit linux apps! Aside from the small increase in RAM footprint, the 64-bit EDA tools on Athlon/64 always comes out ahead.

Looks like I'll need to rethink my upgrade plans. Originally I planned on 'cheapskating' on a system-upgrade (AMD dual-core X2 3800+, 2.0GHz,

512K), but it seems FPGA-tools benefit quite abit from the extra 512K cache.

Vote

CPU benchmark for Xilinx PAR

Join the Discussion

Didn't find your answer?