FPGA tool benchmarks on Linux systems

- B
- B. Joshua Rosen
  
  Contact options for registered users
posted
19 years ago

Mon, Feb 28, 2005 11:14 PM

I've put together a webpage on the performance of NCSim and Xilinx tools on various systems, specifically a dual PIII, dual Xeon, Athlon 64

3400+ and an Athlon 64 3800+.

formatting link

- J
- Jason Zheng
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 1, 2005 1:26 AM

A nice testbench but...

What's the amount of memory used on each system?
I think you miss-used the term "cpu-bound." Rather, they are all memory-bound computations. Had they been purely cpu-bound, the xeon machines might have won. The point that you really want to make is that the AMD cpus have on-chip memory controller and thus much less memory latency, which makes memory-intensive applications run much faster than Intel cpus.
You are comparing state-of-the-art AMD workstations with mediocre Intel servers. It's like comparing oranges with apples.

Lastly, might I ask, are you affliated with AMD?

-jz

- K
- Kim Enkovaara
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 1, 2005 6:42 AM

I have some measurements with more current Xeon processors. Unfortunately I had only one not so state of the art Opteron to measure.

These results were published by me in one local Mentor Graphics conference (these are only small part of the numbers). The simulations are done with Modelsim for a ~8Mgate chip (+all memories). The numbers are simulation time in seconds.

RTL One CPU active Sun V880 UIII/900 3531 P4 Xeon 2.2/512k 2224 P4 Xeon 2.4/512k 2087 P4 Xeon 2.8/512k 1928 P4 Xeon 3.06/512k 1634 P4 Xeon 3.4EMT (32b) 1239 AMD Opteron 848(32b) 1584

RTL Both CPUs active Sun V880 UIII/900 3520 P4 Xeon 2.2 2540 P4 Xeon 2.4 2680 P4 Xeon 2.8 2650 P4 Xeon 3.06 2120 P4 Xeon 3.4EMT (32bit) 1450 AMD Opteron 848(32b) 1587

One thing that amazes me is that in Xeons even with RTL simulation the performance degrades very guickly. I guess with 4 processors Xeons degrade very badly. In Opterons there was no degradation to be seen.

For the gate level simulations the results are almost identical, altough the dataset is 15-20x larger and simulation times for the same case are longer. Also if 64b mode was used Opteron became faster and Xeon EMT was little slower (very small differences compared to 32b mode tough).

--Kim

- C
- Christian Schneider
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 1, 2005 6:02 PM

Thanks for all the benchmaks! Very interessting information!

If I interpret the data correctly, two CPU result in the same simulation time, so they are of no benefit? That's a pity!

BR, Chris

Kim Enkovaara wrote:

- D
- Dave Colson
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 1, 2005 7:13 PM

I was told that the benefit of two CPUs is that you can run another application while simulating and not have your computer slow down because the other application will run from the other CPU.

Unfortunately

conference

with

- C
- Christian Schneider
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 1, 2005 7:49 PM

ACK. But it would have been nice to see multithreaded simulations, which benefit from more CPUs. Especially since simulations are parallelizable and some vendors support clusters. So I think that this is just a small step ... which is not done yet. And the dual core CPUs are ante portas ...

BR, Chris

- J
- Jason Zheng
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 1, 2005 7:49 PM

That depends on the process scheduler in the kernel. The most significant benefit is that your multi-threaded application such as HDL simulator can run multiple threads at the same time. For example the following verilog structure lends itself to multi-threading:

fork begin ... end begin ... end join

Now whether that piece of code is actually run as two threads is up to the HDL compiler's design. It might just run as a single thread. Even if it were as two threads, the kernel might decide that another process is important and only give one cpu to the hdl simulator.

The real advantage of the AMD cpu is that each cpu has its own memory interface, and a high-bandwidth link to each other. Intel cpus have a different architecture, where all cpus share the same memory interface. AMD's design is more scalable: adding cpus to the mainboard only slightly affects the memory bandwidth each cpu receives, whereas Intel cpus will get much less memory bandwidth as the number of cpu goes up. Although the memory bandwidth can be improved with higher FSB frquency (1066MHz now) and larger L2 (2MB now) and L3 cache (8MB-16MB?), the Intel approach does not scale. This is why AMD Opetron can easily have

8-way configuration and you rarely even see quad Xeon.

-jz

- B
- B. Joshua Rosen
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 1, 2005 9:40 PM

Parallel simulators are apparently a much harder problem then you might suspect. A number of years ago I was discussing this issue with the CTO of IKOS (since bought by Mentor). To me it seemed that simulation should be a highly parallel problem but he claimed that there had be a number of attempts at parallel software simulators (as opposed to hardware acceleration engines) and that no one had succeeded. With the advent of multi-core processors this year I suspect that the issue will be revisited. In the mean time a dual processor machine is useful for running multiple simultaneous simulations, like regression suites, assuming that you have more than one license.

- T
- Thomas Entner
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 1, 2005 9:51 PM

I think we should all encourage the FPGA- and EDA-tool-vendors to adapt there software for parallel algorithms (especially place and route), as the dual-cores are really coming soon and most of us will buy the fastest machine they can get for reasonable money. In fact, a parallel algorithm would already help a little bit today for P4s with hyper-threading.

Regards,

Thomas

formatting link

"B. Joshua Rosen" schrieb im Newsbeitrag news: snipped-for-privacy@polybus.com...

- J
- Jason Zheng
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 1, 2005 10:40 PM

Place-n-route can already be done in pseudo-parallel fashion with xilinx's modular design. You can simply run two processes that each par a part of the design, and maybe even write a script to automate that process. Dual-core P4 and AMD64 should have no problems doing that sort of things.

Of course, you can't do that for all EDA tools as you will soon run into licensing problems.

-jz

Thomas Entner wrote:

- T
- Tuukka Toivonen
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 2, 2005 9:27 AM

Savant tries to do parallel VHDL simulation:

formatting link

"The SAVANT project has been integrated with UC's WARPED parallel simulation research project and provides an end-to-end VHDL-to-batch simulation capability. WARPED provides a general purpose discrete event simulation API that can be executed in parallel or sequentially. Built on top of WARPED is a VHDL simulation kernel called TyVIS that links with the C++ code generated from SAVANT for batch sequential or parallel simulation."

But as it is a research project, I don't know how well it succeeds.

- K
- Kim Enkovaara
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 2, 2005 10:14 AM

The 2 CPU result is 2 copies of the same simulation running at the same time. There are no multithreaded RTL simulators available commercially.

That measurement shows how the memory bus and machine architeture scales.

--Kim

- K
- Kim Enkovaara
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 2, 2005 10:32 AM

I had also similiar discussion with Mentor Graphics Chief Technologist few years ago. And the story was identical. He said that there have been many different startups (and R&D project inside bigger companies) that tried to do parallel simulator, but none succeeded well enough.

My imprssion was that the hard problem is how to partition the design to minimise the events needed to communicate between the threads. That communication latency was the killer for performance.

--Kim

- P
- Petter Gustad
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 2, 2005 9:24 PM

I would rather have two single CPU systems. In some cases it's cheaper.

Petter

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

- P
- Petter Gustad
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 2, 2005 9:40 PM

I find this quite surprising as well. It has been discussed a few times in comp.lang.verilog (like in

formatting link

I think many of the EDA vendors are expecting linear speedup so they can apply a linear pricing policy. If the license cost was flat, e.g. they viewed a cluster as a single fast machine I would be happy to accept less than linear speedup and just throw sub-1k$ PC's at my simulation to increase its performance.

I'm also surprised that there aren't many parallel synthesis tools and place & route tools (Xilinx par support a very coarse grain parallelism) out there either. Must be a great opportunity for MPI programmers with EDA knowledge...

Petter

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

- P
- Petter Gustad
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 2, 2005 9:44 PM

Actually the Xilinx par tool has supported a coarse grained parallelism for many years (using the -m option, on Solaris that is). I remember having my par job running on half a dozen or so dual and quad sparc systems.

Petter

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

- B
- B. Joshua Rosen
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 2, 2005 10:25 PM

Part of the problem might be that past attempts at parallelism were trying to use large numbers of processors. Speedups of 2x, which is the absolute maximum that you can get a dual processor system, aren't all that interesting, you certainly wouldn't have bought a special machine just to get a 2X speedup. You might have bought a 32 processor server if you could get a 20x speedup on it, but trying to break up a simulation over that many processors isn't very doable. With the introduction of dual cores the equation changes. Soon every midlevel and higher PC is going to come with two processors, and the workstation class machines will all have a pair of dual core processors. Getting a simulator to take advantage of two to four very tightly coupled processors should be a lot easier then getting it to scale to 32 or 64 loosely coupled cpus. Also the potential market is larger because everyone will have at least two processors in their systems.

- N
- nonoe
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 3, 2005 5:36 AM

performance

In

Well, at my previous workplace, there was a dual Pentium3/S (1.26GHz,

512K cache) server. Running two *independent* memory-intensive jobs simultaneously basically incurred a 60% performance hit. Another way to put it is this; if job-A takes one 1 hour by itself, and job-B takes 1 hour by itself -- launching both A+B simultaneously causes the completion time to increase to 1.6 hours (for both jobs.) ACKK!!! (This was for NC-Verilog 4.0.)

the dataset

if 64b mode

differences

That's very interesting to know. At my current workplace, we have an 'unofficial' (i.e., unsanctioned by management -- we're a Solaris department!) Athlon/64 3200+. From my firsthand experience, Cadence's WarpRoute and Buildgates/PKS5 benefit tremendously from 64-bit x86_64 vs

32-bit IA32 mode, something like a +30% boost in throughput. (The job's RAM footprint increases a bit, as expected and noted in the product's documentation.) The main problem is that a lot of older CAD-tools just "don't work right" under the 64-bit Linux O/S. Ironically, our ancient "signalscan" waveform viewer still runs, while our tool-guy can't figure out why Tetramax U-2003.06-SP1 refuses to work...

It's really funny when a manager comes along and asks why the engineers like the Athlon/64 so much, and the engineers tell him "because it runs a synthesis job up to 3x faster than our fastest Solaris boxes."

- E
- EdA
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 3, 2005 9:47 PM

Another research project in this arena is DVS.

formatting link

and

formatting link

And as mentioned the communications bottleneck is an issue.

Maybe the IBM Cell Processor is what we've been waiting for? :-)

formatting link

/Ed