Very interesting
I really doubt its the branch behaviour even though the Athlon series has always been good on office type twisty apps. For branchy code segments that fit in the I cache, these days the branches almost come for free and guess right more often than not.
I'd hazard a guess it has more to do with the data set being very large and missing the L1, L2 and TLBs way too often, "poor locality of reference" , even 1% misses, maybe less maybe enough to wreak havoc.
It not difficult to create a simple data structure that holds millions of items in a hash table and see even an Athlon xp2400 give up 300ns avg accesses to each entry if all accesses appear random.rather than the naive 1ns its L1 cache can actually do.
You can plot a graph of open random address width from 6bits to 24bits and watch execution time go from 1n to 4ns and then roughly stepping
30ns 100ns 300ns for x[i] when i is coming from any old random no generator and masked by width field. Measured on an xp2400.
If this simple test were run on various cpus, we could see how the caching really works for graduating locality disaster cases and choose accordingly.
Now EDA software doesn't deliberately do this, but might get some of the same effect unintended simply by having to walk immense graphs and trees. Think about it, draw a graph with millions of nodes and try to label in such a way that it can be traversed with mostly low address bit changes (high locality) when the nodes in the graph are allocated completely in random fashion. Then think, how many operations actually get performed on each link list traversal, a lot of the time it might be just passing through looking for something, the worst possible situation, all fetch no work.
I don't imagine there is much EDA code that looks like beautiful DSP media codec stuff with super straight line high locality SSE tuned code.
I could be all wrong, but I thinks it the Memory Wall effect and the Opteron maybe does a better job of recovering. That also means a cpu that concentrates on that aspect desn't even need a clock advantage as long as it tolerates poor locality better.
I wonder if its possible to get stats from the cpu performance hardware that shows what the cpu is really doing in memory, bit out of my league.
I wonder if the EDA guys just crank out code or do they ever measure algorithms on different x86 hardware at the cache level, curious?
I also wonder how much FPU is actually used and how so?.
On a threaded cpu designed to work with threaded memory where there is little memory wall (latency tolerence all around), it doesn't take much hardware to design a processor element in FPGA that can match Athlon xp300, and 10 or so ganged together can then match xp3000 but you get
40 odd threads to fill instead of waiting on cache misses. Me, I'd rather fill the threads (occam style) than wait, but most are not of that opinion (yet).
Now if EDA ever becomes highly concurrent, (some have done this in VLSI EDA from simulation to P/R) it does make possible some real speed ups when real threading becomes pervasive in cpus (not this 2,4 thread nonsence).
johnjakson at usa dot ... transputer2 at yahoo dot ...