Thanks everyone, this is real interesting, but please don't stop posting if you have more insights to share!
FWIW, my runtimes in Quartus are dominated by P&R (quartus_fit); on Linux, they run about 20% faster on my 2005-era 64-bit Opteron than on my 2004-era 32-bit Xeon (both with a 32-bit build of Quartus). Another test run of a double-precision DSP simulation (compiled C) ran substantially slower on the Opteron, which I thought was supposed to have better floating-point performance than Xeons of that era. Maybe it was just a case of the gcc -O5 optimization switches being totally tuned to Intel instead of AMD, or maybe my Quartus P&R step is primarily dominated by integer calculations.
I originally suspected P&R might have a lot of floating-point calculations (even prior to signal-integrity considerations) if they were doing any kind of physical synthesis (e.g., delay calculation based on distance and fanout); ditto for STA, because that's usually an integral part of the P&R loops. I also suspected that if floating- point operations (at least multiplies, add/subtract, and MACs) could be done in a single cycle, there would be no advantage to using integer arithmetic instead (especially if manual, or somewhat explicit integer scaling is required).
On the other hand, in something like a router, you can get more exact location info wrt stuff like grid coordinates than you can with floating-point. As far as dynamic range is concerned, I seem to recall that SystemC standardized on 64-bit time to run longer simulations, but SystemC is a different animal in that regard anyway. Nonetheless, I also seem to recall that its implementation of time was 64-bit integers (scaled), because the average FPU operations are really only linear over the 53-bit mantissa part. Assuming they want linear representation of time ticks, I can see the appeal of using 64-bit integers in simulation.
As far as event-driven simulations are concerned, I totally understand how hard it is to make good use of multithreading or multiprocessing, because everything is so tightly coupled in that evaluate/update/ reschedule loop. If you were working at a much higher level (behavioral/transaction), where the number of low-level events is lower and the computation behind "complex" events took up a much larger portion of the evaluate/update/reschedule loop, then multicore/ multiprocessing solutions might be more effective for simulation. (Agree/disagree?) It seems that as you get more coarse-grained with the simulation, that even distributed processing (multiple machines on a network) becomes more feasible. Obviously the scheduler has one "core" and has to reside in one CPU/memory space, but if it has less work to do, then it can handle less frequent communication with the event-processing CPUs in another space.
Back to Quartus in particular and Windows in general... Quartus supports the new "number_of_cpus" or some similar variable, but only seems to use it in small sections of quartus_fit (I think Altera is just making their baby steps in this area).
That appears to be related to the number of processors inside one box. If a single CPU is just hyperthreaded, the processor takes care of instruction distribution unrelated to a variable like number_of_cpus, right? And if there are two single-core processors in a box, obviously it will utilize "number_of_cpus=2" as expected. Does anyone know how that works with dual-core CPUs? i.e, if I have two quad-core CPUs in one box, will setting "number_of_cpus=7" make optimal use of 7 cores while leaving me one to work in a shell or window?
Does anyone know if Quartus makes better use of multiple processors in a partitioned bottom-up flow compared to a single top-down compile flow?
In 32-bit Windows, is that 3GB limit for everything running at one time? i.e., is 4GB a waste on a Windows machine? Can it run multiple
2GB processes and go beyond 3 or 4GB? Or is 3GB an absolute O/S limit, and 2GB an absolute process limit in Windows?
In 32-bit Linux, can it run 4GB per process and as many simultaneous processes of that size as the virtual memory will support?
In going to 64-bit apps and O/S versions, should the tools run equally fast as long as the processor is truly 64-bit?
Thanks again for all the insights and interesting discussion.
jj