Need to speed up Stratix compiles.

Not if the four machines are sitting around all night running screen savers.

--
Rick "rickman" Collins

rick.collins@XYarius.com
 Click to see the full signature
Reply to
rickman
Loading thread data ...

Would seem a very good idea.

On this topic, I see Intel released a new Xeon with 3GHz and 4MB (!) cache, and they claim 25% faster. Of course, you pay - $3692 (Qty column not given ) :) The PR claims This is the last release before intel adds 64 bit extensions....

Reply to
Jim Granville

I disagree. Synthesis as well as P&R involve exploring many alternatives and sort/explore by some underestimate of expense/delay (typically using a A* search algorithm or similar). This can be done in parallel. The datasets can be copied to each node and there will be very little information which has to be exchanged over the interconnect. Of course there is not much to gain if your P&R takes 1 minute, but for larger designs and/or more accurate wire delay models (e.g. non-linear delay modeling and physical synthesis) the benefit will be larger.

This has been implemented in some ASIC tools already. Actually Xilinx has been doing some very simple parallel processing in ISE (on Solaris and now Linux) for a long time. Multiple iterations of "par" can run in parallel on multiple hosts, then you pick the best result. This is of course, extremely coarse grained compared to what I indicated above.

Petter

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
 Click to see the full signature
Reply to
Petter Gustad

My experience is the opposite. I've heard from users in the high performance computing industry that the most cost efficient systems are clusters of dual CPU nodes (assuming your application will run efficiently on a cluster).

A 4 CPU Xeon system like a Dell PowerEdge 6650 with 4x Xeon, 3.0GHz and 4GB RAM costs $28,070. A single PowerEdge 750 (1U server) with

3.4GHz P4 (higher clock frequency, but smaller cache) with 1GB RAM costs $3,165. 8 CPU Xeon SMP's (Profusion architecture) are very expensive. A Proliant 8500 costs $100,000+ if memory serves me right.

Petter

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
 Click to see the full signature
Reply to
Petter Gustad

The hyperthreaded Xeons run as two processors, so a quad Xeon board appears to a HT-aware OS as an 8-CPU system.

Why pay for all the extra high-end hardware in a top-end server if you don't need it? When I was last looking at building systems like this, about 18 months or so ago, a quad-Xeon mobo from Supermicro was

Reply to
Max

Then you would call a system with single P4 with HyperThreading a dual processor system as well then? This would be a little "unfair" when comparing to a full dual-core CPU like the rumored UltraSparc-IV.

My point was that you usually get lots of extra high-end hardware when you buy large SMP systems, especially when you need to go beyond

4-way. Also, it's usually cheaper to get 4x4GB RAM rather than 16GB RAM for a single MB (unless you have a large enough number of DIMM slots).
Reply to
Petter Gustad

I was just trying to be helpful by sharing my experience. We're only interested in speeding up Quartus builds in this thread and some have been suggesting more memory (32 GB in some instances) and faster drives. I've done both in two different machines and the biggest improvement came from tweaking the memory subsystem, not adding more memory above 512MB or a faster drive.

The 7200 RPM drive is very much faster as can be seen with much much faster boot times. Didn't mean much on Quartus builds though. Seems Quartus needs (for my Nios system) a fast CPU with at least several hundred MB's of tweaked memory.

I write not slow image processing algorithms and use as many wires as the system can provide. If its an 8 bit cpu then I use 8 bit optimizations, if its 32 bit then 32 bit optimizations. Haven't tried 64 bit yet, but I plan too. Can't imagine any developer worth their salt that wouldn't.

Ken

min.

there

Reply to
Kenneth Land

Last I checked, multiple PAR runs didn't gain much if you had a well floor-planned system. That was a long time ago. Has anything changed?

--
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
 Click to see the full signature
Reply to
Hal Murray

Hi Rick,

I agree. That's why my original posting makes reference to some SPEC results showing that 64-bit code on Athlon64 is ~5% slower than the same programs compiled in 32-bit code. One specific SPEC sub-component is a tool called VPR, which is an academic place & route tool for FPGAs. It shows a

8% slow-down. While by no means comprehensive, I think this gives an idea of how much speed to expect out of 64-bit vs. 32-bit code, at least for now.

I've forwarded your comments on how nice it would be to see some results for different system configurations on to the relevant groups in Altera. My personal experience (going from PII to PIII to P4) has been that SPEC2000 is a pretty good proxy for Quartus performance, especially for place & route limited designs.

Regards,

Paul

Reply to
Paul Leventis (at home)

Hi Max,

It's hard to use fine-grained parallelism on place-and-route tools like Quartus. This doesn't mean that people (academia, industry) haven't tried and aren't still trying, but I wouldn't hold my breath. See my previous posting on this topic:

formatting link

Of course, coarse-grained parallelism (running multiple place-and-route runs on multiple machines) is much easier. Quartus II ships with a pretty cool tool called Design Space Explorer. This tool tries out a whole bunch of Quartus settings and random seeds on your design in order to find the settings that optimize performance. This requires multiple runs of Quartus. DSE is capable of farming these runs off to multiple CPUs/computers through LSF or a built-in distributed computing engine.

To find out more about DSE and coarse-grained parallelism, please see the section entitled "DSE Advanced Information" in

formatting link

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

That is very interesting information. I was not aware of the AMD 64-bit code was running slower than 32-bit code. I am sure that you won't see much of that on the AMD web site. I may check in the PC building newsgroups to see what results they are finding. They seem to be a bunch that get to the skinny of things like this.

--
Rick "rickman" Collins

rick.collins@XYarius.com
 Click to see the full signature
Reply to
rickman

True. If you don;t have a highly congested design with a high degree of utilization you will probably not gain that much.

My point was that this was an example of a *very simple* parallelism done by Xilinx. It would be more optimal (and much more difficult) to make a parallel version of a single iteration of "par".

Petter

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
 Click to see the full signature
Reply to
Petter Gustad

Windows XP sees my dual-Xeon workstation as a quad CPU machine, so it can schedule four separate simultaneous threads. If it's behaving as a quad-processor, then I'm not sure what else I should call it.

I agree, it's difficult to buy a ready-made high-end system that doesn't have redundant PSUs, hot-swap RAID etc. This is why I haven't bought off the shelf for over 5 years now, but buy the components I actually want and assemble it myself.

Bare high-end mobos are cheaper than most people think. At the time, I paid around $850 for a Supermicro P4DC6+ dual Xeon board, but I haven't looked at current prices. There is a hike when you want more than two physical processors, though - presumably due to low demand and less competition. The P4 Xeons are hugely cheaper than the PIII versions for some reason.

If you're an AMD fan, then Tyan make nice multi-CPU boards at sensible prices.

--
  Max
Reply to
Max

I hardly ever use Windows so I haven't had a chance to observe this. I don't have a HT system at hand now, but what does

grep ^processor /proc/cpuinfo

return on a Linux based HT system?

We have a small cluster of Quad Opterons at work. They give superb performance when I run Synopsys Design Compiler and similar tools. Unfortunately I can't run Quartus II (3.0) on these as I have mentioned earlier.

Petter

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
 Click to see the full signature
Reply to
Petter Gustad

Sorry, don't know from personal experience. I don't use Linux much, and when I do, I run it under VMware, which emulates a uniprocessor.

I've heard from other users that Linux understands HT, but I don't know any more than that, really. I daresay the folks in some of the hardware groups could say more - try alt.comp.periphs.mainboard.supermicro

--
  Max
Reply to
Max

I got the answer from a local Linux group. It appears as 4 processors:

$ grep ^processor /proc/cpuinfo processor : 0 processor : 1 processor : 2 processor : 3

So it will be difficult to distinglish between two physical CPU packages, dual-core and HT...

Petter

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
 Click to see the full signature
Reply to
Petter Gustad

I think you can look at the "flags":

$ grep ^flags /proc/cpuinfo flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm

This is on a uni-processor Xeon box. The "ht" flag might hint at hyperthreading, but I'm not sure...

Reply to
Marius Vollmer

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.