ISE 10.0 finally with multi-threading and SV support ?

R

ratztafaz 18 years ago

Will the 10th edition of ISE, to be relealed next week, finally support multithreading/SMP machines to reduce synthesis + P&R time?

Will we finally get support for synthesis of System Verilog constructs?

What other major features to you still miss? - Discuss!

Vote

J

Jon Beniston 18 years ago

Full Verilog 2001 support!

Jon

Vote

S

sky465nm 18 years ago

Proper exit codes upon failure.. * Use libusb (drop jungo). * Published API for Impact. * GUI that doesn't crash. * Possible to exploit server farms such that the Synthesis/P&R can be shared between physically different machines. * Installer that allows installation on non-RH machines.

Vote

R

ratztafaz 18 years ago

Theres a much better approach availlable: XILINX JTAG tools on Linux without proprietary kernel modules

windrvr sucks!

look:

formatting link

Vote

U

Uwe Bonnes 18 years ago

You both mean the same...

Uwe Bonnes bon@elektron.ikp.physik.tu-darmstadt.de Institut fuer Kernphysik Schlossgartenstrasse 9 64289 Darmstadt --------- Tel. 06151 162516 -------- Fax. 06151 164321 ----------

Vote

S

sky465nm 18 years ago

Yep.

(I thought no-one would misunderstand the original message)

Vote

K

Kolja Sulimma 18 years ago

That is easy to do for high level synthesis, next to impossible for placement and very difficult for routing.

VHDL2006 support.

Kolja Sulimma

Vote

J

jb 18 years ago

Hello,

I did not dig into it -- but I always felt its exact the opposite. Some time ago, i read most P&R ist based upon simulated annealing. Is that still true? While SA might not be the most parallizable algorithm on earth, it should give you some speedup; at least on SMP...

Do you've got some read-worthy documents on that topic?

.

Vote

K

Kolja Sulimma 18 years ago

With SA you perform a small change on your design and evaluate the fitness of the new design. Based on that you decide how to continue. In cases where the fitness update is inexpensive SA is inherently serial and hard to parallize. Of course you could evaluate multiple changes before making the next decision like genetic algorithms do. But anyway, I doubt that Xilinx still uses SA for their placer.

Please not that the algorithms running in the EDA world are extremely complicated compared to stuff like computer graphics or similar. Performance in many cases depends on a multitude of parameters that are hand tuned. These parameters can be affected by parallization. You need to do all the parameter tuning again or convergence migh actually be worse when running two cores instead of one.

Anything that can be done on partitions of the design is easy. For example you can start synthesis on parallel on multiple cores for different source files. But later in the flow it gets really nasty. Onother easy thing to do is to running the timing analyzer in parallel with bitgen. You do not need to wait for Xilinx to do that, just write your own makefile.

Before complaining about Xilinx you should be aware that ASIC designers often face runtimes of many hours for their tools. (Nightly builds....) Parallel EDA software therefore is an active research topic but with no major successes yet. Don't expect the FPGA vendors to support multiple cores before the big EDA companies like synopsys, Magma DA or Cadence do.

Kolja Sulimma

Vote

A

aludwin 18 years ago

Hi all,

Parallel EDA is is indeed difficult but we've had some success with it here at Altera - our first parallel algorithms were shipped in 2006 and our first parallel placer was shipped in 2007. The main placement algorithm is getting a speedup of 1.6x on two processors and 2.2x on four, works without partitioning, and always gives exactly the same answer as the serial version. We described the techniques we used at a recent conference - if you're lucky enough to have an ACM web account, you can read the full paper here:

formatting link

As of the latest release of Quartus II, you'll get an average speedup of about 15% on two processors and 20% on four. We're also actively improving our existing parallel algorithms and working on a bunch of new ones, so we expect these numbers to improve significantly in the future. You can check out

formatting link

for a few more details, or find out what the numbers are for the latest release.

Cheers, Adrian Ludwin Altera

Vote

<

<steve.lass 18 years ago

ISE 10.1 does not yet support multi-threading, but it does have a 2X Map&P&R runtime improvement.

No SV yet. You will have to get that from our partners (I guess that would be Synopsys if you're looking for SV synthesis).

Steve

Vote

J

Jonathan Bromley 18 years ago

I find that a tad disappointing. I'm sure X has done its market research, but from my point of view I see so many potential benefits to designers from adopting SV... and the FPGA community has traditionally been much more willing to adopt interesting language features for design, whereas the ASIC community tends to be rather conservative because it sees most of its problems as being downstream of the design phase.

The part I find frustrating is that the real gains from SV come only when an implementation (of the design subset) is reasonably complete. Chipping away at the edges (as some vendors initially did), by implementing always_ff and a few easy data types, doesn't get us anywhere in terms of real design capability. My hit-list... full support for enums, including traversal methods packed structs and unions interfaces, ** including modport expressions ** unique/priority always_comb When completed, this gives you a design language with significantly better expressive power than VHDL in some areas.

~~~~~~~~~~~~~~~~

Finally, a minor correction for the record:

SV synthesis is competently supported not only by Synopsys DC but by at least one of the major third-party FPGA synthesis tools. The other obvious third-party FPGA synth tool makes a reasonable attempt at SV, **as does the free tool of your obvious competitor**.

~~~~~~~~~~~~~~~~~~~~~~~

Big whinge time: AFAIK NO synthesis tool yet supports modport expressions, the SV feature that has most impact in providing new expressive power for re-usable, parameterised design.

Jonathan Bromley, Consultant DOULOS - Developing Design Know-how VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK jonathan.bromley@MYCOMPANY.com http://www.MYCOMPANY.com The contents of this message may contain personal views which are not the views of Doulos Ltd., unless specifically stated.

Vote

K

Kolja Sulimma 18 years ago

I guess a theoretical computer scientist would call that "not parallelizable".

20% on four processors is hardly impressive and should be possible be rewriting the makefile alone.

If you spent the same human ressources on optimizing the serial algorithms, don't you think a 15% speedup on a single processor would have been possible? Also 15% speedup ist the same as pruchasing the CPU three months later.

A note on the algorithms: I you used an quadratic placer, you could parallize the matrix operation while running the main algorithm on only one processor.

Kolja Sulimma

Vote

P

Paul Leventis 18 years ago

Hi Kolja,

All parallel results (the 15-20% Adrian quotes) are *in addition* to normal improvements we make to Quartus compile times. In fact, a lot of our effort continues to go into serial algorithm speed (and memory footprint), since these gains usually stack with parallelism gains. We get serial improvements by replacing algorithms with better ones, tuning various parameters/trade-offs, early exiting various algorithms when they reach the design targets. We also play with new compilers, compiler optimization flags, look at memory layout / locality, etc. We leave no stone unturned, and no programmer unflogged :-)

If you look at a breakdown of Quartus compile time, it is spread across a large number of algorithms/code fragments, few of which contribute more than a few percent each. If there was a big run-time peak, we would have squashed it by now -- hence the relatively even breakdown of run time. So i isn't that the problem isn't parallelizable -- it's that it takes a lot of time & effort to rewrite a large code base to take advantage of multiple processors. Many of the algorithms we parallelize achieve >1.5X speedup (on two CPUs), which is pretty good for a memory-intensive algorithm. With each release of Quartus since Quartus II v6.1 we've introduced more and more parallelism across the flow.

Which makefile are your referring to?

But if we speed things up by 15% at our end, and you get a 15% faster CPU, then you've got yourself a 32% speed-up. If we didn't give you that 15%, you'd only have your 15% speed-up.

Regards,

Paul Leventis Altera Corp.

Vote

K

Kolja Sulimma 18 years ago

Of course. I spent years in EDA algorithm research.

Whatever script it is that controls the tool flow. I am not familiar with Quartus, but generally for an EDA flow it is possible to start synthesis on multiple source files in parallel without changing the algorithms. You can also run bitstream generation in parallel with post layout timing analysis. The cleanup/report phase of any step could be overlapped with the previous step, and so on.

Yes. But all these are minor cleanups when you consider the class of algorithms involved. Runtime between different placers for the same quality of results vary orders of magnitude. Try to run you 1999 placer on a design from today.

Kolja

Vote

A

aludwin 18 years ago

Hi Kolja,

Sorry to use a cliche, but you have to walk before you can run! Our first parallel release (in 2006) improved runtimes by 5% or so, and we've since pushed that up to 20% and have many more improvements on the way. The big problem, of course, is that we're held back by Amdahl's law - if you perfectly parallelize 25% of your algorithm (eg get a 4x speedup on four processors), this will only make the overall algorithm about 20% faster. However, if you parallelize the second

25%, you get a 60% speedup, and the third 25% will get you to 130% (ie, a 2.3x speedup). We're clearly still in that first-25%-ish range, but we don't intend to stay there.

As Paul mentioned, we've dramatically improved our serial runtimes as well, which gets a greater improvement than our parallel results to date. But it's unwise to rely solely on our ability to repeat this feat again and again. In addition, while it was recently true that processors were doubling in speed every 18 months or so, chip vendors now are choosing lower clock speeds and increasing the number of cores available instead. For example, when I compare the SPEC CINT2006 results for the second half of 2006 with the second half of 2007 (the most recent numbers I could find on spec.org), the best single- threaded score increased by only 9% and the average score by only 14%. Since our devices are now growing much faster than this, waiting for a new processor to reduce your runtimes isn't a good long-term solution anymore.

In short, these are early days for parallel EDA and we're not at 4x speedups yet, but we're busy laying the groundwork for the many-core future to make sure we don't get left behind.

Cheers, Adrian Ludwin Altera

Vote

K

Kolja Sulimma 18 years ago

Don't get me wrong: I believe it's a good thing to work on that. I am just surprise that you are doing this development in you production toolchain. This seems to be a lot of risk and effort for a minor improvement. I would probably wait until I could present a bigger improvement. But that is your decision.

As a sidenote: At least in designs with an utilization below 70% or so it should be relatively easy to partition the design and then run the whole toolfllow on the partitions. That also is something that probably can be done on the makefile level using the toolflows for dynamic reconfiguration. Might be a nice master thesis.

Kolja Sulimma

Vote

P

Paul Leventis 18 years ago

Hi Kolja,

Lets say (hypothetically) that there are 100 places that we have to parallelize to get a 2X speed-up. I think it is far less risky to dribble out the improvements on 10 algorithms per release, climbing from 5% to 10% to 15% to 20% improvement, then it is to develop all

100, test them locally, then unleash all the changes on the world at once. Plus this way users get that 20% benefit sooner rather than later.

If the user is willing to partition their design, then they can already achieve some speed-up with Quartus by using the Quartus Incremental Compile capabilities in a bottom-up fashion. Each partition can be compiled with a separate execution of the Quartus flow (which can be parallelized), and the resulting partitions can then be combined. Another benefit is that only those partitions that are impacted by design modifications need to be recompiled.

As for automatic partitioning: Our users tend to push design performance. If we pre-decide how to break up a design into completely separable partitions and then separately synthesize and place-and-route those partitions, yes, you can get a compile time speed-up. But how much performance is lost? Could you have got an equal speed-up by just dialing down the effort level in the flat, serial algorithm? Also, what do you do at the partition boundaries? Do you need to do some routing and timing analysis at the top level after synthesis/placement/routing on the sub-sections? If the partitions do not line up perfectly with clock domains or with keepers, then you have a slack dependency on signals that cross partition boundaries -- how do you solve that?

I think that this is a good area of research. But I would not say it is "easy".

Regards,

Paul Leventis Altera Corp.

Vote

A

Andreas Ehliar 18 years ago

I have found verilog-mode for emacs to be quite nice in regards to the above. It doesn't have all the features I want but it is quite nice nevertheless if you use the /*AUTO*/ features of it.

/Andreas

Vote

R

ratztafaz 18 years ago

Partitioning would be nice, if it would work. At least with Xilinx ISE it does not. It could dramatically reduce my compile times, but I got so many random fatal errors with it so I had to abandon it. Maybe in some future version when they fix the issues. Another thing with partitions: The ISE gets so ridiculously slow, its not even funny. Not that I would use the crappy ISE editor but it still is annoying.

BTW a nice Verilog Editor would be a great thing too! Automatic code completion, automatic wiring through hierarchies, complex refactoring capabilities, real time syntax compiler, back tracing instantiation calls, linking between instantiations and module definitions...(I am dreaming). Still, its pathetic that you still have to invest tons of time to create toplevel modules and wire modules together.

Vote

ISE 10.0 finally with multi-threading and SV support ?

Join the Discussion

Didn't find your answer?