Auto pipeline logic??

Hi all,

Using HDL to pipelining manually is a hardy task. And I found some tools like Synplify have pipeline tools. But the pipeline they provided is just insert reg between RAM and logic.

My question is: Is there a tool to auto pipeline the logic? For example, I want to pipeline the logic by insert N regs. And if there exists such a tool, what does it modify, HDL or netlist level?

Best regards, Davy

Reply to
Davy
Loading thread data ...

I think Xilinx, or maybe it's Mentor Graphics, that has a tool called Precision, that is suppose to "push or pull" registers in order to make timing criteria. I don't know what it costs.

b r a d @ a i v i s i o n . c o m

Reply to
Brad Smallridge

Hi Davy,

Both Synplify Pro and XST (and probably other synthesis tools) can do this to some extent. It's also often called Register re-timing. It works like this: you can write your big block of combinatorial logic in HDL, then add N registers after it (also in HDL), then get the tools to push the registers around to optimize the timing of the circuit.

That's the theory! In practice you'd be somewhat foolish to rely on this today (Try it!). However, it's likely to become more and more important in future.

An area in which XST does this fairly well is multipliers. If you use the right settings and/or attributes, it's possible to write a combinatorial multiply followed by a register-based delay line, and have the registers pushed back into the adder tree automatically. The original HDL source is untouched; it's the resulting netlist which is optimized.

Cheers,

-Ben-

Reply to
Ben Jones

Hi,

Yes I know re-timing, it just push pull the register(rely on the original netlist), but not insert register.

Is there any tool to insert registers?

Thanks! Davy

Reply to
Davy

Well, inserting registers changes your design in a fundamental way. Most circuits I can think of would just stop working if you added registers to them at random. Only you, the designer, know exactly how much pipelining it is legal to apply to a given part of your circuit. So I don't believe such a tool exists - certainly not in the general case.

Cheers,

-Ben-

Reply to
Ben Jones

I would think that would be a very bad idea to try to do automatically - it would completely change your timing. It's one thing to automatically do re-timing to improve your margins or your maximum clock rate, but adding registers will change the function of your logic. You might just as well ask for a tool to insert extra logic to improve your design.

Reply to
David Brown

You have to insert your own registers to make the pipeline a desired latency. You can then let the tool move the logic across those boundaries.

How would you specify to the tool what you want pipelined, what you don't, and what the expected final latency in clocks is? You insert registers in the paths you want piped.

The tool I use to insert registers: vi.

Reply to
John_H

Ben Jones ( snipped-for-privacy@xilinx.com) wrote: : > Yes I know re-timing, it just push pull the register(rely on the : > original netlist), but not insert register. : > Is there any tool to insert registers?

: Well, inserting registers changes your design in a fundamental way. : Most circuits I can think of would just stop working if you added : registers to them at random. Only you, the designer, know exactly : how much pipelining it is legal to apply to a given part of your : circuit. So I don't believe such a tool exists - certainly not in the : general case.

This is very true, but there's no reason a designer couldn't specify a bunch of signals (e.g. the data signal from a combinatorial multiply and associated control signals) and some tool would add aribtrary (to a user specified limit) stages of pipelining to all signals to meet timing, with logic/register shuffling. This would only work the control and data flows can be aribtrarily pipelined, but many ops can be described this way/

A half way house to acheive this is to use current register shuffling on the data signals and experimentally add registers to reach timing, and then pipline associated control signals. If done using a VHDL generate tc. it's a two second text editor job to do the later.

If someone writes the tools then the whole operation could be scripted, with the logic to be messed and associated signals isolated in a soure file.

All in all construtive use of a text editor on the source is much less hastle :-)

--
cds
Reply to
c d saunter

True, although I don't see much merit in doing it that way. In FPGAs, the pipeline registers are essentially free (because they're there after every LUT, even if you don't use them). So you don't get much advantage from "just" meeting timing - if you have four clock cycles do do something in, then you might as well take all four - who cares? You'll get better results out of the tools that way, too.

Of course, if you *do* care about the latency of your operations and you want to minimize it, then you're already thinking in enough depth about your design that an automated tool would be unnecessary.

Cheers,

-Ben-

P.S. Whenever someone says "automated tool" I immediately envisage a smug paperclip: "I see you're trying to close timing - would you like some help with that?" This of course means that all my subsequent utterances on the subject can be safely disregarded. :-)

Reply to
Ben Jones

Hi Davy,

No tool that I know of, but you can write code in such a way that the pipelining is configurable. I've written a few blocks where I could adjust the pipelining of the block by changing a "pipelining schedule", which was just an array variable containing pipeline-able points in the design. By changing this variable, I could change the amount of pipelines, and therefore the amount of registers used and the fmax of the design. This has worked pretty well for me for designs like binary trees of arbitrary depth. At the top of the code I would have a variable like:

pipeline_schedule(TREE_DEPTH-1 downto 0) := ( 0, 1, 1, 0, 1, 1 );

and further in the code where I would have, say a tree I would do something like (this is not the actual code, just the idea here):

for i in 1 to TREE_DEPTH-1 generate for j in 0 to LEAVES-1 generate if (pipeline_schedule(i)==0) generate -- just a level of logic a(i)(j*2)

Reply to
Peter Sommerfeld

I've tried to write my code this way for some time now. This is a design style which is pretty hard. You start with an algorithm or computation, and apply it to an arbitrary sized chunk of data, and build the circuit so that registers are inserted at "appropriate" points during elaboration of the structure. It would be easier if the VHDL language standard was modified to support "return generic values" that are computed during component elaboration and returned as a compile time constant to the upper level code that instantiates the component. The reason for this, of course, is that the lower level component should contain code to calculate the appropriate latency, and then return this value to the upper level of the hierarchy so that the other signal paths can have their latency balanced.

Since VHDL doesn't let you do this, the only other solution I can think of is to write functions in a package that perform the latency calculation at the top level, and then pass the latency as a parameter to the lower level components. This works OK, but requires laborious latency calculations at each level of the hierarchy, making it difficult to cleanly separate the functionality of the sub-components. But you *can* do it. There are some types of circuits where it's really tough to calculate the latency based on the circuit parameters. CRC's are a good example of this. The number of levels of logic generated by a parity matrix depends on the input data width, as well as the polynomial being used. Actually, now that I think about it, the number of levels of logic is proportional to the log2 of the maximum hamming weight of the columns. OK, bad example :-)

Still, if you write a component with a generic like so:

generic( delay : natural )

and then actually implement that latency in the component, it's pretty easy after that to get the component to run at the silicon limit for your target FPGA technology (given enough latency, of course). It's a pain to have to write the functions that calculate exactly where to place the registers, since you have to take the latency as a given and put it where it will do the most good. I usually put the first register at the output, the 2nd register at the input, and 3rd and higher registers in the middle.

In some cases you have to depend on the retiming capability of the synthesis tool to put the registers where they are needed. This is usually because the actual net where it needs to go doesn't exist until the component is elaborated. Oh yeah, now I remember... I have this problem with a CRC_ECC component that does single error correction on input data protected by a CRC. The error syndromes turn out to be a big static table with unpredictable values. The circuitry for recognizing the syndromes has *very* unpredictable levels of logic. This can't be pipelined in the source, because the nets don't exist until elaboration, and even then they don't have names you can access. Retiming is the only solution to getting decent speed with a circuit like this. Even with retiming, I still have to guess at an appropriate value for latency. In the future, we might solve this by iterative synthesis that sets a latency or logic level attribute on the component label, so that during the 2nd iteration of synthesis, we can detect situations like this and add appropriate latency. But this will require a language change, as usual :-(

Here's a tip on writing components with variable latency:

First, create a set of delay elements (registers) that take a delay parameter for various types. I use:

std_logic, std_logic_vector, unsigned, signed.

I use the step function extensively to distribute delays in a component.

function step_function( i, j: integer ) return natural is begin if i >= j then return 1; else return 0; end step_function;

then given a generic delay value, I can compute the delays for various points like so:

constant bottom_delay : natural := step_function( delay, 1); constant top_delay : natural := step_function( delay - bottom_delay, 1 ); constant middle_delay : natural := delay - top_delay - bottom_delay;

This puts the first available register at the bottom, the 2nd goes to the top, and the remaining in the middle. The delays can never be negative.

It's not uncommon to use one of these quantities for passing latency to subcomponents. The middle delay might be used as a delay parameter, for example. In some rare cases, you may have different latencies on different inputs or outputs. This depends on the demands of the higher level hierarchy.

Sometimes the latency is just obvious, and all you have to do is write a function that returns the value based on some input parameters. Recursive trees fall into this category. This can be a little tricky sometimes, and having a way to return an elaborated value from a subcomponent would be really handy here. Of course, having this capability would open the door to dependency loops in elaboration, so iteration limits might have to be set.

Having said all this, I'll close with the assertion that I can see the day coming when it won't be just the logic that needs pipelining... Someday we'll have to pipeline the routing too. Probably in the 45 nm node.

John McCluskey

P.S. I'm a Xilinx FAE, but writing this in my "off hours".

Reply to
John McCluskey

Hi John,

Yup, that gets said about once a week in this office. :-)

Well, the other possibility is to use flow control handshakes at each pipeline stage and put up with non-deterministic latency. That has its own problems, of course - but I've found both approaches useful in certain contexts.

As another poster indicated, if you have a bunch of sites where a register could be placed, then it's a piece of cake to have an array of booleans as a generic parameter to control the register placement. The problem of finding optimal register placement can usually be solved by brute force: write a script to implement the component using every possible vector with N bits set, for every value of N you're interested in. In fact, you don't even need to do every possible vector (which is a binomial-type thing) because you can easily prove that certain vectors will always be worse than other vectors. This narrows the search space quite a bit (which matters if you're talking about a 10-deep pipeline, but not so much if it's 3 or 4!). Of course, you have to do this every time your design changes significantly; but if your project is headed for a library which will be used many times by many people, it shouldn't change (and the effort should be justified).

It's happening already. When you're designing for 300MHz plus, those previously-innocuous net delays of 900ps are a massive chunk out of your cycle budget. More than once I've ended up with a critical path having no levels of logic...

Cheers,

-Ben-

P.S. I'm a Xilinx design engineer, and I don't get "off hours". :)

Reply to
Ben Jones

Shoot nothing new here, I've been pipelining wires when necessary since my XC3000 days. It is even there as a bullet in some of the performance tips I published in various places to that effect. If you are pushing the envelope on FPGA performance, you'll need to do things like pipelining wires. As true now as it ever was.

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com  
http://www.andraka.com  

 "They that give up essential liberty to obtain a little 
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759
Reply to
Ray Andraka

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.