Automatic latency balancing in VHDL-implemented complex pipelined systems

Hi, Last time I have spent a lot of time on development of quite complex high s peed data processing systems in FPGA. They all had pipeline architecture, a nd data were processed in parallel in multiple pipelines with different la tencies.

The worst thing was that those latencies were changing during development. For example some operations were performed by blocks with tree structure, s o the number of levels depended on number of inputs handled by each node. T he number of inputs in each node was varied to find the acceptable balance between the number of levels and maximum clock speed. I also had to add som e pipeline registers to improve timing.

Entire designs were written in pure VHDL, so I had to adjust latencies man ually, to ensure that data coming from different paths arrive in the next b lock in the same clock cycle. It was really a nightmare so I dreamed about an automated way to ensure proper equalization of latencies.

After some work I have elaborated a solution which I'd like to share with t he community. It is available under the BSD license on the OpenCores websit e

formatting link
. The paper with detailed description is available on arXiv.org
formatting link

I'll appreciate any comments. I hope that the proposed method will be useful for others.

With best regards, Wojtek

Reply to
wzab01
Loading thread data ...

I have heard that some synthesis software now knows how to move around pipeline registers to optimize timing. I haven't tried using the feature yet, though.

I think it can move registers, but maybe not add them. You might need enough registers in place for it to move them around.

I used to work on systolic arrays, which are really just very long (hundred or thousands of stages) pipelines. It is pretty hard to hand optimize them that long.

-- glen

Reply to
glen herrmannsfeldt

Yes, of course the pipeline registers may be moved (e.g. using the "retimin g" feature). I usually keep this option switched on for implementation. My method only ensures, that the number of pipeline stages is the same in a ll parallel paths. And keeping track of that was really a huge problem in b igger designs.

--
Wojtek
Reply to
wzab01

implementation.

in

in

Not sure why you expect the tool to do what you should do and do so for simulation tool. How can you you simulate a design that synthesis will put for you registers?

Kaz

--------------------------------------- Posted through

formatting link

Reply to
kaz

W dniu wtorek, 29 wrze?nia 2015 11:50:53 UTC+1 u?ytkownik kaz nap isa?:

?ytkownik glen

t

The tool is supposed to ensure that the appropriate number of registers is added. In case of high-level parametrized description it is really difficult to av oid mistakes. Therefore an automated tool is preferred. The registers are put not only for synthesis, but also for simulation. I hope, that my preprint explains more clearly both motivation and implemen tation.

Regards, Wojtek

Reply to
wzab01

I knew about this sort of thing ten years ago, although I've never used it (for FPGA I'm mostly an armchair coach).

At the time that my FPGA friends were rhapsodizing about it, the designer still needed to specify the total delay, but the tools took the responsibility for distributing it.

It makes sense to do it that way, because you're the one that has to decide how much delay is right, and who has to make sure that the timing for section A matches the timing for section B -- for the moment at least that's really beyond the tool's ability to cope.

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com
Reply to
Tim Wescott

I'm not picturing the model you are describing. If all sections have the same clock, they all have the same timing constraint, no? As to the tools distributing the delays, again, each stage has the same timing constraint so unless there are complications such as inputs with separately specified delays, the tool just has to move logic across register boundaries to make each section meet the timing spec or better to balance all the delays in case you wish to have the fastest possible clock rate.

Maybe by timing you mean the clock cycles the OP is talking about?

--

Rick
Reply to
rickman

The way I've seen it, rather than carefully hand-designing a pipeline, you just design a system that's basically

.---------------------. .-------. data in -->| combinatorial logic |---->| delay |----> data out '---------------------' '-------'

where the "delay" block just delays all the outputs from the combinatorial block by some number of clocks.

Then you tell the tool "move delays as you see fit", and it magically distributes the delay in a hopefully-optimal way within the combinatorial logic, making it pipelined.

As I said, I've never done it -- I couldn't even tell you what search terms to use to find out what the tool vendors call the process.

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com
Reply to
Tim Wescott

(snip, I wrote)

Some time ago, and before I knew about this, I was working on designs for some very long pipelines, thousands of steps. Each step is fairly simple, and all are alike (except for data values).

I figured that in an FPGA, the pipeline would go across the array, then down and across backwards, until it got to the end.

I then figured that the delay at the end, where it turned around to go back, would be longer than other delays, but didn't know how to modify my code.

As with many pipelines, I can add registers to all the signals without affecting the results, though they will come out a little later. But where to add the registers?

It turned out to be too expensive, so never got built, or even close. Sometime later, I learned about this feature, but never went back to try it.

One could put in sets of optional registers, such that either all or none of a set get implemented. That might not be so hard, but you do need a way to say it.

-- glen

Reply to
glen herrmannsfeldt

el

h

re

er

g

st

The problem I'm dealing with is just about the number of clock cycles, by w hich data in each data path are delayed.

The equal distribution of delay between stages of pipeline is so technology specific, that it probably must be handled by the vendor provided tools an d in fact usually it is. In old Xilinx tools it was "register balancing", i n Altera tools and in new Xilinx tools it is "register retiming".

So my problem is not so complex. And yes, it was solved in GUI based tools many years ago. In old Xilinx System Generator it was a special "sync" block which was doin g that. Just see Fig. 4 in my old paper from 2003 (

formatting link
).

The importance of the problem is still emphasized by the vendors of block-b ased tools (e.g.

formatting link
)

However I've never see tool like this available for designs written in pure HDL, not composed from blocks in GUI based tool...

I have found that for designs with pipelines with lengths depending on diff erent parameters and somehow interconnected in a complex way there is reall y a need for a tool for automatic verification, or even better for automati c adjustment of those lengths. Without that you can easily get incorrect design which processes data misal igned in time.

So that was the motivation. Sorry if my original post was somehow misleading.

Regards, Wojtek

Reply to
wzab01

Yes, but you talked about the tool not being able to "cope" with matching the delays in section A and B. I'm not following that.

--

Rick
Reply to
rickman

Yes, I understand the problem you are addressing. I have never done a design where this was much of a problem, but I'm sure some designs are much larger and more complex than the ones I have done.

Yes, it is important to have a tool to do this when the design is large or your timing margins are tight. It can save a lot of work.

Not to me. :)

--

Rick
Reply to
rickman

Basically I meant that you need to be responsible for lining up the delays in all the sections -- you can't make one section delay by five more clocks without identifying all the other pertinent sections that depend on that and make them delay by five more clocks, too.

If the tool could do everything we'd all be wiring houses for a living.

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com
Reply to
Tim Wescott

Ok, but that is not the tool CAD vendors provide. That is the tool the OP is talking about.

--

Rick
Reply to
rickman

Any VHDL compiler cannot be a useful compiler unless it respects the user entered registers. Though it may fit an equivalent arrangement as in register retiming for timing purposes.

Register delay stages is obviously what we are talking about rather than combinatorial/routing delays which is a concern for each register timing and which the tool decides together with any constraints from user.

It is up to user to decide the register delay stages. It cannot be technology sensitive unless you are doing some high level coding that does not specify registers. I don't know what this level is though.

How come a user build a design without being correct about register delay. How do you add streams or multiply or switch etc. and ask the tool to do the job?

Kaz

--------------------------------------- Posted through

formatting link

Reply to
kaz

..

As mentioned before just search for register retiming. It works exactly as you described although it is not perfect. It can move combinational logic between register pairs to balance the slack. Register retiming is a relative old technology and has been available on most independent tools (like Mentor's Precision and Synopsys's Synplify) and Vendor synthesis tools for many years. From what I understand vendor tools can only move logic into one direction due to a patent owned by Mentor Graphics.

# Info: [7004]: Starting retiming program ... # Info: [7012]: Phase 1 # Info: [7012]: Phase 2 # Info: [7012]: Phase 3 # Info: [7012]: Phase 4 # Info: [7012]: Total number of DSPs processed : 0 # Info: [7012]: Total number of registers added : 138 # Info: [7012]: Total number of registers removed : 66 # Info: [7012]: Total number of logic elements added : 0

Register retiming is something you want to enable by default unless you are planning to use an equivalence checker,

Hans

formatting link

Reply to
HT-Lab

combinatorial

Register retiming is a technique to help timing of setup/hold of a given path.

It does not and should not change latency of path in terms of clock periods.

The OP is referring to latency of a path in terms of clock periods rather than delay issues within a given path.

Kaz

--------------------------------------- Posted through

formatting link

Reply to
kaz

W dniu ?roda, 30 wrze?nia 2015 09:41:15 UTC+1 u?ytkownik kaz napisa?:

s
.

In the systems which I have to build there are some paremetrized components , in which latency depends on their parameters. Unfortunately I can not pub lish the original designs but a simplified version of one of those systems is provided as a demonstration of the method on OpenCores. For example I have a block for finding the maximum value from certain numbe r of inputs. It is a tree built from elemantary comparators. When looking for optimal implementation (in terms of resource usage and max imum clock frequency) I have to select the number of values compared simult aneously in such a basic comparator. My implementation automatically adjust s number of stages to the number of inputs in an elementary comparator and in the whole system. Of course the number of stages affects the latency (de lay in number of clocks). There are many such blocks which may be adjusted independently. Tryig to keep design adjusted properly (in a sense that all latencies in pa rallel pipelines are equal) is really difficult and error-prone. So thats why I needed a tool which does it for me. Of course I have to analyze the results, and sometime introduce manual corr ections... Does it answer the question above?

Regards, Wojtek

Reply to
wzab01

user

than

timing

do

can not

systems

automatically

comparator and

adjusted

so in short you regenerate some components with new latency different from the intended and tested one. I will just balance the latency manually and run the test. I don't see much practical scope for automating such change.

Kaz

--------------------------------------- Posted through

formatting link

Reply to
kaz

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.