Buffering the critical path.

V

vssumesh 19 years ago

Hello all, In my design i am using a 32 bit adder and some combinational logic after that. The full path i want to constrain to double the clock period (20ns) and it is not constraing. When analysed the critical path observed that there is big carry chain for the adder and a big routing delay between the combinational logic (which i never expected). Is the big carry chain is causing the trouble in the router. I am thinking of buffering the output of the adder with a -ve edge (constrain that path to 5ns). And then constrain the other path that is after the buffer to next stage FF to 16ns. Will this buffering ease the routing effort. Please advice. Thanks and regards Sumesh V S

Vote

P

Peter Alfke 19 years ago

What kind of device are you using?

20 ns for a 32-bit adder (using dedicated carry) would be ridiculously slow... Dedicated carry, available in all Xilinx FPGA devices, uses less than 50 ps per bit (plus some basic delay). Peter Alfke

=========== vssumesh wrote:

Vote

V

vssumesh 19 years ago

No 20ns for the adder and the remaining combinational logic. The adder delay is as you said is very much less.

Vote

P

Peter Alfke 19 years ago

Vote

D

David Ashley 19 years ago

Peter,

He probably got thrown off by the top-posting format of your question. Forgive him. :)

formatting link

-Dave

David Ashley http://www.xdr.com/dash Embedded linux, device drivers, system architecture

Vote

M

Marlboro 19 years ago

Vote

R

Ray Andraka 19 years ago

SOunds like there are several layers of combinatorial logic. Pipeline the design. Also, I think he is using the term "buffer" to indicate adding a register stage.

The adder bits are like Peter said, about 50ps per bit, but the time to get on and off the carry chain adds more than 2ns, still nowhere near the 20ns.

It isn't the carry chain causing the problem. The problem comes about from using many levels of logic (ie the signal goes through lots of LUTs) between the flip-flops plus the propagation delay associated with the carry chain. You need to look at the ratio of logic delay to routing delay. If the routing delay on the critical path is more than the logic delay, you can likely fix the problem with some manual placement. The placer does a very poor job placing the additional layers of LUTs in multi-layer combinatorial logic. The LUT connected to the flip-flop places well, but the LUTs leading up to that one get scattered to the far reaches of the chip. You could try a higher effort level on the placement, but that may not provide enough improvement. You'll get better results floorplanning the locations of the additional layers of LUTs to be laid out logically and close to the rest of the LUTs in the path. Trouble is, the LUT names are subject to change on subsequent synthesizer runs, so you have to be really careful. The best solution, if your design can support it, is to pipeline the logic deeper.

Vote

V

vssumesh 19 years ago

Sorry Peter, but i did not mean that. Sorry for the confusion. I am using v4LX60 for my design. And there is a requirement of adding two 37 bit no and doing some combinational logic based on that. The total time is 20ns. The adder is taking very little time, the full logic itself is taking around 4ns delay. But the main problem is with routing delay. I forgot to tell you that it is a block RAM based design. And it uses 128 BRAM frm v4lx60(implemented a 16 port RAM). Also it uses the block RAMS in a scattered manner. So now i have placed this block in the central region. So the last routing to the block RAMis taking lot of delays. In the previous version there was no combinational logic after the adder and i got the timig correctly. But not now. What i was asking is to add registers to latch the output of adder.I thought like it would be good for the PAR to see two paths insted of 1 path from a source FF to destination FF. Also Ray there is 32*16 such signals. Is it possible to manually route all those signals. I think the pipeling is not possible since this is part of a pipeline stage of a processor. Which expects the result in the same cycle. So pipelining is not an option.

Ray i was asking that if we brake the above long line into separate parts using the +ve and -ve edge of the clocks is it possible to help tool for a better PAR. Thanks and regards Sumesh V S

Vote

P

Peter Alfke 19 years ago

Vote

T

Tommy Thorn 19 years ago

Well, we can't have that, so ... congratulations! I look forward to having a chance to play with them, but it will not happen until the XC5VLX50 is supported by the ISE WebPACK and the ML501 becomes available.

LUT6s are nice and will help combinational-heavy designs. The logic functions in the DSP may come in handy if they are flexible enough (haven't checked yet). The biggest surprise in the Virtex-5 though was the new routing network. I can only hope that the more regular structure translates into shorter P&R times.

Oh, and kudos for the ML501. The specs looks outstanding and I'm really happy to see a DDR2 SODIMM slot and just just some soldered down part.

Regards, Tommy

Vote

G

Göran Bilski 19 years ago

What are the logic you do after the addition?

If the logic is any related to the carry-out of the addition, I tend to extend the carry-chain with more logic instead of doing it after the addition. It's easy to do AND/OR boolean expression using the carry-chain.

I would also look at the synthesis schematics to see if the design generated by the synthesis tool is correct. Sometimes just rewrite the RTL can improve both area and speed. Also since routing is a big issue, maybe trying to make the design smaller might improve routing delays.

But without specific details on the design, I can only give you very rough tips.

You might also try to run the xplorer tool and see if that will give you some more speed. The result from different runs within xplorer can vary up to 20%.

Göran

Vote

V

vssumesh 19 years ago

Hi Goran, The combinational logic after the addition includes a saturation logic which checks the 37 bit and based on that output constant values or if it is not saturation condition output the accumalated data. The bit checking operation is AND and OR operation like and 31 to 16 bits and if it is one based on some other parametrs output the saturated value etc. So it is a AND,OR array then a decision logic after that a mux. After all these operations there is a shifting operation which formats the data to write to RAM locations (There is a special format to write to the RAM). The saturation operation is based on the addition result. Is it possible to optimize such an operation as you said if yes please show me an example how to implement that in HDL.

Peter Alfke wrote:

Peter i am not that type. I am comming to you people for advice so i wont behave like that. Though i am working for a company so i wont be able to reveal the full system but to the extent to which you people can understand the problem i think i am able to explain. Also my english is week so may not be the correct sentance that i am trying to express. Please forgive me if that happend.

I know that the routing delay for a carry chain is very small but i thought that the long steps in it is causing confusion to the tool. Thats why i have mentioned it.

Ya i am trying with that modifying the area placements etc. One doubt in that ; what should be the margin i gave to tool for the time constrains. Eg: if the working frqency is 80MHZ (12.5 ns) is it enough that i gave 12ns second constrain to the critical path. Or 24ns second for a double cycle path.

No i have to complete the design in V4LX60.

Also please advice me on the braking the logic path. will that be good for timing ???

Vote

K

KJ 19 years ago

underanalyzing the problem. By 'overanalyzing' I mean that you originally homed in on the carry chain as being a possible contributor to your problem and started wondering about ways to improve it; by 'underanalyzing' I mean that you didn't just look in the final fitted report for the worst case timing paths to either confirm or deny this to be the case.

The process for improving timing is both relatively straightforward but is still somewhat of an art. The straightforward process part is simply...

Identify all of the timing paths that are failing. This should come out of the timing report not out of any guessing.
Understand what the root cause of the timing problem is for each failing path (if there aren't too many) or the top problems (if there are a lot).
Decide what to do to fix the problem (more on that later).
Impement the fix, run through the tools and go back to step 1. Repeat until there are no failing paths.

Step #3 "Decide what to do..." is where the art comes in and probably you are the only one who knows what can be traded off in your particular application. It's also somewhat of an art in somehow knowing ahead of time which tricks will be best in a particular situation. Another word for 'art' though would be simply 'experience'. Even without this you can still do it, it just may take you longer.

Here are some of the tricks of the trade that you'll need to decide for yourself on how to apply. Some may be applicable to your design, some may not. There are other tricks that I'm sure other people can suggest as general improvements but without more detail on your code I don't think anyone will necessarily be able to give specific guidance on your design.

Like I said, the list below are just some samples of things, some or none of which may apply to your design...but I'm guessing that some will.

Better algorithms. This tends to result in either be a big improvement or will not be applicable at all. An example here could be to use a linear feedback shift register instead of a counter. The improvement here is that there is no carry chain to worry about, performance is basically independent of the size of the counter. Also logic resource scales less rapidly than with a binary counter. Whatever your 'and/or' logic is that you have after your adder might be worth inspecting to see if there is a better overall algorithm.
Increase latency. Find some point in the logic where you can register signals without 'too much' problem and that also breaks up the failing timing path (as determined from #1 at the start of the post) roughly into equal timing lengths. Generally when you delay things by a clock cycle there is something else that is affected by this so you need to change whatever that is as well. How many and how difficult those 'other' things are is how you decide if it's 'too much of a problem' to break the chain at a certain point. In some cases, there might be no choice, you have to break the chain at some point and there are several things that get affected by this and you just have to code to it. This general technique is also called 'pipelining'.
Relax clock cycle requirements. Do things really need to be done in one clock cycle or can they be done in two (or more)? The default static timing analysis works on the assumption that the combinatorial logic between flops must be completed within one clock cycle. If your input isn't changing that fast and you can take more than clock cycle in your application then you simply need to specify to the synthesis tool which paths those are and how many clock cycles the computation can take.
Transform operations that depend on two operands into one that depends on a single operand and a constant. This is somewhat of a more specific example of a 'better algorithm'. An example of this would be let's say you have a counter that needs to count up to a programmable number of counts and then reset. One way to do this would be to provide a writable register that contains the 'stop count'. Reset the counter to 0 and count up until you get to 'stop count'. If you do that you'll find the critical path to be the statement "if Count = Stop_Count then....". If 'Count' and 'Stop_Count' are 10 bit numbers then the '=' function will take as input 20 signals. A different approach would be to preload the counter with 'stop count' and then count down until you get to 0. Now you will be comparing count to a hardcoded number (i.e. 0) which means the '=' function will only take as input 10 signals. Most likely fewer logic resources and better timing.
See if operations can be moved from one clock cycle to another. This one is a bit tricky to explain and falls under what synthesis tools would call 'register retiming' (which may or may not be happening already on your design) but the idea here is that let's say there is already a three clock cycle pipeline where stuff on clock cycle 2 is where the critical path is but the flops that are doing the clock cycle

1 stuff get done really quickly. It might be possible to move some of the computation from clock cycle 2 into clock cycle 1 slowing down the timing paths in clock cycle 1 (but it was 'fast' to begin with) and improving your critical timing path in clock cycle 2. The 'simplest' way to approach this would be to read up on 'register retiming' (if your tool has this) and try it out and hope that it can spot ways to do this movement without further involvement on your part. If it can help you, it is usually just a matter of adding some extra flops to your source code and let the tool do the hard work. If it doesn't then you'll need to sit down and figure out for yourself how to move things from one clock cycle to another to spread out the load a bit. Unfortunately, I can't give you a good specific example here it is very application specific.

The examples I've given for each type of trick is not because I have any reason to think it will or won't apply to your design but just specific examples of the general thing that you're trying to do for each point. You need to analyze your failing paths and figure out which basic idea might be applicable and then develop the code specific to your design and go through the 4 step process.

KJ

Vote

G

Göran Bilski 19 years ago

Hi,

The idea is to use MUXCY as a building block. The drawback is that you need to express the adders using MUXCY as well since you need to extend that carry-chain with some more logic.

ex.

If you do a comparison between two values and want to do some function depending on the result of comparison. Since comparison is basically a subtraction, the carry output from the subtraction will tell the result of the comparison.

Let say you want to AND the carry-output signal with some other signal. You can do this by a MUXCY like this.

MUXCY_I : MUXCY_L port map ( DI => '0', CI => Subtract_carry_out, S => signal_to_and_with, LO => anding_result);

As you can see the signal "anding_result" will only be high if "subtract_carry_out" and "signal_to_and_with" both is high. ie. an AND

The beauty of this is that instead of routing out the carry-signal to normal routing and using a LUT for the ANDing, this use hard-route wires and an extremely fast logic.

The "normal" way can take around 1 ns depending on family including routing and using the carry-logic around 0.01 ns

Doing an "OR" is done like this

MUXCY_I : MUXCY_L port map ( DI => '1', CI => Subtract_carry_out, S => Inverted signal_to_or_with, LO => oring_result);

Here the oring result is high if either the subtract_carry_out is high or the inverted signal_to_or_with is high.

You can now create a long chain of these AND/OR logic and have extremely fast operations compare to the normal way.

MicroBlaze is using this a LOT in order to maximize the clock frequency.

Göran

Vote

T

Tommy Thorn 19 years ago

Great list Kevin!

There's an even better way (cf. Peter Alfke): widen the counter with one more bit (if necessary), and start the count one below the original value. When the original counter reached 0, the new counter will roll around to all ones, thus you only have to check the top bit. Since I saw this brilliant trick, I've used it everywhere. (It is of course yet another trade-of, but it's saving many LUTs at the expense of one cheap FF and one fast carry chain).

In general the "Decide what to do..." boils down to fully understanding: the target resources (their strengths and limitations), what you are trying to achieve, and what potential trade-ofs can be made. Generally there's no "quick fix", only hard work.

Tommy

Vote

D

David Ashley 19 years ago

Why is the inverter needed, can't you just switch the order of inputs:

MUXCY_I : MUXCY_L port map ( DI => Subtract_carry_out, CI => '1', S => signal_to_or_with, LO => oring_result);

2nd point, wouldn't the synthesizer be expected to make this optimization automatically? On the face of it this sounds like vendor lock-in. If you take advantage of this you're tied to xilinx parts...

-Dave

David Ashley http://www.xdr.com/dash Embedded linux, device drivers, system architecture

Vote

K

Kolja Sulimma 19 years ago

Exactly when will a saturation take place? And how many different saturated values are selected? Maybe you can reveal the RTL of that part? I doubt that it is that much of a trade secret....

I doubt that you gain much in FPGAs, but there is patent by infineon for their tricore-2 ALU. They have a logic path parallel to the ALU that checks whether a saturation will occur via a carry look ahead like structure.

Kolja Sulimma

Vote

Buffering the critical path.

Join the Discussion

Didn't find your answer?