HDL tricks for better timing closure in FPGAs

Hi,

I am working with FPGAs and trying to take advantage of the parallelism that is available in them. My design is in Verilog HDL and my design is using a lot of resources in the FPGA - ~ 80 % of the available LABs (or LEs or ALE), ~50% of Interconnect and ~80% of available on-chip memory. Also I am required to run this design at very high clock rate - almost equal to on chip memory clock rate. With this high utilization, its becoming very difficult to meet high timing requirements !!

So, my question is regarding Verilog HDL coding - Is there any recommended coding style which improves timing closure (or in other words makes it easier for the tools to meet timing ) ??

If yes, please point me to the right location.

Thanks in advance. JeDi

Reply to
JeDi
Loading thread data ...

No, there are no 'coding styles' in any language that will do anything to improve clock cycle performance.

To improve timing you change your design to pipeline the processing. To pipeline you break up a 'big' computation (i.e. one that takes a long time and therefore becomes a critical timing path) into smaller ones that take multiple clock cycles.

Kevin Jennings

Reply to
KJ

I disagree; however, I would include 'pipelining' as part of the coding style/trick. You can also try to code such that the critical path(s) with have small enough blocks of logic between flip-flops to enable timing closure. You may add attributes to signals to try to coerce the synthesizer into doing 'the right thing'; if it doesn't come automatically, you might do some low-level coding, synthesize that, and then use the resultant edif file as a black box to the the next level up.

Before troubling with all that, though, do some bottom-up evaluations, particularly of things you feel will have trouble meeting timing. The tools will often produce sub-optimal solutions when trying to solve many simultaneous, conflicting requirements (resource location, timing, ...), and thus have trouble for the total design. Generally, the timing performance achieved at the chip level is less optimal that at the block level, so make sure your blocks will meet timing. If they do, and the whole doesn't, you might try incremental design techniques, where you solve one problem, build on it for the next, etc... If they don't, you can try re-coding and/or re-architecting to get the block(s) to meet timing, and then try the whole. Solve the relatively simple problems first... and sometimes the big problems become simple.

Other (non-coding) tricks: location constraints, multi-cycle constraints, ...

JTW

Reply to
jtw

Hi jtw !!

"I disagree; however, I would include 'pipelining' as part of the coding style/trick. You can also try to code such that the critical path(s) with have small enough blocks of logic between flip-flops to enable timing closure."

I tried pipelining - mainly to break large combinational blocks into smaller ones. But, the problem I run into is that the logic utilization increase almost 15-20 % more than the already high utilization !!! This creates a situation where the tool is not able to place everything close enough to meet timing - because the interconnect delay of the FPGA now becomes the bottle neck !!!

" You may add attributes to signals to try to coerce the synthesizer into doing 'the right thing'; if it doesn't come automatically, you might do some low-level coding, synthesize that, and then use the resultant edif file as a black box to the the next level up"

This is what I am trying to do now and seeing what impact does this have - fingers crossed !!! Thanks for the suggestion though.

"The tools will often produce sub-optimal solutions when trying to solve many simultaneous, conflicting requirements (resource location, timing, ...), and thus have trouble for the total design. Generally, the timing performance achieved at the chip level is less optimal that at the block level, so make sure your blocks will meet timing. Other (non- coding) tricks: location constraints, multi-cycle constraints,"

So true !! I have seen exactly this - block level everything is fine .. but chip level ... performance starts deteriorating !! I have tried the resource allocation and timing constraints also. Though these improve the frequency a little, I think the biggest constraint comes from the very high logic utilization and hence the tool is not able to concentrate its efforts efficiently !!! However, I am working on a few things ... lets see if the efforts pay dividends !!!

Thanks for the suggestions. If you come across anything else, please point it to me !!

JeDi

Reply to
JeDi

What's the difference between target and the result freq.?

Reply to
Aiken

What yer want v. what yer get

Reply to
Icky Thwacket

Then you have the following options (in no particular order):

  1. Choose a faster speed grade part.
  2. Choose a larger part and keep pipelining.
  3. Design better algorithms that can be implemented with better performance.
  4. Slow the clock down

Unless you're just missing by a little bit, don't expect to attribute your way to happiness, you'll most likely be sadly disappointed...after spending a (possibly) considerable amount of time trying to get it to work. Uncross your fingers, let the blood flow through and go back to one of the four suggestions given previously...or give it the old college try and hope for the best.

Saying that each block has good performance but tying them all together is a problem caused by 'sub-optimal' placement is without any basis. While it could be a contributor, the most likely cause is not the synthesis tool but the algorithm you're trying to implement.

Look at your worst case timing path. If you see it going through a whole bunch of levels of logic, then it's not the synthesis tool's poor placement, it's your logic. If you see only one or two levels of logic and unreasonably long delays then it is either the synthesis tool (as you suggest) or you have an unrealistic expectation of what kind of clock speed you can expect to run at.

Good luck

Kevin Jennings

Reply to
KJ

Saying that each block has good performance but tying them all together is a problem caused by 'sub-optimal' placement is without any basis. While it could be a contributor, the most likely cause is not the synthesis tool but the algorithm you're trying to implement.

Look at your worst case timing path. If you see it going through a whole bunch of levels of logic, then it's not the synthesis tool's poor placement, it's your logic. If you see only one or two levels of logic and unreasonably long delays then it is either the synthesis tool (as you suggest) or you have an unrealistic expectation of what kind of clock speed you can expect to run at.

Good luck

Kevin Jennings

-----

I have seen specific cases where the synthesizer gets carried away optimizing away redundant logic; e.g., several instances of the same logic. When the individual block is sent through synthesis and then place & route, everything is fine; when the chip, containing multiple copies of the logic, is sent through, the synthesizer perceives (correctly) redundant logic. Unfortunately, in my case it made par work much harder, because of the increase in fanout (more spacing/distribution than number of loads.) When I synthesized the block independently, and then did the top-level sim with the several instances appearing as black boxes, the chip-level place & route improved significantly, achieving timing closure.

I often run low-level blocks through preliminary synthesis & par, even when I don't do black-box instantiation, to give me that realistic expectation. I try to find out the limits of performance here, not just that it meets my 'top-level' timing; if it just barely meets at the low-level, I expect trouble later.... It also gives me the opportunity to experiment with different optimization schemes (coding style, re-architecting, synthesis directives, etc.) with a quick turnaround. In many of my designs, some blocks may be used 4 or more times; the more times, the more relevant to optimize for utilization, particularly when pushing the limits (space/routing/speed.)

JTW

Reply to
jtw

And there it is... Stop making large combinatorial blocks. As others have pointed out, make 1 or 2 levels of logic, then a flip flop. Here's an example that I see often in image processing: you're processing a raster-scan image, so you need to know when you're at the end of an image line. You make a counter to count out the pixels on a line, then you pepper the control equations with a comparison of the row counter to the limit value. It might meet timing when synthesized and placed in unit testing, but as the chip grows, the placement of these equations has to compete with the placement of all other logic, and it no longer meets timing. The worst case timing path is the counter increment signal (hopefully a register output), through the counter, and through the carry chain, the counter output goes through the comparator, the comparator output combines with the rest of the control logic, then mercifully shows up at the D of a flip-flop. Now, if you had just registered the comparator output instead, the increment logic, counter logic and comparator delays would no longer contribute to your timing problem because they all happened in the previous clock cycle. Yes, this means that you have to compensate for the flip-flop delay of the comparator, but you just work that out in the design before you code (or you'll work that out anyway when the design doesn't meet timing).

--
Joe Samson
Pixel Velocity
Reply to
Joseph Samson

Joseph Samson wrote: The worst case timing path is the counter increment signal

Oops! Those counter outputs are registered! In my defense, it was 7AM. The principle remains, though. Think of your logic register to register, not as large combinatorial chunks.

--
Joe Samson
Pixel Velocity
Reply to
Joseph Samson

This will work, but the time involved to get the design finished will increase exponentially. A simpler way to do the same is using more than one clock. I usually have 3 clocks: slow (several MHz), main processes (tens of MHz) and high speed.

--
Programmeren in Almere?
E-mail naar nico@nctdevpuntnl (punt=.)
Reply to
Nico Coesel

Yes. A well-packed FPGA design will utilize about the same number of LUTs and flops.

Lets say that I instead describe a large combinational chunk using 100 LUTs and maybe a multicycle constraint of 10 ticks.

This will work fine in isolation as long as the flops associated with those 100 LUTs are not needed.

However, as the FPGA fills up, eventually place+route will need those spare flops for other processes, and the synchronous Fmax will start to suffer from the long routes needed to wire them up.

In other words, adding a large combinatorial chunk can "injure" an unrelated high-speed synchronous block, that was working fine before.

-- Mike Treseler

Reply to
Mike Treseler

Somehow, several of the designs I recently finished had: 320 MHz+ data clock, DDR 160 MHz+, coherent with data clock 160 MHz+, system clock 80 MHz+, system clock coherent with 160 MHz system clock

I still used multi-cycle constraints, TIG (Timing Ignore) constraints, location constraints, .... and had to 'play tricks' to force the tool not to remove redundant FFs, and insert FFs to break up combinational logic, and ...

Adding clocks will not particularly make meeting timing easier; it is comparable to using multi-cycle constraints, which imply a virtual clock. (It may add value in power saving on the clock trees.) Adding clocks adds complexity, because now you must manage the clock domain crossings. But, sometimes, that is just part of the job...

JTW

Reply to
jtw

(jtw wrote, though I am redoing the indenting)

The advantage of pipelining is that each part of the pipeline runs in parallel. If you have more data to process, it enters the pipeline on subsequent cycles. There are many books on the design of pipelined processors (from the 1960's and 1970's) that will explain that part.

If you don't have more data to process, then you might do an iterative design that reuses the same logic on consecutive clock cycles, along with a state machine to keep track of what is being done and when.

Consider multiplying two N digit numbers (in any base).

You have to generate N partial products, and then add them together. As combinatorial logic, it may be N+1 levels deep. All the partial products are generated, and then N levels of adder to add them up. It takes O(N**2) logic units. Much of the logic isn't doing anything most of the time.

Pipeline it as an N stage pipeline. Each stage generates a new partial product and adds it to the cumulative result. It still takes the same amount of logic, plus the registers to generate the pipeline stages, but new data can go in on each cycle. The results come out N (or N+1) cycles later. M products can be completed in M+N+1 clock cycles, where the cycle is long enough to do one partial product and one sum. (Even better, pipeline the sum separately.)

If you don't have enough data to keep an N stage pipeline full, an iterative design works. Only O(N) logic units, though maybe an equivalent amount to keep the thing running. Results come out in N+1 cycles, new data goes in every N+1 cycles.

Those are the tradeoffs between logic and throughput.

You can also do something in between, with more logic and fewer pipeline stages than the latter design.

It is all a tradeoff between logic and throughput.

-- glen

Reply to
glen herrmannsfeldt

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.