Area Optimization

Hi all, I have a design (written in VHDL) targetting the Spartan 6 series, and it's oversubscribed for LUTs. Can anyone recommend good resources to read? I've already spent a little time looking around the design in ISE's schematic viewer, but with tens of thousands of LUTs it's not exactly a fast process going from that angle, and if possible I'd rather avoid getting into a lot of explicit instantiation of primitives.

I've already read the ISE documentation on how to write expressions that the synthesizer can recognize as particular patterns, but unfortunately most of my design is just brute-force combinational logic (a lot of basic boolean operations and additions on fairly wide values) arranged into a pipeline, so the special patterns don't really apply (I don't have counters, or RAMs, or shift registers, or what have you).

This is with ISE 13.1, in case it matters, the most recent AFAIK.

I do have the option of moving to a larger chip if necessary, but would strongly prefer not to as the one I'm using is the largest supported by WebPack. I've looked at chips in other families, and WebPack seems to top out at similar LUT counts in all the families.

Thanks! Chris

Reply to
Christopher Head
Loading thread data ...

One of the tricks, which I don't believe the the tools will do automatically, is use the BRAMs in place of logic. That is, use a BRAM as a big look-up table. Since BRAMs are synchronous, you have to fit it in with your pipeline logic, but that shouldn't be so hard to do.

-- glen

Reply to
glen herrmannsfeldt

Chris

Are a pile of techniques that can reduce size and a lot depends on the original HDL design and coding style. We do this as one of our services and I have seen designs reduced to 40% of the original design in some extreme cases. Obtaining a 20% reduction to 80% of the original is more typical.

As with any engineering prpblem the first thing to do is to identify where your problem might be. I would typically use Floorplanner to identify which modules in your design are the largest. The largest is probably got the most chance of giving you most.

On the simple level try speed and area driven synthesis. Area mode does not always give the smallest result. You can also use choice of sythesisers to get different results if you have those available to you. Typically you might get 5-10% out of these techniques but I have seen some extreme sythesiser results giving a X3 variation on some logic.

One other thing on synthesis that can make a reasonable difference is the setting for you state machine encoding. Try playing with different settings. If the XST switch isn't broken again try anything but One Hot encoding. XST programmers have a fixation for One Hot encoding and it one gives the best results in less than 25% of designs.

Moving to the next level and much more extreme is to look at your HDL. Here you can look for shift registers that can go to SRL16/32 technology in Xilinx parts. That can save a lot. Old techniques like using illegial states in a state machine to reduce logic decoded can also be beneficial. Other techniques like using RAM for multiple related registers may also get you a reduction.

John Adair Enterpo> Hi all,

Reply to
John Adair

Have you turned on the area optimization control? Most synthesizers have a trade off between speed and area. Most of the time they seem to default to optimizing for speed. That can easily get you 10% in most designs.

As to techniques, first you need to find out where your LUTs are being used. Rather than using tools for that, compile your code one module at a time or in smaller groups of modules. I usually code from the bottom up and test every module in the simulator. So it is not hard to also do a compile and see how large each one is. Then you might be able to see which ones are larger than you expect and can look at how to improve them.

Rick

Reply to
rickman

Xilinx white paper WP231 is a good read. It is mainly for speed but shows why doing things like using an asynchronous reset is a really bad plan for both speed and area.

If you really don't care about speed then have you considered converting your parallel data paths into serial? Serial adders are really really small.

John Eaton

--------------------------------------- Posted through

formatting link

Reply to
jt_eaton

Have you first established which parts of you design are responsible for the most LUT usage?

If not, I wrote FPGAOptim when I was in a similar situation to help with just that:

formatting link

Drop me an email via that webpage and I'll get a download link to you.

Alternatively, these days Planahead can provide a view on LUT usage, and the logfiles also have some information.

Once you know which blocks to optimise, you've had good answers from others already. In my most recent case (a video processing application) there's sections of code which only have to update once per video line - they are prime targets for resource sharing.

As John Adair said, reducing by 20% is usually easily doable. With deep knowledge of what's going on and the tradeoffs that are acceptable, I've achieved 40-50% in the past.

Cheers, Martin

--
martin.j.thompson@trw.com 
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.conekt.co.uk/capabilities/39-electronic-hardware
Reply to
Martin Thompson

r

Hi John,

Thanks for that pointer. I have always been a believer in using the async reset and now I see that this may not always be the best way to reset a design. But the devil is in the details. I wonder if this still applies to non-Xilinx designs?

Rick

Reply to
rickman

It applies it all designs. Designers who started their careers with asynchronous logic carried it with them when Design for Synthesis and synchronous design became a requirement but it has never been the best choice. Many designers make the mistake of thinking that because they need an asynchronous reset system that they must design it using asynchronous logic. That is simply not true. We design synchronous systems that are black box equivalent to asynchronous systems all the time. The main thing that you need to realize about reset system design is that the purpose of the reset system is not to reset the system when a trigger event occurs. It's purpose is to NOT reset the system when a trigger event is NOT occuring.

The same is true for airbag controllers.The job of an airbag controller is not to deploy the bag when the car is in a accident, it's job is to not deploy the bag when the car is not having an accident. Any system where the expected number of uses is small and the effects of the usage is large will follow this rule.

Remember the 1st StarWars movie? They built DeathStar with an emergency exhaust port that provided a direct path from the reactor core to the surface. It was ray shielded but could not be particle shielded. Bad plan.

An asynchronous reset has a direct path from a pad into every flip-flop in the entire chip. It is analog shielded but not digitally shielded. Bad plan.

Resets in a real product (not a simulation) are really rare events. If a reset is delayed by 20 microseconds then nobody will notice. If a product that you are using suddenly resets itself then you will likely notice. Spend a few hundred cycles on a digital filter before you do something drastic.

John Eaton

--------------------------------------- Posted through

formatting link

Reply to
jt_eaton

John the best is to design to never reset !

Reply to
ARSDMTHE

You can create a design that will work with no resets at all. The problem is that the verification suite will take a few eons to finish.

John

--------------------------------------- Posted through

formatting link

Reply to
jt_eaton

John so include it in the design and go for eons !

Reply to
ARSDMTHE

jt_eaton wrote: (snip)

Most FPGA do an asynchronous reset on all FF at the end of configuration. I don't believe that is optional.

-- glen

Reply to
glen herrmannsfeldt

d
s

he

ll

.
n

Interesting philosophy.

Rick

Reply to
rickman

I believe that is optional for any given FF. The GSR has to be enabled on each FF and that is the point of the white paper. In Xilinx devices using the GSR uses one of the set/reset input on a FF as an async input which also configures the other input as async IIRC. The tools are capable of using the Set and Reset inputs a synchronous inputs to reduce the LUT usage and improving the speed of a design... in some cases.

As to the philosophical avoidance of async resets, I can't say I share that belief. As you point out, there is one async reset on the chip that you can't eliminate, the PROGRAM pin. Even if it doesn't reset the FFs, it will stop the design from working and reload all the LUTs and memory.

It has been a long time since I used a Xilinx part, so I may not remember them 100% correctly.

Rick

Reply to
rickman

Lots of interesting advice here! In particular I read the Xilinx whitepaper with interest. Unfortunately, a lot of the advice seemed to be inapplicable to my problem. I can't look for the individual submodule that's taking up most of the area, because my application is a single long pipeline with a large number of very similar stages: the area isn't taken up by any one stage, but more by the number of stages. And because the design is a pipeline with general logic (mostly bitwise, plus a small bit of basic arithmetic) between registers, I don't really see any opportunities for special primitives like SRLs, DSPs, or the like that would reduce area. I can probably solve my problem by building a smaller pipeline and reusing it; I preferred not to do that as it will decrease system performance but it looks like I don't have much choice now.

Thanks anyway! Chris

Reply to
Christopher Head

Something to remember about Xilinx FPGAs, at least when designing in VHDL and synthesizing with XST, is that you can specify the initial value of registered signals (when declaring the signal in the declarative part of the architecture). This is sometimes considered bad practice (bad coding style) in other contexts, and may not be supported by other tool flows.

--------------------------------------- Posted through

formatting link

Reply to
RCIngham

I would start by saying that the biggest opportunities for savings are almost always by starting at the algorithm level. You'll only get so far by playing with implementation.

one suggestion might be to look for places where you could do 'double clocking' - ie generate a 2x clock with the DCM and run a particular piece of logic twice per cycle, muxing the inputs and distributing the outputs. We have some designs that were multiplier limited, so we used this trick as our main pipeline was slow enough to use one multiplier to do double duty per pipeline stage.

some other tricks - use multipliers as shifters if you have them spare. See if you can rejigger your pipeline stages. Some of the older parts (vitrex-2 or so) have dedicated BUFT primitives that you can use to reduce the number of logic elements in multiplexers.

Look at and understand the logic usage reports from the synthesizer. If a module gets generated with more f/fs than you think it should, it's good to dig in and figure out what got generated. For XST There is a tool or option that will show a schematic of synthesized logic, this can be handy.

Reply to
amdyer

(snip)

For systolic arrays, which I will guess that the OP is working on, that often doesn't help. You could speed up the whole thing by a factor of two, though.

-- glen

Reply to
glen herrmannsfeldt

I really like the fact that you can initialize rams as well. You no longer need to think in terms of rams or roms, you have a universal read/writable rom for everything.

Need a screen buffer for your display? Create a startup screen image file and have that loaded as well. Need some boot/test code. Load it in at startup and then reuse that memory later.

This stuff is great!!

John

--------------------------------------- Posted through

formatting link

Reply to
jt_eaton

"General" logic is always ripe for optimization, or maybe I should say, de-unoptimization. If I were you, I would code each stage as a separate module and measure the size to compare to what you think it should be.

I have seen many times where the tools took what I thought was pretty straight forward code and blew it up to something ugly. Obviously it was doing what I told it to, but I would have been able to do better than the machine because I understood the logic better. So I had to change my code to indicate how it could be simplified.

Don't worry about the special features of a chip. First figure out if the tools did an ok job...

Rick

Reply to
rickman

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.