tricks to make large PLAs fast?

- E
- Eric Smith
  
  Contact options for registered users
posted
18 years ago

Mon, Feb 27, 2006 8:08 AM

I have a design with a large PLA, and I'm trying to make it run fast in a Spartan-3. It's too big for block RAM, since it has 25 inputs, slightly fewer than 512 product terms, and 32 outputs.

My first attempt was to just translate the PLA equations to VHDL and synthesize it with the default settings. This uses 34% of a 3S500E, and has a minimum cycle time of slightly under 12 ns with over 20 levels of logic. If I turned on timing-directed mapping, or increased the effort levels, I suppose it might get slightly better. But my own analysis suggests that it should be possible to implement the PLA with no more than 11 levels of logic worst case, seven levels for the product terms, and 4 levels for the sums.

Anyhow, as the subject line suggests, I'm interested in any clever tricks to get more efficient FPGA implementation of the PLA. For instance, would use of the carry chain speed up wide gates? (Or do the tools already infer that?) Should I put some constraint on the product terms, to keep the tools from merging the product term and sum logic?

I'll experiment with this myself, but perhaps someone here has already done this, in which case suggestions would be quite welcome.

Thanks, Eric

- J
- John Adair
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Feb 27, 2006 12:27 PM

Eric

First comment is that you have a lot of levels of logic. Might not be a problem if only carry logic.

Is the 12nS a result of the a build with a timing constraint set or run without?

Often you can assist speed by seperating logic manually into logical blocks and not allowing the synthesis process to share logic. Sometimes you can enforce a function crudly by using the negative edge to produce a sub-term but that does not always work well.

If you set the frequency constraint to what you want the tools will usually give a failure if there is no chance of making timing. The thing then is examine where you fail and see if you can pipeline or in some way make it faster. One thing to remember in SRAM based FPGAs is the the width, or number of inputs, to a function have a direct significance on the levels of logic and hence speed. So as an example if you have a state machine driving this logic consider using grey or binary encoding rather than the usual default of one-hot if it is such an input.

John Adair Enterpoint Ltd. - Home of Hollybush1. The PC104+ FPGA Development Board.

formatting link

- B
- Brian Drummond
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Feb 27, 2006 12:56 PM

Have you tried simply comparing "optimize for area" versus "optimize for delay" settings in the synthesis tool? That can make a big difference, at least with Leonardo Spectrum; you don't say which synth tool you are using.

- Brian

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Feb 27, 2006 4:44 PM

If your terms are wide, the carry chains are often a more compact way of doing product terms *and* sums. While short carry chains often tend to take as long as about 3 logic levels, you should be able to perform the entire 25 input, 512 product term, 32 output in 512+32 carry chains or fewer (think of the nand/nand array for implementing sum of products). The number of inputs for each product term would determine the number of elements in the carry chain. I don't believe many tools do a good job of pushing logic into carry chains.

If you chose to parameterize a module to implement the product terms, you could have a threshold of 7 product terms to implement as combinatorial logic versus a manual carry chain implementation to keep a balance of delays and resources. The drawback to this technique is that many common product term "chunks" which could be shared between different terms would be in separate carry chains.

- E
- Eric Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Feb 27, 2006 7:02 PM

I put registers on the inputs and outputs, and set a timing constraint for the cycle time at 8 ns.

Last night I tried synthesizing the same source for a Virtex 4 LX in the -12 speed grade, with timing-based mapping turned on and the map and par efforts set to high, and a 5 ns constraint. It got it down to 10 levels of logic, but still took a hair under 10 ns. I'll try this again on the Spartan 3E.

It takes a very long time to run that, so most of the time during development I'll use the default settings and live with a lower clock speed.

Eric

- E
- Eric Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Feb 27, 2006 7:03 PM

I think I was using optimize for delay, but I didn't save the report files so I'll have to run it again to be sure.

I'm just using the ISE synthesis.

Thanks, Eric

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Feb 27, 2006 8:18 PM

That's a large array - does it really cover 2^25 combinations, or can you compress the inputs, so that the remainder can fit into Block Ram(s) ?

-jg

- E
- Eric Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Feb 28, 2006 1:58 AM

Not really. It was a design originally implemented in custom CMOS in the early 1980s, and I don't really want to redesign it any more than necessary. There are lots of don't cares scattered throughout the AND matrix of the PLA, so it won't fit in any reasonable-sized ROM or RAM. Also, the 25-bit input words don't uniquely map to outputs; a given input word may (and often does) match multiple product terms.

I've now tried putting a "keep" attribute on the product terms, and that made the timing worse. I thought it would result in better (separate) optimization of the product terms and OR terms, rather than mashing them together and trying to optimize the result.

By default, ISE *is* using the carry chain, and in fact it seems to be using it for some 95-input gates, which end up being much slower than an equivalent tree would be. I'm doing a run now with the "USE_CARRY_CHAIN" attribute set to "no" for all the sum terms.

I might try having hacking my Python tool that translates the PLA so that it directly instantiates trees of four-input gates for all of the product terms and sum terms, with a "keep" attribute on each gate output, and see what kind of timing that gets me. I think that should result in the fewest possible levels of logic, which would be around eight.

For the first attempt, I'll just brute-force it, so that none of the terms have any shared sub-terms. If the results look reasonable, I'll try to optimize it for use of common sub-terms to reduce the total number of LUTs, while keeping the levels constant.

Eric

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Feb 28, 2006 3:39 AM

... another idea : you could target a CPLD, like XC2C512, and see what that tool flow makes of it. It may give some ideas...

Their PLA has a fanin of 40, a depth of 49 per 16 MC block, so it _should_ be very efficent - but it may be too big for the tools to optimize properly, in which case some middle nodes might help ?

-jg

- E
- Eric Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Feb 28, 2006 5:41 AM

Interesting idea, but the rest of the design isn't going to fit in a CPLD, and splitting it across two chips is going to be even slower. I'm just trying tricks to improve performance in the FPGA.

Eric

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Feb 28, 2006 6:53 AM

I was not suggesting you move the whole design to the CPLD, just the PLA portion, to see what the tools do. The FIT reports might give you some ideas that can be used in the FPGA, and you will be able to judge the FPGA logic levels. The 512MCell device would be able to fit one output per block, and in that case the FanIN and OR widths will show you the resource needed per PLA output. Think of it as a PLA Analyzer :)

-jg

- K
- Kolja Sulimma
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Feb 28, 2006 3:34 PM

Make sure to translate the don't cares to your VHDL and to use a synthesis tool that makes use of the don't cares. AFAIK ISE just set's all specified don't cares to '0'.

I can also imagine that most tools are not well tuned to syntesize large sop descriptions. You can try to feed the design through some university synthesis tool first. See what SIS makes out of it. Or try a BDD package. A BDD package will generate at most 25 levels of logic for that design. Probably a lot less. Generate VHDL from the BDD and feed that to ISE, maybe it does a better job optimizing that instead of the SOP description. If there are any outputs that depend on less than 14 inputs push those into BlockRAMs.

Kolja Sulimma

- M
- Michael Hennebry
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Feb 28, 2006 5:01 PM

My analysis suggests 3+5=8 as the minimum number of levels of 4-input luts:

4*4*4=64> 25 and 4*4*4*4*4=1024> 512. A product could require up to 8 luts, a sum up to (2**9+1)/3=171 for a total of 512*8 + 32*(2**9+1)/3=9568 luts. What am I missing?

- E
- Eric Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Feb 28, 2006 11:49 PM

It was convenient to translate it to VHDL concurrent assignments of the form:

p126

- E
- Eric Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Feb 28, 2006 11:54 PM

Michael Hennebry wrote:

I was tired and confused the count of levels with the number of LUTs needed for one term. The input terms need up to seven LUTs but only in three levels as you state. For the sum terms, none have more than

256 inputs, so four levels is OK. Thus my PLA shouldn't need more than seven levels of logic.

I'm working on a new version of my Python PLA-to-VHDL converter that will explicitly instantiate LUT primitives. It treats the AND and OR arrays separately. For each array, it finds the most common four-input subexpression, generates a LUT for it, substitutes that LUT output back into the original equations, and iterates. I haven't quite got it working yet.

Eric

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 1, 2006 1:16 AM

Sounds interesting. Can you keep notes on the FPGA tools handling of the various assistance levels, and report back when you have it working ?

- and if you are doing AND and OR arrays separately, then the CPLD XC2 flow I suggested would make another good comparison. - ie compare your PLA optimizer with theirs ...

-jg

- K
- Kolja Sulimma
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 1, 2006 3:53 PM

Eric Smith schrieb:

Didn't you say something about 25 inputs?

Kolja Sulimma

- K
- Kolja Sulimma
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 1, 2006 4:24 PM

That would not work. There are IEEE comparison functions to do that (using '-' not 'X') but our version is just as good. I thought you had don't cares in the function: p126

- E
- Eric Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 1, 2006 11:23 PM

Sorry, I misread your statement. I thought you were talking about the OR matrix.

If I use BRAMs for part of the AND matrix, it may slow things down, because I already need to register the output of the PLA, and this would make it take two cycles, or else require me to register all the product terms and move the OR matrix into the begining of the next cycle, adding to the combinatorial delay after the PLA.

Or adding an out-of-phase clock for the BRAMs.

I haven't yet studied the BDD stuff you suggested (CUDD), but it looks interesting. I also found a Java BDD package.

I've already started hacking my Python PLA translator code to instantiate LUT primitives directly. It will be interesting to see whether P&R can do a good job on the result, or whether I'll have to come up with my own placement algorithm and use RLOCs.

If my approach is successful, I'll publish the Python code in case it might be of use to anyone else with a similar problem.

Eric

- E
- Eric Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 1, 2006 11:29 PM

I should have kept better notes on my attempts so far, which mostly amounted to fiddling with combinations of use of 'keep' attributes on the product terms, disabling USE_CARRY_CHAIN on the product terms and sums, whether to use timing-driven mapping, and the mapping and P&R effort levels. But so far no combination I tried gave better results than simply not using any attributes and running the tools with the default settings. If I throw in the attributes, the process takes up to ten hours to run, versus less than an hour without, so I haven't been able to try all permutations. But I've pretty much convinced myself that trying to make Foundation do a better job on the PLA through the use of attributes is a lost cause.

I suppose to some extent I'm impressed that Foundation did as good a job as it did at the default settings, since I'm obviously doing something that the tools weren't really optimized for.

I'll give it a try.