FSM State Minimization on FPGAs

Hi everybody,

after a long discussion about State Minimisation with ISE

formatting link

I was looking for other tools supporting this feature. The result was disappointing. None of the FPGA synthesis tools seams to support this feature, while every tool was capable of extracting FSMs and creating safe-FSMs if desired. But none seems to support State Minimization.

Anyone knows a good reason why? Or have I missed a tool? (exept design compiler, which is for ASICS)

Best regards Eilert

Reply to
backhus
Loading thread data ...

How about: other than in academia, there's not a whole lot of use for the capability, and identifying redundant states (or chains of states) is not an easy task (read: extends run-time).

I'd rather they work on something that benefits more users...

Andy

backhus wrote:

formatting link

Reply to
Andy

I'm glad someone else said that before me :-)

After the long and intriguing thread on state minimisation, I had been thinking about various state machines - trivial and not so trivial - that I've designed over the past few years; and I can't think of even one where automatic state merging would have been helpful. The kind of "sequence recogniser" state machine that was used as an example isn't very realistic, is it? Have I missed something important?

--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan.bromley@MYCOMPANY.com
http://www.MYCOMPANY.com

The contents of this message may contain personal views which 
are not the views of Doulos Ltd., unless specifically stated.
Reply to
Jonathan Bromley

About the only times I've had a state machine that could been minimized I would have been looking for the switch to turn off state minimization, if that was implemented.

Suppose I had a statemachine that needed to run at close to full FPGA speed. This implies that the design needs to have a single level of logic between registers. With N-input LUTs, then a given state can have only N (input states + conditions). By splitting states, then a faster clock speed and functional identical state machine can be produced.

As an example, suppose I was designing for 4-input LUTs, and I had two input states, and two condition bits for a state. One-hot is assumed. The input logic to state3 would be:

state3 :: (state1, state2, (state3, c1, c2))

This is five inputs, which would require two levels of logic with

4-LUTs. I could improve speed by splitting state 3 into two states, state3a and state3b:

state3a :: (state1, (state3a, c1, c2)) state3b :: (state2, (state3b, c1, c2))

Any output equations and next state equations would have state3 replaced with (state3a or state3b). A "state minimization feature" would collapse these states back into state3, making the FPGA run slower.

There are some similar state duplication tactics that can be done with binary coded statemachines. Adding states can be useful with slower and more complex machines as well, by replacing a state that is too complex to work at the target clock speed with two states that are less complex, and (hopefully) will work at the target clock speed.

-- Phil Hays (Xilinx, but speaking for himself)

Reply to
Phil Hays

  1. Such minimization is risky, rarely applicable and can consume lots of time.
  2. State machines consume a very small percentage of fpga resources in most designs.
  3. Making Fmax is often a more compelling problem given the capacity trends in FPGAs.

-- Mike Treseler

I don't think so. Is this reduction on by default in design compiler?

-- Mike Treseler

Reply to
Mike Treseler

formatting link

You need to define what/where you expect Minimize ?. Johnathon makes a good point, that state merge minimize (which seems to be your main focus?) is not a common problem. I can see risks, and we already suffer from tools that are 'too clever by half' - if the tools missed that the merge states were being used somewhere, you would have a nasty bug.

What about a time-keeper state : one that appears empty/redundant to a logic parser, but is actually needed for timing - should we let the tools loose on those ?

There are also many different State Minimize yardsticks : This week, I have a state engine that uses a Johnson counter, and found (by the eyeball inspection optimize method) that extending (rather than merge) the state decode, reduced the decode logic overhead => higher packing density. In this case, that mattered - more often, it is a don't care. Johnson was also better overall packing, than Gray or Binary, but I'm not sure I'd expect the tools to be able to make those calls.

I suppose a tool that took a state engine, and spawned a huge number of possible solutions and then reported the six best, could be usefull ?

-jg

Reply to
Jim Granville

Unless you're doing guided P/R, synthesis run-time is typically a smaller fraction ( < 20% ? ) of total build time for the bitstream. And to my way of thinking, if the synth tool implements state minimization properly, it would be a user selectable option for each FSM, just as XST currently allows selection of encoding algorithm for each FSM. So the effort need be expended only on those FSMs that benefit from it.

I thought it was very plausible as a _simplified_ example. Sequence recognizers are important in all sorts of coding schemes that are used today (MPEG/JPEG etc.), which contain specific markers delineating starts of blocks. I took a good hard look at Eilert's example, and could not come up with a circuit that beats the 3 LUT + 3 FF solution (there's a challenge). And the technique is certainly not limited to sequence recognizers, but to any sort of FSM which may have equivalent states.

If state collapsing provides the smallest/fastest circuit in some cases, I think it would be very nice to have it as a synthesis option. Would anyone _not_ want it as an option?

Further ISE results: I stood Eilert's LSB-first circuit on its head, and tried the MSB-first version.

Before state reduction: 15 states: Auto and Sequential (produced same netlist): FFs / LUTs / est Fmax: 4 / 9 / 400 MHz (2 LUTs in front of each FF, and 1 LUT decoding 2 FFs) Compact: FFs / LUTs / est Fmax: 4 / 5 / 536 MHz (1 LUT in front of each FF, and 1 LUT decoding all 4 FFs)

After state reduction: 8 states: Auto: FFs / LUTs / est Fmax: 3 / 3 / 553 MHz ( 1 LUT in front of 2 FFs, direct input to 3rd FF, and Compact / Sequential / Gray: FFs / LUTs / est Fmax: 3 / 4 (!) / 553 MHz (1 LUT in front of each FF, and 1 LUT decoding all 3 FFs)

The reason posted these ( and I hope they haven't spoiled any of Eilert's lab exercises) was because I was surprised to see the "Auto" encoding produce better results than the "Compact" encoding. Know your tools!

Best Regards, John

Reply to
JustJohn

The option would be fine. Paying $10,000 for it wouldn't be.

-- Mike Treseler

Reply to
Mike Treseler

That would be usefull, but it should also have the ability to include some defined state-decode in the 'optimise bucket', so you get smallest size/best speed of the machine, and critical decode paths.

In my most recent FSM, I coded a johnson ctr ( not an option in XST ?), and there were decode savings on the optional mapping of the unused/unmapped states.

What could be usefull, is a tool that simply reports wasted/merge-able states.

In one case, but not the other ?

-jg

Reply to
Jim Granville

Yes - for the LSB first "10-15 detector" FSM w/ pre-encoding state reduction, Compact encoding gave 3 LUTs Auto encoding gave 4 LUTs With the MSB first, Compact encoding gave 4 LUTs Auto encoding gave 3 LUTs

I don't mind that the results are not exactly as advertised, as long as I know to "expect the unexpected". John

Reply to
JustJohn

Hi everybody, thanks to everyone for their contributions.

Some of you mentioned that state minimization rarely improves most FSMs, except the mentioned decision-tree example type. I made the same experiences and agree to that.

The surprising point for me was that state expansion also is a method to get faster and/or smaller results.

Some mentioned a large increase of synthesis runtime due to a state minimization algorithm. I don't think so. My experience with dc is that it's done almost instantly (on a 400 MHz Sparc) and most of the time is spent for writing messages to the screen. In fact, after extracting the FSM into a state table the algorithm just needs to sort that table, look for equivalent outputs and next states to be eliminated and rearrange the next-state entries to the remaining collapsed state. I think synthesis tools already have much more complex algorithms working than this one. State expansion could be much more complicated, since it needs a area/timing estimation to calculate the benefits of creating aditional states.

Jim feared that during state minimization time keeper or wait states could be eliminated. Definitely not. The next-state of a wait state is never equal to any other state, so it can't be eliminated.

About how and when such an algorithm should be applied... Well, like most special features of an EDA tool it should have a control flag, either by command line or GUI. In dc, to answer Mikes question, every step is performed by a separate command. Like this:

--
compile -map_effort medium
Reply to
backhus

It's often useful to think of large state machines as software. They are just easy cases of microcode, usually implemented in ROMs.

If you are just over a power-of-two boundary, saving a few states can get you into the next smallest size ROM. If you are working with an existing design, sometimes it's the difference between fit and won't-fit.

I suspect that going from 17 to 15 states might simplify things enough to be very interesting in some cases.

My vote would be for a utility to process a state machine and tell me about the redundant states so I could fixup the source code by hand.

Years ago, I wrote a hack to scan some microcode and check for redundant states. I forget the context. Maybe a friend was out of space and desperate for a few more words. It might have just been a hack. I think it saved a dozen out of 4K.

--
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's.  I hate spam.
Reply to
Hal Murray

I'm not sure I understand how your simplified state table optimization is going to find identical sequences of states, since the individual states do not have identical from's and to's compared to any other individual states, but the sequence as a whole may be identical to another sequence as a whole. It seems to me that this is where the most redundant states are going to be found. In software, they usually write these identical sequences as subroutines, perhaps we should focus on the same, and use sub-state-machines.

Finding identical sequences is a little harder. OTOH, someone smarter than me has probably figured it all out...

I've designed a lot of FPGAs, and never had need for more than 20 states, with the vast majority less than 8. With that number of states, the Mark II eyeball optimizer works pretty well.

Andy

backhus wrote:

Reply to
Andy

Reply to
Mike Treseler

That has been my experience as well. Most controllers are pretty simple.

The larger examples I have seen are cluttered attempts to fit all the process variables into a single enumeration.

Separate variables for counters and shifters make a much cleaner description.

-- Mike Treseler

Reply to
Mike Treseler

This is why it is a multi-pass algorithm: If one sequence of states is identical to another sequence of states, then in particular the last state of each sequence is identical, and may be easily recognized and collapsed. Repeating on the new state machine, which has one less state than the original, uncovers that the next to last states (of the original) are also equivalent, and may be collapsed...each pass uncovers another equivalent state until the entire equivalent sequence is reduced.

Reply to
JustJohn

Hi Andy, as already said by John, it's not done in a single pass.

A simple explanation can be found here (regardless of the typo on p.4: read "SZ and SY"):

formatting link

A more scientific approach can be found here:

formatting link
**http%3a//web.cecs.pdx.edu/%7ealanmi/510FM2/510OCE-11.ppt

While the first paper can be understood and applied even by undergraduates, the second one is very theoretical.

Funny coincidence: both explain the rule for equivalent states on p.4 :-)

Best regards

Eilert

Andy schrieb:

Reply to
backhus

Since One-Hot is assumed, there probably isn't a significant time penalty for carefully packing a slice using an F5 mux to build a

5-input LUT, or using the carry mux and Cin for some cases, both of which avoid the routing/switch box delays, with a small mux delay instead. If only the tools did that automatically for performance, or allowed it easily by hand.
Reply to
fpga_toys

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.