after a long discussion about State Minimisation with ISE
I was looking for other tools supporting this feature. The result was disappointing. None of the FPGA synthesis tools seams to support this feature, while every tool was capable of extracting FSMs and creating safe-FSMs if desired. But none seems to support State Minimization.
Anyone knows a good reason why? Or have I missed a tool? (exept design compiler, which is for ASICS)
After the long and intriguing thread on state minimisation, I had been thinking about various state machines - trivial and not so trivial - that I've designed over the past few years; and I can't think of even one where automatic state merging would have been helpful. The kind of "sequence recogniser" state machine that was used as an example isn't very realistic, is it? Have I missed something important?
Jonathan Bromley, Consultant
DOULOS - Developing Design Know-how
About the only times I've had a state machine that could been minimized I would have been looking for the switch to turn off state minimization, if that was implemented.
Suppose I had a statemachine that needed to run at close to full FPGA speed. This implies that the design needs to have a single level of logic between registers. With N-input LUTs, then a given state can have only N (input states + conditions). By splitting states, then a faster clock speed and functional identical state machine can be produced.
As an example, suppose I was designing for 4-input LUTs, and I had two input states, and two condition bits for a state. One-hot is assumed. The input logic to state3 would be:
state3 :: (state1, state2, (state3, c1, c2))
This is five inputs, which would require two levels of logic with
4-LUTs. I could improve speed by splitting state 3 into two states, state3a and state3b:
Any output equations and next state equations would have state3 replaced with (state3a or state3b). A "state minimization feature" would collapse these states back into state3, making the FPGA run slower.
There are some similar state duplication tactics that can be done with binary coded statemachines. Adding states can be useful with slower and more complex machines as well, by replacing a state that is too complex to work at the target clock speed with two states that are less complex, and (hopefully) will work at the target clock speed.
You need to define what/where you expect Minimize ?. Johnathon makes a good point, that state merge minimize (which seems to be your main focus?) is not a common problem. I can see risks, and we already suffer from tools that are 'too clever by half' - if the tools missed that the merge states were being used somewhere, you would have a nasty bug.
What about a time-keeper state : one that appears empty/redundant to a logic parser, but is actually needed for timing - should we let the tools loose on those ?
There are also many different State Minimize yardsticks : This week, I have a state engine that uses a Johnson counter, and found (by the eyeball inspection optimize method) that extending (rather than merge) the state decode, reduced the decode logic overhead => higher packing density. In this case, that mattered - more often, it is a don't care. Johnson was also better overall packing, than Gray or Binary, but I'm not sure I'd expect the tools to be able to make those calls.
I suppose a tool that took a state engine, and spawned a huge number of possible solutions and then reported the six best, could be usefull ?
Unless you're doing guided P/R, synthesis run-time is typically a smaller fraction ( < 20% ? ) of total build time for the bitstream. And to my way of thinking, if the synth tool implements state minimization properly, it would be a user selectable option for each FSM, just as XST currently allows selection of encoding algorithm for each FSM. So the effort need be expended only on those FSMs that benefit from it.
I thought it was very plausible as a _simplified_ example. Sequence recognizers are important in all sorts of coding schemes that are used today (MPEG/JPEG etc.), which contain specific markers delineating starts of blocks. I took a good hard look at Eilert's example, and could not come up with a circuit that beats the 3 LUT + 3 FF solution (there's a challenge). And the technique is certainly not limited to sequence recognizers, but to any sort of FSM which may have equivalent states.
If state collapsing provides the smallest/fastest circuit in some cases, I think it would be very nice to have it as a synthesis option. Would anyone _not_ want it as an option?
Further ISE results: I stood Eilert's LSB-first circuit on its head, and tried the MSB-first version.
Before state reduction: 15 states: Auto and Sequential (produced same netlist): FFs / LUTs / est Fmax: 4 / 9 / 400 MHz (2 LUTs in front of each FF, and 1 LUT decoding 2 FFs) Compact: FFs / LUTs / est Fmax: 4 / 5 / 536 MHz (1 LUT in front of each FF, and 1 LUT decoding all 4 FFs)
After state reduction: 8 states: Auto: FFs / LUTs / est Fmax: 3 / 3 / 553 MHz ( 1 LUT in front of 2 FFs, direct input to 3rd FF, and Compact / Sequential / Gray: FFs / LUTs / est Fmax: 3 / 4 (!) / 553 MHz (1 LUT in front of each FF, and 1 LUT decoding all 3 FFs)
The reason posted these ( and I hope they haven't spoiled any of Eilert's lab exercises) was because I was surprised to see the "Auto" encoding produce better results than the "Compact" encoding. Know your tools!
Yes - for the LSB first "10-15 detector" FSM w/ pre-encoding state reduction, Compact encoding gave 3 LUTs Auto encoding gave 4 LUTs With the MSB first, Compact encoding gave 4 LUTs Auto encoding gave 3 LUTs
I don't mind that the results are not exactly as advertised, as long as I know to "expect the unexpected". John
Hi everybody, thanks to everyone for their contributions.
Some of you mentioned that state minimization rarely improves most FSMs, except the mentioned decision-tree example type. I made the same experiences and agree to that.
The surprising point for me was that state expansion also is a method to get faster and/or smaller results.
Some mentioned a large increase of synthesis runtime due to a state minimization algorithm. I don't think so. My experience with dc is that it's done almost instantly (on a 400 MHz Sparc) and most of the time is spent for writing messages to the screen. In fact, after extracting the FSM into a state table the algorithm just needs to sort that table, look for equivalent outputs and next states to be eliminated and rearrange the next-state entries to the remaining collapsed state. I think synthesis tools already have much more complex algorithms working than this one. State expansion could be much more complicated, since it needs a area/timing estimation to calculate the benefits of creating aditional states.
Jim feared that during state minimization time keeper or wait states could be eliminated. Definitely not. The next-state of a wait state is never equal to any other state, so it can't be eliminated.
About how and when such an algorithm should be applied... Well, like most special features of an EDA tool it should have a control flag, either by command line or GUI. In dc, to answer Mikes question, every step is performed by a separate command. Like this:
It's often useful to think of large state machines as software. They are just easy cases of microcode, usually implemented in ROMs.
If you are just over a power-of-two boundary, saving a few states can get you into the next smallest size ROM. If you are working with an existing design, sometimes it's the difference between fit and won't-fit.
I suspect that going from 17 to 15 states might simplify things enough to be very interesting in some cases.
My vote would be for a utility to process a state machine and tell me about the redundant states so I could fixup the source code by hand.
Years ago, I wrote a hack to scan some microcode and check for redundant states. I forget the context. Maybe a friend was out of space and desperate for a few more words. It might have just been a hack. I think it saved a dozen out of 4K.
The suespammers.org mail server is located in California. So are all my
other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
I'm not sure I understand how your simplified state table optimization is going to find identical sequences of states, since the individual states do not have identical from's and to's compared to any other individual states, but the sequence as a whole may be identical to another sequence as a whole. It seems to me that this is where the most redundant states are going to be found. In software, they usually write these identical sequences as subroutines, perhaps we should focus on the same, and use sub-state-machines.
Finding identical sequences is a little harder. OTOH, someone smarter than me has probably figured it all out...
I've designed a lot of FPGAs, and never had need for more than 20 states, with the vast majority less than 8. With that number of states, the Mark II eyeball optimizer works pretty well.
This is why it is a multi-pass algorithm: If one sequence of states is identical to another sequence of states, then in particular the last state of each sequence is identical, and may be easily recognized and collapsed. Repeating on the new state machine, which has one less state than the original, uncovers that the next to last states (of the original) are also equivalent, and may be collapsed...each pass uncovers another equivalent state until the entire equivalent sequence is reduced.
Since One-Hot is assumed, there probably isn't a significant time penalty for carefully packing a slice using an F5 mux to build a
5-input LUT, or using the carry mux and Cin for some cases, both of which avoid the routing/switch box delays, with a small mux delay instead. If only the tools did that automatically for performance, or allowed it easily by hand.