Hi all! I'd like to once again bring up the subject of state machines running into illegal states (illegal in the sense that the state vector does not correspond to any of the states defined in the VHDL code), because despite having spent half a day googling and reading related threads, I'm still left with a couple of questions:
Most discussions cover how to recover from illegal states, but few cover how it actually happens. What are the (I presume) electrical reasons to that a state machine runs into an illegal state in the first place? Is there anything one can do to reduce the risk? Assume all FSM inputs connected to I/O pins are synchronized with one FF each, and the whole design is synchronous. Does anyone know of a good tutorial on this issue? I could add that in my case, the transition into an illegal state almost always happen immediately upon startup of the system, if it happens.
How can I force Xilinx XST (6.2 SP3) to produce a safe FSM that recovers from an illegal state? A "when others => state
Internal noise coupling in the chip (crosstalk), power drops, alpha particles, not properly double-sync'ing an async signal before using it in two different places (BTDT, seen it in a real chip), ... the list goes on!
I have never had a design with a state machine which got into illegal states. The only two reasons that I can think of for this happening is
1) electrical noise which would also cause upset of *other* FFs in the system causing other symptoms and
2) timing issues with the FSM. This can be either from async inputs (metastability) or from failing to meet setup time on a reg input. If you have done your static timing analysis correctly, then it must be a metastability issue. The fact that it occurs happens on startup says to me it is a timing issue. If you can chase the problem away by slowing your clock, then it is a static timing issue. If it persists, then you most likely have a metastable issue.
Figure out what is wrong and deal with the cause of the problem.
Hi Phil! I use no reset signal at all; instead I specify initial values for all signals by the declarations, which is supposed to work fine with XST. But your point is still interesting in case I would need to introduce an asynchronous reset some day. Does that mean one should avoid them if illegal states are a concern?
OK, probably I would need the complete list with full descriptions! Do you know of any books or tutorials on this subject? I'm not really an electrical engineer but I have to deal with this, so any pointers would be appreciated.
You do have an asynchronous reset, you just didn't know that you did. When a Xilinx FPGA finishes the program download, it has all initial values held until an internal signal is released. This release is asynchronous to your clock. To avoid problems with this add a counter that is reset to all zeros. Until that counter counts to 15, keep the state machine in the initial state.
(Note: Startup is a messy subject. This is a simplified version.)
There is another common issue with DCMs or DLLs that you might also be having a problem with. Are you using a DCM or a DLL?
Yes. Suppose the initial state is "100" and the desired next state is "010". This would be a three state one-hot machine. If the first bit is held until just after the first edge of the clock and the second bit is held until just before the first edge of the clock, then the next state will be illegal, "110". If the first bit is held until just before the first edge of clock and the second bit is held until just after the first edge of the clock, then the next state will be illegal, "000".
Does that make it clear?
-- Phil Hays Phil-hays at posting domain should work for email
Wow, isn't software clever... and it probably does not tell you it did this either... but never mind, the other state have no logical pathways, so everything will be OK.
Back to the real engineering world: Do LOOK at the resultant output of your tools, and HOW it actually built the FSM. It can use .D or .T registers, with .D the most common.
Implicit in most .D coding is that state 00000 is the goto state from any illegal ones : Thus for many reasons (hopefully very rare) you MIGHT goto an illegal state, but the one after that will be 00000.
This should be a cornerstone state of your legal state list, either the POR state, or the safe-idle state.
Choosing gray code related states can reduce the pathways to illegal states, but in complex FSM's, this is not always possible.
You should not rely on this recovery pathway in regular system operation, it should be a safety-net. During tests, you could INC a counter when passing through 00000,
.T register state engines can be smaller, but they also can literally stick at an illegal state.
I don't think I can rule that out in this specific application. But where, that is on what physical signal, do you mean the electrical noise would occur, and how could it affect the FPGA's internal state?
I doubt that it's about static timing in my case since my clock is 20 MHz, and XST's post-layout static timing analysis doesn't complain. Metastability could be an issue, but it's strange that it happens so often. On one particular design, it happens about once every ten times i startup the system. All inputs are synchronized with one FF each, but I'll try adding a second one to see if it helps.
I agree totally, that's why I pointed out that most previous threads dealt with recovery from but not with the cause of illegal states.
Well, the FSM optimizer detects unreachable and removes related logic, and I guess that's what's happening here too. Indeed, if I simulate the RTL code as it is, I can't put the design in an illegal state - there is simply no signal that I can force to an illegal value. But if I add a dummy state in the enumerated state type definition without adding a "when" clause for it, then I can enforce the dummy state and the "when others" clause is applied. If I simulate the post-layout code, I would be very surprised if the illegal state detection worked since it was taken away during synthesis, but I haven't tried it.
If you are using ISE 6.1, also check the "Clock Information:" section of your SYR (Synthesis) file. In some cases, it (erroneously) generates additional clocks from combinatorial logic. From some of my own experience, I've noticed this sometimes this can be cured by making sure that all possible states are defined in your combinatorial logic.
Example - Extra clocks will be generated if the two commented lines below are left commented, but will not be generated if they are uncommented.
if (currentState = IDLE) then if (someTrigger = '1') then nextState still left with a couple of questions:
Well, the output netlist isn't exactly human-readable, although I guess I could write a simple FSM, synthesize it and study it. But actually I already know how XST has encoded my machine. My options are essentially One-Hot, Compact (binary), Sequential, Gray, and Johnson, all presumably on D flip-flops. I get One-Hot encoded machines unless i ask for something else.
However, correct me if I'm wrong, the state encoding itself doesn't change anything in the machine's ability to recover from illegal states
- it takes some logic that detects these illegal states and forces the state vector back to normal, and that logic obviously isn't there. Many synthesis tools provide an extra option "safe FSMs" which will add such logic, but XST doesn't. So my question is XST-specific - how do I add illegal state recovery logic with XST?
No, not quite. If you consider .D registered FSMs, then if you have enumerated 00000 as a legal state, that naturally maps what you call recovery logic. If the tools choose one-hot, and take 00000 as illegal/not possible, (since it is not a one-hot state) then you loose the natural recovery path. ie the state encoding itself CAN affect the recovery, as if you avoid the .D recovery path of all low, your FSM will not perform the same as one that includes all lows as a specified state.
Try Gray or Johnson, and make sure the 00000 is an enumerated/specified state, and see what happens ?
But that only solves the problem for one specific illegal state... or do I misunderstand something? Say I have a three-state machine, where XST by default would encode the states "100", "010", and "001". So there are now five illegal states. If I follow your suggestion, I could encode the states like "00", "10", and "01". But there is still one illegal state, namely "11". I agree this is a lot better and it significantly reduces the risk of falling into the illegal state, but the risk is still there. And if the machine somehow falls into this illegal state, I want there to be some recovery logic to take the machine out of it.
Simply put - no matter what encoding I use, there will always be illegal states (except when the number of states is an exact power of two, so that I can assign each state vector configuration to a legal state). And if there are illegal states, the machine can fall into them. And if the machine can fall into an illegal state, I want it to get out of there automatically.
I have tried binary encoding, and indeed wouldn't hang anymore. But considering what I wrote above, I don't feel confident that it really solved the problem - I believe I just reduced the probability for it to happen. But I would happily be proved wrong!
With .D registers, and you can consider FSMs as low level coded as: Q0.D = SeriesOfTerms0; Q1.D = SeriesOfTerms1;
Each valid state will have a number of hold-until-next-move-true terms, but states not covered will have NO .D terms, and so their NEXT state is Q0.D = 0; Q1.D = 0; or to the 00 state.
If you code IF State=11 THEN immediate_next = 00, then no more logic is generated, as that is implicit.
So it will get out of there automatically. With one-hot, the actual problem is that where it GOES NEXT is also not on the state map, whereas with other schemes, esp if you implicitly include 0000, then there is a recovery path.
From reading your various postings, I believe the summary is:
You have a state mach>Hi all! I'd like to once again bring up the subject of state machines
In summary from other postings:
This might be metastables. This might be a timing problem. There is an asyc reset, which occurs when your chip goes active. (FPGAs and CPLDs do this differently, but the effect is similar) There are various noise sources that could cause this: kai: "Internal noise coupling in the chip (crosstalk), power drops, alpha particles, not properly double-sync'ing an async signal before using it in two different places ... the list goes on! Phil Hays wrote: "You do have an asynchronous reset, you just didn't know that you did. When a Xilinx FPGA finishes the program download, it has all initial values held until an internal signal is released. This release is asynchronous to your clock. To avoid problems with this add a counter that is reset to all zeros. Until that counter counts to 15, keep the state machine in the initial state." Rickman wrote: "Figure out what is wrong and deal with the cause of the problem." You wrote: "I doubt that it's about static timing in my case since my clock is
20 MHz, and XST's post-layout static timing analysis doesn't complain. Metastability could be an issue, but it's strange that it happens so often. On one particular design, it happens about once every ten times i startup the system. All inputs are synchronized with one FF each, but I'll try adding a second one to see if it helps."
Here is my analysis:
Trying to change your design to get out of illegal states is nearly pointless, since A) it is hard to do B) the tools work against you C) you may not catch all possible cases D) by the time you detect it, damage has already been done E) if the cause is gross signal integrity problems such as unreliable power, then you FSM is the least of your problems. (there are exceptions to this, such as remote systems (no one to push the reset button, ultra high reliability systems (tolerates rare alpha particle upsets) ) Rickmans quote above is spot on.
Since this happens 10% of the time in a system at 20MHz, this is not metastability.
If you want to learn more about metastability, this is my favorite URL:
Even though your problem is not metastability, once your current problem is fixed, the much rarer problem of metastability may cause problems. A double synchronizer on all your async inputs is cheap insurance.
You wrote: "I doubt that it's about static timing in my case since my clock is 20 MHz, and XST's post-layout static timing analysis doesn't complain."
Your assertion that static timing analysis indicates that there are no problems is insufficient. I have seen far too many designs by engineer that proudly show the static timing report showing that there are no errors, but they have not generated the "unconstrained paths" report. The static timing analyzer tells you that of the paths you have constrained, these all meet timing, but the delay on the unconstrained paths is unbounded. You need to identify all unconstrained paths and either be able to explain why they dont need a timing constraint (such as a push button input), or add constraints so that the paths are covered.
Phil Hays' quote above is almost certainly identifying your problem, and gives a fine solution. Let me expand on it. The problem is that when the chip goes active, you have logic signals that go into the state machine that cause it to transition to a next state immediately. Since the going active is asynchronous to the 20MHz clock, you may have anywhere from 50 to 0 ns to do this. This represents a race condition (not a metastability), and in 10% of your startups, you lose the race. As others have described, not all parts of the state machine have enough time (when the available time is less than 50 ns) to transition to the next valid state. Phil's (and Philip's) solution is to hold off the first transitions of the state machine until a few cycles after the chip goes active. Phil's solution suggests 15 cycles, probably anything over 4 would be rock solid. As an example, I usually use a 4 bit shift register to do this. Either way, it works like this:
The hold-off circuit is initialized to 0000 (counter or shifter). The release of reset (chip going active) allows either to start changing. Phil's counter counts up, in my case the D input to the shifter is tied high, so I start to shift in '1's (0000->1000->1100->1110->1111->1111 ...) For Phil's counter, you probably would want to make it dead-end at 1111, and not wrap back to 0000.
Neither the counter or the shifter can't get to their terminal state other than through multiple cycles of the clock.
In your FSM, the initial state is set in your VHDL. Depending on what your FSM does the transition out of this state may be to one or more states. For ALL of these exit conditions, you need to add an additional signal, the detection of the terminal state of the hold-off circuit. The result is that the FSM cant leave the initial state until several clocks after the chip goes active, because the same logic that initializes the FSM, is also holding the hold-off circuit in its initial state. By the time the FSM is allowed to make its first transition, it will have stable input signals (through the double synchronizers) and it will have a full cycle to do its transition.
Additional answers to some of your other questions:
"But your point is still interesting in case I would need to introduce an asynchronous reset some day. Does that mean one should avoid them if illegal states are a concern?"
Yes, you should avoid async signals and resets, regardless of whether you are concerned about illegal states. If you must have broadly used resets, then the common recommendation is async assertion, and sync de-assertion.
You wrote: "Metastability could be an issue, but it's strange that it happens so often. On one particular design, it happens about once every ten times i startup the system. All inputs are synchronized with one FF each, but I'll try adding a second one to see if it helps."
Right. The 1 in 10 occurrence rate is far to high to be metastables in a system running at 20MHz.
Your async inputs to the state machine should have at least a double synchronizer. (read the above URL). The double synchronizers is just good design practice.
In summary: Add the hold-off circuit, and check the unconstrained paths report in the static timing analyzer.
All right, NOW I see your point! But it seems to me you're assuming that the synthesis tool always generates the transition logic such that illegal states always transition to the state vector of all zeroes, and I'm not yet convinced that this is the case. I would have believed that the transition function would map all illegal states to "Don't care", allowing the tool to minimize the transition logic. As a consequence, the illegal states could transition to anyting, including the same illegal state, which means it's stuck. But I will try to investigate how XST generates the transition logic. If you're right, you definitely answered my question no 2.
If this were true, then you would never need to specify that you are using one-hot encoding. The states that are not used would be detected as not needing to be decoded and the logic would automatically minimize to just using one bit to represent each state. But I have seen guides and HDL books explicitly tell you to either use an attribute to inform the tool that you are using one-not encoding or to do the encoding yourself by not using a case statement.
Where have you seen that a tool will optimize away the others clause? I would like to read further on this. Are you talking about a FSM using an enumerated data type? That would certainly not use the others clause if all the listed values are covered.
Rick "rickman" Collins