Real examples of metastability causing bugs

Eli Bendersky · 2008-01-08T14:20:43+00:00

Hello,Suppose that I'm sampling an asynchronous signal with a FF, withoutusing any synchronizers before it. This FF will become metastable fromtime to time with a MTBF depending on the device's parameters, theclock rate and the input signal change rate.Can you please suggest *real life* examples of how this can make mefail in a real design, that is, where the time of recovery for themetastable event is indeed 0. Here are two off the top of my head:1) The output of this FF can be used directly as the output of thedevice, causing an intermediate value on the output for some time,which can harm other devices.2) If such an input is sampled by two different FFs for differentpurposes, they may end up with different results.Thanks in advance,Eli

A

Allan Herriman 18 years ago

Quite the contrary. Please make the distinction between actual metastability and other clock domain crossing issues, such as sampling the same async signal with two different flip flops.

In the rest of your post (snipped) you didn't even seem to acknowledge that there are other types of clock domain crossing issues that aren't related to metastability.

The first point of my post was that metastability is easy to deal with

- just add another flip flop and leave enough timing slack to get whatever MTBF you want. (I know from some of my designs that unless you have clock rates of more than a few hundred MHz, it isn't hard to get failure rates of < 1 metastability-related failure in the lifetime of the product.)

The other point of my post was that because everyone has heard of metastability and that it's usually easy to deal with - just add flip flops and some timing slack - it gets taken into account in designs and doesn't actually cause a lot of bugs. Instead, actual bugs related to clock domain crossings are mostly caused by things not related to metastability. (I'm sure you could list a few. I'll see if I can dig up a copy of the training material I wrote which describes the most common errors.)

Also, please bear in mind that I was quoting the results of actual research on fielded product designs.

Regards, Allan

Vote

M

Mike Treseler 18 years ago

Maybe Allan's point is that lack of proper synchronization is a much more common problem than metastable events are in properly synchronized designs.

-- Mike Treseler

Vote

K

KJ 18 years ago

Yes, but sampling an async signal with two different flip flops would not be an example of properly adding retiming flip flops.

Actually I do agree with most of what you said (although re-reading my post I can see it probably didn't come across that way...sorry).

I certainly agree that darn near every timing problem that I've investigated as well has to do with moving a signal either from a completely asynchronous domain or some other clock domain. Whether or not there was actual metastability or not was irrelevant since the solution to the design error was to properly move the signal into the sampling clock domain.

Fundamentally the design error was violation of setup/hold requirements and/or sampling that signal in more than one place. Whether or not that caused actual metastability or not I didn't investigate.

My only disagreement was when you said "...so they put retiming flip flops everywhere. Consequently, metastability related problems don't occur often.". But if they properly put in the retiming flops, then they wouldn't have any timing issues, let alone a low probability (but not non-existent) one such as metastability. But my point probably hinges on the work 'properly' as well.

I agree as long as that signal from async or other domain goes only into exactly one flip flop than it has been properly taken care of. I'll also add that even when apparently coded properly, I've found that synthesis tools can sometimes defeat this by replicating a flop to improve its result which means you have to add vendor specific attributes to try to guard against this.

I think you're overestimating new designers ability to properly add these flops based on postings in this and other newsgroups even when the poster seems to have knowledge of metastability.

Violating setup and/or hold time requirements covers darn near everything. Sampling with only one flip flop into the new domain covers most other cases. Never combinatorially generate a signal that will be used to sample another signal covers the only other things that I can recall at the moment.

And posting those results is appreciated...like I said, I think we're solidly in agreement about the solution although I might not have come across as such in my earlier posting.

Kevin Jennings

Vote

J

John_H 18 years ago

A real life situation for a missing synchronizer would be the "duh" moment I had in one of my first processor interfaced designs. I wrote a value to the FPGA but the register write wasn't related to the system clock. Occasionally some of my logic got one part of the word on one cycle before all the logic got the full register value on the second cycle. This schism where values were *supposed* to change simultaneously but didn't caused me problems.

Just adding a synchronizing flop DOES NOT get rid of metastability troubles. Just adding two consecutive synchronizing flops DOES NOT get rid of metastability. Luckily most of the time the path between the two flops ends up being short but too often the designer DOES NOT properly constrain the path between those two flops.

The effect of hitting the metastability window is that the logic takes a moment longer to decide if it's high or low. While the static timing analyzer will guarantee your results if you meet setup and hold, the synchronizing flop specifically violates the setup and hold in order to rarely hit that metastability window. In this case, the constraints MUST be changed to guarantee the metastability related errors will be in the 1k or 1M year kind of range.

The timing constraint from a synchronizing flop to the next flop in the sequence MUST be constrained to a time that's shorter than the prevailing system clock period. If the signal takes up to 2 ns longer to decide what signal level it is, the following logic (or second synchronizing flop) must have that additional headroom built-in through timing constraints.

- John_H

Vote

A

Allan Herriman 18 years ago

Yes! Thanks Mike.

I couldn't find the actual figures I wanted, but from memory it went something like this (most commonly encountered clock domain crossing or async logic bug listed first):

- (race) Passing vectors (i.e. multiple signals) from clock domain A to clock domain B and expecting all the bits to arrive on the same B clock.

- (race) As above, but adding multiple banks of retiming flip flops in the B clock domain, which fixed the (non-existent) metastability issue but did nothing about the race.

- (race) Passing a signal in clock domain A to multiple flip flops in clock domain B, and expecting the B flip flops to get the same value on the same clock.

- (race) As above, but created when the tools replicate the B logic to manage fanout.

- (glitch) Multiple signals in clock domain A hit some combinatorial logic producing a single signal which is sampled by a flip flop in clock domain B. Sometimes there may be a glitch which gets sampled by the B flip flop. It can be difficult to design combinatorial logic with good glitch coverage (and if you do, the tools will often remove it). (See XAPP

024, btw.)

- (glitch) Clock multiplexers made out of combinatorial logic with inadequate glitch coverage (or adequate glitch coverage removed by the tools).

I think the significant thing about that list is that even if flip flops were infinitely fast and had 0 chance of ever entering a metastable state, all of those bugs would still exist.

Regards, Allan

Vote

A

Allan Herriman 18 years ago

A few more. These are still async logic, but not related to clock domain crossings.

- Using async reset or set inputs on flip flops to implement a logic function (rather than just using them for initialisation). I can remember a case where a design would fail even when we could prove mathematically that it couldn't fail. Rewriting it to avoid the use of async resets fixed the problem.

- Gating clocks to create a logic function. I know this sort of thing is done in ASICs to save power, but it just doesn't seem to work too well in FPGAs sometimes.

Vote

S

Symon 18 years ago

Hi Allan, Can you remember what that was? I'm fairly sure synthesis tools use this type of trick, and I've never seen a problem with it, providing timing is met.

Right, you get runts or glitches on the clock which might clock some FFs but not others. Very bad! Cheers, Syms.

Vote

P

Peter Alfke 18 years ago

I do not see how metastability could ever be a "useful" feature. My app note is an analysis tool that measures the statistical probability of the metastable delay by random testing. As such it is useful, and I think it shows the only practical way to really get quantitative data. Peter Alfke, Xilinx Applications

Vote

J

John McCaskill 18 years ago

I know that at least XST will use these inputs to implement logic functions. They have comparatively long setup times, and I see them frequently when looking at static timing analysis reports.

Regards,

John McCaskill

formatting link

Vote

J

John_H 18 years ago

s

While XST may use the inputs in implementing the logic, it uses the synchronous set/reset rather than the asynchronous clear/preset equivalent. Any synchronous implementation should be covered by the timing analysis.

Vote

S

Symon 18 years ago

OK, of course. Thanks, John_H!

Vote

G

glen herrmannsfeldt 18 years ago

John_H wrote: (snip)

Again, I wouldn't call this metastability. It can be made worse by metastability, but the problem is that the propagation delay to the different parts of the register is (always) slightly different, and if you get close enough to the clock edge some will get one value, and some the other. That will still be true even for perfect FFs.

Synchronizing FFs don't get rid of it, but if the probability is low enough that is good enough. Two synchronizing FFs will square the probability of metastability on each clock cycle (assuming statistical independence).

First the logic must be designed to avoid the multiple register clocking problem. In the case of FIFOs this is done by using gray code such that only one bit changes on any cycle. You get one or the other, where both are valid.

The metastability problem comes not while crossing clock domains, but after crossing clock domains. It is normal for the output of a register to go through other logic before the next register. The delay of that logic, in addition to the possible metastability delay, causes the metastability problem. Efficient logic design maximizes the logic between register stages, and so gets closer to failure due to metastability caused delay. A synchronizing register allows the maximum time for metastability to be resolved before entering the next FF.

Yes.

-- glen

Vote

G

glen herrmannsfeldt 18 years ago

(snip)

The assumption is that it is exponential. I don't know if you can prove that or not. Also, the measurements are not easy and the result will be very sensitive to the exact timing.

I had the idea once in a discussion here of building a metastability locked loop. That is, a PLL with a FF in the feedback loop such that the phase adjustment goes toward the metastability point. That would maximize the number of metastability events. Then you need to find a way to measure the resolving time and graph it...

-- glen

Vote

P

Peter Alfke 18 years ago

Glen, I have been thinking about this for decades, but I now consider it hopeless. If you believe the results of my statistical measurements, then you realize that the capture window for a metastable delay of more than 2 ns is not picoseconds, but a fraction of a femtosecond. There is no way to keep the circuitry stable within such a narrow timing window. And if you could, how do you derive any quantitative data from it? I remain convinced that the randomly asynchronous testing approach is the only one that gives us reliable results. Peter Alfke

Vote

J

John_H 18 years ago

A single synchronizing register DOES NOT allow the maximum time for metastability. Two synchronizing registers are closer to "correct."

With only one synchronizing flop for one control signal - ignoring vectors for the moment - the only way to guarantee logic will work properly with this single flop on "this side" of the time domain is to tighten the timing constraint such that any metastability delay is acceptable in the system. If the tightened constraint is too difficult, a second single flop is needed to distribute the synchronized signal.

- John_H

Vote

G

glen herrmannsfeldt 18 years ago

In response to my post:

Much longer that I have thought about it.

It did occur to me while writing that, that temperature variations would shift the metastable point. The goal of the MLL is to maximize the rate of observations of metastability...

The only one I have thought of so far is to add a variable voltage sine to the MLL feedback loop. Increasing the voltage should decrease the probability of 2ns metastability events.

Not having the design for the feedback loop of the MLL, it seems that there might be a gate with a signal that is high for the amount of time of the metastability, plus or minus some propagation delays. Those delays would have to be measured, and then the average metastability time could be measured by the average voltage on that line. The change in the average time with the sine voltage disturbing the feedback loop, and the time variation due to the feedback voltage, would allow one to determine the probability vs. metastability time curve.

It would seem to require simpler measurements than the MLL.

-- glen

Vote

P

Peter Alfke 18 years ago

The asynchronous test is a real beauty. By adjusting one frequency, you can have the detected metstable events come in at kilohertz speed, or at the much lower rate of one or two during the lunch break or even overnight. And it confirms the logarithmic relationship. Real fun! (If the phenomenon itself weren't so ugly...) Peter Alfke

Vote

S

Symon 18 years ago

Hi John, Just to point out that when using two FFs to mitigate metastability, the constraints file should include something (e.g. MAXDELAY) to make sure the signal delay between the two FFs is somewhat less than the default, which is the period of the FFs' clock. The P&R tools may not do this otherwise. Cheers, Syms.

Vote

A

Allan Herriman 18 years ago

[snip]

I wasn't talking about new designers or Usenet posters, I was talking about a large group of experienced designers at Agilent. They wouldn't have been employed there if they didn't have a basic grasp of fundamentals such as designing for metastability.

I apologise for not making it clear I wasn't talking about noobs.

[snip]

Allan

Vote

M

Mike Treseler 18 years ago

Yes. The synthesis tools assume input synchronization. When synchronization fails all bets are off. This thread points out the importance of being able to distinguish structures intended to *be* synchronizers from structures that assume such synchronization. I beginning to think I should code synchronizers as separate entities/modules. This would simplify constraints and make it easier to check for register duplication.

Funny. I'm wiping the coffee off my monitor ;) These inventions always move the problem around but never solve it.

-- Mike Treseler

Vote

Real examples of metastability causing bugs

Join the Discussion

Didn't find your answer?