Real examples of metastability causing bugs

- E
- Eli Bendersky
  
  Contact options for registered users
posted
16 years ago

Tue, Jan 8, 2008 2:20 PM

Hello,

Suppose that I'm sampling an asynchronous signal with a FF, without using any synchronizers before it. This FF will become metastable from time to time with a MTBF depending on the device's parameters, the clock rate and the input signal change rate.

Can you please suggest *real life* examples of how this can make me fail in a real design, that is, where the time of recovery for the metastable event is indeed 0. Here are two off the top of my head:

1) The output of this FF can be used directly as the output of the device, causing an intermediate value on the output for some time, which can harm other devices.

2) If such an input is sampled by two different FFs for different purposes, they may end up with different results.

Thanks in advance, Eli

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Jan 8, 2008 4:38 PM

Your item #2 describes the most common problem, exacerbated by excessive routing delay differences. Peter Alfke

- E
- Eli Bendersky
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Jan 8, 2008 5:57 PM

Hi Peter, thanks for answering. Could you provide a piece of VHDL/Verilog code that is realistic and has this problem ?

- S
- Symon
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Jan 8, 2008 6:24 PM

Hi Eli,

process(clock) begin if rising_edge(clock) then if bad_input = '1' then count

- M
- MikeShepherd564
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Jan 8, 2008 7:10 PM

Aren't those two reasons enough for avoiding it?

Or are we just doing your homework?

Mike

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Jan 8, 2008 7:41 PM

Eli, Look at XAPP094 (you can easily google it) It shows the circuit I have used to quantify metastable delay. The delay is short, so you have to be quick to catch it... Peter Alfke

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Jan 8, 2008 9:49 PM

Symon wrote: (snip)

I would have called this an ordinary setup/hold violation.

If the problem is due to timing of bad_input, propagated through the MUX that I presume it generates, then it should be setup/hold violation.

Metastability should occur due to clock rate issues, through the appropriate propagation delay, but independent of bad_input, and only if bad_input does satisfy setup/hold.

I would say that the usual cause of option 2 in the previous post is also setup/hold violation.

Note that this system can fail even with perfect FFs due to different propagation delays.

-- glen

- A
- Andy
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Jan 8, 2008 10:36 PM

I agree, #2 is independent of metastability; it is a parallel synchronizer, which is a bad thing. If the propagation skew is more than setup+hold to all of the destination registers, it could meet setup and hold on all of them (avoiding metastability), while still failing functionally (incrementing by 3).

Andy

- A
- Andy
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Jan 8, 2008 10:48 PM

Example #1

process (event, out2) is begin if out2= '1' then out1

- M
- Mike Treseler
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Jan 8, 2008 10:56 PM

This FF might be used as an input synchronizer intended to eliminate logic races. Setup and hold violations are to be expected for a synchronizer and in almost all cases synchronization succeeds anyway.

But maybe once a year, the bowling ball stops on the speed bump and synchronization fails and the synchronizer causes a logic race.

The race may or may not cause a bad state transition.

A bad transition may or may not cause an observable error.

I might be able to improve my odds to say, one synchronization failure in 100 years by using a two stage synchronizer, but I can't eliminate the possibility.

This is the case of the *missing* synchronizer. This is often confused with metastability, but it is really a design error. I don't have to wait nearly as long to observe an error in this case.

-- Mike Treseler

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jan 9, 2008 12:49 AM

But... if the input goes to a synchronizer FF and the output of the synchronizer FF feeds more than one FF and the associated combinatorial paths between them are close to the clock period, a metastability delay can force different values for the shorter and longer paths.

Without the synchronizing flop, #2 is just a design error. With the synchronizing flop, #2 can see a metastability error because the added (rare) metastability delay screws up the static timing. That's one main reason to specify a more restrictive timing constraint for synchronized paths.

I usually tighten my synchronizer output timing paths by 2 ns to account for any rare metastability delays so the same value WILL get to all the destination flops under more circumstances rendering the probability to the 1000 year plus range.

- John_H

- E
- Eli Bendersky
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jan 9, 2008 6:19 AM

Hi Mike,

It's been a long time since I've had to do homework, thankfully :-) I'm preparing a set of slides on metastability and synchronization techniques at work. There's a lot of material online on how metastability happens, MTBF, synchronizers, etc. But I realized that it's difficult for me to imagine a real case in which metastability in a FF without a synchronizer causes trouble. This is probably so because the designs have lots of FFs and most of the paths are between FFs, so the first FF to sample the asynchronous event is "kind of" a syncrhonizer anyway.

Eli

- E
- Eli Bendersky
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jan 9, 2008 6:21 AM

Yes, and my question was about the "missing synchronizer", which is a design error as you said.

I just wanted an example of real code from real life doing something useful that is susceptible to this design error.

Eli

- E
- Eli Bendersky
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jan 9, 2008 6:22 AM

Hi Peter, I downloaded this application note a couple of weeks ago and went through it. Would you say that your metastability-catching circuit could be useful for some real application ?

- T
- Thomas Stanka
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jan 9, 2008 6:49 AM

In my opinion, this problem is not direct visible from vhdl code. Metastability in real devices is a question of timing and real HW after synthesis.

if rising_egde(clk) then internal

- J
- -jg
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jan 9, 2008 9:34 AM

Case 2 is an aperture effect, and as stated, you can create cicuits that will catch this, and cause real problems. It is not so much a VHDL construct as a HW detail error.

Pure metastability extends the settling time, but on a very narrow actual trigger window and so is harder to catch causing a problem, but I have often thought as an academic exercise it would be interesting to try to plot the actual statistical tail of the settling time window.

Most discussions use a simple model ot a log-log plot, on a couple of data points

- fine as a 'there be dragons' type warning, but not what I'd call true engineering..

-jg

- S
- Symon
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jan 9, 2008 9:44 AM

Hi Andy, Glen, If bad_input is metastable or asynchronous, you get the same bad effect. HTH., Syms.

- S
- Symon
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jan 9, 2008 9:55 AM

Hi Mike, I thought from the OP's post that he means that this 'input' has already been sampled in the system clock domain, although it's not clear. If that is the case, assuming setup and hold are met, this is a metastability problem. If metastable signals only ever go to one place it's not a problem. That's how the input resampler works.

Of course, I quite agree that this type of fault most often appears with an asynchronous input going to two destinations. Cheers., Syms.

p.s. FYI

formatting link

this FF is immune. (Actually, of course it isn't, but it's interesting to see why it doesn't work.)

- A
- Allan Herriman
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jan 9, 2008 12:31 PM

Here's something you can include in your training material. It's from a c.a.f. post I made in 2003:

"When I was at Agilent I analysed the causes of failures in some FPGA developments.

About half of all FPGA design related bugs (weighted by the time spent finding them) were associated with asynchronous logic and clock domain crossings. I guess that's not too surprising.

What you may find surprising is that 0% of the clock domain crossing bugs had anything to do with metastability. Glitches and races were the cause.

My interpretation: I think that most designers have heard of metastability, so they put retiming flip flops everywhere. Consequently, metastability related problems don't occur often.

YMMV."

Regards, Allan

- K
- KJ
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jan 9, 2008 12:49 PM

Your last paragraph directly contradicts the previous one. If designers actually were properly putting in retiming flip flops then they wouldn't be having clock domain crossing issues. Even if the signal under consideration is an input pin of the device, it still either exists in some clock domain (to which the device may have access to) or it is completely asynchronous to any clock. In either case, sampling of some signal with a clock implies a clock domain crossing into the new clock domain.

A metastable output is always caused by violating setup or hold timing requirements. This is a necessary condition. However violating setup or hold time requirements does not necessarily cause an output to be metastable. This implies that violating setup and hold is not a sufficient condition to cause metastability.

Since the only knob you have to turn to try to create a metastable output is the timing of one signal relative to another this means that there is no way you can create an experiment that will always cause a flip flop to go metastable. So the only thing you can do is repeat an experiment over and over in the hope that eventually it will go metastable. Having done this though you will still only be able to conclude that violating timing can cause metastability (note: 'can cause' not 'will cause').

KJ