Mitigating metastability.

- S
- Symon
  
  Contact options for registered users
posted
20 years ago

Fri, Aug 29, 2003 11:23 PM

Hi, Before I start, metastability is like death and taxes, unavoidable! That said, I've read the latest metastability thread. I thought these points were interesting.

Firstly, A quote from Peter, who has carried out a most thorough experimental investigation :- "I have never seen strange levels or oscillations ( well, 25 years ago we had TTL oscillations). Metastability just affects the delay on the Q output."

Secondly, from Philip's excellent FAQ :- "Metastable outputs can be

1) Oscillations from Voh to Vol, that eventually stop. 2) Oscillations that occur (and may not even cross) Voh and Vol 3) A stable signal between Voh and Vol, that eventually resolves. 4) A signal that transitions to the opposite state of the pre clock state, and then some time later (without a clock edge) transitions back to the original state. 5) A signal that transitions to the oposite state later than the specified clock-to-output delay. 6) Probably some more that I haven't remembered. "

So, this got me thinking on the best way to mitigate the effects of metastability. If Peter is correct in his analysis of his experimental data, and I've no reason to doubt this, then Philip's option 5) is the form of metastability appearing in Peter's Xilinx FPGA experiments.

So, bearing this in mind, a thought experiment. We have an async input, moving to a synchronising clock domain at (say) 1000MHz. Say we have a budget of 5ns of latency to mitigate metastability. The sample is captured after the metastability mitigation circuit (MMC) with a FF called the output FF. My first question is, which of these choices of MMC is least likely to produce metastability at the output FF?

1) The MMC is a 4 FF long shift register clocked at 1000MHz. MMC1 : process(clock) begin if rising_edge(clock) then FF1

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sat, Aug 30, 2003 1:27 PM

All you need to answer these questions is the equation that describes your metastability. That is contained in most of the references that have been given. The settling time is found in an exponent, so using two FFs with half the time of a single FF will make the problem worse, not better.

The best (and only) way to resolve metastability is to provide more time. The probability never goes to zero, but you can get arbitrarily close.

Sym>

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design      URL http://www.arius.com
4 King Ave                               301-682-7772 Voice
Frederick, MD 21701-3110                 301-682-7666 FAX

- H
- Hal Murray
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sat, Aug 30, 2003 11:53 PM

You should also consider 1 FF clocked as late as you can wait.

Each FF has a setup time and a clock-output delay. I'm talking about the actual measured time, not the data book worst case times.

If you chain FFs together, that time gets subtracted from the settling time. The settling time is in an exponent. Waiting a little bit longer helps a lot.

--
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's.  I hate spam.

- P
- Philip Freidin
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Mon, Sep 1, 2003 9:09 AM

Thanks.

Peter's experimental data revolves around detecting metastables, and counting them to create the data we use for our calculations. Very good stuff!

I too have created metastability test systems, which not only count the metastables, but also display them on an osciloscope.

I would like to make a very strong distinction about the typically presented scope pictures of metastability, and the data that I have taken.

What you normally see published (in terms of scope photos, as opposed to drawn diagrams) is a screen of dots representing samples of the Q output. These scopes are high bandwidth sampling scopes that typically take 1 sample per sweep, and rely on the signal being repetitive to build up a picture of what is going on. Examples are the Tek 11801 and

11803, as well as the newer TDS7000 and TDS8000 . The CSA11803 and CSA8000 are basically the same scopes with some extra software.

The picture at the top of page 3 of this document is typical:

formatting link

The scope is triggered by the same clock as the clock to the device under test (DUT), and the scope takes a random sample (or maybe a few samples) over the duration of the sweep. Most sweeps are of the flip flop not going metastable, and so the dots accumulate and show the trajectory of the flip flop. Occasionally the flip flop goes metastable, and sometimes the random sample occurs during the metastable time. These show up as the dots that are to the right of the solid rising edge on the left. Every dot that is not on that left edge represents times when the flip flop had a longer than normal transition time, after you take into consideration clock jitter, data output jitter, scope trigger jitter, and scope sweep jitter. All of these can be characterized by first doing a test run that does not violate the setup and hold times of the DUT.

The problem with these test systems is that when you do record a metastable event, you only get 1 sample point on the trajectory and you can say very little about the trajectory, other than it passed through that point. Even when these scopes take multiple samples per sweep, they are often microseconds apart, and of little interest in the domain we are talking about here.

Although the collected data is predominantly of non metastable transitions, these all pile up on top of each other as the left edge of the trace, and do not significantly detract from seeing the more interesting dots to the right.

The test systems that I have designed are quite different. These test systems only collect trajectory data when the flip flop goes metastable, and they sample the DUT output at 1GSamples per second, thus taking a sample every nanosecond. The result is that the scope pictures I have show the actual trajectory of the metastable.

For your viewing pleasure, I have put them up on the web:

formatting link

These are far from just delayed outputs! The end result though is still the same, systems that fail. But seeing these scope pictures of the actual Q output might make you think about how you measure metastability.

For example, on meta_pic_1.jpg lower trace, the vertical scale is 1V per division, and the 0V level is 1/2 a division above the bottom of the screen. The horizontal scale is 4ns per division. Now what if your test system took a sample at 10ns, and used a threshold of 1.5 volts (2 div up from the bottom of the picture). You would say that the signal is always high at this point. If you sampled again at 20 ns (middle of the screen, you would say that it has resolved for all the traces shown, and you would count all the transitions that returned to ground (because they were high at 10ns). All those traces that ended up high would not be counted. This would be bad if in the real system the device listening to the DUT happened to have a threshold of 2.1 volts (right in the middle of that cute little hump).

This also shows why using a signal like this as a clock could be a real disaster.

Knowing what the trajectory of the DUT output looks like can make you think a lot harder about how you test it.

Basically the improvement in MTBF is a function of the slack time you give it to resolve. This is the sum of the slack time between FF1 and FF2, FF2 and FF3, FF3 and FF4, and FF4 and output. Lets throw some numbers at it. Setup time is 75ps, clock to Q is

200ps, routing delay between any pair of Q to D paths is 100ps. Clock distribution skew is 25ps (in the unfortunate direction). So we have 4 paths of 1000ps - (75+200+100+25) = 4 * 600ps = 2.4ns

Ok, so this is weird, and it adds a mux :-)

Transit time through muxes is 200ps (assume that getting toggle to it is a non issue), and its output connects to the D of the output FF. No extra routing delay.

Path 1: slack from FF1 to FF3 plus slack from FF3 to output

2000-(75+200+100+25)+2000-(75+200+100+25+200)= 3.0ns

Path 2: same slacks, just different FFs 3.0ns

So: weird but better :-)

I could of course screw up the results by changing the delay numbers, but they are pretty realistic for current technology.

Yep.

Actually MMC 2 is favored regardless of oscillations or not because of the 600ps of additional slack time.

My thinking on this has always been that the only thing that matters is the resolving time (slack) and the thought experiments about later stages sampling at just the right time to catch the previous FF resolving only cloud the issue. I am not as confident on this issue as I am on others though.

What I am confident on though, is that there is a better MMC than your two, and it follows on from MMC #2.

Just use 2 FFs, and clock them every 4 ns:

(that is, enable them every 4th clock cycle. This would mean though that unlike MMC #2, which runs 2 parallel paths and avoids some latency, this could have upto 4 ns of extra latency, if you just miss the input change)

slack from FF1 to output:

4000-(75+200+100+25) = 3.6ns

If the latency is really a problem, you could build on the MMC #2 design and have 4 paths each out of phase by 1 clock cycle. Since the path is now only 2 FFs long you would have to have 4 output FFs, and the selector mux would be after these 4 FFs. On the bright side, the mux delay does not eat into the resolving slack time, but it would eat some of the available cycle time in the logic that follows the output FF.

Thanks for an interesting question. Comments above :-)

Philip Freidin

Philip Freidin Fliptronics

- S
- Symon
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Sep 2, 2003 8:20 PM

Hi Philip, Thanks for your post, those pictures were certainly very interesting! What parts did you use? I notice they seem to disagree with Peter's quote that "Metastability just affects the delay on the Q output.". I wonder, Peter, if Xilinx FFs behave differently from the ones in Philip's photos? (Other than speed, of course.) I must admit, one reason I posted was that I found hard to believe that any FF wouldn't show runt pulses, or funny output levels, albeit for brief periods of time, during metastable events. It's also interesting that the straight shift register isn't necessarily the best way to reduce metastability effects. That was what I suspected and was another reason behind my post. I agree that the 'four paths out of phase' solution is better, and preserves the sampling resolution. Often the sampling resolution needs to be preserved, which was why I didn't present an 'enabled every third or fourth go' type circuit. Anyway, thanks to all for their thoughts, it's an interesting topic! cheers, Syms.

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Sep 2, 2003 8:43 PM

Symon,

I think the sampling o-scope shots agree perfectly with what Peter said. Runt pulse, and funny levels are the easiest metastable results to catch, as they are just before the long unknown settling time behavior that is so vexing to designers.

It also makes a difference where you look: a master-slave FF reduces the duration of the unknown transistion over a simple FF without a slave to help "sharpen up" the transistions.

Aust> Hi Philip,

- S
- Symon
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Sep 3, 2003 1:15 AM

Hi Austin, Maybe I got the wrong end of the stick, but when Peter said:- "I have never seen strange levels or oscillations ( well, 25 years ago we had TTL oscillations). Metastability just affects the delay on the Q output." I thought he meant that he'd only seen metastability where the output from the FF was always either on or off, just that sometimes the transition was delayed. Philip's pictures clearly show 'strange levels'. This is important, I believe, when deciding what the effects of metastable FFs are on following circuitry. I guess we'll have to wait until he returns from his Portugese jaunt before we find out what he meant!! Of course, I agree the Master/Slave thing helps. A master FF on its own is what I'd call a latch, the clock controlling whether it's transparent or not. The slave is the sameish circuit again, fed from the output of this, but its clock is inverted. So, I guess that you're saying because the master and slave are fabricated right next to each other, the input to the slave can be expected to transition faster than the input to the master which travels from further away? Less capacitive interconnect to drive. (BTW, I assumed throughout the metastability stuff we were talking about D-type FF, rather than latches.) thanks, Syms.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Sep 3, 2003 7:30 AM

I think the exact behavior is largely irrelevant since a simple delay is just as disasterous as anything else you would encounter. Since you don't know *when* the transition would happen, it could happen at the moment the next FF is latching the intermediate value. That is enough for the next FF and all following logic to behave badly as well.

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design      URL http://www.arius.com
4 King Ave                               301-682-7772 Voice
Frederick, MD 21701-3110                 301-682-7666 FAX

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Sep 3, 2003 2:32 PM

Symon,

Agreed. D FF is what I was assuming here, but folks do have different ways of building them, and not all master-slave implementations are the same (in fact I have seen perhaps a dozen different versions).

Peter will have some explaining to do when he gets back .... as the quote does sound odd. There is most definitely voltage levels that remain in the undecided region for various lengths of time until the circuit resolves its state.

Aust> Hi Austin,

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Sep 3, 2003 2:36 PM

Rick,

I agree that the effect is the same: you just don't know when it will resolve, and thus the value that you "see" at the next level is basically unknown.

It could be that Peter's point is that the next circuit in line does make a decision, and it most certainly makes the "unknown" into a '1', or into a '0'. It is unlikely that the next circuit in line will propagate the same intermediate behavior, and if it can (gain is too low), then things just get more fuzzy until someone upstream ends up resolving the level back to a '1' or a '0'.

Aust> Sym> >

- S
- Symon
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Sep 3, 2003 4:36 PM

Hi Rick, Good point, the unknown delay is bad enough. Thinking about it, I can't think of a 'sensible' digital metastability reduction circuit and scenario which would differentiate between a simple delay, and (say) a runt pulse. (Perhaps if the second FF was clocking faster than the 'sample' FF? Doesn't make much sense!) OTOH, my concern is this. If the following FF also goes, or is very likely to go, metastable if its D input is at a 'funny' level, then the 'funny' level metastability is more likely to propogate than a simple delay. This is because the simple delay has to hit a tiny time window to propogate the metastability, whereas Philip's photo's show the funny levels lasting a while. thanks, Syms.

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Sep 3, 2003 5:08 PM

Symon,

A long time ago, I designed a timing system for telecom that used a rubidium clock. The reason why this is interesting will become apparent in a moment.

One function of the system was to measure up to five external sync references (presumably from traffic bearing lines from other offices).

The circuit to do this was basically a counter, that was sampled by a rubidium derived clock. All clocks were syntonous (same frequency, arbitray phase -- aka the SONET/SDH telepone network). As the phase wandered back and forth, the metastable regions would get exercised, and the measurement board would report that an input signal had arbitrarily "slipped" by some random number of bits (due to a metastable transistion of a bit in the counter/latch).

This was so frustrating, because in real life, the inputs could slip due to failures, glitches in the network, etc. So how do you know a real bad slip, from a metastable one?

In the hope (vain and useless) of reducing the occurence of the metastable sample, we had three levels of FFs to try to re-synchronize the sample count, along with an elaborate set of clock enables. We got it to the point where the false slip occurred about evey two months, in a typical network. Since real outages were far more common, it was not a big deal.

In spite of this, we wrote software to identify a real slip, from a false slip. Basically, if five successive samples were not equal (as the phase can't change that fast) we threw out that set of measurements, and took another five. This dropped the occurence of a false slip to below the threshold that we could measure (but it could still happen and probably did - still does out there somewhere).

So, metastability can sometimes be beaten into submission, but it never goes away....

The ultimate frustration is that one of these references gets used to track the rubidium (steer it in a phase lockeed loop), so a fake glitch can cause quite a hit, which then causes slips throughout the network as the rubidium runs off the the "wrong" frequency/phase. A real glitch can also cause the same behavior, so the locked loop is quite loosely coupled, and before each update, one checks absolutely everything to be sure that what you are trying to track is real....

I had heard there were some cases with equipment designed by others where the maintenance folks would just disconnect the inputs, as the bare rubidium ran so clean, that it was better to not track at all (fewer slips) than to bother with trying to track the references and falsely running off into the weeds due to metastability and poor reference checking.

All of this became obsolete when GPS became available, as now precise time (frequency) was broadcast for free.

Aust> Hi Rick,

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Sep 3, 2003 5:52 PM

If that were the case, then metastability would not be an issue at all. Having an indeterminate voltage is what creates metastability. The signal does not need to remain in the indeterminate value for any length of time, so a transistion at the wrong time is as bad as any other indeterminte value. When FFs see an indeterminate input at the sampling time (very small time and voltage window), they will create an indeterminate value for an arbitrary period (or at an abritray delay) on the output.

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design      URL http://www.arius.com
4 King Ave                               301-682-7772 Voice
Frederick, MD 21701-3110                 301-682-7666 FAX

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Thu, Sep 4, 2003 8:55 AM

clock. The

(presumably

derived

SONET/SDH

regions would get

arbitrarily

bit in the

failures,

metastable one?

sample, we had

elaborate set

evey two

was not a big

slip. Basically,

fast) we threw out

of a false slip

probably did -

away....

the rubidium

which then

locked loop is

everything to be sure

maintenance

that it was

the references

checking.

(frequency) was

Very interesting account - worth adding to the FAQ under metastable ?

-jg

- M
- Mike Treseler
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Thu, Sep 4, 2003 9:43 PM

Yes. Great story.

My old story involves *missing* a synchronizer rather than having failures *in* a synchronizer.

At least I didn't have to wait two months for symptoms to occur in that case :)

-- Mike Treseler