Mitigating metastability.

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Hi,
    Before I start, metastability is like death and taxes,
unavoidable! That said, I've read the latest metastability thread. I
thought these points were interesting.

Firstly, A quote from Peter, who has carried out a most thorough
experimental investigation :-
"I have never seen strange levels or oscillations ( well, 25 years ago
we had TTL oscillations). Metastability just affects the delay on the
Q output."

Secondly, from Philip's excellent FAQ :-
"Metastable outputs can be

1)      Oscillations from Voh to Vol, that eventually stop.
2)      Oscillations that occur (and may not even cross) Voh and Vol
3)      A stable signal between Voh and Vol, that eventually resolves.
4)      A signal that transitions to the opposite state of the pre
clock
        state, and then some time later (without a clock edge)
transitions
        back to the original state.
5)      A signal that transitions to the oposite state later than the
        specified clock-to-output delay.
6)      Probably some more that I haven't remembered. "

    So, this got me thinking on the best way to mitigate the effects of
metastability. If Peter is correct in his analysis of his experimental
data, and I've no reason to doubt this, then Philip's option 5) is the
form of metastability appearing in Peter's Xilinx FPGA experiments.

    So, bearing this in mind, a thought experiment. We have an async
input, moving to a synchronising clock domain at (say) 1000MHz. Say we
have a budget of 5ns of latency to mitigate metastability. The sample
is captured after the metastability mitigation circuit (MMC) with a FF
called the output FF.
    My first question is, which of these choices of MMC is least likely
to produce metastability at the output FF?
1) The MMC is a 4 FF long shift register clocked at 1000MHz.
MMC1 : process(clock)
begin
  if rising_edge(clock) then
    FF1 <= input;
    FF2 <= FF1;
    FF3 <= FF2;
    FF4 <= FF3;
    output <= FF4;
  end if;
end process;

2) The MMC is 4 FFs, each clock enabled every second clock.
MMC2 : process(clock)
begin
  if rising_edge(clock) then
    toggle <= not toggle;
    if toggle = '1' then
      FF1 <= input;
      FF3 <= FF1;
      output <= FF3;
    else
      FF2 <= input;
      FF4 <= FF2;
      output <= FF4;
    end if;
  end if;
end process;

    Option 1) offers extra stages of synchronization between the input
and output, but the 1ns gap between FFs means that metastability is
more likely to propagate. Option 2) waits 2ns for the sample FFs to
make up their mind, vastly decreasing the metastability probability.
    My second question is, does the type of metastability, i.e. the
things in Philip's list, affect which is the better choice? For
instance, if the first FF in the MMC exhibits oscillations in
metastability, then the second FF in the MMC would have several
chances, as its input oscillates, to sample at the 'wrong' time. This
might favour MMC option 2). If, however, the first FF in the MMC goes
into option 5) metastability, then there's only one chance for the
second FF to sample at the 'wrong' time. This might confer an
advantage on MMC option 1).

    Anyway, I'm still thinking about this. I think the clock frequency
may decide which is better for a given FF type. Any comments?
                       Cheers, Syms.

Re: Mitigating metastability.
All you need to answer these questions is the equation that describes
your metastability.  That is contained in most of the references that
have been given.  The settling time is found in an exponent, so using
two FFs with half the time of a single FF will make the problem worse,
not better.  

The best (and only) way to resolve metastability is to provide more
time.  The probability never goes to zero, but you can get arbitrarily
close.  


Symon wrote:
Quoted text here. Click to load it

--

Rick "rickman" Collins

snipped-for-privacy@XYarius.com
We've slightly trimmed the long signature. Click to see the full one.
Re: Mitigating metastability.
Quoted text here. Click to load it



You should also consider 1 FF clocked as late as you can wait.

Each FF has a setup time and a clock-output delay.  I'm talking
about the actual measured time, not the data book worst case times.

If you chain FFs together, that time gets subtracted from the
settling time.  The settling time is in an exponent.  Waiting a
little bit longer helps a lot.

--
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
We've slightly trimmed the long signature. Click to see the full one.
Re: Mitigating metastability.
Quoted text here. Click to load it

Thanks.


Peter's experimental data revolves around detecting metastables, and
counting them to create the data we use for our calculations.
Very good stuff!

I too have created metastability test systems, which not only count
the metastables, but also display them on an osciloscope.

I would like to make a very strong distinction about the typically
presented scope pictures of metastability, and the data that I have
taken.

What you normally see published (in terms of scope photos, as opposed
to drawn diagrams) is a screen of dots representing samples of the Q
output. These scopes are high bandwidth sampling scopes that typically
take 1 sample per sweep, and rely on the signal being repetitive to
build up a picture of what is going on. Examples are the Tek 11801 and
11803, as well as the newer TDS7000 and TDS8000 . The CSA11803 and
CSA8000 are basically the same scopes with some extra software.

The picture at the top of page 3 of this document is typical:

    http://www.onsemi.com/pub/Collateral/AN1504-D.PDF

The scope is triggered by the same clock as the clock to the device
under test (DUT), and the scope takes a random sample (or maybe a
few samples) over the duration of the sweep. Most sweeps are of
the flip flop not going metastable, and so the dots accumulate and
show the trajectory of the flip flop. Occasionally the flip flop
goes metastable, and sometimes the random sample occurs during the
metastable time. These show up as the dots that are to the right
of the solid rising edge on the left. Every dot that is not on
that left edge represents times when the flip flop had a longer
than normal transition time, after you take into consideration
clock jitter, data output jitter, scope trigger jitter, and scope
sweep jitter. All of these can be characterized by first doing a
test run that does not violate the setup and hold times of the DUT.

The problem with these test systems is that when you do record
a metastable event, you only get 1 sample point on the trajectory
and you can say very little about the trajectory, other than it
passed through that point. Even when these scopes take multiple
samples per sweep, they are often microseconds apart, and of
little interest in the domain we are talking about here.

Although the collected data is predominantly of non metastable
transitions, these all pile up on top of each other as the left
edge of the trace, and do not significantly detract from seeing
the more interesting dots to the right.

The test systems that I have designed are quite different. These
test systems only collect trajectory data when the flip flop
goes metastable, and they sample the DUT output at 1GSamples per
second, thus taking a sample every nanosecond. The result is
that the scope pictures I have show the actual trajectory of
the metastable.

For your viewing pleasure, I have put them up on the web:

   www.fpga-faq.com/Images/meta_pic_1.jpg
   www.fpga-faq.com/Images/meta_pic_2.jpg
   www.fpga-faq.com/Images/meta_pic_3.jpg

These are far from just delayed outputs! The end result though
is still the same, systems that fail. But seeing these scope
pictures of the actual Q output might make you think about how
you measure metastability.

For example, on meta_pic_1.jpg lower trace, the vertical scale
is 1V per division, and the 0V level is 1/2 a division above
the bottom of the screen. The horizontal scale is 4ns per
division. Now what if your test system took a sample at 10ns,
and used a threshold of 1.5 volts (2 div up from the bottom of
the picture). You would say that the signal is always high
at this point. If you sampled again at 20 ns (middle of the
screen, you would say that it has resolved for all the traces
shown, and you would count all the transitions that returned
to ground (because they were high at 10ns). All those traces
that ended up high would not be counted. This would be bad
if in the real system the device listening to the DUT
happened to have a threshold of 2.1 volts (right in the
middle of that cute little hump).

This also shows why using a signal like this as a clock
could be a real disaster.

Knowing what the trajectory of the DUT output looks like
can make you think a lot harder about how you test it.

Quoted text here. Click to load it

Basically the improvement in MTBF is a function of the slack time
you give it to resolve. This is the sum of the slack time between
FF1 and FF2, FF2 and FF3, FF3 and FF4, and FF4 and output.
Lets throw some numbers at it. Setup time is 75ps, clock to Q is
200ps, routing delay between any pair of Q to D paths is 100ps.
Clock distribution skew is 25ps (in the unfortunate direction).
So we have 4 paths of 1000ps - (75+200+100+25) = 4 * 600ps = 2.4ns

Quoted text here. Click to load it

Ok, so this is weird, and it adds a mux :-)

Transit time through muxes is 200ps (assume that
getting toggle to it is a non issue), and its
output connects to the D of the output FF.
No extra routing delay.

Path 1:
slack from FF1 to FF3 plus slack from FF3 to output

2000-(75+200+100+25)+2000-(75+200+100+25+200)= 3.0ns

Path 2:
same slacks, just different FFs   3.0ns

So: weird but better :-)

I could of course screw up the results by changing
the delay numbers, but they are pretty realistic for
current technology.

Quoted text here. Click to load it

Yep.


Actually MMC 2 is favored regardless of oscillations or not
because of the 600ps of additional slack time.

Quoted text here. Click to load it

My thinking on this has always been that the only thing
that matters is the resolving time (slack) and the thought
experiments about later stages sampling at just the right time
to catch the previous FF resolving only cloud the issue. I am
not as confident on this issue as I am on others though.

What I am confident on though, is that there is a better MMC
than your two, and it follows on from MMC #2.

Just use 2 FFs, and clock them every 4 ns:

(that is, enable them every 4th clock cycle. This
would mean though that unlike MMC #2, which runs 2
parallel paths and avoids some latency, this could
have upto 4 ns of extra latency, if you just miss
the input change)

slack from FF1 to output:

4000-(75+200+100+25) = 3.6ns

If the latency is really a problem, you could build
on the MMC #2 design and have 4 paths each out of phase
by 1 clock cycle. Since the path is now only 2 FFs long
you would have to have 4 output FFs, and the selector
mux would be after these 4 FFs. On the bright side, the
mux delay does not eat into the resolving slack time,
but it would eat some of the available cycle time in
the logic that follows the output FF.

Quoted text here. Click to load it


Thanks for an interesting question. Comments above :-)

Philip Freidin




Philip Freidin
Fliptronics

Re: Mitigating metastability.
Hi Philip,
      Thanks for your post, those pictures were certainly very
interesting! What parts did you use? I notice they seem to disagree
with Peter's quote that "Metastability just affects the delay on the Q
output.". I wonder, Peter, if Xilinx FFs behave differently from the
ones in Philip's photos? (Other than speed, of course.) I must admit,
one reason I posted was that I found hard to believe that any FF
wouldn't show runt pulses, or funny output levels, albeit for brief
periods of time, during metastable events.
      It's also interesting that the straight shift register isn't
necessarily the best way to reduce metastability effects. That was
what I suspected and was another reason behind my post. I agree that
the 'four paths out of phase' solution is better, and preserves the
sampling resolution. Often the sampling resolution needs to be
preserved, which was why I didn't present an 'enabled every third or
fourth go' type circuit.
      Anyway, thanks to all for their thoughts, it's an interesting
topic!
                   cheers, Syms.

Re: Measuring metastability.
Symon,

I think the sampling o-scope shots agree perfectly with what Peter said.
Runt pulse, and funny levels are the easiest metastable results to catch,
as they are just before the long unknown settling time behavior that is so
vexing to designers.

It also makes a difference where you look:  a master-slave FF reduces the
duration of the unknown transistion over a simple FF without a slave to
help "sharpen up" the transistions.

Austin

Symon wrote:

Quoted text here. Click to load it


Re: Measuring metastability.
Hi Austin,
       Maybe I got the wrong end of the stick, but when Peter said:-
"I have never seen strange levels or oscillations ( well, 25 years ago
we had TTL oscillations). Metastability just affects the delay on the
Q output." I thought he meant that he'd only seen metastability where
the output from the FF was always either on or off, just that
sometimes the transition was delayed. Philip's pictures clearly show
'strange levels'. This is important, I believe, when deciding what the
effects of metastable FFs are on following circuitry. I guess we'll
have to wait until he returns from his Portugese jaunt before we find
out what he meant!!
        Of course, I agree the Master/Slave thing helps. A master FF
on its own is what I'd call a latch, the clock controlling whether
it's transparent or not. The slave is the sameish circuit again, fed
from the output of this, but its clock is inverted. So, I guess that
you're saying because the master and slave are fabricated right next
to each other, the input to the slave can be expected to transition
faster than the input to the master which travels from further away?
Less capacitive interconnect to drive. (BTW, I assumed throughout the
metastability stuff we were talking about D-type FF, rather than
latches.)
                  thanks, Syms.

Quoted text here. Click to load it

Re: Measuring metastability.
Quoted text here. Click to load it

I think the exact behavior is largely irrelevant since a simple delay is
just as disasterous as anything else you would encounter.  Since you
don't know *when* the transition would happen, it could happen at the
moment the next FF is latching the intermediate value.  That is enough
for the next FF and all following logic to behave badly as well.  

--

Rick "rickman" Collins

snipped-for-privacy@XYarius.com
We've slightly trimmed the long signature. Click to see the full one.
Re: Measuring metastability.
Rick,

I agree that the effect is the same:  you just don't know when it will
resolve, and thus the value that you "see" at the next level is basically
unknown.

It could be that Peter's point is that the next circuit in line does make a
decision, and it most certainly makes the "unknown" into a '1', or into a
'0'.  It is unlikely that the next circuit in line will propagate the same
intermediate behavior, and if it can (gain is too low), then things just get
more fuzzy until someone upstream ends up resolving the level back to a '1'
or a '0'.

Austin

rickman wrote:

Quoted text here. Click to load it


Re: Measuring metastability.
Quoted text here. Click to load it

If that were the case, then metastability would not be an issue at all.
Having an indeterminate voltage is what creates metastability.  The
signal does not need to remain in the indeterminate value for any length
of time, so a transistion at the wrong time is as bad as any other
indeterminte value.  When FFs see an indeterminate input at the sampling
time (very small time and voltage window), they will create an
indeterminate value for an arbitrary period (or at an abritray delay) on
the output.  

--

Rick "rickman" Collins

snipped-for-privacy@XYarius.com
We've slightly trimmed the long signature. Click to see the full one.
Re: Measuring metastability.
Hi Rick,
       Good point, the unknown delay is bad enough. Thinking about it,
I can't think of a 'sensible' digital metastability reduction circuit
and scenario which would differentiate between a simple delay, and
(say) a runt pulse. (Perhaps if the second FF was clocking faster than
the 'sample' FF? Doesn't make much sense!)
       OTOH, my concern is this. If the following FF also goes, or is
very likely to go, metastable if its D input is at a 'funny' level,
then the 'funny' level metastability is more likely to propogate than
a simple delay. This is because the simple delay has to hit a tiny
time window to propogate the metastability, whereas Philip's photo's
show the funny levels lasting a while.
          thanks, Syms.


Quoted text here. Click to load it

Re: Measuring metastability.
Symon,

A long time ago, I designed a timing system for telecom that used a rubidium
clock.  The
reason why this is interesting will become apparent in a moment.

One function of the system was to measure up to five external sync references
(presumably
from traffic bearing lines from other offices).

The circuit to do this was basically a counter, that was sampled by a rubidium
derived
clock.  All clocks were syntonous (same frequency, arbitray phase -- aka the
SONET/SDH
telepone network).  As the phase wandered back and forth, the metastable regions
would get
exercised, and the measurement board would report that an input signal had
arbitrarily
"slipped" by some random number of bits (due to a metastable transistion of a
bit in the
counter/latch).

This was so frustrating, because in real life, the inputs could slip due to
failures,
glitches in the network, etc.  So how do you know a real bad slip, from a
metastable one?

In the hope (vain and useless) of reducing the occurence of the metastable
sample, we had
three levels of FFs to try to re-synchronize the sample count, along with an
elaborate set
of clock enables.  We got it to the point where the false slip occurred about
evey two
months, in a  typical network.  Since real outages were far more common, it was
not a big
deal.

In spite of this, we wrote software to identify a real slip, from a false slip.
Basically,
if five successive samples were not equal (as the phase can't change that fast)
we threw out
that set of measurements, and took another five.  This dropped the occurence of
a false slip
to below the threshold that we could measure (but it could still happen and
probably did -
still does out there somewhere).

So, metastability can sometimes be beaten into submission, but it never goes
away....

The ultimate frustration is that one of these references gets used to track the
rubidium
(steer it in a phase lockeed loop), so a fake glitch can cause quite a hit,
which then
causes slips throughout the network as the rubidium runs off the the "wrong"
frequency/phase.  A real glitch can also cause the same behavior, so the locked
loop is
quite loosely coupled, and before each update, one checks absolutely everything
to be sure
that what you are trying to track is real....

I had heard there were some cases with equipment designed by others where the
maintenance
folks would just disconnect the inputs, as the bare rubidium ran so clean, that
it was
better to not track at all (fewer slips) than to bother with trying to track the
references
and falsely running off into the weeds due to metastability and poor reference
checking.

All of this became obsolete when GPS became available, as now precise time
(frequency) was
broadcast for free.

Austin

Symon wrote:

Quoted text here. Click to load it


Re: Measuring metastability.
Quoted text here. Click to load it
clock.  The
Quoted text here. Click to load it
(presumably
Quoted text here. Click to load it
derived
Quoted text here. Click to load it
SONET/SDH
Quoted text here. Click to load it
regions would get
Quoted text here. Click to load it
arbitrarily
Quoted text here. Click to load it
bit in the
Quoted text here. Click to load it
failures,
Quoted text here. Click to load it
metastable one?
Quoted text here. Click to load it
sample, we had
Quoted text here. Click to load it
elaborate set
Quoted text here. Click to load it
evey two
Quoted text here. Click to load it
was not a big
Quoted text here. Click to load it
slip.  Basically,
Quoted text here. Click to load it
fast) we threw out
Quoted text here. Click to load it
of a false slip
Quoted text here. Click to load it
probably did -
Quoted text here. Click to load it
away....
Quoted text here. Click to load it
the rubidium
Quoted text here. Click to load it
which then
Quoted text here. Click to load it
locked loop is
Quoted text here. Click to load it
everything to be sure
Quoted text here. Click to load it
maintenance
Quoted text here. Click to load it
that it was
Quoted text here. Click to load it
the references
Quoted text here. Click to load it
checking.
Quoted text here. Click to load it
(frequency) was
Quoted text here. Click to load it

 Very interesting account - worth adding to the FAQ under metastable ?
-jg

Re: Measuring metastability.

Quoted text here. Click to load it

Yes. Great story.

My old story involves *missing* a synchronizer
rather than having failures *in* a synchronizer.

At least I didn't have to wait two months
for symptoms to occur in that case :)


        -- Mike Treseler



Re: Measuring metastability.
Symon,

Agreed.  D FF is what I was assuming here, but folks do have different ways of
building them, and
not all master-slave implementations are the same (in fact I have seen perhaps a
dozen different
versions).

Peter will have some explaining to do when he gets back .... as the quote does
sound odd.  There
is most definitely voltage levels that remain in the undecided region for
various lengths of time
until the circuit resolves its state.

Austin

Symon wrote:

Quoted text here. Click to load it


Site Timeline