Metastability resolution

- C
- comp.arch.fpga.posting.account
  
  Contact options for registered users
posted
17 years ago

Tue, Sep 19, 2006 10:13 PM

I am designing a crossdomain synchroniser and wanted to check that I understand the formula for the mean time between metastable failures correctly. Sorry if the answer can be easily found on the web; I tried to find it and failed.

The usual formula is MTBF=1/(T0 f1 f2 e^{-t/tau}), where f1 is the clock frequency of a flip-flop's clock, f2 is the edge frequency at which its input transitions, T0 is the metastable window aperture size (it *is* called that?), and tau is the metastability time constant. A failure happens whenever the flip-flop becomes metastable and remains so for at least time t.

The value of T0 seems impossible to find for Xilinx FPGAs, presumably because it varies exponentially with tau, and that is difficult enough to measure accurately (?). I think that T0 can be at most t_{setup}

+t_{hold} so that might be one way of obtaining a value (?) [, though Xilinx say that negative hold times are not guaranteed, so I should probably stick with just t_{setup} whenever the hold time is negative]

The above formula only works when the two clocks are independent. If they are not then I think a good upper bound is MTBF >= 1/( ( max f1 f2 ) e^{-t/tau} ) (?) The rationalle is that the flip-flop cannot go metastable any more often than either f1 or f2 (remembering f2 is the edge frequency, though since the potentially metastable flip-flop is fed from another flip-flop clocked with frequency f2 that ends up being the same thing). It might be that even when the two clocks are produced by the same DCM there will be sufficient jitter to allow the upper bound to be improved considerably, but probably not if the best known bound on T0 is of a similar magnitude to the jitter (?)

Could someone please confirm the above is corect? Many thanks in advance!

- C
- comp.arch.fpga.posting.account
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Sep 19, 2006 10:22 PM

I am designing a crossdomain synchroniser and wanted to check that I understand the formula for the mean time between metastable failures correctly. Sorry if the answer can be easily found on the web; I tried to find it and failed.

The usual formula is MTBF=1/(T0 f1 f2 e^{-t/tau}), where f1 is the clock frequency of a flip-flop's clock, f2 is the edge frequency at which its input transitions, T0 is the metastable window aperture size (it *is* called that?), and tau is the metastability time constant. A failure happens whenever the flip-flop becomes metastable and remains so for at least time t.

The value of T0 seems impossible to find for Xilinx FPGAs, presumably because it varies exponentially with tau, and that is difficult enough to measure accurately (?). I think that T0 can be at most t_{setup}

+t_{hold} so that might be one way of obtaining a value (?) [, though Xilinx say that negative hold times are not guaranteed, so I should probably stick with just t_{setup} whenever the hold time is negative]

The above formula only works when the two clocks are independent. If they are not then I think a good upper bound is MTBF >= 1/( ( min f1 f2 ) e^{-t/tau} ) (?) The rationalle is that the flip-flop cannot go metastable any more often than either f1 or f2 (remembering f2 is the edge frequency, though since the potentially metastable flip-flop is fed from another flip-flop clocked with frequency f2 that ends up being the same thing). It might be that even when the two clocks are produced by the same DCM there will be sufficient jitter to allow the upper bound to be improved considerably, but probably not if the best known bound on T0 is of a similar magnitude to the jitter (?)

Could someone please confirm the above is corect? Many thanks in advance!

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Sep 19, 2006 10:48 PM

Have you read:

formatting link

and

formatting link

?

Austin

- C
- comp.arch.fpga.posting.account
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Sep 19, 2006 10:53 PM

Sorry to follow myself up. I made a mistake in the post (used max when intending to use min, though neither is incorrect), so I cancelled it and posted the corrected version moments later. Google groups seems to have honoured the cancel request, but another Usenet server I use has not. Sorry if you see two copies.

- C
- comp.arch.fpga.posting.account
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Sep 19, 2006 11:11 PM

Many thanks for your reply.

Of course. It seems to be about the only source of the value of tau for Xilinx FPGAs I could find. Have I missed other TexhXclusives that give the same data for Virtex4/Spartan etc?

formatting link

That is the same thing, right?

And that is yet another substantially identical copy, except in PDF?

I am very sorry if I missed it, but the article you refer to does not seem to give a value for T0 and does not address the case when the two clocks are not independent. What am I missing?

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Sep 19, 2006 11:21 PM

formatting link

- C
- comp.arch.fpga.posting.account
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Sep 19, 2006 11:43 PM

Many thanks for your reply.

Certainly not as much as tau, though it can make the difference between being comfortable using a synchroniser with just one flip-flop and needing two. Even the "trivial" upper bound for V2Pro is something like

0.2ns, so with a 10ns clock it increases MTBF by a factor of 50. I would hope that T0 is much smaller than 0.2ns but have no way of knowing for certain. Would I be right in thinking that T0 cannot be measured with any accuracy?

So are you saying that the upper bound I came up with is correct? I would certainly be pleased if you were.

I will not be the final user of the synchroniser for which I need to know the MTBF. I need to allow for the possibility that the two clocks will be produced by the same DCM and might therefore be synchronous.

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Sep 20, 2006 12:13 AM

Let's look at the basics: A flip-flop has an undefined output delay when the D input changes within a very tiny portion of the set-up time window, and the delay is the longer the closer that change is to the center of the tiny window. For a 3 ns extra delay I measured (indirectly) this tiny window as a small fraction of a femtosecond. Expressed this way, MTBF and data and clock frequencies fall out of the equation, and the behavior looks as if it were deterministic. So I consider this a basic figure of merit of the flip-flop.

If your two frequencies are correlated, you may have a very hard time calculating the proabbility that the two edges ever get that close. They may always be very close, or they may never be close at all. Especially if you are exposed to, or if you rely on jitter... Peter Alfke

- C
- comp.arch.fpga.posting.account
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Sep 21, 2006 1:20 AM

Many thanks for your help so far. I am afraid this post is a bit long but with a bit of luck I am not the only one who cares to know the answers, even if the thread is surprisingly quiet for what can often be a rather popular topic. Perhaps it was something I said early on :-)

Yes, so it seems a shame that no-one will tell me what T0 is and hence I have to use the size of the entire setup window instead.

Yes, so that way is not directly useful if I have the frequencies and need to get the MTBF (?)

Sorry to nitpick, but did you mean to say "for a 500ps (not 3ns) extra delay" the window size is 0.03 fs?

Does that value (or anything else in the techxclusive) allow me to determine the MTBF of my synchroniser? I cannot see how I could do any better than get the MTBF of a circuit with exactly the same routing delay and other overheads as the circuit used in writing the techxclusive (unless I throw in a value of T0). I guess you might say that my settling time will be off by no more than maybe half a nanosecond if I just extrapolate from the MTBFs in the techxclusive. Sadly, that difference translates to a MTBF different by orders of magnitude and might result in a need for a second flip-flop if the user is sufficiently paranoid. Maybe an example of what I want to find would help:

I have the usual synchroniser composed of two flip-flops. The output of the first is directly connected to the input of the second. ISE has been told to ensure that the output delay + routing delay + setup time

the difference of clock skews can be at most x ns. The input of the first flip-flop is connected to a source that may transition f2 times per second. Both flip-flops are clocked at frequency f1. We can assume the clocks are asynchronous for now. I can look up the tau of the first flip-flop. Now, what is the mean time between failures, where a failure occurs whenever the second flip-flop samples a metastable signal?

[ There are two solutions I tried, both based on obtaining T0 and using the usual formula.

One was to calculate T0 based on the values in the table in the techXclusive. This resulted in a ridiculously large time because, at a rough guess, half the time in the half-periods is spent on overheads. There is a note that says so in the techxclusive, it is just that it was not obvious exactly how huge the implied value of T0 would be :-)

The other was to get T0 from the graph. Taking the point where several of the lines intersect I came up with 0.03fs. This seems much too small because it matches the 0.03fs value mentioned earlier in the techxclusive that I interpret to be T_{500ps} (?). Obviously, T0 must be several orders of magnitude larger than T_{500ps}.

Is there a third way, or can either one above be made to work? ]

Yes, so I attempted to come up with a formula for a lower bound on the MTBF when the clocks are synchronous, and was hoping someone here could confirm that it is correct. To save anyone interested looking up the original article, it was MTBF >= 1/(( min f1 f2 ) e^{-t/tau}). I would be pleased to explain why I think the formula is correct if it is not obvious and anyone cares.

It would be equally good if someone could point me to a webpage/book/paper that discusses the MTBF when the clocks are synchronous. I am sure it exists, but could not find it myself.

Being exposed to jitter can only help in my circumstances. Relying on jitter is obviously much less safe, especially if the best known value of T0 is roughtly the same order of magnitude :-(

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Sep 21, 2006 3:29 AM

Let's for a moment forget about the missing value for T0.

What I published were experimental results, extrapolated with the knowledge that MTBF increases exponentially. So I had to measure only two or three points on the (exponential) straight line.

We all know that the location of the tiny timing window moves about with temperature and voltage and is also affected by processing prameters. That's why the manufacturer gives you the wide spread for the set-up time. Pinning down the location of the tiny window inside the set-up time is meaningless and impossible, but we can measure its effective width. For uncorrolated clocks, we can measure and plot the MTBF and thus derive the probability of any additional (beyond normal prop-delay) non-deterministic, statistical metastable delay.

I plotted this as a graph, and mentioned that the vertical axis must be scaled by the inverse of the product of the two frequencies. So for 100 MHz clock and 10 MHz data, the y-axis values must be multiplied by 3 x

5 = 15. (Because the likelyhood of any specific metastable delay is 3 times less due to the slower clocking, and another 5 times less due to the more sparse data edges).

I am of the opinion that this is a really useful tool to quantify metastable behavior. The graph is based on real measurements, not on simulation, and not on an attempt to match an equation. That's why I do not get excited about the "missing T0"

Moreover, it shows that the traditional double-synchronizer is very effective. In most cases, the metastable delay never reaches the set-up time window of the second flip-flop. (Even if it did, the level change would have to fall into the tiny window on the second synchronizer flip-flop, which is a very remote probability). But the traditional rule still holds: minimize the delay between the two flip-flops. When every half nanosecond of available slack affects MTBF by a factor of a million, it would be foolish not to shorten that path.

I did these measurements first in 1988 (after years of frustrating experiments at my previous job, fresh at Xilinx, and tempted by the ability to put the "instrument" inside the device under test). Then I repeated the tests two more times, first with XC4000 and lastly with Virtex-2 Pro. There has been little pressure to repeat them again, but anybody who has a spare couple of days is welcome, and is assured of my assistance. It takes only an eval board, a crystal oscillator plus a clock generator, a good frequency counter and a stopwatch. Nothing exotic.

Anybody who knows how to c> Peter Alfke wrote: