Real examples of metastability causing bugs

- M
- mk
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Jan 10, 2008 4:19 PM

What I don't understand is how these patents get written. Doesn't the person who comes up with this invention look at the timing waveform at the bottom of fig.1 and say "what happens if this input goes to the input of the xor [on fig.3]?"

- M
- Mike Treseler
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Jan 10, 2008 6:36 PM

Many organizations like to collect patents as as bargaining chips in legal disputes. An engineer might come up with the original idea, but lawyers do the technical writing and interpretations. Some end up obeying the laws of physics. Some don't.

Apparently not in this case.

-- Mike Treseler

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Jan 10, 2008 11:31 PM

Let me give some quantitative information: At a 300 MHz clock rate and roughly 50 MHz data rate, the Virtex-2Pro flip-flop exhibits an extra delay of 1.5 ns once every second.

What is the metastable capture window? The data changes once every 10 ns ( 50 MHz has a 20 ns period, with 2 changes per period) Within a second, the 300 MHz oscillator puts 300 million transitions into this half period, and only one of them causes a metastable delay of 1.5 ns

10 ns divided by 300 million =3D 0.03 femtosecond =3D 33 times 10 exp -18 seconds. In other words, the capture window is extremely small... Peter Alfke

- W
- Walter Dvorak
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jan 11, 2008 9:16 PM

maybe an interessting link to this topic:

WD

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jan 11, 2008 9:47 PM

Well, we all know and respect Howard Johnson. His circuit will demonstrate metastable events, but it does not give quantitative data. How often does the output go metastable for how long, when it tries to synchronize two frequencies of that distribution? Those are the questions I have answered in XAPP094, with reasonable numeric precision. Peter Alfke

- J
- -jg
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jan 11, 2008 10:15 PM

Yes it is small, but the analysis above seems a little suspect ?.

Of the 300MHz edges, only one in every three has _any_ chance of hitting a metastable window, the others are effectively blanks. Say one 300MHz edge is within 500ps, and so has some probability, the next two will be well clear of any window, and so have zero chance of triggering an event. So that would indicate 0.1fs, or 100 atto seconds. (still small)

You could verify that, by a change in the rate, and seeing if a drop to 200MHz did move the 'fire rate'.

-jg

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jan 11, 2008 11:42 PM

8

Interesting argument, but it is wrong. The easiest proof would be to imagine twice the clock rate, 600 MHz. I know intuitively that the metastability rate would be much (very much) higher, but by your reasoning, it would stay the same.

So here is a more rigorous attempt:

If data changes once per 10 ns, and we have one clock edge per second, the largest possible distance between data change and clock edge is of course 10 ns. At a 1 MHz clock rate, the maximally possible distance between data change and clock edge is a million times shorter, and this relationship continues: at 300 MHz the max distance during any second is 10 ns divided by 300 million. (That's the 0.03 femtoseconds I mentioned.) Increasing the clock frequency even more will further reduce the max distance between clock edge and data change. That's why MTBF is known to be inversely proportional to the product of the two frequencies (clock and data), irrespective of which one is the larger value. The statistics of random events can be a tricky subject... Peter Alfke

- J
- -jg
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Jan 12, 2008 1:12 AM

f

Very much ? - by your numbers it would only double ? I think you are viewing this in terms of your test circuit, where the clock does more than one thing - it also checks for the failure, and in that case yes, a faster clock would shrink the window, and so give (very much) higher failure rates.

My argument is a system only has a finitie number of data edges, and so one cannot sensibly claim to 'roll the metastable dice more often' than that. Well, one can, but the calculated result is unrealistically short :)

It should be possible to verify this on a test bench, but it would need a test circuit that did NOT change the metastable window snapshot, with clock changes.

So, perhaps a Clock enable on the first register, that allowed 300MHz,

150MHz, 100MHz etc samples of the 51MHz data (but other registers keep their window sampling) ?

It is somewhat academic, I'll admit, as designers don't care if an 'apparent window' is

33 or 100 attoseconds wide - both values are well under any system jitter.

Another interesting bench test, would be to see how many adjacent samples could be made to exceed a metastable window. With a slow phase velocity, and good system jitters, a very small trigger window would say this would be very rare. A wider trigger window would increase the probability of two consecutive edges both failing. If the jitter is 1000x as large as the trigger window, then the chance of two consecutive failures is 1 in a million (assumes nominal phase lock, just jitter errors )

-jg

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Jan 12, 2008 2:28 AM

Your argument was that two out of every three 300 MHz clock edges don't even have a CHANCE of creating the metastable event. Consider the "controlled" asynchronous version of this test where the 300 MHz clock is phase locked to the 50 MHz clock being sampled and the 300 MHz clock is phase shifted across the full 10ns of a single bit period. This 10 ns phase shift produces three capture window crossings. If you use a

200 MHz clock and shift those 10 ns, you will have two window crossings. If your sample clock frequency could approach infinity, you would *always* have a metastable event captured.

Peter's looked at this problem (measuring metastability) for a very, very long time with full experimental setups to verify his numbers. I don't believe the equations were created by him but have existed for ages. My first introduction to the theory and supporting equations was about 2 decades ago. Perhaps Peter Alfke authored that text.

- John_H

- J
- -jg
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Jan 12, 2008 3:30 AM

Correct. You can roll the dice once, but not three times.

I'm not sure I follow?. Peter's test circuit uses the clock for two different things. Once to sample the data stream and again to set the settling window, to decide if metastable events occured. Yes, this is nice and simple, but can give the illusion that the faster clock gives more transistion-settling-event samplings.

Yes, a good limit case (assuming zero jitter) : This case would have these events every 10ns (not every clock) - you cannot have more transistion-settling-events than transistions :)

With a difference of 3:1 between CLK and Edge rates, the distinction I am trying to make is not large on the scale of metastable values, but I would derive a different window-size than Peter, from the same data.

Should be easy enough to verify, and also get better test vehicles for more accurate window sizes. If I was making chips, I'd like to know that number as precisely as possible (even tho it is way below any jitter, and some would say 'who cares?') because it could indicate if a new process was actually better than an older one.

-jg

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Jan 12, 2008 6:30 PM

If you had a fixed delay to measure the metastability delay rather than the sampling clock, you remove that second clock from the situation.

If you think of statistics and imagine the capture window *is* 33 femtoseconds then the chance of *any* random edge hitting that capture window in the 10 ns data period (provided by the 50 MHz clock being sampled) is 33fs/10ns. If you could produce 300 million sampling edges (at any frequency, doesn't matter) where the edges increment in phase offset relative to the 10ns period by 33fs each step (assuming zero jitter *here*) then one and only one clock will cause a metastable event. When 300 million edges are presented asynchronously, these 300 million edges have a flat statistical distribution across the 10ns period.

An infinite clock limit means metastability with or without jitter. Your assumption that jitter would be important in this limit case helps reinforce my view of your mathematical reasoning.

So do the darned verifications! You appear not to believe the many verifications that Peter HAS DONE. Your nay-saying an expert on this issue is silly and annoying. I wouldn't bother trying to underscore the validity after these first attempts if this was just a private discussion. I'm concerned that people who don't know much about metastability would see this conversation as an indication that the issues aren't clear.

Are there any other well defined issues you'd like to cast doubt upon? I mean... while we're at it.

- John_H

- M
- Mike Treseler
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Jan 12, 2008 7:17 PM

I agree. This same discussion is dug up about once a year. This is where electronics meets quantum mechanics. Many smart people have been fooled on this subject.

-- Mike Treseler

- J
- -jg
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Jan 13, 2008 5:44 AM

You do not seem to be reading what I am writing.

I have no issue with Peter's measurements, or his test circuits, but I do have a small issue with the derived window size, that he then calculates from those measurements.

As I have already stated it is somewhat academic, and does an end user care if it is 33 atto seconds, or 100 atto seconds ?

Both are very small numbers. .

-jg

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Jan 13, 2008 6:24 AM

And the discussion on why those numbers are valid appears to escape you. Just because 2 of the three clocks "never have a chance" of causing an event doesn't let you limit the even statistical distribution of edge position within the period to the "single edge closest to the window."

If the two clocks were different by a factor of 100 rather than a factor of 3, your suggestion on calculating the window size would be much less than academic.

I know you have an issue with the derived window size. I have read your posts. I see why your view of the approach is wrong. I've tried to give you different directions to look at the problem to see why your observations are less than complete.

I like to see people understand. I don't see it here.

- John_H

- J
- -jg
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Jan 13, 2008 7:03 AM

There are also two error types, the average one, and the peak one. Sometimes in engineering, we like to think about worst case, as well as averages.

Someone else mentioned a locked metastable generation system, ie one that deliberately tries to be metastable.

Suppose I have a 1MHz data rate, and choose a 1MHz (+ErrN) Clock,

-or- a 100MHz(+ErrM), and assume a 'nominally' real system with nice round numbers of a 0.1fs window, and 0.1ps jitter

Q1: Can these widely variant clocks ever give the same peak error rates ? Q2: Can the error rate ever go above one per microsecond ?

-jg

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Jan 13, 2008 6:05 PM

This discussion is not about politics or religion: This is science, and there is only ONE correct answer. But the newsgroup is a poor vehicle to convince someone who does not want to "get it". This debate should not go on forever...

On a related subject: It amazes me that there is so much talk and fear about metastability, but nobody gets his fingers dirty and performs real measurements. I published "my" original circuit 18 years ago (!), and to my knowledge no university has picked this up as a simple challenge. Any competent student can grasp the concept in less than a week, and anybody with the skill to configure FPGAs or CPLDs can duplicate these experiments in a short time with simple equipment, ( an eval board, a variable clock source, and a stop watch), and every experiment usually runs for less than an hour. Why does nobody try to PROVE me (and Xilinx) right or wrong? I have publicly (in this newsgroup) offered my assistance, but nobody responded.

IC manufacturers (including Xilinx) do not seem to see metastabilty as a very important subject. There are always more burning design issues, like raw speed, functionality, size and power consumption that take precedence. Further "metastability-hardening" might compromise some of the other important aspects. FPGAs have up to hundreds of thousands of flip- flops, and only a few of them will ever be challenged with metastability.

We all know that we can never avoid metastability, but I wanted to have quantitative proof that we can live with it, and have ways to design around it. Is a one-man effort sufficient for that ? Thanks for your trust, but it feels a bit lonely... Peter Alfke

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Jan 13, 2008 6:13 PM

There's a minute possibility the error can happen on several consecutive clock cycles so yes - the instantaneous rate can far exceed one event per statistically-even distribution of edges because they have probabilities of hitting the window, not certainties. Over a long enough period of time (high enough population) the error will approach the statistical expectations.

If your example is the MLL - the Metastability Locked Loop - then jitter has a strong impact but to determine the actual error rate, the jitter distribution has to be part of the equation. There are too many ways to describe jitter. If the MLL is ideally centered, then the percent of population of the 0.1fs window has to be determined relative to the "0.1ps" jitter distribution. The 0.1ps could be RMS or peak-to-peak with any of several multiplying factors from RMS.

But IT DOESN'T MATTER. For determining the sampling window size, an even statistical distribution across a fixed period will give you the best statistical results.

Even in the totally asynchronous case, the possibility of hitting the same window multiple times in a row exists; it's just a very small probability.

If you want to think "peak error rates" you could envision a system similar to the originally proposed 300MHz/50MHz system (that produces an average of 1 event per second) and have the relative frequency offset be so small that it take a minute for the sampling window to be visited by the "next" consistently located edge in this nearly synchronous design. In that situation, the events will "burst" about 60 errors in a short period of time. In an asynchronous system, the probability that the two signals are that close in frequency for such a long period of time is extremely small.

If you want to continue serious discussion on this topic, please mail me directly. I figure others in this group are starting to glaze over when they see this thread belabored more and more. I may need to dig up some resources to start talking specific population ratios for jitter distributions but I could deliver the math.

- John_H

- H
- Hal Murray
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Jan 13, 2008 11:40 PM

Maybe you did a good enough job that the area isn't interesting any more?

I'd like to see data for different temperature, voltage, and rise times. I'm a bit surprised a university hasn't jumped on that one.

The rise times are hard to measure and control inside a FPGA. Maybe just long routing vs short routing would be interesting.

--
These are my opinions, not necessarily my employer's.  I hate spam.

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Jan 14, 2008 12:20 AM

Nice assumption...

Hal, XAPP094 shows the (not-surprising) effect of different voltages. I am sure that the external rise time would be totally swamped out by the gain in the clock buffering. Routing delays just move the time laterally, have no effect on the measurements. Don't expect newer families to be any better. The lowering of supply voltages has a bad impact... (But no need to cry, metastable delays are short enough for almost all cases; and smart users know the remedies for the remaining extreme cases) The obviously better total systems performance of the newer families comes from architecture and systems improvementss, not from a faster gain x bandwidth product in the flip-flops. IMHO, somewhat of a guess... Peter Alfke

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Jan 14, 2008 7:29 AM

(snip)

I had originally thought of it based on a PLL, which would have jitter. That would be part of the measurement.

There is a web site in another post that uses one clock with a variable delay. That may help as far as jitter.

That is still a lot faster than without a lock system.

-- glen