Bit error rate

- K
- Kevin Kilzer
  
  Contact options for registered users
posted
20 years ago

Tue, Sep 30, 2003 2:38 AM

Is there any way to estimate the bit error rate of a data bus that passes through a Xilinx FPGA? I have input gates, the block RAM, and output gates involved in the system, and I would like to predict the error rate of data passing through.

Kevin

- H
- Hal Murray
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Sep 30, 2003 3:13 AM

I'm missing something. What kind of errors are you interested in?

If your design is clean, the error rate from everything short of cosmic rays should be 0. Or at least low enough so that it is very very hard to measure.

Note that "clean" includes the logic and power supply and SI on the input and output sides.

--
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's.  I hate spam.

- K
- Kevin Kilzer
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Sep 30, 2003 4:49 AM

Then why do DRAM memory systems include a CRC or parity bit? Certainly there is some non-zero probability that a latch will miss or a gate will experience a random noise spike?

If what you say is true, the BER of a disk drive will be entirely the fault of a noisy head, and not the deserializer, cache or bus drivers?

Kevin

- P
- Paul Leventis
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Sep 30, 2003 5:10 AM

Hi Kevin,

DRAMs include CRC to protect against bit flips in their RAM cells. The thing they worry about are alpha particle and neutron strikes. When the strike occurs, it can create a momentary current that flips the state of the RAM cell -- it's a function of the number of RAM cells and the resilience of each cell. The bigger the cap of the cell, the harder it is to flip the value. The more there are, the higher the chance that a strike will affect a given chip.

There are some people who are starting to worry about particle induced glitches in logic/routing, but the consensus is that isn't a problem yet. For those people who are very paranoid, there are techniques such as triple modular redundancy (think of it as a circuit in triplicate that takes a best two-out-of-three result) that essentially reduce the chance of logic faults to zero.

A far more common cause of logic faults is poor design -- if your design is sensitive to momentary glitches (i.e. asynchronous) you are much more likely to have a problem if some event causes a glitch. Most often, this is due to cross-talk or other such down-to-earth electrical issues. We design the routing in our FPGAs so that they will not glitch even with worst-case attackers.

BTW, if I recall correctly, the biggest cause of BER in a hard disk is due to the tinyness of the bits they read & write -- they ain't 1's and 0's at that point, more like best guesses :-) You also get bit errors in the communication medium (cheap ribbon cable) connecting the hard disk to the HDD controller.

Regards,

Paul Leventis Altera Corp.

- M
- Muzaffer Kal
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Sep 30, 2003 5:17 AM

The issue with DRAM is that they occupy such a large portion of the die and that they are optimized for size to hold the minimum charge necessary to keep a bit till the next refresh cycle. This makes them particularly vulnerable to various forms of radiation. If you notice, there are very few designs where the error probability of random logic is controlled against errors caused by radiation. In terms of disk drives, the main problem is the uncertainty in the bit lengths on the media and the clock/data recovery after data is captured by the head in addition to the noise added by the head. Again errors caused by radiation is not a major concern in the data path and even in cache as the cache is most probably static memory which is more resistant to such errors. With a well tested (with external scan or BIST etc) logic, the probability of radiation induced errors are completely negligible in almost all of the designs, unless they involve large quantities of DRAM or used in space or life critical applications.

Muzaffer Kal

formatting link

ASIC/FPGA design/verification consulting specializing in DSP algorithm implementations

- M
- Marc Randolph
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Sep 30, 2003 1:49 PM

Considering that you specificly mentioned "input gates, block RAM, and output gates" in your original posting, Hal's response was correct.

Now, if you'd actually mentioned DRAM and disk drives, I'm sure Hal's response would have been different.

Marc

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Sep 30, 2003 2:52 PM

Kevin,

If the design has the proper amount of timing slack, and the clock for the design has well behaved jitter, then the error rate is 0.

If there is inadequate slack in the timing, and the clock jitter is unbounded, then the error rate is non-zero.

At some point, the error rate becomes so small that other things are likely to occur before an error is noticed/logged/reported. Like a power loss. Or a circuit failure somewhere (not in the FPGA).

Jitter is often modeled with gaussian distributions, but actual oscillators do not have infinite energy, so they don't actually have unbounded "tails" where the jitter value keeps increasing indefinitely as the probability decreases (true random jitter).

Bit errors also almost never occur at a rate, but rather occur in clumps, or bursts, and are therefore not random at all. A channel with dribbling bit errors is broken, and should get fixed or have error correction added on top of it.

Check out the articles on the tech Xclusives pages on jitter, timing, and slack.

Soft errors from cosmic rays are well understood (at least by us), so you can also take these into account (if an error every ~1000 years is important in your application - which it is for many today).

Check out the article on this on the Xillinx website: "1000 Years Between Single Event Upsets" on the tech Xclusives pages.

By the way, we recently put the 90 nm Spartan 3 in the neutron beam, and we are gratified (and delighted) that it has ~30% smaller cross section than the 150 nm technology (ie it will be upset less frequently!).

(Presented at MAPLD this last month. For a copy of the presentation, contact your FAE.)

Aust> Is there any way to estimate the bit error rate of a data bus that

- H
- Hal Murray
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Oct 1, 2003 7:51 AM

Here is how I look at this area...

There are two types of electronics: logic and communications.

Logic includes computers. Everybody expects them to work correctly.

Communications includes things like Ethernet, fibers, and satellite links. People (including engineers) expect a few errors.

The error rate you actually get on a communications link is determined by the signal to noise ratio. That assumes classic gausian noise. See any good communications text book. The key here is "gausian", which is pretty good for fibers and satellite links.

On most communication links, there is an economic tradeoff. How many miles can you go on a fiber before you get too many errors? How many bits/second can you get on a satellite link before you get too many errors?

Disks are similar to communications links. How many bits per square inch can I get vs what is the error rate when reading them back?

On the other hand, people expect logic (and/or computers) to get the right answer - zero errors. All that means is that they are running with a signal/noise ratio that is very very very good relative to communications links.

If you look at the classic errors vs signal/noise chart, you will see that it's exponential. Make the signal stronger and the errors go way down. Make it still stronger and you can't mesure them. Classic logic (gates and FFs) operate in a signal/noise range that is so far off scale that they don't make enough errors to worry about. You need to worry about other sources of errors instead - things like meteorites landing in your lab and smashing your FPGA but missing your error testing gear.

If you look at errors on logic/computers, you can lump them into several buckets. Design/software errors For example, Intel's divide bug Fabrication errors from the factory. broken chips, inadequate testing, assembly errors, ... Systematic errors that are similar to noise. These are things we can localize and analyse, but often overlook: Clock jitter (see Austin's msg) crosstalk noise on power rails Alpha particles, Cosmic rays... Signal Ingegrety, reflections (see the recent Spartan3 discussions) ... These are the hardware versions of the software bugs above. Thermal noise: This is what's left after you correct for all of the above. If something strange happens often enough, somebody will figure out what's going on and put a name on it and it will get added to the list above.

DRAMs are interesting. They are on the border between communications and logic. We want them to work all the time, but we also want them to be cheap. Cheap means small which means they are more likely to drop bits if an alpha particle hits the right place. (When people were first starting to get interested in alpha particles, they were black magic or "thermal" noise. As soon as people understood what was going on then they could measure and avoid the problem.)

If you want cheap DRAMs, you will get occasional errors. (You can't buy any other kind, so get used to it.) With ECC and good software (scrubbing) you can get close to error free DRAMs. Similarly, with good FEC (Forward Error Correcting) you can get close to no errors on satellite links.

Back to FPGAs. Roughly, they don't make any errors. What I mean by that is that the gates and FFs work as expected. I expect there is some thermal noise, but I doubt if you can measure it. There are too many other things causing more intersting problems.

If your question was really how many errors to expect on a simple input-FPGA/BRAM-output type design, my answer would be "it depends". How solid is your design? Any metastability involved? What sort of external noise/EMI? What are your input output lines connected to? Is the clock solid? When you get done with all those questions, then you get to ask about alpha particles and cosmic rays.

It's really really hard to prove that your design is solid. The software/systems guys have a neat phrase. Testing can't prove the absence of bugs. It can only demonstrate their existence.

The software guys have a set of tricks that make looking for bugs more productive. Similarly, with hardware, it helps to look in the places that are likely to cause errors. Put a scope on your clocks. Check your power. Look at the places where signals cross clock domains. ...

--
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's.  I hate spam.

- K
- Kevin Kilzer
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Fri, Oct 3, 2003 6:01 AM

Thanks to all for the clear explanations. I will further assume that since SRAM is similar to logic (unlike DRAM), that the SRAM is practically immune to upset also. I'll stop worrying about random events in the logic, and concentrate my testing on the other factors that were mentioned.

Kevin

- H
- H. Peter Anvin
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Fri, Oct 3, 2003 11:31 PM

Followup to: By author: snipped-for-privacy@mindspring.com In newsgroup: comp.arch.fpga

SRAM is more immune than DRAM, but SRAM is frequently built using process-optimized cells, and do have a (small) probability of being affected by intrinsic radiation.

This is why most processor vendors have started using ECC on the caches; especially the larger (L2+) caches.

That being said, I don't expect to see these in an FPGA, and it's highly unlikely to be the source of any problems you might see.

-hpa

--
 at work,  in private!
If you send me mail in HTML format I will assume it's spam.
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64