Electronic News Article on 90 nm soft error FUD

- A
- Austin Lesea
  
  Contact options for registered users
posted
20 years ago

Wed, Oct 29, 2003 7:13 PM

Hello from the SEU Desk:

Peter defended us rather well, but how can one seriously question real data vs. babble and drivel?

Well, after 919 equivalent device years of experiment at sea level, Albuquerque (~5100 feet), and White Mountain Research Center (12,500 feet) the Rosetta Experiment* on the 3 groups of 100 2V6000s has logged a grand total of 45 single soft error events, for a grand total of 20.4 years MTBF (or 5335 FITs -- FITs and MTBF are related by a simple formula -- mean time between failures vs failures per billion hours or FITs).

It actual tests done by third parties, it takes from 6 to 80 soft errors (flips) with about 10 flips on average to affect a standard non redundant design in our FPGA. This is just common sense, as for years ASIC vendors trashed FPGAs as they "use 30 transistors to do the job of just one!" Guess what? What was our "downfall" is now a strength!

True. So that means that a 2V6000 at sea level gets a logic disturbing hit once every 200 years.

535 FITs (soft errors affecting customer design) for a 6 million gate FPGA.

The biggest part A**** makes is 6 times smaller, so for our 2V1000, we get about 90 FITs. For a 3S1000, it is 30% better (see blelow), or 63 FITs. OK A****, tell us what your actual as measured FIT rate is for your largest device? Go ahead, I'd like to know. How many device years do you have to back it up? 1000 actual years? Nope. Didn't think so.

You know, if you want to use FITs, we'll use FITs. But I am afraid it will give those spreading nonsense fits (pun intended).

Now if you use triple redundant logic, checksums, ECC codes, you can design so you NEVER HAVE AN UPSET.

As has been published, Xilinx FPGAs are on the Mars Landers (on their way there now), so someone is not concerned about upsets. Even periodic reconfiguring (scrubbing) eliminates a major portion of the probability of logic affecting upsets. Virtex II, and II Pro have ways to actually check, detect, and correct the flipped bits using the ICAP feature. For details, contact your FAE. If 535 FITs is completely unacceptable for that critical application you have, this makes it 0 FITs from soft errors.

Some of our customers have now qualified Virtex II Pro as the ONLY solution to the soft error problem, as ASICs can't solve it (easily like we have), and other FPGAs do not have the facts to back up their claims. That is quite new: the Xilinx FPGA is the only safe design choice to make? Maybe it is right now, as it is the only choice where all of the variables are measured, understood, and techniqies exist to reduce the risk to near zero, or whatever level is acceptable.

Oh, and yes, the 90nm technology is now 30% better than the 150 nm technology (15% better than the 130 nm technology) as proven by our tests (as presented to the MAPLD conference last month).

So, you can run around blathering on about data taken by grad students (no offense, I was one at one time), or you can look at our real time results from three locations on 300 devices being tested 24 by 7, or talk to us about our beam tests in protons and neutrons, or ways to design to get the desired level of reliability for your system.

And, you may want to consider going with the vendor who has been actively working on soft error mitigation for more than five years now. And has real results to show for it.

Let Moore's Law Rule!

Austin

*Rosetta Stone was the key that unlocked ancient Egyptian wisdom to the world. The stone had an inscription in three languages, which allowed archeologists to decipher ancient Egyptian writings. The Rosetta FPGA Experiment is designed to translate beam testing (proton or neutron) into actual atmospheric, or high altitude results, without having to actually build huge arrays of FPGAs and send them to mountain tops around the world to get real results. It was also designed to answer the basic questions of altitude effects, position effects, and how smaller device geometries behave in the real world.

- J
- Joe Hass
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Fri, Oct 31, 2003 9:42 PM

But keep in mind that SEUs are random events, unlike other failure mechanisms that depend on cumulative damage, so if one device has an MTBF of 20 years then a system with 20 devices has an MTBF of one year. Most professionals in the radiation effects field don't use MTBF as a measure of SEU immunity, they use errors/bit-day or a similar metric.

And that if you have 200,000 in the field at sea level then 2 or 3 are getting a logic disturbing hit EVERY DAY. Or if you have critical mission that lasts for five years then your chance of getting a logic disturbing upset is one in forty. OK for a PC running Windows, perhaps, but if you are building warheads....

Again, FITs is not a good metric. These aren't "failure in time", they are random events. An SEU can happen in the first millisecond of operation or after 200 years of operation.

Now that's just misinformation. We've put a number of ASICs in space, and in worse environments than the surface of Mars. How did Galileo survive the sulfur ions around Jupiter for ten years without your products?

Can you tell us what the penalty in area and speed would be in going to TMR? And exactly which of your products have sufficient resistance to total ionizing dose to be considered for space applications...do your current state-of-the-art products fit in this category?

I've been in this business for twenty years, on both the military and civilian side. I've designed full custom, ASIC and FPGA products for a variety of space applications.

Methinks the lady doth protest too much...

Joe

--
K. Joseph Hass
Center for Advanced Microelectronics & Biomolecular Research
721 Lochsa St., Suite 8               Post Falls, ID   83854

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Fri, Oct 31, 2003 10:58 PM

Let me just address the relatively simple subject of FITs vs MTBF.

100 FITs means an MTBF = 10 million years.

But nobody I know would be silly enough to interpret this to mean that each circuit lives that long and then suddenly dies. We all assume a statistically even distribution ( with different parameters descibing infant mortality). That's why we laughed when Actel (in the original press quote) made sucha big issue about the difference:

"Actel, currently the only anti-fuse FPGA maker, refuted this suggestion, pointing out that Xilinx's use of mean time between failures (MTBF) is the wrong metric to measure error rates: "MTBF is the wrong statistic, because a neutron event is random," said Brian Cronquist, senior director of technology at Actel."

I sent him an e-mail suggesting for us to disagree on more relevant things. No answer. Seems like they don't have a more meaningful rebuttal. Enough said.

Obviously the Xilinx large scale "Rosetta" test results have given the antifuse community fits ( pun intended). They should.

That is not to say that we are perfect, or that we have the only viable solution. But antifuses have lost their (high-priced, small size) monopoly. And fresh blood and competition is always healthy, even in aerospace !

Peter Alfke

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Fri, Oct 31, 2003 11:04 PM

Joe,

Thanks for giving me the opportunity to reply.

I thought no one cared to comment.

See below.

Aust> Aust> >Well, after 919 equivalent device years of experiment at sea level,

So, the device has 20 million bits. Do the math. I have stated all the arguments. You like cross section? bit errors/time? Just poke the buttons on your calculator. It is all statistics (even MTBF or FITs).. Soft errors are no different from any other failure: they are random!

Yes! And if I had 200 million of them, I would be getting an error every millisecond! Oh my! Help! Oh s**t! Give me a break. This is standard 5 o'clock news hype: just make it sound as bad as possible. Fact: each unit will still fail only once every 200 years. If you are fortunate enough to have sold a million units, then you should also be smart enough to use the necessary design techniques to mitigate being put out business by the more dominant failure rate of the hardware in the system itself. Soft errors are a small part of the overall system reliability calaculation you must perform. That is my point here.

Oh yes, and it happened right now! Oh my! Stop it. Give it up. You can only scare people who are ignorant of real world effects.

I don't know? Did it use 90nm technology? Nope.

None. Uses up 3X+ logic though.

Yes. We have rad hard FPGAs for total ionizing doses. Look it up in the Q-Pro line on the web. The devices are immune to SEL, too. ASICs and standard parts are having problems with SEL now. Didn't you know that? Haven't been reading your LANSCE test updates, huh?

Good, then you should welcome all the work we are doing, and the progress we are making. And you should recognize4 the FUD that is being spread about by others who are not only ignorant of what is going on, but have no other intent than to save their own skins by spreading as much false information as possible.

All the world's a stage.....

- S
- Symon
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sat, Nov 1, 2003 12:18 AM

Hi Austin, Out of interest, how many of the 300 parts in your experiment broke permanently? Any at all? If there were any 'hard' failures, did altitude affect this statistic, or were these failures due to other mechanisms? Syms.

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Mon, Nov 3, 2003 4:32 PM

Symon,

None. We have no possibility of Single Event Latch-up (already tested every famility in the neutron beam). No hard failures whatsoever.

The device FIT rate is probably somewhere around 20 FITs (estimate from high temperature operating life testsing).

Aust> Hi Austin,