Single Event Functional Interrupts (SEFI) in Virtex

- P
- Praveen
  
  Contact options for registered users
posted
19 years ago

Wed, Apr 6, 2005 5:36 PM

Hello all,

I am doing a literature survey on SEFIs in Xilinx FPGAs. Unfortunately, there are not many papers on this. It might either be because this is not a major issue or because there is not really much work done.

Did you encouter SEFIs in your design? If yes, what mitigation methods did you use? I would appreciate any of your feedback.

Thank you.

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Apr 6, 2005 5:58 PM

Go to the TechXcusives paper at

formatting link

Peter Alfke, Xilinx Applications

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Apr 6, 2005 6:53 PM

Praveen,

To what are you referring?

A single event error in the logic fabric (CLB SEE)?

A single event transient in the fabric (CLB SET)?

A single event upset of a memory cell (SEU config)?

A single event upset of a BRAM memory cell (SEU BRAM)?

A single event transient or single event error which affects the entire device (as in fooling the chip into thinking PROG was pulled low)?

We usually refer the the last event (loss of configuration, global reset, global tristate) as a SEFI (single event functional interrupt).

Not everyone uses this terminology, but we use it because it is descriptive of what happens (on very very very rare occasions!).

Our mil/aero/automotive customers also think this way, and we have statistics for the probabilty of any of the above happening, all the way from 1 million years for a SEE, or SET in the fabric for the largest device, to the article that Peter pointed you to for config upsets (more than a 1000 years).

By the way, Virtex 4 has now improved upon the upsets rates due to our design techniques and has shown a reduction to 60% (over Virtex II) of the previous FIT rates for the configuration memory. This winds the clock back to the days when people didn't even think about SEUs. Watch for the tech Xclusive on this subject (appearing soon).

If you want all the details, contact our mil/aero/automotive group FAEs who are trained in this, and have all the field tests, studies, etc. at their fingertips. For us, there is no unknowns in this regard. After all, we are used in airplanes, spacecraft (and automobiles) so these folks want to know exactly what the probabilities of failures are, and how to mitigate them (deal with them when they occur, or mask them so you never see a failure in the system).

Austin

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Apr 6, 2005 6:58 PM

I just realized,

So as not to confuse anyone, if VII is 1000 years MTBE (mean time between config upsets -- which is actually better than that based on more recent data, but we will just go with it for now), V4 is better, so it is ~ 1,667 years for the same number of config bits. (60% of 1667 years is 1000 years).

If you do nothing about upsets, 90nm is worse than 130nm, is worse than

180nm, etc.

So, you have to do something to make it better.

We did.

(You're welcome,)

Austin

- P
- Praveen
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Apr 6, 2005 7:56 PM

Hi Austin,

Thanks for the reply. First of all, I am using virtex -II. I am concerned about the SEUs occuring in controls of the device(leading to SEFI like behavior). I have obtained a document from SEE consortium that discussess the different SEFIs (like POR, SMAP and JTAG) and the ways to mitigate them. I also found other presentations on xilinx website that discuss the same thing.

I wanted to know if there are any other SEFI issues in Virtex-II and the mitigating methods that can be used (and have been used).

Thank you,

-Praveen

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Apr 6, 2005 9:14 PM

Praveen,

You have the paper, and the modes discovered, and the work-arounds.

There is a lot of work on-going by the mil/aero community on radiation testing. Perhaps your company would consider joining the radiation effects consortium that we sponsor (if you have need for this)?

Austin

- P
- Praveen
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Apr 7, 2005 5:14 PM

Austin,

The methods discussed in the document mitigate the SEFI but will result in the loss of data (because of reconfiguration). Could you suggest a way (if there is one) of mitigating the SEFI without losing the information ? We would like to try it out even if it complicated.

Thanks for your input.

-Praveen

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Apr 7, 2005 6:39 PM

Praveen,

Well, the problem is that a SEFI might hit the line which controls the "clean-out" (zeroization of all config and BRAM).

If that happens, then basically, the upset has caused the device to re-initialize (or just go stupid). There is no way to prevent this from happening. We are researching how to design so this can not occur, but this is a very tough problem. An event can strike any transistor.

uP, ASSP, and ASICs also have SEFI behavior. And yes, theirs is incredibly rare as well. FPGAs are similar in SEFI behavior to all the other devices. Maybe better. I haven't see the SEFI x-section for a Pentium chip.

Yes, these SEFI cross sections are incredibly small, and the probability is also small that this happens, but presuming it does happen, there is really nothing at all that you can do (except detect that it happened, and reconfigure the device from scratch).

Systems that have to be hardened against SEFI will use a CPLD, or other device, to detect that a SEFI occured, and reconfigure the FPGA. In the time it takes to recongize the SEFI, and reconfigure, all data is lost (unless it is part of a redundant system, which is commonly used for critical applications).

In the order of increasing robustness:

- no measures taken (susceptible to SEU, and SEFI): the vast majority of all FPGA applications fall into this category, as do ASIC, ASSP, and uP.

- use TMR on the user pattern to remove the effcts of SEUs completely, still susceptible to SEFI: a step that gets rid of SEU effects. Also is used in some special ASICs for mil/aero.

- scrub the config memory (continually reload the config while operating): used my many space probes, still susceptible to SEU and SEFI, but recovers very quickly, and SEUs do not accumulate leading to an overall availability improvement. This is what the Mars Lander Pryo controller did. The landers themselves just reconfigure once a day (enough to mitigate the effects they anticipated).

- scrub and use TMR: now we only have SEFI to worry about. The best choice for getting to the level of reliability where the only thing that can be of any trouble is a SEFI. Good enough for just about anything except where human life is concerned.

- readback the config and fix the bits that flipped (use of V4 FRAME_ECC): similar to scrubbing, but faster and less hardware. Same as above non-TMR scrubbing case.

- readback and fix config for a TMR design: only SEFIs to worry about. Good for just about anythign excepting a human life.

- monitor the device with another device (eg CPLD) for SEFI, reconfigure if a SEFI occurs: used in critical space and avionics. May also be doing TMR, scrubbing, etc. as well. This is still not good enough for a human life situation unless the time to recover is fast enough not to matter.

- provide one other FPGA, dual redundant: use of dual rednundant allows for transfer away from a fault, used for even higher availability (each individual unit may be scrubbing, use TMR, etc. There may also be a "voter" to switch between FPGA's in case of SEFI). Almost the highest level of availability, in that we still don't trust even this arrangement for human life situations. It may get used in military systems where the probabvility of death is much much higher than the probability of a systems failure, so added system availability is not needed (a real toght decision, one I gladly don't have to make).

- fully duplicated, dual redundant: used by things like commercial airliners, and airports. Two redundant systems that can be selected manually by the air traffic controllers or pilots in the unlikely event that one of the redundant systems fail. Within each redundant system, various levels of protection may, or may not be necessary, since the entire system is duplicated. A system with no scrubbing of the FPGAs, but with many self-checks that are done independent of the FPGA is used in fact in all US and Canadian Airports for all communications between the ground and air, and ground and ground. I designed it. If one redundant unit either detects a failure in itself, or the redundant unit it is paired with detects that its partner has gone stupid, it switches itself in, and the other out in less than 50 ms. If the air traffic controllers can't talk to the airplanes for some reason, they have a manual switch they push to transfer everything over to another set of com links, radios, antennas, etc.

Austin

- K
- Kris Vorwerk
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Apr 7, 2005 7:15 PM

You might find the following interesting/relevant ... (It's just something I came across while Googling) ...

formatting link

Kris

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Apr 7, 2005 8:49 PM

Kris,

Nice ppt presentation. They quote 65 years in orbit around the earth mean time between SEFI.

We have published results of heavy ion testing that states we are at (least) 1.5E-6 SEFI/day in earth orbit, which is 1,800 years between SEFI. Not sure where the discepancy comes from.

It may be that work was done on commercial parts, instead of using the Qpro series (which has EPI wafers). If you are going to go into space, you are better off using the Qpro devices.

formatting link

Page 3, Table 3.

Not sure that EPI alone would have more than a factor of 2 improvement in upset rate (SEU or SEFI).

Could be that the authors of the ppt also divided by 24 (thinking our specification was in hours). That yields a number closer (76 years--but wrong nonetheless).

Sea level is ~ 40 times less upsets, so a SEFI at sea level is ~ 7,300 years.

We have some customers with more than 250,000 Virtex II's in the field (monitored), and that would mean they would have ~ 35 SEFI's a year. Since they have had far fewer (in fact: none reported), one has to take even this projection as overly conservative for us earthlings on the ground.

Also, space based projection of failures use heavy ions, and earth based projections of failures use protons, and neutrons. There is factor of (at least) 1e5 to 1e6 there in terms of the size of the "bullet!"

For example, the cross section for a Virtex II memory cell is ~2.283E-14 for neutrons, and is ~8E-8 for a heavy ion. These are from recent tests with neutrons and with heavy ions (Xilinx Radiation Effects Consortium).

Sort of like a grain of sand vs. a locomotive engine.

This is a good analogy: if you are hit by a train, what do you do? If you are hit by a grain of sand, what do you do?

Compare our Xilinx Virtex II FPGA with a popular uP: (for SEFI)

formatting link

with up to a few SEFI per day (worst case), to one SEFI a year (best case).

So the next time you see the "blue screen of death" on your laptop computer, was that a SEFI, or was it Microsquat?

Austin