Praveen,
Well, the problem is that a SEFI might hit the line which controls the "clean-out" (zeroization of all config and BRAM).
If that happens, then basically, the upset has caused the device to re-initialize (or just go stupid). There is no way to prevent this from happening. We are researching how to design so this can not occur, but this is a very tough problem. An event can strike any transistor.
uP, ASSP, and ASICs also have SEFI behavior. And yes, theirs is incredibly rare as well. FPGAs are similar in SEFI behavior to all the other devices. Maybe better. I haven't see the SEFI x-section for a Pentium chip.
Yes, these SEFI cross sections are incredibly small, and the probability is also small that this happens, but presuming it does happen, there is really nothing at all that you can do (except detect that it happened, and reconfigure the device from scratch).
Systems that have to be hardened against SEFI will use a CPLD, or other device, to detect that a SEFI occured, and reconfigure the FPGA. In the time it takes to recongize the SEFI, and reconfigure, all data is lost (unless it is part of a redundant system, which is commonly used for critical applications).
In the order of increasing robustness:
- no measures taken (susceptible to SEU, and SEFI): the vast majority of all FPGA applications fall into this category, as do ASIC, ASSP, and uP.
- use TMR on the user pattern to remove the effcts of SEUs completely, still susceptible to SEFI: a step that gets rid of SEU effects. Also is used in some special ASICs for mil/aero.
- scrub the config memory (continually reload the config while operating): used my many space probes, still susceptible to SEU and SEFI, but recovers very quickly, and SEUs do not accumulate leading to an overall availability improvement. This is what the Mars Lander Pryo controller did. The landers themselves just reconfigure once a day (enough to mitigate the effects they anticipated).
- scrub and use TMR: now we only have SEFI to worry about. The best choice for getting to the level of reliability where the only thing that can be of any trouble is a SEFI. Good enough for just about anything except where human life is concerned.
- readback the config and fix the bits that flipped (use of V4 FRAME_ECC): similar to scrubbing, but faster and less hardware. Same as above non-TMR scrubbing case.
- readback and fix config for a TMR design: only SEFIs to worry about. Good for just about anythign excepting a human life.
- monitor the device with another device (eg CPLD) for SEFI, reconfigure if a SEFI occurs: used in critical space and avionics. May also be doing TMR, scrubbing, etc. as well. This is still not good enough for a human life situation unless the time to recover is fast enough not to matter.
- provide one other FPGA, dual redundant: use of dual rednundant allows for transfer away from a fault, used for even higher availability (each individual unit may be scrubbing, use TMR, etc. There may also be a "voter" to switch between FPGA's in case of SEFI). Almost the highest level of availability, in that we still don't trust even this arrangement for human life situations. It may get used in military systems where the probabvility of death is much much higher than the probability of a systems failure, so added system availability is not needed (a real toght decision, one I gladly don't have to make).
- fully duplicated, dual redundant: used by things like commercial airliners, and airports. Two redundant systems that can be selected manually by the air traffic controllers or pilots in the unlikely event that one of the redundant systems fail. Within each redundant system, various levels of protection may, or may not be necessary, since the entire system is duplicated. A system with no scrubbing of the FPGAs, but with many self-checks that are done independent of the FPGA is used in fact in all US and Canadian Airports for all communications between the ground and air, and ground and ground. I designed it. If one redundant unit either detects a failure in itself, or the redundant unit it is paired with detects that its partner has gone stupid, it switches itself in, and the other out in less than 50 ms. If the air traffic controllers can't talk to the airplanes for some reason, they have a manual switch they push to transfer everything over to another set of com links, radios, antennas, etc.
Austin