Fault Tolerant FPGA design

P

Prad 21 years ago

Hi all, I was trying to search for fault tolerant FPGA design for a real time embedded system on the web As I understand Transient error - mostly taken care of by TMR which addds a lot of overhead or some other methods like the"direct-load" which doesent take care of multiple faults.

Can anybody please help me with information regarding what is currently being done about this?

Also regarding Permanent faults, my guess is that we would need a redundant FPGA for enhanced reliability, because other methods like partial reconfiguration by rerouting around the faulty elements assumes there are a whole lot of free elements available

Any information would be a great help to me!

Thanks Venkata Paruchuri

Vote

S

sam 21 years ago

For transient faults, yes TMR is the most reliable method. If you are using Xilinx FPGA there is an entire application note dedicated to the different configurations of implementing TMR. Redundacy is required for making a system fault tolerant. The reliability of the system depends on how much redundancy you are willing to introduce in the system. For example if you want to tolerate double faults use NMR (N

Vote

A

Austin Lesea 21 years ago

Sam,

Go to the IRoC website:

formatting link

They describe redundancy in space (like TMR), redundancy in time (like process the problem three times, each after a scrub or reloading of the configuration, and vote on the results), or parallel state and data path error checking (ie calculate a parity or CRC for all state machines and data paths, and pass that, and check that at each stage.

These are three common techniques used for highly reliable designs (like space control systems, aircraft, robots, etc.)

We see these techniques (or a mixture of them) now being used in FPGAs due to the concerns with soft errors, as well as due to errors caused by other means (signal integrity, jitter, etc.) having nothing to do with FPGAs.

For example, a packet processing system has an entering CRC, and an exiting CRC. They should be the same if the packet did not change. If they are different, through the packet out (do not acknowledge receipt)

-- it will be resent to you.

The Mars Rovers use scrubbing (reprogramming at an interval) to assure that the FPGA has no upset bits (to an acceptable rate). They did not have to use TMR to meet their goals. Given their goals are pretty tough to meet, why do you think you need TMR?

Generally speaking, not all of any design is critical, so not all of the design needs to be triplicated. Only the critical parts.

Most designs have startup logic, test logic, and performance monitoring logic that has nothing to do with the critical function.

We have seen ratios of 2:1 up to 8:1 for logic that is not critical compared to logic that is critical.

A careful study of your application may show that you need to triplicate (or duplicate if all you need to know is that there was an error, fixing it requires three copies) much less than the entire design.

Aust> For transient faults, yes TMR is the most reliable method. If you are

Vote

S

sam 21 years ago

Austin,

I too agree that the entire design in not susceptible to SEUs. ( I have an entire paper on that

formatting link

But we also found that there is a loss of SEU immunity with decreasing redundancy.

Time redundancy is good but cant be used for real time data as detecting and correcting takes a lot of time. The inputs have to be stored on the memory on in the FPGA which is risky unless they are rad-hard.

Scrubbing is a great feature, but can be a problem when there is a SEFI. And again scrubbing at regular interval needs an uncorrupted source of data that has to be stored separately in a rad hard memory chip.

So, my view is that for 100% immunity(against single bit errors) TMR is the best way.

Vote

P

Prad 21 years ago

Thanks all for your help and guidance.

I do not have any particular application in mind as of now, but I am just trying to test the fault tolerance in general and the extent of reliability that can be brought in. I want to study and give some quantitative details after testing practically - (ofcourse in conjunction with a simulator to inject faults)

After reading your comments I am of impression that TMR is enough atleast as of now, I mean NMR wouldnt actually be needed (especially if the Mars Rover doesent need it)

I am trying to work towards fault tolerance of the overall system, and am begning to think that immunity to transient faults must be brought at component level, and immunity to permanent faults would be on the system level?

I am not quite sure how Time Redundancy works, does it do all this with such a low overhead that it can be neglected.My concern is that, As you are all aware real time embedded systems may have stringent timing requirements, in that case can time redundancy meet the race like the hardware redundancy - or is it like this - since both are done within the chip, there is not going to be a huge difference, considering that communication is what consumes a lot of time....

Any views on permanent fault tolerance techniques?

Thanks once again, and would greatly appreciate any more resources like

Special thanks to Mr.Sam for that paper, and Mr.Austin for in-depth explaination - things got more clear now.

Vote

T

Thomas Stanka 21 years ago

You could gain immunity to transient faults by changing the design and/or by choosing a proper component (e.g Actel has Hi-Rel fpgas with build-in TMR). Further there are some transient faults (e.g Latchup) that seems to me only manageable by choosing a proper device.

You could get a immunity to some permament errors by choosing a proper device, other permanent errors could only be overcome by changing the system. BTW The first point to fault tolerance (maybe the least investigated) is a robust design that overcomes _any_ faulty input.

bye Thomas

Emailantworten bitte an thomas[at]obige_domain. Usenet_10 ist für Viren und Spam reserviert

Vote

S

sam 21 years ago

and/or by

TMR).

Just a Caution : These FPGAs are not reprogrammable

Vote

Fault Tolerant FPGA design

Join the Discussion

Didn't find your answer?