Exact time-to-Failure data for FPGA devices

- A
- Amr Ahmadain
  
  Contact options for registered users
posted
18 years ago

Mon, Jul 25, 2005 4:36 PM

Hi All,

I was wondering if somebody knows where to find exact time-to-failure data for FPGA devices. I checked the reliability data available in both Xilinx and Actel Reliability reports and they both report failure rates in FITs, from which one can calculate the MTBF.

What I'm looking for is accelerated life testing data and specifically High Temeperature Operating Life (HTOL) test data expressed in exact-time-failure not in FITs as given in the reliability reports mentioned above. Well, I'm not sure if this kind of data exists in the first place, but no harm from asking:))

Thanks,

Amr

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Jul 25, 2005 5:01 PM

Amr, there is no exact time, only a statistical probability. ICs do not suffer from a clearly defined wear-out mechanism that would allow us to predict the end-of-life. There is a wide variety of factors that affect reliability, and there is the Arrhenius model that describes the dependence on temperature, but all predictions are statistical. And at "normal" temperatures, ICs live a very long life, 20 to 100 years or more. Most failures are the result of overstress, mostly in the I/O. Peter Alfke, Xilinx

- A
- austin
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Jul 25, 2005 8:35 PM

Amr,

As Peter says, there is no "wear out" mechanism at work, so we are talking about a latent defect that finally results in a junction failure, or a gate rupture, or a mechanical failure of the connections (solder bump cracking, package via opening, printed circuit board wiring breaking).

The HTOL data is available from our Reliability group through our FAEs to users who wish to examine it (under NDA).

Basically here you will see the equations, the tests at elevated temperatures, and any failures that have occured, and what and where they were (and what changes were made to improve the product as a result).

For example, I have ~ 6,000 devices running 24X7 for the Rosetta experiments, and I have not had a single failure in ~ 300 gigabit-years (take the number of bits in a device, and multiply by years, and divide by 1E9 to get Gb-yrs). So normal operation will never predict a FIT rate (just takes to long to break).

Another way of looking at this, is that those 6,000 devices have run for an average of two years. That makes it 12,000 device years. Or if one failed right now, that would be 9.5 FIT. So, under normal conditions, the failure rate is much less than 9.5 FIT.

To get a statistically significant number, the HTOL conditions place the device under conditions it should never see. This totally arbritray test method is the industry standard, and doesn't predict the real failure rate well at all. It will indicate if a part is in trouble, however, as a part will fail the HTOL qualification miserably if there is a weakness.

Austin

- A
- Amr Ahmadain
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Jul 26, 2005 12:57 AM

Hi Austin,

If I understand you correctly, the FIT numbers given in reliability reports are not representative of actually anything useful. It is just provided for complying with the industry standard.

One more thing, according to the Arrhenius relatonship and the way the thermal acceleraion factor is calculated, there is only one stress junction temperature for testing the device. In other words, the devices are exposed to constant stress loading as opposed to step or random stress. is that right?

One last issue: Is the Voltage accelaration factor used in calculating the FIT numbers given in the Xilinx reliability report? or is it just the thermal acceleration factor? I'm not sure you are allowed to answer this last question :)

Thanks,

Amr

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Jul 26, 2005 2:44 AM

Amr, what do you really want to know, and why? Your original question was very simplistic (I think you have used the word na=EFve before), but now you get into a fair amount of detail. What is really on your mind? Maybe we can answer a direct question more easily. Peter Alfke, Xilinx

- A
- Amr Ahmadain
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Jul 26, 2005 3:11 AM

Hi Peter,

I swear there is nothing on my mind. It's just that I'm a little bit confused and I'm trying to clear things up.

There is a lot of literature, books, publications and reports on testing techniques and mechanisms and I'm trying to figure out how the entire testing process is done from an industrial perspective.

I'm a 4rd year PhD student at University of Cincinnati, and FPGA reliability is just one part of my PhD dissertation. So yes, there seems to be an rough idea on my mind but just for acedemic purposes. I'm still working on it and it all has to do predicting reliability for FPGA devices. You can say a new prediction method. But I'm not yet sure of its feasibility so I'm just trying to clear thinsg up. That's all.

I'm not trying by any means to pull information out of anybody by asking indirect questions. Maybe my questions gave you that impression but again there is no purpose to them at all and this is again my fault :)

Thanks a lot.

Amr

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Jul 26, 2005 4:04 AM

Amr, thanks for the clarification. High temperature, high voltage and excessive currents are the predominant reasons for semiconductors to fail. Nuclear radiation may be an exotic source of failure. Inside an FPGA, there is little chance to ever encounter these phenomena, unless you run the whole chip at very high temperature or high voltage. On the other hand, when circuits are this good and long-lasting, it gets increasinly difficult, and time-consuming, and expensive to prove how good they are. Austin mentioned some of these difficulties. Commercial users are really not so much interested whether a particular FPGA lasts 50 or 100 or 150 years. They know that the equipment will be scrapped as obsolete long before that. They are interested in the "midlife mortality", how many out of 100,000 ICs fail during the nth year of life of the equipment, where n is between 2 and 20. That's why they are looking for, and are satisfied with, statistical reliability data.

Peter Alfke, from home.