Virtex4 burn-in failure

Hello,

I've observed a failure of one of our Virtex4-based products that is a bit puzzling. The device had been in operation for several weeks before failing, and is among a run of a couple hundred. I've dissected the specimen, and the Virtex4 appears to draw a large amount of current and gets hot immediately upon power-up. It does not configure or show any signs of life.

Aside from the possibility of this being induced by a manufacturing defect that took its time to show up, can anyone think of any design- related causes of a failure like this?

Thanks in advance,

Mike.

Reply to
msn444
Loading thread data ...

Do any of the power supplies EVER exceed the maximum allowed? You should carefully check power-on, operating, and power-off conditions with a high speed scope. This is the only thing that I've ever done that truly damaged an FPGA.

Another (less likely) possiblility is overcurrent on inputs due to driving them above and/or below the supplies.

What about over-temperature? However, it would have to get effing hot (>125C ?).

I can't think of anything that would bother the thing.

Bob

Reply to
BobW

I've had issues in the past with this very thing happening on a V2PRO part. It was due to power supplies not coming up in the right sequence, or not being monotonic in their ramp-up; thus causing the FPGA to get hot and drawing enough current to drop my low voltage supply.

One way to test if this is your problem would be to hold off configuration (I usually design in a push-button on the PROG_B line) until all the supplies are up and stable. If the problem goes away, you can be fairly confident that it is a power supply ramp-up/sequencing issues.

Reply to
Rob

Probably not related to your failure, but there was a curious errata for V4 where >unused< MGTs would fail after a few hundred hours of device uptime. The workaround was a dummy MGT driver. I always wondered what the failure mechanism was there!

--
Ben Jackson AD7GD

http://www.ben.com/
Reply to
Ben Jackson

Ben,

No mystery: NBTI. Unused MGT front end pmos devices in the differential amplifier circuits could see a significant Vt shift if they were not transitioning. One input high, and one low, and NBTI occurs in the pmos devices, made even worse if the temperature is also high (e.g. like 70 to 85C or hotter). The DCM delay line was also susceptible to NBTI shift, hence the "auto-cal" block being added by the software (to keep delay lines busy switching at a low frequency).

Later devices perform these functions by hardware, or design techniques to mitigate the shift are used (no longer an issue after V4).

Although the NBTI shift may be demonstrated in a lab, there has never been a case of a field failure for either the MGT, or dCM, due to NBTI. It seems the condition is created by such a specific sequence of temperatures, and static voltages, that unless you are unlucky enough to duplicate, all the pmos shift together, and everything is just fine.

NBTI starts out quick, then slows down. A bake without power restores the levels a lot. Just turning things on and off, can mitigate any issues. Very tricky stuff, but once understood, can be dealt with easily.

NBTI is over thirty years old, and has been understood and dealt with by the IO designers for a long time. What was a surprise is that MGT front end design (and the DCM delay line) used thinner oxide devices in V4, and didn't expect to see the shift. Foundry practices also helped tune down the effects.

This particular "melt-down" scenario is unrelated to NBTI.

Common causes: shorts in the package/pcb/solder balls, over-temp of the die (caused by inadequate heatsinking), large over voltage (on core, io, or aux -- causes junction or gate breakdown, this may be power supply, or ESD).

Xilinx will issue a RMA (return mechandise authorization) and try to find the cause of failure. However, this is not taken lightly, we request that the customer removes the device using very specific methods, so that we can establish what caused the failure (often customers remove the device, destroying it in the process).

A RMA is also something that takes time, and just one failure is not considered a reason to go to all the trouble.

Any device returned without authorization is not accepted.

Austin

Reply to
austin

Austin; Just out of curiousity, what is "NBTI" ?

-Dave Pollum

Reply to
Dave Pollum

I haven't seen any sign of the power supply voltages going over the maximum, but I'll take a closer look. Pretty sure the device isn't going over 60C in normal operation, let alone 125C, so I don't think that's it...

Thanks for the suggestions... Mike.

Reply to
msn444

google is your friend:

formatting link
top 4 items.

Reply to
mk

Hi Austin,

Do Xilinx publish overall figures for RMA devices, e.g. what fraction are caused by ESD, no fault found, overvoltage, etc?

Thanks, Allan.

Reply to
Allan Herriman

Dave,

formatting link

Austin

Reply to
austin

Allan,

There is the quarterly quality report which lists the hard fail rates for each family. This is all causes.

formatting link
and
formatting link

There are separate sections for process, and ESD and Latch-Up.

Austin

Reply to
austin

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.