Ben,
No mystery: NBTI. Unused MGT front end pmos devices in the differential amplifier circuits could see a significant Vt shift if they were not transitioning. One input high, and one low, and NBTI occurs in the pmos devices, made even worse if the temperature is also high (e.g. like 70 to 85C or hotter). The DCM delay line was also susceptible to NBTI shift, hence the "auto-cal" block being added by the software (to keep delay lines busy switching at a low frequency).
Later devices perform these functions by hardware, or design techniques to mitigate the shift are used (no longer an issue after V4).
Although the NBTI shift may be demonstrated in a lab, there has never been a case of a field failure for either the MGT, or dCM, due to NBTI. It seems the condition is created by such a specific sequence of temperatures, and static voltages, that unless you are unlucky enough to duplicate, all the pmos shift together, and everything is just fine.
NBTI starts out quick, then slows down. A bake without power restores the levels a lot. Just turning things on and off, can mitigate any issues. Very tricky stuff, but once understood, can be dealt with easily.
NBTI is over thirty years old, and has been understood and dealt with by the IO designers for a long time. What was a surprise is that MGT front end design (and the DCM delay line) used thinner oxide devices in V4, and didn't expect to see the shift. Foundry practices also helped tune down the effects.
This particular "melt-down" scenario is unrelated to NBTI.
Common causes: shorts in the package/pcb/solder balls, over-temp of the die (caused by inadequate heatsinking), large over voltage (on core, io, or aux -- causes junction or gate breakdown, this may be power supply, or ESD).
Xilinx will issue a RMA (return mechandise authorization) and try to find the cause of failure. However, this is not taken lightly, we request that the customer removes the device using very specific methods, so that we can establish what caused the failure (often customers remove the device, destroying it in the process).
A RMA is also something that takes time, and just one failure is not considered a reason to go to all the trouble.
Any device returned without authorization is not accepted.
Austin