XC3000 non-recoverable lockup problem

lecroy7200 · 2005-03-11T15:57:20+00:00

I am looking at a design that was done several years back which usesseveral XC3000 devices. All devices are programmed with the same corefrom a computer on power up. What I am seeing is that once or twice ayear, one of the devices will enter a non-recoverable mode. In thismode, the device appears to be in power down or reset. Once in thismode I am no longer able to program the device. Pulling the XC3000'sreset low for 10us has no effect. The problem appears random. Theonly way to reprogram the part is to power down the IC. I have triedrunning tests where I just reprogram the device, over heat the device,change the supply voltages, etc. and can't reproduce the problem. Whenthe Xilinx device is in this mode, it draws little power. It can beheld in this mode for what appears an infinite amount of time andcauses no damage to the device.Was there some kind of an undocumented test mode built into the XC3000that I may be seeing? Does anyone else ever remember seeing a problemlike this?

L

lecroy7200 21 years ago

Yes, it's real negative ECL. 100H603s are used for the translation.

This is interesting. During the power supply design, a lot of pain went into the sequencing. A few of the parts have problems when the power is not brought up correcly. The problem so far seems to have happened not during power cycles of the instruments, but while the instruments are in their normal run mode.

It's a good point.

Vote

B

Brian Davis 21 years ago

In attempting to reproduce the problem, are you testing in an actual system with all boards and I/O connected in the usual configuration, or in the lab with just the problematic board on a bench supply?

Does that result in any symptoms like your FPGA config problem, or is everything so locked up that you can't tell ?

Without the power-up sequencing, do you get destructive failures?

If not a destructive failure mode, can you briefly disconnect (not clamp) the +5.0 V supply to the board, with -5.2V present, and see what level the +5.0V plane on the board goes to?

Or, intentionally disable/break the supply sequencer and power cycle.

Do you have an AC disturbance/brownout generator to see how the supply sequencer for the +5.0 and -5.2 rails behaves during a brief AC dropout ?

Is one supply rail much more heavily loaded than the other, so it would dip faster on an AC brownout?

Hmmm, that also rings some faint old warning bells- I seem to recall having problems once after I redesigned a card to replace some obsolete translators with either the '602 or '603. Symptoms were field returns with either a failed bit or two on the '60x, or occasionally a part that looked it like had undergone some sort of latchup/runaway and self destructed.

Brian

Vote

L

lecroy7200 21 years ago

Yesterday was another loss.

All of the testing is being done with the real hardware.

No, it could result in device failure. But not on this board.

Based on textbook data, yes. Have I seen a problem, no. Again, a lot of care was taken to ensure this could not happen.

I did several tests with the supplies yesterday. No damage to the board it would appear, but I also was not able to reproduce the problem.

I tried drop out testing on all supplies in every combination. The ECL seemed to have little problem with the testing I did.

Yes we do, and again we have done a lot of testing like this and have not seen any problems like the one I am describing.

I am VERY surprized that I have not seen a destructive failure with all the testing I have done.

If I were to make a guess, it's like the Xilinx device goes into sleep mode somehow and won't wake back up. But again, while testing the power down pin I was not able to replicate the problem.

No word from Xilinx yet.

Vote

J

Jim Granville 21 years ago

Have you done aggressive field (impulse & RF burst) testing yet ? This would be on as near a real field-install as practical.

Do you have a stats map, of the failure count/installed unit/site/time ? [ ie these are truly random failures ?]

When all else fails, you can always blame Alpha particles ? ( see this highly selective test

formatting link

)

-jg

Vote

L

lecroy7200 21 years ago

Again, yes.

unit/site/time ?

Again, yes they are random.

No, this is not an option.

After running an automated test for the last 24 hours, finally one part on my test unit failed. Because the software would try and program the devices after a fixed amount of time in order to detect the fault, I am not sure if the pins were in this state at the time of the failure. However, Looking at the the control pins, HDC is high, LDC is low, done/pgm' is low and init is low.

I increased the reset time to over a second with no luck. I tried reloading the device over 50 times with no luck. This is what we are seeing. I am not sure why it entered this mode and if it had anything to do with my testing, or was just its time.

I have left the device in this state is anyone has ideas on further test.

Vote

L

lecroy7200 21 years ago

After waiting an hour for Xilinx to return my phone call, I decided to attempt further tests. From previous failures it seemed that the supply had to be turned off for a long time in order to recover from this failure. Using the dual FET test setup I started pulseing the power to the FPGAs. Using a scope which was connected to the device, I started out pulsing the power off for 10uS. I then tried to program the device 5 times. The device was in the same state, so I increased the off time and repeat the process. At 1ms (0.001 seconds) the part would still not recover. Thinking that the device may have actually been damaged I decided to try a much longer power cycle of about 5 seconds and sure enough, the part came back to life.

Of course, none of this helps me. I have started another test where I do not attempt to reprogram the devices. This will allow me to determine the state of the control pins prior to attempting the reload.

I am starting to suspect that the devices internal oscillator has a problem. There is very little information about it. There appears to be no way to detect it's failure. Has anyone seen any detailed information about how the oscillator was designed? Was the same design used for all Xilinx devices? I am thinking to try and detect the internal oscillator with a spectrum analyzer.

Vote

J

Jim Granville 21 years ago

You could measure Icc, to try and guage how much of the Clock tree is operational, and compare that with a device that is held in config-ready state ? Also see if some of the simplest pin-pin paths are still 'alive'. [ but, of course, be careful not to remove the power :) ]

Your description thus far sounds like it is flipping back into part-way through a config cycle, but in such a way re-config cannot shake it loose.

-jg

Vote

J

Jim Granville 21 years ago

It does confirm that there is an internal RC style Reset block, that has a recovery somewhere between 1ms and 5 sec.

It could be usefull to narrow that Trec ( and also Vrec) down more. For example, I have seen devices that state "Vcc must reduce to less than 0.2V for POR to operate correctly"

I have started another test where I

good idea.

-jg

Vote

P

Philip Freidin 21 years ago

Great.

As I have read further, you no longer have it in this state. Assuming you get it into this failed state again, please try the followig 2 things:

1) supply a continuous CCLK, with DIN alternating 0 and 1. Please tell us what you see on the DOUT pin.

2) Supply a continuous CCLK for at least 2**24 cycles (16 million cycles) plus a few extra 100000 cycles. At 1 MHz, this will take about 16 seconds. Set DIN to 1 for all of this. Please report any changes you observe on Done/Prog, HDC, LDC, and DOUT. There is a "fault " mode where if it gets some clock cycles that are unexpected at the beginning, or corruption of the header it can take 16000000 clocks for it to get back to the beginning of the state sequence.

I am suspect>I think there is a combined ~prog and done pin. It's pulled low

Keep working on it, Philip

Philip Freidin Fliptronics

Vote

B

Brian Davis 21 years ago

If you ever are able to reproduce the problem at will, probably the easiest way to check the health of the internal CCLK is to place the part in master serial mode before initial powerup, so you can observe the behavior of the CCLK output under the lockup-generating conditions.

Unfortunately, my earlier suggesti>

good luck, Brian

Vote

L

lecroy7200 21 years ago

I don't know if the problem would show up in master mode. I don't want to introduce any other variables into the system. I know it has this problem with the current configuration.

I had tried to send out a few seconds of normal clock cycles once the part was locked and was not able to get it to recover. But, again if the device's internal clock was dead, then the device would not be able to sample the Reset state.

After playing with various near field probe designs, I now have one that appears to pick up the 1MHz internal clock. The probe hooks directly to a 20db amplifier and off to the spectrum analyzer. I am not working in a screen room right now but the signal does appear to be from the FPGA and not a local radio station. It is very close (within

50KHz) to the 1MHz that is called out.

I have run three more days of tests and was not able to get the part into the strange mode.

I am curious if Xilinx has had troubles with their internal oscillators in the past. The newer parts are programmable where this part is not (fixed clock rate). So some changes were made to the design.

I will post again once I can see if it is the clock or not.

Vote

L

lecroy7200 21 years ago

A second failure took place. I reset all of the ICs, disabled the cards master clock and left all of the FPGAs in the unprogrammed state. Looking around I was not able to tell if the 1MHz signal was present or not. It is so far down in the noise floor that it is virtually undetectable.

I decided to start looking at wider BWs. It appears that the internal clock is not 1MHz, but much higher. Doing a sweep from 500KHz to 50MHz and comparing the peaks, the IC that is in the strange state is missing a peak at around 16-17MHz.

This signal is changes part to part which I would expect for a sloppy oscillator. Again, the data sheets do not mention this. I will try and call Xilinx today and see if they can confirm that this is the internal clock.

Vote

L

lecroy7200 21 years ago

To further verify that the 16MHz is the internal clock I tried to change the temperature of the device to see how it effects the frequency and indeed it does. Just what you would expect from an RC design. I am very confident that the oscillator is the problem.

I did some searching and came across an app note form 1997 that talks about the 1Mhz clock on the 3000.

"The nominal frequency of this oscillator is 1 MHz with a max deviation of +25% to -10%. The clock frequency, therefore, is between 1.25 MHz and 0.5 MHz. In the XC4000 family, the 1-MHz clock is derived from an internal

8-MHz clock that also can be used as CCLK source."

I have provided Xilinx with the lot codes on these parts and I am guessing that at some point the oscillator was changed to 16MHz on the

3000.

I am trying more tests now to try and get other oscillators to fail.

Vote

A

Austin Lesea 21 years ago

lecroy,

The oscillator itself is at a much higher frequency, and is divided down to the number listed in the data sheet. At least, we still do it that way, even today.

The accuracy of this oscillator would be from 1/2 to 2X the nominal (it just isn't critical).

Since this part still had paper schematics (REALLY) it is far too old for us to go look at its design.

Phil is on the right track.

This part did have a brownout issue (if the the voltage dropped just right, for just the right amount of time, and came back up) that would place it in a locked state that could not be recovered until the power was cycled.

I solved this problem 15 years ago by using a Dallas Semi Power on Reset part to reset the power supply if it detected a glitch.

The product was an optical multiplexer for then AT&T (and then Lucent).

We had sold more than 100K units in three years. I think you can still buy them even today.

They are used in some applications that are actually critical, so they went through an amazing battery of tests (for the audio radio channels at all US and Canadian Airports, for example).

Austin

snipped-for-privacy@chek.com wrote:

Vote

J

Jim Granville 21 years ago

That freq makes more sense than 1MHz for the buried osc, as 1MHz is relatively slow, so needs more specialised die area - in the old process of the 3000, a ring osc will give 16-17MHz region. Dividers are simple.

If you need additional confirmation it is inside the FPGA, you could give the chip a squirt of freeze - ring osc's are temp dependant.

They are likely to gate the loader osc, to save power, so this may only confirm you have exited the first power-up load state, but are unable to get back into load state.

-jg

Vote

J

Jim Granville 21 years ago

Do you recall how low the Vcc had to cycle, in order to correctly recover ?

Sounds just like my power removal wdog.... :) How did you 'detect a glitch' - was that simply via Vcc lowering, or did that get an "I'm OK" signal from the FPGA ?

I have wondered why more regulator chips do not offer this type of 'wide hysteresis' in their operation.

-jg

Vote

A

Austin Lesea 21 years ago

Jim,

See below,

Aust> Aust>

As I recall, it had to go below 150 mV to 300 mV to recover.

The POR IC had a settable threshold with an external resistive divider. It responding very quickly. I set it to the voltage range I knew I never wanted to be in. I think that was anything below 2.5V. For a 5V supply, I figured many bad things would happen if I went below 2.5V.

The problem is how do you tell? A band gap reference takes a lot of area, and is hard to be accurate in the really deep sub micron tecnologies. So if you can't measure more accurately that +/-5%, why bother?

Vote

J

Jim Granville 21 years ago

I did say regulator chip, not FPGA :). In the analog realm of regulators this is a no-brainer, all the support silicon is already there, it just needs a difference in the enable/disable details.

Regulators/reset generators on FPGA is another topic entirely... The best indicator of what is possible, are the MOSFET charge based Vref chips from Xicor (now intersil), and the bigger embedded controllers, esp towards the Automotive area, where on chip regulators are more and more common.

Todays FPGAs are such power hogs, that this is less practical, but on the 'zero power' CPLDs it makes sense to engineer it better than the present numbers.

-jg

Vote

L

lecroy7200 21 years ago

This is not what the data sheet states. The 4000 data sheet makes a distinction that it runs at 8MHz and divides down to the 1MHz where the 3000 is at 1MHz. I am not disagreeing with you. I believe that the 3000 was changed overtime and the clock was part of these changes and now runs at around 16MHz. The documents were never updated to reflect this change because it was "transparrent" to the end user. Of course this is all a guess on my part.

Agree, it just needs to work. Too bad it seems to have problems.

Funny, we can still pull up our paper documents if needed. I agree, its not fun but sometimes you just have to roll up your sleves and dig in.

Again, I read Xilinx's app. note on the brown out problem and it makes it clear that the part can be reset without removing power. I don't disagree that the internal logic could get into a locked state and that there was not a problem with brown out. I also think it is very possible that the current devices being sold could have a second problem with the internal oscillator. There is no mention anywhere about the oscillators failing to start or locking up in the brown out app. note. I am sure if Xilinx would have known this, it would have been documented and the power cycle requirements would have been called out, which they are not.

Again, power cycling the device, no matter how it could be done, is not an option for this system.

It sounds like Xilinx is not willing to dig into the root problem of the oscillator. I can understand this to some degree. After all the software has not supported the device in several years. So my next question is if you are able to tell me if the oscillator design used in the currently sold

3000s is being used in other Xilinx devices?

Vote

L

lecroy7200 21 years ago

recover ?

After testing the second failure, I tried the power cycle test again. The second part behaved the same as the first. Removing power from the device and shorting the supply (much less than 150mV) for over 1mS would not cause the oscillator to restart (observing it with the spectrum analyzer).

Vote

XC3000 non-recoverable lockup problem

Join the Discussion

Didn't find your answer?