XC3000 non-recoverable lockup problem

I am looking at a design that was done several years back which uses several XC3000 devices. All devices are programmed with the same core from a computer on power up. What I am seeing is that once or twice a year, one of the devices will enter a non-recoverable mode. In this mode, the device appears to be in power down or reset. Once in this mode I am no longer able to program the device. Pulling the XC3000's reset low for 10us has no effect. The problem appears random. The only way to reprogram the part is to power down the IC. I have tried running tests where I just reprogram the device, over heat the device, change the supply voltages, etc. and can't reproduce the problem. When the Xilinx device is in this mode, it draws little power. It can be held in this mode for what appears an infinite amount of time and causes no damage to the device.

Was there some kind of an undocumented test mode built into the XC3000 that I may be seeing? Does anyone else ever remember seeing a problem like this?

Reply to
lecroy7200
Loading thread data ...

I haven't used the XC3000, only 95xx and Spartan. But, I'm assuming it has jtag or something to download the config. Is the download controlled by an onboard CPU, or is it a pin header that you plug a programming cable into? If it is a header, do you jumper the pins to prevent stray fields from producing clocks on the download pins?

Well, presumably, it isn't a classic CMOS latchup event, or it would either fry the device or overload the power supply on the whole board.

Jon

Reply to
Jon Elson

It's been a long long time...

I can't find the data sheet on their web page and my memory is (more than) a bit rusty...

I forget the details of how configuration works. I think reset is just a global reset of the FFs. It doesn't have anything (much?) to do with configuration.

I think there is a combined ~prog and done pin. It's pulled low (open drain) by the 3000 until it gets configured. Power up starts needing configutation. A high-to-low transition on ~prog asks for another configuration cycle. If your attempted configuration gets confused, there is no way to start over until you finish configuration since ~done is held low so you can't make it go high-to-low.

Configuration starts with a 24 bit bit-count value. After that many configuration clocks, all the devices in the the chain release their done pulldown. If one of the devices in the chain gets (somehow) a low value in that counter you have to cycle through

2**24 cycles to wrap around and finish the current cycle.

How clean is your configuration clock? Try cold rather than hot on your lab setup.

The description of configuration in the data book is pretty good, but maybe only after you know the answer because you have read it many times.

--
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
 Click to see the full signature
Reply to
Hal Murray

This sounds like something that used to be called the "brownout" problem. I did a xilinx search, but found nothing. Must have been purged.

I seem to remember that the problem occurs if there is a low going transient that drops below some level (like 3.0 V) which causes the chip to do house cleaning (wiping the config memory), but doesn't go low enough to trigger the re-configuration logic.

Maybe Peter Alfke can remember better than me? This problem was fixed around 1989 I think.

The solution may involve changes to your power supply, such that if the voltage ever dips below say 4.5V, you make sure it goes all the way down to 0V, for maybe 100 mS, before it comes back up.

A google search found this:

formatting link

which sort of confirms the fault mode.

Philip Freidin

Philip Freidin Fliptronics

Reply to
Philip Freidin

Thanks for all your ideas on this matter.

There is no JTAG support on this device. All devices are programmed from an external computer using the Done/Prg', Reset, Data and Clock pins using the slave serial mode. The traces are daisy chained to each device and then terminated at the end of the bus. All devices are loaded with the same core using slave serial mode. Even if the loading state machine were some how stuck, needing more clock cycles to flush it, the programming does this upon each load sequence. Also, if you have the data sheet, on page 7-19, you will notice that during configuration, if the Reset pin is active, the configuration will abort and the init. sequence will start over.

The following is from the data sheet for the 3000:

"To initiate a re-programming cycle, the dual-function pin DONE/PROG must be given a High-to-Low transition. To reduce sensitivity to noise, the input signal is filtered for two cycles of the FPGA internal timing generator."

All of these pins are hard wired together. And once in the "locked" state, the device remains with the pin released. So, I am still able to pull the pin low to start a new download. Once the device is in the mode, it is almost like it behaves like it is no longer in the circuit. I am able to program all other devices in the chain.

The following is from the data sheet for the 3000:

"The FPGA tests for the absence of an external active Low RESET before it makes a final sample of the mode lines and enters the Configuration state. An external wired-AND of one or more INIT pins can be used to control configuration by the assertion of the active-Low RESET of a master mode device or to signal a processor that the FPGAs are not yet initialized. If a configuration has begun, a re-assertion of RESET for a minimum of three internal timer cycles will be recognized and the FPGA will initiate an abort, returning to the Clear state to clear the partially loaded configuration memory words. The FPGA will then resample RESET and the mode lines before re-entering the Configuration state. During configuration, the XC3000A, XC3000L, XC3100A, and XC3100L devices check the bit-stream format for stop bits in the appropriate positions. Any error terminates the configuration and pulls INIT Low."

Reply to
lecroy7200

each

loading

abort

the

circuit.

Reply to
lecroy7200

I also have vague recollections of this problem ( needing to completely shut off power to 0V recover )

Some googling turned up this copy of an old Xilinx answer record #134 (watch line breaks on the link):

formatting link

Brian

Reply to
Brian Davis

Thanks. I read the note and agree that the problem could be related to some kind of transient. If the Done/Program pin were stuck in the low state it appears that the device will still reset by monitoring the state of the Reset pin.

"A re-program is initiated.when a configured XC3000 series device senses a High-to-Low transition and subsequent >6 us Low level on the DONE/PROG package pin, or, if this pin is externally held permanently Low, a High-to-Low transition and subsequent >6 us Low time on the RESET package pin."

The note to your link suggests that setting Reset high for > 6us then setting it and the Prog/Done pin low for > 6us will bring the device back to the clear configuration state. Looking at the loader code, this is pretty much what is being done on every load. The Reset normally idles high and it along with the Program pin are pulled low for 7.5us. I verified this as well. Doing this does not make the device exit this strange mode. So far, the only thing that seems to clear it from this state is a hard power down.

As a test, I forced the Prog pin low on one device in the chain. The pin latches low as expected. I then forced a few clock cycles to get the device into some mid data stream mode. I then pulled the reset low and started a normal configuration. The part did recover, releasing the Prog. pin at the end of the programming cycle. So, at least this all seems to work.

My next step is to conduct noise onto the supply to see if I can replicate the problem. Because this happens so infrequent, it is next to impossible to find any other clues.

Brian Davis wrote:

formatting link

Reply to
lecroy7200

I tried a few different tests. I first reduced the supply votage on the Xilinx devices by 500mV and ran the system as normal, but saw no problems. I then reduced the supply voltage until I started to see problem with the function of the devices (this was around 3.5 volts), but as soon as the supply was returned to normal the parts would function normal as well. Using a bias T I then injected a sinewave onto the supply line. I ran the supply at 4.5 volts and injected a

500mV signal. I did multiple sweeps from 100KHz up to a bit over a GHz and saw no problems. I then ran the same test with the supply at 5 volts and again saw no problems.

So far, it would appear the problem is not related to the supply voltage or operating temperature.

Reply to
lecroy7200

Err, maybe. Keep in mind that on many devices, the RESET does NOT reset everything, and is more aptly labeled reset request. It is not uncommon to see devices enter a illegal (but 'safe') state, that can only be exited by a power cycle. This is because chips often use internal POR cells, using simple RC elements. Such states are normally either external energy transient or runt-pulse related.

That is why in some industries, the WDOG systems work by doing a Power-Cycle, rather than the less effective 'reset'.

If you can sense this state, your best remedy could be to trigger a power re-cycle ?

-jg

Reply to
Jim Granville

related to

low

everything,

I am just going by what their data sheet says. From my testing, it does appear that the part functions this way as well. Not to say that it has not reached some "safe state" that reset won't shake it loose, because that appears to be what I am seeing.

This design has the power down pin tied to VCC, but I broke it out to test this mode as well. I tried doing some basic DC tests as well as sweeping RF into the pin, but again was not able to replicate the problem.

Yes, I can detect when the problem happens. Power cycling the system is not an option. Are the Xilinx guys still browsing this group? If so, any ideas from the masters? I am running out of ideas to try.

Reply to
lecroy7200

Have you opened a case with our hotline?

It appears you are getting into some strange state between on, and off.

The devil is in the details, and working directly with the hotline is your best bet to resolve it.

Aust> Jim Granville wrote:

Reply to
Austin Lesea

The key word in my previous article is "transient", as in maybe only for a microsecond or two.

This corrupts an internal state machine, and the illegal state is something like "house cleaning complete", "configured" (which is no longer true), and "done" pin low (caused by "house cleaning complete").

As the done pin is an open drain signal, if the FPGA is pulling it low, then it cant tell if you are pulling it low out side as well. It needs for the FPGA to release this signal, and an external resistor to pull it up, before it can recognize an external pull down of this signal.

Yep.

You haven't tried something like I have suggested. Try and have the power sitting at 5V, and pulse it low to 3V or 2.5V for 1 to 2 microseconds, then back high for much longer time (seconds).

The problem I seem to remember is that the narow pulse is recognized by the voltage level detector in the FPGA, which starts the house cleaning, that wipes the config memory, but the 6uS filter stops the power glitch from being detected by the reconfiguration logic.

The problem with the sinewave is that when it goes low it does so for too long to trigger the brownout problem.

I am still fairly sure this is a powersupply transient problem.

Philip

Philip Freidin Fliptronics

Reply to
Philip Freidin

I don't believe the supply tests you've described so far would have certainly induced the configuration problem.

Was that 500 mV measured at the device VCC pins, or on the front panel of the 50 ohm signal source driving the bias-T?

Either way, I second Philip's suggestion that you should try to re-create the problem with a brief, not-quite-to-zero glitch on the supply.

My own encounter with this problem (long ago) occurred under rapid AC power switch cycling as described in his original link.

When you next induce this problem, here are some random thoughts on additional sticks with which to poke at the stuck FPGA:

- longer ( >>6us ) reset and program pulses

- send more config data than needed and look for data to appear on DOUT

- if possible, stop driving CCLK externally, switch the mode pins on the stuck device to master serial, then reset and look for CCLK coming out of the part

Brian

Reply to
Brian Davis

Thanks again. I will try the pulsed supply as suggested today and post my findings.

To give you an idea of the MTBF, of about 90 ICs being run, I have seen or heard of the problem 5 times over about a six month period. There is not a good way to probe the parts after the fact to see what is going on. Reproducing the problem seems to be the only good way to solve it at this point.

Reply to
lecroy7200

Besides power supply transient, you should also look at IO pin transients, or even RF field bursts. The consensus is the logic is being disturbed, but the source is a mystery. Large IO transisents can cause lateral currents in the die. Do the failures have any site/user-clusters ?

-jg

Reply to
Jim Granville

Not sure what you mean by this.

I did look at all of the signals to/from the devices and there is not an excessive amount of over/undershoot.

I tried setting up a test to pulse the power to the board. The problem in doing this is the amount of bulk and high frequency capacitance on the board prevents me from getting a good solid pulse. I thought about doing a push-pull but have not tried this yet. Just interrupting the power, I set the pulse time to where the devices would just start to become effected. I would then reprogram them and repulse the supply. I ran this test for about four hours today and was still not able to replicate the problem.

I also opened a case file with Xilinx. I would drop the old 3000A parts if I knew that would fix the problem, but until I am able to reproduce it I don't see this being an option. I have never seen an FPGA get into a mode like this that required a power cycle to clear it.

Reply to
lecroy7200

Are the failures random across all your installations, or do they cluster in a few sites or users. Often these external impact effects are local. Any relays or contactors involved ?

It is not a 'normal signal' effect, but an external impulse effect.

A good 'chip cracker' test is a self commutating mains relay, and a short 'wand' cable. The arcing contact + bounce effects mean you can generate wave fronts of over 1KV/ns. Wave the wand around your cables and PCB, and watch for your effect.

We have used this type of severe test setup, to find that a number of chips have the (Reset < Power Cycle) effects.

Reply to
Jim Granville

This is a random failure from what I am able to tell and not tied to a particular location or setup. The environment is clean (from a noise perspective).

not

of

We sell products all over the world and require CE certification. We perform a full set of conducted and radiated tests on every product. During these tests I have not seen this problem. This is a much worse environment than where the units are being run.

The main clock on this card is 500MHz and we sample on both edges. There are several layers of ECL before getting to the Xilinx parts that are running at a modest 60MHz. So termination and correct layout are a big factor in making the designs work. Almost every trace is controlled.

I went ahead and installed a second FET to crowbar the supply to the card once power was cut in order to force the bulk and high frequency caps. to discharge at a faster rate. This did help me achieve a faster edge with my power supply testing, but again I was not able to reproduce the problem.

Reply to
lecroy7200

other startup tales (that may not fit your one-of-N AWOL symptoms):

If that's real negative ECL, I have had similar bizarre powerup problems with both an ancient Altera CPLD and an Atmel AVR, when residing in a mixed TTL/ECL design:

A temporary failure, or delayed startup, of the +5.0V supply would allow the -5.2V supply to slightly reverse bias the +5.0V rail.

Under these conditions, when the +5.0V supply did come up, the CPLD or uC would not initialize properly, until both supplies were removed and then restarted at the same time.

Also, TTLECL translators can have strange clamp bias paths into a pin when one supply fails ( IIRC, more of a severe problem in PECL across boards with a failed supply on one card )

Brian

Reply to
Brian Davis

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.