XC3000 non-recoverable lockup problem

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 29, 2005 2:55 PM

being

cycle

minutes

remaining

cycled ?

I once again may need to retract this. I have not been able to reproduce the power cycle test results. I am beginning to wonder if there was something flawed in my first attempt.

I have been testing multiple boards with multiple devices per board.

that

The

Again, temperature does not appear to be a factor. I have done numerious temperature tests and have never seen any corrilation. I am seeing a failure in four days on average. The rapid failure appears to have been a fluke of nature. Just one more random data point.

Again, this is no loading. Just looking at the internal oscillator and watching how long power must be removed before it recovers. Nothing to do with reprogramming the device.

oscillator. It would be great if there were a way to probe it to verify what I am seeing with the analyzer.

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 29, 2005 3:33 PM

?

I thought we were going to take this offline, but since you are still posting here (fine with me, by the way):

Yes. We found the schematic. We found the hand written note in the margin.

Basically what Rob sent you from the hotline.

If that doesn't work, then I am afraid we are at the end of our resources to provide help.

Changes were later made to the XC4000 so that it did not have this issue.

It is caused by a power supply glitch (and made worse if you use the power down mode as well). Remove the glitch, and the problem goes away. Perhaps you just need to add a 1,000 uF capacitor to the power suppy? (or remove one, to prevent the glitch)

Time spent on the KNOWN CAUSE (the glitch) would be beneficial (in my opinion). You are unlikely (in fact: never going) to fix the chip. The issue was addressed in later families, and never in the XC3000.

If anyone else out there can help, please do.

Austin

(and the rest of us back here at Xilinx that actually remember the XC3000)

- P
- Philip Freidin
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 29, 2005 4:07 PM

Well, I have read all of your posts, and everyone elses too. The problem is one of clarity of communications.

Ok, this seems pretty clear,

But in another article you write "I would drop the old 3000A" and in another article you write:

****** XC3190A PQ160AKJ9901 A2025068A Assumed date: 99 Fundimental frequency: 16MHz @ -60dB ******

Xilinx produces an XC3000 family, and XC3000A family, an XC3100 family and an XC3100A family (and many others too). My point about clarity is that your original article says XC3000, another article says XC3000A, and finally with actual partnumbers it turns out XC3100A.

Are all the devices on all the boards XC3100A? It matters, as the various familys had slightly different config logic.

"pretty much" is not clear.

You are dealing with a tough problem. It is rare, difficult to reproduce, and in an area (configuration) in which almost every designer has at least at one time had problems, some times intermittent, sometimes easy to repeat. The experience has been that except in extremely rare situations the problem has been traced back to something outside the FPGA.

I understand your frustration, you've been at this for over 2 weeks, and no magic bullet yet.

See, this is surprising. This is not the way configuration is supposed to be started.

In normal configuration, the reset is high, Init and Done/Prog both have pullup resistors.

The software in your configuration processor should test that INIT is high indicating that housecleaning is complete, then it should test D/P, it should be high too. To start the program process, you pull D/P low, and wait till INIT goes high, indicating it is ready for configuration data. The clock and data should start greater than 10us after INIT goes high. Starting sooner than this can cause the header to not be read correctly.

In the fault mode you have described, the D/P is permanently low. For this situation, assuming that the device is in slave serial mode, I believe you would supply a clock (at 1MHz or slower to CCLK), and try taking INIT high for > 10uS, then low for 10 uS, then stop driving it and let the pullup resistor try to pull it high. I would expect for INIT to stay low for the house cleaning, and then eventually, the FPGA would stop asserting it low, and the pullup resistor would then pull it high. At this point, stop driving the CCLK signal. It would now be ready for configuration. The D/P signal (which you should not be driving) should also go high because of the pullup resistor. If you get this far, then things are back to normal, and a low going pulse on D/P should start the config process as described in the previous paragraph.

I know you think you have done the above, and the problem is the internal oscillator, but I am unconvinced. I would suggest the following laborious process.

Describe in excruciating detail the signal sequence and timing you observe on ALL the following signals, including timing relationships, and whether the signal has a pullup resistor or not, and when the processor is driving it or not. These are the signals:

DONE/PROG CCLK DIN DOUT INIT RESET PWRDWN LDC HDC M0 M1 M2 RDY/BSY

Somewhere in all of this there is an answer.

Still trying to help Philip Freidin

=================== Philip Freidin snipped-for-privacy@fpga-faq.org Host for

formatting link

- B
- Brian Davis
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 29, 2005 4:09 PM

Proper internal oscillator startup would normally be guaranteed by the monotonic VCC rise requirements for the part in question; oscillator failure would be consistent the earlier speculation of a hypothetical transient of some sort taking out the FPGA.

BTW, on a failed part, have you observed DOUT for activity under the test conditions described in Philip's earlier posts?

Also, what value pullup/pulldown resistors are you using for the mode and powerdown pins? I have another vague recollection that that the internal pullups were "stiffer" in later 3xxx series parts, and needed lower values for the external resistors.

At the risk of sounding repetitive, the method you seek is called "master serial mode", which lets you directly observe CCLK ( or a divided down version thereof ).

Yes, this requires changing another variable in your test setup, which might affect your chances of observing something.

However, it provides the benefit that you would now have a signal that can be directly probed, and used to catch whatever transient event is perturbing the FPGA: e.g., trigger a deep memory scope on "loss of CCLK" while probing any likely suspects (VCC, configuration pins, VEE, translator output pins, etc.) at a high sample rate with plenty of pretrigger storage.

Brian

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 29, 2005 4:30 PM

"Seeing that you have decided to continue to post to the public thread rather than contact me directly, I will assume that this is how you wish to handle this issue. "

You had my direct contact information. I expected that you and Peter would have used it rather than continue to post.

margin.

I believe this is a different problem than what was originally noted. I only state this as it seems that there was never any mention of a non-recoverable state like I am seeing and there is never any mention of the internal oscillator failing. Maybe this was the orignal problem.

Your call. My guess is had the device been used in the some of the DOD designs, that help would be coming out of the woodwork.

issue.

away.

suppy?

Again, the problem I am seeing could be very well be caused by a transient of some kind. That is why I am running so many different transients to try and reproduce the problem. If I am unable to find a way to reproduce the problem, it will be near impossible to know if it can be fixed or if any changes I make have an impact on the problem. It's nice to be able to throw out a recommendation of a 1000uF bulk cap. but without proof that it did anything to improve or hurt the design, there is little value. That is why testing at this stage is so important.

The

I agree that fixing the device is not an option. I never expected this. Again, to make it very clear, I need to make sure that we do not run into this with whatever device we replace the 3000 with. I had hoped that Xilinx would have been more proactive in helping to identify the problem. If it is an oscillator design issue that you would be able to tell me that the problem was found and that corrections were made to newer devices to prevent it.

It would seem that getting anything from Xilinx is impossible. So the next step will be to qualify a new device based on the tests I am currently running.

On the upside, it seems that the D/P pin going low is a side effect of the problem. So at least I think we can limit our customers exposure to the problem.

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 29, 2005 4:49 PM

lecroy,

We have been in contact with you directly (through Rob).

I am cc'd on all of the emails, and since I escalated this to the fire department, I was responsible for all communications.

I am sorry you are frustrated.

We found the shcematics.

We (and you) know this is caused by a glitch, yet you will do nothing to change the setup, so nothing changes!

A famous line by the owner and CEO of California Microwave - Dave Leason

- is as follows: (said to a technician staring at a broken pcb)-

"Well, what have you tried?"

"I don't know what is wrong, so I don't know what to do."

"If you do nothing, nothing will be the result."

Basically, by refusing to add a capacitor to the supply (or in your best judgement do anything to the supply that would modify its behavior) you are in exactly the same state as the technician: doing nothing will result in no change.

Sometimes you have to do something to get something. In fact, I would state that stronger: you must do something to get any information at all.

Playing with a spectrum analyzer is like looking for your keys under the streetlamp: because to look anywhere else is tough (it is dark there!?).

To imply that your application is not important enough to warrant a response from Xilinx is an insult to the good folks on the hotline, and to me personnaly.

I am now taking time out of my day to reply to you (again). I could be working with the NSA, JPL, NASA, or the US AF, or any one of the government folks that I am responsible for working with on the many government programs that we work on everyday.

But, no, I am working on tyring to help you.

Abuse is not going to make me likely to post further. As of this moment, the case is closed. We have done what we can with what you are willing to do (look under the streetlamp). I hope you take the other advice here on the newsgroup, and do some of the things they suggest, if you do not like the suggestions we have provided.

Sorry that you are upset, we are upset as well now.

Austin

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 29, 2005 5:02 PM

Personally I would consider it unreasonable to expect Ford Motor Company to figure out why a '73 Pinto station wagon is experiencing occasional vapor-lock AND base my decision whether to buy a 2005 Thunderbird on what their level of support was... whether they fixed the problem or not.

I applaud your efforts to exhaustively address a problem you're experiencing with ancient parts. Those parts aren't old, they're ancient in the progress of FPGAs.

Be happy for the support you HAVE received - the Xilinx and non-Xilinx folks that continue to add their insights are good people. Don't look to hold the FPGA manufacturer accountable when they HAVE addressed the issue you're encountering but it was put to bed a decade ago.

Often the true cause of something can't be determined without excessive investment of time, money, or newsgroup postings. I wish you luck in finding your happy place with respect to the error you encountered.

Respectfully, - John_H

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 29, 2005 8:53 PM

off-line, I assumed you would be in contact. "So, your support for this issue is now Peter Alfke and Austin Lesea."

What I did get from Rob was the following: " Have you been in contact with Austin or Peter on this issue yet, aside from the postings on comp.arch.fpga? If so, can you please CC me on those e-mails to keep me in the loop on this case? " Again, leading me to think you would be in touch.

to

It's great that your putting words in my mouth. I am not sure of the cause of the problem. Sure it could be what you refer to as a "glitch". I really do not know, nor can I seem to find any correlation what any tests I have run.

Leason

Nice. I am sorry you feel this way about my efforts.

best

you

I have taken the opposite direction of trying to cause the failure, and from this you feel I am doing nothing.

would

all.

the

there!?).

It's just another tool to me that provides another way to look at the problem.

and

That's fine, but it's the truth.

be

I am sorry you felt that all your hard work on this problem has taken away from your other customers.

are

if

Abuse, LOL!!! I needed that bit of humor.

Sorry you are upset. I am just trying to find the root problem.

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 30, 2005 2:29 PM

problem

Good that you know that everyone read them all. I for sure could not make that statement.

family

is

Very good point!! The part in question is an XC3190A. But, I have also tried some tests with the non A devices as well. When I first opened the call with the hotline I provided them with all of the details but did not even think about it in my original posts.

various

Yes, all of the parts are the same on this board. All the XC3190A.

then

I measured 7.5uS. Also note where I ran some tests at 10uS. All greater than minimum. While I don't think I posted it, I even tried a test where I held reset for well over a second.

reproduce,

least

situations

Good enough.

and

LOL. Not looking for any magic.

Which they all are.

- P
- Philip Freidin
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 30, 2005 8:39 PM

Well, this is still a clarity issue. I wrote "elses" but should have probably written "else's", as in "I have read the articles by everybody else".

You read "elses" and assumed I meant "else has" which is a contraction I have never heard of :-)

Philip Philip Freidin Fliptronics

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 31, 2005 1:27 PM

Thanks for the post Brian.

Based on this I tried several tests yesterday using different power supply ramp rates. I went into the seconds. Watching the oscillator with the spectrum analyzer I can see it sweep as it begins to start and finally locks to the normal frequency. I tried manually adjusting the supply by watching the oscillator to see if I could trick it that way into not starting. From this I never saw any of the internal oscillators fail to start after a day of tests.

It's almost like there was some undocumented test mode that the part gets into. I doubt it has anything like this, but from all my tests the part seems very robust.

An interesting thing I did note was that when the device is powered down, the oscillators continue to run. Who would have guessed. They must not wanted to deal with the time to lock. The data sheet talks about the 3100A drawing 5mA in power down.

No

M2 uses a 1K. M0,M1 and power down are tied directly to VCC.

Agree, and don't think I had not thought of this. The specturm analyzer and near field probe work fine. Not sure why Xilinx did not agree with the technique.

This is a very good idea. Had I been able to replicate the problem, using this as a positive trigger would have been a good idea.

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 31, 2005 5:43 PM

Helpful hint: XC3000-type circuits are sensitive to slightly negative Vcc followed by a fast ramp-up. This might upset the deliberate imbalance in the configuration latches, and might (perhaps) cause a lock-up, although we have no record of this. It is just an enlightened speculation on our part.

15 years ago, the environment was generally slower, but now, especially with ECL ciruits in the vicinity, there might be a chance for such malfunction.

Just a guess and a helpful hint. No response is needed, and, please, no insult. Peter Alfke

- P
- Philip Freidin
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 31, 2005 7:07 PM

One of the things that has been bothering me through all of this is your detecting of a 16 MHz clock. The configuration clock for these devices is nominally 1 MHz, but actually it can range from 500 KHz to 2 Mhz. I am certain that it does not use a faster clock and then divide it down. So I have been beating my head about what this 16 MHz is. Well, I just figured it out.

The baseline family is XC3000

The XC3100 family was a higher performance family

The XC3000A family had some routing enhancements which made use of some previously dummy bits in the config bitstream.

The XC3100A has both the higher performance of the XC3100 family, and the routing enhancements of the XC3000A

The XC3000L is a low power derivative of the XC3000A

(FYI, all bitstreams are the same length. Designs compiled for the XC3000 or XC3100, can be used with any of the 4 families. Designs compiled for the XC3000A or XC3100A, may make use of the dummy bit, and so these bitstreams can not be used in the XC3000 or XC3100 devices).

The performance enhancement in the XC3100/3100A family is achieved by the use of on chip charge pumps. These create higher voltages that are used on selected circuits in the FPGA. These charge pumps use free running oscillators that are separate from the config oscillator, and are almost certainly the 16 MHz that you are seeing. There is no way to actually measure these oscillators, other than what you are doing with the spectrum analyzer.

Since you are seeing that the 16 MHz is not present in devices that are not operational, this means that the charge pumps are not all running. Under this situation, I would expect that the chips would be basically non operational, and no amount of banging on reset, D/P, or other control pins is going to help. This is what you have reported.

I don't remember if the problem you are seeing is that devices that operational, stop operating, without turning the system off, or that when a system is turned on, sometimes it does not start up correctly. (Have you ever said this?)

At this point it would seem that either the profile of the powerup voltages, transients on the power lines while operational, or maybe negative transients on data lines.

As an example, I have seen DRAMs fail due to excessive undershoot on a data line, that violates the devices max negative spec. This is a failure mode different from the well known problem of latchup. I.E. the device fails, but does not exhibit the high power consumption associated with latch up.

Philip Freidin

=================== Philip Freidin snipped-for-privacy@fpga-faq.org Host for

formatting link

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 4, 2005 2:34 PM

Philip,

the

almost

Thanks for all your insight!! I was a bit surprized that the "smartest and most helpful engineers" at Xilinx did not pick up on the charge pump right away.

As per our off-line talks, I have gone ahead and rebuilt the design using slew limited outputs for the two pins in question. I have begun running my transient tests but it will be a few weeks before I am convinced this was the problem.

The following link is to my post about the reflected energy causing possible problems:

formatting link

The following was taken from a Xilinx app. note.

"For all FPGA families, ringing signals are not a cause for reliability concerns. To cause such a problem, the Absolution Maximum DC conditions need to be violated for a considerable amount of time (seconds). "

I am including parts of our off-line talks that may be a benifit to others reading this thread.

ground plane around the device for a reference. Ground plane is attached to devices ground in multiple places. The scope is a LeCroy

7300. 3GHz BW with a sample rate of 20GS/S. Using a 3.5GHz active probe with a loop of about 0.5". All measurements are taken at the FPGA's pins. Using no filtering, etc. If there is a glitch, I will find it.

is some undershoot from the reflection. This undershoot can be more than 0.5 Volts below the rail. On their newer parts I had seen where they started to specifiy the SWR of the next stage, but I was not able to obtain this document. You may recall me posting this lenghtly post last summer. I have never seen a problem where, say all of the energy was reflected back to the device's output and have it cause a problem. Maybe the 3100A was prone to having problems with this.

Well anything that goes more than .5V below ground would concern me, even very short duration. I don't think the 3100A was particularly prone to this.

While normally you worry about undershoot and overshoot at a receiver, in the case of FPGAs, all pins are both. So even if you are usin a pin only as an output, it still has an input structure including the protection diodes. The undershoot can cause the diode to conduct, and this can in turn upset the local ground reference inside the FPGA. This may be your fault mode. Note that this type of thing can have data pattern sensitivity. I.E. a bunch of outputs all switching low at the same time, maybe on pins that are further away from the ground pins rather than nearer, with reflections arriving at about the same time, etc.

Two suggestions: can you force the data outputs to bang between paterns that are predominantly all '1's and then '0's? Other idea, set up a low impedance pulse generator to generate say a 1 uS pulse of -1V, and apply it to some pins (1 at a time) and see if this induces the problem.

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 4, 2005 8:07 PM

formatting link

That's from a Pin-failure viewpoint. - ie energy damage. They also spec a MAX peak current.

There IS another failure mode, which is the lateral currents that result from the clamp diodes ( which are actually side-ways transistors ). It is not easy to KNOW what peak currents you get, especially on cable or external runs.

At the highest levels, these injection currents cause latch-up, but there can be lower levels, where operation is compromised, but the device does not latch up.

Latchup tests are purely "did the SCR trigger?" ones, they do NOT (AFAIK) ever check to see if the part logically miss-fired in any way.

-jg

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Apr 6, 2005 7:59 PM

Yes, I think that's what I had stated.

Peter's original app note on the subject.

formatting link

So far no problems with my testing. If this solves the problem it would be interesting to know if there was some reason that the 3100A's internal doublers were prone to failure because of this.

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Apr 19, 2005 3:09 PM

I have come to the conclusion that it is possible that the actual core design can effect the internal charge pump circuits. After weeks of testing a core that was auto generated, I have been unable to reproduce the problem. Setting the outputs to FAST or slow appears to have no effect on the failure. Talking with Philip, it does not appear that the device had any capabilities to turn off the charge pumps.

I did go back to the original core and made sure I could reproduce the failure once more.

I also came across this old note from Xilinx:

"Note that XC3100L and XC5200L use a continuously running internal oscillator to generate an elevated voltage for driving the pass-transistor gates , This is called "pumped gates" and gives better speed, but results in significantly elevated idle ( quiescent ) current consumption, bad for battery-operated systems. XC3100 devices have always used this technique, while the original XC5200 devices did not, but the coming releases will."

It appears some of the newer parts also used internal charge pumps. Would be interesting to know if they would be prone to the same problem.

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Apr 19, 2005 5:07 PM

All of

The more recent FPGAs use the Vccaux supply through a regulator to supply Vgg, or the pass gate voltage supply.

There are no charge pumps in FPGAs now since Virtex (roughly 7.5 years).

Austin

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Apr 20, 2005 12:31 PM

years).

Just doing a quick search I find the Coolrunner is using a charge pump for the programming voltage. Just search the data sheet for "charge" and you will find it.

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Apr 20, 2005 3:42 PM

Coolrunner is a CPLD.

Aust>>There are no charge pumps in FPGAs now since Virtex (roughly 7.5