XC3000 non-recoverable lockup problem

- H
- Hal Murray
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 23, 2005 9:10 AM

My memory if very fuzzy. I think the Reset pin would break out of that mess. The catch was that the system we were working on didn't have a way for the CPU that was supplying the bits to flap Reset.

--
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's.  I hate spam.

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 23, 2005 3:48 PM

lecroy,

Regardless of what any piece of paper claims, it is the memory of many here that the only way to recover is by powering down.

As a 15 year old problem, it is one that we only have our (failing) memories to rely upon.

There was no answer database in those days.

There was no hotline.

Aust>>The oscillator itself is at a much higher frequency, and is divided down

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 23, 2005 3:51 PM

lecroy,

If repowering the device will return the part to a state where it can be programmed, then I have to say it is a bad device.

Trying to infer the operation by sniffing for a oscillator that is the wrong frequency sounds suspicious.

Aust>>>Do you recall how low the Vcc had to cycle, in order to correctly

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 23, 2005 4:40 PM

be

Austin, not to make fun of the situation, but I really have no idea what you are trying to tell me in this statement.

the

I am sorry, but again I must be missing something. Are you stating that you don't believe my test results and that looking at the devices internal oscillator with a spectrum analyzer is not a valid test??

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 23, 2005 4:53 PM

Well, it's not just any piece of paper, it's the Xilinx published papers. I agree, it does appear that the documents for this device were not kept up.

I don't know if the problem is 15 years old or not. Is it tied to when the oscillators frequency was changed to 16MHz, it is possible. I am guessing that this happend after 1997 when the 4000 data sheet I have was published. Again, its all a guess on my part. Xilinx would need to answer this.

The problem now is how to prevent it with the next design. The first step is to determine if the design used for the currently sold 3000 series oscillators was used in other devices. If so, I plan to stay clear of these.

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 23, 2005 5:35 PM

lecroy,

No one changed the frequency of any oscillator.

The layout has been shrinking so as to be able to be fabricated, that is all.

Did the oscillator go from 8 MHz to 16 MHz over this period of time?

Maybe.

But, for you to infer something from a 16 MHz signal is suspect: does failure to configure 100% correlate with this signal?

If you power it down, and back on, can it be reprogrammed? If not, it is a bad part. If it can be reprogrammed, then it is a good part (as far as configuration is concerned).

If you have a case open with the hotline, what have they said, and what are they doing?

If you do not have a case open with the hotline, then you should have one open.

Austin

snipped-for-privacy@chek.com wrote:

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 23, 2005 8:07 PM

I will see if I can locate one of the pre 97 devices to verify this. We may some some in stock in our area. I will let you know what I find.

is

time?

does

Well, there is certainly nothing that prevents Xilinx from running their own tests to validate what I am seeing. I certainly can not force them to do so. I am 99.9% confident that the 16MHz is from the 3000's internal oscillator and that this is the fundimental frequency. I see this signal on every working device and no where else. It is a very loose frequency, it tracks with the individual device's temperature and it has the most energy from my sweeps.

So far there is 100% correlation of the failure, but we are only talking one data point. I can also tell you that once I reset the power that the 16MHz signal for that device was present (again which it was not while in this failed state) and the part began to function normally.

The oscillator locking fits with what I am seeing, not being able to reprogram the devices.

Again, this depends what you mean. Looking at the power on the device, I can bring it below what I can detect for over 1mS and turn it back on and the part will not allow me to reprogram it. Nor will the oscillator start running. I have to remove power for a much longer time in order for the oscillator to start and allow me to reprogram. This appears to be the case with all six failures I have seen, in that they need the power removed for several seconds inorder to recover.

Agree. Again, out of the six times I have now seen this, the failure has not appeared to cause any damage to the devices.

what

I am not so sure this is the place to discuss this. If you want the persons name I have been in contact with, or the case number feel free to send me a direct e-mail. During the first contact I was asked if I was the person posting in this forum, to which I responded yes. I explained in detail what I knew at the time, including providing exact part numbers and lot codes for the parts I was testing. I was told by the person I spoke with that this was outside what the hotline could handle and that it would be elevated to a higher group and that they would get back with me. I continued to work on the problem and made the comment in one of my postings about not yet hearing back from Xilinx. That same night I received and e-mail from the hotline as follows:

"It's been a few days since we last communicated, and I wanted to check in. Since this device isn't officially supported by the hotline any longer, I'm having to do a fair amount of work to find any information on it here. I'll keep investigating this, and I'll let you know when I come across anything that hasn't been tried already according to the suggestions on comp.arch.fpga."

Later after reproducing the failure I tried to contact the support group and left a voice message stating what I had found and asked for them to return my call. After I did not hear back from them for several hours I continued my testing. Once I discovered the oscillator problem I again tried to contact support and left a second voice mail. Again, there was no return call. On my third attempt I finally spoke with my original contact and was told that you were the expert at Xilinx and that your posting about the part being designed on paper was correct and that there was nothing they could do. Google groups was down that day and I was not able to read your posting, so they forwarded me your post. I still have the case number open, and if I learn anymore about the problem I will try and contact them again.

So, now that we have established you as being the expert at Xilinx the question becomes if you can help. I am setting up one more test to determine the state of the program/done pin prior to attempting to reprogram the device after the failure. I will publish these findings once I have them. Also, if I am able to locate an older device and test it, I will publish what I find on the internal oscillator.

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 23, 2005 10:32 PM

I would expect (generally speaking) a Config Ring osc to gate itself off, after config is completed. What does a normally operating device show - does this osc appear to gate in normal usage ?

It may pay to get a closer number on that - < 5 seconds and > 1ms is quite wide... Most vanilla buried POR cells are RC in nature (tho the R may be a FET ), and they will have a TIME as well as a voltage requirement for reset. Austin has given ~150mV-350mV region as voltage, but no one have a time value yet.

You could ask Xilinx explicitly if newer devices have any buried POR cells, that are not also replicated by a RESET ?

I would expect this type of oops to be eliminated :)

-jg

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 23, 2005 11:19 PM

lecroy,

I have found the case, and the CAE assigned, and I am working with Peter to resolve this.

So, your support for this issue is now Peter Alfke and Austin Lesea. All I can say is that if we can't help you, then no one can, so you can not complain about not getting the best resources assigned to the job!

Basically, as an officially usupported part (end of life, last time buy status, etc. ...), we will do what we can.

Peter and I are the only ones here who remember anything about the XC3000.

The hotline was unable to follow through on this due to the age of the part, and the total lack of information on it. In future, we will provide the hotline for a way to deal with this other than just getting frustrated.

I apologize for that on the part of Xilinx. It is a situation that we had not anticipated (support of an obsolete product).

I suggest we move this off of this forum.

Austin

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 24, 2005 12:16 AM

From what I see with the spectrum analyzer, all of the devices except the failed part will keep their clocks running at all times.

Yes I agree, except I am not sure what value this information is. Next time I reproduce the problem, I will measure it.

I agree. But I am also a bit surprised with as much work was being done with the DOD back then that a known problem like this would not have been documented.

I did receive a third email from the hotl " It seems possible that from the outside an 8MHz oscillator would look like 16. That is, you see both transitions as a pulse.

Note, for example, that the "60Hz" sound that we are used to hearing from things like transformers is actually 120Hz. "

It is an interesting point. I know nothing of the internal design. We do not know the symmetry or if there are possible resonance that occur that could fake out the measurements. However, if this were the case I would not expect a 1Mhz signal to have it's majority of power at 16MHz. I agree that 8MHz is possible, but I took note that there was nothing at 8MHz, or if there was it was buried in the noise floor of the analyzer.

I spent some time trying to locate some older parts to test. I can not find any documents that state how the data code was marked, so I am supplying all of the markings as they are shown on each device. Note that these parts are in different packaging, different sizes, etc. So I am not even sure how valid any of this data is. Also, the amplitude is a relative number. I have nothing to gauge it on than it's being relative from one device to the next. Also the probe was moved to detect the peak reading.

XC3190A PQ160AKJ9901 A2025068A Assumed date: 99 Fundimental frequency: 16MHz @ -60dB

XC3164A PC84CKG9649 A71686A

4C Assumed date: 96 Fundimental frequency: 20MHz @ -40dB

XC3164A-5 PC84C X24961M AIG9406 Assumed date: 94 Fundimental frequency: 20MHz @ -40dB

XC3120-5 PC84C XG2936M AJG9537 Assumed date: 95 Fundimental frequency: 20MHz @ -50dB *

It was very difficult to lay the probe flat onto this device with the PCB it was located on. Suspect that the reading would have been much higher.

From this I would agree that the basic frequency was the same from at least as far back as

It is interesting that there seems to have been a shift and that the amplitude changed so much, but I don't know if this is any kind of an indicator. After all it is a different package.

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 24, 2005 12:49 AM

You may have to document this, for those units in the field, plus if you decide to retro-fit a power removal WDOG, this will be an important number for that design.

-jg

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 24, 2005 2:16 AM

Thanks!! You should now have my contact information. Feel free to use this.

I am hopeful it will get resolved one way or another.

I look forward to working with you and Peter on this.

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 24, 2005 4:56 PM

Wow, I was gone for 2 weeks, and here is a 52-thread mushroom. I joined Xilinx early 1988, when the XC3020 was being announced, and I was responsible for applications, technical support and all device documentation. I also started a quarterly magazine called Xcell (still alive) and wrote a 1.3-page article "The Effect of Marginal Supply Voltage" (Xcell#6, 4Q90, and reprinted in edited form in every databook up to 1994) Don't google it, you get only one valid hit, and it is in Russian. The last paragraph may be relevant here:

"...The XC3000 stays configured for small dips, and is smart enough to reconfigure itself (if a master) or to ask for reconfiguration by pulling INIT and D/P Low (if a slave). XC3000 will not lock up; the user can initiate reconfiguration at any time by pulling D/P Low, or, if D/P is Low already, by forcing a High-to-Low transition on RESET..." NOTE: IT SAYS: HIGH-TO-LOW TRANSITION ON RESET.

Then follows a description of brown-out (as Phil Freidin mentioned) but that only relates to early XC2000, which lack the circuit that responds to the High-to-Low transition on RESET. Once Vcc has dropped below 2 V, these chips need a very low voltage (

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 24, 2005 6:59 PM

Seeing that you have decided to continue to post to the public thread rather than contact me directly, I will assume that this is how you wish to handle this issue.

I have read this document and I agree that this is what it states. If you had taken the time to read all of the posts, you would notice that this was one of the first things I had verified.

itself.

I am only posting what I am seeing. You can choose to agree or disagree or even roll up your sleeves and check for yourself.

we

be

It would be interesting to see how many of the 3000 series devices are sold. Regardless, I really don't care about the companies track record at this point. I am having a specific problem with a specific device and the question is how best to handle it. If Xilinx does not want to be part of the problem solving, this is fine.

Now this is an arrogant statement! You are entitled to your opinions of your customers and yourselves.

pin.

Your way behind on what has been done to determine the root cause of this problem. It is a shame that you have decided to take what I would consider the unprofessional route of pointing the finger at your customer rather than trying to understand what has been done to identify the problem. If you do decide that your would like to try and work with us to solve this issue, I am more than willing to share with you everything I have seen to date.

In the mean time I plan to continue with my testing to determine if there is a reliable way to detect the fault.

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 24, 2005 7:52 PM

I think Peter posted here, because there was a large thread, and he was back from a 2 week absence.

If you want Xilinx to assist on support for an EOL device, you could try to not annoy them ?

IIRC you have seen this ~6 times. Do you have a 'feel' yet for the correlating cause, over those 6 times - any common event or stimulus ?

Even tho this is an EOL device, the reason to try and nail this is to check that no new devices have the same issue.

-jg

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 24, 2005 8:55 PM

thread

you

was

I agree that it is a lot to take in. I will not comment on why Peter felt he should post to the group. That was his choice to make. I was under the impression from Austin's last post that we would handle this outside of a public group.

Sorry you feel this way Jim. The goal here was certainly not to do so. I am only presenting my findings. If my findings do not match up with what Xilinx has published or stated in the group, I will point it out. How they react to that is up to them. If your refering to my posting on how the hotline was handled, remember that it was Xilinx who asked for this information in a public group. Up till then I only posted that I had been having little to no responce from them.

?

Yes, you are correct in that I have seen the failure six times now. At this time everything still appears to be random, but again I have only duplicated the problem twice from when I started posting. It's not a lot of data to draw a conclusion from. I am trying everything I can to cause the parts to get into this mode, but nothing I do seems to effect it. The only thing I have a "feel" about is the internal oscillator dropping out. If I reproduce the problem and the oscillator drops out as before I will have a lot more confidence that this is why the part can not recover.

There are two items that need to be addressed. The first and most important is to minimize the effect this will have on our customers. That is why my focus is on finding a way to detect the failure (with no changes to the hardware). The second reason is to minimize our risk in future designs.

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Mar 28, 2005 2:02 PM

Several more tests were conducted using the same test configuration. During this test I monitored the state of the done/program pins of all of the devices prior to the failure. The test would read and store the D/P pins status, attempt to program the devices, if failed to program all eight after five attempts then report the original status of the D/P pins. Then report the status of the D/P pins after an attempt was made to program the devices.

I wanted to also collect enough data in an attempt to determine if the failure of the internal oscillator could be duplicated.

I was able to replicate the failure three more times and it would appear that when the device fails, the initial state of the D/P pin is high. After an attempt was made to program the devices, the D/P pin latched in the low state. It also appears that with every failure that something happens with the

16MHz oscillator in that I no longer see anything in that area. What is interesting is that if the oscillator was dead, I would not expect the D/P pin to latch low. Maybe it is not a sampled input but is trully edge triggered.

Also note that once the power was cycled, that in all three cases the oscillator returned to normal and the devices were able to be programmed.

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Mar 28, 2005 8:02 PM

is

I need to retract the above statement. As it turns out, the software that was being used to monitor the status of these pins inverted them prior to displaying it. So, the devices appear to go into the program state.

In this last test I wanted to try and decouple the CORE that was being loaded into the device. For this test, all that was done was to cycle the supply. I used a 5mS off time and cycled at 100Hz. Using the spectrum analyzer I monitored the 16MHz clock. After about 10 minutes of testing, the oscillator had failed to start. I probed the remaining devices and found that three others also had failed to start. I then started to increase the off time using the one-shot mode. I noted that at about 200mS - 250mS two of the devices oscillators restarted. The third device took more than a second of off time before starting.

During the about tests, I noted that the D/P pin was low for all devices during the test, reguardless of the state of the oscillator. Also, during the tests, no attempt was made to reprogram the devices. Only the spectrum analyzer with the near field probe was used to determine if the part had failed.

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 29, 2005 8:16 AM

This is multiple devices on one board, or multiple boards being cycled ?

Sounds like you now have a reasonably rapid means of entering the suspect state, and some numbers on Trec. ( which probably also varies with temperature... )

Is 5ms enough time to exit pgm load mode, or is this test removing Vcc before the Load state engine has finished ?

This does sound like a 'sticky trigger' test, in that any of the ~60,000 power cycles that causes an upset, will not clear on the next cycle, as that Toff is < Trec.

-jg

- L
- lecroy7200
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 29, 2005 1:33 PM

Not that I do not appreciate everyones help in this matter, but I have received several PMs included from Xilinx tech support asking if I have tried the following:

- Bring the DONE/PROGB pin low

- Hold RESETB low fot at least 6 us

- Start the re-configuration

I am not sure if some people are not able to read the entire thread and that is the cause. The following are from my first and fourth posts:

"Pulling the XC3000's reset low for 10us has no effect. ... The only way to reprogram the part is to power down the IC. "

"The note to your link suggests that setting Reset high for > 6us then setting it and the Prog/Done pin low for > 6us will bring the device back to the clear configuration state. Looking at the loader code, this is pretty much what is being done on every load. The Reset normally idles high and it along with the Program pin are pulled low for 7.5us. I verified this as well. Doing this does not make the device exit this strange mode. So far, the only thing that seems to clear it from this state is a hard power down."