Weird JTAG lockup issue, where is the BUG?

- A
- Antti
  
  Contact options for registered users
posted
17 years ago

Sun, Jul 9, 2006 7:50 AM

Hi

I have several Spartan3 boards that have a very weird issue, namly when configured with one specific VHDL design using Impact with verify off then after first programming attempt (status fail with CRC check!) the JTAG chain is reported broken before the FPGA and further configuration or even jtag idcode reading is not possible until complete power off the FPGA. When imact option verify is on then however the same bitstream can be used to configure the boards multiply times, the JTAG lockup doesnt happen. It is not related to bad bitstream because the VHDL design (LEON3 system) when compiled to different FPGA (S3-1500 or s3-4000) has the same behaviour. The boards in question (2 different PCBs) seem to work with all other design I have tested.

To my understanding the JTAG TAP controller should be completly separate function block from the FPGA fabric - so no matter what is loaded as FPGA config should not make the JTAG TAP unscannable. So the issue could be only related to power supply behaviour, some voltage spike at FPGA startup?

Any ideas what to test or where to look? Or what to test. I would really like to get to the bottom of the problem and understand how come does LEON3 design make the JTAG Chain to die (this is what is looks like for the moment).

The FPGAs on the boards where I see this behaviour are with date codes mentioned in

formatting link

but I dont think this could be the issue?

Antti

- N
- Nico Coesel
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Jul 9, 2006 3:36 PM

I've had this problem with Spartan2 fpga's. I even cooked a few! So far I could trace the problem, it had to do with power supply current capability and bypassing. Sometimes the fpga will draw a huge amount of current during configuration. If the power supply system (including the bypass capacitors) can't supply this current, you'll have some latch-ups in the fpga.

--
Reply to nico@nctdevpuntnl (punt=.)
Bedrijven en winkels vindt U op www.adresboekje.nl

- A
- Antti
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Jul 9, 2006 4:30 PM

Nico Coesel schrieb:

hi thanks for answer,

and yes that is what I think also the problem could be.

but I assumed the Spartan 3 has no special requirements of huge currents required to startup.

both 1.2 and 2.5V powersupplies are 6A step-downs from LT and look like really designed by the book. Gosh I really hate if I need to troubleshoot them.

I still wonder why the latchup never happens when I select "verify on" in impact !?

guess I need to setup up DSO trigger on done=1 and monitor all the supplies at the transition time.

Antti

- R
- Rob
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Jul 9, 2006 6:24 PM

I'm not that familiar with Xilinx's FPGA's; but I did have an issue with an Altera FPGA that turned out to be power supply related. The problem was that the power-up configuration was unstable, sometimes it would work and other times it wouldn't. But, if I powered up, then initiated a configuration (from an on board push-button), it always worked. This led me to look at the power rails. In my case, I had a power supply that was generating a non-monotonic rise on VCCint. Once I fixed the rise so that it was smooth the problem went away.

Can you initiate, or re-initiate, the configuration cycle after you are powerd up and the voltage rails are stable? If so, try it, and see what happens. It may give you another clue.

Take care, Rob

- A
- Antti
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Jul 9, 2006 6:58 PM

Rob schrieb:

Hi Rob,

1) I can configure and reconfigure the board with many many different designs and never see an issue at all.

2) when using one specific design/bitstream then I can configure and reconfigure any number of times when Xilinx impact is set to perform configure and verify. Impact even reports programming and verify success !!

3) using the same bitstream and impact with configure, but no verify then first configuration attempts says configure error (CRC error) and after that the JTAG chain is reported as broken before the FPGA. The power supplies are still proper Voltage and stable and the FPGA does not get hot. But it needs to be power cycled for the JTAG TAP to come live again.

I understand that power supply is the most likely issue but why doesnt the issue never happen when jtag operation is set to configure_and_verify? and locks up the jtag tap 100% when attempting to configure without verify?

I bet this remains "Xilinx mystery" forever.

Antti

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Jul 9, 2006 7:01 PM

Anti,

All devices after Virtex E (Sparta 2E) have no extra current required over that which is specified in the data sheet for minimum power on current.

Is it possible that the configuration you are loading requires more power than you have available?

I have seen DONE go high, only for the power supply to crash, fold back, and the part starts to reconfigure again.

As for the JTAG state machine, it is definitely possible for it to enter a "bad" state from which it may never recover. It is only with Virtex

4, and now Virtex 5, that we have worked carefully on the state machines to harden them from soft errors, which might place them in an unrecoverable state. Irradiation with neutrons can quickly find those hidden bad states!

Austin

- A
- Antti
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Jul 9, 2006 7:41 PM

Austin Lesea schrieb:

Hi Austin,

I also did think there is no extra power surge at configuration on S3.

I do not think the design takes more power then available.

I was just porting LEON3 design onto some new boards to have more designs for the board test. To my very surprise the LEON3 design never started up correctly. I did make the design smaller by disabling MMU and caches and the problem persisted. The desing uses 13% of S3-4000 and is set to run from 25MHz.

All power supplies are rated to 6A. The same board runs succesfully a Microblaze desing with two separate SDRAM controllers, ethernetand TFT display cores at 72MHz, I would bet that design should defenetly burn more dynamic current than the plain vanilla LEON3 design. Ok I cant measure the LEON3 design power as it never comes up live. Wrong I can, I have one Memec board with s3-1500 I can load the design that fails on my board onto memec and measure current and then measure current on my boards with some design that do work.

That should tell if the boards that fails do work with the current that the LEON3 design requires.

As of JTAG dead states - that fact that Xilinx has only ironed it out for V4 and V5 really surprises me. A JTAG TAP isnt rocket sience.

-- I havent been able to test with non JTAG configuration methods yet maybe the all issue is only with impact software - the JTAG chain contains a Atmel AT91SAM7S ARM with JTAGSEL=0 eg the ARM ICE JTAG chain is selected. It is remotly possible that the ARM JTAG is getting messed up somehow. This could even explain why there is difference when configuring with verify on and off. Well it means that I have stumbled into some very nasty Impact bug?

I know that the ARM ICE JTAG is not 100% proper JTAG but as long as it.. hmm maybe i solved the issue at this very moment, as the ARM JTAG has a bug that disturbs some JTAG operations when JTAG clock is over system clock and the Atmel ARM powers up with internal 128KHz clock, then it is remotly possible it gets upset somehow. As I did not see problem so far I assumed the ARM BYPASS works at higher speeds also (but all assumptions are wrong).

If I think of it, then it sounds like that this must be the problem. Just weird that every other design worked so far and one design doesnt.

Antti

- R
- Rob
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Jul 10, 2006 1:52 AM

I didn't think that was the problem, but I thought I would throw it out there. Bizarre problem indeed. Please post when you find the answer.

- A
- Antti
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Jul 10, 2006 6:59 AM

Rob schrieb:

[snip]

mystery solved !

The issue is the bug in ARM core netlist that is licensed by Atmel for the AT91SAM7S!

The problem was in no way related to any issues with Xilinx FPGA or power supplies despite the weird 'Effect' that the issue was only visible whith one specific FPGA design and only with impact and only when configuration attempt was done with verify OFF setting.

When the AT91SAM7S has PLL enabled the issue with the same 'bad' bitstream doesnt occour anymore.

Antti

- D
- David R Brooks
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Jul 10, 2006 9:34 AM

I am surprised there: I thought the JTAG standard had defined that state machine so that a limited number (5, by memory) of clocks with TMS=1 would force it out of anything.

- A
- Antti
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Jul 10, 2006 10:06 AM

David R Brooks schrieb:

Hi David,

yes you are correct - 5 times TCK when TMS=1 *MUST* transit to TLR state. but the issue was max TCK frequency that another JTAG device in the chain was able to handle. It happened to be 100KHz what should have make all JTAG comms to fail, but unfortunatly did not. Eg the 'BAD JTAG' device operated "good enough" in order not to disturb other devices unless some sequence that was dependant on the bitstream loaded to FPGA made it really upset.

As I did not see any problems with several design I wrongly assumed the 'slow max TCK' device was working properly at TCK=200KHz (Impact Cable III) when the only command sent to it was BYPASS. This wasnt the case. A bitter experience - did cost me several hours wasted with uneneccary troubleshooting.

So beware - if the JTAG chain includes devices with ARM core, then make sure the ARM clock is running at desired frequency in order for the JTAG to work properly. I think some newer ARM cores have that bug fixed, but there are several ARM licensors still using the Buggy ARM IP.

Antti