Issues with a batch of Virtex-II chips

I

IgI 21 years ago

Hi!

I'm using Virtex-II (XC2V1000-FF896-4C) in one of the product which we have been selling for over 3 years. Recently we got "new" batch of Virtex-II chips and problems started to arise. So far I have isolated PCBs with three different batch of Virtex-II chips:

Batch A: XC2V1000 FF896AFT0301 F1247582A

4C Philippines

Batch B: XC2V1000 FF896AGT0409 D2169507A

4C Taiwan

Batch C: XC2V1000 FF896AFT0205 F1205613A

4C Philippines

All the chips in batch A have the suffix AFT301, all the chips in batch B have the suffix AGT0409,... PCBs with chips from batch B and C are working fine, on the other hand none of the 42 PCBs, where chips from batch A are used are working. PCBs are the same (same revision) for all the products, all other components (ZBTRAMs, DDR SDRAMS, passive components,....) are the same. All voltages are within the safe margins, all input clocks are clean. All the affected boards pass the JTAG test, in other words we didn't find any soldering errors, short circuits, vias without metallization, wrong resistors or capacitors, incorrectly oriented diodes or capacitors... or any other error we could think of. We got all the chips in a sealed package. PCBs were tested at different temperatures (from 8 degrees Celsius to 46). Only the PCBs with chips from batch A don't work. Let me explain what precisely is not working.

I'm using 6 DCMs to generate clocks for ZBTRAM, DDR, System, ConfigBus,... and two DCMs don't set the locked signal after I release them sequentially from reset. I don't know if other parts of the design (the parts which don't use ZBTRAM clock) don't work either, because the missing clock is a fatal error and I didn't have the time to investigate further in that direction. Working freq. of ZBTRAM is 120MHz, DDR is working at 166MHz, System at

100MHz, ConfigBus at 10MHz,...

We are currently using ISE 5.2 SP3 for this design. I have verified the bit stream by reading it back from the chip and it's ok. Two coworkers, guys from the production and I are working on solving this problem for the last two days and we are almost out of ideas what else we could try, except replace the problematic chips with the non-problematic. I can't use ISE 6.1 or newer because the routing is not successful or ISE simply doesn't meet the timing constraints (the chip is 99% full).

Have you experienced anything similar in the past? How did you solve the problem? Do you have any ideas/suggestions what else I could try? I couldn't find any document on the xilinx web site explaining the detailed chip signatures. I would like to know, what AFT0301 stands for? Is this the product date, production line, factory code...? I would like to know, when the chips have been manufactured (how old are they)?

I guess we'll have a competition in the company next week. And the goal will be; who can throw virtex-II the farthest... Ok, I'm just joking, but I needed to vent...argh...

Igor Bizjak

Vote

V

Vladislav Muravin 21 years ago

Igor,

Vow, this is very similar to what i have experienced a couple of days ago. I did not try, however, to look for different "batch"s, as you did, because i think that all of our FPGAs are XC2V2000/3000 676AGT0405, i think.

So in my design, i am cascading 4 DCMs. But i am not using their LOCK indication, because i am only interested in some large fractional M/N frequency synthesis, and the input frequency is less than 24 MHz ( I assume that you are not using any DCM outputs for input clock less than

24 MHz)

The problem was that the last DCM was not generating the desired frequency, but exactly either 1/8 or 1/16 of it. It was looking like the reset was applied for too short period of time (but it was for more than 3-4 input clocks!)

So this problem was resolved by defining a reset registers for each PLL, and asserting/deasserting the reset by the software (or some delay implemented in FPGA by some large counter) in chain.

I assume this is this is different from what you've experienced, but hope this helps.

Sincerely,

Vladislav Muravin

Senior FPGA Design Engineer

Advantech AMT (Advanced Microwave Technologies)

657 Orly Avenue

Dorval H9P 1G1

Quebec, Canada

Tel: (514) 420-0045 ext. 240

Fax: (514) 420-0073

formatting link

Finally, i noted that

have

three

none

the

working.

ConfigBus,...

don't

bit

I

couldn't

will

Vote

A

Austin Lesea 21 years ago

Igor,

Any reason why you haven't open a web-case? Or called the hotline?

With your "lines down" situation, you should be moved to the head of the line, and be given the highest level of service.

Let me know,

Austin

IgI wrote:

Vote

I

IgI 21 years ago

than

10MHz clock is generated externally.

and

Each DCM has it's own reset line connected to reset register and is asserted/deasserted by software.

Thanks for sharing you experience, Igor Bizjak

B

(ZBTRAMs,

in

pass

with

sequentially

fatal

direction.

this

we

non-problematic.

when

Vote

I

IgI 21 years ago

No, there isn't any reason, I simply forgot. We had to solve the problem the fastest we could, because the customers are waiting for the products. And the fastest way was to exchange the chips with the "good" ones. I still have

5 PCBs with problematic chips. We will analyze them further on Monday. If we don't come up with some reasonable explanation, I will open a web-case.

Igor Bizjak

have

three

B

none

the

(ZBTRAMs,

within

pass

with

working.

ConfigBus,...

sequentially

don't

fatal

direction.

bit

this

we

non-problematic. I

couldn't

when

will

Vote

A

austin 21 years ago

Let me know,

I'd like to think we can look these things up very quickly to at least let you know which fab, which assembly (packaging house), etc.

Aust> No, there isn't any reason, I simply forgot. We had to solve the problem the

Vote

B

Brian Drummond 21 years ago

Have you re-run timing analysis on the 5.3 design, but using the latest timing analyser and latest speed files?

Sometimes the speed files are changed to reflect new information about the devices ... usually in the "right" direction. But if the old (formerly successful) design fails with new speed files, that might point you towards a solution.

With 6.1, have you tried MPPR (multi-pass pacement and routing)? Sometimes modifying the placement (in FPGA editor) of failing paths and re-running "re-entrant routing" can fix problems, if there are only a small number of failing paths.

- Brian

Vote

I

IgI 21 years ago

No, because I don't think there's any timing issue here. The logic is trivial and runs at low speed. We are using the same "clock generation" module in several other designs, without any issues. We have products running 24/7 for two years now without a single issue. As I stated before, the problem appeared only with the selected chips.

But, I will test our new Virtex-II designs with the latest timing analyzer and latest speed files as you suggested. It's a good idea for new designs.

Yes, I have. I tried 6.1, 6.2 and 6.3. It's always the same story. Placer/Router does a lousy job. Either the constraints can't be met or the router can't connect all the nets. ISE 5.2 SP3 completes without any errors and reports 7 logic levels for the constraint. On the other hand ISE 6.x reports 16 logic levels for the same constraint.

In my experience (for the Virtex-II family) if the design takes less than ~90% of chip resources then the results of ISE 6.x are similar to the ISE

5.2 SP3, sometimes even better, but as soon as design takes more than 95% of all chip resources then ISE 6.x gives up. Similarly I still use ISE 3.3 for SpartanXL and Spartan2 designs, because ISE 4.2 or newer don't produce the desired results. I know a lot depends on the synthesis tool (I'm using synplicity)...

Thanks for you suggestions, Igor Bizjak

Vote

B

Brian Drummond 21 years ago

Maybe not, unless there are hold time or skew issues, not properly covered by older speed files. I would try the newer speed files on the "suspect" design just to check.

That can be illusory, if 9+ of those 16 levels are carry logic. It may reflect relatively small differences in placement or routing getting on/off the carry chain.

But I have found (a) a LUT connected to a long carry chain but placed on the other side of the chip ... and (with a heavily floorplanned design, where the placer can't do that) (b) a signal taking 3ns to get from one CLB to its immediate neighbour.

The former (if an isolated incident) can be fixed in FPGA editor, the latter either reflects severe congestion (whatever happened to "view/congestion map" in the floorplanner?) or a very lazy router.

Interesting. I didn't know that about the 5.x-6.x problems, but Ray Andraka has commented on the relative performance of 3.3 vs later in the past. (Google may help a little) I'm still using 3.3 in "production"!

My experience so far with 6.x (Webpack) is that it will never meet reasonable constraints, but radically overconstraining it will improve results.

For example, if I want 10 ns, and ask for it, I get 10.5 ns. But if I ask for 9 ns I get 9.8, (or 10.1) and if I ask for 8 ns I get 9.5 ns... I just made up those numbers but they represent the trend I've seen. If the resulting design passes timing analysis at 10.0 ns, I can't see any reason not to use it...

Thanks. I've not pushed such high resource usages, so it's interesting to hear tales from people pushing the chips hard in other respects.

- Brian

Vote

A

Austin Lesea 21 years ago

All,

Igor has his case now submitted, and it was escalated due to its nature (basically saying "lines down" makes that happen).

As of 8:30 AM PST 2/22/2005 here in San Jose we have folks on it.

Thanks to all who posted. For those interested, I will probably post the results here (if Igor agrees) as there seems to be some interest in lot related failures.

Generally speaking, lot related failures are almost always design related: either the lot silicon is a little faster, or a little slower (but within spec's) than the previous lot, and an unconstrained timing path doesn't work. Sometimes IOs are a little stronger, or a little weaker, and that too is within spec but makes a difference in a design.

The fabric speed was the case of the customer who designed their own FIFO (and didn't understand schrnoization circuits), and "lot" related problems.

In this case, we have just started, so Igor will learn failry quickly what the differences are between the lots, and we will help resolve what the cause of the problem is, and provide solutions.

Thanks again to all who have interest in this sort of posting, as it gives us a chance to educate folks on the services we offer (the hotline), the escalation procedures for hot cases (lines down), and the nature of this particular kind of problem, and the types of likely resolutions we often find.

In no way am I implying that Igor has a funny path in his design: I am only suggesting that this is often our experience. Rarely (VERY RARELY) we have lot quality problems, test escapes, etc. that all manufacturers occasionally have when something doesn't go right in the test group. Of course, each time that happens, it is cause for reviews of quality and proceedures so we never make that mistake again!

So to all of you who think you might have a lot quality problem, again, that is so rare that I only mention it here to be accurate and honest.

Often mentioning something in the news group is like describing a new rare illness to a hypochondriac, suddenly everyone thinks they are sick with the new rare disease!

(In which case it isn't rare anymore....)

Austin

Vote

Issues with a batch of Virtex-II chips

Join the Discussion

Didn't find your answer?