40% less SEU's! in V4: another good reason to choose Xilinx

- A
- Austin Lesea
  
  Contact options for registered users
posted
18 years ago

Fri, May 6, 2005 9:12 PM

All,

Latest update on atmospheric upsets:

formatting link

Virtex 4 memory cells are almost twice as hard to upset as Virtex II.

We promised to reduce our susceptibility to atmospheric upsets, and we are fulfilling that promise.

Not all semi companies have made this choice: it is hard to do, and increases area.

I know of work being done at Intel, and Cypress to improve, but nowhere else.

It is highly likely that competing 90nm FPGA companies have done anything at all (except get a lot worse).

The ASIC (ASSP, hardened solutions, etc.) also have not made this choice (as it would really blow up their area a lot). Thus, 90nm ASIC technology has a typical SRAM FIT rate of 5,000 FIT/Mb (from neutron data error rate specifications for a typical 90nm SRAM ASIC cell), as compared to our less than 250 FIT/Mb.

The ASIC DFF's, logic, etc. are also a fantastic neutron detector: the resulting hardness of the Virtex 4 is on par with, and better than a full custom 90nm ASIC doing the same task!

Unfortunately, no data is available on ASIC's, as they just don't know. To test, one would have to place the part in a neutron beam, while running, which is rather hard to do with a complete system ...

Caveat Emptor!

Virtex 4 on the other hand, combines with built in ECC for the BRAM, and built in FRAME_ECC for the configuration, which allows for selecting whatever level of system hardness to soft errors is desired.

Austin

- B
- Ben Twijnstra
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, May 6, 2005 10:47 PM

Hi Austin,

I'm really happy for you.

Are there any V4s without the money-eating ECC stuff for us terrestrials?

Ben

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, May 6, 2005 11:41 PM

Nice try! ECC at the 64-bit parallel level eats only 8 extra bits, and our BlockRAMs had those traditional parity bits all the time. No extra storage cost. Just some clever partitioning... "The best things in life are (almost) free" Peter Alfke

- T
- Thomas Rudloff
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, May 7, 2005 12:08 AM

Hi Peter,

I learned about SEU that you can design redundant (three times the logic if you can convince your compiler not to remove redundant logic). This will keep the user logic save. But is there a way to keep configuration save since this changes logic and routing?

Regards, Thomas

- A
- austin
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, May 7, 2005 12:32 AM

And,

The frame_ecc is 12 bits per 1312, or less than 1% overhead.

Aust> Nice try!

- A
- austin
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, May 7, 2005 12:46 AM

Thomas,

Yes. The Xilinx TMR (XTMR) tool converts the design from the designed and placed to a full TMR design automatically taking advantage of our structure so that no one config bit can upset the function.

FRAME_ECC allows a design to do redundancy in time (RIT).

Calculate what you need, check if an error has occured, if not, go on. If an error has occurred, fix the error, step back, recalculate.

Repeat.

Between XTMR which allows you to choose only those critical areas that need triplication for redundancy in space (RIS), and FRAME_ECC which enables redundancy in time, an arbritraily safe system can be implemented.

For example:

Simplest - do nothing. With an effective system FIT rate of 20 FIT/Mb of config memory, this may be so far down in the noise, it isn't an issue.

Next step - when the FRAME_ECC indicates an error, reconfigure the chip. This creates some unavailability, but is able to keep any errors from propagating any further. Or back up, and recalculate the result after flipping the bit back (RIT).

Little better - when a error is detected, correct it. Since from 1 in

10 to 1 in 80 flips actually hits something that matters (real data from real customers), there is a 1% to 10% chance that flip could ever cause an error, and since you fix it in less than 200 ms (for the largest part), the probability that in that 200 ms something critical changeds, and it mattered is even tinier (like maybe one in a thousand chance). And, if you add to this RIT, it is even more bulletproof.

Even better - since this is a system that requires a hot spare (at this point, we are talking about 99.9995% available systems where the hard fail rate kills you first) you detect a soft error, and switch to the redundant unit immediately while you fix the bit, and do a system recheck.

Best - triplicate critical elements AND have a hot standby that can be switched to in case of soft error detect.

All of the above are enabled in V4 -- it is up to you to set your FIT rate goals, and then fufill them. Can't do that with the competition -- they just don't have all the options we do. For example, a complete reconfig takes them down, but we can reconfig while still operating, and fix the flipped bit back.

Aust> Hi Peter,

- P
- Piotr Wyderski
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, May 7, 2005 1:00 AM

BTW, is it possible to order a special, rad-hard version of a modern medium-complexity FPGA chip, say, comparable with Cyclone 1C3? Would it mean a complete redesign of the chip internals or is it relatively simple?

Best regards Piotr Wyderski

- A
- austin
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, May 7, 2005 2:17 AM

Piotr,

Very observant question.

For atmospheric upsets, it is a relatively easy process to change all memory cells to SERT or DICE single upset hardened cells, with an increase in area as you go from 6T cells to 12T and 16T cells in the ASMBL columnar architecture which is actually trivial to do. But who will pay for this?

Without the ASMBL architecture, it requires a complete relayout.

If there are ways to design that result in the desired system FIT rate, one must comapre the costs of the extra logic with the costs of hardening the design (hard IP vs. soft IP).

I believe the answer is a judicious combination of both: make the basic FIT rate better, and also provide some degree of hardening without incurring too much cost.

Aust> Aust>

- B
- Ben Twijnstra
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, May 7, 2005 3:46 PM

Hi Peter Alfke,

There's addtitional bit lanes in Altera devices too.

So what does this add then? Did you add optional hard ECC generation/detection blocks to these 9th/18th bits? Or does the user have to code this him/herself?

If it's an optional hard macro we're looking at 2 configurable muxes and an ECC generator on the input side, and 2 configurable muxes and an ECC checker on the output side for evey set of 9 bits.

Also, do the V4s run continuous config sanity checks like Altera's devices?

Best regards,

Ben

- A
- austin
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, May 7, 2005 4:06 PM

Ben,

See below,

Aust> Hi Peter Alfke,

To do what?

We have hard ECC, 72/64 code, that can be instantiated to provide single bit error correction, and doulble bit error detection with no soft IP required.

We allow the custoemr to decide what they want to do: they can do just a check, or a check and correct, or nothing at all. They pay the least possible because we only harden what we need to enable this feature, not the whole thing. What A offers is a "oh no!" bit: if it is set, you have no recourse but to reconfigure and start over. That is all A allows the customer to know, nothing more.

The same IP also allows the customer to flip bits so that they can see what effect NSEUs would have without having to go to a neutron beam (which is very expensive,, and time consuming).

- B
- Ben Twijnstra
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, May 7, 2005 5:57 PM

Hi austin,

Oh, for 9-bit video data, or parity checking, or ECC, whatever you like.

That's exactly what I wanted to know. So, to summarize:

If activated, a 64-bit write to a BRAM will use 8 additional bits for error-checking and recovery. The read and write ports have optional dedicated hard logic that, when enabled, generate and check ECC data.

By the way, does this ECC stuf work on narrower RAM widths?

In A, the config error pin will allow you to take any external action. Rebooting the device is the most common application, but more elaborate schemes are possible. Also, the internal logic is also able to respond to a config error. Then again, since the configuration cannot be trusted anymore, it would be best to bring the circuit offline as quickly as possible.

The 'reloading-while-running' feature in X is cool, but if I were an FPGA and I knew I couldn't be trusted anymore, Asimov's first law would kick in and I'd disable myself ASAP (i.e. after sticking a Post-It to my forehead indicating that a service technician should look me over because I went crazy).

Very nice idea indeed. After getting the first documentation about A's sanity checking we actually had to go to a nuclear lab to test the feature (the lab was also quite interested in the feature). We didn't do any quantitative testing (how could we, as humble end users), we just stuck the PCB in a high-intensity neutron beam and waited. And waited. And waited. But, in the end we found out that it did work ;-)

Best regards,

Ben

- A
- austin
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, May 7, 2005 6:43 PM

Ben,

See below,

Aust> Hi austin,

Yes we have an extra bit for evey 8 bits as well. Most folks just use it for parity.

Yup.

Nope. Customer has to insantiate whatever external muxes they would liek to use the ECC with other widths. We felt that this extra muxing was trivial for the customer, where if we had to do it, it would make the block less useful and bigger for all the customers who don't want or need ECC. Given the FIT/Mb rate of the BRAM is already 6 to 8 times better than commercial SRAM, many customers evaluate the risk, and decide to use simple parity rather than ECC.

I'm A is so smart (sarcasm), and know exactly what to do for their customers. We, on the other hand do not presume to tell the customer what they must do. Since only 1 in 10 to 100 bit flips actually does anything at all, there is a 1 to 10% chance that the FPGA is still able to decide what to do. In fact, if you triplicate a "sanity check" monitor, and allow it to make the decisions, you do not have to tear down the whole chip for every hit. That takes very little extra logic.

You see, A's "oh no" bit will trip 10 to 100 times more often than an actual functional failure: why take the system down 100 times more often that you really need to? Not very bright. Running around saying "I've been hit, I've been hit ...." Insteaad we offer that you can decide if you should flip just that one bit, and just continue on from there.

If it is a video, voice, or packet application, what risk was taken? A bad pixel? A pop or click? One bad packet? Those things happen all the time for other reasons than SEU. No interruption. A's solution can not do that.

"Help me, Help me! I've been hit, and I don't know where! I might be dying, (but I am probably OK, but you can't trust me anymore."

I much prefer a more elegant solution: "Bit XYZ has flipped, do you want to flip it back?"

Yes, I know, it is all A has to sell, so make sure there is lots of FUD associated with the X solution (since it can't be matched by A).

I think it quite nice that their "solution" to SEUs is their hardcopy: less competition for FPGA vendors! Gartner-Dataquest removes all hardcopy revenue from A's balance sheet when comparing them with other FPGA vendors now. Their sales may be increasing, but their FPGA market share is decreasing. Too bad they just don't seem to be interested in playing with us anymore. No MGTs in S2, No processor. S2: 2 many upsets, 2 hot, 2 slow, 2 noisy; 2 little, 2 late, 2 bad.

Does it? How do you really know? They could count ten errors, and then say "I've been hit" and you would never know the difference.

How do you know that the ckecker wasn't hit? Do they provide a hearbeat so you are sure the checker is checking? We do.

I say, have them prove that every single bit can be tracked.

Upset rates are different for LUT, DFF, RAM, config. Do you know what is checked? On V4, it is very clear what is being monitored. And you know what is happening all the time.

If you are really as paranoid as you claim (WCGW, WGW, AATWPM - what can go wrong, will go wrong, and at the worst possible moment), I would think not even knowing what is checked, and what flipped would drive you crazy.

- P
- Paul Leventis (at home)
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, May 17, 2005 7:49 AM

Hi Austin,

No offence, but this sounds like bull to me. So you are claiming that since you have columns of blocks (er, ASMBL architecture), you can suddenly tolerate changing the fundamental layout of your configuration RAM cells without touching anything else? This would imply that not only are your various blocks floorplanned as columns, but that the memory cells sprinkled throughout those blocks also line up perfectly and that no other circuitry would need to be adjusted.

Regards,

Paul Leventis Altera Corp.

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, May 17, 2005 2:50 PM

Paul,

I understand how frustrated you are.

We are 40% better in SEUs than V2 Pro (or V2).

You folks must be really scrambling since you did absolutely nothing to reduce your SEU FIT rate (by using 90nm 6T cells for config).

Enjoy your ??? FIT/Mb 90nm 6T config memory.

Compared to S2, V4 is probably at least twice as good, perhaps even three times better. Actel will probably hire IRoC again to test us both. It will be fun to see that report!

Unfortunately, since you do not support customer readback, we can't test your part in the neutron beam, as we could not really be able to count all the upsets, and where they actually occur. Not knowing must really be a pain for you guys. No way to really know if ICDES has accomplished anything at all.

Separate FIT rates for config, and BRAM are a requirment for our customers, as well as having a number of techniques that can be used to mitigate the SEU issue, and a design flow to achieve any desired system FIT rate.

I'd like to see your numbers for config and BRAM, as we are very satisfied with our improvements.

It will be fun to watch as this sinks in the minds of the customers out there ....

Sorry you can not say "we are just like Xilinx" anymore. I was glad to do all the work, but I am afraid that we will derive all the benefits now that we thought through all of the issues.

Austin