Altera HardCopy and SEUs

- R
- Roger
  
  Contact options for registered users
posted
19 years ago

Thu, Jan 20, 2005 10:56 AM

Compared with a normal Stratix, is a HardCopy version immune to SEUs?

I'd have thought the possibility of a configuration error is reduced to zero as the configuration SRAM cells have gone but an event in the logic is still possible. Does anyone have any ideas (or figures maybe)?

TIA

Rog.

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Jan 20, 2005 4:50 PM

Roger,

The simple answer is that it is "they claim to be better".

That used to be the one big advantage an ASIC used to have over a FPGA: (note the past tense -- the fact that there are "so many fewer memory cells" means that "the upset rate is much less" - all else being equal).

The facts are less clear, at the latest technology nodes.

Since we, and the SRAM vendors do SEU testing (eg Cypress has a similar project to our Rosetta where they to simulation, beam testing, and atmospheric testing), we are the only two vendors that "know" what is happening (see our MAPLD presentation for the last few years).

But, look carefully at what Cypress is saying: SRAM at 90 nm is ~ 3000 FIT/Mb for atmospheric SEU's at sea level (which Cypress has improved to better than 1500 by design, doing things similar to what we are doing).

Our configuration latches in comparison are ~180 FIT/Mb at 90nm (latest extrapolation of all testing, with a margin of error of 1/2 to 2X -- just need more time (or more errors). Good position to be in. Our promise is to get better with each technology, not worse. And we are proving that to be true.

Well, a D FF in an ASIC (or hardened solution) is a pretty small animal at 90nm. Our studies show its upset rate to be significantly higher than one would first suppose. In fact, the 90nm standard cell D FF is the most sensitive element! They are by, our calculations, 10X our CMOS configuration latch in terms of FIT/Mb.

If the FIT/Mb for 90 nm CMOS configuration latch is ~180, and it takes on average 10 tries to hit something 'critical' (a LUT, a piece of interconnect that matters - open/shorting something), then the mean time to fail is 18 FIT/Mb of config memory in our FPGAs. Gee, 10X better. If the ASIC has even 1/10 the D FF and memory that the FPGA has, then they are equal in time to fail by atmospheric upsets....

By the way, the D FF in the CLB is 1 FIT/Mb, so it is so unlikely to be flipped, it is not even on the radar screen (can be ignored).

Now let us say that I use a XC4VLX60, and once I am done, I use a 2M gate ASIC (an actual customer story, one million gates + 1 million D FF's). If we make a few reasonable calculations, I get ~500 years between functional failures in the FPGA (assuming you do not use our correction features, which would make this much much better (longer)), and ~500 years between functional failures for the 90nm ASIC. So, given we are no better, no worse (on paper), and we can be better by using the FRAME_ECC, and other easy built in error correction techniques, we win in the SEU MTBF with the longest time to a failure category.

Now, we know what we do, and we tell you.

For fun, go ask your vendors what their atmospheric MTBF is for soft errors (SEU, SEE, SER, SEL) for their hardened solution.

Many large companies are doing this (have done this).

They don't know. Ask more closely, and do they know how to do Qcrit studies? Can they follow JESD89?

Ask for the results. We will talk about ours, just contact your FAE.

Are they hiding just how bad they really are?

Or, are they just afraid that they are really bad, and have no way to know (predict)?

Do they do beam testing? (we do)

Do they test for latch-up (which may destroy the chip)? (we do, and have to latch-up)

Do they do atmospheric experiments? (we do)

Do they promise to only get better from one generation to the next? (we and Cypress both do)

So, we can not say they are "worse" but we can certainly say that they are not "better."

The truth is somewhere in between.

But I would choose the technology that knows what is going on, and has a means to predict the MTBF, and extend the MTBF hours by using simple built in features, if it is something that is needed.

Make a reliability spreadsheet for your system. Set goals (hard and soft). Then work with our FAEs to meet those goals. It is called reliability engineering, and it may surprise you that the FPGA will win.

After all, being in satellites, autonomous aircraft, jet fighters, automobiles, and on Mars has taught us a lot about reliability engineering, and SEUs.

Go Mars Rovers! One year and counting!

Austin

Roger wrote:

- B
- Ben Twijnstra
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Jan 20, 2005 7:27 PM

Hi Austin,

So, in your opinion, would the FIT/Mb for a V2Pro (on 130nm) be higher or lower than a V4 (on 90)? The area of a V2Pro DFF is larger, but the amount of energy required to create an actual fault is higher too.

PS: feeling much better after (1) fixing just eentsily oxidized soldering connection causing the boot failure and (2) getting my motorcycle permit.

Ben

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Jan 20, 2005 11:28 PM

Ben,

Well, based on the numbers I gave, V2P has a higher FIT/Mb rate than V4, but generally speaking the V2P chip folks use is smaller (less memory cells, less logic) than the V4 chips.

So, at one time the sweet spot part for Virtex was a XCV300.

Then for the Virtex E it went up to a XCV600E.

With Virtex II, it was 2V1000.

In Virtex 2 Pro, is is the 2VP30.

In Virtex 4, we suspect it will be the XC4VLX60. At least that is what a lot of folks are getting shipped as samples right now. Although the LX25 is pretty popular as well. Not to mention their is a huge volume application for FX12's that might outstrip all of the others....predicting the high runner a-priori is just gambling.

So, if customers would just stop using more logic, we would be getting 'better' in each device (of the same size) since Virtex II.

But, customers will pack more stuff in (for less money), so the FIT/family gets slightly worse, even though the FIT/Mb is better, just because larger parts are being compared to smaller parts. Hence the reason why we have to work so hard to get better, so we don't get worse overall.

That is the reason why we added the FRAME_ECC to correct errors in V4: parts are growing faster than we are decreasing the FIT/Mb rate (although we are doing that, too). So on a system basis, we still must get better.

Austin

Ben Twijnstra wrote:

- P
- Paul Leventis (at home)
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Jan 21, 2005 1:25 AM

Hi Roger,

The short answer to your question is that yes, a HardCopy device will have no susceptability to SEUs affecting the "programming" of the device. Configuration memory cells are replaced with hard-wired connections, thus eliminating any possibility that the programming of the device is upset by a particle collision. Memory storage cells are still present in the block memories, so I imagine that some SEU is possible in the *state* of the chip, as it is with any ASIC.

Regards,

Paul Leventis Altera Corp.

- R
- Roger
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sat, Jan 22, 2005 12:48 PM

Thanks Paul,

Is the die shrink that's mentioned with HardCopy devices by virtue of the configuration cells being removed? I presume the feature size of the logic remains the same though so the characteristics of the logic are unchanged? Is the speed up by virtue of the logic now being hardwired?

Regards, Rog.

- R
- Roger
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sat, Jan 22, 2005 1:05 PM

One other question:

Am I right in thinking that the NRE for a HardCopy design is $200K? Is this fixed regardless of what the device or logic complexity of the design is?

TIA,

Rog.

- P
- Paul Leventis (at home)
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sun, Jan 23, 2005 6:32 PM

Hi Roger,

Programmable routing (comprising multiplexors, buffers, and configuration SRAMs) is replaced by traditional ASIC routing (buffers + wires only). This is the primary source of the area reduction. Speed comes from the routing being hardwired (vs. switched via multiplexors) as well from the reduction in area and some other changes. The main logic of the device and especially items such as the RAMs, IOs, and PLLs are the same as the FPGA, so you get the same behaviour.

As for your second question (separate post), I don't know what the NRE is for HardCopy. I don't see $$$ values in my job with Altera, except on my paycheque :-). You should contact an Altera sales rep for pricing information.

- Paul