Dear All, Austin in particular, I saw this and thought of you! Cheers, Syms.
14 years ago
Dear All, Austin in particular, I saw this and thought of you! Cheers, Syms.
Well, Cypress, Xilinx, IBM, and many others have made it no secret that neutrons at sea level are causing upsets, and we have done something about it (and presented the papers, and shown our results).
Intel has also been working very quietly on this, with much less press.
I suggest that if you are not thinking about single event effects, you should be, and demanding your vendor show you the proof of their design efforts in this regard.
Virtex 5 is (as of today), 144 FIT/Mb for the config bits, 95% confidence interval from 100 to 200 FIT/Mb. This is from our 400 devices located on mountain tops in France (31.029 Giga-bit-years of test time, 35 events).
Compare this to a 65nm ASSP or ASIC, which is at least 1000 FIT/Mb or1000 FIT/million gates(!). Do nothing, and it gets worse. Do something, and it gets back to where it should be. These numbers from the SELSE II conference a few years back: the industry numbers are really a lot worse, but no one will admit it.
There is a reason why Xilinx FPGA devices are finding their way into many high availability and high reliability applications: we are the only choice -- there is no competition whatsoever.
Hi Austin, I wondered what were your thoughts on their patent where "The cosmic ray detector [built into the device] is therefore designed to spot when rays have caused interference and then tell the chip to repeat the command." ? I guess in an FPGA it could trigger a readback to ensure the device was still correctly configured and/or issue a user logic reset. Cheers, Syms.
Well, that employee should be fired: that is the stupidest thing I have ever read.
It isn't even science -- detecting neutrons! Pure BS! A neutron is an uncharged particle, that goes through 10 meters of concrete before it gets stopped. Detecting one is just......stupid.....idiotic.....
(breathe in, breathe out.)
Their PR folks are probably going nuts on this one!
Was that April 1 dateline?
Anyway, Intel is pretty savvy, and they are not standing still. If you use their parts, you need to request their Soft Error Effects roadshow.
It is only given under NDA, so although I know it exists, and I suspect I know what is in it, I have never seen it.
I have seen IBM's "show" and they certainly have their act together. As do we. IBM's "show" is under NDA, however, so I can't say anything about its contents.
Our roadshow is available by request from your local friendly FAE, and it is no NDA is required (why would we hide we are the best?).
Remember: per the JEDEC89A standard, there are three ways to characterize soft error effects. Be sure to ask which ones were used, and their degree of confidence.
If they won't share this with you (under NDA), then they are hiding something, something very very bad.
First of all, there is no such thing as a single particle detector.
Secondly, detecting the current spike (from a strike) requires a very complex circuit, itself subject to spikes (I know, we designed them for the USAF...)
Thirdly, Intel has done far more than this, and deserved a better PR.
Perhaps they should fire the PR firm?
Aust>> Intel has also been working very quietly on this, with much less press.
Yes, in S3A, S3AN, S3D, V4, V5 we are able to either reconfigure on detection of an upset, notify the user (and they decide what to do), or in V4 and V5, correct the flipped bit without having to reconfigure (or even go to the config flash/prom).
Basically, in our road show, it is detailed how the user needs to decide what to do, and at what levels, in order to meet their availability and reliability numbers.
Mitigation is part hardware, part system architecture, and part software. Depending on what you are doing, and how long you can tolerate being "off-line" there are different solutions.
-just reconfigure, start fresh
-just fix the bit flip, continue on (as a flip does nothing 90% of the time, and seldom causes anything to 'crash')
-fix the bit flip and reset or go back to a check point/known states
-use dual redundancy, and check for agreement (if a fault is not tolerated - like in banking, accounting) repeat if no agreement
-use full triple modular redundancy (when it must be correct, and 100% available), also scrub to fix bits that may flip so flips are not allowed to accumulate
All methods are used by our customers, and they all work. We have reference designs and support for these models. And they can be tested by reconfiguring to flip bits while operating. One heck of a lot cheaper than using a proton beam, or neutron beam .... and more complete (we have folks who flip each bit, one by one, and prove their system meets its requirements).
Austin, Are you talking about the link I posted? I didn't see any reference to neutrons, am I missing something? Also, if what you say is true, that neutrons whizz through 10 meters of concrete, aren't you gonna be incredibly unlucky to get a direct neutron hit on a 45nm transistor? (BTW., A cursory web search would suggest some kind of boron based detector, which kinda makes sense as boron is used to absorb thermal neutrons in nuclear reactors.
At sea level,93% of particles from the cosmic ray shower are neutrons, and 7% are protons (see JEDEC89A).
There are 12.9 per square cm, every hour, passing through everything (for New York City, up to 25X more on mountain tops, 300X at 40K feet, less at the equator, 10X at the poles...).
There are also electrons, muons, pions, and a host of more exotic stuff, but hose either don' matter (do not affect anything), or they are absorbed quickly, or decay (even a lone neutron decays in 11 minutes!).
So, like I said, that is the dumbest PR I have read. It gets the first prize for ignorance about soft error effects.
Some Real Science:
Aha, thanks! Now I think I get most of it. It would seem that the cosmic rays, which are charged particles, hurtle into the earth from all directions. They are made of protons mainly, with some alpha and beta particles. The earth's magnetic field means that there are more at the poles than at the equator. The cosmic rays are charged and so interact with the atmosphere a lot, and so very few reach the earth's surface. However, these energetic collisions in the atmosphere produce showers of neutrons. These uncharged particles don't interact with the atmosphere nearly as much as the cosmic rays, so can reach the surface more easily.
Ok, here's another question. As the uncharged neutrons don't interact with much, indeed you say they can go through 10 metres of concrete, I can't see why the highly interactive remaining protons aren't the real danger, even though they only comprise 7% of the total, not the 93% neutrons? Maybe none of the original protons reach the surface, but the 7% protons are produced by secondary neutron collisions?
Sorry to bombard you with questions!
Expecting quality in a PR document seems to be the triumph of hope over expereince?
These thing start in the depths of a company, we assume largely accurate. Then, that companies Media liason/managers work on it.
Then the PR firm 'works' on it and finally the publishing media's editors have a go.
Like chinese whispers, any semblence to the original, is pure coincidence! ;)
Boy, I saw that text, too, and really wondered about how reliable such a procedure would be. If the state of flip-flops or dynamic memories are altered, repeating the previous instruction operation would be worthless. There is SO much more area in high-end CPUs devoted to memory and much less to logic functions, I would expect memory corruption to be the most probable fault.
Right, having worked with a nutron detector array, detecting them is REALLY hard, and not something easily done on a chip. However, most neutrons pass through chips easily with no interaction, and so can be ignored. What you have to detect is if the neutron was CAPTURED, and deposited energy in (or very near) the active circuitry. That will release some energy (could be charged particles, could be Gamma rays) that could affect the active circuitry. The gammas could be detected from a distance, but they can be quite directional and local, so detecting them could be tough, too.
Really! Just detecting a neutron or Alpha hit could be difficult, although detecting a cosmic ray shower is a lot easier, as the shower of charged particles greatly increases your probability of detection on a small detector device (probably just a diode). But, then, the REAL problem is how to CORRECT any malfunction that may have ocurred. Reducing the probability of corruption, as Austin descibes Xilinx has done, seems the most reliable and provable scheme. Proving you can correct corruption from a hit anywhere on a chip, while running ANY program, at any time, seems like fantasy.
The protons interact VERY strongly, due to the charge. As most electronics is housed in something, the housing usually stops the protons, although there will be Gamma radiation when they hit, and that can penetrate the housing. If you put a bare photodiode outside on a dark night and reverse-biased it, you could pick up these interactions easily with an oscilloscope. With a little digging into the physics, you could discriminate alpha hits from protons, etc. Of course, cosmic ray showers deliver so much "stuff" that you'd just see big pulses without being able to pick out the fundamental particles.
Oh, one other aspect is "stopping distance". Very energetic charged particles zing through stuff with minimal energy deposited into the material, until enough energy has been shed, then they interact and stop suddenly. So, the very high energy primary particles are not much trouble, it is when they either lose energy by travelling through something or create secondary particles that the energy is low enough to create ions.
So, the protons are not likely to ever make it into the silicon directly. Secondary Alphas and lots of Gammas will be bouncing around, and those could deliver energy to the chip.
The cosmic rays are ions: iron, gold, xenon, carbon, basically anything and everything. Yes, there are lots of protons, but they do not have enough energy to cause problems. More light ions (like carbon), fewer heavy ions (like gold).
But, iron, with too few electrons, traveling at 90% the speed of light, now there is a particle!
When one of these "heavy ions" strike the upper atmosphere, say a nitrogen molecule, all hell breaks loose and you get all sorts of products (Even CERN has nothing on a cosmic ray--high energy physics used mountain top sites before the cyclotron!). Since neutrons have no charge, and go right through most things, (as most mass is empty space), the neutrons predominate at the earth's surface.
Beam neutrons at a block of iron, or aluminum or copper, and you will get radioactive iron, aluminum, or copper (excess neutrons will eventually be released if they have created an unstable isotope). This is why lead on the surface is more radioactive that lead at the bottom of the sea.
The ions got directed by the earth's magnetic field, but once the ion strikes, the neutrons are unaffected by the fields.
The direction is predominately "up" as the flux falls off away from "up" (towards the sky) as the neutrons are absorbed by the atmosphere at oblique angles.
No neutrons come from "down" unless you are standing on lead, uranium, or in the basement in Minnesota (Radon).
The neutron hits the silicon lattice.
The silicon "spallates" (spilts the atom) and releases an alpha particle ( a helium atom, minus the electrons: two protons, two neutrons).
The alpha particle has charge, and it upsets the source drain region (due to deposited charge, actually leaves a trail of 'holes' and electrons which quickly recombine, in less than 30 ps).
The neutron may also just "ping" the silicon lattice, and cause the silicon dioxide molecule to be dislocated from the lattice, or just vibrate. In either case, charge is also released.
A good history lesson (and some physics):
If you can stomach the physics....
I read the patent(s).
I am amazed at how useless this is: use of a MEMS to detect the charge cloud....
OK, so there went a neutron.
Did it upset one of my 512K bits in my cache?
Did it upset a register?
So, they have a number of patents for this, and it is interesting (as a physics experiment), but then, so is:
Interesting science, bad PR!
A much better patent to crow about:
Jon Elson posted: |-------------------------------------------------------------------| |"[..] | | | |[..] Proving you can | |correct corruption from a hit anywhere on a chip, while running ANY| |program, at any time, seems like fantasy." | |-------------------------------------------------------------------|
Correct. Xilinx did (and probably still does) have an admission on its website that a level of risk must always be accepted, no matter what is done to combat single-event effects.
Regards, Colin Paul Gloster, unemployed and hungry
Austin posted: |------------------------------------------------------------------------| |"[..] | | | |[..] they can be tested | |by reconfiguring to flip bits while operating. One heck of a lot cheaper| |than using a proton beam, or neutron beam .... and more complete (we | |have folks who flip each bit, one by one, and prove their system meets | |its requirements)." | |------------------------------------------------------------------------|
Logical testing will not match checking whether real radiation respects your model of the system. One transient can defeat the outcome of clocked triply modularly redundant voters.
Sincerely, Colin Paul Gloster, unemployed and cold
It is a question of completeness.
Logically going through every bit, is 100% functionally complete.
Sitting in a proton beam is "waiting for Godot" -- how long must you wait to check enough bits to achieve the required coverage?
It becomes a matter of "too many dollars to keep the lights on." (Beam testing is horribly power hungry, and very expensive, eg TSL is $250K for a session, not including the airplane tickets, hotel rooms, people, rental cars...).
Additional system testing in a beam is highly desired, but the goals are not for functional completeness, but to cover whatever might have been missed bu flipping 100%, one by one, every configuration bit.
XTMR Tool(tm) software can not be broken by a single radiative event, nor by a single bit flip (as verified by NASA, JPL, CERN, etc....).
Our flow triplicates the voters, so that every feedback path gets a full TMR. A failure in a voter is "voted" out by the other two voters.
That is why we have so many designers using this flow:
it just works.
Aust> Austin posted:
True. The lawyers require that we never accept any risk of this type. In this case, for something that is "soft" and leaves no record, we can not accept any liability for something that can never be proven absolutely. We can only do our best, and ask our customers to do their best, and show them results from all the testing, and other missions and working systems.
If a part 'fails', all we can do is issue an RMA, test it, and replace it if found bad. That is the complete and total extent of our liability, unless otherwise agreed upon with our legal department.
If you recall the alphas in solder bumps fiasco ($10M loss for Xilinx), it is our goodwill and honesty that we negotiate the return and replacement of every lot of the affected parts once we were aware of the problem. In that very real sense, we are the ONLY company to have corrected the situation up front, and in public. Many other alpha contamination situations are buried, and only become legends, or appear in papers 10 years later as "stories" from previous generations of products.
Aust> Jon Elson posted:
Austin posted: |----------------------------------------------------------| |"[..] neutrons at sea level are causing upsets, [..] | | | |[..] | | | |I suggest that if you are not thinking about single | |event effects, you should be, and demanding your vendor | |show you the proof of their design efforts in this regard.| | | |[..]" | |----------------------------------------------------------|
|---------------------------------------------------| |"There is a reason why Xilinx FPGA devices are | |finding their way into many high availability | |and high reliability applications: we are the | |only choice -- there is no competition whatsoever."| |---------------------------------------------------|
Are you sure you have not been brainwashed by your own P.R. department? I commend Xilinx's acknowledgement of the existence of single-event transients (my attempt at a Ph.D. in electronic engineering was ruined by a supposed tutor who used to work for the European Space Agency for four years who only ever spoke about single-event upsets even though I had applied to the university with an explicit emphasis on single-event transients in my application essay).
How can you say that people can only buy from Xilinx instead of from, for example, Aeroflex?
Regards, Colin Paul Gloster, unemployed and hungry
ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.