RAM Failure modes [long -- whiners don't read]

- D
- Don Y
  
  Contact options for registered users
posted
7 years ago

Tue, Apr 26, 2016 3:18 AM

[crossposted; feel free to elide either group in your reply]

I have typically tested "RAM" (writeable memory) in POST for gross errors. Usually, a couple of passes writing, then reading back the output of a LFSR derived PRNG with a long and "relatively prime" period. (One goal of POST being 'short and sweet')

This catches "stuck at" errors, decode errors, etc. It does very little by way of catching soft errors (unless it gets lucky).

For modest amounts of memory and relatively short power-on times (hours/days), this has been satisfactory.

But, for larger memory configurations and much longer up-times (weeks/months/years), I suspect not so much.

My question(s) concern internal (MCU) memory (typ static or psuedostatic) and external (DRAM) memory.

Additionally, memory that is used to store code (r/o) as well as data.

"Code" is protected from accidental overwrites by hardware (so, only "suspect" if that hardware fails *or* software deliberately disables it -- bug, no need to worry about those).

"Data" is, well, data; hard to really know WHAT it should be at any time (unless I wrap everything in monitors, etc.).

All the memory is "soldered down" so no issues with flakey connectors, vibration, etc. ECC is not (easily) available' I'd have to create and verify syndromes with external logic and would be unable to do anything more than complain (crash) when an error was detected (no ability to rerun bus cycles)

Assume operating conditions are "within published specifications". Separately, I'll ask the value of putting in hardware to VERIFY that is true, ongoing.

BEYOND POST...

I can verify the contents of "code" memory by simply running ongoing checksums (hashes) periodically. I.e., compute the hash when the code is loaded; then verify it remains unchanged during execution.

I can regularly check pages of memory as they are released from use ("free") as well as regularly swap out in use pages (code or data) for analysis.

I can coordinate groups of such pages -- at some difficulty and cost (in terms of idled resources) -- to check for decode errors.

And, of course, rely on various watchdogs/daemons to HOPEFULLY spot behaviors that manifest as the result of corrupted data or code.

There are, of course, run time costs for all of this (which I can bear -- IF they are fruitful).

[I'm operating in the 1Mb internal SRAM, 2Gb DRAM (DDR2/LPDDR) arena.]

So, the questions I have are:

- what are the typical POST DELIVERY failure modes for DRAM technologies? internal SRAM/PSRAM?

- how do these correlate with product age, temperature, etc.? (is it more effective to MORE tightly control certain aspects of the environment)

- is it worthwhile to monitor the environment and signal operating conditions that are suggestive of memory failures (instead of hunting for broken bits)?

- is it better to just "refresh" memory contents inherently in the design than to rely on them remaining static and unchanged?

- J
- Jasen Betts
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Tue, Apr 26, 2016 8:09 AM

"row hammer" springs to mind.

--
  \_(?)_

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Tue, Apr 26, 2016 11:23 AM

I would first ask why you are concerned about this. I assume you have already thought about at least some of the points below (I know you are not doing memory testing merely for fun!), but perhaps you have not thought about them all, and perhaps answers to them can help you or others to find answers to your specific questions.

First, have you ever found memory problems with the systems you have? My experience with memory is that it very rarely fails, and when it does it is mostly a system problem (like poorly terminated buses, bad connections, running beyond maximum speed, etc.) rather than an issue with the memory itself. Almost all memory errors will then be caught in a brief check of address lines and data lines during production testing

- power-up or online testing is then unnecessary.

Secondly, what would you do if you found memory problems? If you cannot rely on your memory, it is difficult to rely on /anything/ in the system. On many systems with ECC memory, detection of an uncorrectable error leads to immediate shutdown because it is better to stop /now/, that to risk causing more problems. I have no idea what sort of systems you are designing, but if you can't make such an immediate shutdown, and you feel memory issues are a realistic issue, then perhaps you have no choice but to use some sort of ECC memory or other redundancy rather than trying to spot a problem after it has happened.

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Tue, Apr 26, 2016 5:22 PM

Obviously, because I am concerned with reliability and availability.

Historically? Yes. But, memory technology has improved greatly in the decades that have passed. I'd never even consider a gigabit of memory built from 4kx1 devices!

That's not what the literature indicates. Also, doesn't explain why ECC memory is used.

That's not true. You *expect* some number of errors in any memory subsystem. A more interesting question is "how many ECC *corrected* errors before you start worrying about the ability of your ECC to *detect* errors -- even uncorrectable ones.

Memory errors do not imply a system has erred.

If I'm examining a page of memory that isn't currently executing (because the program's control is currently somewhere else in the text segment) and I find an error (hard or soft) and I correct it or replace that page BEFORE the program has a chance to execute any of the commands affected by that error, then the program hasn't been compromised.

If I find an error that causes a bit to assume the value that it

*should* assume (e.g., lsb of a location is stuck at one but the location is intended to hold the value '0x9') then, likewise, no problem.

If I find an error that causes a bit to assume a BAD value -- but conditions in the program effectively make that irrelevant (e.g., the value specifies a timeout for an operation -- but, the operation still manages to successfully complete before the timeout expires), again, no problem.

Etc.

You can have a system apparently running successfully in spite of ongoing errors.

Or not.

The difference is, whether you KNOW about the errors or wait to find out about them by the system misbehaving (e.g., a watchdog kicking in or some other VERY INDIRECT measurement of reliability).

If the only time you have to test memory is POST (or, an explicit BIST invoked by the user), then you have to rely on interrupting the normal services of your device in order to perform that test and (re)gain that confidence.

"We'll be making a stop in East Bumph*ck, Iowa, while we run a regular test on the memory in our avionics systems. We're sorry for the delay and promise to have you back on your way as soon as possible!"

The point of my questions is to inquire as to how people see and expect to see memory failures -- in (external) DRAM as well as (internal) SRAM.

As I suspect most folks only test at POST, how would they react to a situation where the user just happened NOT to shut down their product/system for 10 years? Would they feel confident that it was still intact? Executing (out of RAM) the same code that they loaded, there, 10 years earlier? (bugs in their software can't corrupt the RAM's contents -- but the RAM can degrade!)

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Tue, Apr 26, 2016 5:46 PM

That's typically a result of a specific usage pattern. So, you can adopt the attitude of NOT letting those types of behaviors into your code *or* resolve yourself to their inevitability and count on some increased number of SOFT errors, as a result.

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Tue, Apr 26, 2016 6:08 PM

Hi Don,

I think nowadays David's attitude is both the obvious and the correct one. Leave memory testing to the silicon and board manufacturers, they have better means to test it than the CPU which uses it. If you need the feeling of some extra reliability use a part with ECC (and populate the chips for it....).

I have never noticed a memory failure last 30 years which could not be tracked down to something external to the memory, e.g. bad board connection, missing/bad bypass caps etc.

It is just that the failure probability of everything else dwarfs the one memory silicon has, all these tests won't buy you much if anything.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

- R
- Robert Wessel
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Tue, Apr 26, 2016 7:08 PM

It's not like there's a lack of literature on that subject...

But if you assume independence in bit errors, the calculation is for uncorrectable faults is simple. If you want to consider some probability of more difficult failures (and entire chip, and entire DIMM), those generate hard failures immediately, unless you add heroics like DRAM sparing (basically RAID for RAM - IBM likes the term "RAIM"), as is done on high end servers.

If you're just doing monitoring, and then preventive maintenance, based on an accumulated soft error rate, there again has been a fair bit of literature, but they all come to approximately the same conclusion - soft errors are pretty rare for most devices, and on a handful they tend to be much more common. So the exact threshold is actually not that important. So a DIMM getting a soft error every few months in ignorable, several per day is not, and there's little in the real world between those.

That's called scrubbing, and is fundamental to any redundant storage scheme, RAM or disk. Even if you have nothing but ordinary single bit (for RAM) errors, the odds of that turning into an uncorrectable error are just the amount of time the condition persists and the odds of another bit in the protected block getting hit). So you must scrub.

And proper machine check architectures do distinguish between immediate errors and deferrable ones. If you get a uncorrectable memory error during an instruction fetch, that thread is going to be dead. But if scrubbing turns up an uncorrectable error, it can be reported to the OS, which can deal with it at its leisure. That could be reloading the page (if possible), or killing every process that has that page mapped (if it cannot be reconstructed).

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Tue, Apr 26, 2016 8:00 PM

You put that very well.

When considering the reliability of anything, and its failure modes and their consequences, you have to consider the balance of probabilities. There is no point in testing memory just because you can test it - the testing is not free, and at some point it becomes negative value (for example, it takes so much of the processor time that you need a faster processor with lower reliability). You do the tests that make sense, based on the likelihood of there being a problem, the consequences of the problem, and what you can do if you spot a problem.

You see the same sort of situation in security. It makes sense to have a good lock on your front door - but once it is so good that it is easier for burglars to break a window, then any expense on more locks is wasted.

And in the memory test situation, who cares if your card's memory is perfect after 10 years if the electrolytic capacitors have decayed after

5 years? You need to make sure the effort is put in the right places.

Now, I am not saying that Don should /not/ test his memory - I am just asking if he has clear thoughts (and preferably, real numbers) justifying it.

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Tue, Apr 26, 2016 8:05 PM

But the literature tends to be concerned with large memory arrays. And, large memory arrays tend to be constructed differently. And, operated in different environments, attended by "professionals", etc.

The literature suggests failures repeat. So, its not a uniform distribution across an entire device/array. I.e., stumbling onto a soft error suggests you're more likely to find another in that same spot than elsewhere. (This, in turn, suggests you treat the soft error as if it is -- or will become -- a hard error)

Some studies show hard errors are more prevalent than soft; others show the exact opposite. A google study (big data farms) claimed ~50,000 FiT/Mb. Further, it appeared to correlate error rates with device age -- as if cells were "wearing out" from use.

And, its not "a soft error every few months" but, rather, several thousands per year (per GB so figure I'm at 1/4 of that -- per device node!)

But you can do more than that. You can retire the affected memory so it no longer presents a (potential) problem.

That isn't true, either. What if the opcode should have been a "jump if zero" and, instead, gets decoded (bad fetch) as "jump unconditionally". This only causes a problem if the value was NOT zero!

And, even if it *was* zero, the consequences may not be fatal. Maybe a light blinks a little faster. Or, a newline gets inserted in the middle of a line of text. etc.

This can be happening all the time and no one could be the wiser!

OTOH, if something told you that it was likely happening -- and could quantify that frequency -- you might be more predisposed to taking remedial action before the fit hits the shan!

If the only assessment you make of the integrity of your memory happens at POST, a cautious user would be reseting the device often just to ensure POST runs often!

- R
- Robert Wessel
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, Apr 27, 2016 4:28 AM

On a per-DIMM basis, the Google paper has 8.2% of DIMMs experiencing one or more correctable errors per year, and of those 8.2%, the median number of errors is 64 per year (with the overall average being

3751!). The go on to mention that for the DIMMs with errors, 20% of those account for 94% of the errors.