I have typically tested "RAM" (writeable memory) in POST for gross errors. Usually, a couple of passes writing, then reading back the output of a LFSR derived PRNG with a long and "relatively prime" period. (One goal of POST being 'short and sweet')
This catches "stuck at" errors, decode errors, etc. It does very little by way of catching soft errors (unless it gets lucky).
For modest amounts of memory and relatively short power-on times (hours/days), this has been satisfactory.
But, for larger memory configurations and much longer up-times (weeks/months/years), I suspect not so much.
My question(s) concern internal (MCU) memory (typ static or psuedostatic) and external (DRAM) memory.
Additionally, memory that is used to store code (r/o) as well as data.
"Code" is protected from accidental overwrites by hardware (so, only "suspect" if that hardware fails *or* software deliberately disables it -- bug, no need to worry about those).
"Data" is, well, data; hard to really know WHAT it should be at any time (unless I wrap everything in monitors, etc.).
All the memory is "soldered down" so no issues with flakey connectors, vibration, etc. ECC is not (easily) available' I'd have to create and verify syndromes with external logic and would be unable to do anything more than complain (crash) when an error was detected (no ability to rerun bus cycles)
Assume operating conditions are "within published specifications". Separately, I'll ask the value of putting in hardware to VERIFY that is true, ongoing.
BEYOND POST...
I can verify the contents of "code" memory by simply running ongoing checksums (hashes) periodically. I.e., compute the hash when the code is loaded; then verify it remains unchanged during execution.
I can regularly check pages of memory as they are released from use ("free") as well as regularly swap out in use pages (code or data) for analysis.
I can coordinate groups of such pages -- at some difficulty and cost (in terms of idled resources) -- to check for decode errors.
And, of course, rely on various watchdogs/daemons to HOPEFULLY spot behaviors that manifest as the result of corrupted data or code.
There are, of course, run time costs for all of this (which I can bear -- IF they are fruitful).
[I'm operating in the 1Mb internal SRAM, 2Gb DRAM (DDR2/LPDDR) arena.]So, the questions I have are:
- what are the typical POST DELIVERY failure modes for DRAM technologies? internal SRAM/PSRAM?
- how do these correlate with product age, temperature, etc.? (is it more effective to MORE tightly control certain aspects of the environment)
- is it worthwhile to monitor the environment and signal operating conditions that are suggestive of memory failures (instead of hunting for broken bits)?
- is it better to just "refresh" memory contents inherently in the design than to rely on them remaining static and unchanged?