For moderately large to very large memory (DRAM) subsystems, what sorts of policies are folks using to test RAM in POST? And in BIST (presumably more involved than POST)?
The days of device-specific test patterns seem long gone. So, cruder tests seem like they are just as effective and considerably faster.
E.g., I typically use three passes of a "carpet" pattern (seed a LFSR -- or any other PRNG -- write the byte to the current address, kick the RNG, rinse, lather, repeat; reseed the LFSR, reset the address, read the byte at the current address, compare to RNG state, kick the RNG, rinse, lather, repeat) expecting any problems to manifest as gross failures (rather than checking for disturb patterns, etc.)
[Of course, the period of the PRNG is chosen to be long and relatively prime wrt any of the addressing patterns]
BIST just changes the number of iterations with protections on certain key parts of the address space.
The tougher issue is testing "live" memory in systems that are "up" 24/7/365...
Systems running 24/7/365 are much more likely to have disc failure before memory failure.
But for embedded systems there may not be any disc storage. but I think the lifetime of the RAM is likely about the same as the CPU. So it is not worth the effort, since the device may be replaced before failure starts.
Even a pacemaker has a finite lifetime.
Given that, you must have some control over the system. you're working on the BIOS level, right? So can you tell what RAM is unused? That's the easy case.
Can you force the higher level software to yield control briefly for random checks?
Otherwise, the only way I can see is to somehow watch the higher level execution and check its reads and writes. But short of single stepping, I don't know how to do it.
Custom memory hardware? A dual ported memory management unit where you can swap RAM pages at will without disturbing the application execution?
If your system needs that level of reliability, then it may be worth the money and effort.
A memory subsystem can "fail" (i.e., not be reliable in maintaining the data it is charged with preserving) without being "worn out". E.g., problems with the power supply can manifest in memory errors long before the system itself "fails".
Consider the different cases:
- POST You essentially have the entire system at your disposal for some amount of time (ideally, you want to keep this period short so your bootstrap doesn't become a noticeable event -- perhaps have different methods of bringing up different levels of POST so you can exercise more comprehensive tests when you feel you may have more time available)
- BIST You probably have "a good portion" of the system at your disposal. And, probably for a considerably longer period of time. I.e., you are *deliberately* engaged in testing, not "operating"
- Run-time You probably have a severely restricted portion of the system at your disposal and probably for very short periods of time (lest your efforts start to interfere with concurrent operations)
During POST, I think the time constraints mean you can really only perform gross tests of functionality. Comprehensive/exhaustive testing would just take way too long. Hence my use of a simple test ("carpet") that hopefully catches *some* gross errors if any exist.
In BIST, I think you can be far more methodical in applying patterns to the subsystem to try to draw marginal portions of the array out.
At run-time, I think the constraints are so severe that you can really only look for gross errors in very localized portions of the array (as become available for testing)
I have many nodes in the system. If I don't need the I/O's on a particular node (i.e., if I am just using it as a compute server), then I can migrate the executing tasks to another node (possibly bringing a new "cold" node on-line just for that purpose) while the majority of the memory is tested on the "original" node. If the (or some) I/O's are *required*, then I have to leave some services running on the node to make that hardware available -- even if I migrate the tasks that are interfacing to those I/O's off to another node (as above).
Of course, the I/O's on certain nodes will tend to be "needed" more often than the I/O's on other nodes. But, I can always schedule "bulk testing" to take advantage of even brief periods where the I/O's are expected to be idle.
E.g., if the HVAC node has *just* brought the house up/down to the desired setpoint temperature, it is likely that the furnace/ACbrrr will not be needed for "a few minutes". So, I could move everything off of that node (even the "drivers" for the I/O's) for a short time while the node is placed in "test mode". The assumption being that the testing will take less time than the house's thermal time constant necessitating a reactivation of the furnace/ACbrrr.
Other nodes may be less predictable. E.g., the security system needs to remain on-line even if the house is "occupied" and the "alarm" doesn't need to be armed (e.g., consider the role of "panic switches", fire/smoke detectors, etc.)
I have a demand-paged virtual memory system. As a matter of security, every swapped out page is scrubbed before being placed back into use (as this is a potential bridging of a protection domain. As part of that scrubbing, I could briefly test the physical memory represented by that page.
But, this is HIGHLY localized. E.g., a decode failure would never (incredibly rarely!) be detectable as you deliberately can't see the entire memory subsystem; no way of knowing if your actions "here" are manifesting "there".
I can "silently" arrange for in use pages to be replicated and then swapped without affecting the application (esp for TEXT pages). But, it still leaves me with just a tiny, local region of memory that I can examine and play with -- hard to imagine a failure showing up there that hasn't already caused a failure elsewhere!
Note that there is a subtle difference between *ensuring* the system is reliable and detecting when it is prone to failure. Especially in my architecture, a faulty node can just be kept out of service -- making all or some of the I/O's that it handles unavailable (e.g., maybe you can't irrigate) -- without compromising everything (or, increasing the cost to ensure the "lawn can always be watered")
Well, what kind of errors do you want to detect? I'd guess the most common failure pattern for a factory test / power on self test would be individual lines shorted against ground / Vcc / each other, or disconnected, due to bad soldering, dirt, corrosion etc. Therefore, I'd just try every line and its inverse. This doesn't need to be particularily fast, and it only needs some strategically placed memory pages. But it would need all bit lines, so borrowing unused bits of an array of 32-bit words would not work.
I only have displays on 3 nodes. And, those are "optional".
But, regardless, you'd want to test all bit positions (without having to "wait" for the right data to *happen* to be "displayed")
These are SoC's (augmented with external memory) so ECC isn't usually supported.
I don't think I need to pay for "live" error detection. I expect to catch most failures either because a node misbehaves in the course of its normal operation (emits some faulty data, "goes offline", trips a deadline handler for one of its tasks, etc.) *or* a node that is being brought on-line fails its POST (i.e., "died in its sleep").
BIST is a necessary evil to troubleshoot any system: the node is misbehaving, why? (removing and replacing a node can be expensive -- mainly labor). Being able to put a node into a diagnostic mode for an indeterminate amount of time means you can have considerable control over what it is doing during that time (letting it run some "random" collection of apps is less predictable).
Run-time testing is an attempt to bridge the two -- catching failures before they manifest. E.g., knowing that an irrigation solenoid is shorted/opened *before* you need to energize it; thus, allowing you to notify the needed repair before it has consequences.
I'm not concerned with factory test -- that can be as comprehensive as needed because the costs are external to the device(s) in question. And, because there are far more things that need to be tested than can (affordably) be accommodated with recurring dollars.
I suspect most "memory failures" won't be "hard" failures. Nor will they be directly related to the memory subsystem itself. Rather, I expect things like excessive power supply ripple (because a filter is failing over time/temperature) or other issues on which the memory relies for its proper operation (ventilation/cooling/etc.).
I have nodes installed in a wide range of environments so its not reasonable to expect them all to be operating at a comfortable ambient, etc. And, some "less knowledgeable" user might fail to realize the consequences of his choice of siting ("Um, sure it's only 115F outside; but the sun shining directly on that nice black ABS casing in which you've mounted that node probably has the internal temperature up 50F higher!" E.g., car interiors, here, easily and OFTEN attain temperatures in excess of 140F. With outside temps above 100F for ~70-100 days each year, that's not an "exception" but, rather, a *rule*!)
Of course, POST is done only once at the first (and hopefully the only time) for a few decades.
For such system, typically ECC memory is used. In such systems you can perform "flushing" i.e. read-writeback sequences to all memory locations at regular intervals, perhaps every few minutes if strong radiation is present. If the memory word contains a bit error, the ECC will correct it and the writeback will write clean data+ECC into that memory word.
Of course, you should log the location and frequency when ECC is needed and the need for correction is high at some location, you should declare that memory page dead and use some bad block replacement system, which is easy to implement on any virtual memory operating system.
If it takes too long to test each individual memory cell, on a DRAM at least test every row driver and every column sense amplifier. For a single memory page, test every memory location. This will test the column sense amplifiers as well as the input/output multiplexor. In addition to this, test one memory word from each memory page, which will examine the row decoder and row driver lines.
For RAS/CAS DRAMs this will also examine all external address as well as data lines for shorts and Vcc/Gnd issues, since all lines are examined anyway.
As stated elsewhere, individual nodes are powered up and down routinely within the normal operation of the system. So, it is possible for POST _on_a_specific_node_ to be run often (i.e., as often as power is cycled to that particular node).
SoC implementation so ECC is not in the cards. Even if I added the syndrome management in an external ASIC, there's no way to fault the CPU to rerun a bus cycle. So, WYSIWYG as far as DDR memory is concerned.
This is actually an amusing concept. Ask folks when they consider their ECC memory system to be "compromised" and you'll never get a firm answer. E.g., how many bus errors do you consider as sufficient to leave you wondering if the ECC is actually *detecting* all errors (let alone *correcting* "some")? How do you know that (detected) errors are completely localized and have no other consequences?
In my case, I treat errors as indicative of a failure. Most probably something in the power conditioning and not a "wear" error in a device. Leaving it unchecked will almost certainly result in more errors popping up -- some of which I will likely NOT be able to detect.
E.g., a POST error in DRAM causes me to fall back to recovery routines that operate out of (internal) SRAM. A failure in SRAM similarly causes DRAM to be used to the exclusion of SRAM. A failure in both means SoL!
Regardless, in these degraded modes, the goal is only to *report* errors and support some limited remote diagnostics -- not to attempt to *operate* in the presence of a known problem.
I use that approach for (nonvolatile) configuration memory -- primarily as a safeguard against power collapsing unexpectedly in the middle of an (atomic) update of one or more configuration parameters.
But, I haven't logged such an event in many years leading me to think it is overkill.
[OTOH, as it is only used for moving parameters from the nonvolatile store into the *working* (configuration) store, it's a one-time hit that doesn't add much to code size or execution time -- it runs once during IPL and once again at shutdown... both times where it isn't really noticeable]
I need something that will attest to the integrity of the memory subsystem as a whole, not just the nonvolatile portion of it.
Wait, WHAT? the topic began with a 24/7/365 requirement and no mention of distributed load sharing systems.
If you can run POST "often" then what new policy are you really looking for?
The Security system musty be 100% available so there you use redundant systems (maybe 3 with a voting protocol) Power cycling one of the three on a schedule doesn't seem too bad. You want security with high availability, you can't get by on the cheap.
So I don't think I see the problem.
.. and our sponsor, Duct Tape would like to remind you: in the long run, ALL solutions are temporary.
So mostly not High availability. Seeking something beyond POST may just be overkill (except for that security system).
The *system* runs 24/7/365: "... issue is testing "live" memory in systems that are 'up' 24/7/365..."
I can't run POST on any particular (i.e., randomly/periodically chosen) portion of the system at any given time. I can run POST (or BIST) on
*certain* pars (nodes) of the system at *selected* times.
E.g., if I am not presently "using water" (irrigation, domestic water, etc.), then the node that is responsible for monitoring and controlling water use can be commanded to run a POST (or BIST) -- after ensuring the I/O's are "locked" in some appropriate state(s).
[For example, make sure the main water supply valve is "open", irrigation valves are "closed", etc.]
Likewise, if the security cameras covering the back yard are not needed during daylight hours, those nodes can be powered up, tested, then powered down until they *will* be needed.
Looking at different (smaller) time intervals, I can probably cheat and arrange for the HVAC node to be tested IMMEDIATELY AFTER the house has reached its heat/cool temperature -- on the assumption that the furnace/ACbrrr will *not* be needed in the N seconds/minutes that it takes for that test to complete (again, after first locking down the I/O's to some sort of "safe" state).
OTOH, the database server is the sole repository of persistent data in the system. Taking *it* offline means the system AS A WHOLE needs to be essentially quiescent -- there's no way for a node to inquire of settings, make changes to settings, respond to changes in the environment, etc. if the DB server is not responsive.
[And, the DB server has gobs of resources so testing there tends to be far more time consuming]
I don't have any explicit redundancy in the system. E.g., if the door camera bites the shed, it's gone. No way to recover that lost functionality (without the user replacing the node). Likewise, if the node that handles water usage/metering craps out, those functions are gone until the hardware is replaced (e.g., perhaps the water supply turns *off* when you'd like it to remain on; or, perhaps it is locked on even in the event of a detectable plumbing failure, etc.).
As there are no "backups" for individual I/O's, I rely on runtime testing to identify problems *before* they interfere with operation -- to give the user a "head's up" before the system encounters a failure IN IT'S INTENDED USE OF THAT FEATURE.
E.g., turn the security cameras on during the day, verify that the images returned by each are "nominal" (i.e., the tree that used to be in the center of the scene is still visible there). If not, you can alert the user before the system *requires* those cameras to be operational (to perform their security monitoring functions).
Consider the irrigation system: it may be days or even weeks for certain irrigation valves to be "needed". Yet, a wire could get cut or shorted -- or a valve mechanically inoperative -- at any time while the valve is "dormant". Waiting to detect that problem until the system decides that the valve *must* be energized means you're already too late: why didn't you fix it two days earlier when it
*failed* (but, wasn't yet NEEDED)?
Permanent Temporary Fixes.
Availability is a relative concept.
If you wanted to flush a toilet and the water happened to be "off" because that node was busy doing a self-test, it's not "the end of the world"... but it would surely be annoying -- and NOTICEABLE.
Likewise, if someone came to the front door and the doorbell didn't "ring" because *that* node happened to be running a memory test...
Or, missing an incoming telephone call while the phone system was running diagnostics.
Or, someone opening (and then closing) a door to gain entry to the premises while the node charged with watching those events was "preoccupied" with testing.
How many of these are "inconveniences" is debatable: if you went to make a call with your cell phone and found it was "busy, testing", SHIRLEY you *could* wait a bit while that testing, finishes, right? Are
*all* your phone calls so terribly urgent that they can't wait??
OTOH, even having to wait a second more than normal while it *aborts* the test (and reloads the application) would probably be noticeable to you ("Damn phone is ALWAYS 'testing'!")
My goal is to highly integrate this system with day to day living (or "business", etc.). As such, if "some" component is always (i.e., "often") claiming to be 'busy, testing', it can be counterproductive. (using the "splash screen" diversion to hide your activities wears thin, quickly)
So, being able to "hide" these sorts of activities in ways of which the user is unaware becomes a significant design goal...
But the "system" consists of multiple nodes. The nodes are where the memory that you want tested exists. You can restart a node (triggering a POST), s o FOR THAT NODE (yes the nodes are not totally interchangeable) you do not need an elaborate run-time memory test process.
I never suggested the testing had to be random. Periodic is fine, just like your scheduled check up on your car. (Hi Joe, I brought in the car for it s 60,000 mile service check)
A lot of break-ins happen in daylight hours. But you are muddying the water s I think. Are you writing POST for DRAM in the camera? or in the node that reads the video from the camera?
That's doable. so another node (or set of nodes?) that doesn't need fancy r un-time memory testing.
Then you designed a distributed system, but still have a single point of fa ilure for the system.
Why does the HVAC need to query the DB continuously? Even if the settings c hange periodically (different tempts for different times of day, different days, and even different seasons), does not mean the settings change minute by minute. So what if the temp setting changes at 5:05PM instead of 5:00PM .
How do you do DB maintenance?
Agreed, but have you measured how long?
The topic was memory testing, not the I/O. Don't muddy your own topic.
The topic was memory testing, not the I/O. Don't muddy your own topic.
Start a new thread for I/O, predictive maintenance.
Well, actually you never provided your availability requirement other than the vague 24/7/365 quip in the first post.
But because there is water in the tank, it would work once.
then I think you have bigger system design problems. (Over engineering)
That is another one like Security where you cannot geet by one the cheap (s ingle node)
100% availability requires some redundancy.
The debateable point is this: exactly what is the availability requirement for each subsystem?
For example the system I am working on has a requirement that it is unavail able for clinical use less than a small number of hours per year (not count ing scheduled maintenance).
You got that right!
And needs to be addressed, but it is a different topic.
The node and all the I/O's that it services are unavailable during POST. As I said, previously, POST wants to achieve a balance between thoroughness and expediency -- any time spent *in* POST increases the time before the node can be brought on-line for its normal operation. BIST takes the attitude that testing is the operational mode of the node -- so, like POST, the node's normal functions are not provided to the system.
Run-time testing (of all components in a node) attempts to juggle both criteria -- testing *and* operation.
But you (I) can't even guarantee any particular periodicity -- that was the point of my "randomly/periodically" comment. "Testing" is just another workload that has to be scheduled based on its needs and impositions on (portions of) the system.
Our back yard is protected and "supervised" -- threats would come from the front of the building. Some other homeowner (business owner) may have the exact opposite set of circumstances. As such, the "testing" workload has to adapt to the other uses that each particular node is called on to perform as defined in *that* particular system (not something that is known at compile-time)
I verify that the camera's functionality will be available to the system. This means:
- the PTZ mount will respond to motion commands
- the camera will deliver a "video signal"
- the video signal will represent the image of the "scene" before the camera (i.e., if there was a tree in the scene the last time the camera was verified as operational, that tree should still be there!)
- the memory into which that image will be analyzed (motion detection) etc.
The system degrades. The functionality that user A considers important may not be the same that user B desires. If a user wants the DBMS to be redundantly implemented, he adds another (or several) other instances to the system.
I've put a lot of effort into eeking out every last bit of *system* reliability from the components as it degrades. E.g., if external DRAM dies, a node can degrade to a mode whereby it's virtualized I/O's are serviced by code running on some other node -- possibly one that was powered up in response to the detected memory failure on that node! OTOH, if the
*user* considers that functionality to be "disposable", then no other node need "sacrifice" resources to address that failure; wait for the user to install a replacement!
The DB server is the ONLY source of persistent store in the system.
As such, *everything* get's its marching orders (indirectly) from tables in the DBMS.
And, as the HVAC *observes* conditions, the only place where those observations can be *stored* is in the DBMS. I.e., I don't say "set the temperature to X degrees at time T" but, rather, "at time T, the user wants the temperature to be X" -- the system sorts out what it has to do (and when) in order to achieve that goal. It does this by learning how the building reacts (e.g., to outdoor conditions) and how the plant compensates (when commanded).
Additionally, if the HVAC node invokes a service (possibly on another node) and that service requires something of the DBMS, then you also have an indirect dependency relationship. E.g., if the HVAC needs to load the "evaporative cooling module" (a "module" being a piece of code), that is fetched from the only PERSISTENT STORAGE in the system: the DBMS. If a node is "brought up", the code that runs *in* that node is similarly supplied by the DBMS ("ROMS" just contain bootstraps).
The DB isn't visible to the user. Each application that needs access to some particular set of tables/relations accesses and maintains those.
How do you maintain the data/tables you have in your product's *RAM*? (Ans: the producers and consumers of those data do the maintenance!)
There's 16G of DRAM in the DBMS server along with gobs of spinning media. How long does it take to do a *comprehensive* test of your PC and its components?
Each node is implemented as printed circuit boards. On those boards are components. Some of those components switch coils that gate the flow of water through pipes. Some of those components drive motors that position cameras. Some components sense temperature, humidity, etc. And, SOME STORE DATA (i.e., DRAM). Any component can fail!
Testing I/O's is not a "special case" any more than testing *memory* is a "special case". The goal is to ensure the hardware can perform the tasks it will be asked to perform when called upon to perform them.
It's the exact same issue! Components are components. Does a user care if the DRAM in his phone system died vs. a protection network from a lightning strike on the PSTN interface? As far as he is concerned, "My phone is broke!" Letting him know he's got a potential problem brewing BEFORE he is victimized by it makes for a friendlier device. Even if the remedial action he takes is to UNPLUG the phone interface and connect a WE station set to the lines, directly!
Because that is something that the user defines.
I drive very little. I can tolerate a vehicle being "down" for a week at a time without noticeably impacting my lifestyle. My neighbor drives a
*lot*! He can't tolerate "several hours" without a vehicle (and gets a loaner any time his car is in for *any* service -- even an oil change!)
We have lots of citrus trees. A failure in the irrigation system means we'd have to drag out a garden hose and manually irrigate if we couldn't get the system repaired in a few days. My (other) neighbor lets his fruit rot on the trees... if HIS irrigation system failed, he wouldn't even notice!
"If you wanted to wash your hands after going to the bathroom..." "If you wanted to take a shower..." "If you wanted to do laundry..." "If you wanted a glass of drinking water..." "If ..."
So, a doorbell should have DEDICATED wires, transformer and annunciator? And, if the residents are *deaf*, they should install visual annunciators in every room of the house (lest they not be able to see the lamp flashing in the living room while they are located in one of the bedrooms -- or
*asleep*?). And, if they happen to be out in the back yard, gardening?
If a semi-trailer shows up at a loading dock and "rings the bell" to gain entry -- but there isn't a *dedicated* attendant just sitting there all day waiting for deliveries -- should there be bells located throughout the facility in every place the attendant might happen to be (bathroom, front office, stock room, etc.)?
OTOH, if a "system" can notice that "doorbell ring" and notify the responsible party WHEREVER HE MAY BE, then there is no need to bother everyone else in the facility with these events (like paging systems in days of old)
But you *can* -- if you can run diagnostics while the node is still providing its core functionality! If you require the node to be power cycled to enter POST -- or, commanded to enter BIST -- then you leave the system without that functionality even though there is not a real *failure* present (e.g., if that node *had* a genuine failure, then you're SoL; but, if it doesn't have a catastrophic failure yet is "busy, testing", you don't want the system to behave as if that node was "broken/unavailable".
Do you *own* anything that guarantees 100% availability? (Cell)Phone? Thermostat? Vehicle? PC? Lightbulb? etc. You tolerate some potential risk to greatly offset added cost AND COMPLEXITY!
How many folks *don't* do regular backups on their PC's -- despite the value of that content?
How many hours without power before the perishables in your refrigerator (or freezer) are "suspect"?
How many folks pull the failing batteries out of their smoke detectors (potentially putting their lives at risk) just to silence the annoying "dying battery" chirp? Why not keep spare batteries on hand??
People make their own decisions as to where to spend their dollars and risk. We have a "wired" station set as a backup to the cordless phones and a cell phone as a backup to the land-line. Yet, we can still find ourselves without phone service depending on what sort of "problem" manifests upstream from us.
That's up to the user. It's impractical to offer a system of this scale with every possible set of priorities to address every possible set of constraints that any *potential* user might envision.
Look around your house. What "backup" do you have for your garage door opener (imagine if it fails while you are *outside*)? Doorbell? Thermostat? Irrigation system? Furnace/ACbrrr? Hot water heater? TV? "HiFi"? Phone? Alarm system?
All of these things *do* fail. Yet, how many folks have a "hot spare" on hand? Or, even a *cold* spare?
The difference is "failures" are things that users can address -- even if not desired: time to buy a new . Artificially induced "unavailability" ('busy, testing') has the potential to be far more frequent than the once-in-a-product's-lifetime "sorry, this is broken"!
I have no scheduled maintenance. Nodes are added by connecting them to a switch. Software is updated by adding entries to tables in the DBMS. As nodes can and do come on-line and off-line regularly, changes and enhancements seemlessly merge with the existing components.
Reliability is addressed by keeping spares -- for whatever YOU consider to be important.
But, the system needs to be able to tell you when those spares are (or may be) needed! You don't want to watch a tree start dropping fruit before you discover that the irrigation valve that services that tree (or, perhaps the entire irrigation controller!) is malfunctioning.
I disagree. That's the point of run-time (memory, in this case) testing!