Memory testing

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Hi,

For moderately large to very large memory (DRAM) subsystems,
what sorts of policies are folks using to test RAM in POST?
And in BIST (presumably more involved than POST)?

The days of device-specific test patterns seem long gone.
So, cruder tests seem like they are just as effective and
considerably faster.

E.g., I typically use three passes of a "carpet" pattern
(seed a LFSR -- or any other PRNG -- write the byte to
the current address, kick the RNG, rinse, lather, repeat;
reseed the LFSR, reset the address, read the byte at the
current address, compare to RNG state, kick the RNG, rinse,
lather, repeat) expecting any problems to manifest as gross
failures (rather than checking for disturb patterns, etc.)

[Of course, the period of the PRNG is chosen to be long and
relatively prime wrt any of the addressing patterns]

BIST just changes the number of iterations with protections
on certain key parts of the address space.

The tougher issue is testing "live" memory in systems that
are "up" 24/7/365...

Memory testing
Systems running 24/7/365 are much more likely to have disc failure before memory failure.  

But for embedded systems there may not be any disc storage. but I think the lifetime of the RAM is likely about the same as the CPU. So it is not worth the effort, since the device may be replaced before failure starts.

Even a pacemaker has a finite lifetime.  

Given that, you must have some control over the system. you're working on the BIOS level, right? So can you tell what RAM is unused? That's the easy case.  

Can you force the higher level software to yield control briefly for random checks?  

Otherwise, the only way I can see is to somehow watch the higher level execution and check its reads and writes. But short of single stepping, I don't know how to do it.

Custom memory hardware? A dual ported memory management unit where you can swap RAM pages at will without disturbing the application execution?  

If your system needs that level of reliability, then it may be worth the money and effort.  

ed

Re: Memory testing
Hi Ed,

On 6/4/2015 4:11 AM, Ed Prochak wrote:
Quoted text here. Click to load it

No rotating media.

Quoted text here. Click to load it

A memory subsystem can "fail" (i.e., not be reliable in maintaining the
data it is charged with preserving) without being "worn out".  E.g.,
problems with the power supply can manifest in memory errors long before
the system itself "fails".

Quoted text here. Click to load it

Consider the different cases:
- POST
   You essentially have the entire system at your disposal for some amount
   of time (ideally, you want to keep this period short so your bootstrap
   doesn't become a noticeable event -- perhaps have different methods of
   bringing up different levels of POST so you can exercise more comprehensive
   tests when you feel you may have more time available)
- BIST
   You probably have "a good portion" of the system at your disposal.  And,
   probably for a considerably longer period of time.  I.e., you are
   *deliberately* engaged in testing, not "operating"
- Run-time
   You probably have a severely restricted portion of the system at your
   disposal and probably for very short periods of time (lest your efforts
   start to interfere with concurrent operations)

During POST, I think the time constraints mean you can really only perform
gross tests of functionality.  Comprehensive/exhaustive testing would just
take way too long.  Hence my use of a simple test ("carpet") that hopefully
catches *some* gross errors if any exist.

In BIST, I think you can be far more methodical in applying patterns to
the subsystem to try to draw marginal portions of the array out.

At run-time, I think the constraints are so severe that you can really only
look for gross errors in very localized portions of the array (as become
available for testing)

Quoted text here. Click to load it

I have many nodes in the system.  If I don't need the I/O's on a particular
node (i.e., if I am just using it as a compute server), then I can migrate
the executing tasks to another node (possibly bringing a new "cold" node
on-line just for that purpose) while the majority of the memory is tested
on the "original" node.  If the (or some) I/O's are *required*, then I
have to leave some services running on the node to make that hardware
available -- even if I migrate the tasks that are interfacing to those
I/O's off to another node (as above).

Of course, the I/O's on certain nodes will tend to be "needed" more often
than the I/O's on other nodes.  But, I can always schedule "bulk testing"
to take advantage of even brief periods where the I/O's are expected to
be idle.

E.g., if the HVAC node has *just* brought the house up/down to the desired
setpoint temperature, it is likely that the furnace/ACbrrr will not be needed
for "a few minutes".  So, I could move everything off of that node (even
the "drivers" for the I/O's) for a short time while the node is placed in
"test mode".  The assumption being that the testing will take less time
than the house's thermal time constant necessitating a reactivation of
the furnace/ACbrrr.

Other nodes may be less predictable.  E.g., the security system needs to
remain on-line even if the house is "occupied" and the "alarm" doesn't
need to be armed (e.g., consider the role of "panic switches", fire/smoke
detectors, etc.)

Quoted text here. Click to load it

I have a demand-paged virtual memory system.  As a matter of security,
every swapped out page is scrubbed before being placed back into use
(as this is a potential bridging of a protection domain.  As part of that
scrubbing, I could briefly test the physical memory represented by that
page.

But, this is HIGHLY localized.  E.g., a decode failure would never (incredibly
rarely!) be detectable as you deliberately can't see the entire memory
subsystem; no way of knowing if your actions "here" are manifesting "there".

I can "silently" arrange for in use pages to be replicated and then swapped
without affecting the application (esp for TEXT pages).  But, it still
leaves me with just a tiny, local region of memory that I can examine and
play with -- hard to imagine a failure showing up there that hasn't already
caused a failure elsewhere!

Quoted text here. Click to load it

Note that there is a subtle difference between *ensuring* the system is
reliable and detecting when it is prone to failure.  Especially in my
architecture, a faulty node can just be kept out of service -- making
all or some of the I/O's that it handles unavailable (e.g., maybe
you can't irrigate) -- without compromising everything (or, increasing the
cost to ensure the "lawn can always be watered")

Re: Memory testing
Quoted text here. Click to load it

If the memory has a fixed block of 24/32bpp video memory, then you can  
borrow 2-3 bits of each band without much visible disturbance.

But indeed even if you can find a block of free memory, it is easy to  
saturate the bus and cause deadlines to be missed.

High-reliability systems often employ Hamming codes (for booleans and  
enums) and inverted shadow copies for other values (which are checked on  
each access).


--  
(Remove the obvious prefix to reply privately.)
Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

Re: Memory testing
Boudewijn Dijkstra wrote:
Quoted text here. Click to load it

Well, what kind of errors do you want to detect? I'd guess the most
common failure pattern for a factory test / power on self test would be
individual lines shorted against ground / Vcc / each other, or
disconnected, due to bad soldering, dirt, corrosion etc. Therefore, I'd
just try every line and its inverse. This doesn't need to be
particularily fast, and it only needs some strategically placed memory
pages. But it would need all bit lines, so borrowing unused bits of an
array of 32-bit words would not work.


  Stefan


Re: Memory testing
Hi Stefan,

On 6/4/2015 9:06 AM, Stefan Reuther wrote:
Quoted text here. Click to load it

I'm not concerned with factory test -- that can be as comprehensive as needed
because the costs are external to the device(s) in question.  And, because
there are far more things that need to be tested than can (affordably)
be accommodated with recurring dollars.

Quoted text here. Click to load it

I suspect most "memory failures" won't be "hard" failures.  Nor will they
be directly related to the memory subsystem itself.  Rather, I expect
things like excessive power supply ripple (because a filter is failing
over time/temperature) or other issues on which the memory relies for
its proper operation (ventilation/cooling/etc.).

I have nodes installed in a wide range of environments so its not reasonable
to expect them all to be operating at a comfortable ambient, etc.  And, some
"less knowledgeable" user might fail to realize the consequences of his
choice of siting ("Um, sure it's only 115F outside; but the sun shining
directly on that nice black ABS casing in which you've mounted that node
probably has the internal temperature up 50F higher!"  E.g., car interiors,
here, easily and OFTEN attain temperatures in excess of 140F.  With
outside temps above 100F for ~70-100 days each year, that's not an
"exception" but, rather, a *rule*!)

Re: Memory testing
On Thu, 04 Jun 2015 18:06:47 +0200, Stefan Reuther

Quoted text here. Click to load it

If it takes too long to test each individual memory cell, on a DRAM at
least test every row driver and every column sense amplifier. For a
single memory page, test every memory location. This will test the
column sense amplifiers as well as the input/output multiplexor. In
addition to this, test one memory word from each memory page, which
will examine the row decoder and row driver lines.

For RAS/CAS DRAMs this will also examine all external address as well
as data lines for shorts and Vcc/Gnd issues, since all lines are
examined anyway.


Re: Memory testing
Hi Boudewijn,

On 6/4/2015 6:08 AM, Boudewijn Dijkstra wrote:
Quoted text here. Click to load it

I only have displays on 3 nodes.  And, those are "optional".

But, regardless, you'd want to test all bit positions (without having to
"wait" for the right data to *happen* to be "displayed")

Quoted text here. Click to load it

These are SoC's (augmented with external memory) so ECC isn't usually
supported.

I don't think I need to pay for "live" error detection.  I expect to catch
most failures either because a node misbehaves in the course of its normal
operation (emits some faulty data, "goes offline", trips a deadline handler
for one of its tasks, etc.) *or* a node that is being brought on-line fails
its POST (i.e., "died in its sleep").

BIST is a necessary evil to troubleshoot any system:  the node is misbehaving,
why?  (removing and replacing a node can be expensive -- mainly labor).
Being able to put a node into a diagnostic mode for an indeterminate amount of
time means you can have considerable control over what it is doing during that
time (letting it run some "random" collection of apps is less predictable).

Run-time testing is an attempt to bridge the two -- catching failures
before they manifest.  E.g., knowing that an irrigation solenoid is
shorted/opened *before* you need to energize it; thus, allowing you
to notify the needed repair before it has consequences.


Re: Memory testing
Quoted text here. Click to load it

I wasn't talking about ECC.  I meant in software.  Which is overkill for  
most applications.


--  
(Remove the obvious prefix to reply privately.)
Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

Re: Memory testing
On 6/8/2015 6:45 AM, Boudewijn Dijkstra wrote:
Quoted text here. Click to load it

I use that approach for (nonvolatile) configuration memory -- primarily as a
safeguard against power collapsing unexpectedly in the middle of an (atomic)
update of one or more configuration parameters.

But, I haven't logged such an event in many years leading me to think it is
overkill.

[OTOH, as it is only used for moving parameters from the nonvolatile store
into the *working* (configuration) store, it's a one-time hit that doesn't
add much to code size or execution time -- it runs once during IPL and
once again at shutdown... both times where it isn't really noticeable]

I need something that will attest to the integrity of the memory subsystem
as a whole, not just the nonvolatile portion of it.

Re: Memory testing


Quoted text here. Click to load it

Of course, POST is done only once at the first (and hopefully the only
time) for a few decades.

For such system, typically ECC memory is used. In such systems you can
perform "flushing" i.e. read-writeback sequences to all memory
locations at regular intervals, perhaps every few minutes if strong
radiation is present. If the memory word contains a bit error, the ECC
will correct it and the writeback will write clean data+ECC into that
memory word.  

Of course, you should log the location and frequency when ECC is
needed and the need for correction is high at some location, you
should declare that memory page dead and use some bad block
replacement system, which is easy to implement on any virtual memory
operating system.


Re: Memory testing
On 6/4/2015 11:42 PM, snipped-for-privacy@downunder.com wrote:
Quoted text here. Click to load it

As stated elsewhere, individual nodes are powered up and down routinely
within the normal operation of the system.  So, it is possible for POST
_on_a_specific_node_ to be run often (i.e., as often as power is cycled to
that particular node).

Quoted text here. Click to load it

SoC implementation so ECC is not in the cards.  Even if I added the syndrome
management in an external ASIC, there's no way to fault the CPU to rerun
a bus cycle.  So, WYSIWYG as far as DDR memory is concerned.

Quoted text here. Click to load it

This is actually an amusing concept.  Ask folks when they consider
their ECC memory system to be "compromised" and you'll never get a
firm answer.  E.g., how many bus errors do you consider as sufficient
to leave you wondering if the ECC is actually *detecting* all errors
(let alone *correcting* "some")?  How do you know that (detected) errors
are completely localized and have no other consequences?

<shrug>

In my case, I treat errors as indicative of a failure.  Most probably
something in the power conditioning and not a "wear" error in a device.
Leaving it unchecked will almost certainly result in more errors popping
up -- some of which I will likely NOT be able to detect.

E.g., a POST error in DRAM causes me to fall back to recovery routines
that operate out of (internal) SRAM.  A failure in SRAM similarly
causes DRAM to be used to the exclusion of SRAM.  A failure in both
means SoL!

Regardless, in these degraded modes, the goal is only to *report*
errors and support some limited remote diagnostics -- not to attempt
to *operate* in the presence of a known problem.


Re: Memory testing

Hi Don,

On Friday, June 5, 2015 at 4:46:19 AM UTC-4, Don Y wrote:
Quoted text here. Click to load it

Wait, WHAT?
the topic began with a 24/7/365 requirement and no mention of distributed load sharing systems.

If you can run POST "often" then what new policy are you really looking for?

The Security system musty be 100% available so there you use redundant systems (maybe 3 with a voting protocol) Power cycling one of the three on a schedule doesn't seem too bad. You want security with high availability, you  can't get by on the cheap.

So I don't think I see the problem.  

[]
Quoted text here. Click to load it
.. and our sponsor, Duct Tape would like to remind you:
               in the long run, ALL solutions are temporary.

Quoted text here. Click to load it

So mostly not High availability. Seeking something beyond POST may just be overkill (except for that security system).

ed

Re: Memory testing
Hi Ed,

On 6/9/2015 2:32 PM, Ed Prochak wrote:
Quoted text here. Click to load it

The *system* runs 24/7/365:
    "... issue is testing "live" memory in systems that are 'up' 24/7/365..."
------------------------------------------^^^^^^^

Quoted text here. Click to load it

I can't run POST on any particular (i.e., randomly/periodically chosen)
portion of the system at any given time.  I can run POST (or BIST) on
*certain* pars (nodes) of the system at *selected* times.

E.g., if I am not presently "using water" (irrigation, domestic water, etc.),
then the node that is responsible for monitoring and controlling water use
can be commanded to run a POST (or BIST) -- after ensuring the I/O's are
"locked" in some appropriate state(s).

[For example, make sure the main water supply valve is "open", irrigation
valves are "closed", etc.]

Likewise, if the security cameras covering the back yard are not needed
during daylight hours, those nodes can be powered up, tested, then powered
down until they *will* be needed.

Looking at different (smaller) time intervals, I can probably cheat and
arrange for the HVAC node to be tested IMMEDIATELY AFTER the house has
reached its heat/cool temperature -- on the assumption that the furnace/ACbrrr
will *not* be needed in the N seconds/minutes that it takes for that test
to complete (again, after first locking down the I/O's to some sort of "safe"
state).

OTOH, the database server is the sole repository of persistent data
in the system.  Taking *it* offline means the system AS A WHOLE needs
to be essentially quiescent -- there's no way for a node to inquire of
settings, make changes to settings, respond to changes in the environment,
etc. if the DB server is not responsive.

[And, the DB server has gobs of resources so testing there tends to
be far more time consuming]

Quoted text here. Click to load it

I don't have any explicit redundancy in the system.  E.g., if the door
camera bites the shed, it's gone.  No way to recover that lost functionality
(without the user replacing the node).  Likewise, if the node that handles
water usage/metering craps out, those functions are gone until the hardware
is replaced (e.g., perhaps the water supply turns *off* when you'd like it to
remain on; or, perhaps it is locked on even in the event of a detectable
plumbing failure, etc.).

As there are no "backups" for individual I/O's, I rely on runtime testing
to identify problems *before* they interfere with operation -- to give
the user a "head's up" before the system encounters a failure IN IT'S
INTENDED USE OF THAT FEATURE.

E.g., turn the security cameras on during the day, verify that the
images returned by each are "nominal" (i.e., the tree that used to
be in the center of the scene is still visible there).  If not, you
can alert the user before the system *requires* those cameras to be
operational (to perform their security monitoring functions).

Consider the irrigation system:  it may be days or even weeks for
certain irrigation valves to be "needed".  Yet, a wire could get
cut or shorted -- or a valve mechanically inoperative -- at any time
while the valve is "dormant".  Waiting to detect that problem until
the system decides that the valve *must* be energized means you're
already too late:  why didn't you fix it two days earlier when it
*failed* (but, wasn't yet NEEDED)?

Quoted text here. Click to load it

Permanent Temporary Fixes.

Quoted text here. Click to load it

Availability is a relative concept.

If you wanted to flush a toilet and the water happened to be "off"
because that node was busy doing a self-test, it's not "the end of
the world"... but it would surely be annoying -- and NOTICEABLE.

Likewise, if someone came to the front door and the doorbell didn't
"ring" because *that* node happened to be running a memory test...

Or, missing an incoming telephone call while the phone system was
running diagnostics.

Or, someone opening (and then closing) a door to gain entry to the
premises while the node charged with watching those events was
"preoccupied" with testing.

Etc.

How many of these are "inconveniences" is debatable:  if you went to
make a call with your cell phone and found it was "busy, testing",
SHIRLEY you *could* wait a bit while that testing, finishes, right?  Are
*all* your phone calls so terribly urgent that they can't wait??

OTOH, even having to wait a second more than normal while it *aborts*
the test (and reloads the application) would probably be noticeable to
you ("Damn phone is ALWAYS 'testing'!")

My goal is to highly integrate this system with day to day living
(or "business", etc.).  As such, if "some" component is always
(i.e., "often") claiming to be 'busy, testing', it can be
counterproductive.  (using the "splash screen" diversion to hide
your activities wears thin, quickly)

So, being able to "hide" these sorts of activities in ways of which the
user is unaware becomes a significant design goal...

Re: Memory testing
On Wednesday, June 10, 2015 at 2:01:53 AM UTC-4, Don Y wrote:
Quoted text here. Click to load it
y
y
T
d
of
..."
Quoted text here. Click to load it


But the "system" consists of multiple nodes. The nodes are where the memory
 that you want tested exists. You can restart a node (triggering a POST), s
o FOR THAT NODE (yes the nodes are not totally interchangeable) you do not  
need an elaborate run-time memory test process.
Quoted text here. Click to load it
c.),
Quoted text here. Click to load it
e

I never suggested the testing had to be random. Periodic is fine, just like
 your  scheduled check up on your car. (Hi Joe, I brought in the car for it
s 60,000 mile service check)
Quoted text here. Click to load it
d

A lot of break-ins happen in daylight hours. But you are muddying the water
s I think. Are you writing POST for DRAM in the camera? or in the node that
 reads the video from the camera?


Quoted text here. Click to load it
Cbrrr
afe"
Quoted text here. Click to load it

That's doable. so another node (or set of nodes?) that doesn't need fancy r
un-time memory testing.
Quoted text here. Click to load it
,

Then you designed a distributed system, but still have a single point of fa
ilure for the system.

Why does the HVAC need to query the DB continuously? Even if the settings c
hange periodically (different tempts for different times of day, different  
days, and even different seasons), does not mean the settings change minute
 by minute. So what if the temp setting changes at 5:05PM instead of 5:00PM
.

How do you do DB maintenance?

Quoted text here. Click to load it

Agreed, but have you measured how long?

Quoted text here. Click to load it
 on a
Quoted text here. Click to load it
, you
ity
s
re
t to

The topic was memory testing, not the I/O. Don't muddy your own topic.

Quoted text here. Click to load it

The topic was memory testing, not the I/O. Don't muddy your own topic.

Start a new thread for I/O, predictive maintenance.


Quoted text here. Click to load it
ir
er.
.
ng
es
!
Quoted text here. Click to load it
n,
rs
ate*
Quoted text here. Click to load it
 be
Quoted text here. Click to load it

Well, actually you never provided your availability requirement other than  
the vague 24/7/365 quip in the first post.
Quoted text here. Click to load it

But because there is water in the tank, it would work once.

Quoted text here. Click to load it

then I think you have bigger system design problems. (Over engineering)

Quoted text here. Click to load it

That is another one like Security where you cannot geet by one the cheap (s
ingle node)
Quoted text here. Click to load it

100% availability requires some redundancy.
Quoted text here. Click to load it

The debateable point is this: exactly what is the availability requirement  
for each subsystem?

For example the system I am working on has a requirement that it is unavail
able for clinical use less than a small number of hours per year (not count
ing scheduled maintenance).


Quoted text here. Click to load it
You got that right!

Quoted text here. Click to load it

And needs to be addressed, but it is a different topic.

have a great day.

Re: Memory testing
On 6/11/2015 6:17 AM, Ed Prochak wrote:

[attrs elided]

Quoted text here. Click to load it

The node and all the I/O's that it services are unavailable during POST.
As I said, previously, POST wants to achieve a balance between thoroughness
and expediency -- any time spent *in* POST increases the time before the node
can be brought on-line for its normal operation.  BIST takes the attitude that
testing is the operational mode of the node -- so, like POST, the node's normal
functions are not provided to the system.

Run-time testing (of all components in a node) attempts to juggle both
criteria -- testing *and* operation.

Quoted text here. Click to load it

But you (I) can't even guarantee any particular periodicity -- that was the
point of my "randomly/periodically" comment.  "Testing" is just another
workload that has to be scheduled based on its needs and impositions on
(portions of) the system.

Quoted text here. Click to load it

Our back yard is protected and "supervised" -- threats would come from the
front of the building.  Some other homeowner (business owner) may have the
exact opposite set of circumstances.  As such, the "testing" workload has
to adapt to the other uses that each particular node is called on to
perform as defined in *that* particular system (not something that is known
at compile-time)

Quoted text here. Click to load it

I verify that the camera's functionality will be available to the system.
This means:
- the PTZ mount will respond to motion commands
- the camera will deliver a "video signal"
- the video signal will represent the image of the "scene" before the camera
   (i.e., if there was a tree in the scene the last time the camera was
   verified as operational, that tree should still be there!)
- the memory into which that image will be analyzed (motion detection)
etc.

Quoted text here. Click to load it

The system degrades.  The functionality that user A considers important
may not be the same that user B desires.  If a user wants the DBMS to
be redundantly implemented, he adds another (or several) other instances
to the system.

I've put a lot of effort into eeking out every last bit of *system*
reliability from the components as it degrades.  E.g., if external DRAM dies,
a node can degrade to a mode whereby it's virtualized I/O's are serviced by
code running on some other node -- possibly one that was powered up in
response to the detected memory failure on that node!  OTOH, if the
*user* considers that functionality to be "disposable", then no other
node need "sacrifice" resources to address that failure; wait for the
user to install a replacement!

Quoted text here. Click to load it

The DB server is the ONLY source of persistent store in the system.

As such, *everything* get's its marching orders (indirectly) from tables in
the DBMS.

And, as the HVAC *observes* conditions, the only place where those observations
can be *stored* is in the DBMS.  I.e., I don't say "set the temperature to X
degrees at time T" but, rather, "at time T, the user wants the temperature to
be X" -- the system sorts out what it has to do (and when) in order to
achieve that goal.  It does this by learning how the building reacts
(e.g., to outdoor conditions) and how the plant compensates (when commanded).

Additionally, if the HVAC node invokes a service (possibly on another node)
and that service requires something of the DBMS, then you also have an
indirect dependency relationship.  E.g., if the HVAC needs to load the
"evaporative cooling module" (a "module" being a piece of code), that is
fetched from the only PERSISTENT STORAGE in the system:  the DBMS.  If a
node is "brought up", the code that runs *in* that node is similarly
supplied by the DBMS ("ROMS" just contain bootstraps).

Quoted text here. Click to load it

The DB isn't visible to the user.  Each application that needs access to
some particular set of tables/relations accesses and maintains those.

How do you maintain the data/tables you have in your product's *RAM*?
(Ans: the producers and consumers of those data do the maintenance!)

Quoted text here. Click to load it

There's 16G of DRAM in the DBMS server along with gobs of spinning
media.  How long does it take to do a *comprehensive* test of your PC
and its components?

Quoted text here. Click to load it

Each node is implemented as printed circuit boards.  On those boards
are components.  Some of those components switch coils that gate the
flow of water through pipes.  Some of those components drive motors that
position cameras.  Some components sense temperature, humidity, etc.
And, SOME STORE DATA (i.e., DRAM).  Any component can fail!

Testing I/O's is not a "special case" any more than testing *memory* is a
"special case".  The goal is to ensure the hardware can perform the tasks
it will be asked to perform when called upon to perform them.

Quoted text here. Click to load it

It's the exact same issue!  Components are components.  Does a user care if
the DRAM in his phone system died vs. a protection network from a lightning
strike on the PSTN interface?  As far as he is concerned, "My phone is broke!"
Letting him know he's got a potential problem brewing BEFORE he is
victimized by it makes for a friendlier device.  Even if the remedial
action he takes is to UNPLUG the phone interface and connect a WE station
set to the lines, directly!

Quoted text here. Click to load it

Because that is something that the user defines.

I drive very little.  I can tolerate a vehicle being "down" for a week at
a time without noticeably impacting my lifestyle.  My neighbor drives a
*lot*!  He can't tolerate "several hours" without a vehicle (and gets a
loaner any time his car is in for *any* service -- even an oil change!)

We have lots of citrus trees.  A failure in the irrigation system means
we'd have to drag out a garden hose and manually irrigate if we couldn't
get the system repaired in a few days.  My (other) neighbor lets his fruit
rot on the trees... if HIS irrigation system failed, he wouldn't even
notice!

Quoted text here. Click to load it

"If you wanted to wash your hands after going to the bathroom..."
"If you wanted to take a shower..."
"If you wanted to do laundry..."
"If you wanted a glass of drinking water..."
"If ..."

:>

Quoted text here. Click to load it

So, a doorbell should have DEDICATED wires, transformer and annunciator?
And, if the residents are *deaf*, they should install visual annunciators
in every room of the house (lest they not be able to see the lamp flashing
in the living room while they are located in one of the bedrooms -- or
*asleep*?).  And, if they happen to be out in the back yard, gardening?

If a semi-trailer shows up at a loading dock and "rings the bell" to
gain entry -- but there isn't a *dedicated* attendant just sitting there
all day waiting for deliveries -- should there be bells located throughout
the facility in every place the attendant might happen to be (bathroom,
front office, stock room, etc.)?

OTOH, if a "system" can notice that "doorbell ring" and notify the
responsible party WHEREVER HE MAY BE, then there is no need to bother
everyone else in the facility with these events (like paging systems
in days of old)

Quoted text here. Click to load it

But you *can* -- if you can run diagnostics while the node is still providing
its core functionality!  If you require the node to be power cycled to
enter POST -- or, commanded to enter BIST -- then you leave the system
without that functionality even though there is not a real *failure*
present (e.g., if that node *had* a genuine failure, then you're SoL;
but, if it doesn't have a catastrophic failure yet is "busy, testing",
you don't want the system to behave as if that node was "broken/unavailable".

Quoted text here. Click to load it

Do you *own* anything that guarantees 100% availability?  (Cell)Phone?
Thermostat?  Vehicle?  PC?  Lightbulb?  etc.  You tolerate some potential
risk to greatly offset added cost AND COMPLEXITY!

How many folks *don't* do regular backups on their PC's -- despite the
value of that content?

How many hours without power before the perishables in your refrigerator
(or freezer) are "suspect"?

How many folks pull the failing batteries out of their smoke detectors
(potentially putting their lives at risk) just to silence the annoying
"dying battery" chirp?  Why not keep spare batteries on hand??

People make their own decisions as to where to spend their dollars
and risk.  We have a "wired" station set as a backup to the cordless
phones and a cell phone as a backup to the land-line.  Yet, we can still
find ourselves without phone service depending on what sort of "problem"
manifests upstream from us.

Quoted text here. Click to load it

That's up to the user.  It's impractical to offer a system of this scale
with every possible set of priorities to address every possible set of
constraints that any *potential* user might envision.

Look around your house.  What "backup" do you have for your garage door
opener (imagine if it fails while you are *outside*)?  Doorbell?
Thermostat?  Irrigation system?  Furnace/ACbrrr?  Hot water heater?
TV?  "HiFi"?  Phone?  Alarm system?

All of these things *do* fail.  Yet, how many folks have a "hot spare"
on hand?  Or, even a *cold* spare?

The difference is "failures" are things that users can address -- even if
not desired:  time to buy a new <whatever>.  Artificially induced
"unavailability" ('busy, testing') has the potential to be far more
frequent than the once-in-a-product's-lifetime "sorry, this is broken"!

Quoted text here. Click to load it

I have no scheduled maintenance.  Nodes are added by connecting them
to a switch.  Software is updated by adding entries to tables in the
DBMS.  As nodes can and do come on-line and off-line regularly, changes
and enhancements seemlessly merge with the existing components.

Reliability is addressed by keeping spares -- for whatever YOU consider
to be important.

But, the system needs to be able to tell you when those spares are
(or may be) needed!  You don't want to watch a tree start dropping
fruit before you discover that the irrigation valve that services
that tree (or, perhaps the entire irrigation controller!) is
malfunctioning.

Quoted text here. Click to load it

I disagree.  That's the point of run-time (memory, in this case) testing!

Quoted text here. Click to load it

Time for bed.

Site Timeline