I'm designing a system using lots of COTS hardware. IME, few of these things are ever designed *thinking* about their roles in a *system*. Instead, someone throws together a set of features, wraps some sort of syntax around them (in/out) and throws it out into the market. :<
As such, it's often difficult to know, with certainty, that all of the devices you are attached to are actually working, working *properly* or even POWERED UP!
Essentially, I have a processor, touch panel (EIA232), printer (probably parallel), display, barcode scanner (EIA232), electronic scale (EIA232) and *possibly* a keyboard (probably *never* see a mouse!).
Each of these things has smarts. And, each was designed without concern for any of the others *or* the processor that talks to all of them.
So, it is possible for the touch panel to get "hung" (i.e., you can't count on getting valid input from it!). Or, the printer. Or the barcode scanner. Or the scale. Or...
Much of the user's interaction is designed to have an incredibly lightweight user interface. I.e., seldom even *looking* at the display. Also, the individual components may not be closely colocated (so, you can't count on the user to *see* that the printer isn't working, etc.)
The software is set up with a daemon for each device to (hopefully) detect communication problems, devices that are powered down or misconfigured, etc. But, many devices haven't been designed with keep-alive protocols in mind. And, most don't formally specify how they behave when you try talking to them "regularly" (i.e., trying to exploit configuration commands that IN THEORY shouldn't affect normal operation -- but end up doing so! :< )
Each "node" is configured independantly of the others. E.g., one might have a barcode scanner but no printer; another might have a printer but no touch screen; another might have a scale but no *display*! Managing the configuration isn't a problem. *But*, the variations mean that you can't rely on any particular device being present at *each* node (except the processor). I.e., you can't just flash messages on a screen; or tell someone to type "REBOOT", etc.
There are only about 30 of these at each location. But, there won't be any MTS around to support them. So, if something doesn't *seem* to be working correctly, I need a simple protocol for (nontechnical) users to get things back to a known/running condition.
It *seems* like the only realistic AND INTUITIVE protocol for "recovery" is to sequence power to the devices in question. Ideally, to *every* device at a node -- though remembering to do so may be a problem (so I need to deal with the possibility that some devices might get reset while others aren't).
And, of course, the software has to take measures to protect pending transactions as this sort of "problem" can come up at any time.
What problems am I failing to foresee? Are there any other (practical) ways of doing this?