Yellow Book: "System Recovery for Dummies"

D

D Yuniskis 16 years ago

Hi,

I'm designing a system using lots of COTS hardware. IME, few of these things are ever designed *thinking* about their roles in a *system*. Instead, someone throws together a set of features, wraps some sort of syntax around them (in/out) and throws it out into the market. :<

As such, it's often difficult to know, with certainty, that all of the devices you are attached to are actually working, working *properly* or even POWERED UP!

Essentially, I have a processor, touch panel (EIA232), printer (probably parallel), display, barcode scanner (EIA232), electronic scale (EIA232) and *possibly* a keyboard (probably *never* see a mouse!).

Each of these things has smarts. And, each was designed without concern for any of the others *or* the processor that talks to all of them.

So, it is possible for the touch panel to get "hung" (i.e., you can't count on getting valid input from it!). Or, the printer. Or the barcode scanner. Or the scale. Or...

Much of the user's interaction is designed to have an incredibly lightweight user interface. I.e., seldom even *looking* at the display. Also, the individual components may not be closely colocated (so, you can't count on the user to *see* that the printer isn't working, etc.)

The software is set up with a daemon for each device to (hopefully) detect communication problems, devices that are powered down or misconfigured, etc. But, many devices haven't been designed with keep-alive protocols in mind. And, most don't formally specify how they behave when you try talking to them "regularly" (i.e., trying to exploit configuration commands that IN THEORY shouldn't affect normal operation -- but end up doing so! :< )

Each "node" is configured independantly of the others. E.g., one might have a barcode scanner but no printer; another might have a printer but no touch screen; another might have a scale but no *display*! Managing the configuration isn't a problem. *But*, the variations mean that you can't rely on any particular device being present at *each* node (except the processor). I.e., you can't just flash messages on a screen; or tell someone to type "REBOOT", etc.

There are only about 30 of these at each location. But, there won't be any MTS around to support them. So, if something doesn't *seem* to be working correctly, I need a simple protocol for (nontechnical) users to get things back to a known/running condition.

It *seems* like the only realistic AND INTUITIVE protocol for "recovery" is to sequence power to the devices in question. Ideally, to *every* device at a node -- though remembering to do so may be a problem (so I need to deal with the possibility that some devices might get reset while others aren't).

And, of course, the software has to take measures to protect pending transactions as this sort of "problem" can come up at any time.

What problems am I failing to foresee? Are there any other (practical) ways of doing this?

Thx,

--don

Vote

P

Phil 16 years ago

Interesting problem and you've clearly given it some thought.

I'd worry about both the order of power on or whether you need everything to be powered off at some instant. Someone running between rooms is going to cycle each switch in turn, not turn them all off and then all back on. That might be implicit in your description, but it's something to consider.

At the risk of stating the obvious, power strips will help for units that happen to be close to each other. You might consider remote power controls for distant units and cycle them under processor control. X10 power modules aren't perfect, but they are cheap, available, and might well get the job done for you.

Cheers,

-- Phil

Phil Koopman -- snipped-for-privacy@cmu.edu --

formatting link

Author of: Better Embedded System Software, Drumnadrochit Press 2010

Vote

D

D Yuniskis 16 years ago

Unfortunately, it doesn't seem like there is a "real" solution that I can "a priori" *know* will work. :< (e.g., if power in {A, B, C} sequence things might recover perfectly; yet {B, A, C} may

*hang* -- if the instructions to the user are simply "power everything off and then everything back on", they can comply and still be SOL :-/ )

Exactly. I think I have to say "turn *everything* off *before* turning ANYTHING back on". Even then, its a crap shoot (as I said, if folks had designed all these boxes *cognisant* of their roles in SYSTEMS, they would consider these issues in the designs of their interface protocols instead of leaving things to chance :< )

Right. And, even the order that they hit each device might vary from one person to another.

I think I'll have to force some artificial order to their actions -- and then verify that this order "works" for each configuration. For example:

1) turn off everything. 2) power up the processor 3) if a display is present WITH A SEPARATE POWER SWITCH, power up the display 4) if a barcode scanner is present WITH A SEPARATE POWER SWITCH, power up the scanner 5) if a ....

(the "POWER SWITCH" conditions are there to acknowledge the fact that some devices will not have separate power supplies/switches -- e.g., some of the barcode scanners are powered from the computer; some of the touch panels are powered from the display; etc.)

I think a better strategy would be to try to keep things local. I.e., if a peripheral is far enough away that it would merit this sort of thing, then give it its own processor and divide the problem that way.

My experience with X10 has been that they make more problems than they solve. In an industrial environment, I think they would probably give up the ghost too readily :-/

Hopefully, run-time daemons can catch hung peripherals before they become noticeable to users. I think its better to *tell* someone they have a problem (assuming the software can't recover by itself) than to wait for them to discover the problem later (i.e., when they *need* the device and any recovery activities effectively add delays to their work)

I suspect the printer(s) will be the toughest devices to monitor. Seems like they don't really want to talk

*with* you but, rather, just *listen* (and complain when they are in a fault condition). Nothing proactive in their design :-/

Vote

P

Philip Koopman 16 years ago

It may or may not be practical for your installations, but a human-centric approach to this would be to put large number signs on all the relevant power switches in a group (1, 2, 3, ...) and tell them to turn them off in order, then back on in order. A little easier to know if you missed one that way. But this isn't a miracle cure to be sure, and depends on someone doing the install who can do that for you. I don't know the context of your installs.

You're right; depends upon the operating environment.

I'd say your idea of having a processor present at each local equipment site is probably going to be the way to go. As to the rest ... I don't see any easy answers.

Cheers,

-- Phil

Phil Koopman -- snipped-for-privacy@cmu.edu --

formatting link

Author of: Better Embedded System Software, Drumnadrochit Press 2010

Vote

D

D Yuniskis 16 years ago

The problem there is the order will vary with the "node". E.g., a site without a display will have: #1 CPU #2 Barcode scanner #3 ... While a site *with* a display would have: #1 CPU #2 Display #3 Barcode Scanner #4 ... And a site without a barcode scanner: #1 CPU #2 Display #3 ...

*If* you can force people to "read the numbers" (regardless of what they *think* the numbers *were* -- at some other node!), then youo just have to ensure the numbers get updated wach time you add or remove equipment from a node.

Yeah, I think the solution sucks -- though anything else I come up with seems to suck *more*! :<

Things would be a lot easier if folks designed devices more intelligently -- with an eye towards integration and the issues that it poses -- instead of just designing in a vacuum...

Vote

F

Frnak McKenney 16 years ago

If y'all don't mind a comment or three from someone who missed the start of this thread...

It appears that the OP's concern is powering devices up and down in a particular order, with minimal or no required delay between when devices receive power as long as it occurs in the proper order.

Are these all 115VAC-powered devices (I'm including wall-warts and power bricks here)? If so, could all the power cords (or extensions for the 115VAC end of the wall-warts) be run into a common switch box like hte one on my desk with 7(?) switches labelled MAIN, CPU, MONITOR, PRINTER, etc. work? Sounds like the OP might want larger and different labels, but at least the order would be left->right or right->left, and even _I_ might get that right most of the time.

But perhaps the OP has seen me attempting to work at 0300 after a long day and has concerns about my always getting the power-on order correct. (Emergencies are seldom properly scheduled. )

In that case, it wouldn't be too hard to rewire a power- distribution box like mine so each switch provided power to one outlet AND to the next switch. Throw "1" then "3" and neither "2" or "3" get power. The down side is that when the person throwing switches (pushing buttons) realizes it and hits "2" with "3" still in the ON position, "2" and "3" get power simultaneously. Or worse: "Hm... 1, 2, 3, 4, 5, 6, ... oops! I didn't press 2 hard enough... 2!" (add possible spitzensparken sound effects here)

If you have to deal with devices with multiple power sources and voltages (say 5V at 300A, plus 115VAC, plus 230VAC) then you'll have to come up with some method of providing common signalling and control, a whole 'nother barrel of monkeys.

Hm. I know there are delayed-on and delayed-off relays (and one assumes solid-state equivalents exist these days). If you have the resources, ONE box with ONE switch, multiple outlets, and a bunch of such relays might do the job:

ON provides power to device 1 and the dealayed-ON relay for device 2. When device 2 gets power, that also turns on the delayed-ON relay for device 3...

For power OFF, the OFF position doesn't really -- it just starts a cascade of delayed-OFFs which occur in the proper order. You still have to ensure that the devices are plugged into the box's outlets in the correct order.

Oh, and if you want to get really fancy, add in power-sensing circuits as well so "3" can't get power if "2" isn't powered on already. That'll help catch cases where someone switched off a device using its own powere switch or the device is unplugged (or has blown a fuse).

There's probably alreayd a high-priced (and reliable) commercial device that does most or all of this, with some fancy name like "sequenced power distribution box", but I haven't looked.

Did I miss the discussion of what happens if the devices come on in a "wrong" order, and how much delay (usec? msec? seconds? minutes?) is required between each device being powered on? (There's probably a maximum as well -- external SCSI boxes that power up 15 minutes after the host scans the bus looking for devices usually doesn't work all that well. )

Gack. X-10 is a noise-prone multiple-command-send environment with most switches/devices providing no feedback or status information (the kind of situaiton where you use "shadow registers" and hope for the best).

[...]

If 100% if all potential users submitted detailed descriptions of each and every situation where a given device would ever be used, the manufacturers might be able to do this, but anyone attempting this should have decades of experience in Advanced Cat Herding before taking this one on.

Good luck...

Frank McKenney

Applying computer technology is simply finding the right wrench to pound in the correct screw. -- Frank McKenney, McKenney Associates Richmond, Virginia / (804) 320-4887 Munged E-mail: frank uscore mckenney ayut mined spring dawt cahm (y'all)

Vote

D

D Yuniskis 16 years ago

Not at all. If that were the case, you couldn't buy *any* devices that "talked to" other devices. E.g., how is it that you can use an SRAM in a variety of applications without the manufacturer having to know "detailed descriptions of each and every situation where [that] given device would ever be used"?

You *can* use an SRAM because the interface is well defined and encompasses (nearly?) everything that an "other" device would need to be able to ascertain presence, functionality, size, etc. of that device.

Dealing with the "serial" devices in my example, all that is required of the designer is:

- strictly define the syntax and content of valid messages so you (it) can determine if it has received a valid message (i.e., if it misses the start of a message, it doesn't treat the *tail* of the message as some other EQUALLY VALID message). For example: HANDSHAKING and NO HANDSHAKING would be bad choices for configuration commands as the second can be "accidentally" recognized as the first *if* the device happened to miss the initial portion of the (second) message. Note that adding a checksum also gives you this protection but increases the work that the "other device" has to do in order to form valid messages. I.e., you couldn't simply do: fprintf(serialout, "BAUDRATE %d.\n", baudrate); as the checksum would need to be injected -- and *computed* as a function of "baudrate".

- provide acknowledgements of "commmands" -- so the "other device" has some assurance that you *did* receive the command and it was the command that was intended. I.e., you need a FDX link (or some creative hardware signaling)

- don't act on anything that STRICTLY SPEAKING fails any syntax/semantic check (e.g., "BAUDRATE 19199")

- after a long break, begin an autobaud/autoconfig sequence abandoning any previous link configuration. Expect *any* valid message to immediately follow (perhaps prefaced with a fixed character sequence to allow simpler devices to more easily lock onto the correct baudrate -- e.g., " ".

I suspect I may have left a hole unplugged someplace. But, just these few things (all of which are trivial to implement) would allow the "other device" to converse with the device in question without fear of: something being "missed" *or* something being "misinterpreted".

Vote

D

D Yuniskis 16 years ago

You also *effectively* need a NO-OP message that can be used as a keep-alive/link test. With many devices, you can often find some "benign" command that you can repeatedly invoke -- solely for the *acknowledgement*. E.g., "BAUDATE 19200" "ACK-BAUD3" "BAUDRATE 19200" "ACK-BAUD3" "BAUDRATE 19200" ...

And, of course, you have to make sure the interface activity doesn't *unreasonably* interfere with the normal operation of the device. I.e., if changing the baudrate caused a barcode scanner to misread a label that was being scanned concurrently, you'd have a hard time justifying (to me) why that should be the case! :-/

(OTOH, I can understand that a UPC label that is being decoded may or may *not* be recognized if a configuration command enabling/disabling recognition of UPC labels was acted upon concurrently)

Vote

Yellow Book: "System Recovery for Dummies"

Join the Discussion

Didn't find your answer?