Kicking the dog -- how do you use watchdog timers?

If you need an absolutely reliable product (medical safety, NASA, or whatever), you have to use ultra high assurance design processes that are not economically competitive in more typical application areas. If you don't use those processes, you aren't designing "without a care", but you're designing with an amount of care chosen through an engineering and business decision, based on how much product failure you're willing to tolerate. If falling back to a WDT is a cheap way to reach your acceptable failure rate, it seems like an ok option.

I worked on a thing a while back whose hardware randomly locked up every few thousand hours of operation. We never figured out why, and decided not too spend excessive resources studying it, given that it was coming due for a total redesign anyway.

We had a few hundred of these things in the field which meant that on average, we logged maybe one WDT reset per day across the whole fleet. The application area was not even slightly safety critical and most of the resets were in the middle of the night when the device wasn't in use anyway. There was a slim possibility that a reset at the wrong time could actually inconvenience a customer and we'd get a support call. But AFAIK that never happened. Nobody ever noticed the resets.

I think the above is a typical story. I wasn't involved in the management decision to ship the thing despite the lockups (relying on the WDT), but I can't say that they made a wrong choice. In mathematics we prove things and then expect to be absolutely sure of them, but engineering is different. Most engineering is about making stuff that meets cost constraints and empirically works well enough for the application, and that's what they did.

Reply to
Paul Rubin
Loading thread data ...

I meant my comment more as an encouragement to look at schematics or ask the hardware designers what reset does.

I see your point about possibly letting the micro reset the rest of hardware -- either way, one should not assume things.

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com 

I'm looking for work -- see my website!
Reply to
Tim Wescott

In general that is a logical fallacy.

Consider a situation where one section of the code, let's say one thread, hangs because of broken hardware, but other threads are still doing useful work, e.g., transmitting status information up to the cloud.

--
Randy Yates, DSP/Embedded Firmware Developer 
Digital Signal Labs 
http://www.digitalsignallabs.com
Reply to
Randy Yates

Or the case where reset causes screen reinitialization and the glimpse of something vital you had to understand what the problem was was too short.

Generalizations are almost always wrong, but in general having a dog (and being able to turn it off for situations like the above) is a good thing :) (well not if it is an organic dog in your backyard to yell all the time, just a silicon sort of dog....). On certain systems it may even be smoke-saving, hitting reset early enough. On larger systems which are of the complexity of a PC and remotely operating it can at times eliminate the need for someone to have to go to the device and reset it... The latter is particularly useful during in situ fine-tuning which involves significant programming (has been for me).

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

Am 13.05.2016 um 05:09 schrieb Randy Yates:

So a _broken_ system will be fixed by not doing a reset; really?

Then that won't be fixed by not doing a reset. Just like I said.

Reply to
Hans-Bernhard Bröker

The fallacy is the implication "reset won't fix something broken" => "reset is not worth attempting", which comes from the erroneous concept that something broken is unusable. In fact lots of brokenness takes the form of the device freezing up once in a while, when it's supposed to keep working. Resetting won't un-break the device: it's still a broken device that will freeze up again eventually. But if resetting clears the immediate symptom (the freeze-up) so you can keep using it, that might be good enough for your purposes.

Anyone who deals with technology products in the real world is used to this. My DSL modem freezes up every few months and I have to reset it manually since there's no WDT. This is a known problem with these modems. Resetting is a minor nuisance if I'm at home, but potentially a big headache if I want my home computer to stay online so I can connect to it while travelling. Consider these two solutions:

1) Buy a new modem (still with no WDT) guaranteed not to freeze for 3 years: the vendor replaces it under warranty if it freezes in that period. 2) Add a WDT to the existing broken modem, i.e. it will still freeze now and then, but it self-resets in the event of a freeze.

I think #1 is an actual "fix", but #2 is more robust in practice. If I really cared about remote access, I'd want a WDT even with a new modem. If I had to pick between the two, I'd choose the WDT over improving the modem's underlying reliability. There's only so far you can go in trying to make hardware failure-proof. Even NASA gives up at a certain point. They make stuff as reliable as they can, but then they deal with residual unreliability by adding backup hardware in case the primary still fails.

Reply to
Paul Rubin

Am 13.05.2016 um 21:10 schrieb Paul Rubin:

I did not make that implication, so have to object to this criticism being applied to my posts.

It's not, for the use case you described, because you won't be home to let them in, so they won't be able to exchange it. And if you were home, there's no way they'll be there with the exchange device faster than you can reach the existing device's reset button (or power plug).

Reply to
Hans-Bernhard Bröker

Well I don't understand what you were getting at then.

Of course it's a fix. It changes a deployment of broken equipment into one of non-broken equipment. How can that be anything other than a fix? What else can it mean to fix something? The issue is that it's hardware, not mathematics. Just because it's not broken today doesn't mean it will never break. Therefore being able to mitigate potential failure is still important, maybe even more important than being able to fix existing actual failure.

Reply to
Paul Rubin

As you said, use them when they are needed, and that's what I do. Except with me their use is the rule and not the exception. Most of my systems have to run unattended for years on end and there is little chance that a person will be able to cycle the power or press a reset button. That being said, I tend to use rather long time out periods, so I don't get bit on the butt by a WDT that is always on the verge of triggering and if the WDT does expire, something has really gone awry. Also like you, I do have a couple of motor control systems that are a bit more safety critical that I do have faster time out periods, mainly to make sure, that if they fail, the system can attempt to place itself in as safe a condition as possible.

I think a watchdog makes sense in any system that is far from home, like your rover example, or one that incorrect operation may be more dangerous than the system running off into la-la land.

Reply to
WangoTango

Hans,

The scenario I constructed illustrates an instance where not doing a reset results in less-broken behavior than resetting would. So yes, it's still broken if you do a reset, but a reset breaks it "harder."

You statement implies (I think) that doing a reset won't do any more harm than not doing a reset, and may do some good. That implication is a fallacy in that it is not true in certain (albeit pathological, but possible) cases.

So yes, your assertion as stated above is true. I should have been more precise in my refutation and explained that it is the implication under the assumption of degrees of brokenness that is a fallacy.

--
Randy Yates, DSP/Embedded Firmware Developer 
Digital Signal Labs 
http://www.digitalsignallabs.com
Reply to
Randy Yates

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.