Randy Yates recently started a thread on programming flash that had an interesting tangent into watchdog timers. I thought it was interesting enough that I'm starting a thread here. I had stated in Randy's thread that I avoid watchdogs, because they mostly seem to be a source of erroneous behavior to me. However, on reflection I realized that I lied: I _do_ use watchdog timers, but not automatically. To date I've only used them when the processor is spinning a motor that might crash into something or otherwise engage in damaging behavior if the processor goes nuts. In general, my rule on watchdogs, as with any other feature, is "use it if using it is better", which means that I think about the consequences of the thing popping off when I don't want it to (as during a code update or during development when I hit a breakpoint) vs. the consequences of not having the thing when the processor goes haywire. Furthermore, if I use a watchdog I don't just treat updating the thing as a requirement check-box -- so you won't find a timer ISR in my code that unconditionally kicks the dog. Instead, I'll usually have just one task (the motor control one, on most of my stuff) kick the dog when it feels it's operating correctly. If I've got more than one critical task (i.e., if I'm running more than one motor out of one processor) I'll have a low- priority built-in-test task that kicks the dog, but only if it's getting periodic assurances of health from the (multiple) critical tasks. Generally, in my systems, the result of the watchdog timer popping off is that the system will no longer work quite correctly, but it will operate safely. So -- what do you do with watchdogs, and how, and why? Always use 'em? Never use 'em? Use 'em because the boss says so, but twiddle them in a "last part to break" bit of code? Would you use a watchdog in a fly-by-wire system? A pacemaker? Why? Why not? Could you justify _not_ using a watchdog in the top-level ...

Watchdog timers are not often used in FPGAs. I guess that's because processes in HDL seldom get stuck or lost in the weeds. ;) When I did design a software project we had multiple tasks each kicking another task which would track what was going on and "pet" the watch dog to keep it from barking. The various tasks had periods of "interest" different from the watch dog timeout, so this process dealt with the appropriate time period of each of the tasks being watched. Only this task needed to actually deal with the watch dog period. -- Rick C

Kicking the dog -- how do you use watchdog timers?

P

Paul Rubin 10 years ago

If you need an absolutely reliable product (medical safety, NASA, or whatever), you have to use ultra high assurance design processes that are not economically competitive in more typical application areas. If you don't use those processes, you aren't designing "without a care", but you're designing with an amount of care chosen through an engineering and business decision, based on how much product failure you're willing to tolerate. If falling back to a WDT is a cheap way to reach your acceptable failure rate, it seems like an ok option.

I worked on a thing a while back whose hardware randomly locked up every few thousand hours of operation. We never figured out why, and decided not too spend excessive resources studying it, given that it was coming due for a total redesign anyway.

We had a few hundred of these things in the field which meant that on average, we logged maybe one WDT reset per day across the whole fleet. The application area was not even slightly safety critical and most of the resets were in the middle of the night when the device wasn't in use anyway. There was a slim possibility that a reset at the wrong time could actually inconvenience a customer and we'd get a support call. But AFAIK that never happened. Nobody ever noticed the resets.

I think the above is a typical story. I wasn't involved in the management decision to ship the thing despite the lockups (relying on the WDT), but I can't say that they made a wrong choice. In mathematics we prove things and then expect to be absolutely sure of them, but engineering is different. Most engineering is about making stuff that meets cost constraints and empirically works well enough for the application, and that's what they did.

Vote

T

Tim Wescott 10 years ago

I meant my comment more as an encouragement to look at schematics or ask the hardware designers what reset does.

I see your point about possibly letting the micro reset the rest of hardware -- either way, one should not assume things.

Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!

Vote

R

Randy Yates 10 years ago

In general that is a logical fallacy.

Consider a situation where one section of the code, let's say one thread, hangs because of broken hardware, but other threads are still doing useful work, e.g., transmitting status information up to the cloud.

Randy Yates, DSP/Embedded Firmware Developer Digital Signal Labs http://www.digitalsignallabs.com

Vote

D

Dimiter_Popoff 10 years ago

Or the case where reset causes screen reinitialization and the glimpse of something vital you had to understand what the problem was was too short.

Generalizations are almost always wrong, but in general having a dog (and being able to turn it off for situations like the above) is a good thing :) (well not if it is an organic dog in your backyard to yell all the time, just a silicon sort of dog....). On certain systems it may even be smoke-saving, hitting reset early enough. On larger systems which are of the complexity of a PC and remotely operating it can at times eliminate the need for someone to have to go to the device and reset it... The latter is particularly useful during in situ fine-tuning which involves significant programming (has been for me).

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Vote

H

Hans-Bernhard Bröker 10 years ago

Am 13.05.2016 um 05:09 schrieb Randy Yates:

So a _broken_ system will be fixed by not doing a reset; really?

Then that won't be fixed by not doing a reset. Just like I said.

Vote

P

Paul Rubin 10 years ago

The fallacy is the implication "reset won't fix something broken" => "reset is not worth attempting", which comes from the erroneous concept that something broken is unusable. In fact lots of brokenness takes the form of the device freezing up once in a while, when it's supposed to keep working. Resetting won't un-break the device: it's still a broken device that will freeze up again eventually. But if resetting clears the immediate symptom (the freeze-up) so you can keep using it, that might be good enough for your purposes.

Anyone who deals with technology products in the real world is used to this. My DSL modem freezes up every few months and I have to reset it manually since there's no WDT. This is a known problem with these modems. Resetting is a minor nuisance if I'm at home, but potentially a big headache if I want my home computer to stay online so I can connect to it while travelling. Consider these two solutions:

1) Buy a new modem (still with no WDT) guaranteed not to freeze for 3 years: the vendor replaces it under warranty if it freezes in that period. 2) Add a WDT to the existing broken modem, i.e. it will still freeze now and then, but it self-resets in the event of a freeze.

I think #1 is an actual "fix", but #2 is more robust in practice. If I really cared about remote access, I'd want a WDT even with a new modem. If I had to pick between the two, I'd choose the WDT over improving the modem's underlying reliability. There's only so far you can go in trying to make hardware failure-proof. Even NASA gives up at a certain point. They make stuff as reliable as they can, but then they deal with residual unreliability by adding backup hardware in case the primary still fails.

Vote

H

Hans-Bernhard Bröker 10 years ago

Am 13.05.2016 um 21:10 schrieb Paul Rubin:

I did not make that implication, so have to object to this criticism being applied to my posts.

It's not, for the use case you described, because you won't be home to let them in, so they won't be able to exchange it. And if you were home, there's no way they'll be there with the exchange device faster than you can reach the existing device's reset button (or power plug).

Vote

P

Paul Rubin 10 years ago

Well I don't understand what you were getting at then.

Of course it's a fix. It changes a deployment of broken equipment into one of non-broken equipment. How can that be anything other than a fix? What else can it mean to fix something? The issue is that it's hardware, not mathematics. Just because it's not broken today doesn't mean it will never break. Therefore being able to mitigate potential failure is still important, maybe even more important than being able to fix existing actual failure.

Vote

W

WangoTango 10 years ago

As you said, use them when they are needed, and that's what I do. Except with me their use is the rule and not the exception. Most of my systems have to run unattended for years on end and there is little chance that a person will be able to cycle the power or press a reset button. That being said, I tend to use rather long time out periods, so I don't get bit on the butt by a WDT that is always on the verge of triggering and if the WDT does expire, something has really gone awry. Also like you, I do have a couple of motor control systems that are a bit more safety critical that I do have faster time out periods, mainly to make sure, that if they fail, the system can attempt to place itself in as safe a condition as possible.

I think a watchdog makes sense in any system that is far from home, like your rover example, or one that incorrect operation may be more dangerous than the system running off into la-la land.

Vote

R

Randy Yates 10 years ago

Hans,

The scenario I constructed illustrates an instance where not doing a reset results in less-broken behavior than resetting would. So yes, it's still broken if you do a reset, but a reset breaks it "harder."

You statement implies (I think) that doing a reset won't do any more harm than not doing a reset, and may do some good. That implication is a fallacy in that it is not true in certain (albeit pathological, but possible) cases.

So yes, your assertion as stated above is true. I should have been more precise in my refutation and explained that it is the implication under the assumption of degrees of brokenness that is a fallacy.

Randy Yates, DSP/Embedded Firmware Developer Digital Signal Labs http://www.digitalsignallabs.com

Vote

Kicking the dog -- how do you use watchdog timers?

Join the Discussion

Didn't find your answer?