Randy Yates recently started a thread on programming flash that had an interesting tangent into watchdog timers. I thought it was interesting enough that I'm starting a thread here. I had stated in Randy's thread that I avoid watchdogs, because they mostly seem to be a source of erroneous behavior to me. However, on reflection I realized that I lied: I _do_ use watchdog timers, but not automatically. To date I've only used them when the processor is spinning a motor that might crash into something or otherwise engage in damaging behavior if the processor goes nuts. In general, my rule on watchdogs, as with any other feature, is "use it if using it is better", which means that I think about the consequences of the thing popping off when I don't want it to (as during a code update or during development when I hit a breakpoint) vs. the consequences of not having the thing when the processor goes haywire. Furthermore, if I use a watchdog I don't just treat updating the thing as a requirement check-box -- so you won't find a timer ISR in my code that unconditionally kicks the dog. Instead, I'll usually have just one task (the motor control one, on most of my stuff) kick the dog when it feels it's operating correctly. If I've got more than one critical task (i.e., if I'm running more than one motor out of one processor) I'll have a low- priority built-in-test task that kicks the dog, but only if it's getting periodic assurances of health from the (multiple) critical tasks. Generally, in my systems, the result of the watchdog timer popping off is that the system will no longer work quite correctly, but it will operate safely. So -- what do you do with watchdogs, and how, and why? Always use 'em? Never use 'em? Use 'em because the boss says so, but twiddle them in a "last part to break" bit of code? Would you use a watchdog in a fly-by-wire system? A pacemaker? Why? Why not? Could you justify _not_ using a watchdog in the top-level ...

Watchdog timers are not often used in FPGAs. I guess that's because processes in HDL seldom get stuck or lost in the weeds. ;) When I did design a software project we had multiple tasks each kicking another task which would track what was going on and "pet" the watch dog to keep it from barking. The various tasks had periods of "interest" different from the watch dog timeout, so this process dealt with the appropriate time period of each of the tasks being watched. Only this task needed to actually deal with the watch dog period. -- Rick C

Kicking the dog -- how do you use watchdog timers?

R

rickman 10 years ago

Coding bug??? Not sure where you are getting this. I specifically excluded SEU because that is one situation that can cause problems with

*any* design in unpredictable ways. Otherwise this is a system design issue. If your system is subject to "glitches" then a means should be designed into the handshake to resolve timeouts. Resetting the entire FPGA or board shouldn't be necessary.

You are talking about a specified timeout on a communications protocol, not a watchdog.

If you can design sequential control logic that doesn't have FSMs, then you are a better man than I am... or you are good at renaming circuits. Everything sequential is an FSM other than simple data registers. A counter is a FSM.

Yes, defining transitions for every possible state is a good tool, if needed. But by default adding a watchdog timer is overkill, especially when it simply masks a bug rather than exposing it.

The Transputer had math instructions that would halt the CPU when an overflow occurred. It sounded crazy at the time, but that is actually preferable to letting an erroneous system continue running. Watchdogs are often like that, they let a system continue running in a corrupt way rather than pointing to the bug.

Again, I don't call that a watchdog since it is actually a part of your protocol. A watchdog is used to catch problem you know nothing about but you want the system to continue to run. In CPUs they reset the system so user intervention isn't required. But it is still a disruption to the user if they are using it at the time.

In FPGAs the logic can be designed to not hang. It may require work to do the proper analysis, but it is not just possible, but saves money in the long run when you don't need to fix difficult to find bugs. Bottom line is adding a watchdog to an FPGA to catch unknown problems shows that something is missing from the design process.

Rick C

Vote

R

rickman 10 years ago

My first FPGA design was to provide data on the PCI bus through a bus interface chip. Turns out the PCI bus will hang the entire CPU if a handshake is not completed. My very first iteration of the design had a bug in the FSM that locked up the PC. lol It got fixed very quickly.

Rick C

Vote

P

Paul Rubin 10 years ago

As FPGA's get bigger and the circuits in them get more complicated, don't they face the same combinatorial explosion that big software systems do? There are tons of historical examples of Intel and similar CPU's (hard silicon, not even FPGA's) locking up due to bugs. Any big CPU or comparable chip will have an errata list. At some point you may have to accept that bugs are inevitable, and that a reliable system (besides preventing as many bugs as it can) also has to mitigate any remaining ones. Watchdogs are a time tested approach for that purpose.

Vote

R

rickman 10 years ago

How may of those large ASICs with hang bugs had watchdog timers? The bug was a system level design problem, not a logic bug in a FSM. They could be dealt with by a software change, no? Even if they couldn't be dealt with with software, what would a watchdog do? Reset your entire computer/phone/flight nav?

There is always the possibility of bugs in FPGAs. But bugs that require the use of a watchdog are a class of bugs that should be shaken out in debug unless the designers are not very good. If they can't find them in debug, they have to be pretty durn infrequent.

Do you have any links to descriptions of such bugs? I'm curious.

Rick C

Vote

R

Randy Yates 10 years ago

A WDT is also not a cure-all. Consider this scenario: a piece of hardware fails, changing the inputs to a piece of code in an unexpected way and causing the code to go into the weeds and the WDT to fire.

But after restart, the hardware is still failed and providing unexpected inputs, the same bug occurs again, the WDT fires again, and the processor restarts again. Ad-infinitum.

So what did this fix? :)

Randy Yates, DSP/Embedded Firmware Developer Digital Signal Labs http://www.digitalsignallabs.com

Vote

P

Paul Rubin 10 years ago

I'd expect the WDT to be in the box that the ASIC is deployed in, not in the ASIC itself. That way the box resets if the ASIC locks up.

formatting link

Vote

R

rickman 10 years ago

I looked at this list and of the first three only one was a lockup of any sort. The LCD Controller in the MPC823 can hang the CPU when the LCD is disabled while in aggressive mode (LAM). This has a very simple fix in software, before disabling the LCD, turn off the LAM.

Where is the need for a watchdog of any sort? If an ASIC locks up, won't the CPU be able to figure it out and reset whatever is appropriate?

Rick C

Vote

P

Paul Rubin 10 years ago

What if the CPU is part of the ASIC?

Vote

R

rickman 10 years ago

Shouldn't a processor reset also reset the hardware?

Rick C

Vote

R

rickman 10 years ago

Ok, what if?

Rick C

Vote

T

Tim Wescott 10 years ago

Randy's point, I think, is that if something is _broken_, a reset isn't going to un-break it.

A processor reset should also reset the hardware, in much the same way that cops should always be honest -- "should" in this case indicates a moral requirement, but not, in all companies, a reasonable expectation.

Tim Wescott Control systems, embedded software and circuit design I'm looking for work! See my website if you're interested http://www.wescottdesign.com

Vote

R

rickman 10 years ago

I'm not sure we are on the same conversation. We were discussing how to design systems, not what systems get designed.

Rick C

Vote

A

Allan Herriman 10 years ago

You seem to making the assumption that having two flip flops in series will stop the first one from being replicated. I've seen it happen (albeit with a huge fanout on the second FF).

The only way to ensure that the first FF has not been replicated is to check, or to apply attributes that will tell the tools not to replicate it. Even then, the tools may have bugs (they certainly have in the past) and you still need to check to be sure.

The good news is that the check can be automated.

Do you have citation for "known bad practice"? I wrote that list in (I think) 2001, and I haven't seen anything containing /all/ of those points published prior to that date.

Allan

Vote

R

rickman 10 years ago

No, I'm not really a history buff. I don't know of any resource that lists problems to be avoided in digital logic design. Do you have any? I believe all of these issues are common knowledge.

You list 8 things you look for and the first 6 are clock domain crossing issues. So that is really one issue, good clock domain crossing design and you have listed six ways that designers screw up.

Using async inputs on FFs has always been discouraged by the FPGA companies, in no small part because it makes the design hard to verify and I believe they have said it makes it hard to port to ASICs (maybe because of being hard to verify).

I have heard forever that it is hard to gate clocks properly. It requires knowledge of gate delays and detailed timing which is typically avoided in FPGA designs in favor of unit delays simulation with static timing analysis.

Rick C

Vote

R

Rob Gaddi 10 years ago

Wait, you're proposing that an error on the data bus should raise some sort of Data Abort exception (perhaps at vector address 0x10) rather than render the system catatonic? Poppycock! Who would ever design a ARM-based CPU in such a way?

Rob Gaddi, Highland Technology -- www.highlandtechnology.com Email address domain is currently out of order. See above to fix.

Vote

R

rickman 10 years ago

Since we have been discussing purely hardware issues and this is primarily a software group, I have started a post in comp.arch.fpga. If you think it is appropriate to continue this discussion here, maybe add comp.arch.fpga to the list of groups.

Rick C

Vote

L

lasselangwadtchristensen 10 years ago

yes I know, crazy talk ;)

-Lasse

Vote

R

Randy Yates 10 years ago

Exactly. Discriminate "broken hardware" from "hardware that's gotten into a bad state." The former won't benefit from a reset, the latter will.

Randy Yates, DSP/Embedded Firmware Developer Digital Signal Labs http://www.digitalsignallabs.com

Vote

R

rickman 10 years ago

Even broken hardware will benefit if the reset prevents actions that cause damage or disrupt a larger part of the system. It is frequently the case that a power up self test is performed before a system controls dangerous devices or tries to communicate with the larger system.

Rick C

Vote

H

Hans-Bernhard Bröker 10 years ago

Am 11.05.2016 um 08:17 schrieb Tim Wescott:

Non-reset is not going to, either.

It's worth trying to distinguish between a run-off-into-the-wild system and a permanently broken one. So trigger a global reset, and see if that makes it work again. If it does, things are better than before. If it doesn't, they're no worse. As problem-handling approaches go, that's a pretty impressive result.

That's what a watchdog ultimately is good for: to distinguish between a SEU and a FUBAR situation.

There's really nothing terminally wrong with having a watchdog. The main risk I see is that it's easy to fall into the trap of thinking of the Dog not as (almost) the last line of defense, but as the first, or even the only one you need. I.e. it's tempting to think: "Nice, now I've got a watchdog, so the rest of the system can be designed without a care."

OTOH, just because not all cops are honest, that doesn't make a world without any cops a better place.

It can hardly be considered a concept's fault if some people implement it incorrectly. And it may not even be incorrect to leave resetting the rest of the circuit to the micro. There might be some information for the micro to be had from inspecting the state of other parts of the hardware, as left behind by the hosed system state. It all depends.

Vote

Kicking the dog -- how do you use watchdog timers?

Join the Discussion

Didn't find your answer?