Randy Yates recently started a thread on programming flash that had an interesting tangent into watchdog timers. I thought it was interesting enough that I'm starting a thread here.
I had stated in Randy's thread that I avoid watchdogs, because they mostly seem to be a source of erroneous behavior to me.
However, on reflection I realized that I lied: I _do_ use watchdog timers, but not automatically. To date I've only used them when the processor is spinning a motor that might crash into something or otherwise engage in damaging behavior if the processor goes nuts.
In general, my rule on watchdogs, as with any other feature, is "use it if using it is better", which means that I think about the consequences of the thing popping off when I don't want it to (as during a code update or during development when I hit a breakpoint) vs. the consequences of not having the thing when the processor goes haywire.
Furthermore, if I use a watchdog I don't just treat updating the thing as a requirement check-box -- so you won't find a timer ISR in my code that unconditionally kicks the dog. Instead, I'll usually have just one task (the motor control one, on most of my stuff) kick the dog when it feels it's operating correctly. If I've got more than one critical task (i.e., if I'm running more than one motor out of one processor) I'll have a low- priority built-in-test task that kicks the dog, but only if it's getting periodic assurances of health from the (multiple) critical tasks.
Generally, in my systems, the result of the watchdog timer popping off is that the system will no longer work quite correctly, but it will operate safely.
So -- what do you do with watchdogs, and how, and why? Always use 'em? Never use 'em? Use 'em because the boss says so, but twiddle them in a "last part to break" bit of code?
Would you use a watchdog in a fly-by-wire system? A pacemaker? Why? Why not? Could you justify _not_ using a watchdog in the top-level processor of a Mars rover or a satellite?
Watchdog timers are not often used in FPGAs. I guess that's because processes in HDL seldom get stuck or lost in the weeds. ;)
When I did design a software project we had multiple tasks each kicking another task which would track what was going on and "pet" the watch dog to keep it from barking. The various tasks had periods of "interest" different from the watch dog timeout, so this process dealt with the appropriate time period of each of the tasks being watched. Only this task needed to actually deal with the watch dog period.
I'd say the FPGA equivalent to a watchdog is integrity checking hardware, like ECC RAM, state machines with explicit invalid state checking, all the way up to triple-modular redundancy. I've never needed any of that nonsense because everything I work on remains pleasantly surrounded by atmosphere, but it's definitely out there.
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order. See above to fix.
The problem is that you don't often know at design time which (if any) failures will require this sort of protection. Even "bug free" code can reside in a system that experiences hardware faults (power supply fluctuations, input latchup, etc.).
So, do you try to bolt this capability on, after the fact? Or, design around it from the start (hoping not to need it)?
There's no hard and fast rule for how you should implement a watchdog. It's a component in your system, just like any other component.
Putting the stroking of the watchdog in the idle task can leave your system vulnerable to any sort of momentary overload; or, necessitate an unduly long timeout (to accommodate short overloads).
Putting it in an ISR is almost always silly -- for obvious reasons.
OTOH, I currently use the software equivalent of that mechanism by having my "watchdog monitor" run as a very HIGH priority task! But, one that spends most of its life blocking awaiting "sanity messages" from the various tasks that are trying to stroke this *virtual* watchdog.
Putting all of the watchdog (hardware) interface in one task allows a more consistent -- and discerning -- implementation.
First, it ensures any such activities will get logged! If you've got lots of independant/autonomous tasks stroking the watchdog, you never know which one FORGOT to do so. As a result, you can't recover (post mortem) when the device comes out of reset.
Second, it allows the "stroking" to be smarter and more demonstrable of sentience on the part of the individual "strokers". I.e., instead of just twiddling a bit, you can engage the other party in a dialog and place further constraints on verify its sanity. ("Why are you sending me these keep alive messages at such an alarming rate? I was only EXPECTING to receive them from you at a more modest rate. Perhaps something has gone wrong in your implemlentation or process state??")
Third, it allows for tasks to *request* a watchdog intervention! ("OhMiGsh! The motor is ignoring my commands to turn off! Somebody pull the plug -- NOW!!!!") And, this can be logged for post mortem.
(sigh) I have a lengthy paper/tutorial I wrote many years ago on the subject as I'd had the "argument" with clients many times over the years. People seem to have a naive concept of what watchdogs (sentries) can and can't do -- as well as when they are indicated vs. contraindicated.
[One of these days, I'll set up a web site and push all these documents out there. But, far more interesting things to do with the few hours present in each day :-/ ]
Watchdogs take many forms -- hardware and software. A process that deliberately KILL's processes that it suspects of being corrupt is just as much a watchdog as a piece of hardware that tugs on /RESET.
Communication happens both in-band and out-of-band. The former, of course, tends to rely on "some (software)" remaining operational. The latter works around it.
A watchdog plays a LAGGING role in a system (it "happens" AFTER something has already gone wrong) as well as a LEADING role (it informs the user/environment of a potential "more significant" failure that hasn't percolated through the "system", yet!
This role should not be glossed over. INFORMATION IS CONVEYED by these mechanisms. Simply ignoring that information (i.e., letting the device reset itself) is usually not a very good idea.
[Consider what happens when you have a device that is eager to start up quickly. So, if the device has incurred a watchdog upset, everything appears to shut down, unceremoniously. Then, as the device starts up, again, it rushes to get everything running, again -- just in time for it to be (possibly) shut down by the same, persistent failure retriggering the watchdog overrun. SOMETHING wants to be able to detect when a watchdog event has occurred and adjust the RESTART procedure (different from the START procedure) accordingly.]
I'm currently working on ways to signal remote devices when a watchdog event has been triggered in some OTHER remote device; without relying on in-band signalling (if the device is misbehaving, how do I know it will be ABLE to inform others that it has just been watchdogged?). The point being so those other devices can adjust to this INFORMATION -- instead of wondering why some service/capability (in which the failed node played a part) isn't working properly AFTER SOME ARTIFICIAL DELAY.
What's the reliability of each system and PROTECTION system? I'd surely not want a watchdog on a Mars rover that resets more frequently than the round trip radio delay to its earth station!
(some hand-waving, there, but the point should be obvious)
Ask them why their FSMs got stuck. In development they may make mistakes, but you don't use watchdogs for debugging. In fact they get in the way.
I've never had a FSM failure in the field, but I suppose there is a first time. I did say "seldom", not never. A FSM in an FPGA is a separate entity. No other process in the FSM can step on it's memory or cause it to miss a deadline. CPUs are shared which hugely complicate multi-process designs in all aspects. You just don't have that in an FPGA. By comparison FPGAs are simple. But maybe I've just not worked on an FPGA design that was complicated enough to compare to what the software guys do...
Quoting Tim Williams' book "The most cost-effective way to ensure the reliability of a microprocessor-based product is to accept that the program (or data or both, my addition) *will* occasionally be corrupted, and to provide a means whereby the program flow can be automatically recovered, preferably transparently to the user. This is the function of the microprocessor watchdog."
So, the whole thing is what to do "when" (not "if") shit (the unexpected) happens.
That's an interesting approach, just give up on making the system reliable and instead make it recover from a failure. You do realize that just because Tim Williams said this, it doesn't make it gospel. It
*is* possible to make programs that work and in some cases a program can be *proven* to work. But those are rare.
Sure, it's great if you can make your system recover from a catastrophic failure. But there are many systems where that is not remotely a solution. Virtually any real-time control needs to work and the only other solution is to shut it down, preferably safely. Even that is not always possible.
For any system where there is potential for harm to people or even equipment (depending on the cost) the best approach is an independent monitor that disconnects the errant controller. In other words, when safety is important, a processor watchdog timer may not be adequate.
I just recalled that when designing FSMs in HDL, there is typically a synthesis option to recognize all unused states and design so they return to the reset condition. This is a good way to deal with SEU issues. It is very hard to prevent a hiccup from SEU, but recovery can be built in.
How would you implement a watchdog for an FPGA which likely has many independent FSMs? What would you monitor?
I had to do some googling what he has actually said.
I still maintain that watchdog timers are only required at high radiation environments, in which humans would start to get radiation sickness or at least cancer in the long run. . Old electronics systems have been working for a decade or two without reboot. I have maintained some computer systems that were designed to do some thermal cycling every year. If i forgot to do the thermal recycling every year, do the system restart the next year or the year after that, no big problem.
Here's something from a comp.arch.fpga post I made in 2003:
"When I was at Agilent I analysed the causes of failures in some FPGA developments.
About half of all FPGA design related bugs (weighted by the time spent finding them) were associated with asynchronous logic and clock domain crossings. [snip] 0% of the clock domain crossing bugs had anything to do with metastability. Glitches and races were the cause."
Geeze, that just shouldn't happen. I'm not sure what they mean by "asynchronous logic" as real asynchronous logic is almost never used in FPGAs. Clock domain crossing is well understood so there is no reason to not get it right. It's the kind of thing that normally gets a big, red flag at design time and so is done correctly.
Please don't expect that illegal state coverage will make your FSM reliable. That will only help with illegal states, but illegal states aren't the only causes of lockups.
Consider FSMs in two systems (perhaps on the same chip) talking to each other with some handshaking. There's a state that waits for a handshake signal from the other system. If both FSMs get in that state (from any cause: glitch, SEU, coding bug), the system will lock up.
You should be able to see how a watchdog would help with that. The watchdog could be built into the FSM, or it could sit to the side and reset the whole FSM.
Firstly, I create an architecture that doesn't have many interlocking FSMs. Significant parts of my design (particularly in the datapath) will not have any FSMs at all, and hence, no chance of FSM lockups.
Then I consider each FSM independently. If possible, I make it inherently crashproof. If not, I may add a watchdog timer. Sometimes I will add a circuit that looks for bad signatures (e.g. unusual FIFO depths) instead.
A recent example from a system I was designing for a client:
The Xilinx transceivers need to be reset in a particular sequence to work properly (particularly at the higher data rates, e.g. > 10Gb/s). These transceivers don't have a lock output that works reliably (thanks Xilinx!). Instead, one must go to the next highest protocol layer (e.g. (Ethernet) PCS level) to monitor that protocol's sync to determine whether the transceiver is working.
I coded a watchdog timer that would reset the transceiver if it hadn't seen PCS sync for a certain time. I can't get it to fail now.
A watchdog on the AXI bus would nice, it is easy to reconfigure the programmable logic in a Zynq but if you have stuff on bus you have to make absolutely sure no software is accessing that because it will halt the whole system and only a reset will recover from that
I would not say that clock domain crossings are /well/ understood by beginners, or even moderately experienced designers.
BTW, I weighted the results with the time taken to find the bugs. There weren't that many bugs, it's just that they took a long time to find compared with straightforward functional bugs.
Many of the bugs were caused by integrating IP (written elsewhere) and it wasn't always obvious to the designers that signals were crossing clock domains.
Some of the bugs were created by the tools, e.g. when they replicated logic. That makes the bugs hard to find during source code review. It's actually better to review the post-synth netlist than the source code. (Better still to use an automated tool to do it.)
[From a 2008 c.a.f post of mine] here's a list of the sort of things that could go wrong. Please bear in mind that this list is historical (i.e. it was based on experience with older FPGA families and older tools, in a job I left over a decade ago.).
- (race) Passing vectors (i.e. multiple signals) from clock domain A to clock domain B and expecting all the bits to arrive on the same B clock.
- (race) As above, but adding multiple banks of retiming flip flops in the B clock domain, which fixed the (non-existent) metastability issue but did nothing about the race.
- (race) Passing a signal in clock domain A to multiple flip flops in clock domain B, and expecting the B flip flops to get the same value on the same clock.
- (race) As above, but created when the tools replicate the B logic to manage fanout.
- (glitch) Multiple signals in clock domain A hit some combinatorial logic producing a single signal which is sampled by a flip flop in clock domain B. Sometimes there may be a glitch which gets sampled by the B flip flop. It can be difficult to design combinatorial logic with good glitch coverage (and if you do, the tools will often remove it). (See XAPP
- (glitch) Clock multiplexers made out of combinatorial logic with inadequate glitch coverage (or adequate glitch coverage removed by the tools).
- Using async reset or set inputs on flip flops to implement a logic function (rather than just using them for initialisation). I can remember a case where a design would fail even when we could prove mathematically that it couldn't fail. Rewriting it to avoid the use of async resets fixed the problem.
- Gating clocks to create a logic function. I know this sort of thing is done in ASICs to save power, but it just doesn't seem to work too well in FPGAs sometimes.
We leave the ARM watchdog permanently enabled in our Zynq systems for that very reason. If code (via e.g. a wrong pointer) accesses an address without an AXI address decode, it will hang the AXI, and hence the whole box.
With a watchdog, it will reboot (perhaps to fail again, perhaps not).
Why would a beginner be designing a system without supervision? As I said, this is the sort of issue that gets a red flag and lots of attention in a design review.
That's why they get lots of attention up front rather than after they are a problem.
That's exactly what happened to me. A simple UART needed a FF at the data in port. I had designed the UART and didn't document that detail. I used it later in a test fixture and forgot to include the I/O FF. It bit me hard as I was writing the software the talked to this port and kept thinking the flaw was software.
Not sure how that happens. Are you saying a design with 1 FF and many destinations had an async input? That alone is a no-no. If there had been a FF in front to remove metastability it would have provided protection from async inputs (race) when the second FF was replicated.
All of these issues are known bad practice. Messing with the clocks or using async inputs on FFs is an especially bad practice. I thought that ended in the 90s.
I did my first FPGA design in '95. I've always wondered how good that code was. I took some training on the Orcad schematic tools for FPGA design and learned about VHDL in one day. lol That made me the resident expert! I had to deal with changing compilers twice in the project, so I learned something about making your code portable very early. But I knew little about clock domain crossings and metastability. I guess I knew about race conditions though. That was not uncommon in discrete logic design which I had done.
Regardless, I don't see any reason to use a watchdog timer with an FPGA design unless you have SEU issues. A proper code review with experienced designers will catch all of the above problems. Adding a bandaid is not the solution when there is no reason to not have a good, clean system in an FPGA.
The reason why watchdogs are used with software is because software has so many more interactions and opportunities for something to screw up. Using an inherently serial processor to do multitasking is prone to problems with complex interactions. Clock domain crossing in FPGAs is similarly complex, but nearly always much more limited in scope, so much easier to focus on to resolve all the details and get right.
I just don't buy the need for a watchdog with an FPGA. Have you ever seen one used that didn't involve SEU?