Kicking the dog -- how do you use watchdog timers?

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Randy Yates recently started a thread on programming flash that had an  
interesting tangent into watchdog timers.  I thought it was interesting  
enough that I'm starting a thread here.

I had stated in Randy's thread that I avoid watchdogs, because they  
mostly seem to be a source of erroneous behavior to me.

However, on reflection I realized that I lied: I _do_ use watchdog  
timers, but not automatically.  To date I've only used them when the  
processor is spinning a motor that might crash into something or  
otherwise engage in damaging behavior if the processor goes nuts.  

In general, my rule on watchdogs, as with any other feature, is "use it  
if using it is better", which means that I think about the consequences  
of the thing popping off when I don't want it to (as during a code update  
or during development when I hit a breakpoint) vs. the consequences of  
not having the thing when the processor goes haywire.

Furthermore, if I use a watchdog I don't just treat updating the thing as  
a requirement check-box -- so you won't find a timer ISR in my code that  
unconditionally kicks the dog.  Instead, I'll usually have just one task  
(the motor control one, on most of my stuff) kick the dog when it feels  
it's operating correctly.  If I've got more than one critical task (i.e.,  
if I'm running more than one motor out of one processor) I'll have a low-
priority built-in-test task that kicks the dog, but only if it's getting  
periodic assurances of health from the (multiple) critical tasks.

Generally, in my systems, the result of the watchdog timer popping off is  
that the system will no longer work quite correctly, but it will operate  
safely.

So -- what do you do with watchdogs, and how, and why?  Always use 'em?  
Never use 'em?  Use 'em because the boss says so, but twiddle them in a  
"last part to break" bit of code?

Would you use a watchdog in a fly-by-wire system?  A pacemaker?  Why?  
Why not?  Could you justify _not_ using a watchdog in the top-level  
processor of a Mars rover or a satellite?

--  

Tim Wescott
Wescott Design Services
We've slightly trimmed the long signature. Click to see the full one.
Re: Kicking the dog -- how do you use watchdog timers?
On 5/9/2016 1:06 PM, Tim Wescott wrote:
Quoted text here. Click to load it

Watchdog timers are not often used in FPGAs.  I guess that's because  
processes in HDL seldom get stuck or lost in the weeds.  ;)

When I did design a software project we had multiple tasks each kicking  
another task which would track what was going on and "pet" the watch dog  
to keep it from barking.  The various tasks had periods of "interest"  
different from the watch dog timeout, so this process dealt with the  
appropriate time period of each of the tasks being watched.  Only this  
task needed to actually deal with the watch dog period.

--  

Rick C

Re: Kicking the dog -- how do you use watchdog timers?
rickman wrote:

Quoted text here. Click to load it

I'd say the FPGA equivalent to a watchdog is integrity checking
hardware, like ECC RAM, state machines with explicit invalid state
checking, all the way up to triple-modular redundancy.  I've never
needed any of that nonsense because everything I work on remains
pleasantly surrounded by atmosphere, but it's definitely out there.

--  
Rob Gaddi, Highland Technology -- www.highlandtechnology.com  
Email address domain is currently out of order.  See above to fix.

Re: Kicking the dog -- how do you use watchdog timers?
On Mon, 09 May 2016 15:07:08 -0400, rickman wrote:

Quoted text here. Click to load it

I've spent lab time next to unhappily cursing FPGA guys (good ones)  
trying to determine why their state machines have wedged.

So I'm not sure that's an entirely accurate statement.

Quoted text here. Click to load it

That's more or less what I do if I need to keep watch on multiple tasks.

--  

Tim Wescott
Wescott Design Services
We've slightly trimmed the long signature. Click to see the full one.
Re: Kicking the dog -- how do you use watchdog timers?
On 5/9/2016 5:13 PM, Tim Wescott wrote:
Quoted text here. Click to load it

Ask them why their FSMs got stuck.  In development they may make  
mistakes, but you don't use watchdogs for debugging.  In fact they get  
in the way.

I've never had a FSM failure in the field, but I suppose there is a  
first time.  I did say "seldom", not never.  A FSM in an FPGA is a  
separate entity.  No other process in the FSM can step on it's memory or  
cause it to miss a deadline.  CPUs are shared which hugely complicate  
multi-process designs in all aspects.  You just don't have that in an  
FPGA.  By comparison FPGAs are simple.  But maybe I've just not worked  
on an FPGA design that was complicated enough to compare to what the  
software guys do...


Quoted text here. Click to load it


--  

Rick C

Re: Kicking the dog -- how do you use watchdog timers?
rickman wrote:

Quoted text here. Click to load it

Oh, that's easy.  Because of either:

An error in the synchronous logic, leaving it in a defined state with no
way out (20% chance).

An unsynchronized async input causing a race condition that static
timing couldn't catch (80% chance)

Or a single event upset (0.0001% chance)


--  
Rob Gaddi, Highland Technology -- www.highlandtechnology.com  
Email address domain is currently out of order.  See above to fix.

Re: Kicking the dog -- how do you use watchdog timers?
On 5/10/2016 12:48 PM, Rob Gaddi wrote:
Quoted text here. Click to load it

That's a system debug thing and actually shouldn't happen at all as  
there are tools to analyze for it.


Quoted text here. Click to load it

Newbie mistake... that even... uh, experienced designers do once in a  
while... uh, sometimes...  still, it wouldn't make it to a fielded  
system and so is does not create a need for a watchdog.


Quoted text here. Click to load it

SEU is a possibility and in fact is a reason why watchdogs are used on  
FPGAs in space craft.  Here on the ground the probability is more like,  
0.0000000000001 in a year.  I didn't actually count the zeros, but it is  
a *lot*.  You will never see it in your lifetime.

--  

Rick C

Re: Kicking the dog -- how do you use watchdog timers?
On Tue, 10 May 2016 13:01:54 -0400, rickman wrote:

Quoted text here. Click to load it


Here's something from a comp.arch.fpga post I made in 2003:

"When I was at Agilent I analysed the causes of failures in some FPGA
developments.

About half of all FPGA design related bugs (weighted by the time spent
finding them) were associated with asynchronous logic and clock domain
crossings.  [snip]  0% of the clock domain crossing bugs had anything to  
do with metastability.  Glitches and races were the cause."


Regards,
Allan

Re: Kicking the dog -- how do you use watchdog timers?
On 5/10/2016 7:11 PM, Allan Herriman wrote:
Quoted text here. Click to load it

Geeze, that just shouldn't happen.  I'm not sure what they mean by  
"asynchronous logic" as real asynchronous logic is almost never used in  
FPGAs.  Clock domain crossing is well understood so there is no reason  
to not get it right.  It's the kind of thing that normally gets a big,  
red flag at design time and so is done correctly.

--  

Rick C

Re: Kicking the dog -- how do you use watchdog timers?
On Tue, 10 May 2016 19:54:18 -0400, rickman wrote:

Quoted text here. Click to load it


I would not say that clock domain crossings are /well/ understood by  
beginners, or even moderately experienced designers.

BTW, I weighted the results with the time taken to find the bugs.
There weren't that many bugs, it's just that they took a long time to
find compared with straightforward functional bugs.

Many of the bugs were caused by integrating IP (written elsewhere) and it  
wasn't always obvious to the designers that signals were crossing clock  
domains.

Some of the bugs were created by the tools, e.g. when they replicated  
logic.  That makes the bugs hard to find during source code review.  It's  
actually better to review the post-synth netlist than the source code.  
(Better still to use an automated tool to do it.)


[From a 2008 c.a.f post of mine] here's a list of the sort of things that  
could go wrong.  Please bear in mind that this list is historical (i.e.  
it was based on experience with older FPGA families and older tools, in a  
job I left over a decade ago.).


- (race) Passing vectors (i.e. multiple signals) from clock domain A
to clock domain B and expecting all the bits to arrive on the same B
clock.

- (race) As above, but adding multiple banks of retiming flip flops in
the B clock domain, which fixed the (non-existent) metastability issue
but did nothing about the race.

- (race) Passing a signal in clock domain A to multiple flip flops in
clock domain B, and expecting the B flip flops to get the same value
on the same clock.

- (race) As above, but created when the tools replicate the B logic to
manage fanout.

- (glitch) Multiple signals in clock domain A hit some combinatorial
logic producing a single signal which is sampled by a flip flop in
clock domain B.  Sometimes there may be a glitch which gets sampled by
the B flip flop.
It can be difficult to design combinatorial logic with good glitch
coverage (and if you do, the tools will often remove it).  (See XAPP
024, btw.)

- (glitch) Clock multiplexers made out of combinatorial logic with
inadequate glitch coverage (or adequate glitch coverage removed by the
tools).

- Using async reset or set inputs on flip flops to implement a logic
function (rather than just using them for initialisation).  I can
remember a case where a design would fail even when we could prove
mathematically that it couldn't fail.  Rewriting it to avoid the use
of async resets fixed the problem.

- Gating clocks to create a logic function.  I know this sort of thing
is done in ASICs to save power, but it just doesn't seem to work too
well in FPGAs sometimes.


Regards,
Allan

Re: Kicking the dog -- how do you use watchdog timers?
On 5/10/2016 8:50 PM, Allan Herriman wrote:
Quoted text here. Click to load it

Why would a beginner be designing a system without supervision?  As I  
said, this is the sort of issue that gets a red flag and lots of  
attention in a design review.


Quoted text here. Click to load it

That's why they get lots of attention up front rather than after they  
are a problem.


Quoted text here. Click to load it

That's exactly what happened to me.  A simple UART needed a FF at the  
data in port.  I had designed the UART and didn't document that detail.  
  I used it later in a test fixture and forgot to include the I/O FF.  
It bit me hard as I was writing the software the talked to this port and  
kept thinking the flaw was software.


Quoted text here. Click to load it

Not sure how that happens.  Are you saying a design with 1 FF and many  
destinations had an async input?  That alone is a no-no.  If there had  
been a FF in front to remove metastability it would have provided  
protection from async inputs (race) when the second FF was replicated.


Quoted text here. Click to load it

All of these issues are known bad practice.  Messing with the clocks or  
using async inputs on FFs is an especially bad practice.  I thought that  
ended in the 90s.

I did my first FPGA design in '95.  I've always wondered how good that  
code was.  I took some training on the Orcad schematic tools for FPGA  
design and learned about VHDL in one day.  lol  That made me the  
resident expert!  I had to deal with changing compilers twice in the  
project, so I learned something about making your code portable very  
early.  But I knew little about clock domain crossings and  
metastability.  I guess I knew about race conditions though.  That was  
not uncommon in discrete logic design which I had done.

Regardless, I don't see any reason to use a watchdog timer with an FPGA  
design unless you have SEU issues.  A proper code review with  
experienced designers will catch all of the above problems.  Adding a  
bandaid is not the solution when there is no reason to not have a good,  
clean system in an FPGA.

The reason why watchdogs are used with software is because software has  
so many more interactions and opportunities for something to screw up.  
Using an inherently serial processor to do multitasking is prone to  
problems with complex interactions.  Clock domain crossing in FPGAs is  
similarly complex, but nearly always much more limited in scope, so much  
easier to focus on to resolve all the details and get right.

I just don't buy the need for a watchdog with an FPGA.  Have you ever  
seen one used that didn't involve SEU?

--  

Rick C

Re: Kicking the dog -- how do you use watchdog timers?
On Tue, 10 May 2016 21:45:32 -0400, rickman wrote:

Quoted text here. Click to load it

You seem to making the assumption that having two flip flops in series  
will stop the first one from being replicated.  I've seen it happen  
(albeit with a huge fanout on the second FF).

The only way to ensure that the first FF has not been replicated is to  
check, or to apply attributes that will tell the tools not to replicate  
it.  Even then, the tools may have bugs (they certainly have in the past)  
and you still need to check to be sure.

The good news is that the check can be automated.


Quoted text here. Click to load it

Do you have citation for "known bad practice"?  I wrote that list in (I  
think) 2001, and I haven't seen anything containing /all/ of those points  
published prior to that date.

Allan

Re: Kicking the dog -- how do you use watchdog timers?
On 5/11/2016 7:16 AM, Allan Herriman wrote:
Quoted text here. Click to load it

No, I'm not really a history buff.  I don't know of any resource that  
lists problems to be avoided in digital logic design.  Do you have any?  
  I believe all of these issues are common knowledge.

You list 8 things you look for and the first 6 are clock domain crossing  
issues.  So that is really one issue, good clock domain crossing design  
and you have listed six ways that designers screw up.

Using async inputs on FFs has always been discouraged by the FPGA  
companies, in no small part because it makes the design hard to verify  
and I believe they have said it makes it hard to port to ASICs (maybe  
because of being hard to verify).

I have heard forever that it is hard to gate clocks properly.  It  
requires knowledge of gate delays and detailed timing which is typically  
avoided in FPGA designs in favor of unit delays simulation with static  
timing analysis.

--  

Rick C

Re: Kicking the dog -- how do you use watchdog timers?
On 5/11/2016 12:23 PM, rickman wrote:
Quoted text here. Click to load it

Since we have been discussing purely hardware issues and this is  
primarily a software group, I have started a post in comp.arch.fpga.  If  
you think it is appropriate to continue this discussion here, maybe add  
comp.arch.fpga to the list of groups.

--  

Rick C

Re: Kicking the dog -- how do you use watchdog timers?
On 5/10/2016 12:48 PM, Rob Gaddi wrote:
Quoted text here. Click to load it

I just recalled that when designing FSMs in HDL, there is typically a  
synthesis option to recognize all unused states and design so they  
return to the reset condition.  This is a good way to deal with SEU  
issues.  It is very hard to prevent a hiccup from SEU, but recovery can  
be built in.

How would you implement a watchdog for an FPGA which likely has many  
independent FSMs?  What would you monitor?

--  

Rick C

Re: Kicking the dog -- how do you use watchdog timers?
On Tue, 10 May 2016 13:36:55 -0400, rickman wrote:

Quoted text here. Click to load it

Please don't expect that illegal state coverage will make your FSM  
reliable.  That will only help with illegal states, but illegal states  
aren't the only causes of lockups.

Consider FSMs in two systems (perhaps on the same chip) talking to each  
other with some handshaking.  There's a state that waits for a handshake  
signal from the other system.  If both FSMs get in that state (from any  
cause: glitch, SEU, coding bug), the system will lock up.

You should be able to see how a watchdog would help with that.  The  
watchdog could be built into the FSM, or it could sit to the side and  
reset the whole FSM.

Quoted text here. Click to load it

Firstly, I create an architecture that doesn't have many interlocking  
FSMs.  Significant parts of my design (particularly in the datapath) will  
not have any FSMs at all, and hence, no chance of FSM lockups.

Then I consider each FSM independently.  If possible, I make it  
inherently crashproof.  If not, I may add a watchdog timer.  Sometimes I  
will add a circuit that looks for bad signatures (e.g. unusual FIFO  
depths) instead.


A recent example from a system I was designing for a client:

The Xilinx transceivers need to be reset in a particular sequence to work  
properly (particularly at the higher data rates, e.g. > 10Gb/s).
These transceivers don't have a lock output that works reliably (thanks  
Xilinx!).  Instead, one must go to the next highest protocol layer (e.g.  
(Ethernet) PCS level) to monitor that protocol's sync to determine  
whether the transceiver is working.

I coded a watchdog timer that would reset the transceiver if it hadn't  
seen PCS sync for a certain time.  I can't get it to fail now.

Regards,
Allan

Re: Kicking the dog -- how do you use watchdog timers?
Den onsdag den 11. maj 2016 kl. 01.59.02 UTC+2 skrev Allan Herriman:
Quoted text here. Click to load it

A watchdog on the AXI bus would nice, it is easy to reconfigure the programmable logic in a Zynq but if you have stuff on bus you have to  
make absolutely sure no software is accessing that because it will  
halt the whole system and only a reset will recover from that

-Lasse


Re: Kicking the dog -- how do you use watchdog timers?
On Tue, 10 May 2016 17:22:43 -0700, lasselangwadtchristensen wrote:

Quoted text here. Click to load it

We leave the ARM watchdog permanently enabled in our Zynq systems for  
that very reason.
If code (via e.g. a wrong pointer) accesses an address without an AXI  
address decode, it will hang the AXI, and hence the whole box.  

With a watchdog, it will reboot (perhaps to fail again, perhaps not).

Regards,
Allan

Re: Kicking the dog -- how do you use watchdog timers?
Den onsdag den 11. maj 2016 kl. 02.58.08 UTC+2 skrev Allan Herriman:
Quoted text here. Click to load it

yep, but it would be nice if there was an option to get something similar  
to an access denied and handle it from there instead of resetting the whole  
system

-Lasse

Re: Kicking the dog -- how do you use watchdog timers?
snipped-for-privacy@gmail.com wrote:

Quoted text here. Click to load it

Wait, you're proposing that an error on the data bus should raise some
sort of Data Abort exception (perhaps at vector address 0x10) rather
than render the system catatonic?  Poppycock!  Who would ever design a
ARM-based CPU in such a way?

--  
Rob Gaddi, Highland Technology -- www.highlandtechnology.com  
Email address domain is currently out of order.  See above to fix.

Site Timeline