Kicking the dog -- how do you use watchdog timers?

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, May 11, 2016 2:00 AM

Coding bug??? Not sure where you are getting this. I specifically excluded SEU because that is one situation that can cause problems with

*any* design in unpredictable ways. Otherwise this is a system design issue. If your system is subject to "glitches" then a means should be designed into the handshake to resolve timeouts. Resetting the entire FPGA or board shouldn't be necessary.

You are talking about a specified timeout on a communications protocol, not a watchdog.

If you can design sequential control logic that doesn't have FSMs, then you are a better man than I am... or you are good at renaming circuits. Everything sequential is an FSM other than simple data registers. A counter is a FSM.

Yes, defining transitions for every possible state is a good tool, if needed. But by default adding a watchdog timer is overkill, especially when it simply masks a bug rather than exposing it.

The Transputer had math instructions that would halt the CPU when an overflow occurred. It sounded crazy at the time, but that is actually preferable to letting an erroneous system continue running. Watchdogs are often like that, they let a system continue running in a corrupt way rather than pointing to the bug.

Again, I don't call that a watchdog since it is actually a part of your protocol. A watchdog is used to catch problem you know nothing about but you want the system to continue to run. In CPUs they reset the system so user intervention isn't required. But it is still a disruption to the user if they are using it at the time.

In FPGAs the logic can be designed to not hang. It may require work to do the proper analysis, but it is not just possible, but saves money in the long run when you don't need to fix difficult to find bugs. Bottom line is adding a watchdog to an FPGA to catch unknown problems shows that something is missing from the design process.

--

Rick C

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, May 11, 2016 2:04 AM

My first FPGA design was to provide data on the PCI bus through a bus interface chip. Turns out the PCI bus will hang the entire CPU if a handshake is not completed. My very first iteration of the design had a bug in the FSM that locked up the PC. lol It got fixed very quickly.

--

Rick C

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, May 11, 2016 2:38 AM

As FPGA's get bigger and the circuits in them get more complicated, don't they face the same combinatorial explosion that big software systems do? There are tons of historical examples of Intel and similar CPU's (hard silicon, not even FPGA's) locking up due to bugs. Any big CPU or comparable chip will have an errata list. At some point you may have to accept that bugs are inevitable, and that a reliable system (besides preventing as many bugs as it can) also has to mitigate any remaining ones. Watchdogs are a time tested approach for that purpose.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, May 11, 2016 3:01 AM

How may of those large ASICs with hang bugs had watchdog timers? The bug was a system level design problem, not a logic bug in a FSM. They could be dealt with by a software change, no? Even if they couldn't be dealt with with software, what would a watchdog do? Reset your entire computer/phone/flight nav?

There is always the possibility of bugs in FPGAs. But bugs that require the use of a watchdog are a class of bugs that should be shaken out in debug unless the designers are not very good. If they can't find them in debug, they have to be pretty durn infrequent.

Do you have any links to descriptions of such bugs? I'm curious.

--

Rick C

- R
- Randy Yates
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, May 11, 2016 3:36 AM

A WDT is also not a cure-all. Consider this scenario: a piece of hardware fails, changing the inputs to a piece of code in an unexpected way and causing the code to go into the weeds and the WDT to fire.

But after restart, the hardware is still failed and providing unexpected inputs, the same bug occurs again, the WDT fires again, and the processor restarts again. Ad-infinitum.

So what did this fix? :)

--
Randy Yates, DSP/Embedded Firmware Developer 
Digital Signal Labs 
http://www.digitalsignallabs.com

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, May 11, 2016 3:48 AM

I'd expect the WDT to be in the box that the ASIC is deployed in, not in the ASIC itself. That way the box resets if the ASIC locks up.

formatting link

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, May 11, 2016 4:06 AM

I looked at this list and of the first three only one was a lockup of any sort. The LCD Controller in the MPC823 can hang the CPU when the LCD is disabled while in aggressive mode (LAM). This has a very simple fix in software, before disabling the LCD, turn off the LAM.

Where is the need for a watchdog of any sort? If an ASIC locks up, won't the CPU be able to figure it out and reset whatever is appropriate?

--

Rick C

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, May 11, 2016 4:10 AM

What if the CPU is part of the ASIC?

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, May 11, 2016 4:10 AM

Shouldn't a processor reset also reset the hardware?

--

Rick C

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, May 11, 2016 4:16 AM

Ok, what if?

--

Rick C

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, May 11, 2016 6:17 AM

Randy's point, I think, is that if something is _broken_, a reset isn't going to un-break it.

A processor reset should also reset the hardware, in much the same way that cops should always be honest -- "should" in this case indicates a moral requirement, but not, in all companies, a reasonable expectation.

--
Tim Wescott 
Control systems, embedded software and circuit design 
I'm looking for work!  See my website if you're interested 
http://www.wescottdesign.com

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, May 11, 2016 6:21 AM

I'm not sure we are on the same conversation. We were discussing how to design systems, not what systems get designed.

--

Rick C

- A
- Allan Herriman
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, May 11, 2016 11:16 AM

You seem to making the assumption that having two flip flops in series will stop the first one from being replicated. I've seen it happen (albeit with a huge fanout on the second FF).

The only way to ensure that the first FF has not been replicated is to check, or to apply attributes that will tell the tools not to replicate it. Even then, the tools may have bugs (they certainly have in the past) and you still need to check to be sure.

The good news is that the check can be automated.

Do you have citation for "known bad practice"? I wrote that list in (I think) 2001, and I haven't seen anything containing /all/ of those points published prior to that date.

Allan

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, May 11, 2016 4:23 PM

No, I'm not really a history buff. I don't know of any resource that lists problems to be avoided in digital logic design. Do you have any? I believe all of these issues are common knowledge.

You list 8 things you look for and the first 6 are clock domain crossing issues. So that is really one issue, good clock domain crossing design and you have listed six ways that designers screw up.

Using async inputs on FFs has always been discouraged by the FPGA companies, in no small part because it makes the design hard to verify and I believe they have said it makes it hard to port to ASICs (maybe because of being hard to verify).

I have heard forever that it is hard to gate clocks properly. It requires knowledge of gate delays and detailed timing which is typically avoided in FPGA designs in favor of unit delays simulation with static timing analysis.

--

Rick C

- R
- Rob Gaddi
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, May 11, 2016 4:30 PM

Wait, you're proposing that an error on the data bus should raise some sort of Data Abort exception (perhaps at vector address 0x10) rather than render the system catatonic? Poppycock! Who would ever design a ARM-based CPU in such a way?

--
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
 
Email address domain is currently out of order.  See above to fix.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, May 11, 2016 4:31 PM

Since we have been discussing purely hardware issues and this is primarily a software group, I have started a post in comp.arch.fpga. If you think it is appropriate to continue this discussion here, maybe add comp.arch.fpga to the list of groups.

--

Rick C

- L
- lasselangwadtchristensen
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, May 11, 2016 5:44 PM

yes I know, crazy talk ;)

-Lasse

- R
- Randy Yates
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, May 12, 2016 4:44 PM

Exactly. Discriminate "broken hardware" from "hardware that's gotten into a bad state." The former won't benefit from a reset, the latter will.

--
Randy Yates, DSP/Embedded Firmware Developer 
Digital Signal Labs 
http://www.digitalsignallabs.com

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, May 12, 2016 5:27 PM

Even broken hardware will benefit if the reset prevents actions that cause damage or disrupt a larger part of the system. It is frequently the case that a power up self test is performed before a system controls dangerous devices or tries to communicate with the larger system.

--

Rick C

- H
- Hans-Bernhard Bröker
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, May 12, 2016 6:34 PM

Am 11.05.2016 um 08:17 schrieb Tim Wescott:

Non-reset is not going to, either.

It's worth trying to distinguish between a run-off-into-the-wild system and a permanently broken one. So trigger a global reset, and see if that makes it work again. If it does, things are better than before. If it doesn't, they're no worse. As problem-handling approaches go, that's a pretty impressive result.

That's what a watchdog ultimately is good for: to distinguish between a SEU and a FUBAR situation.

There's really nothing terminally wrong with having a watchdog. The main risk I see is that it's easy to fall into the trap of thinking of the Dog not as (almost) the last line of defense, but as the first, or even the only one you need. I.e. it's tempting to think: "Nice, now I've got a watchdog, so the rest of the system can be designed without a care."

OTOH, just because not all cops are honest, that doesn't make a world without any cops a better place.

It can hardly be considered a concept's fault if some people implement it incorrectly. And it may not even be incorrect to leave resetting the rest of the circuit to the micro. There might be some information for the micro to be had from inspecting the state of other parts of the hardware, as left behind by the hosed system state. It all depends.