"Am I still working okay?" asked the micro controller... - Page 4

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Re: "Am I still working okay?" asked the micro controller...

: If it appears that the hardware is falling apart, how could you trust
: that it makes any sensible decisions ? Of course, if each output

You've changed the situation -- 'the hardware is falling apart' is hardly the
same as a single hardware failure.

Generally, an MCU on reset sets the outputs to a known value -- all 0 or all
1.  If you design fail-safe, then a hardware reset, in the face of some
failing hardware, will at least make sure everything is off.

: In any really safety critical system, you should use double or triple
: (voting) redundant system, not watchdogs.

There is a WHOLE class of problems for which that is completely overkill.
Take an arcade game, or vending machine, or any machine that is going to take
physical punishment and need regular maintanance.  

People are going to beat on a soda machine. Do you want to put
tripple-redunancy memory on that, or just design it such that when it breaks
it just sits there resetting itself, so no one can get free soda ?

Arcade games use watchdogs because there is a very small window where they
will make money. (Or used, when it was dedicated hardware, now it's largely
PC level hardware, but I digress) Competition means getting the thing out
the door relatively quickly, and cheap enough to sell.  

You want to get every bug, but if you wait too long, you'll be into the next
generation. The watchdog means that if there IS a bug, the machine will just
reset and keep earning money, instead of not earning money until an op gets
to it.

Fail-safe means that WHEN the thing fails, you try your best to make sure
it's in a 'safe' condition.

Chris Candreva  -- snipped-for-privacy@westnet.com -- (914) 967-7816
We've slightly trimmed the long signature. Click to see the full one.
Re: "Am I still working okay?" asked the micro controller...

Quoted text here. Click to load it

Which brings up Robin's original point about "dodgy code". Like it or
not, code defects will occasionally make their way into any non-trivial
project produced in the real world. In the face of difficult deadlines,
compromises will ocassionaly get made, people may screw-up, QA may fall
down on the job.

Anyone who claims NEVER, EVER to have unwittingly released "dodgy code",
or to have been part of a team that did so is either:

1) lying
2) never had to code under pressure (time and cost constraints)
3) lying -- to themselves
4) not been coding for very long, or never on a project with much complexity

As another poster put it, watchdogs are one facet of an entire process
of due diligence, which should also encompass code reviews, sane coding
and design techniques, thorough QA, etc. In general, not implementing
watchdogs where it might make sense to do so is, frankly,  foolish.
(Replies: cleanse my address of the Mark of the Beast!)

Teleoperate a roving mobile robot from the web:
We've slightly trimmed the long signature. Click to see the full one.
Re: "Am I still working okay?" asked the micro controller...

Quoted text here. Click to load it

Amen to #4.  I remember reading a story
about a company that, when hiring salesmen, would always ask the
prospective salesman about the major accounts that he had *lost*.  If he
had never lost a customer, he didn't get hired, because that meant that he
had never "played in the major leagues."

Part of being a geek is having a tendency to grossly overestimate the role
that personal ability plays in the success of one's work.  The reality is
that the highest levels of intelligence (or its correlates) that have been
observed in human beings are *far, far* away from the levels that would
guarentee perfection.  Any business process that relies on humans being
omniscient is, by definition, a failure.  There is *no* way to guarantee
that Mr. Murphy will never pay you a visit.  There are practices that will
make him feel distinctly unwelcome (and there are practices that amount to
buying him a first-class plane ticket and putting him up in the penthouse
suite of the most expensive hotel in town), but none of them will offer you
absolute certainty.

Re: "Am I still working okay?" asked the micro controller...
Quoted text here. Click to load it

Besides: redundancy still isn't a good reason not to use watchdogs.

You may have 4 redundant devices, but what if they all fail at the same
time (which could happen under extreme, unplanned condition)?
What if only one of them fails, but there is another unexpected failure
that prevents redundancy to function as expected (that is, you have
3 working devices, but the whole system fails to notice there is
something wrong with the 4th)? Well, you get the idea.

If fighting planes were perfect, pilots were perfect and conditions
were perfect, guaranteed 100% of the time, we wouldn't need to design
ejecting seats. But we still design them, and once in a while, they
are actually useful and save a life. That's exactly the same thing.
Who cares whose fault it is when an unexpected event occurs? It's
useful to be able to retrieve detailed info of failures, but right
when it happens, nobody cares at this point: the system has to
recover in the quickest way possible. Period.

As a basic rule of thumb, I'd just say that watchdogs are good for
dealing with transient, temporary, unexpected failures. Redundancy
is used more with a long-term (or complete) failure of one or several
devices in mind. Of course, if designed in a sensible manner, they
can complement one other and even interact with one another. That's
when things get interesting.

Re: "Am I still working okay?" asked the micro controller...
On Fri, 28 May 2004 20:57:21 GMT, "Christopher X. Candreva"

Quoted text here. Click to load it

But how does the WDT tell the difference between a transient failure
and the hardware falling apart ?

The self test routines after reset may detect some permanent failure
or it might not. The self test routine itself could go crazy due to
permanent hardware problems and the WDT kicks in again.

Now we have an other interesting situation, which has not been
discussed so far. If there is a permanent hardware/software error and
the WDT triggers over and over again, this can also cause a lot of
damage (e.g. due to repeated large startup currents in some big
loads). Thus, the WDT should be allowed to kick in only for a
predefined number of times and then disable the whole system until
manual intervention.


Re: "Am I still working okay?" asked the micro controller...
Quoted text here. Click to load it

  I have also noticed a trend for some newer WDOG devices to have quite
long timeout options (mins to even hours). This can have merit, as
examples given in another thread show the problems with designing too
close to a WDOG's poorly defined timebase.
  Other WDOGs I've seen have a longer FIRST trigger window, to allow
more elasticity on POST/Boot modes, until the opeational SW proper
starts working.

  It would be a good idea to check for annoyance/damage modes, in a
continually firing WDOG failure instance.


Re: "Am I still working okay?" asked the micro controller...

Quoted text here. Click to load it
Double or triple redundancy is not always the answer for Safety Critical
Systems. Sometimes just a different logical processor (or even a relay
based interlocking scheme) will provide the protection. Sometimes you
have to even consider fully mechanical interlocking as part of the
system. Whatever mitigation scheme you need to use should be based on
the risk assessment arising from a fully discovered HAZOP study.

Having watched over a lot of the responses, I am in the camp that is
aimed at getting the code as correct as you possibly can before you
begin to worry about turning the watchdog on. However, I also use a
separate Puilse Maintained Relay circuit that has to be kept energised
by a correctly responding system. This relay automaticazlly signals
unhealthy if it de-energises due to a system failing to kick it
properly or by a failure in its own circuitry (see my Reading and
Writing the World articles on my website).

We've slightly trimmed the long signature. Click to see the full one.
Re: "Am I still working okay?" asked the micro controller...
On Sat, 29 May 2004 00:40:22 +0100, "Paul E. Bennett"

Quoted text here. Click to load it

The main purpose of redundant systems is to let the system operate
normally even if some controllers fail, not safety. I fully agree that
the last ditch security system should not rely on computer logic and
preferably not even on electricity.


Site Timeline