"Am I still working okay?" asked the micro controller...

- S
- SelfTest
  
  Contact options for registered users
posted
19 years ago

Wed, May 19, 2004 1:05 PM

Say we have a micro controller with limited memory. Say it will perform some realtime control of something.

How to make a SW for a micro controller, that in addition to its normal operation (control of something), from time to time it will also check itself if it is doing okay or not ? How a program can test itself? Can some one suggest any intelligent method (other than watch dog) ?

- U
- Uddo Graaf
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 1:12 PM

That's called a 'watchdog' timer and is standard in most microcontrollers. It's basically a countdown timer which the computer program running on the microcontroller needs to set every x times per second to prevent it reaching zero. When it reaches zero the microcontroller is reset. So when a program 'hangs' the program stops setting the watchdog countdown timer and the microcontroller is reset.

- H
- Hans-Bernhard Broeker
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 1:43 PM

Ultimately, you can't. A CPU can no more meaningfully ask itself "Am I still working OK?" than you can ask yourself meaningfully "Have I fallen asleep yet?"

You can use watchdogs or internal consistency checking to some extent to determine general health of the software. Assertions can be inserted into the code, i.e. conditions that you know must come out true at all times, because otherwise something's fatally wrong.

But there's often little or no point trying to detect hardware faults

--- if the hardware does break you're quite probably toast anyway. You can't usually fix such a problem from the software side, and by The Usual Kind of Luck, the faults that do occur will be exactly those you can't, or at least didn't test for. And that's before you consider that such tests mean more code in total, and thus more opportunities for bugs.

Morale: if you don't know what to do with the answer, don't ask the question.

--
Hans-Bernhard Broeker (broeker@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.

- M
- moocowmoo
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 1:46 PM

One way to check hardware is to run another identical processor and compare that they behave the same. If you have three or more then you can perform voting so that the most popular answer is the one that gets used.

Peter

--
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.659 / Virus Database: 423 - Release Date: 15/04/04

- J
- <jiang>
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 1:48 PM

compare

That is cool idea !..

- U
- Unbeliever
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 1:58 PM

You are correct in identifying watchdog timers as one form of COP (computer operating properly test). Other things I've often used are:

1) Background checksum on code and constant/initializer areas of memory 2) Flags and timers which indicate that critical routines and interrupts are running at about the right rate, usually checked in the watchdog timer interrupt. 3) Guardwords between stacks and other memory and regular checks that these have not been compromised (agail often in the watchdog timer interrupt. 4) Feedback of critical output signals to ensure the hardware is working correctly (the hardware is much more likely to suffer random failures than the software). 5) A decent watchdog timer with an algorithmic stimulus and response (e.g. watchdog processor supplies a pseudorandom number and main processor replies with next pseudo-random number in a sequence). Much better than the primitive kick within a certain time style of watchdog, which is prone to failure to detect runaway software which includes a kick. 6) One I haven't used but seen used on a critical plc style system is an odd number of redundant processors (3 in this case) which vote on the state of an output (output follows the state of two agreeing inputs).

Of course, the next question you should ask is "What do I do when I detect a failure". If it is a safety critical system (e.g. the something you're controlling is a train, nuclear reactor or gas furnace rather than a lego windmill) there's a whole other set of questions you should ask even before asking the first one.

hth, Alf

- M
- moocowmoo
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 2:13 PM

perform

It's not my idea, NASA uses a set of five computers for the Space Shuttle flight software.

Peter

--
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.659 / Virus Database: 423 - Release Date: 15/04/04

- C
- CBFalconer
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 2:54 PM

And not so simple. What takes the vote? What if it fails?

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
   Available for consulting/temporary embedded and systems.
     USE worldnet address!

- M
- martin griffith
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 2:55 PM

have a look at

formatting link

There seems to be a lot to getting just a little old WD bullit proof

martin

Three things are certain: Death, taxes and lost data. Guess which has occurred.

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 3:11 PM

Without special hardware support, you can't.

It can't.

Redundant hardware running independantly developed sw with majority voting of outputs.

--
Grant Edwards                   grante             Yow!  I HAVE a towel.
                                  at               
                               visi.com

- M
- Mike Harrison
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 4:21 PM

You also need to consider the likelihood of a problem occurring in the first place - time spent designing the hardware to be reliable (e.g. EM/ESD immunity) is time much better spent than trying to second-guess what might go wrong and then hope you can do something useful about it.

For example, in the old days when systems typically comprised seperate MCU/RAM/ROM chips, it made sense to test SRAM and checksum ROM, as these involved many interconnections and sockets which could fail. It makes much less sense to do it on a single- chip MCU, where the sort of failures that are plausible on a seperate-chip system just don't happen.

- J
- Joe Pfeiffer
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 4:28 PM

Most of the microcontrollers I've seen that are intended for applications like this have a built-in watchdog timer (I'm assuming when you say "other than watch dog" you mean "other than external watchdog"). In the case of the processor I know best, the HC11, it's called the COP (Computer Operating Properly) timer. The idea here is your software has to reset it occasionally; if the timer ever goes off, it's because your control program has gotten itself wedged.

--
Joseph J. Pfeiffer, Jr., Ph.D.       Phone -- (505) 646-1605
Department of Computer Science       FAX   -- (505) 646-1002
New Mexico State University          http://www.cs.nmsu.edu/~pfeiffer
Southwestern NM Regional Science and Engr Fair:  http://www.nmsu.edu/~scifair

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 4:36 PM

And the probability that your program will still be able to run and do predictable things when there is a failure in the MCU is also small.

Multiply the probability of MCU failure by the probability your program will run with such a failure, and you get a number sufficiently close to zero yadda, yadda, ...

--
Grant Edwards                   grante             Yow!  Spreading peanut
                                  at               butter reminds me of
                               visi.com            opera!! I wonder why?

- J
- Jim Hewitt
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 5:18 PM

Hans,

In this case, the very next question should be Moral: if you don't know how the answer [i.e. the sensor/hardware] could fool, don't ask the question.

- P
- Paul E. Bennett
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 5:22 PM

..and adding to that list. External Pulse Maintained relay. This device has to be fed a change of polarity of its input signal at a regular rate in order for it to maintain a relay in its energised state. If any single component fails, the power supply goes off or the input does not change then the relay just de-energises and opens its contacts. The pulse drive for such a circuit should be driven from the processor internal sanity checks that your software is performing (all check OK so change the state of the output). This device can elevate a single processor from SIL0 to SIL1 with very little effort.

Further, your microcontroller may be comunicating with other systems in order to perform its control. Doing sanity checks on the communication link and checking its integrity in operation will yield a good idea of sub-system health. You will need checksums and/or CRC's on all messages between systems.

Integral step-wise walking memory test and other walking sanity checks. This can detect potential failure points quite early on.

There are a number of others.

You should do an evaluation of what the system safe state is going to be (off, bypassed or gracefully degrading). Then your design efforts should always lean the system toward achieving those safe states unless it is continuing to work properly.

--
********************************************************************
Paul E. Bennett ....................
Forth based HIDECS Consultancy .....
Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************

- S
- Spehro Pefhany
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 5:40 PM

If you have access to a decent library, check out one these standards before you choose which hardware to use:

ANSI/AAMI SW68, Medical Device Software - Software Life-Cycle Processes

ANSI UL1998, the Standard for Safety of Software in Programmable Systems

EN/IEC 60601-1-4, the Collateral Standard for Programmable Electrical Medical Systems

Best regards, Spehro Pefhany

--
"it's the network..."                          "The Journey is the reward"
speff@interlog.com             Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog  Info for designers:  http://www.speff.com

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 6:08 PM

There are plenty of simple things you can consider if something is failing.

1) Turns yourself off, no need to draw power if you are battery operated. 2) Turn off any external device, which should not operate when the program is not active 3) Reset yourself. If it is not OK, due to a temporary problem, this is quite good. >

--
Best Regards,
Ulf Samuelsson   ulf@a-t-m-e-l.com
This is a personal view which may or may not be
share by my Employer Atmel Nordic AB

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 6:12 PM

Use mechanical or pneumatic voting, not electric.

For instance, if you want to control a bidirectional relay, use a core with three separate coils, each controlled by a separate processor. If the current in two coils flow in opposite direction, the resultant magnetic field is zero. Then the third coil will determine the resultant force alone.

Paul

- S
- Spehro Pefhany
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 6:58 PM

On one machine I'm very familiar with there are three safety interlocks (one electrical (not electronic), one hydraulic, and one mechanical). Only when all 3 agree it is safe is the electronics allowed to do what it wants.

Best regards, Spehro Pefhany

--
"it's the network..."                          "The Journey is the reward"
speff@interlog.com             Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog  Info for designers:  http://www.speff.com

- G
- Guy Macon
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, May 19, 2004 7:14 PM

What do you plan to have the microcontroller do if the answer in "no?"

-- Guy Macon, Electronics Engineer & Project Manager.

formatting link