"Am I still working okay?" asked the micro controller...

I worked on a project substantially larger than a single microcontroller but the idea we applied might be appropriate. We took a very hard line on this and the charter of the group was that there were going to be no bugs delivered to the customers. In some of the functions that we wrote it was feasible to write one, or a small number, of "sanity checks", small tests that would evaluate whether arguments being passed and/or state variables had values that were appropriate at the moment.

If a sanity check failed we displayed "Fatal Error nnnnn", where nnnnn was the program counter at the point where the check failed, and then we halted the processor.

This had a number of interesting and sometimes unexpected consequences. The first was that it quickly became the case that nobody wanted to be the one responsible for passing bad data to someone else's sanity check. That seemed to result in people being much more careful that they would not pass bad data. Secondly, it became a very popular thing for people to carefully craft these checks to keep themselves from being responsible for a failure. Thirdly, in an embedded environment when everyone is in a panic to get all the work done, it seems that when the box just locks up and you know it is going to take hours to try to figure out what just happened, it seems much more reasonable to just hit the reset button and try to get on with your own work. But when "Fatal Error nnnnn" pops up and in seconds you can look at the build file and tell exactly where the error happened and what sanity check failed you are much more likely to yell "FATAL ERROR NNNNN!" over the wall. Everybody in the team would cringe, hoping it wasn't them who had just called that function with bad data. And the person who had just observed this, plus the person who had inserted that sanity check were both "the good guys." This soon led to adding sanity checks when we would find the box crashed in some strange way and it took hours to realize we hadn't caught some bad case.

But this then led us to being able to test in a novel way. We wrote some code on a test harness that would hammer the box with random input. It would poke buttons and send in commands and present data, pretty much completely randomly, but at 100 commands/second! Within seconds of trying this a check blew up and we had another Fatal Error nnnnn. But that let us find and fix an oversight quickly. After a number of iterations we were to the point where this would run all weekend with zero failures.

Then the decision was made, we were going to leave all these in the code and live when we shipped it. Another team working across the wall with a similar product was horrified, "You don't want your customers to know you have BUGS, DO YOU?!?!" And our reply was that they were going to know one way or the other. We shipped. And we waited. And we waited. All the checks apparently had made us find almost all the bugs before it went out the door.

One afternoon I did get a call from the marketing rep. He had a message from the marketing secretary. She had a message from the receptionist. She had a call from Hughes. They had been using this and it had popped up "Fatal Error nnnnn" and just locked up. They were so astonished that they went over to another building, got a camera, brought it back and took a picture. Then they called. And I got nnnnn from 1500 miles away. In 30 seconds I knew which check had failed, knew that it was a single variable, knew it must have been out of range and I could now hammer the box until I could figure out a way to find and fix that. I did.

After 18 months and with 2000 of the product in the field being used by people pretty much full time we had 3 Fatal Errors found, and I thought that was pretty much all of them that were ever seen because in the manual it told them that if they ever saw this to call this phone number and tell us that number so we could fix it for them. I found and fixed those 3 and a number of others that I knew about but no customer would likely ever see.

The guys across the wall, they had ten times the support team and didn't even bother about bugs that didn't just crash the box, and if it did, they just cycled the power and went on. I even tried to get marketing to offer a campaign, I'd PAY customers for the first Fatal Error found. They squashed that, it would have made the other team look bad.

One other item that helped with the sanity checks, we filled all memory with 0xAAAA initially, and even when some memory was released. That oddball value was unlikely to be a reasonable value for most state variables and helped us fail more sanity checks.

Reply to
Don Taylor
Loading thread data ...

I worked on an aerospace actuator that did it like this:

Three hydraulic actuators have three electronic control systems.

Each actuator monitors the other two and has two outputs that are at +5V if it thinks that actuator is good, -25V if it thinks that actuator is bad. The actual monitoring consists of challenges/responses through six dual-redundant actuator- to-actuator digital communication links and looking at extra pressure transducers on the monitored actuator that are read by the monitoring actuator. This identifies wrong behavior.

Each actuator has an input that connects to the outputs of the other actuators through two resistors that form a summing junction. If the sum is > -5V, it operates normally. If the sum is < -5V, it goes into "freewheeling mode", where it exerts no force and is easy to move. If one or both of the other actuators asserts -15V it freewheels.

Each of the two resistors mentioned above is actually a pair of resistors in series. The summing junction also has a pair of high-value resistors in series to local common to hold the input at 0V in the case of two open input signals.

One actuator can drag along two freewheeling actuators and control the aircraft.

Two actuators working together can drag along a third actuator that is trying as hard as it can to go the other way and control the aircraft.

Result: no single point of failure in the actuator electronics or voting system can result in loss of control of the aircraft.

--
Guy Macon, Electronics Engineer & Project Manager for hire. 
Remember Doc Brown from the _Back to the Future_ movies? Do you 
have an "impossible" engineering project that only someone like 
Doc Brown can solve?  My resume is at http://www.guymacon.com/
Reply to
Guy Macon

There are some applications where instead of having a watchdog reset the system when it goes astray you can simply reset the system again and again with a periodic reset. This can be the output of an oscillator or even the push of a button (a common way of designing toys).

--
Guy Macon, Electronics Engineer & Project Manager for hire. 
Remember Doc Brown from the _Back to the Future_ movies? Do you 
have an "impossible" engineering project that only someone like 
Doc Brown can solve?  My resume is at http://www.guymacon.com/
Reply to
Guy Macon

[snip]

Don, may I have permission to put your story up on my web page?

Here is another technique which I use:

Start with "finished" and "debugged" code.

Have one programmer insert N bugs in another programmer's code, keeping careful records of what and where. The idea is to put in errors typical of the errors that the person writing the code normally makes.

Have the author of the code debug and fix all bugs that he can find, stopping when he can't find any more bugs. Keep record of all bugs fixed. Don't tell him which are his or how many were inserted.

Let's say that we inserted 20 bugs, he found 10 of them, and he found

20 of his own bugs. That tells us that there are around 20 of his own bugs still undiscovered.

The psychology is interesting. The programmers write code with far fewer bugs and do a far better job of testing before saying that they are done. The programmer who finds all of the inserted bugs and no new bugs is a hero. (I reinforce that with bonuses and with specific mention in writing of this accomplishment during performance reviews.)

--
Guy Macon, Electronics Engineer & Project Manager for hire. 
Remember Doc Brown from the _Back to the Future_ movies? Do you 
have an "impossible" engineering project that only someone like 
Doc Brown can solve?  My resume is at http://www.guymacon.com/
Reply to
Guy Macon

As SelfTest hasn't come back yet to give any more info or comments, I am looking at his "(other than watch dog)" and wondering if the question is really "Is my micro still running and going about its normal business?"

Usually the first thing any programmer learns is how to flash a LED. By adding a LED and resistor to an output pin, you can call a "turn LED on", and "turn LED off" in a sequence, say flash 4 times on power up being OK.

Extending this further, you can test for certain I/O operations taking place correctly with a set number of flashes.

Many companies use 7 segment LEDs on their products, and such things as "system alive" can mean the 7 segment LED running around in a figure 8.

Power up, self test, and real time diagnostics can be performed from a simple single LED, right up to multiple computer systems to monitor the operations.

I believe that anybody that designs a useful lump of hardware should have at least one LED that can be pulsed under program control for this purpose.

Cheers Don...

--
Don McKenzie
E-Mail Contact Page:      http://www.e-dotcom.com/ecp.php?un=Dontronics

USB to RS232 Converter that works http://www.dontronics.com/usb_232.html
Don's Free Guide To Spam Reduction  http://www.e-dotcom.com/spam_exp.php
Reply to
Don McKenzie

On the Amiga computer one of the testing packages used 0xDEADBEEF to fill unused memory. ;-)

It also added guard band areas around allocated memory and then checked those after the free to be sure you didn't write outside of your allocated area.

That second idea would work best if you had an OS or at least memory management code.

--
    Gerald Bonnstetter
    Bonnsoft
    bonnsoft@antispamextrastuffnetins.net
Reply to
Gerald Bonnstetter

Feel free. I might even be able to do a better job describing this.

I've read about that and given that considerable thought. But I've never quite been able to convince myself just what would be appropriate to put into the code and where. If you have really found a successful way of doing that I'd be interested.

It is the culture that is put in place, I really believe that. Figure out what THE charter for the project is going to be and get everyone to buy into that. Put that up on the wall as a big sign. That charter, if it is done right, seems like it will answer a fair share of the questions raised during the development. Maybe that charter is "We do not care if this is all crap, we will deliver a deat oppossum, as long as it gets done on time." But strongly resist the urge to put meaningless crap up for the charter, if you don't really really really mean "We will accept NO bugs" then don't claim that is your standard. If everyone knows what the charter is and that it WILL stand, no matter how bad the firestorm gets, it will do them good. But if even some of them know this is meaningless raving it will do no good at all.

I used to be an engineer for hire. Oh well.

Reply to
Don Taylor

I let the other engineers make that decision after seeing the programmer's past errors. And when I am waring my manager hat I insist that any result other than perfect performance be kept confidential, even from me. This is a tool for reducing errors, not a tool for beating programmers over the head.

--
Guy Macon, Electronics Engineer & Project Manager for hire. 
Remember Doc Brown from the _Back to the Future_ movies? Do you 
have an "impossible" engineering project that only someone like 
Doc Brown can solve?  My resume is at http://www.guymacon.com/
Reply to
Guy Macon

Let me guess, it was too heavy to fly? ;-)

--
Ben Jackson

http://www.ben.com/
Reply to
Ben Jackson

It's quite good as is, but if you want to rewrite it so much the better. Just post the improved version if you decide to improve it.

--
Guy Macon, Electronics Engineer & Project Manager for hire. 
Remember Doc Brown from the _Back to the Future_ movies? Do you 
have an "impossible" engineering project that only someone like 
Doc Brown can solve?  My resume is at http://www.guymacon.com/
Reply to
Guy Macon

Judge for yourself:

formatting link

:)

--
Guy Macon, Electronics Engineer & Project Manager for hire. 
Remember Doc Brown from the _Back to the Future_ movies? Do you 
have an "impossible" engineering project that only someone like 
Doc Brown can solve?  My resume is at http://www.guymacon.com/
Reply to
Guy Macon

I am sure that this can be an effective tool. But it seems less than optimal to introduce bugs in order to get the programmers to debug existing bugs. Maybe that is just me...

I have read that it can be useful to track the number of bugs found over time. This typically follows a curve of exponential decay and can help you predict the number of bugs left in a product. Certainly this is less intrusive and has less overhead.

One thing I don't support is the idea of engineers beating each other up over mistakes. I worked at one place where a mistake that was checked back into version control would result in the author receiving the "Arrow of Shame". I did not agree that the tip of version control is what you work with or ship and I certainly did not agree with whacking people over the head when they made a mistake. I stopped this tradition on my project.

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design      URL http://www.arius.com
4 King Ave                               301-682-7772 Voice
Frederick, MD 21701-3110                 301-682-7666 FAX
Reply to
rickman

"SelfTest" wrote in news:40ab5b93$0$3034$ snipped-for-privacy@news.optusnet.com.au:

Going to the ridiculous extreme, we adapted the production test vectors for the ARM7 core and turned them into a modular program which could be fired off at intervals, perform a few instructions that exercised part of the core and affected some of the registers, then wrote those registers out into a hardware register that accumulated a CRC value. We actually set this up for a dual-processor system that was used in an Anti-lock Braking System. The nice feature of that braking system is that it could fall back to a "dumb" mode if either of the processors noticed that the other wasn't getting the same results.

The test sets were fine-tuned by running them through a simulation of the core that allowed us to simulate every possible stuck at one, stuck at zero fault. The best we could come up with in the time and codespace allowed was something like a 92% fault detection rate (which equated to

96% of all 'discoverable' faults).

I believe this is now a licensable package available from ARM.

Peter.

Reply to
CodeSprite

This is a form of a technique known as "process pairs". The OP should do some searching using those keywords.

Reply to
Clifford Heath

Anyone who enables the Watchdog timer is advertising:-

1) My code is dogdy. 2) My hardware is EMC prone. 3) I have a new source of error; the watchdog itself.

Cheers Robin

Reply to
robin.pain

You will forgive me if I prefer that you stay out of aerospace...

Reply to
Guy Macon

For any non-trivial application, all three are true.

Reply to
Dave VanHorn

Robin should stick to lego's and not electronics:

Reply to
Captain Bly

What a pile of bullshit. There are more reasons for an embedded system to fail that you can even begin to imagine. Not using watchdogs (in a sensible way, of course) is totally irresponsible in my opinion.

Reply to
Guillaume

Jack Gannsle wrote a GREAT article on why you should use watchdogs, and why they are so tricky to use properly.

formatting link

--
- Alan Kilian  
Director of Bioinformatics, TimeLogic Corporation 763-449-7622
Reply to
Alan Kilian

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.