Continous eeprom checksum microcontroller

This is the kind of discussion that I expect from an embedded systems programmer. Instead of blindly assuming that the sum-checker is more reliable than the EEPROM being checked, Spehro is giving reasons why the assumption might be true.

This also is the kind of discussion that I expect from an embedded systems programmer. Spehro is addressing the fundamental issues of "what should be done if there is an error", and he doesn't fall into the common error of not analysing the "do nothing" option.

--
Guy Macon, Electronics Engineer & Project Manager for hire. 
Remember Doc Brown from the _Back to the Future_ movies? Do you 
have an "impossible" engineering project that only someone like 
Doc Brown can solve?  My resume is at http://www.guymacon.com/
Reply to
Guy Macon
Loading thread data ...

Guy Macon wrote: ...

Very true, but still subject to further refinement: consider which parts (ALU, RAM) are * on the same chip * because failures due to threshold changes are more liable to occur from one chip to another than between components on the same silicon. - RM

Reply to
Rick Merrill

&*$!@*! spellchecker! JUST the place...
Reply to
Guy Macon

"Guy Macon" schreef in bericht news: snipped-for-privacy@corp.supernews.com...

I am a programmer. 'Requirement flaws' are treated differently from 'Requirements' because they are flagged as flaws.

checksum,

errors,

Rubbish. Hard numbers don't have any value in this context, and reliability/cost of the sum checker is the least interesting bit. If I had hired you as a manager you would have a problem, wasting time on instructing other staff to waste even more time. What matters is if a system failure is something you can afford or not. Assuming, for the sake of this discussion, the software has to work with occasionally corrupted eeprom data, you have to decide what you can do to avoid that, and at what cost. Piles of analysis tend to be highly unreliable, cost calculations in such areas never make sense, better to trust a bit of common sense. BTW, in whatever system, I would not be worried by eeprom itself, would worry more about software making accidental writes and. most important, a healty hardware design with nice power up/down behaviour. Implementing a continous check may cure that just enough to let the systems pass the testers, but if that is desireable... For the same reasons I don't like the well spread practice of restoring important hardware registers on a regular basis. Or watchdogs. I use both, but I don't like it a bit.

--
Thanks, Frank.
(remove 'x' and 'invalid' when replying by email)
Reply to
Frank Bemelman
[%X]

As the definition of system is quite wide I am just asking to clarify matters (although I think I know what you mean).

When you say that some systems have "no safe state" I am taking it that you are speaking of individual sub-system modules that are one of a redundant set so that failure of an individual sub-system module does not have an impact on the overall safety of the whole system.

I have not come across many of this type of system but then I have never worked in any of the aerospace industries (where I expect such considerations to exist in plenitude).

--
********************************************************************
Paul E. Bennett ....................
Forth based HIDECS Consultancy .....
Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************
Reply to
Paul E. Bennett

Assuming that you have demonstrated a need to be certain of the validity of data in the EEPROM (or any other area of fixed memory) then you should also have a figure that indicates the maximum time bewteen full checking reports (rember, integrity is a time and probability of failure measure).

Also assuming that the system you are developing has, as mentioned in another post, no safe state then you may need to know how much of the time the individualk parts of the system are available to you. Not only would you run the checksum but you would also run other hardware integrity checking on a continuous piecemeal) basis, leaving markers as to the success or otherwise such that a reporting programme can report the results of the error analysis. Note that we are now in the realm of MUST NOT FAIL systems.

The question of what you do when a part of your system fails must be answered fairly early on in the design phase. Every engioneer should ask himslef that question as a matter of routine deliberation for new designs.

Forunately for me, I need not care too much about losing one module of a system so long as it indicates that it has failed (and why). I have several techniques that I use to check that the system is really behaving itself and ensure that outputs are disabled (a safe state in 99% of mys syetms).

As I often state, let the risk assessments guide you to what you need to check and then work out the scheme that gives you the best chance of meeting the integrity taregets (not all parts of the system need to work to the same level).

--
********************************************************************
Paul E. Bennett ....................
Forth based HIDECS Consultancy .....
Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************
Reply to
Paul E. Bennett

Which is it? Do you flag them as flaws or ignore them without informing the person that wrote them? The former I like. The latter I consider to be grounds for termination on the third offense.

They do if you do them right.

Reliability is the most *important* bit, whether you find it to be interesting or not.

I hope that you are referring to the analysis of whether the continuous eeprom checksum makes the system more or less reliable. I would not instruct anyone to do that analysis - I would simply would refuse to add a continuous eeprom checksum to the requirements without it. If I allowed requirements to be added without any apparent benefit, *that* would be wasting time.

Once again you are pretending that you know that the hardware that does the continuous EEPROM checksum is more reliable than the EEPROM. If it happens to be a lot less reliable, you are making a system failure more likely.

Not if you do them right.

They make sense if you do them right.

And you think that doing a continuous EEPROM checksum when you don't know (because you don't like analysis) whether the EEPROM is orders of magnatude more likely or less likely to have an error than the system that does the checksumming makes common sense? I will stick with the "piles of analysis" as being more reliable than "common sense."

We agree here.

Again you assume that continuous EEPROM checksum makes the system more reliable rather than less reliable. How do you know this? What method did you use to arrive at this conclusion?

Reply to
Guy Macon

Does that mean you deliver projects that are not to the clients spec?

The early part of my projects usually involve rewriting the specification to make it fully coherent. It takes quite a bit of negotiation but then can end up costing the client less (once you rid the spec of the useless dross). Remember that you have to engineer the customer as well as the system.

--
********************************************************************
Paul E. Bennett ....................
Forth based HIDECS Consultancy .....
Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************
Reply to
Paul E. Bennett

Way to go Guy!

--
********************************************************************
Paul E. Bennett ....................
Forth based HIDECS Consultancy .....
Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************
Reply to
Paul E. Bennett
[...]

Interesting discussion. Reminds me of my first job out of college, part of a team modifying the software of the fuel gauge for a commercial airliner. I thought I'd posted on this before, but google isn't finding it for me...

We were working on a project known as "dispatch enhancement," which was a complete misnomer. We were actually tightening up some diagnostics, adding some others, and adding the ability to send diagnostic messages to the aircrafts Engine Indicating and Crew Alerting System (EICAS). In summary, we were adding the ability to detect more problems and providing better error messages. (Prior to this enhancement, the only "error message" we provided to the crew was blank displays). Nothing we were doing would "enhance" the "dispatch" of aircraft on their flights.

While we were working on this, an aircraft using the existing fuel gauge ran out of fuel in mid-air. Look up the "Gimli Glider" if you want more information.

We suddenly came under much greater pressure to complete our modifications ahead of schedule. Which made no sense whatsoever:

1) The fuel gauge on the subject aircraft was blank, indicating internal diagnostics had found a problem. We were not going to prevent that from happening -- indeed, after our modifications, it would potentially occur more often, because we could find additional problems. 2) The FAA regulations said that when this aircraft's fuel gauges are blank, the aircraft doesn't fly. This aircraft was flying because it wasn't subject to the FAA (i.e., not an American flight). 3) The flight regs to which the aircraft was subject allowed flight

when the fuel in the tanks was measured manually. This was done more than once, correctly each time. The ground crew reported to the pilots the number of pounds of fuel in the tanks. The pilots thought the reported value was in kg.

Back to the subject: for some reason someone got it in their head that our changes would make the fuel gauge "more reliable," and therefore we _had_ to complete our changes ASAP. Probably because of the bogus project name. In one sense we were: our changes would make it less likely the fuel gauge would cause the airplane to malfunction. But by their definition (aircraft flies more often), we would probably make the fuel gauge *less* reliable.

And the question of what to do in a failure. We would still blank the displays. We would also notify the crew of the nature of the problem through EICAS. No change there. The only change that could have prevented this incident was external to our group (and was made IIRC: the subject flight regs were changed to prevent the aircraft from flying with blank fuel gauges).

Regards,

-=Dave

--
Change is inevitable, progress is not.
Reply to
Dave Hansen

These circuits are exercised each time any program executes, not just when the checker routine is executed (unless of course, if the processor contains some dedicated hardware for CRC calculation :-). A fault in the normal program execution hardware would most likely even prevent the checker routine to be executed. However, a normal program does not necessary access _all_ the memory locations, so the likelihood of detecting an error is much larger than causing itself any new errors.

I have to admit that due to previous experience with low reliability DRAM and EPROM, I still assume that memory is still the weakest point, but like to be proven wrong (hopefully not due to unreliable CPU hardware :-).

It should also be noted that in systems that may run for years without reboot, a check executed at startup does not be very useful.

Then there is the question of what to do if the internal consistency check fails. At least in redundant system, the active system detecting an internally consistency error can perform a quick smooth handover to the other unit and not wait until the active unit fails completely and the other unit has to take control, usually not so smoothly.

With voting systems (with 3 or 5 identical units), especially when driven actively in both directions, a unit detecting an internal error can disable itself, rather than wait for the tug of war that follows, when the failed unit would actively try to control in the wrong direction.

Paul

Reply to
Paul Keinanen

Yep, never have a problem with it too. I add or increase specs as well.

can

I have no complaints. It's important to know what they want/expect. That's not often found in the raw specifications. Exceptions are strict performance specifications, which should be honoured or negotiated if need be.

--
Thanks, Frank.
(remove 'x' and 'invalid' when replying by email)
Reply to
Frank Bemelman

"Guy Macon" schreef in bericht news: snipped-for-privacy@corp.supernews.com...

[snip]

If the checking system would be less reliable, it would make the entire system useless for the more obvious tasks it has to do. In that case, I couldn't care less about eeproms flipping a bit. Checking the eeprom isn't the main goal of the system. So a good system is first priority, no matter what (flawed) specs lull me into believing.

Continous checking (with auto correcting) is sweeping the dust under the carpet, out of sight. Something you should add very late in the development, at the time you are wondering why bothering.

--
Thanks, Frank.
(remove 'x' and 'invalid' when replying by email)
Reply to
Frank Bemelman

Assuming that a continuous check is done in the null task, which would otherwise just burn idle CPU cycles, do you have examples in which adding the continuous checking would have decreased the total _system_ reliability ?

The only mechanism I can think of is that the checker routine instructions consume more power than idle instructions, so the CPU temperature will slightly increase and thus slightly decrease the MTBF of some components. In battery powered systems, the battery will fail slightly earlier.

On the other hand, a "continuous" specification does not have to mean that you burn 100 % of the (idle) cycles for the check routine, a scan could take along time if you sleep for a millisecond after each kilobyte checked :-).

Put this kilobyte checker into a task just above the idle task priority and each time the system has nothing else to do, it first drops to the kilobyte checker to check the next memory segment and then falls down to the idle task. Thus, only 1-10 % of the idle cycles would be consumed and the temperature increase would be insignificant.

Of course you would have to consider the most likely EEPROM failure rate when deciding how long the scan can take.

Paul

Reply to
Paul Keinanen

I would expect nothing less from a professional embedded systems engineer.

--
Guy Macon, Electronics Engineer & Project Manager for hire. 
Remember Doc Brown from the _Back to the Future_ movies? Do you 
have an "impossible" engineering project that only someone like 
Doc Brown can solve?  My resume is at http://www.guymacon.com/
Reply to
Guy Macon

Certainly.

Assume that the application is rarely run (making the null task the one that is orders of magnitude most likely to be running). Let's assume the main task runs once a second and the null task runs a million times a second while doing nothing and 100,000 times a second while checking the EEPROM.

Further assume that there is a register, ALU, or other part of the uC that the main task uses once, that the EEPROM check uses 10 times, and that the do nothing task never uses.

Assume that this register gives a wrong answer one time out of a million, and that the EEPROM is far less likely than this to have an error.

With continuous EEPROM checksum: one error per second on average.

Without continuous EEPROM checksum: one error per million seconds on average.

(Paul goes on to discuss running the sum checker less often, which would, of course, reduce the million to one ratio above. The million to one ratio was just a made-up example, of course; it could be 1:1 or 10:1 or 1:10 or any of a number of different ratios. In real life you could wait years for the first failure of the EEPROM or of the EEPROM checker.)

--
Guy Macon, Electronics Engineer & Project Manager for hire. 
Remember Doc Brown from the _Back to the Future_ movies? Do you 
have an "impossible" engineering project that only someone like 
Doc Brown can solve?  My resume is at http://www.guymacon.com/
Reply to
Guy Macon

You think that a eeprom checksum task exercises the same circuits (registers, RAM, instruction decoders, EEPROM reading amplifiers...) that a do nothing task exercises?

Reply to
Guy Macon

"Guy Macon" schreef in bericht news: snipped-for-privacy@corp.supernews.com...

can

You should expect more, if you want to see more than early stages alone.

--
Thanks, Frank.
(remove 'x' and 'invalid' when replying by email)
Reply to
Frank Bemelman

As for EPROMs/EEPROMs: with a leaking oxide that looses charge over time one would expect not a flipped bit but a noisy bit. The read-amplifier/transistor has no hysteresis to prevent that, because in normal operation hysteresis is not needed. State would depend on supplyvoltage and temperature too. Therefore memory could test ok on startup ( cold chip ). And it could test again ok after a running checksum has detected an error. "Repairing" a noisy EEPROM would be possible if one has segmented it in small blocks each with a checksum. After an error in a block has been detected it would have to be reread several times till one has established the true data because the pattern is stable and consistant with checksum. After that one would rewrite the data to the EEPROM. Obviously a Hamming-Code would be a more direct/faster approach for repair.

Good book is: Sharma "Semiconductor Memories. Technology, Testing, Reliability" IEEE Press 1997 But it has no simple answers either.

MfG JRD

Reply to
Rafael Deliano

I think the OP wouldn't agree on this definition (at least, I do not). One example of a system with a safe state is a railway interlocking, where the safe state is "all signals red, all points motionless": if a catastrofic error is diagnosed by a properly designed interlocking, you can always go to that state, where a minimum harm is guaranteed for trains and passangers.

On the contrary, an avionic system has not an evident safe state. Just imagine stopping the jets in case of panic...

-- Ignacio G.T.

Reply to
Ignacio G.T.

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.