Continous eeprom checksum microcontroller

- K
- Ken Lee
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Jul 6, 2004 11:09 PM

I was thinking the same thing. I presume some analysis was performed & the eeprom checksum is a mitigation of some fault or hazard. Otherwise what do you do when a fault is detected? Initialise to default values? Log an error & continue? Halt the device? Reset the device?

Also is a "checksum" adequate or should a CRC be calculated?

Performing continuous eeprom checks could chew up considerable MIPs, so I wouldn't do it unless I had cause for good reason. Some good design practices employ minimal resource usage -- I wouldn't put this one into that category.

Ken.

+====================================+ I hate junk email. Please direct any genuine email to: kenlee at hotpop.com

- S
- Spehro Pefhany
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Jul 6, 2004 11:34 PM

Attempt to salvage correct value, attempt to repair at an appropriate time in my case.

It could not, too.

Yes.

Depends on the other specifications. The minimum resource usage to meet ALL the specifications, right?

Best regards, Spehro Pefhany

--
"it's the network..."                          "The Journey is the reward"
speff@interlog.com             Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog  Info for designers:  http://www.speff.com

- K
- Ken Lee
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Jul 6, 2004 11:36 PM

If it is at all possible to corrupt micro registers via external means (that is not by a software related fault) then I wouldn't be attempting to mitigate the fault as, God knows, what other aspects of the micro would be questionable. Instead I would mitigate against the resultant hazard. For instance, if the micro is controlling the transmitted output power of the device (as for a CAT device) then I would be looking at putting a power limiter in the hardware.

Also if stored values need to be validated by some means then an appropriate software architecture should be adopted. Possibly you could adopt a scheme to CRC or checksum critically stored parameters before use, rather than performing continuous refreshes -- just a thought.

The fact that you are even discussing such issues would seem to me that a proper hazard analysis needs to be performed on the system to determine where you stand and to get a handle on the type of mitigations you need to put in place.

Ken.

+====================================+ I hate junk email. Please direct any genuine email to: kenlee at hotpop.com

- B
- Ben Bradley
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Jul 6, 2004 11:51 PM

It really depends on how often it needs to be performed. If you have a routine called in a main loop or tick interrupt that accumulates a location, increments a pointer (and if at end, does the compare and resets the pointer) and returns, it will use very little resources and be done in seconds or minutes. From reading followup posts, the OP seems to want to do this for the more general purpose of system integrity. I'm not sure this is the best (and certainly not the only) way to validate system integrity. As any recent c.a.e reader knows, we've had a few heated threads recently regarding devices intended to increase system integrity. Some things are a lot better than others in increasing, validating or insuring system integrity.

Quoting an earlier response in the thread:

It seems to me your question is "I am using a hc12 micro. What can I do to make it as reliable as reasonably possible?"

As for the "register refresh" he may be referring to output registers and data direction registers. Electrical spikes can cause the bits in these registers to change states, so it makes sense to refresh them regularly. But why does your customer ask this? It seems that as the designer, you should be making these decisions. Is your customer micromanaging you? Or is your customer a governmental agency and these are "required specs"? But if a spike can change an I/O register, then it can change any other read/write bit on the silicon, such as a CPU register or RAM location. These can't be "refreshed" because you don't know what values they should be. The solution to this, in addition to to the above CRC's and refreshing, is to reduce the effect of a spike so it's much less likely to affect the controller: change layout, add bypass caps, add diodes and resistors on I/O's and such. So what happens if an I/O port is the wrong value? Could it do something dangerous? Could it lose valuable data? What should the thing do? Reset? Save the fault state in an EEPROM, light a special "ERROR" LED and stop? (I rather like that, as the fault state is "valuable data" to the designer) All this is dictated by the application.

- P
- Paul E. Bennett
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Jul 7, 2004 7:18 PM

As I stated previously, only the risk assessment process for the system in its intended environments can guide the mitigating measures required in the system. It can take as long or longer to perform the risk analysis ccompared to the time required to do the system design.

The difference between genearal purpose controller systems and high integrity controller systems should, nowadays, be less reflected in the hardware manifestations. The difference is more likely to be manifested in the software techniques used and the overall system integration arrangements. Whatever errors you find in the system you still must have some plan of action for the controller to follow even if it is just raise an alarm and turn itself off. Naturally, when errors are encountered they should be logged somewhere so that the engineers/operations staff can identify what happened (in sequence hopefully).

Most likely what he is asking but also trying to follow a document he has been given by his client (who may just have cherry picked some supposedly useful phrases that have been applied to other systems without understanding the implications of what they are asking - the OP needs to explore this with his client from the basis of a good grounding in defensive programming techniques and their overall value to system integrity. I would require a significant amount of information from the OIP to be able to assist him to that level from where he seems to be.

The general rule for registers that are relied upon for output is that their state should be refreshed at the successful completion of each control loop cycle based on the evidence from the real system state as represented by the inputs (INPUT-->PROCESS-->OUTPUT). If the path to the end of the control loop is not completed successfully then you may need to set a default output pattern that has been determined to be safe (I try and make mine all outputs off if I can - not always possible).

Systems that are easily suceptable to the effects of spikes from PSU, ESD or RFI need the hardware design looking at. Decent layout, adequate decoupling, filtering, shielding and sensible arrangement of ground circuits will all have beneficial effects on the system.

It just takes a little up-front thinking to eliminate many of the problems that can arise for a system design. No-one needs to rush to dash out a system design on receipt of the requirements.

--
********************************************************************
Paul E. Bennett ....................
Forth based HIDECS Consultancy .....
Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************

- K
- Ken Lee
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Jul 7, 2004 11:10 PM

What does "salvage" really mean. If a checksum is performed on a block of memory & it's wrong then one could take this to mean that 1 or any number of the contents are incorrect. If a checksum is performed on a single item and it's wrong, then all you can deduce from this is that the value is incorrect. Unless you keep a mirror image of the data you cannot "salvage" the correct value. Possibly the only things one could do is fall back to some "safe" or default value, reset the device or place the device in an error state.

Sorry but this requirement just has that particular pattern as a MIPs-chewer -- "repetitive calculation on a block of memory". Why can't this requirement be met on demand -- that is, the checksum is calculated when the data is read and used?

I've no argument on resource budgeting for input requirements, but this particular requirement looks like a mitigation for some fault or hazard. I've no problem with checksumming or CRCing stored data, but doing it on a continual basis seems to me to be ill-formulated. Admittedly I don't know what the application is, but I'm in the medical electronics game and have worked in the automotive industry, and am familiar with over-burdened mitigations.

Ken.

+====================================+ I hate junk email. Please direct any genuine email to: kenlee at hotpop.com

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Jul 8, 2004 6:46 AM

If you are using some sort of multitasking kernel, which usually contains a null task (running an idle loop), which executes when no other task is runnable, simply put the memory check into this null task.

Paul

- K
- Ken Lee
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sun, Jul 11, 2004 10:55 PM

I'm sure that there are a multitude of ways to implement this but that wasn't my point. I was making an observation as to why this requirement had to be done continuously, opposed when it's needed & that's when the value is actually read.

Ken

+====================================+ I hate junk email. Please direct any genuine email to: kenlee at hotpop.com

- J
- Jim McGinnis
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Jul 12, 2004 2:12 AM

Suppose the device is the navigation system for an airplane, and you haven't taken off yet. Wouldn't you like to know whether you could rely on it once you're in the air?

--
Jim McGinnis

- M
- Mike Fields
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Jul 13, 2004 2:28 AM

In fact, that is one of the tasks that does run as part of the normal frame in our embedded avionics software for some of the boxes we build for airplanes !! Yes, we do want to know if things are corrupted.

-- Mike "mikey" Fields

formatting link

outgoing email scanned by Norton Antivirus ... is that good ?

Linux users brag on how long their system stays up, Window users assume it's a temporary condition ...

- K
- Ken Lee
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Jul 15, 2004 2:36 AM

So you're implying that they don't do a system check of the navigation system on the ground before take-off. I'm assuming that the navigation system is turned on prior to the plane getting into the air.

Let me be perfectly clear -- I'm NOT saying that the check shouldn't be done. I'm objecting to the fact that it is done "continuously" rather than on-demand.

Ken.

+====================================+ I hate junk email. Please direct any genuine email to: kenlee at hotpop.com

- F
- Frank Bemelman
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Jul 15, 2004 7:24 AM

"Jim McGinnis" schreef in bericht news: snipped-for-privacy@4ax.com...

Oh, not again. Why not a controller for a nuclear power plant, an on-board engine controller for a satellite in orbit, a laser for eye corrections, a heart-lung machine in an OR, a launcher for H-boms...

The bottom line is that folks that implement it, probably need it. Either because their hardware is crap or they are paranoid or it is a requirement from some retard, leaving 0.01% of applications where it really serves a purpose.

--
Thanks, Frank.
(remove 'x' and 'invalid' when replying by email)

- G
- Guy Macon
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Jul 15, 2004 8:06 AM

I don't think so.

But why would they imagine thet the EEPROM hardware is crap while having faith the the "do the checksum" hardware and the "store the checksum to compare with the EEPROM" hardware will work just fine?

By why would they be paranoid about the EEPROM hardware and not the EEPROM checking hardware?

Possibly , but if you develop embedded sytems, it's your *job* to identify flaws in the requirements. If it isn't needed and the customer insists on having it you have to make the personal choice of whether to do it or find work elsewhere.

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Jul 15, 2004 8:25 AM

With avionics, it should be noted that at 10 km in the polar cap areas, the radiation level is higher than elsewhere, so it is a good idea to do continuous checks if your device might move in those areas. I don't know if the South Atlantic Anomaly will increase the radiation levels at 10 km significantly, but at least in low orbit satellites, there is a significant increase in the radiation levels.

Paul

- G
- Guy Macon
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Jul 15, 2004 8:45 AM

Again, I have seen no evidence that the sum-checker is more reliable than the EEPROM being checked. Everyone seems to be accepting that it is based on nothing more than blind faith.

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Jul 15, 2004 10:32 AM

Even if the checksum algorithm is executed directly out of the EEPROM (which is not always the case), the surface area occupied by the checker is very small compared to the total area of the EEPROM in most cases. If there is a single (hard or soft) error in the EEPROM, the likelihood is much greater that is in the error is the other part of the EEPROM than in the checker code itself due to the area ratio.

The worst case is that there are error(s) in the EEPROM, but a bit flip in the actual checker code will modify the program so that it will return EEPROM OK, but the likelihood is still smaller.

Then there is the different question, is it enough to be able to detect only a single bit error or is detection of multiple errors needed. If the errors appear randomly, it might be sufficient to be able to detect only one or two errors if the checker is executed often enough. After detecting of the first error, the device should be taken out of service.

However, if there is a great likelihood of multiple errors appearing once, e.g. when a highly energetic particle hits the box and creates a shower of secondary particles hitting all over the EEPROM, you need an algorithm that is able to detect multiple errors at once.

Paul

- F
- Frank Bemelman
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Jul 15, 2004 12:54 PM

"Guy Macon" schreef in bericht news: snipped-for-privacy@corp.supernews.com...

They need it, but only from their point of view.

I have no idea.

Paranoid behaviour implies lack of understanding.

Oh, if someone insist on it, even after pointing out it isn't very useful, why not. But most requirement flaws I ignore without informing the person that wrote them (or simply copied them from another project).

--
Thanks, Frank.
(remove 'x' and 'invalid' when replying by email)

- S
- Spehro Pefhany
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Jul 15, 2004 1:38 PM

It means I incorporate a lot (more than just a mirror) of redundancy on important information, because a non-recoverable failure is very expensive. Data integrity is more important than saving a few cents on memory. The other options you mention are open if they are acceptable in the application, of course. Some systems have no "safe" state (few I work on), or there is an unpleasant choice such a) test limit controls, b) cause $10,000 damage (100% certain).

Yes, although checking the entire memory every time a few very frequently accessed locations are used might be quite unnecessary costly in bandwidth. But that's just implementation, and most anyone here can figure ways around that.

EEPROM is fundamentally different from RAM etc. because any errors that arise (because of issues beyond the control of the engineer) will persist indefinitely. They also wear out, and are fundamentally less reliable than RAM due to the high dielectric stresses involved in Fowler-Nordheim tunneling etc. (especially from re-writing).

On frequency- as you say above, I don't think it's necessary to do it more often than the information is accessed. ;-) More seriously, the upper time limit is typically set by how long it takes the system to get into trouble, worst-case. If it's a slow thermal system, then a minute or ten minutes with worst-case outputs may be no big deal.

Best regards, Spehro Pefhany

--
"it's the network..."                          "The Journey is the reward"
speff@interlog.com             Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog  Info for designers:  http://www.speff.com

- G
- Guy Macon
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Jul 15, 2004 3:32 PM

But the sum-checker is far more than lust the place where the sum-checking code is stored. It is also the electronics that reads the code, the ALU that executes the code, the registers and RAM that the code uses, and so forth. One would have to estimate the error rate of all of those parts of the uC and compare them to the error rate of the EEPROM. Unless you do that, you have no idea whether your continuous sum-checker increases or decreases system reliability compared to an on-demand sum-checker or no sum-checker at all.

--
Guy Macon, Electronics Engineer & Project Manager for hire. 
Remember Doc Brown from the _Back to the Future_ movies? Do you 
have an "impossible" engineering project that only someone like 
Doc Brown can solve?  My resume is at http://www.guymacon.com/

- G
- Guy Macon
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Jul 15, 2004 3:45 PM

You would have a problem if I was your project manager. You would be instructed to evaluate the requirements and to agree or disagree with each requirement, and the definition of your code being "done" would include the independent testers verifying that your code complies with all requirements. On my projects requirement errors are serious, and they are to be corrected, not ignored.

Then again, I wouldn't be handing you requirements that are male bovine excrement. Before you got a requirement to implement a continuous checksum, you would have hard numbers for EEPROM errors, sense-amp errors, ALU errors, register errors. etc.. both under normal conditions and under conditions of radiation, ESD, etc, and an analysis of the reliability impact, cost, etc. of the sum-checker.

--
Guy Macon, Electronics Engineer & Project Manager for hire. 
Remember Doc Brown from the _Back to the Future_ movies? Do you 
have an "impossible" engineering project that only someone like 
Doc Brown can solve?  My resume is at http://www.guymacon.com/