Continous eeprom checksum microcontroller - Page 2

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Re: Continuous eeprom checksum microcontroller
"Guy Macon" <http://www.guymacon.com schreef in bericht
Quoted text here. Click to load it

I am a programmer. 'Requirement flaws' are treated differently from
'Requirements' because they are flagged as flaws.

Quoted text here. Click to load it
checksum,
errors,

Rubbish. Hard numbers don't have any value in this context, and
reliability/cost
of the sum checker is the least interesting bit. If I had hired you as a
manager
you would have a problem, wasting time on instructing other staff to waste
even more time. What matters is if a system failure is something you can
afford or not. Assuming, for the sake of this discussion, the software has
to work with occasionally corrupted eeprom data, you have to decide what
you can do to avoid that, and at what cost. Piles of analysis tend to be
highly unreliable, cost calculations in such areas never make sense, better
to trust a bit of common sense. BTW, in whatever system, I would not be
worried by eeprom itself, would worry more about software making accidental
writes and. most important, a healty hardware design with nice power up/down
behaviour. Implementing a continous check may cure that just enough to let
the systems pass the testers, but if that is desireable... For the same
reasons
I don't like the well spread practice of restoring important hardware
registers
on a regular basis. Or watchdogs. I use both, but I don't like it a bit.

--
Thanks, Frank.
(remove 'x' and 'invalid' when replying by email)


--
Thanks, Frank.
(remove 'x' and 'invalid' when replying by email)








Re: Continuous eeprom checksum microcontroller

Quoted text here. Click to load it

Which is it?  Do you flag them as flaws or ignore them without
informing the person that wrote them?  The former I like.  The
latter I consider to be grounds for termination on the third
offense.  

Quoted text here. Click to load it

They do if you do them right.

Quoted text here. Click to load it

Reliability is the most *important* bit, whether you find it to
be interesting or not.

Quoted text here. Click to load it

I hope that you are referring to the analysis of whether the continuous
eeprom checksum makes the system more or less reliable.  I would not
instruct anyone to do that analysis - I would simply would refuse to add
a continuous eeprom checksum to the requirements without it.  If I allowed
requirements to be added without any apparent benefit, *that* would be
wasting time.  
 
Quoted text here. Click to load it

Once again you are pretending that you know that the hardware that does
the continuous EEPROM checksum is more reliable than the EEPROM. If it
happens to be a lot less reliable, you are making a system failure more
likely.

Quoted text here. Click to load it

Not if you do them right.

Quoted text here. Click to load it

They make sense if you do them right.

Quoted text here. Click to load it

And you think that doing a continuous EEPROM checksum when you don't
know (because you don't like analysis) whether the EEPROM is orders
of magnatude more likely or less likely to have an error than the
system that does the checksumming makes common sense?  I will stick
with the "piles of analysis" as being more reliable than "common
sense."
 
Quoted text here. Click to load it

We agree here.

Quoted text here. Click to load it

Again you assume that continuous EEPROM checksum makes the system
more reliable rather than less reliable.  How do you know this?
What method did you use to arrive at this conclusion?



Re: Continuous eeprom checksum microcontroller
On Thu, 15 Jul 2004 11:14:55 -0700, Guy Macon
<http://www.guymacon.com wrote:

[...]
Quoted text here. Click to load it

Interesting discussion.  Reminds me of my first job out of college,
part of a team modifying the software of the fuel gauge for a
commercial airliner. I thought I'd posted on this before, but google
isn't finding it for me...

We were working on a project known as "dispatch enhancement," which
was a complete misnomer.  We were actually tightening up some
diagnostics, adding some others, and adding the ability to send
diagnostic messages to the aircrafts Engine Indicating and Crew
Alerting System (EICAS).  In summary, we were adding the ability to
detect more problems and providing better error messages.  (Prior to
this enhancement, the only "error message" we provided to the crew was
blank displays).  Nothing we were doing would "enhance" the "dispatch"
of aircraft on their flights.

While we were working on this, an aircraft using the existing fuel
gauge ran out of fuel in mid-air.  Look up the "Gimli Glider" if you
want more information.

We suddenly came under much greater pressure to complete our
modifications ahead of schedule.  Which made no sense whatsoever:

   1) The fuel gauge on the subject aircraft was blank, indicating
      internal diagnostics had found a problem.  We were not going
      to prevent that from happening -- indeed, after our
      modifications, it would potentially occur more often, because
      we could find additional problems.

   2) The FAA regulations said that when this aircraft's fuel gauges
      are blank, the aircraft doesn't fly.  This aircraft was flying
      because it wasn't subject to the FAA (i.e., not an American
      flight).

   3) The flight regs to which the aircraft was subject allowed flight

      when the fuel in the tanks was measured manually.  This was
      done more than once, correctly each time.  The ground crew
      reported to the pilots the number of pounds of fuel in the
      tanks.  The pilots thought the reported value was in kg.

Back to the subject: for some reason someone got it in their head that
our changes would make the fuel gauge "more reliable," and therefore
we _had_ to complete our changes ASAP.  Probably because of the bogus
project name.  In one sense we were: our changes would make it less
likely the fuel gauge would cause the airplane to malfunction.  But by
their definition (aircraft flies more often), we would probably make
the fuel gauge *less* reliable.

And the question of what to do in a failure.  We would still blank the
displays.  We would also notify the crew of the nature of the problem
through EICAS.  No change there.  The only change that could have
prevented this incident was external to our group (and was made IIRC:
the subject flight regs were changed to prevent the aircraft from
flying with blank fuel gauges).

Regards,

                               -=Dave
--
Change is inevitable, progress is not.

Re: Continuous eeprom checksum microcontroller
"Guy Macon" <http://www.guymacon.com schreef in bericht
Quoted text here. Click to load it

[snip]

Quoted text here. Click to load it

If the checking system would be less reliable, it would make the
entire system useless for the more obvious tasks it has to do.
In that case, I couldn't care less about eeproms flipping a bit.
Checking the eeprom isn't the main goal of the system. So a good
system is first priority, no matter what (flawed) specs lull me
into believing.

Continous checking (with auto correcting) is sweeping the dust under
the carpet, out of sight. Something you should add very late in the
development, at the time you are wondering why bothering.

--
Thanks, Frank.
(remove 'x' and 'invalid' when replying by email)







Re: Continuous eeprom checksum microcontroller

Quoted text here. Click to load it

Way to go Guy!

--
********************************************************************
We've slightly trimmed the long signature. Click to see the full one.
Re: Continous eeprom checksum microcontroller

Quoted text here. Click to load it

Does that mean you deliver projects that are not to the clients spec?

The early part of my projects usually involve rewriting the specification
to make it fully coherent. It takes quite a bit of negotiation but then can
end up costing the client less (once you rid the spec of the useless
dross). Remember that you have to engineer the customer as well as the
system.

--
********************************************************************
We've slightly trimmed the long signature. Click to see the full one.
Re: Continous eeprom checksum microcontroller
Quoted text here. Click to load it

Yep, never have a problem with it too. I add or increase specs as well.

Quoted text here. Click to load it
can

I have no complaints. It's important to know what they want/expect. That's
not often found in the raw specifications. Exceptions are strict performance
specifications, which should be honoured or negotiated if need be.


--
Thanks, Frank.
(remove 'x' and 'invalid' when replying by email)




Re: Continous eeprom checksum microcontroller

Quoted text here. Click to load it

I would expect nothing less from a professional embedded systems engineer.

--
Guy Macon, Electronics Engineer & Project Manager for hire.
Remember Doc Brown from the _Back to the Future_ movies? Do you
We've slightly trimmed the long signature. Click to see the full one.
Re: Continous eeprom checksum microcontroller
"Guy Macon" <http://www.guymacon.com schreef in bericht
Quoted text here. Click to load it
can

You should expect more, if you want to see more than early stages alone.

--
Thanks, Frank.
(remove 'x' and 'invalid' when replying by email)




Re: Continous eeprom checksum microcontroller
On Sun, 11 Jul 2004 22:12:58 -0400, Jim McGinnis

Quoted text here. Click to load it



With avionics, it should be noted that at 10 km in the polar cap
areas, the radiation level is higher than elsewhere, so it is a good
idea to do continuous checks if your device might move in those areas.
I don't know if the South Atlantic Anomaly will increase the radiation
levels at 10 km significantly, but at least in low orbit satellites,
there is a significant increase in the radiation levels.

Paul


Re: Continous eeprom checksum microcontroller


Quoted text here. Click to load it

Again, I have seen no evidence that the sum-checker is more reliable
than the EEPROM being checked. Everyone seems to be accepting that
it is based on nothing more than blind faith.



Re: Continous eeprom checksum microcontroller
On Thu, 15 Jul 2004 01:45:42 -0700, Guy Macon
<http://www.guymacon.com wrote:

Quoted text here. Click to load it

Even if the checksum algorithm is executed directly out of the EEPROM
(which is not always the case), the surface area occupied by the
checker is very small compared to the total area of the EEPROM in most
cases. If there is a single (hard or soft) error in the EEPROM, the
likelihood is much greater that is in the error is the other part of
the EEPROM than in the checker code itself due to the area ratio.

The worst case is that there are error(s) in the EEPROM, but a bit
flip in the actual checker code will modify the program so that it
will return EEPROM OK, but the likelihood is still smaller.

Then there is the different question, is it enough to be able to
detect only a single bit error or is detection of multiple errors
needed. If the errors appear randomly, it might be sufficient to be
able to detect only one or two errors if the checker is executed often
enough. After detecting of the first error, the device should be taken
out of service.    

However, if there is a great likelihood of multiple errors appearing
once, e.g. when a highly energetic particle hits the box and creates a
shower of secondary particles hitting all over the EEPROM, you need an
algorithm that is able to detect multiple errors at once.

Paul


Re: Continuous eeprom checksum microcontroller

Quoted text here. Click to load it

But the sum-checker is far more than lust the place where the
sum-checking code is stored.  It is also the electronics that
reads the code, the ALU that executes the code, the registers
and RAM that the code uses, and so forth.  One would have to
estimate the error rate of all of those parts of the uC and
compare them to the error rate of the EEPROM.  Unless you do
that, you have no idea whether your continuous sum-checker
increases or decreases system reliability compared to an
on-demand sum-checker or no sum-checker at all.

--
Guy Macon, Electronics Engineer & Project Manager for hire.
Remember Doc Brown from the _Back to the Future_ movies? Do you
We've slightly trimmed the long signature. Click to see the full one.
Re: Continuous eeprom checksum microcontroller
...
Quoted text here. Click to load it

Very true, but still subject to further refinement: consider
which parts (ALU, RAM) are * on the same chip * because failures
due to threshold changes are more liable to occur from one chip to
another than between components on the same silicon. - RM


Re: Continuous eeprom checksum microcontroller

Guy Macon <http://www.guymacon.com says...

Quoted text here. Click to load it

&*$!@*! spellchecker! JUST the place...


Re: Continuous eeprom checksum microcontroller

Quoted text here. Click to load it

Assuming that you have demonstrated a need to be certain of the validity of
data in the EEPROM (or any other area of fixed memory) then you should also
have a figure that indicates the maximum time bewteen full checking reports
(rember, integrity is a time and probability of failure measure).

Also assuming that the system you are developing has, as mentioned in
another post, no safe state then you may need to know how much of the time
the individualk parts of the system are available to you. Not only would
you run the checksum but you would also run other hardware integrity
checking on a continuous piecemeal) basis, leaving markers as to the
success or otherwise such that a reporting programme can report the results
of the error analysis. Note that we are now in the realm of MUST NOT FAIL
systems.

The question of what you do when a part of your system fails must be
answered fairly early on in the design phase. Every engioneer should ask
himslef that question as a matter of routine deliberation for new designs.

Forunately for me, I need not care too much about losing one module of a
system so long as it indicates that it has failed (and why). I have several
techniques that I use to check that the system is really behaving itself
and ensure that outputs are disabled (a safe state in 99% of mys syetms).


As I often state, let the risk assessments guide you to what you need to
check and then work out the scheme that gives you the best chance of
meeting the integrity taregets (not all parts of the system need to work to
the same level).

--
********************************************************************
We've slightly trimmed the long signature. Click to see the full one.
Re: Continuous eeprom checksum microcontroller
On Thu, 15 Jul 2004 08:32:55 -0700, Guy Macon
<http://www.guymacon.com wrote:

Quoted text here. Click to load it

These circuits are exercised each time any program executes, not just
when the checker routine is executed (unless of course, if the
processor contains some dedicated hardware for CRC calculation :-). A
fault in the normal program execution hardware would most likely even
prevent the checker routine to be executed. However, a normal program
does not necessary access _all_ the memory locations, so the
likelihood of detecting an error is much larger than causing itself
any new errors.

Quoted text here. Click to load it

I have to admit that due to previous experience with low reliability
DRAM and EPROM, I still assume that memory is still the weakest point,
but like to be proven wrong (hopefully not due to unreliable CPU
hardware :-).

It should also be noted that in systems that may run for years without
reboot, a check executed at startup does not be very useful.

Then there is the question of what to do if the internal consistency
check fails. At least in redundant system, the active system detecting
an internally consistency error can perform a quick smooth handover to
the other unit and not wait until the active unit fails completely and
the other unit has to take control, usually not so smoothly.

With voting systems (with 3 or 5 identical units), especially when
driven actively in both directions, a unit detecting an internal error
can disable itself, rather than wait for the tug of war that follows,
when the failed unit would actively try to control in the wrong
direction.

Paul
  

Re: Continuous eeprom checksum microcontroller


Quoted text here. Click to load it

You think that a eeprom checksum task exercises the same circuits
(registers, RAM, instruction decoders, EEPROM reading amplifiers...)
that a do nothing task exercises?



Re: Continuous eeprom checksum microcontroller
On Fri, 16 Jul 2004 00:54:44 -0700, Guy Macon
<http://www.guymacon.com wrote:


Quoted text here. Click to load it

Any system using interrupts will use quite a lot of the CPU resources.
In a RTOS you may have to run the scheduler after each interrupt to
see, if any high priority task became runnable due to the interrupt.

Thus, the job done by interrupts and scheduler is similar to that of
the EEPROM checker, even if the high priority tasks do nothing for a
long time.

I agree that the null task could be as trivial as a single
WaitForInterrupt instruction or a single branch to itself instruction,
which will exercise only a small part of the CPU, but this is not the
point.

Paul
    


Re: Continuous eeprom checksum microcontroller
Quoted text here. Click to load it
  As for EPROMs/EEPROMs: with a leaking oxide that looses charge over
time one would expect not a flipped bit but a noisy bit. The
read-amplifier/transistor has no hysteresis to prevent that, because
in normal operation hysteresis is not needed. State would depend on
supplyvoltage and temperature too.
  Therefore memory could test ok on startup ( cold chip ).
And it could test again ok after a running checksum has detected an
error.
  "Repairing" a noisy EEPROM would be possible if one has segmented
it in small blocks each with a checksum. After an error in a block
has been detected it would have to be reread several times till one
has established the true data because the pattern is stable and
consistant with checksum. After that one would rewrite the data to
the EEPROM. Obviously a Hamming-Code would be a more direct/faster
approach for repair.

Good book is:
Sharma "Semiconductor Memories. Technology, Testing, Reliability"
IEEE Press 1997  
But it has no simple answers either.

MfG  JRD

Site Timeline