micro self-check of checksum

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Hello,
I am programming a PIC in assembly and am trying to think of a way to
self-verify the integrity of the code shortly after power up.  I would like
to store the checksum as a literal in flash or store it in the EEPROM. What
would a subroutine look like that can calculate the checksum of it's own hex
code, including the subroutine itself that is calculating the checksum of
it's own hex code...whew... almost got myself into a paradox there!

Hope you know what I mean. Is there a way for a PIC to read what execution
code values are being held in its own flash?

Thomas



Re: micro self-check of checksum

Quoted text here. Click to load it

It depends on the model.  Newer ones have a way to read code space.

Thad


Re: micro self-check of checksum
On Fri, 23 Sep 2005 21:26:04 GMT, "Thomas Magma"

Quoted text here. Click to load it

I am not familiar with PICs, and somebody already pointed out that at
least some PICs can't read their code space as data.  But the general
idea of performing a checksum on a binary image is a pretty simple
one, especially if the image is in a single contiguous chunk of
memory.

The easiest case of all is if the last few bytes or words of that
memory are not "special", that is they don't need to hold the power up
start address or an interrupt vector.

First you pick a checksum algorithm, which could be a simple 8 or 16
bit sum, a CRC of some size, or something like a Fletcher checksum.

For simplicity, let's assume you are going to do a simple 8-bit
checksum, ignoring the overflow out of the 8 bits.  Here is a simple C
function that would perform the sum:

unsigned char checksum(const void *start, size_t count)
{
   const unsigned char *uc = start;
   unsigned char sum = 0;

   while (count--)
   {
      sum += *uc++;
   }
   return uc;
}

What you do is calculate the sum of all but the last byte of the image
before you program the flash.  Then you put the 2's complement of that
value into the last byte of the image, and program the flash from the
image.

At run time you call the function with the start address of the flash
and the size of the flash, including the last byte.  If the flash is
good, the value returned will be 0.

If the validation function you prove is not easily returned to 0 by
one value, calculate the value minus the last byte or word or however
bit the checksum value is, then store the value itself into the last
byte/word.  Then at run time you call the function with the start of
the flash and the size of the flash minus the last byte/word holding
the sum.  Compare the value returned to the contents of that last
byte/word, and the flash image is good if they match.

--
Jack Klein
Home: http://JK-Technology.Com
We've slightly trimmed the long signature. Click to see the full one.
Re: micro self-check of checksum
Quoted text here. Click to load it

Now, assuming you find an error...what do you do?  You have just proven the
code is not trust worthy, so you cannot rely on the code to make the system
safe in any way.   Or in fact do anything predictably  So is the test worth
while?  (playing devils advocate).

Regards,
Richard.



http://www.FreeRTOS.org




Re: micro self-check of checksum

Quoted text here. Click to load it

The principle is to put all devices in a safe state, this can be
accomplished by forcing a reset and by not executing the normal code
after that but instead disabling all hardware and busy looping.
This is based on the assumption that there exist a safe state for
the system upon failure of both hardware and software for instance
in the case of a power failure.


Re: micro self-check of checksum

Quoted text here. Click to load it
the
system
worth

So you are relying on code you know is corrupt to put all devices into a
safe state?  In fact you don't even know the code is corrupt, as if it is
corrupt you don't know anything for sure, etc.  You cannot even rely on it
to force a reset - maybe it is the decision to reset that has the
corruption.

As Spehro Pefhany says, its a statistics game.  You can only improve the
probability of safe behaviour.

Regards,
Richard.


http://www.FreeRTOS.org




Re: micro self-check of checksum

Quoted text here. Click to load it

A probability of failure of 10E-9 is required in the most severe
cases. This is not a zero probability of failure.

You certainly can't leave the code memory unchecked. Now you can
imagine measures that enhance the probability of a correct
behaviour in case of code corruption. For instance you can have
multiple sections of code each with a check sum.

You can also double the critical portions of code, you could also
use different memory devices for each.

In some systems you must output a square watchdog signal. You can
have one portion of code thats writes a 1 and another that writes
a zero. If following the detection of an error you busy loop, the
correct waveform will not be output. This will trigger a reset by
an external watchdog. Of course you would then hope that the reset
code is correct.

Every safety device has to perform memory code checking. For high
levels of safety several processors are used, each with its own
copy of code. Even in these cases each processor performs a code
check and stops if an error is found. The others detect this state
and put the device in a safe state.

Regards


Un-realistic Integrity Level (was Re: micro self-check of checksum)

Quoted text here. Click to load it

Is that really a realistic expectation of system integrity. In the,
increasingly, evidence based high integrity development world there is a
movement towaqrds being able to quote the confidence with which the
integrity level figure is quoted (the ACARP** prinicple - See
"Dependability evaluation: a question of confidence" by Bev Littlewood -
Safety Systems newsletter published by the Safety-Critical Systems Club).

In Bev's article he has expressed the opinion that claims of 10E-4 failures
per demand is difficult to support with a high degree of confidence.
Therefore, the confidence level for 10E-9 failures per demand must be quite
low. This is part of the reason why inherent safety must be first built in
and utilised as a first resort.

Quoted text here. Click to load it

There is also the question of, having checked programme memory integrity at
the power-up stage, what efforts are you going to make to continue checking
the integrity of the operational code. Reducing the re-test period will
help improve the apparent system integrity especially of you have a
definite plan of action for putting the system in a safe state should you
detect a problem that does not rely on the errant code.




--
********************************************************************
We've slightly trimmed the long signature. Click to see the full one.
Re: Un-realistic Integrity Level (was Re: micro self-check of checksum)

Quoted text here. Click to load it

This is the figure required for civil avionics : 10E-9 failure per
hour in working conditions.

Quoted text here. Click to load it

In the systems we made the memory code was not tested periodically but
continuously by the lowest priority task. This does not affect the
performance of the system since this task is active only when no
other task is running.


Re: Un-realistic Integrity Level (was Re: micro self-check of checksum)

Quoted text here. Click to load it

I, and Bev Littlewood, are both aware of the figure as a requirement in
avionics. The question below that, though, was what level of confidence do
you have that you actually achieve that level of integrity.
 
Quoted text here. Click to load it

This is still periodically. You are, I expect, only performing part of the
test every time the idle task runs. Therefore, the full test is run over
the course of a period of time and begins again immediately following the
completion. The test interval then, is the time taken to complete the full
test scenario.

--
********************************************************************
We've slightly trimmed the long signature. Click to see the full one.
Re: Un-realistic Integrity Level (was Re: micro self-check of checksum)
snipped-for-privacy@amleth.demon.co.uk says...
Quoted text here. Click to load it

This sort of leads to the question

How often have you (anyone using code checksums) seen these catch field
failures?

And as a supplement how many of these field failures that have been
caught have been the result of (failed) field code updates?

Robert

Re: Un-realistic Integrity Level (was Re: micro self-check of checksum)

Quoted text here. Click to load it

That would require observing a field failure. I know of only one instance
of field failure with any of the systems I have designed over the past 36
years. That field failure, though, was a hardware component failure way
back in the 70's. Most of my observed failures were on the prototypers test
bench.
 
Quoted text here. Click to load it

As to catching errant code, never seen the occurrence despite the
environments that some of my equipment runs in. There is still time though.

--
********************************************************************
We've slightly trimmed the long signature. Click to see the full one.
Re: Un-realistic Integrity Level (was Re: micro self-check of checksum)



R Adsett wrote:

Quoted text here. Click to load it

I measured exactly one in a run of 200,000 systems.  These were 6800
uPs with UV-erase EPROMS.

Keep in mind that the above in only the number that passed the tests
in burn-in and production test and then started failing later.
The POST caught a lot of bad units in production test, but I don't
have a breakdown for how many were ROM checksum failures.

Quoted text here. Click to load it

None.  This was before field code updates were common. I am very
careful about the right voltages and algorithms for burning EPROMS;
someone doing a poor erase or a marginal burn might have had a lot
more trouble.


Re: Un-realistic Integrity Level (was Re: micro self-check of checksum)
On Sun, 25 Sep 2005 20:43:04 +0000, the renowned Guy Macon
<http://www.guymacon.com/ wrote:

Quoted text here. Click to load it

Yes, probably a lot of systems currently are in-circuit programmed
without verification at Vdd limits.


Best regards,
Spehro Pefhany
--
"it's the network..."                          "The Journey is the reward"
snipped-for-privacy@interlog.com             Info for manufacturers: http://www.trexon.com
We've slightly trimmed the long signature. Click to see the full one.
Re: Un-realistic Integrity Level (was Re: micro self-check of checksum)
[snip]
Quoted text here. Click to load it

I've had quite a few. Equipment installed in cellular towers may get
its fair share of lightning surges. Even with pretty hefty protection
schemes some voltage spikes will get through, something that can cause
partial Flash PROM erase, sometimes only a single bit error.
 
Quoted text here. Click to load it

In the cases mentioned above, none. Remote upgrade of software is a
different can of worms. It takes pretty careful design to avoid all
possible pitfalls. Ending up with a dead lump of you have to change
on site will in many cases cause enormous costs, especially if it is
outdoor equipment in areas that due to the climate is more or less
impossible to reach during several months of the year.

/Henrik

--

Re: Un-realistic Integrity Level (was Re: micro self-check of checksum)
snipped-for-privacy@emw.ericsson.se says...
Quoted text here. Click to load it

OK, so we have a vote each for 0, 1 and many :)  So apparently you can
get field failures that are detectable and still react.

Even small amounts of real data trump speculation.

Robert

Re: micro self-check of checksum
On Sat, 24 Sep 2005 16:57:53 GMT, the renowned "Richard"

Quoted text here. Click to load it

The code required to do a checksum and conditionally shut things down
is probably quite compact. If you can detect and deal with 99% of
single bit or single byte errors, then you will have improved the
situation by 100:1.


Best regards,
Spehro Pefhany
--
"it's the network..."                          "The Journey is the reward"
snipped-for-privacy@interlog.com             Info for manufacturers: http://www.trexon.com
We've slightly trimmed the long signature. Click to see the full one.
Re: micro self-check of checksum
Quoted text here. Click to load it

My approach is to halt, and rely on the external hardware watchdog (there
*is* one, right? ;)) to reset the system. The process will repeat, and
essentially keep the system in hardware reset - assuming the code integrity
check occurs before anything critical is done with hardware.

YMMV.

Steve
http://www.fivetrees.com



Re: micro self-check of checksum
in comp.arch.embedded:

Quoted text here. Click to load it

It depends on the nature and requirements of the system.

Some of the embedded systems I work on are safety critical, high
reliability systems.

In cases like these, normally there is a small boot program (sometimes
called a 'BIOS') that runs at reset.  It sets all outputs to a safe
state (usually OFF).  Then it runs an integrity test on its own image,
and tests any other RAM and on-board resources that do not involve any
off-board functions, including the watch dog timer hardware.

If any of the critical on-board hardware fails, this is considered a
system integrity error and the code puts itself in a tight loop with
some sort of error indication on its LEDs or 7-segment displays.

If all of the above tests pass, then the boot code runs some sort of
integrity validation on the flash that holds the main application.  If
that fails (new board or power loss while reprogramming the
application, for example), the boot code retains control and enables
its host communication interface (RS-485, Ethernet, CAN, USB 2.0,
depending on the product) and is ready to accept a new application
download attempt.

If the application image in flash validates, it is launched in place
or, on larger systems, copied to DRAM and started.  With a watchdog
timer left running, in case the application does not start correctly.

On boards with larger, 32-bit processors, there's a little more back
up than this.  Typically on a board like this you require large
amounts of copper and parts to be working correctly at all for the
processor to do anything.  Its clock, data and address busses, (S)DRAM
controller, (S)DRAM, and external flash must all be working correctly
or nearly correctly for the processor to even run its self tests.

On a board like this, we always have a secondary single chip
microcontroller, that runs from internal flash (used to be EPROM) and
internal RAM.  All it needs to run are its clock and the power supply,
no external components.  The micro completes its power on self tests
rather quickly, and then lets the big processor come out of reset.  If
the main processor does not reach the point in its boot code where it
communicates with the micro within a certain time, the micro puts the
main processor back into reset and keeps it there, and displays an
error indication.

Is this sort of system absolutely fool proof?  No, nothing designed by
human beings can guarantee that.  It is robust enough that in
thousands of safety critical systems in use for a decade that it has
never let a processor or other digital or power supply fault cause an
injury?  Most assuredly.

--
Jack Klein
Home: http://JK-Technology.Com
We've slightly trimmed the long signature. Click to see the full one.
Re: micro self-check of checksum

Quoted text here. Click to load it

This has to do with a self-test feature. The device is often baked out in
the sun year after year and/or placed next to high wattage transmitters. If
the self-test fails...time for repair.

Thomas



Site Timeline