micro self-check of checksum

- T
- Thomas Magma
  
  Contact options for registered users
posted
18 years ago

Fri, Sep 23, 2005 9:26 PM

Hello, I am programming a PIC in assembly and am trying to think of a way to self-verify the integrity of the code shortly after power up. I would like to store the checksum as a literal in flash or store it in the EEPROM. What would a subroutine look like that can calculate the checksum of it's own hex code, including the subroutine itself that is calculating the checksum of it's own hex code...whew... almost got myself into a paradox there!

Hope you know what I mean. Is there a way for a PIC to read what execution code values are being held in its own flash?

Thomas

- T
- Thad Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Sep 24, 2005 3:27 AM

It depends on the model. Newer ones have a way to read code space.

Thad

- J
- Jack Klein
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Sep 24, 2005 5:26 AM

On Fri, 23 Sep 2005 21:26:04 GMT, "Thomas Magma" wrote in comp.arch.embedded:

I am not familiar with PICs, and somebody already pointed out that at least some PICs can't read their code space as data. But the general idea of performing a checksum on a binary image is a pretty simple one, especially if the image is in a single contiguous chunk of memory.

The easiest case of all is if the last few bytes or words of that memory are not "special", that is they don't need to hold the power up start address or an interrupt vector.

First you pick a checksum algorithm, which could be a simple 8 or 16 bit sum, a CRC of some size, or something like a Fletcher checksum.

For simplicity, let's assume you are going to do a simple 8-bit checksum, ignoring the overflow out of the 8 bits. Here is a simple C function that would perform the sum:

unsigned char checksum(const void *start, size_t count) { const unsigned char *uc = start; unsigned char sum = 0;

while (count--) { sum += *uc++; } return uc; }

What you do is calculate the sum of all but the last byte of the image before you program the flash. Then you put the 2's complement of that value into the last byte of the image, and program the flash from the image.

At run time you call the function with the start address of the flash and the size of the flash, including the last byte. If the flash is good, the value returned will be 0.

If the validation function you prove is not easily returned to 0 by one value, calculate the value minus the last byte or word or however bit the checksum value is, then store the value itself into the last byte/word. Then at run time you call the function with the start of the flash and the size of the flash minus the last byte/word holding the sum. Compare the value returned to the contents of that last byte/word, and the flash image is good if they match.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~ajo/docs/FAQ-acllc.html

- M
- Michael Lange
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Sep 24, 2005 3:44 PM

Thomas Magma schrieb:

You need an PIC that can access the code space, but this feature isn't implemented in every type. For example you can use 16F877A or 18F4550 (that if used last time). On the Microchip website, there is at least one app-note (TB026), that discusses your problem (with sample code).

HTH Michael

- R
- Richard
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Sep 24, 2005 4:57 PM

Now, assuming you find an error...what do you do? You have just proven the code is not trust worthy, so you cannot rely on the code to make the system safe in any way. Or in fact do anything predictably So is the test worth while? (playing devils advocate).

Regards, Richard.

formatting link

- L
- Lanarcam
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Sep 24, 2005 5:19 PM

The principle is to put all devices in a safe state, this can be accomplished by forcing a reset and by not executing the normal code after that but instead disabling all hardware and busy looping. This is based on the assumption that there exist a safe state for the system upon failure of both hardware and software for instance in the case of a power failure.

- S
- Spehro Pefhany
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Sep 24, 2005 6:35 PM

The code required to do a checksum and conditionally shut things down is probably quite compact. If you can detect and deal with 99% of single bit or single byte errors, then you will have improved the situation by 100:1.

Best regards, Spehro Pefhany

--
"it's the network..."                          "The Journey is the reward"
speff@interlog.com             Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog  Info for designers:  http://www.speff.com

- R
- Richard
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Sep 24, 2005 7:46 PM

the

system

worth

So you are relying on code you know is corrupt to put all devices into a safe state? In fact you don't even know the code is corrupt, as if it is corrupt you don't know anything for sure, etc. You cannot even rely on it to force a reset - maybe it is the decision to reset that has the corruption.

As Spehro Pefhany says, its a statistics game. You can only improve the probability of safe behaviour.

Regards, Richard.

formatting link

- S
- Steve at fivetrees
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Sep 24, 2005 8:33 PM

My approach is to halt, and rely on the external hardware watchdog (there

*is* one, right? ;)) to reset the system. The process will repeat, and essentially keep the system in hardware reset - assuming the code integrity check occurs before anything critical is done with hardware.

YMMV.

Steve

formatting link

- J
- Jack Klein
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Sep 24, 2005 8:33 PM

On Sat, 24 Sep 2005 16:57:53 GMT, "Richard" wrote in comp.arch.embedded:

It depends on the nature and requirements of the system.

Some of the embedded systems I work on are safety critical, high reliability systems.

In cases like these, normally there is a small boot program (sometimes called a 'BIOS') that runs at reset. It sets all outputs to a safe state (usually OFF). Then it runs an integrity test on its own image, and tests any other RAM and on-board resources that do not involve any off-board functions, including the watch dog timer hardware.

If any of the critical on-board hardware fails, this is considered a system integrity error and the code puts itself in a tight loop with some sort of error indication on its LEDs or 7-segment displays.

If all of the above tests pass, then the boot code runs some sort of integrity validation on the flash that holds the main application. If that fails (new board or power loss while reprogramming the application, for example), the boot code retains control and enables its host communication interface (RS-485, Ethernet, CAN, USB 2.0, depending on the product) and is ready to accept a new application download attempt.

If the application image in flash validates, it is launched in place or, on larger systems, copied to DRAM and started. With a watchdog timer left running, in case the application does not start correctly.

On boards with larger, 32-bit processors, there's a little more back up than this. Typically on a board like this you require large amounts of copper and parts to be working correctly at all for the processor to do anything. Its clock, data and address busses, (S)DRAM controller, (S)DRAM, and external flash must all be working correctly or nearly correctly for the processor to even run its self tests.

On a board like this, we always have a secondary single chip microcontroller, that runs from internal flash (used to be EPROM) and internal RAM. All it needs to run are its clock and the power supply, no external components. The micro completes its power on self tests rather quickly, and then lets the big processor come out of reset. If the main processor does not reach the point in its boot code where it communicates with the micro within a certain time, the micro puts the main processor back into reset and keeps it there, and displays an error indication.

Is this sort of system absolutely fool proof? No, nothing designed by human beings can guarantee that. It is robust enough that in thousands of safety critical systems in use for a decade that it has never let a processor or other digital or power supply fault cause an injury? Most assuredly.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~ajo/docs/FAQ-acllc.html

- L
- Lanarcam
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Sep 24, 2005 8:52 PM

A probability of failure of 10E-9 is required in the most severe cases. This is not a zero probability of failure.

You certainly can't leave the code memory unchecked. Now you can imagine measures that enhance the probability of a correct behaviour in case of code corruption. For instance you can have multiple sections of code each with a check sum.

You can also double the critical portions of code, you could also use different memory devices for each.

In some systems you must output a square watchdog signal. You can have one portion of code thats writes a 1 and another that writes a zero. If following the detection of an error you busy loop, the correct waveform will not be output. This will trigger a reset by an external watchdog. Of course you would then hope that the reset code is correct.

Every safety device has to perform memory code checking. For high levels of safety several processors are used, each with its own copy of code. Even in these cases each processor performs a code check and stops if an error is found. The others detect this state and put the device in a safe state.

Regards

- P
- Paul E. Bennett
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sun, Sep 25, 2005 11:34 AM

Is that really a realistic expectation of system integrity. In the, increasingly, evidence based high integrity development world there is a movement towaqrds being able to quote the confidence with which the integrity level figure is quoted (the ACARP** prinicple - See "Dependability evaluation: a question of confidence" by Bev Littlewood - Safety Systems newsletter published by the Safety-Critical Systems Club).

In Bev's article he has expressed the opinion that claims of 10E-4 failures per demand is difficult to support with a high degree of confidence. Therefore, the confidence level for 10E-9 failures per demand must be quite low. This is part of the reason why inherent safety must be first built in and utilised as a first resort.

There is also the question of, having checked programme memory integrity at the power-up stage, what efforts are you going to make to continue checking the integrity of the operational code. Reducing the re-test period will help improve the apparent system integrity especially of you have a definite plan of action for putting the system in a safe state should you detect a problem that does not rely on the errant code.

--
********************************************************************
Paul E. Bennett ....................
Forth based HIDECS Consultancy .....
Mob: +44 (0)7811-639972
Tel: +44 (0)1235-811095
Going Forth Safely ....EBA. http://www.electric-boat-association.org.uk/********************************************************************

- L
- Lanarcam
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sun, Sep 25, 2005 11:59 AM

This is the figure required for civil avionics : 10E-9 failure per hour in working conditions.

In the systems we made the memory code was not tested periodically but continuously by the lowest priority task. This does not affect the performance of the system since this task is active only when no other task is running.

- P
- Paul E. Bennett
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sun, Sep 25, 2005 1:22 PM

I, and Bev Littlewood, are both aware of the figure as a requirement in avionics. The question below that, though, was what level of confidence do you have that you actually achieve that level of integrity.

This is still periodically. You are, I expect, only performing part of the test every time the idle task runs. Therefore, the full test is run over the course of a period of time and begins again immediately following the completion. The test interval then, is the time taken to complete the full test scenario.

--
********************************************************************
Paul E. Bennett ....................
Forth based HIDECS Consultancy .....
Mob: +44 (0)7811-639972
Tel: +44 (0)1235-811095
Going Forth Safely ....EBA. http://www.electric-boat-association.org.uk/********************************************************************

- R
- R Adsett
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sun, Sep 25, 2005 4:17 PM

This sort of leads to the question

How often have you (anyone using code checksums) seen these catch field failures?

And as a supplement how many of these field failures that have been caught have been the result of (failed) field code updates?

Robert

- P
- Paul E. Bennett
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sun, Sep 25, 2005 4:36 PM

That would require observing a field failure. I know of only one instance of field failure with any of the systems I have designed over the past 36 years. That field failure, though, was a hardware component failure way back in the 70's. Most of my observed failures were on the prototypers test bench.

As to catching errant code, never seen the occurrence despite the environments that some of my equipment runs in. There is still time though.

--
********************************************************************
Paul E. Bennett ....................
Forth based HIDECS Consultancy .....
Mob: +44 (0)7811-639972
Tel: +44 (0)1235-811095
Going Forth Safely ....EBA. http://www.electric-boat-association.org.uk/********************************************************************

- G
- Guy Macon
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sun, Sep 25, 2005 8:43 PM

I measured exactly one in a run of 200,000 systems. These were 6800 uPs with UV-erase EPROMS.

Keep in mind that the above in only the number that passed the tests in burn-in and production test and then started failing later. The POST caught a lot of bad units in production test, but I don't have a breakdown for how many were ROM checksum failures.

None. This was before field code updates were common. I am very careful about the right voltages and algorithms for burning EPROMS; someone doing a poor erase or a marginal burn might have had a lot more trouble.

- S
- Spehro Pefhany
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Sep 26, 2005 12:05 AM

Yes, probably a lot of systems currently are in-circuit programmed without verification at Vdd limits.

Best regards, Spehro Pefhany

--
"it's the network..."                          "The Journey is the reward"
speff@interlog.com             Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog  Info for designers:  http://www.speff.com

- T
- Thomas Magma
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Sep 26, 2005 3:50 PM

This has to do with a self-test feature. The device is often baked out in the sun year after year and/or placed next to high wattage transmitters. If the self-test fails...time for repair.

Thomas

- H
- Henrik Johnsson
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Sep 27, 2005 3:13 PM

R Adsett wrote: [snip]

I've had quite a few. Equipment installed in cellular towers may get its fair share of lightning surges. Even with pretty hefty protection schemes some voltage spikes will get through, something that can cause partial Flash PROM erase, sometimes only a single bit error.

In the cases mentioned above, none. Remote upgrade of software is a different can of worms. It takes pretty careful design to avoid all possible pitfalls. Ending up with a dead lump of you have to change on site will in many cases cause enormous costs, especially if it is outdoor equipment in areas that due to the climate is more or less impossible to reach during several months of the year.

/Henrik