Opinions on 16-bit checksums.

- K
- Kermit
  
  Contact options for registered users
posted
19 years ago

Tue, Aug 24, 2004 10:17 PM

Hello all,

Since I see that Endianess is a current topic, I'd like to ask a question concerning it and calculating 16-bit checksums. I've never seen a standard anywhere that describes exactly what a checksum is.

However, I have mainly seen them calculated in one of three ways.

1) Simply calculate a 32-bit value which is the sum of each byte added to a running sum. At the end, cast this 32-bit value to a 16-bit value, and call it the checksum.

2) Calculate the same way as above, but pull two bytes of the file out at a time such that you are adding them as 16-bit values in one of the following ways.

a) Little-endian b) Big-endian.

For example, lets say I have a 6-byte file consisting of:

0x00 0x01 0x02 0x03 0x04 0x05

Algorithm 1 would yield:

0x00 + 0x01 + 0x02 + 0x03 + 0x04 + 0x05 = 0x000f

Algorithm 2-a would yield:

0x0100 + 0x0302 + 0x0504 = 0x0906

Algorithm 2-b would yield:

0x0001 + 0x0203 + 0x0405 = 0x0609

I've always preferred to calculate it 16-bits at a time, because that way, you can detect byte-reversal issues. And, apparently, I've always calculated them assuming a little endian format. However, I recently ran into a problem with a distributor trying to program a big-endian file into a flash part. They switched the bytes, then ignored it, which is their fault, but my problem. But, since it was the first time I'd ever worked with a big-endian binary file, I guess it threw me and I wanted to double check my thinking.

How does everyone else on the list calculate a 16-bit checksum.

--kermit

--
It wasn't easy being Greazy ....but it was interesting.

- J
- jamie
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Aug 24, 2004 10:28 PM

formatting link

- J
- Jack Klein
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Aug 25, 2004 1:30 AM

On Tue, 24 Aug 2004 22:17:49 GMT, "Kermit" wrote in comp.arch.embedded:

I never use a simple checksum anymore. Whatever size you use, 8 or 16 or 32, they are completely oblivious to the same bit flipped in two of the words.

Another poster already provided a link to a CRC web site. Another, lower overhead, possibility is a Fletcher checksum, which is not immune to order swapping or to an even number of bit flips in the same position. Each byte of the data is read only once and there are no shifts and masks or look-up tables.

Sample C code for an 8-bit Fletcher checksum (assuming, as the C standard does not guarantee, that an unsigned char contains exactly 8 bits):

struct Fletcher_8 { unsigned char sum; unsigned char total; };

struct Fletcher_8 Fletch_8(unsigned char *source, size_t how_many) { struct Fletcher_8 result = { 0 }; size_t size;

for (size = 0; size < how_many; ++size) { result.sum += *source++; result.total += result.sum; }

return result; }

You can stick the two bytes into a single 16-bit value. And the algorithm extends easily to 16-bit, 32-bit, etc., sums.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~ajo/docs/FAQ-acllc.html

- N
- Neil Bradley
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Aug 25, 2004 6:41 AM

Amen, brother. Checksums are (or at least should be) a thing of the past. There are far better approaches for image integrity checking, even on small micros. I've had quite a few cases where 16 bit checksums didn't checksum properly, causing much wasted time.

-->Neil

- H
- Hans-Bernhard Broeker
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Aug 25, 2004 7:17 AM

What made you believe that the community of programmers would ever have agreed on a single "standard" way of doing it in the first place?

I strongly hope nobody actually ever does it that way, on anything smaller than a 32-bit platform, since that would be quite silly. Calculating in 32 bits is wasteful if you're only going to be using the lower 16 bits of the result --- just let a 16-bit sum roll over and be done with it.

Anyway, this checksumming method is highly dubious, to put it mildly. There are just too many types of rather likely errors that it won't ever catch. You don't want a checksum's core operation to be commutative.

Endianness doesn't really enter the figure as an option you would have to think about. Just interpret pairs of bytes as 16-bit values in whatever endianness your platform prefers, and add them.

E.g. note that both 2-a and 2-b would write the same checksum, if seen as bytes:

0x09 0x06

But that only makes a difference if reversal of pairs of bytes at at odd distances (1, 3 or 1023 bytes apart) is more important to detect / more likely to happen than reversal of pairs an even distance apart (2, 4, or 1024, e.g.). That assumption seems unwarranted.

--
Hans-Bernhard Broeker (broeker@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Aug 25, 2004 2:14 PM

While I agree that CRCs have demonstrably good characteristics for use in serial communications links where burst errors are common, I've never seen any evidence that CRCs are well suited to the failure modes common to ROM or EPROM memory. Perhaps such evidence exists, but I've never seen it presented in the articles I've read on checksum vs. CRC use in embedded systems.

Any 16 bit sum (CRC or otherwise) maps every possible state of the ROM into one of 65536 resulting "checksum" numbers. An error is only detected when that mapping differs between the correct state and the error state.

The mapping ("summing") algorithm used should be chosen so that the most likely failures in the ROM will cause the state to be mapped into a different "checksum" value. Some heuristic rules for choosing a mapping are fairly obvious: Choose one that maps the states evenly (all checksum results are equally likely given a random input state). A mapping algorithm that only returns one of two values 0x0000 or 0xffff is obviously a bad choice. One that returns all values from 0x0000 to 0xffff with equal likelyhood provided random input is obviously better.

The failure modes for serial messages are rather well known, and CRCs were chosen to maximize the likelyhood that those failures will cause the message to be mapped into a different checksum value.

However, I've never seen any information on failure modes and statistics of ROMs. Without that information, how can one claim that a CRC is better than, say, the IP checksum algorithm?

--
Grant Edwards                   grante             Yow!  The entire CHINESE
                                  at               WOMEN'S VOLLEYBALL TEAM all
                               visi.com            share ONE personality --
                                                   and have since BIRTH!!

- N
- Neil Bradley
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Aug 26, 2004 3:34 AM

It does map the values, but it does not take in to account position or anything else. For example, your traditional 16 bit checksum (even/odd/additive approach) would not catch values that cancel themselves out, but a CRC would.

Only if the entire EPROM is 0x00 or 0xff. ;-)

In 20 years of embedded systems development when 16 bit (or bigger) CRCs are used, I've witnessed no failures or oddities. During that period I've had about a dozen failures with checksums. Simple experience dictates it as such.

Plus, I'm not a math whiz, but IIRC, CRC is significantly better than checksum because it also has data position order and sequence dependencies, and checksum doesn't.

-->Neil

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Aug 26, 2004 3:58 AM

That depends on the algorithm you choose. It's true for some and not true for others. There are tons of hashing algorithms out there besides CRC and "add up the bytes".

There are plenty of changes that "cancel themselves out" for a CRC as well. For example: any time the current value of the sum is 0, you can insert a block of zeros of any length into the data without changing the sum. There are other checksum algorithms that would detect such an error in the data stream.

I've been doing embedded systems development for 20 years as well, and have never seen either one fail. Therefore "simple experience dictates they're equal". ;)

But, a 16-bit CRC can still only generate 65536 different output states. Therefore, there are billions of different changes to a ROM that "cancel out" and generate the same CRC -- just as there are billions of changes that can occur and generate the same IP checksum.

Unless you know what the characteristics of those changes are, and can show that they're less likely for the CRC than the types of changes that "cancel out" for other types of checksums, you've got no real argument that CRCs are better than others.

I'm not saying that for verifying a ROM, CRC's _aren't_ better than the IP checksum, but I've never seen anybody who was able to present any evidence that they are, either.

There _is_ plenty of evidence that the characteristics of a CRC are well suited to the types of errors that occur in serial data communications. I don't think it can be assumed that ROMs have those same error characteristics. Perhaps I'm wrong and ROMs do have the same error characteristics as a serial data link. If so, just a few cites is all I ask.

--
Grant Edwards                   grante             Yow!  Are you still an
                                  at               ALCOHOLIC?
                               visi.com

- C
- CBFalconer
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Aug 26, 2004 4:15 AM

... snip ...

I can't prove it, but I contend that no single bit error can possibly result in an unchanged CRC checksum. I suspect that this applies to some possible maximum number of bit errors. There are other methods optimized to detect burst errors.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
   Available for consulting/temporary embedded and systems.
     USE worldnet address!

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Aug 26, 2004 4:40 AM

Using a properly chosen generator polynomial, a CRC checksum can detect:

any two bits in error * any odd number of bits in error * an error burst as long as the checksum

[This is common knowledge and I'm not disputing it.] There are two big reasons CRCs are used in serial communications: 1) They can detect burst errors better than many other algorithms. 2) They are very cheap to implement on-the-fly in hardware using a shift register and a handful of exclusive-or gates.

Because of thise, they are particularly well suited to a serial processing of a bit stream of arbitrary length when errors tend to come in bursts (e.g. due to Rayleigh fading).

I don't see that either of these reasons is applicable when verifying ROMs in software:

1) Nobody has ever shown me that ROM failures have "bursty" characteristics similar to serial communications in a noisy environment. 2) CRCs aren't particularly cheap to implement in software. [I suspect that's why IP headers don't use CRCs.]

--
Grant Edwards                   grante             Yow!  Now KEN and BARBIE
                                  at               are PERMANENTLY ADDICTED to
                               visi.com            MIND-ALTERING DRUGS...

- N
- Neil Bradley
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Aug 26, 2004 5:48 AM

Agreed - MD2/MD5 come to mind, as they're also well suited for the job, but significantly more expensive to implement. Our last project had a 40Mhz cacheless ARM fetching code from an external flash using MD2 and it significantly slowed things down.

However, I'd contend if you are doing more than "Adding up the bytes", then it's no longer a checksum, but a check code. Note the operative word "sum" in checksum...

But not as many as checksum. You can reverse two bytes in a file and a traditional checksum won't catch it, but a CRC most likely will. If you don't believe me, try it yourself.

Consider yourself lucky, then. You yourself have already proven that checksums aren't the best things. Not to say that CRCs are the ultimate, but I'd bank $$ on it that CRCs are better than traditional checksums.

Same deal with a 16 bit checksum. ;-)

The question becomes one of the nature of "naturally occuring" failures and

I think you're really talking about "check codes" rather than traditional checksums. Surely you don't think that a traditional 16 bit checksum is as good as a CRC...

CRC Is good at catching occasional byte errors, or small bursts. It's not the type of data errors that occur on a serial link, but rather the nature of when they occur.

I own an arcade, and I've had several video games over the course of the last 1.3 years (Galaga 88, Marble Madness, and Battlezone to be specific) - all use checksums - reported no failures in their ROM sets, yet when read in their 16 bit CRCs were different. They were using 8 bit or 16 bit checksums (G88/MM Were 16 bit traditional and Battlezone is an 8 bit).

Compared to a traditional 16 bit checksum, a CRC 16 is better. Compared to other "checksums", well, obviously their merits will have to stand on their own, and I'm sure even better algorithms could be made.

-->Neil

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Aug 26, 2004 7:04 AM

I would agree with this. The sort of errors that could turn up in flash can, I think, be divided into a number of groups:

a) Serious muck-ups, such as half-finished program updates. The number of errors are so great that any 16-bit checksum would have a 1/65536 chance of spotting it. All 1's or all 0's should fail.

b) Bad erasing (leading to too many 0's) or under programming (too many

1's). There is no point about having an algorithm guarenteed to spot any two-bit error, when there is either no error, or a whole bunch of errors scattered about the sector.

c) Hardware errors - bad address lines or bad data lines. For data lines, the effect will be similar to b), but for address lines you may get repeated data, or data in the wrong order. Here there will be a slight edge for algorithms that consider the order of the data as well as the value - a plain sum will not be as good as a crc.

I think if you want some sort of guartentee of the errors your check can deal with, you have to use a much more serious algorithm like md5 that is made for big files. A 16-bit CRC might be fine for a 160-bit serial message - but if you are using it to check a 200k program, your scale is out by a factor of 10000. You can deal with the problem by making crc checks of small sections, and then checking them, but is it worth it? If you are using NAND flash, which is expected to have a realistic chance of failure, then the answer is yes. For normal NOR flash, the chances are tiny, other than through user (programmer :-) error, and so almost any check will be fine. A checksum where the sum is rotated (*not* shifted) left one bit before adding would be about as good as anything else, as far as I can see.

BARBIE

ADDICTED to

- C
- CBFalconer
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Aug 26, 2004 1:40 PM

... snip ...

In the money world many fields include a simple check digit, which serves the same function. It primarily protects against user mis-entry of things such as account number. The best of the simple mechanisms alternates between 1 weight and 2 weight in the sum. This protects against the common error of interchanging two digits. It is surprising how many programmers in the field don't realize this, as shown by the 'improved' sums I have seen implemented.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
   Available for consulting/temporary embedded and systems.
     USE worldnet address!

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Aug 26, 2004 2:38 PM

People keep _saying_ things like that, but nobody ever has any real evidence that it's true.

Let's say you've got a 1024 byte block you're summing. There are 2^8192 possible combinations of bits. For _any_ well-behaved[1] 16-bit checksum (CRC or otherwise), There are

2^8176 - 1 bit patterns that are wrong but generate the correct sum. [1] By "well behaved" I mean that all output values are equally likely for a random input stream.

Possibly, but what makes you think there aren't just as many other types of errors that a CRC will miss and a different algorithm will catch?

I'm sure if it was settled by majority vote CRC would win, but in math, things aren't settled by majority vote.

That's my point. CRCs are well suited to the errors than naturally occur in serial communictions. Nobody seems to care that ROMs probably do not not have the same naturally occuring failure modes.

OK, whatever. Such "check codes" are traditionally called checksums. CRC results are often called checksums, even though they're remainders of polynomial division or something like that. What happens when you run the md5sum program? Hint: it doesn't just add up all the byte values.

It probably isn't, but I'd like some real, statistical evidence rather than a bunch of hand-waving and anecdotes.

Huh? I don't get the difference between the "type of error" and "the nature of when they occur". In binary data the "nature of when they occur" is the only thing that matters unless you want to differentiate between errors based on the actual value of the original bit.

Burst errors are one type of error, lots of randomly distributed single-bit errors are another type.

A lot of people believe that (myself included), but that belief does not constitute evidence. We're supposed to be engineers here...

Maybe not. It could be that CRCs are the best you can get. I just want somebody to point to to a source with a real argument.

--
Grant Edwards                   grante             Yow!  You should all JUMP
                                  at               UP AND DOWN for TWO HOURS
                               visi.com            while I decide on a NEW
                                                   CAREER!!

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Aug 26, 2004 2:57 PM

Bravo! This is exactly the type of thing that we need to do to figure out what checksum algorithm to use. We need to look at the types of errors that occur and with what frequency they occur. Just because CRCs are a good choice for checking frames on a serial communications channel, it doesn't mean CRCs are a good choice for verifying a 200KB ROM image. A hammers is definitely a good choice for driving nails into a wall, but that doesn't mean it should be used for applying paint to a wall.

--
Grant Edwards                   grante             Yow!  Is it NOUVELLE
                                  at               CUISINE when 3 olives are
                               visi.com            struggling with a scallop
                                                   in a plate of SAUCE MORNAY?

- C
- CBFalconer
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Aug 26, 2004 6:55 PM

... snip ...

There are, broadly speaking, two classes of chips: Those that erase to all 1 bits (0xff) and to all 0 bits (0x00). The initialization of the CRC should be chosen to match the type used. For erase to 0 initialize to 0xffff (for CCIT16) which will detect the wrong length of leading 0s, and for erase to 0xff initialize to 0, to detect the wrong length of initial 0xffs.

Of course what they erase to depends on how you use them. 0xff at the chip might well be 0 to your system if you pass the values through a set of inverting buffers. This is what I did in the past, and AFAIK I never failed to detect an error with CCIT16. I wrote a module followed by its own CRC, so that checking through the CRC resulted in a zero value. I could then burn in multiple modules in various areas, and one master routine could check each, using a table of block begin/end. This executed on initialization, and was called with a pointer to the table.

That was in the days of ePROMS, some of which have a tendency to self erase, especially if you don't cover the UV window. If you do cover the UV window they tend to not erase, until you learn to scrub the window with alcohol and remove the glue. :-)

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
   Available for consulting/temporary embedded and systems.
     USE worldnet address!

- T
- Travis Breitkreutz
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sat, Aug 28, 2004 2:08 PM

I have only limited experience with programming failures of flash and EEPROM devices, but from what I've seen, such errors do tend to be "bursty" in nature. This is, I believe, due to the fact that the program/erase voltage is usually applied to a row of bytes (e.g., 16,

32, or 64 bytes) at a time, and not to individual bits. Thus, if you have a program/erase problem, it is likely to affect the entire row. Of course, depending on the data being programmed, the failure may look like a series of short "bursts" in the data.

- Travis