Embedding a Checksum in an Image File

- S
- Stefan Reuther
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Fri, Apr 21, 2023 5:40 PM

Am 20.04.2023 um 22:44 schrieb George Neuner:

It works for any CRC that starts with zero and does not invert.

CRC is based on polynomial division remainders. Basically, the CRC is a division remainder of the input interpreted as a polynomial, and if you add that remainder back into the equation, result is zero.

formatting link

result is 0xBA3C

formatting link

result is 0x0000

(Need to be careful with byte orders; for some CRCs on that page, you need to swap the bytes before appending.)

Stefan

- R
- Richard Damon
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Fri, Apr 21, 2023 11:30 PM

Its is a bit of work, but even a 32-bit CRC will be solvable to find the reverse equation. You can do the work once generically, and get a formula that computes the value you need to put into the final bytes to get the CRC of the file back to the CRC it was before adding the CRC and the extra bytes. It wouldn't surprise me if the formula isn't published somewhere for the common CRCs.

- B
- Brian Cockburn
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Fri, Apr 21, 2023 11:52 PM

And just to be sure I understand what you wrote in a somewhat convoluted way. When you have two binary executables that report the same version number you want to be able to distinguish them with a 'checksum', right?

- B
- Brian Cockburn
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Fri, Apr 21, 2023 11:56 PM

David, a hash and a CRC are not the same thing. They both produce a reasonably unique result though. Any change would show in either (unless as a result of intentional tampering).

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sat, Apr 22, 2023 3:14 AM

No one is trying to detect changes in the image. I'm trying to label the image in a way that can be read in operation. I'm using the checksum simply because that is easy to generate. I've had problems with version numbering in the past. It will be used, but I want it supplemented with a number that will change every time the design changes, at least with a high probability, such as 1 in 64k.

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sat, Apr 22, 2023 3:23 AM

Yes, I want the checksum to be readable while operating. Calculation code??? Not going to happen. That's why I want to embed the checksum.

Yes, two compiled files which ended up with the same version number by error. We are using an 8 bit version number, so two hex digits. Negative numbers are lab versions, positive numbers are releases, so 64 of each. We don't do a lot of actual work on the hardware. This code usually is 99.9% working by the time it is tested on hardware. So no need for lots of rev numbers. But sometimes, in the lab, the rev number is not bumped when it should be. The checksum will tell us if we are working with different revisions in that case.

So far, it looks like a simple checksum is the way to go. Include the checksum and the 2's complement of the checksum (in locations that were zeros), and the checksum will not change.

- B
- Brian Cockburn
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sat, Apr 22, 2023 2:07 PM

Rick,

Can you expand on what you mean or expect by 'readable while operating' please? Are you planning to use some sort of tool to inspect the executing binary to 'read' this thing, or provoke output to the console in some way like:

$ run my-binary-thing --checksum 10FD $

This would be as distinct from: $ run my-binary-thing --version -52 $

Signed 8-bit numbers range from -128 to +127 (0x80 to 0x7F) so probably a few more than 64.

This may be an indicator that better procedures are needed for code review-for-release. And that in independent pair of eyes should be doing the review against an agreed check list.

How will the checksum 'not change'? It will be different for every build won't it?

Cheers, Brian.

- R
- Richard Damon
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sat, Apr 22, 2023 2:31 PM

He means the checksum of the file for a given build after the modification will be the same as the checksum of the file before the modification.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sat, Apr 22, 2023 3:01 PM

A CRC is a type of hash - but hash is a more generic term.

Exactly. Thus a CRC is a hash.

It is not a cryptographically secure hash, and is not suitable for protecting against intentional tampering. But it /is/ a hash.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sat, Apr 22, 2023 3:13 PM

Again - use a CRC. It will give you what you want.

You might want to go for 32-bit CRC rather than a 16-bit CRC, depending on the kind of program, how often you build it, and what consequences a hash collision could have. With a 16-bit CRC, you have a 5% chance of a collision after 82 builds. If collisions only matter for releases, and you only release a couple of updates, fine - but if they matter during development builds, you are getting a more significant risk. Since a

32-bit CRC is quick and easy, it's worth using.

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sat, Apr 22, 2023 4:54 PM

More like $ run my-binary thing Hello, master. Would you like to achieve world domination today?

That would be X0FE38

See? This is why I need the checksum. I make mistakes.

Or that I need a checksum. This is a lab compile, not a release. Let's try to stay on task.

It won't be changed by including the checksum and the complement because they add up to zero.

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sat, Apr 22, 2023 4:56 PM

Again - as will a simple addition checksum.

Or, I might want to go with a simple checksum.

Thanks for your comments.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sat, Apr 22, 2023 5:54 PM

A simple addition checksum might be okay much of the time, but it doesn't have the resolving power of a CRC. If the source code changes "a = 1; b = 2;" to "a = 2; b = 1;", the addition checksum is likely to be exactly the same despite the change in the source. In general, you will have much higher chance of collisions, though I think it would be very hard to quantify that.

Maybe it will be good enough for you. Simple checksums were popular once, and can still make sense if you are very short on program space. But there are good reasons why they fell out of favour in many uses.

It's your choice (obviously). I only point out the weaknesses in case anyone else is listening in to the thread.

If you like, I can post code for a 32-bit CRC. It's a table, and a few lines of C code.

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sat, Apr 22, 2023 8:05 PM

I remember a long discussion about this a few decades ago. An N bit additive checksum maps the source data into the same hash space as a N-bit crc.

Therefore, for two randomly chosen sets of input bits, they both have a 1 in 2^N chance of a collision. I think that means that for random changes to an input set of unspecified properties, they would both have the same chance that the hash is unchanged.

However... IIRC, somebody (probably at somewhere like Bell labs) noticed that errors in data transmitted over media like phone lines and microwave links are _not_ random. Errors tend to be "bursty" and can be statistically characterized. And it was shown that for the common error modes for _those_ media, CRCs were better at detecting real-world failures than additive checksum. And (this is also important) a CRC is far, far simpler to implement in hardware than an additive checksum. For the same reasons, CRCs tend to get used for things like Ethernet frames, disc sectors, etc.

Later people seem to have adopted CRCs for detecting failures in other very dissimilar media (e.g. EPROMs) where implementing a CRC is _more_ work than an additive checksum. If the failure modes for EPROM are similar to those studied at <wherever> when CRCs were chosen, then CRCs are probably also a good choice for EPROMs despite the additional overhead. If the failure modes for EPROMs are significantly different, then CRCs might be both sub-optimal and unnecessarily expensive.

I have no hard data either way, but it was never obvious to me that the arguments people use in favor of CRCs (better at detecting burst errors on transmission media) necessarily applied to EPROMs.

That said, I do use CRCs rather than additive checksums for things like EPROM and flash.

- B
- boB
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sat, Apr 22, 2023 8:41 PM

Totally agree ! I stopped using simple checksums years ago. Many processors these days also have a CRC peripheral that makes it easy to use. And I can simply chop that off to 16 bits if I don't want to transmit all 32 bits. OR even 24 bits.

boB

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sun, Apr 23, 2023 3:37 PM

That's a lot of good points. You are absolutely correct that CRC's are better for the types of errors that are often seen in transmission systems. The person at Bell Labs that you are thinking about is probably Claude Shannon, famous for his quantitive definition of information and work on the information capacity of communication channels with noise.

Another thing you can look at is the distribution of checksum outputs, for random inputs. For an additive checksum, you can consider your input as N independent 0-255 random values, added together. The result will be a normal distribution of the checksum. If you have, say, a 100 byte data block and a 16-bit checksum, it's clear that you will never get a checksum value greater than 25500, and that you are much more likely to get a value close to 12750. This kind of clustering means that the 16-bit checksum contains a lot less than 16 bits of information. Real data - program images, data telegrams, etc., - are not fully random and the result is even more clustering and less information in the checksum.

Taking the additive checksum over a larger range, then "folding" the distribution back by wrapping the checksum to 8-bit or 16-bit will greatly reduce the clustering. That will help a lot if you have a program image and use a 16-bit additive checksum, but if you need more than "1 in 65536" integrity, it's hard to get.

A particular weakness of purely additive checksums is that they only consider the values of the bytes, not their order - re-arranging the order of the same data gives the same additive checksum.

CRC's are not as good as more advanced hashes like SHA or MD5. But their distributions are vastly better than additive checksums, and they provide integrity checks for a wider variety of possible errors.

Of course, for some uses, an additive checksum might be considered good enough. There's no need to be more complicated than you need to be. But since CRC's are usually very simple and efficient to calculate, they give an option that is a lot better than an additive checksum for little extra cost, while going beyond them to MD5 or SHA involves significantly more effort. (SHA is your first choice if you are protecting against malicious changes.)

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sun, Apr 23, 2023 5:34 PM

You know nothing of the project I am working on or those that I typically work on. But thanks for the advice.

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sun, Apr 23, 2023 5:37 PM

It never occurred to me that for an N-bit checksum, you would sum something other than N-bit "words" of the input data.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sun, Apr 23, 2023 9:45 PM

Usually - in my experience - you sum bytes, using an unsigned integer

8-bit or 16-bit wide. Simple additive checksums are often used on small 8-bit microcontrollers where CRC's are seen (rightly or wrongly) as too demanding. Perhaps other people have different experiences.

You could certainly sum 16-bit words to get your 16-bit additive checksum, and that would give a different kind of clustering - maybe better, maybe not.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sun, Apr 23, 2023 9:58 PM

You haven't given much to go on. It is still not really clear (to me, at least) if you are asking about checksums or how to manipulate binary images as part of a build process, or what you are really asking.

When someone wants a checksum on an image file, the appropriate choice in most cases is a CRC. If security is an issue, then a secure hash is needed. For a very limited system, additive checksums might be then only realistic choice.

But more often, the reason people pick additive checksums rather than CRCs is because they don't realise that CRCs are actually very simple and efficient to implement. People unfamiliar with them might have read a little, and think they need to do calculations for each bit (which is possible but /slow/), or that they would have to understand the theory of binary polynomial division rings (they don't). They think CRC's are complicated and advanced, and shy away from them.

There are a number of people who read this group - maybe some of them have learned a little from this thread.