SHA1/MD5 of S-Record File

- D
- David T. Ashley
  
  Contact options for registered users
posted
16 years ago

Thu, Feb 21, 2008 9:14 PM

I'd like to define a message digest of an S-record file to be something that is independent of the specific formatting of the file. For my purposes, two S-record files are identical if they specify precisely the same memory locations with the same contents.

For those not familiar with an S-record file:

formatting link

The issue with S-record files is that it is possible to specify an identical memory load in more than one way. For example, just editing the S-record file and swapping two lines will result in a file that is logically identical. Or, one might have records of varying lengths but the end result of what they specify might be the same.

My idea is that I should:

a)Parse the S-record file to form a set of (address, data) 2-tuples, i.e. one 2-tuple for each specified byte.

b)Sort the 2-tuples by address.

c)Concatenate the 2-tuples in the sorted order and take the SHA1 or MD5 of that concatenation.

The SHA1 or MD5 should be the same for S-record files that specify the same memory load but are perhaps formatted differently.

Questions:

a)Is there any advantage or disadvantage to using the raw binary data in the

2-tuples?

b)Anything I'm not thinking about?

Thanks, Dave.

--
David T. Ashley              (dta@e3ft.com)
http://www.e3ft.com          (Consulting Home Page)
http://www.dtashley.com      (Personal Home Page)
http://gpl.e3ft.com          (GPL Publications and Projects)

- D
- David Wagner
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Feb 21, 2008 9:44 PM

The generic principle is to canonicalize (so that any two files that you want to consider equivalent, will have the same canonicalized form), then hash. I don't know enough about S-records to know what is the right or best way to canonicalize, but that's the idea.

I wouldn't use MD5, if this is for security.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Feb 21, 2008 11:04 PM

What's your alternative to using "raw binary data"? It does not matter if your tuples are stored in binary, or ascii hex, or anything else, as long as they are consistent. Use whatever format you think is most convenient for your implementation.

Yes, there is a big factor missing - default memory values. S-record files are mostly used for program downloads for embedded processors (mainly Freescale processors - AFAIK, the format was developed by Motorola). Typically a program update consists of erasing flash blocks and then downloading the S-Record file to the flash. Memory addresses may be assumed to take a particular value (normally 0xff, but possible

0x00) if they are not specified in the S-record file. So two files may represent the same program, but be different according to your comparison because one explicitly fills the blank areas with 0xff, while the other does not.

Another thing to consider is that S-Record files can contain other fields as well as memory values. In particular, they often contain a start address for the program, and IIRC there are a few other possible standard records. Additionally, some tool vendors have extended the files with things like debugging information. You have to decide whether that sort of thing is relevant in your comparisons (and therefore in your hashes).

mvh.,

David

- D
- David T. Ashley
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Feb 22, 2008 5:43 AM

The sorting of the (address, data) 2-tuples provides the canonical form. I understood your post. Thank you very much.

- D
- David T. Ashley
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Feb 22, 2008 5:51 AM

Your guess was correct: I was thinking about ascii hex.

I understand your comment about consistency.

The issue is that I'm so unfamiliar with cryptographic hash functions that I don't know if the binary data would provide enough "difference" or if there is anything more I should do (such as repeat the data to give a longer total string to be cryptographically hashed).

For our application, the default memory values aren't relevant. Also, the startup vector isn't relevant, as this vector is stuffed into a specific spot in the microcontroller memory map and that forces the startup (i.e. this information is data rather than a special S-record data). However, thanks for pointing that out, and I'll think about it further.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Feb 22, 2008 9:32 AM

I am no expert in cryptography, but I can't imagine that there is any difference in the security or uniqueness of the hashes depending on whether you use ascii hex or binary for the data. They have the same information content, so with a decent hash function and enough data, it will make no difference (somebody in sci.crypt please correct me if I'm talking rubbish!). When you have very small amounts of data to encrypt (such as passwords), it's common to add some "salt" to make better use of the hash - but it's not necessary when you have more data.

- E
- Ertugrul SÃ¶ylemez
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Feb 22, 2008 10:50 AM

This is right. Assuming a perfect hash function, the entropy of the hash value is the same as the entropy of the original message, with a limit of the size of the hash value.

Not "to make better use of the hash", but to reduce the effectiveness of rainbow tables. It makes generating them harder by a factor of 2^N, where N is the bit-size of the salt's entropy.

Regards, Ertugrul.

--
http://ertes.de/

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Feb 22, 2008 11:00 AM

If you mean "perfect hash" the way that phrased is used in compilers, that's not anything like a cryptographic hash. Idealized cryptographic hashes have collisions as predicted by the birthday paradox. Therefore, the hash output has slightly less entropy than the input. There have been a few threads here about how the entropy decreases as you iterate the hash, and I think Prof. Wagner has a webpage with the calculation.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Feb 22, 2008 12:08 PM

Thanks for correcting me here - I knew there was a specific reason for salting passwords, but I'd forgotten what it was.

- E
- Ertugrul SÃ¶ylemez
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Feb 22, 2008 12:09 PM

A "perfect hash function" as in a "true pseudo-random function", in which case you're right. The output has slightly less entropy than the input. Sorry for being inaccurate.

But this is a good point. What happens if the message contains far more entropy than the hash value can contain? Let N be the bit-length of the hash value, let M be the message space, an infinite, countable set, and let Y be an arbitrary hash value.

Select an arbitrary finite subset S from M. As the size of S increases, the number of messages, which get hashed to Y, approaches 1/(2^N). Am I wrong here?

Regards, Ertugrul.

--
http://ertes.de/

- V
- Vend
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Feb 22, 2008 12:35 PM

Assuming that the hash function works properly, you don't need to replicate the data or do any particular manipulation.

You'll need to consider whether length extension attacks are relevant to your application.

formatting link

If it's the case, you just need to apply the hash function twice.

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Feb 22, 2008 6:18 PM

It's approximately |S|/2^n. The approximation is not exact. If you take a 160 bit random input and run it through SHA1, the number of outputs will actually be about (2^160 - 2^80) because of birthday collisions. That is why the output has less entropy than the input.

- E
- Ertugrul SÃ¶ylemez
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Feb 22, 2008 6:51 PM

Uhm, sorry, expressed it incorrectly in the rush. I mean, in average, the probability that a message from S gets mapped to Y is 1/2^N, isn't that right? I'd be pretty surprised, if that's wrong.

Regards, Ertugrul.

--
http://ertes.de/

- K
- Kristian Gjøsteen
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Feb 22, 2008 7:15 PM

As long as S doesn't depend on the random function, then yes.

--
Kristian Gjøsteen

- D
- David T. Ashley
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Feb 22, 2008 9:14 PM

I think there are two different notions of "salt".

One is as you find in htpasswd files and so on where the salt is part of the hash output and is visible to a potential attacker. I think that is the type of salt described above.

For example, in this page:

formatting link

you'll notice that the same inputs don't always give the same output hash, because of the included (and fully visible) salt.

The second notion is where you conceal the "salt". For example, when I'm developing web applications and when I'm issuing session identifiers, I might choose the identifier as something like:

concat(database_key, sha1(concat( salt, database_key, salt)))

so that even if a person can guess at a database key (keying to session data on the server), they can't guess at the second part of the session identifier, because they don't have the "salt".

Also, passwords are most often not stored plain. One might store instead the hash sha1(concat(salt, password, salt)), and then just do a comparison when a password is entered.

If the password database is compromised, it is just nonsense without the salt.

If the password database and the salt are compromised, the only attack possible is a dictionary attack.

In this case, the "salt" has to be concealed, otherwise attacks are possible.

I'm not sure if this second usage (where it has to be concealed) is true "salt".

Dave.

- E
- Ertugrul SÃ¶ylemez
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Feb 25, 2008 10:35 AM

No, this is more like a 'tweak' (as used in the LRW mode). Salts, nonces and initialization vectors, which are closely related to each other, generally don't need to be hidden, as opposed to a tweak, which _must_ be hidden.

Regards, Ertugrul.

--
http://ertes.de/