file CRC

- J
- John Larkin
  
  Contact options for registered users
posted
7 years ago

Fri, May 20, 2016 5:58 PM

We have a customer who sends us all sorts of files, binaries and word docs and text and whatever. We sometimes get the "same" file on the same day from two different people, and the file dates are identical but the lengths differ. Or someone emails me a file and someone else says to download it from their portal.

Is there a commonly used CRC program that we could run to get file CRCs? We could ask them to give us the CRC for an "official" final file, and compare that to what we have and think is the correct file.

They are a huge company with bizarre rules for everything, but their documentation discipline is horrendous. We get docs that have no identified author, no dates, no statement of what the file *is* or what it applies to. We get binaries that might be development versions, might be final releases, no way to tell.

A standard CRC utility would help a lot.

--

John Larkin         Highland Technology, Inc 
picosecond timing   precision measurement  

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- T
- Tauno Voipio
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 6:06 PM

It seems that you are looking for a file hash. Google for MD5sum or just MD5. It is better than plain CRC.

--

-TV

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 6:22 PM

What I was really looking for is an industry-accepted program (namely, something their IT people might let them install) that computes a CRC of a file. In needn't be super secure or anything. I'd write such a program myself, but they couldn't install/run it.

--

John Larkin         Highland Technology, Inc 
picosecond timing   precision measurement  

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- J
- Joe Chisolm
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 6:28 PM

For a binary file md5sum will work well enough for what you want. There are versions for linux, windows and other os. For text documents (or word docs) coming from multiple people md5sum can be a pain as the value for file1 vs file2 will fail for even a simple added space. If you can get the document files exported to basic text you can use diff ignoring extra spaces and new lines. There are several side by side diff viewers. I use meld. If the md5sums on the doc files differ get them into text and diff them with meld.

--
Chisolm 
Republic of Texas

- T
- Tauno Voipio
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 6:30 PM

MD5 (Message Digest 5) is the widely used method to ensure that a file is not altered in transit. There are plenty of ready programs to create and check the hash. It is pretty probable that the IT people aready have it.

If you're running Windows, start here: .

--

-TV

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 6:32 PM

A google search turned up a number of options. Here is a list of a few.

formatting link

--

Rick C

- M
- mrdarrett
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 6:35 PM

Yep, md5sum or sha1sum should work fine

formatting link

I vaguely recall it's harder to find collisions for sha1sum than md5sum, but that was long ago and I really don't remember :p

Michael

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 6:40 PM

Then, by definition, they are NOT "identical"!

An MD5 hash is common. But, if the files are different lengths "but identical", they will have different MD5 hashes! For example:

"This is the contents of a file."

"This is the contents of a file. "

can appear "identical" but will yield different hashes. You probably want to know more than just "the hashes are different". I.e., "hashes different, size the same" suggests a different problem than "hashes different, sizes different".

MD5 would be that "standard". If you like GUI interfaces: The algorithm can be implemented in a number of different utilities for different OS's. But, the MD5 of a particular set of bytes will not differ, regardless of where those bytes are encountered.

You might also want to consider something that can be run from a command line or in "batch" mode so it can be scripted. This would let you compute the hashes of a SET of files and store them in a document (that you could then archive -- as a way of reassuring yourself that the files haven't changed, over time) and exchange with the originator, for verification.

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 6:56 PM

Google "birthday paradox". You need approximately 2^64 different files before you approach 50:50 probability of any two of them having the same

128-bit hash. That is a *LOT* of files.

If you aren't operating in an adversarial environment (finance, etc) where there is a lot to gain, then MD5 is just fine. SHA-1 is 160 bits (2^80 files for 50:50 probability) and doesn't have a known attack vector as MD5 does. But these days, I use SHA256. You need 2^128 files to get to 50:50 probability, and that not going to happen before the heat death of the universe.

There are plenty of OpenSSL builds for Windows that provide these tools for free.

Clifford Heath.

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 7:50 PM

Exactly.

98% of the files that they send us are blather. I only want to verify the few that matter.

--

John Larkin         Highland Technology, Inc 
picosecond timing   precision measurement  

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- M
- mrdarrett
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 7:51 PM

Oh ouch

formatting link

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 7:53 PM

That looks exactly like what we might be able to persuade them to use. Thanks.

Too confusing! I'd like to be able to verify that a few critical files are in fact the final, released versions. Things like packaging-for-shipping docs don't matter; FPGA images do.

--

John Larkin         Highland Technology, Inc 
picosecond timing   precision measurement  

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- L
- Lasse Langwadt Christensen
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 8:05 PM

Den fredag den 20. maj 2016 kl. 21.53.28 UTC+2 skrev John Larkin:

for fpga bit files just look at the header, it contains design name, part number and timestamp

-Lasse

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 8:10 PM

I catalog the MD5's of every file (hundreds of thousands!) in my archive. This lets me reassure myself (automagically) that nothing has changed in a file -- even if "I" haven't looked at the file in years (i.e., a program can reexamine every file in the archive at its leisure and verify that the hashes are unchanged from their previously stored values).

I wouldn't want to drag out a file that's been sitting on a disk for years -- only to discover that it suffers from bitrot! I also want to notice when *any* file starts to "rot" as it can be a leading indicator for a pending drive failure (which might leave me with just *one* copy of some/all files on that drive, mirrored elsewhere -- meaning there will be a period of time while I rebuild that "backup" in which I have NO backup!)

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 8:14 PM

But that doesn't tell me that it's the right one!

And some of the files include an FPGA image and a boot loader and some ARM code.

--

John Larkin         Highland Technology, Inc 
picosecond timing   precision measurement  

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 8:20 PM

This is not usually significant if you are operating in a non-hostile environment (i.e., no one is trying to *fool* you into thinking that "apple" is really "orange").

MD5 is two to four times "less computationally complex" than, e.g., SHA256. Probably twice again that for SHA512.

If you have a policy (corporation!) that tracks hashes to ensure the integrity of their store, then the difference can be significant.

E.g., on my archive/NAS machines, I can compute MD5's at about 300MB/s. SHA256 drops down to closer to 75MB/s. If you have several terabytes to hash (over many, many files), the difference can add up quickly. (remember, a processor is also doing other things so any effort that is expended cuts down on how quickly it can do those other things)

[In my case, I don't let my archive spin for any longer than necessary. So, the hashes are computer opportunistically -- *while* I am accessing the contents of other files in the archive. Ideally, I wold like lots of files to get checked while I am poking around looking for . This ensures the entire archive gets verified more frequently]

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 8:20 PM

Given files that have more bits than the hash length, there must be different files that make the same hash. And apparently sometimes two different 128 byte files can have the same 128 byte hash; not surprising. That doesn't affect my problem: I just want to reduce the probability that we were sent the wrong file; a confidence factor of

1e9, or even 1e3, would be plenty.

--

John Larkin         Highland Technology, Inc 
picosecond timing   precision measurement  

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- L
- Lasse Langwadt Christensen
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 8:28 PM

Den fredag den 20. maj 2016 kl. 22.14.48 UTC+2 skrev John Larkin:

you either have to make a crc and know what crc is the right one, or know what the correct timestamp is and it is all handled by the tools

make sure to build the timestamp and revision system version number into the code and there will be no doubt, and the system usual needs that info anyway so it can report what the versions it is running

-Lasse

- J
- Joe Chisolm
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 9:17 PM

There are well documented cases where people have been able to hand craft messages that would match another message hash. For a bit stream file your chances of having 2 files with matching hash values will be a non issue. If you are really concerned about that use sha256. There are shasum programs similar to md5sum (windows available). If I was getting bit streams from a customer I would document on the build information who sent me the file, when they sent it, file name and my own md5sum of the file. If you are trying to track your generated bit streams then that is what your version control system is for.

--
Chisolm 
Republic of Texas

- J
- Jeff Liebermann
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, May 20, 2016 9:40 PM

Yep. Welcome to Microsoft Office documents. Each document is "branded" with the Windoze serial number, user login name, and a bunch of other metadata. When macro viruses were that fashionable malware to write, the perpetrators machine was identified by decoding this information. There is also some hidden stuff in the document as described in: Each version of Office seems to collect more and more metadata: When an MS Office document is created on one machine, and later opened on a different machine, some of this stuff is left intact, while other stuff is replaced by the corresponding data from the last machine. I don't know the specifics.

I don't know if there is a "standard" CRC program, but there are plenty of ports to choose from:

If they're sending you MS Office documents, you can extract quite a bit about the author, version, machine, hours wasted preparing the document, uptime, etc from the metadata using Document Inspector. See the various MS Office URL's above.

Standards are a good thing. Every company should have one.

--
Jeff Liebermann     jeffl@cruzio.com 
150 Felker St #D    http://www.LearnByDestroying.com 
Santa Cruz CA 95060 http://802.11junk.com 
Skype: JeffLiebermann     AE6KS    831-336-2558