Checking large numbers of files

- D
- Don Y
  
  Contact options for registered users
posted
7 years ago

Thu, Oct 27, 2016 10:23 PM

I currently "verify" my data archive with an algorithm similar to:

ONCE: con 0; Success: con 0; Removed, Inaccessible, NoRecord, BadSize, BadHash: con Success+1+iota;

verify(files: list of string ): list of (string, reason) { failures := list of (string, reason);

while (files != nil) { theFile := hd files; (size, signature, result) := examine_file(theFile); (Size, Signature, Result) := query_DBMS(theFile); do { if (NotFound == result) {Result = Removed; break;} if (BadRead == result) {Result = Inaccessible; break;}

// theFile exists! if (NotFound == Result) {Result = NoRecord; break;} if (Size != size) {Result = BadSize; break;} if (Signature != signature) {Result = BadHash; break;} } while (ONCE);

if (Success != Result) { failures = (theFile, Result) :: failures; }

files = tl files; } return( failures ); }

[apologies for any typos; too early in the day to be working!]

The problem this has is examine-file() is expensive -- for large numbers of large files!

I'd like to short-circuit this to provide an early indication of possible problems. E.g., stat(2) each "theFile" to identify any that are "Removed" without incurring the cost of examining the file in its entirety.

[Of course, all files MUST be eventually examined as you can't verify accessibility (BadRead), actual size (BadSize) or correct signature (BadHash) without looking at every byte! But, you can get the simple checks out of the way expeditiously to alert you to "obvious" problems!]

I can get an approximate indication of "actual size" just by stat(2)-ing the file, as well -- relying on examine_file() to later confirm that figure.

The hash, of course, can't be guesstimated without trodding through the file in its entirety.

The problem with *this* approach is I'd have to lock the entire object to ensure the integrity of the "early results" agrees with that of the "later results" (the original approach only requires taking a lock on the file being examined *while* it is being examined).

Are there any "tricks" I can use, here? E.g., the equivalent of CoW would ensure the integrity of my data (for the *original* instance) in the presence of such a global lock. But, I think only the Bullet Server would give me any behavior approaching this...?

Also, how wary would one be relying on the files' timestamps as a (crude) indication of "alteration"? (It seems far too easy to alter a timestamp without altering content/size so I've been leary of relying on that as having any "worth").

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, Oct 27, 2016 10:48 PM

...

Git is very fast at verifying files. You might find that "git annex" is just right for maintaining your archives. It's built for just this, and more.

- K
- kalvin.news
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, Oct 28, 2016 10:50 AM

For a very crude check you can check that the file size hasn't changed in a ddition to file time stamp. Then, you could possibly compute a fast hash fo r each file and use it as the first line of verification instead of using m ore expensive hash algorithms. If you just need to check if the file has be en altered a simple 64-bit CRC might be usable. You should of course evalua te the implications of using hash algorithms that can produce collisions an d that are not considered safe and robust. Some file systems provide hooks that can monitor file alterations. Creating an observer which will monitor any file changes, you will be notified if any file should be altered.

Br, Kalvin

- A
- Ali
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Sun, Oct 30, 2016 10:45 PM

I would suggest to compartmentalize this function. There are unnecessary co nditional operators stacked to achieve a ?something? common . How about breaking apart file handler and DB results? That should provide t wo clear paths to work with. //late to party!

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Mon, Oct 31, 2016 10:45 AM

Its just pseudo-code -- intended to show what has to happen, overall.

I.e., if a particular file in the list presented is NOT found on the media "as expected", then it should be considered as having been REMOVED (deleted) from the medium -- naming it in the list of files presupposes that it SHOULD be there. If it *is* there but can;t be read/examined, then it is INACCESSIBLE -- naming it in the list of files presupposes that it WILL be accessible (i.e., no read/seek errors involved in examining its entire content).

Having verified the file is present and accessible, a record for it should be available in the database -- yet another presupposition. When a record IS available, the size recorded should coincide with the size of the actual file encountered on the medium. And, having the correct size should also lead to the contents having hashed to the correct "signature" recorded.

Failing any of these conditions is an indication that the file presented has not been encountered on the media in the form expected.

You can promote the "not found in database" test to precede the examination of the file's contents. But, that test will rarely fail; the list of files to be checked comes FROM the database! The test just ensures the database hasn't changed in that regard wrt THIS particular file in the interval between when the list of files was prepared and when this particular file happened to be examined (checking a file can take MINUTES or more; plenty of time for the database to encounter changes -- unless I take a lock on it for the duration of the testing process. Or, interleave the file list creation with the actual verification)

You can't compare the "actual" OBSERVATIONS of the file to the stored EXPECTATIONS until you have examined the file.

The question is how much "examine_file()" can be decomposed and the consequences of that decomposition.

E.g., if implemented, instead, as (I'm being lazy in my rewrite and trying to maintain a resemblance to the original code so this is a quick cut-n-paste):

verify(files: list of string ): list of (string, reason) { failures := list of (string, reason);

while (files != nil) { theFile := hd files; do { (result) := check_file_exists(theFile); if (NotFound == result) {Result = Removed; break;}

(Result, Size, Signature) := query_DBMS(theFile); if (NotFound == Result) {Result = NoRecord; break;}

// assume file must remain in the database, now that // it has been spotted there; likewise, assume file // must persist on the medium as it has been seen there // previously! Otherwise, rely on a lock, transaction // or some explicit "recovery" algorithm (size) := check_file_size(theFile); if (Size != size) {Result = BadSize; break;}

// same assumptions as above! (result) = verify_file_accesible(theFile); // of course, this would actually be folded into the // operation below; you'd not need to make two passes // over the file to verify it can be read in its // entirety and THEN compute the hash in a separate pass if (BadRead == result) {Result = Inaccessible; break;}

// same assumptions as above! (signature) := check_file_signature(theFile); if (Signature != signature) {Result = BadHash; break;} } while (ONCE);

if (Success != Result) { failures = (theFile, Result) :: failures; }

files = tl files; } return( failures ); }

The problem with this approach is it gives you several windows where the state of the file, the filesystem (the metadata accessed by check_file_size and check_file_exists) itself and the DBMS can get out of sync. You have to take a lock on the database and the filesystem!

[A "transactional filesystem" has been discussed in this context; so, like the DBMS, you can rollback operations on the filesystem to ensure it always remains in sync with the data you've stored regarding its state (you have two copies of that data and no other way to ensure they REMAIN in sync while you perform these lengthy tests]

The bigger problem lies in structuring the loop to process each file in the list in its entirety -- before advancing to the next file in the list.

Imagine, instead:

verify(files: list of string ): list of (string, reason) { failures := list of (string, reason);

non_existent := verify_existence(files) failures = non_existent :: failures; // unsupported syntax

// FAIL: should have removed non_existent from files! not_recorded := verify_record(files); failures = not_recorded :: failures; // unsupported syntax

// FAIL: should have removed not_recorded from files! wrong_sized := verify_sizes(files); failures = wrong_sized :: failures; // unsupported syntax

// FAIL: should have removed wrong_sized from files! inaccessible := verify_read(files); failures = not_readable :: failures; // unsupported syntax

// FAIL: should have removed inaccessible from files! counterfeit := verify_content(files); failures = counterfeit :: failures; // unsupported syntax

return( failures ) }

verify_existence(files: list of string ): list of (string, reason) { failures := list of (string, reason);

while (files != nil) { theFile := hd files;

(result) := check_file_exists(theFile); if (NotFound == result) failures = (theFile, Removed) :: failures;

files = tl files; } return( failures ); }

verify_record(files: list of string ): list of (string, reason) { failures := list of (string, reason);

while (files != nil) { theFile := hd files;

(Result, Size, Signature) := query_DBMS(theFile); if (NotFound == Result) failures = (theFile, NoRecord) :: failures;

files = tl files; } return( failures ); }

verify_sizes(files: list of string ): list of (string, reason) { failures := list of (string, reason);

while (files != nil) { theFile := hd files;

(Result, Size, Signature) := query_DBMS(theFile); (size) := check_file_size(theFile); if (Size != size) failures = (theFile, BadSize) :: failures;

files = tl files; } return( failures ); }

etc. (hopefully you can see the pattern emerging, here...)

There's a subtle but important difference. The END RESULT is POTENTIALLY the same list of failures. There can be subtle differences in the list as different aspects of each file (and the database entries) will be processed at different relative wall-clock times. E.g, the signature of the first file won't be checked at the same time relative to the invocation of the main function. Indeed, it may not be checked AT ALL -- based on which preceding tests might fail! (compared to my initial implementation which examined ALL aspects of a file before advancing to the next)

Another subtlety is that at each point where "failures" is updated, the presence of ANY "theFile" in that list is a positive indication that the file in question is *NOT* intact. And, by extension, that the list of files is not completely intact!

The caller only has to wait for the *final* assessment to be "assured" that the ENTIRE list is INTACT (in so far as these tests can vouch). Other information can be leaked, sooner (as shown)

So, if you have a medium whose integrity is suspect, you get an early indication that it *is*, in fact, corrupted! You aren't forced to wait for a THOROUGH examination of all files preceding the KNOWN OFFENDER (i.e., the file that has a size of 3 bytes instead of 283922; or the file that has "gone missing"; or the dirent that can't be accessed; etc.) before that determination can be made! "Crap! This DVD+R is showing signs of bitrot!" "Crap! I must have accidentally deleted part of this medium!"

The remedial actions you take when armed with this information sooner (vs. later) can be very different.

E.g., suspicious of a failing medium, I might not want to wait hours for its entire contents to be ROUTINELY VERIFIED before taking action to preserve those objects that might ONLY exist on that medium. Or, the objects that are of "true value". ESPECIALLY if the device is throwing read errors and incurring lengthy timeout/retries (e.g., an optical medium).

If, as is probably the MOST COMMON CASE, a set of objects had been accidentally DELETED, then waiting until they were encountered in the "list of files" could potentially be after many (thousand, millions?) of files have been THOROUGHLY examined -- and found to be intact (i.e., The Norm for any reliable medium!) Checking for the existence of ALL files before checking the content of EACH file would make that information known sooner rather than later. This affords you the opportunity to restore those files as the removed copies may have been your sole "backup"; the originals are at risk until you can recreate those backup copies!

Of course, the "hazards" and races that this approach presents differ significantly from the previous -- and original! -- approaches. So, while the results may be the same if done during "regularly scheduled system maintenance" (like a commercial IT department would do), they won't be the same if the database and the filesystems are being actively used when the tests are performed.

[Indeed, when the filesystems are NOT being used, the media are typically POWERED DOWN in my case! No dedicated period for "routine system maintenance"; that has to happen alongside my normal usage -- and ADAPT to those usage patterns!]