Multiple files with same filename on FAT

Vicne · 2007-10-29T10:45:26+00:00

Hi,I'm planning to put the hard disk of my PVR (a Video DVD recorder withhard-disk) on an external tray to ease video file moving to my PC.Ideally, I would launch a small program to display the HD contents,select the sequences I'm interested, and it would then copy them to myPC's hard drive for advanced editing and burning. Then I would put thedrive back into the PVR.For testing, I took the drive out, cloned it, and I'm currentlyanalyzing its contents.It really looks promising : The drive is FAT based (Ghost said FAT 16,XP says FAT 32), video files are standard MPG without encryption (theyplay ok in VLC) and there's a "reclist.dat" file listing all recordingmetadata (title, start time, duration, compression mode) that Ireverse-engineered without too much hassle.However, the filesystem obviously is not 100% FAT compliant (forexample, the last modification timestamp is empty in XP and"01/01/1601 01:00" in dos prompt's dir) and the real problem is thefollowing :When a video file reaches 4 Gb, a second one is created with the*same* name (!). For example, "dir" shows :01/01/1601 01:00 4.290.772.992 01/01/1601 01:00 201.195.520 or :01/01/1601 01:00 4.290.772.992 01/01/1601 01:00 1.055.064.064 Due to this inconsistence, I cannot play or copy the file : all I'mgetting is an error message.Does one of you know a simple way (utility, libary, OS) of accessingthe files directly by their FAT entry ? I'm pretty sure entries arecontiguous (and even if they're not, finding the correct sequencemanually would not be a problem).As a fallback position, any reliable way to rename the duplicate fileswould be welcome too, although it would probably prevent me fromputting back the drive in the PVR afterwards unless I undo therenaming.The program could be in any language and even any OS as I intend toplace it on a bootable CD that would only be used to make thetransfer.Any hint or help is welcome. VicnePS : All recordings (single or multiple files) are accompanied...

V

Vicne 18 years ago

Unfortunately, I admit I never accessed a disk in a raw manner (by sector, not by file), so finding the "glue" for DOS still puzzles me, but I think with such source code, I'm getting very close.

Thanks a million.

Vicne

Vote

A

Alexei A. Frounze 18 years ago

I don't know if this helps or not, I wrote a FAT module and a tiny standard C library using that and the BIOS (that is, runs with DOS and without as long as the BIOS is out there) to provide standard file/dir I/O. It probably won't support something truly weird as that FAT flavor of yours so you might need to turn off some checks in the code. There's no rename() function, but you could implement it -- just look how delete() finds and changes a directory entry. You can also hack the code to find/open all files (even with duplicate names) and simply copy them over, although that may not be quick as I didn't optimize the code for speed. The fastest may be to rename everything using a custom tool, then copy over using windows. The code's at

formatting link

. I don't know the true reasons behind the weird directory structure (well, naming), I'd just probably use first 4 chars of the file name for the video # and last 4 chars for the part #. If 10000 of each isn't enough, letters and digits can be used together to encode numbers in base-36. :)

Alex

Vote

S

Stefan Reuther 18 years ago

Maybe I should've been a little more explicit.

My filesystem has, like any filesystem, a block cache of a bounded size. Every block read from the disc is stored in there. When someone now wants to open a file, I first look at the blocks I have in the cache whether I find the relevant directory entry there. Only if I don't find it there, I read the directory regularily from top to bottom. This works nicely for ISO9660, because directory entries always start at sector boundaries. It will also work for FAT. I implemented that to avoid thrashing on *huge* directories (e.g. WinXP's i386 folder).

If you now have two files with identical names, it depends on the buffer cache content (and thus, processes running in parallel, for example) which instance you'll get.

Even if your buffer cache doesn't work this way, I gather that some other operating systems maintain name->disk position mapping caches, which have essentially the same problem: if the first entry has already expired but the second one has not, you'll access the second one.

Stefan

Vote

A

Antonio Pasini 18 years ago

I would try the following:

1) Create a raw dump of your whole disk in a file; you can use DD on linux, or WinHex (need registration) or similar utilities; you could also try rawcopy.zip from

formatting link

which is free. I didn't try that.

2) From the link above, use ImDisk to mount your image file as a a virtual drive. I used the control panel, but you can do by command line also.

3) You can access the virtual drive using low level functions (sector by serctor) from WinXP using CreateFileA(), ReadFile and WriteFile().

Have a look to:

formatting link

the code is similar to:

// drive: 0=A char unitName[] = "A:"; char _devicename[] = "\\\\.\\A:"; assert('A'

Vote

V

Vicne 18 years ago

formatting link

That's great. I had a look at it and it could definitely be a starting point. I'm a bit puzzled by the note in the readme : "It doesn't do any direct disk I/O because the I/O functions are callback functions specified by the user." Funnily, I also had a look at larwe's FAT code and it has a rather similar sentence in his readme saying "You must provide functions to read sectors into memory and write them back to the target media.").

So how does one actually read raw sectors from a 160Gb drive under DOS (FreeDOS for example) ?

What is needed for your code to actually perform some I/O ?

Yeah, there are dozens of solutions to avoid conflicting filenames, but well, none is implemented so let's live with that :-)

Thanks very much for your help and advice

Vicne

Vote

V

Vicne 18 years ago

This snippet is very interesting. Indeed, if I go the XP way, that's exactly what I need.

However, I always fear XP will mess with the drive, forcing me to do the imaging step as you describe, which is rather time consuming (1h30 for the 160 Gb drive). So I'm more tempted to go the minimal OS route, and I'm currently thinking of using FreeDOS to perform the copy operation, directly reading from the PVR drive...

Oh yes, that's interesting too. I'll have a deeper look into that lib for sure.

Yes, image is definitely the way to go during the trial phase.

Thanks a lot for your advice and pointers.

Vicne

Vote

V

Vicne 18 years ago

at

formatting link

Looks promising indeed. I'm more and more tempted to do all processing from MS-DOS or FreeDOS maybe, so it looks well suited. Thanks very much !

Yes, I also think one could find dozens of ways to generate distinct filenames, but well, let's live with that...

Thanks again

Vicne

Vote

L

larwe 18 years ago

Absolutely not, because the rename operation either invalidates or updates the cache. Same answer as I gave earlier: if this functionality was broken, then the OS as a whole would be broken.

Vote

S

Stefan Reuther 18 years ago

Your algorithm was: 3. copy CLIP2.MPG to another location 4. delete or rename CLIP2.MPG - the FAT code will rename the first My point is simply: you cannot be sure that the "copy" operation accesses the same instance as the "delete or rename" operation, because there is no guarantee that you access the first instance both times. Some cache *might* have remembered the second directory entry, so you access that, not the first one.

You can hardly call this a broken OS, because you're giving it an invalid filesystem image. For those, *my* requirements are not "behave consistently and correctly", but "don't crash and don't make matters worse" -- simply because defining what is "correct" can be rather hard if input is a random byte soup.

Stefan

Vote

D

Didi 18 years ago

Hmm, sounds like you two mean different cache data types.. :-). Larwe means disk cache as I know it (and have done it for DPS), it holds disk logical blocks. It may happen these blocks hold a directory, and it is much likelier for a directory read to get cached etc., but any higher level sees the same data, so there is no way to present different images to the copy/delete/rename. What Stefan seems to to mean is cacheing directory entries at a higher level than disk I/O; this could perhaps cause the mess he refers to. At least this is how I get it so far :-). But the entire thread is centered about how to fix the problem without seeing a hex dump of the offending - duplicate - directory entries in question... is it worth the effort? It should not be so hard to do. It may turn out the duplicate directory entries point to the same FAT chain in the first place... :-).

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

Vote

L

larwe 18 years ago

YES YOU CAN. Sheesh. Read my original post. I said you mount the image via the loopback device. This runs the standard filesystem (FAT in this case) code, and over/underlying layers, using an underlying filesystem layer to do the sector-level I/O within the image. It is indistinguishable from a normal FAT volume.

If the OS is caching pointers, disk positions, filenames, etc, it is doing it the same way on this virtual drive that it would on a "real" drive. There is no backdoor being used here that might cause some cache coherency issue; both the original copy and the del/rename are going through the same front door route through the overlying FAT code (and volume/filename management code if separate).

I.e. the only way my scenario could fail would require that the OS itself be intrinsically unable to handle the idea that a file might be renamed.

Vote

S

Stefan Reuther 18 years ago

I did. No reason to shout.

Exactly. Now imagine this scenario:

You are listing the directory, by typing "ls" or something similar. The operating system reads in the disk blocks making up the directory, and places them in the cache. Let's say, block #17 contains the directory entry for the first instance of CLIP2.MPG, block #18 contains the entry for the second instance.

Now you perform your step 3. Most likely your OS will find the directory entry from block #17 and copy that file.

In the meantime, another process (or maybe just the "copy" process) also uses the cache, and happens to eject #17 from the cache.

Now you want to perform step 4. Your OS looks at its cache, and sees that it still has #18 in it, so it checks that entry first before starting expensive disk accesses. Luckily, that #18 contains a directory entry for the file you want. Thus it deletes/renames a different file than what you copied in the first place.

This optimisation is valid because it works on valid file systems. And don't say "nobody does that", because my implementation does it (and I think it's a quite obvious optimisation).

The lesson should be: if there is an atomic operation to perform the thing you want - separate two identically named files -, use that, and don't try to do it with multiple operations.

Stefan

Vote

D

Didi 18 years ago

Well this scenario obviously works. I was about to question how practical it is to do so - but at a second thought I can see it can be practical. Not sure if it is worth the extra interaction between device dependent and device independent OS layers, but I guess this "depends". What I have done in DPS is as follows: the device independent layer writes to one or maximum two (consecutive) device blocks when updating a directory entry. Further, it allows the user level code to resume searches at the point after the last search end because of a match. What the cacheing thing does is see if a directory is being accessed, and if so - and if the directory is contiguously allocated and not longer than the max. cacheable size (64k currently, but settable), just read and cache the entire directory (this is the typical scenario, will hit in 99+% of the cases). When writing to the directory, since it is entirely cached as a single cache entry, it cannot be split into cached and uncached blocks. The cache pushes entries on a LRU (Least Recently Used) base, so during directory intensive r/w (copying multiple files etc.) the directory will most likely stay cached; when pushed, the entire cache entry - up to 64k - is written to disk, if dirty. Given that you have your method implemented already I guess I may choose to follow suit, though :-). What I think I will add to my above described mechanism is a bitmap of dirty disk blocks per cache entry, and see how to write them back to disk when it comes to that. Sounds like a good idea anyway - I guess I'll add it next time I dig around those things.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

On Oct 31, 7:57 pm, Stefan Reuther wrote:

Vote

A

Alexei A. Frounze 18 years ago

at

formatting link

If you're unfamiliar with the concept of a callback function... What can I do? Only ask you to find the definition and look around the code. :) And also read the readme's. Since there's a working application in the archive (FAT Commander), one would expect there some code for actual disk I/O, at least something. One hint: look at FAT_Init(). If you don't see it, look at the place where it's called from (only two places).

Alex

Vote

M

Maxim S. Shatskih 18 years ago

I have major doubts that any mainstream OS does this "optimization" which optimizes nothing. All FAT implementations I know do the linear directory scan.

Why optimizes nothing? because, if this page (cache of modern OSes is coupled with memory-mapped files and so works in terms of CPU pages and not blocks) is already present in memory AND the entry is found there, then there is no disk access, with optimization or without it.

And, if the page is not in memory, then the expensive disk access is needed anyway.

Also note that the sequence of FAT LFN entries can cross the page boundary.

Maxim Shatskih, Windows DDK MVP StorageCraft Corporation maxim@storagecraft.com http://www.storagecraft.com

Vote

S

Stefan Reuther 18 years ago

As far as I know, Linux caches name->inode mappings (dentry cache) which is essentially "vulnerable" to the same problem.

There will be *many* disk accesses if the directory you're looking at is larger than your buffer cache. The linear directory scan will start at the first sector (which is no longer in the cache, hence needs a disk access), and ultimately proceed till the sector containing the directory entry (which was in the cache when the operation started, but was pushed out to make room for the other sectors).

Without that, my implementation took several minutes to read WinXP's "i386" folder using a readdir/stat loop. Now, it's down to a few seconds without even needing to seek.

An alternative would have been to implement a dentry cache. Doing it this way way much simpler, all it needed was an enumerate-cached-blocks function for the buffer cache.

If you want to do LFNs, this can be trivially taken care of.

Stefan

Vote

V

Vicne 18 years ago

:-) No, I know what a callback function is. I was just wondering what to put inside :-)

Yeah, sorry, I had not browsed the code yet. A functional application obviously does actual I/O. The int13h.c filename is rather self- speaking regarding this :-).

it's called

Yes, I can see it indeed.

I must say I'm impressed by how exhaustive your implementation is...

Thanks a lot for your help.

Vicne

Vote

L

larwe 18 years ago

I am still unconvinced. I maintain that any FS layer that cannot deal with renaming a file is broken. However even if this was an issue, the OP could force a cache commit, e.g. by unmounting and remounting the volume - or will you now assert that the OS will keep a cache valid even across media ejects?

Vote

S

Stefan Reuther 18 years ago

The problem ist that multiple lookups of a file may yield different results if the FS assumes file names are unique, but on disk they are not. I wouldn't call that a big surprise. Renames are unproblematic.

Stefan

Vote

M

msg 18 years ago

FWIW, it would be nice if the O.P. would post hexdumps of the FAT and dirctory tables; could there be some sort of versioning ala VMS?

Regards,

Michael

Vote

Multiple files with same filename on FAT

Join the Discussion

Didn't find your answer?