Filesystem syntax constraints under Windows

Don Y · 2014-10-10T07:31:29+00:00

Hi, Does anyone know which filesystem naming constraints are imposed in Windows itself vs. the file system layer? Said another way, which constraints are *invariants* regardless of the filesystem? E.g., can a non-native filesystem redefine '+' to replace "../"? Or, allow support for ':' in names? Or, replace '' with '/'? Or, ... Thx, --don

S

Stefan Reuther 11 years ago

You have carefully ignored examples given, and stressed that things "have never worked and still don't work", without outlining what you would consider "working" and why. That's closer to religion than to technical debate for me. "Earth was never round, and still isn't."

Stefan

Vote

D

Dimiter_Popoff 11 years ago

Your best argument so far is "file names are not for humans but for programs".

It does not get a lot more laughable than that, you may want to stop trying to prove the Earth is flat indeed.

Vote

D

Don Y 11 years ago

C:\SfU> mkdir XXX

C:\SfU> cd XXX

C:\SfU\XXX> PATH=..\bin

C:\SfU\XXX> touch AAA

C:\SfU\XXX> touch aaa

C:\SfU\XXX> touch A:a

C:\SfU\XXX> touch A'a

C:\SfU\XXX> touch A`a

C:\SfU\XXX> touch B?b

C:\SfU\XXX> touch C*c

C:\SfU\XXX> ls A'a A:a AAA A`a B?b C*c aaa

It doesn't seem possible to embed redirection operators in filenames regardless of quoting. (e.g., A>a, A to access them.

On the contrary, even Windows Explorer can SEE and ACCESS them!

C:\SfU\XXX> ls > foo

C:\SfU\XXX> cat foo A'a A:a AAA A`a B?b C*c aaa foo

C:\SfU\XXX> cp foo B?b

C:\SfU\XXX> rm foo

C:\SfU\XXX> ls A'a A:a AAA A`a B?b C*c aaa

C:\SfU\XXX> cat B?b A'a A:a AAA A`a B?b C*c aaa foo

C:\SfU\XXX> mv B?b B?b.txt

C:\SfU\XXX> ls A'a A:a AAA A`a B?b.txt C*c aaa

Now, double-click on "B[box]b.txt" in Windows Explorer and see the contents of foo. (B?b.txt)

Unfortunately, the rules Windows (and Interix) seems to follow aren't terribly obvious (on casual inspection).

I should try the same exercise from NFS (client and server) to see how yet another vendor's code behaves under Windows.

Vote

S

Stefan Reuther 11 years ago

This seems to me like it is using some Unicode character which looks like ":" or "?" when displayed on the console, but is actually something else.

Doing something like 'ls | od -vtx1', or 'ls > list.txt' and examining 'list.txt' with a hex editor might enlighten us.

That would be the result of Explorer using the SHFileOperation function, which internally uses a FindFirstFile/FindNextFile loop to support wildcards. This loop will interpret "aaa" as a pattern which matches "aaa" and "AAA".

Which is precisely my argument against all this character set fiddling in a kernel :)

Stefan

Vote

D

Dimiter_Popoff 11 years ago

How do you manage to put files with the same name (case being ignored) into a directory? I don't know what and how windows does about this but in dps you can do this only if you hex dump into the directory file. After that obviously if you search for aaa* you will find as many hits as there are in the directory. If you search for just "aaa" the first match will be considered last (i.e. a directory with duplicate file names in it is considered broken) [actually searching non-ambiguous names goes through a different routine which compares on a 32-bit basis, as fast as the CPU would allow this to be done]. I can't see how one can do this through the user interface of windows, either. If you want to copy files with duplicate names (i.e. coming from a unix filesystem) the only correct way is to rename the file(s), e.g. by appending some unique sequential number or sort of.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Vote

D

Don Y 11 years ago

No. All 7b ASCII codepoints! ":" really *is* ':'...

That was the point of:

C:\SfU\XXX> ls > foo

C:\SfU\XXX> cat foo A'a A:a AAA A`a B?b C*c aaa foo

It gets weirder...

C:\SfU\XXX> dir /b A'a AAA aaa A`a A?a B?b C?c

Note that "DOS" refuses to deal with the ':' and '*' characters and transforms them into '?' (which one would assume it would ALSO refuse to deal with!)

Note, also, the different sort orders (which each differ from Windows Explorer's wacky rules).

Yes, what was unexpected was the fact that a *single* file entry had been "selected" prior to invoking delete. I.e., their codebase assumes a "selection" can be non-unique.

Think of how much spaghetti code they must have in each of these "programs" ("commands")! I.e., if DIR sees a character in a name that it doesn't like, it maps it to '?'. If Explorer sees a character that it doesn't like (expect), it maps it to [box] (unless that character is '*' which it maps to 'nil')

Sheesh! Talk about Principle of Least Surprise... :-/

Note, also, that "names" are processed differently depending on where they are encountered in the command line. E.g., ls > File:List.txt

Vote

D

Don Y 11 years ago

I was taking a shortcut to avoid having to remotely mount the Windows disk as an NFS export (in which case, I hypothesized that I could create arbitrary file names from the NFS client).

Instead, I used MS's Interix subsystem (essentially, UN*X tools that run under Windows -- hence "ls" instead of "DIR", "cat" instead of "TYPE"? in my examples).

Windows is case preserving but case ignoring. So, IN THE ABSENCE OF ANY FILENAME CONFLICTS, I can create "AaA" and it will appear as "AaA" everywhere -- in Windows Explorer, in a DOS box looking at the folder's contents via DIR, etc.

However, thereafter, any references to "AAA", "aaa", "aAa", etc. will all resolve to this initial "AaA".

Under Interix, you bypass Windows' rules for names and write directly to the file system (disk media). So, I can create a file of an arbitrary name (well, not really... there are still some restrictions like I can't seem to embed '>' in a filename) even when a "case conflicting" filename exists.

So, "touch AaA" creates a file called "AaA" while "touch aaa" creates ANOTHER file -- called "aaa". Windows Explorer is smart enough (dumb enough?) to display these as separate files. And, will know which one to "open" if I select it with mouse.

This is exactly the problem I encounter when trying to "manage" large file collections that originate in UN*X *under* Windows. E.g., Makefile and makefile collide in Windows' namespace. So, I end up with one or the other (depends on which order they are REcreated).

Likewise, locore.S and locore.s, etc.

Also, Windows has a trivial file/path name limitation that is regularly exceeded (while working IN windows as well as importing pieces of a file hierarchy from UN*X).

See, also, the other caveats that I posted in my recent reply to Stefan (e.g., DIR silently transforms filenames)

Bottom line, Windows is an annoyance.

"If Microsoft is The Answer, you're asking the wrong Question!"

Vote

G

glen herrmannsfeldt 11 years ago

(snip, someone wrote)

You might create files that Windows can't read, or doesn't like.

(snip)

Well, there are more than one file systems used with Windows, and the rules might be different.

FAT traditionally had an 8.3 (eight character name, three character extension) format. When they added longer names, the short names were still there, and might be considered the real name for the file. Some older utilities required them.

For NTFS, I believe the longer names are really part of the file system and directory, not quite the same as for FAT. It might be that NTFS can still supply a short name for programs that require them.

Even more fun in DOS days were files with names like COM3. I once had one on a disk (from a system with only two COM ports) and then brought the disk to a system with more COM ports. You can't get to the file! Even names like COM3.TXT still don't work.

It might open them some way other than by name.

Also, the filename parser allowed either / or \ as separators, while command line DOS commands required \. I believe you don't want either / or \ in the file name. (That is, not a separator.)

-- glen

Vote

D

Dimiter_Popoff 11 years ago

Well writing to a directory entry not through a system call would easily break the directory, sure. I don't have to tell you this is not the way you want to go in an end product (instability, impredictability issues - how will the next OS version treat these invalid entries etc. etc.).

That's not surprising, once you trick the filesystem with an invalid name entry it will not try to do much on it. When it lists a directory it will just go through all the entries and list.

That is more surprising to me. It means they go through some sideways to locate the clicked file, not by searching for it by name. In DPS, this could be done by using DEN (directory entry number, well it is not a number but pool_no:cluster really), i.e. you list the names on the menu, then for each menu entry you store the DEN and access the file subsequently based on that (possible but impractical). Or you can just keep all the files on the menu open and access not by name but by "registration" (i.e. "handle").

Well I figured as much in the meantime :D . That was the underlying reason for your initial post I suppose.

It probably is much worse than an annoyance to program under but in the example above I would point the finger at the person who has been shortsighted enough to create duplicate file names in the unix environment first, then on the way the unix filesystem is made to allow duplicate file names being created by users.

I don't see how you can handle this situation without inserting a complete name handling layer between.

For example, this is what I did in a similar situation - when one wants to copy * from a longnamed directory into a shortnamed (old, 8.4) one. Files get copied by just using the first up to 8 characters and up to 4 past the last "." character; if such a name has been used already creation will fail (duplicate file name) and the copy code before retrying will modify the destination name by replacing the last 4 name characters (I think) by the text hex. representation of a counter which gets incremented every time it is used. No other way around it, you either have to maintain the file name data case dependent (human readable) or case independent (in that case 8 bytes per file would be plenty). Some bridging between these two fundamentally different cases will always be necessary if they have to coexist.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Vote

D

Don Y 11 years ago

That was the point of opening B?b.txt from within Windows Explorer (double click) to verify that the "correct" file would, in fact, be handed to Notepad.exe and that Notepad would be able to open it!

Yes. If you look at the raw directory contents, the encoding is fairly obvious -- as well as how it was "backward compatible" with an older DOS system (so you could access files created with LFN's on a machine that just runs DOS 4, etc.

NTFS apparently also stores the "file name character translation tables" in the medium.

MS OS's must be *littered* with "special cases".

Vote

D

Don Y 11 years ago

But there is no evidence that this is the case! I.e., I suspect the Interix subsystem doesn't do "raw I/O". Rather, uses a lower level interface to the medium than the "GUI" OS does.

Recall, original DOS treated all filenames as singlecase. So, "AaA" was not accessible.

But, that's NOT the case in Windows! :>

E.g., Windows Explorer did not list the folder/directory contents the same way that the Interix subsystem did. And, from a DOS box, "DIR" performed silent translations on the file names!

E.g., "C*c" appears AS "C*c" on the actual volume. Interix displays its name as "C*c". The DIR command displays it as "C?c". And, Windows Explorer displays it as "Cc".

I.e., they are implementing special case processing even when the file name already exists (instead of just treating it as is!)

I assume the GUI code that displays a folder is given a list of names to display. When you click on a name (via mouse), it looks at cursor's (x,y) and maps that to a particular "line of text" in the display. Then, passes a pointer to "list entry number X" as the result of the "selection".

Only indirectly. I wanted to know what I could "get away with" in mapping my "names" to other contexts (e.g., Windows). I, for example, don't reserve '/' (or '') as path separators. Or '>' as a reserved shell redirection operator (artificially prohibiting it in a filename). Or, ':' to indicate legacy "devices" (COM1:).

So, I could have names like:

Class::Member I/O READ/write

--->

But it is not "shortsighted" -- especially as UNIX predated ALL of MS's offerings!

Again, recall that I'm using these as "object names", not FILE names. So, they are created by a developer and hide *inside* sources. If, for example, process A creates an object and calls it "Fred", it is perfectly reasonable to expect consumers of that object to refer to it as "Fred" -- and not "fred", "FrEd", "fRED" or "Bob"!

E.g., a service can elect to name it's CURRENT clients using a template like "client##". If that service later tried to resolve "ClIeNt23", it seems reasonable (nay, desireable!) that this name should NOT resolve (Gee, can't you remember what you called this client a few microseconds ago??)

I don't tolerate any deviation from that which was originally specified (*within* my system). "Say what you mean and mean what you say".

Dealing with "foreign systems" (e.g., Windows) is the only issue because it/they are inflexible in their naming conventions. :<

But, I can accommodate them by just planning on creating an exported namespace *intended* for their use.

E.g., if I want to make the object named "I/O" accessible, I can create a namespace in which "io_device" is mapped to the same object as *my* "I/O". If the foreign system can't handle lowercase characters, I can create a different namespace wherein "IO_DEVICE" maps to my "I/O". Or, "IO$DEVICE" for VAX fans...

If Windows (it's GUI) wants to treat "Io_DeViCe" as an alternate name for the "io_device" that I export, then so be it. I just have to have a "method" that fabricates viable names for exported objects and have that method vary with the foreign system involved.

Vote

D

Don Y 11 years ago

C:\SfU\XXX> touch A*a

C:\SfU\XXX> dir /b A'a AAA aaa A`a A?a

Vote

D

Dimiter_Popoff 11 years ago

Ouch. Well, this is as huge an ouch as they likely make them. The only practical way out of this I see is to restrict the file names you let windows handle to the subset they handle consistently, the rest of the effort will be a (potentially huge) waste of time & effort. Even if you somehow manage to cover for all cases your solution will work only until their next version or even revision.

Uhm, relying on a character case you type in to have a different file name is nothing I would cann otherwise :-). I can easily see how a programmer can be tempted to do so in a quick hack but it is similar to patching object code in hex without changing the source code "just this once to see what happens". Things like that are bound to bite back, who of us has not been bitten. (Actually I still do patch code sometimes like that but rarely and I must have become better at being cautious enough not to get bitten... or used to the bites and not even noticing them :D ).

I get that, so you are absolutely fine with say 64 bits per object ID (strictly it is not a "name", names are written in text and text at its basic level is case independent). But you just cannot feed binary data into something expecting text as an input and hope things would work, you will have to put a translation layer between the two. I just don't see any way around it. Say a file in your directory mapping all your

64 bit entries into text names or something.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Vote

D

Don Y 11 years ago

I treat names as "arbitrary length, 0x00-terminated "byte arrays/strings". The whole point is to leave things up to the developer/object-implementer to decide what constitutes a "good name".

E.g., one might choose {0x01,0x00}, {0x02,0x00}, {0x03, 0x00}... while someone else might choose I_Like_Insanely_Long_Names_1, I_Like_Insanely_Long_Names_2, etc.

In this way, someone could choose to encode information in a particular name "template" and select from among the available objects in his namespace based on some criteria that maps easily to the template he has chosen: Client_Local_1_Priority_B Client_Local_5_Priority_A Client_Local_7_Priority_B Client_Remote_2_Priority_B Client_Remote_8_Priority_C etc. So, he can choose to select from among the "Client_Local_*" objects. Or, the "*_Priority_A" objects, etc.

The entity that binds the names to the objects decides what is important to it (or, to it's offspring as the initial bindings come from the parent spawning the process). *It* decides what sort of overhead it wants to support/incur.

Also, namespaces *tend* to be small. With how many "objects" does one of your processes typically interact? The whole point is to *ONLY* expose the objects that a process NEEDS to access *to* that process. Hide everything that it should NOT be mucking with by simply not providing a *name* for those objects that shouldn't be accessible!

(If you can't provide a name to the "System" by which it can locate the object for you -- based on the contents of the namespace that *it* maintains on your behalf -- then there is no way for you to access or operate on that object!)

Vote

G

George Neuner 11 years ago

In NTFS the colon is the stream designator. A:a is the name of a secondary stream 'a' inside file 'A'.

Not sure how you created it in the first place (unless by your

*nix-like shell magic). Creating a stream requires additionally specifying type metadata in the name:

formatting link

? and * are filename wildcards in DOS and Windows both ... and DOS doesn't know about NTFS streams.

COMMAND.com and CMD.exe show files in directory entry order unless they are deliberately sorted. Explorer *always* sorts - the default is "by name, grouping folders".

Which argues that whatever software you used to create those files is abusing the long name while still maintaining legal short names to differentiate them. Windows file system is case insensitive so "AAA" and "aaa" are the same file unless tricks are being played behind the scenes.

formatting link

Try doing a DIR/X on the directory using CMD.exe and see what it says.

George

Vote

D

David Brown 11 years ago

He is not hacking the disk in some way - he is using Interix, which is written by Microsoft as sort of mostly Posix compatibility layer for Windows. This layer does not use the Win32 API, but it uses the NT kernel services and system calls in the same way that the Win32 API does. Thus is uses posix-compatible API's to ask the Windows VFS system to create, read or write files, and the VFS system passes this on to the NTFS system. When using explorer, the code uses thw Win32 API to talk to the VFS and then on to the NTFS.

What we are seeing here is that the NTFS filesystem is perfectly capable of holding filenames with almost arbitrary characters, and does not do any case-dependent handling (thus "a" and "A" are different characters, and can be different filenames). This is not surprising, since NTFS was designed to be usable in a Posix environment, and also since it uses a restricted UTF-16 (no multi-point characters) for filenames and does not attempt to include the vast set of rules needed to handle case dependencies.

We also see that through Interix, filenames are not mangled, except perhaps to handle "/" as a directory separator. Through the Win32 API, filenames are mangled in a variety of ways both going into the VFS, and coming out of it - and can be mangled in different ways depending on the particular calls being used. They are then further mangled by the application ("explorer.exe", "cmd.exe", etc.). This is no surprise either, given the history of the system which attempts to remain somewhat compatible with a range of different limitations in kernels and filesystems DOS, Win9x, NT, FAT, FAT32, etc.

And since the mangling and translations are done at different stages - some in the APIs, some in the applications, some in the libraries - there is repetition and inconsistencies. This is also no surprise, based on the development environment at MS - different groups handle different parts, but act competitively rather than cooperatively, with an appalling lack of documentation or references.

Vote

D

Don Y 11 years ago

No. The name of the file is "A:a" as reported by ls(1); "A[box]a" as displayed in Windows Explorer (I'd have to change to a full Unicode font to figure out what [box] really is); and "A?a" as reported by DIR.

I'm using MS's posix tools (Interix). I imagine I could do the same by exporting a folder via NFS and massaging it from a remote machine. Or, by mounting an NFS exported directory from a remote machine and creating these "legal" file names, there.

[I should do that and see what "A>a" looks like!]

Yes, but DOS sees "A*a" and "A:a" in the folder and maps BOTH of them to "A?a" in the DIR listing!

*Windows* is case preserving, case insensitive. But, NTFS is case *sensitive*. The Interix tools are creating "valid" filenames on the medium. Windows (and "DOS") are just having fits dealing with them!

E.g., "A*a" appears as "Aa" in Windows Explorer; "A?a" in a DIR listing and "A*a" when enumerated via ls(1).

Exactly (wrt names) as the "DIR /b" results cited previously!

Vote

D

Dimiter_Popoff 11 years ago

David, I am not sure what you are trying to explain but I do not think there are many people here who need to be told that a set of characters stored as bytes can be compared in a case dependent or independent manner. Key to the point is:

The filesystem stores all name related data it is given (i.e. without loss of information),
The user is not exposed to the bitstreams the OS stores but to to text which consists of characters which are part of an alphabet, for example the Latin alphabet as used in English has 26 characters.

Don's problem is that he just cannot copy a unix directory if it contains duplicate file names to an NTFS one such that it is usable. Of course he can hack his way into doing it, whether through some MS written hack which you say is not a hack or otherwise.

The problem remains and will remain, as unix does not output names but file identifiers (names consist of text, remember the alphabet and the character count). The fact that these identifiers have been misused as text for decades does not mean much beyond the expectations of hardcore unix users that the English alphabet will suddenly begin to have 52 characters.

What exactly are you trying to prove, why do you keep on flailing. Why is it so hard for you to accept that you have overlooked a few simple, obvious facts and just move on.

Dimiter

Vote

D

David Brown 11 years ago

I am trying to explain that Don is not doing something odd or outside of the windows system here, as you seemed to think:

If he had used a disk editor to directly change the filenames, then I could understand your comment. But he has not done anything like that - he has used programs written by Microsoft to run on Windows, and used them to create filenames that other parts of Windows can't deal with properly.

Yes...

In other words, Windows mangles the names it is given.

That is one of his problems, yes.

The Win32 API allows files to be created or opened using "posix semantics" for filenames, including case-sensitive files, characters such as ":" and "*" in filenames, and multiple files differing only in the case of their names. Even if you want to call the MS-supplied posix compatibility layer a "hack", I don't think the standard Win32 API is a hack.

This goes back to your unique idea that files have a sort of colloquial human-friendly nick-name that is a different concept from their "filename" that everyone else uses.

If we were to accept that idea, then /all/ systems have that "problem" - because no system will be happy with a file system that uses approximate names instead of concrete identifiers. By that I mean that "index.html", "Index.html", "Index.html", "Index", "The index file", and "The first page" are perfectly good human-friendly names for the first page of a website - but no OS or filesystem would accept them as alternatives for a file identified as "index.html".

I am just trying to correct your (apparent) misunderstanding about what Don was doing, and how Windows and NTFS treat filenames.

Vote

D

Dimiter_Popoff 11 years ago

No. It reproduces the names exactly as the user has entered them.

You may think whatever you want but using a low enough level call to create invalid directory entries is a hack, whoever may have written the code within the system call. Non-hack application code does not go that low in order to defeat the system-wide rules or compromise the system in other ways, there are always plenty of opportunities to kill a system.

Blimey, so it is my unique idea that file names are meant also for human consumption/processing.

Are you sure you are in good health?

And you go further down the path into demonstrating that you are just flailing madly being unable to accept the simple fact that you said something stupid (can happen to everyone) and then defend that for days and days (does not happen to everyone).

Yeah, you always know better than everyone, I know. Never mind you have no clue what we are talking about really.

Dimiter

Vote

Filesystem syntax constraints under Windows

Join the Discussion

Didn't find your answer?