Filesystem syntax constraints under Windows

D

Don Y 11 years ago

Hi,

Does anyone know which filesystem naming constraints are imposed in Windows itself vs. the file system layer? Said another way, which constraints are *invariants* regardless of the filesystem?

E.g., can a non-native filesystem redefine '+' to replace "../"? Or, allow support for ':' in names? Or, replace '' with '/'? Or, ...

Thx,

--don

Vote

G

glen herrmannsfeldt 11 years ago

Interesting question.

It is well known that the actual system calls accept either / or \ as a separator. The command option processor uses / for options, and so requires \ for the separator.

The more interesting cases come when you use NFS.

I was just reading yesterday that it is possible to have an NFS server on a Windows system that allows for case significant names. (Unlike most that are case preserving.)

The NFS protocols are independent of the actual separator.

The : in drive selection has to be processed pretty early. I am not so sure what to say about : later in the name.

As far as I know, CMD has to keep track of the current subdirectory. That would complicate any other treatment for \ and :.

-- glen

Vote

D

David Brown 11 years ago

I believe you can store case-sensitive named files on NTFS as well - it's part of the posix compliance of the filesystem (harking back to the days when MS were still pretending to cooperate with other OS's).

Don, why are you asking about this? Are you trying to implement a non-native filesystem and want to support as much as Windows allows? If so, then the answer might depend on how that filesystem interacts with Windows. (I am not sure that I can give you more information no matter what you answer - but it might be of some help.)

Even if it is possible to redefine things like directory separators, it might cause a lot more confusion and therefore not be worth implementing (just like using files whose names differ only by letter case).

Also consider path lengths in this - the path length limitations are different within Windows itself and in NTFS.

Vote

D

Don Y 11 years ago

The problem is, these are just observations from *outside* the system. You (I) don't know if the "system" imposes these conventions... OR, if it relies on the filesystem implementation to impose conventions that are appropriate for that specific file system!

E.g., imagine the "system" invokes a file system specific *method* to "parse pathname". In that case, all the system does is parse enough of a pathname to get to a particular mount point, *notice* which sort of file system is mounted *at* that mount point, then pass the balance of the pathname off to filesystem->parse_pathname().

Ancient versions of MS C had library routines to parse pathnames that hard-coded such separators. But, that still doesn't indicate if this was mimicking a service performed within the filesystems of that era *or* was the sole mechanism for handling pathnames.

That's an idea! I have NFS client and server running under Windows. I can mount an external filesystem and see if the Windows client recognizes "FILENAME" and "filename" as two coexisting files. And, if it "does the right thing" when I refer to one or the other.

Similarly, export a portion of the Windows filesystem ("Foo") and verify that it can ONLY be accessed as "Foo" (and not "fOo" or "FOO").

Again, it depends on whether the *system* implements this as a "rule". You can think of drive letters as objects at the "root" of the filesystem. Rules for objects at that level may require ':' (among other top level name conventions).

Or, it could be a hardwired prohibition elsewhere in the filesystem hierarchy.

That's only an issue if *it* hardcodes an algorithm for extracting current directory (instead of calling a filesystem specific *method* for doing so).

Vote

D

Don Y 11 years ago

Preserving (and even *enforcing*) case doesn't guarantee that identifiers differing *solely* in case can coexist in the same container! E.g., ReadMe, READme, ReAdMe, etc.

I don't have a filesystem. Rather, the typical filesystem concept is used to manage a universe of (possibly parallel) nested namespaces. Each object defines the rules for the namespace(s) that it exports. I.e., the valid syntax for identifying portions *of* that object.

So, a "directory/folder" (to use a familiar concept) might support objects named:

- ReadMe.txt

- README.TXT

- Read/\/\e

- Garage:Door:Actuator

- OutsideTemperature

The Garage:Door:Opener object might support objects (methods) named "open" and "close" and "current_state".

I want to be able to make any of these named objects accessible under various other environments. If the host OS('s) enforce their own concept of "what constitutes a name", then I either have to adopt names FOR EXPORTED OBJECTS that are compatible with those OS('s) -- i.e., some GCD thereof -- or provide a translation interface (create a parallel namespace for exported objects such that the exported names comply with the rules of the host OS).

OTOH, if name parsing is left to the filesystem implementation (i.e., as it is in my implementation), then all I have to do is port my implementation to each of those host OS's.

I'm not keen on letting the tail wag the dog. If Windows has a limitation, that's Windows' problem. I'd be comfortable having Windows users bear the inconvenience of Windows' limitations (just like I wouldn't restrict myself to 8.3 names just to make life easy for DOS users).

Yes. But I think much of that issue inherently "goes away" when addressed as an exported namespace. I.e., ONLY what the (Windows) user needs to see has to be made available to him/her. And, at some convenient "mount point" (I guess that's still "drive letter" in the MS world).

So, the exported namespace can represent: some/very/long/traditional/pathname/to/a/file as "file" some/other/file as "file2" some.particular\method as "verb" That(&*^@$%(Fool as "A_Fine_Gentleman"

[note I've tried to show how to accommodate the host's naming rules as well]

Vote

D

Don Y 11 years ago

In case it wasn't obvious: to the "external system", this LOOKS like a single "directory" with four names in it. It is *flat*! (despite the fact that the objects all existed at different "places"/levels in the system exporting them)

Vote

D

David Brown 11 years ago

In theory (I haven't tried this), you can create multiple files in an NTFS directory that differ only by case, because it is required for posix compatibility. But even on *nix systems, where this works perfectly well, it is not recommended practice because it can easily confuse people.

If I understand you correctly (I'm not sure I do fully - but that's okay for now), then I suspect your best idea is to restrict your names to plain English alphabet letters, assuming case-preservation but not case-sensitivity, with a few specific punctuation symbols. Treat "/" as a directory separator - it should work fine on almost any reasonable system.

Punctuation that usually works without trouble is ".", "_", "-" and "+". Most other symbols will work in some circumstances, but can cause issues in particular cases (such as needed escapes or inverted commas when used from the command line).

So if you want a namespace separator, without causing directory changes or other complications, pick one of "._-+".

Vote

D

Dimiter_Popoff 11 years ago

Hi Don,

as David already suggested your best chance is to limit what you accept (and thus have to process) as much as practical. The problems are by far not just related to how this or that character is treated; e.g. at the moment I am struggling with how I process names in _my_ dps scripts using _my_ longnamed directories to be compatible with _my_ older 8.4 directories. The worst issue I have is with names containing spaces; treating those in a script makes word count variable, preserving space count between words must be addressed, forwarding a name which was in quotation marks which get "eaten" during the former processing etc. etc. If you can afford to disallow spaces in names things will be much simpler. Which is not really practical as other systems do allow spaces, so we have to handle this as well.

As a side note, "the right way" to treat file names is to preserve the case information and to ignore it during file search (i.e. aaa and AAA locate the same file). Obviously unix machines will have the improper case treatment problem forever but it is their problem after all, they have saved a few minutes of thinking back when they created it. My solution to that was to do the right thing and allow also the wrong thing to be done; i.e. if you compare the textual part of the name (which is stored as upper (or was it lower) case only) you locate the same file, if you want to distinguish also by case you have to compare another few bytes which carry the case bits for the textual part. But I am not sure I even made a call doing the case dependent search, if I do I have never used it so far anyway.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Vote

S

Stefan Reuther 11 years ago

Here is what Microsoft has to say on the topic:

formatting link

TL;DR, the reserved characters are

- "\", "/" for the path separator

- ":" for the drive letter separator, and to separate file names and alternate data streams

- "?" and "*" because they are used in wildcards

- ">", "

Vote

D

David Brown 11 years ago

This is getting a bit off topic from Don's question, but it is very clearly a matter of opinion as to what is "the right way" to handle file name cases.

I would say that the unix way is the only sensible way, because it is transparent and consistent - the system does not care about the characters used, and does not make any artificial and language-specific distinctions. When people started using UTF-8 filenames, the unix way needed no changes, and everything continues as normal.

But if you start trying to say that some characters are equivalent to particular other characters, you have an endless task as soon as you start looking at anything other than plain English. How is the OS, or

letter but different capitalisation? And why artificially decide that small and large letters in the English alphabet are equivalent - what about other combinations that are considered the same letter(s) in other

should be treated the same. In Norwegian, in some cases you would

are (AFAIUI) 5 different characters for each letter - should these be considered equivalent in file names?

I have no doubt that in a well-designed OS and filesystem, you either have to treat different cases as completely distinct, or you limit the whole system to 7-bit ASCII.

Vote

D

Don Y 11 years ago

The user interface to the (NTFS) filesystem doesn't provide a mechanism for doing so. E.g., creating "ReadMe" when "README" already exists doesn't add another file (it simply overwrites the first one -- I think preserving the original name, as well -- NOT the "new" filename!)

I will have to try it in an exported portion of an NTFS filesystem accessed via NFS. Or, an NTFS filesystem under some other OS...

That's the tail wagging dog approach. What I am trying to understand is how Windows (and other OS's) parse pathnames -- where that functionality resides. I.e., is it filesystem specific or embedded in the OS's notion of "what comprises a pathname".

E.g., Windows recognizes // prefix for network shares. You can conceivably argue that "ftp://" is just another part of the notion of a filesystem.

(In my case, I can create an object named "ftp://" at some particular point in some particular namespace -- e.g., /my/personal/directory/ftp:// -- and bind an "FTP resolver" to that "ftp://" object. Then, when any application walks that path, entering the "ftp://" object causes *it* to parse the balance of the pathname: /my/personal/directory/ftp://google.com/foobar)

I.e., parsing a pathname is expensive, for me -- because each "level" requires invoking a "walk" method on a different object. But, it isn't done often -- if you want to make walks quicker, create a second namespace and bind the final entities (foobar in the above example) to nice short names at the TOP of that namespace! The new namespace acts like a cache of sorts.

*If* Windows passes the pathname for a given mount point to the "handler" for that mounted filesystem and lets that handler resolve it, then I should be free to map any characters into any functionality I want.

So, I can have: traffic/10.0.1.23/inbound traffic/fe80::1%lo0/outbound etc. WITHOUT having to create some artificial naming scheme that maps the "ideal" names into something that "Windows" can handle: traffic/fe80__1&lo0

The whole point is to let each "object" define the rules for the names of its components (or objects beneath it) in a manner that makes sense in the context of the object.

E.g., if you treat "http://" as a top level object (an HTTP resolver), then *it* applies whatever rules it deems appropriate to the objects beneath it -- "host names". I.e., is case insensitive.

A object, in turn, defines the rules for objects beneath it -- "host directories". I.e., is case SENSITIVE.

In the "traffic" example, the next level happens to allow names to be bound to (things that look like) "IP addresses". So, the syntax allows characters that you wouldn't encounter in Windows, etc.

I don't have a universal notion of a "separator". E.g., the above example could just as easily have been: traffic10.0.1.23inbound trafficfe80::1%lo0outbound

[This can make for some interesting dilemmas if not used carefully! :> But, I think it worth the cost to not reserve a delimiter at each level in the hierarchy. So, you can have "foo" as a file coexisting alongside "foo" as a folder -- in windows (i.e., the type of object effectively adds another dimension to the identifier)]

Vote

D

Dimiter_Popoff 11 years ago

Not at all. A "name" is composed of text; text is written using some alphabet and the alphabets I know of (Latin and its variations, Cyrillic with its ones) are used by languages in upper and lower case universally. I don't know how this is with hieroglyph based languages so I leave these out of my comment - these would be subject to completely different treatment anyway. A name - e.g. DAVID or David, refers to the same name bearer in any language I can think of. File names are names and are intended also for human consumption.

How we deal with that at the bit level is a separate matter; the unix way in making DAVID differ from David is clearly wrong and shortsighted.

No, this is an exercise for some language processor. You store a name the way it is spelt, case info and language-specific characters choice upon entering it included.

The basic rule is that we do what has to be done and we don't do what has to be done elsewhere; when storing a name we store a name which is a text string (not just a sequence of bytes) and nothing more than that (i.e. we do not apply any grammar rules to it, we leave that for the name reader/writer).

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Vote

D

Don Y 11 years ago

Hi Dimiter,

[Sort out your display/tablet issue, yet?]

That's exactly what I *don't* want to do!

I want an object to be able to define the rules for what constitutes a legitimate identifier IN IT'S SCOPE in whatever manner it wants. For example, an IPv6 address should be allowed in the namespace (of an object that expects to deal with IPv6 addresses!). You shouldn't have to map some character with arbitrarily imposed "special meaning" (e.g., ':') to some OTHER character -- just because the filesystem's naming rules didn't anticipate using it *in* an identifier.

[Ditto '/' or '']

If, OTOH, you had a more permissive implementation *before*, then this problem wouldn't exist!

E.g., filenames of the form ".cshrc" annoy Windows (though you can

*trick* it to accept them... you just can't always *create* files having such names).

Likewise, I often encounter cases where I am importing a set of files created under UN*X and find name collisions where names differ only by issues of case. It seems counterintuitive to come up with a new scheme that has *less* capability than an existing one. :-/

Exactly. Building on what you say below, should we just ignore embedded whitespace (just like ignoring alphabetic case?). So, "Avery Littleman" and "A Very Little Man" and "AVERYLITTLEMAN" are all the same? :-/

I disagree. And, I think your comments reflect a particular manner in which you see filenames used. I.e., exposed to the user to name objects (files) that he has created.

I use names to identify objects that a piece of software (process) can access. Each process is created with its own, "personal" namespace which has been populated by its parent/creator (and, in doing so, *bound* to specific existing system objects known to its parent).

So, a TOP LEVEL namespace might be: My PID stdin stdout stderr My kmem My Files/ garage door

The process that that namespace serves is responsible for knowing which names are important to it. It's not like exposing an entire filesystem to a user so he can indicate which *file* (collection of bytes on a medium) he wants to access.

Instead, they are conventions (for the most part) that the parent knows the child will observe. So, the parent can bind "stderr" to the "stdin" of some other process (with a completely disjoint namespace) and know that the childs error messages will be processed by that "other" process.

Note that, in my scheme, you are free to implement a "user visible container for files" (i.e., "a folder") and impose an entirely different set of rules, therein. When someone/thing tries to "walk" (parse_pathname) through that folder, the method associated with that type of object ("user visible container for files") can opt to ignore case. Or, treat numbers as Roman numeral equivalents. Or...

OTOH, if you impose some restriction on the *basic* name scheme, you can never "enhance" it (without a lot of compatibility grief).

Vote

D

Dimiter_Popoff 11 years ago

Hi Don,

Well you will always need to define word (or argument) separators etc. So you will always have to map this to that at times, I don't think you can avoid that if you expect a human to be the end user of the system speaking English or sort of alphabet based language. File names are meant for human consumption so they do bear the legacy of human languages, there is no way around that.

Well detection of duplicate names is _more_ capabilities in my book :-). Just as failure to find a name because someone typed it in a different case is _less_ capabilities. The proper handling of a duplicate name coming from a unix directory would be to read the name, detect the duplication and rename using some convention, possibly just append something past the duplicate part (so it can easily be found by partial compare).

No, space is a symbol. It just has no upper case, similar to digits, punctuation etc.

Yes, if you build on the assumption that a "name" is something not meant for human consumption the better way would be to use bytes, not text. But humans are constantly using files which are named, so a "file name" is basically meant for human consumption, this is much of what filesystems are about really.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Vote

D

Don Y 11 years ago

+1

*Which* character set? ARE NAMES EVEN PART OF *ANY* "CHARACTER SET"? I.e., why can't a name be a series of "8 bit slices" out of some N bit quantity? 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 ... (imagine how you would structure a "container" whose contents always followed this sort of rule! 1,000,000+ "names" and any *one* could be accessed in constant time!!)

They're *identifiers*. All they have to be is uniquely resolvable. They only have to make sense to the entities that reference them!

The cognitive stumbling block is getting away from them as identifiers that the user (has to) deal with. If you want to expose meaningful names to a user, then map the "cumbersome" names to something more friendly (in a parallel namespace).

E.g., think of the alphabet soup that Solaris uses to reference devices: ../../devices/pci@0,0/pci-ide@1f,2/ide@1/sd@0,0:q and the "aliases" (i.e., names in a different namespace) that it then applies to make life easy for mere mortals: /dev/dsk/c1t0d0p0 and a user-created symlink that makes life even easier: /bootpartition

Vote

G

glen herrmannsfeldt 11 years ago

(snip)

Vote

H

Hans-Bernhard Bröker 11 years ago

Am 10.10.2014 um 19:53 schrieb Don Y:

And right there, the logic breaks down already. There IS no such thing as "the object's own scope" when it comes to establishing a large-scale hierarchy of named things. As much as that obviously rubs you the wrong way, one just HAS to make some allowances to keep the system working sensibly on a global scale, even if that limits the local subsystem's artistic freedom.

Otherwise how is anyone (or, equally important, any thing) to know whether, e.g.,

foo\bar_baz

is a "foo\bar" with a "baz" in it, rather than a "foo" with a "bar_baz" in it, or even a "fo" containing an "o", which in turn contains a "bar_b" holding an "ar", without some global rule having set up that distinction aforehand?

In a nutshell: you can't have a hierarchy without clearly defined separators between levels.

Trying to avoid making such decisions on a global scale is not good design. It's actually the opposite: it's failure to design anything at all.

The tricky question being what you count as a "capability". There's about equally good grounds for calling it

a) a capability of Unix that it can actually store distinct files "Readme" and "README" in the same directory, and

b) a capability of Windows to find and open a file called "Readme" although your program actually misspelled that "README" in the open() call.

IMHO the Windows approach is just plain dumb silly. It just makes no sense to store the case of file names if you're not going to do anything else with it, like, say, actually respect it as an identifying aspect of that filename.

And that's before you consider internationalizations. How is, say, a German edition of Windows to know how Kyrillic or Greek letters map upper-case to lower-case letters? And even if the system knows, why should it's user be expected to?

Vote

D

Dimiter_Popoff 11 years ago

Indeed, but storing a name the way a user prefers to use capitals in it or not means you don't lose that information - which the user may want (usually does) to have preserved. Capitalization in names serves indeed no other purpose than appearance. This is part of practically all languages so we have to be able to reproduce it. OTOH, being lazy and just comparing bytes rather than thinking a little beforehand how to lay out things so there is no significant overhead involved with the case processing serves no good purpose at all.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Vote

D

Don Y 11 years ago

But that's the difference: other OS's have "file systems". I don't. I have "object namespaces". There are no "files" in my system. No "Windows Explorer" that a user plays with to locate "objects".

I could just as easily assign each object a 64b global identifier that represents it's "address" in some "memory space".

Once you have opened a file, network connection, etc. does your code ever think about it's *name*, again? There's just an identifier somewhere in your code that connects it to the object in question. E.g., a network connection is just a tuple -- never has a "name".

I only need a bridge to other OS's (Windows in this example) to get things into or out of the system. E.g., if a user wants to print a diagnostic log from process 345345, he needs a "handle" to access that (or, a mechanism that effectively/implicitly provides that handle).

The other identifiers are used by the objects *in* the system to relate to *other* objects in the system. I.e., there is no notion of a user browsing my "namespace universe" from some "root node". Because most of it is meaningless to the user -- and often ephemeral.

"Gee, where did that mutex go? It was here a moment ago!"

Then things break. Because you can't know where all references to that "duplicate" name exist. It *wasn't* a duplicate in the original (UNIX) context. E.g., a makefile that refers to foo.S and NOT foo.s may now end up referencing foo.s, instead (if foo.S was detected as the "duplicate")

My point was spaces cause issues -- case sensitivity is an "issue" in your book. Why not "fix" the space issue in a manner similar to the case one? Just ignore them! Allow "A Very Little Man" to be treated as "AVERYLITTLEMAN" -- that way the user doesn't have to worry about remembering how *many* spaces or *if* there were spaces!

The "cost" (to the user) is just a shrinking of the namespace. In much the same way that case insensitivity shrinks the namespace.

Why force the representation to be "difficult to read" (by the developer)? There's a large, available domain of names, why keep the developer from picking names that are "significant" to him?

Why not label all variables in a form: V####?

Again, I'm not dealing with a filesystem but, rather, a namespace. Does the user care if the task that runs the air conditioner is called "AirConditioner" vs. "10034566"? If the user never has to interact with it, then the choice of identifier is not one of his concerns.

OTOH, why should the developer have to pick a unique N-bit number to refer to "the task that runs the air conditioner"? And, another one to refer to the diagnostic log generated by that task? etc.

How many "objects" exist inside your cell phone for which NO "user visible name" exists? Or, your "smart TV"/media tank?

The things that the user must be able to access/reference are the only ones that need to have names (and access mechanisms) that are "user friendly".

But, EVERY object has a name -- including those that are intended to be (or potentially) visible to the user. The problem that I am trying to address is how to make those names visible in a particular environment (e.g., Windows) *without* impacting the choice that the developer has to make in creating their "native names".

It looks like the only way to accomplish this -- given Windows' limitations/constraints -- is to create another "exported" namespace that maps the developer's names to names that Windows will tolerate.

That's not a big deal -- namespaces are an abundant commodity (and, things that the user is likely to need to see will be largely static in their names). But, it means planning on these other "exportable" namespaces when crafting those parts of the code. I.e., I can't just tag existing names and toss them, as is, into a container created for export.

This should let me get around any other "restrictions" of a foreign OS. E.g., I could handle 8.3 names just as easily -- though that may annoy a user having to sort out what they *mean* in such a highly abbreviated form! (But, that's consistent with the tail/dog comment I made before -- he's stuck in 8.3 land so that's

*his* problem... not something that I need to impose on folks working in richer environments!)

Vote

D

Don Y 11 years ago

This suggests I just come up with some klunky OBVIOUS algorithm to translate my names to names compatible with Windows. And, push the problem of sorting that mess out onto the Windows user -- in much the same way that the 8.3 user has to guess at the shortened forms of LFN's.

Vote

Filesystem syntax constraints under Windows

Join the Discussion

Didn't find your answer?