Filesystem syntax constraints under Windows

Hi Don,

OK, I get that - in DPS this is the concept of runtime "objects", which have a 16 byte "name" but it is not for user consumption. I use text in it but this is just a matter of convenience when I code. When the system searches for an object it just compares bytes (well longwords in reality, it does that with filenames too actually :-) ).

But we were talking of a "file system", and in a file system one does have files which do have names some of which are created by humans and supposed to be read/memorized by humans. At some level we do need the text for the name stored and searchable. If we just store the name text as bytes we end up needing twice the search overhead to do it case independent; which is why I think the unix makers back then left it case dependent, did not want to be bothered. It takes only a little - a bit per character - to do it the way it is done in DPS, and you have the best of both worlds. Case-free name information followed by the respective case bitstream. Somewhat more demanding to code but completely doable, was that for me anyway.

Well if you do not need to reproduce the names you get back you can simply hash the incoming names into what, 64 or may be just

32 bits and you are done. If you want to reproduce them forget it, just storing them as you got them is the only sensible way to go. Which does not preclude you from "hashing" (in fact you can just use the addresses of the stored names or sort of) for your internal purposes, of course.

Oh come on Don, we all know the alphabet here, let's not go over it again.

Yes, for the dps objects I wrote earlier about I do that sort of thing by making them do "listname" (or whatever the action was called). The plainest of objects just paste as text ("paste" at an address in memory, that is... :-) ) their 16 byte ID, more sophisticated ones which must be shown to the user have something better to paste (which may be static or not). But generally there is not much else you can do, if you need two different names for a thing you need two separate names, what can you do. OTOH common file systems pose very little restrictions which will be in your way when you invent names during programming so I am not sure at all you have a real issue here.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff
Loading thread data ...

Exactly. In my case, names are arbitrary length -- *like* file names in a modern OS. The cost of this is insignificant as a process typically has a *small* namespace. I.e., it knows nothing of objects that it is NOT SUPPOSED TO ACCESS!

When the process tries to access ("open") an object, initially, the process's namespace is the ONLY place that is examined for the object name provided. If a match is not found, there, then the object does not exist (in the context of that process).

In other words, if you don't want a process to be able to access an object, don't give the process any way of *referencing* the object in the first place.

And, since the process's creator (parent) can only reference objects that exist in *its* namespace, once you remove any references to an object from one process's namespace, it is inaccessible by any of that process's offspring!

There's no concept of a unified "global" namespace. So, you're never walking "long" paths from some "system root node".

Exactly. "stdout" is far more meaningful than "1299". And, exactly

*what* that "stdout" is bound to (in some "global" sense) is immaterial. A process never knows.

The filesystem analogy is the only thing that "others" could relate to. My namespaces are disjoint. I.e., how does something running on one of your netmca's reference an object on *another* netmca? There's no sense of "global naming" that each is aware of.

The "user" isn't creating objects (directly) in the same sense that a user creates "files" in a "filesystem". The user's *actions* result in objects being created. But, the user typically doesn't know -- or care -- what these are called.

There are very few things that the user "injects" and may later want to "remove". I.e., few cases where the user needs to *pick* a name -- and, later, remember it!

OTOH, there are places where the user (esp a developer-type) may want to inquire as to what's happening at some place in the system. Having to remember that "1299" is the error log for a particular process is tedious. Easier to give the process a unique name WHEN YOU DESIGNED IT and the error log that *it* creates an equally recognizable name WITHIN THAT CONTEXT so you can find it later without having to examine the equivalent of a "link map".

As namespaces tend to be small (consider how many objects one of *your* processes encounters in its LIMITED scope of operation), you can adopt simple schemes for maintaining "handles" on those objects.

E.g., one of my OS structures carries a name that is identical to it's location in memory! Sure makes it easy to *find* it! :>

Yes. In my case, a particular "physical" object (bad choice of words) can have a bunch of names -- each different (or the same!) but in different namespaces (even multiple references from within a single reference! e.g., stdout and stderr can resolve to the same "physical" object -- which can only be accessed by *this* process through these two names!)

Modern file systems impose naming constraints that are essentially arbitrary. Why can't I use '>' in a name? Oh, because some APPLICATION considers it a special symbol! Why can't I use ' ' in a name? Oh, because whitespace has historically delimited tokens. Why does the file name have to be "short"? Oh, because there is an arbitrarly low limit on total pathname length and you never know if some application (shell) will be called on to try to copy that file to a point in the hierarchy that has a long path prefix. (recall, each process in my world has it's own "root" node to its private namespace).

Etc.

Reply to
Don Y

I had completely forgotten the NTFS alternate data streams, since at least in early NT versions, there were several issues using these alternate streams.

There are similar issues with file systems supporting multiple versions of a file, such as VMS with multiple versions with same name but different versions in the same directory. Version control software also save multiple versions of a file. These can be problematic, when trying to map these to a foreign system.

Sorting file names for human consumption is a very culture specific issue, even with ISO/IEC 8859-x not to mention Unicode. The sorting for display needs to be done at the user machine at the language selected by the currently logged in user preference. For internal data processing, strict binary sorting order could be used, but for user interaction, the cultural aspect should be noted.

One aspect that I haven't seen discussed in this thread is that a "file" does not necessary have a single "name". While there may be physical allocations of blocks on a disk and there might be some kind of index file entry for those, there can be multiple directory entries with different file or multiple entries in multiple directories pointing to the same physical file. Various links and directory entries are used e.g. for user, protection or language specific views to create multiple views of a file.

Reply to
upsidedown

Recall that I am only using the concept of "files" to relate this to a conventional OS mechanism. In my case, each *name* (in a namespace) is backed by a particular server. Names that would be the equivalent of "files" (persistent data on some sort of medium) would be backed by a "file server".

But, other names may be bound to things like dynamic kernel or process structures, system variables, hardware devices, etc. E.g., "time_of_day" may provide the current time of day (in some particular format) when read. "GarageDoor" may cause the garage door to open when the string "open" is written to it; close when "close" is supplied.

The names in a namespace can be bound to a variety of different types of objects. As well as multiple instances of the *same* object. E.g., "time of day" and "now" can both be bound to the same object. Or, to different accessors on a single object.

I think this is the solution to my problem: I can freely create ANOTHER namespace that I populate with "Windows compatible names" and export

*this* to the Windows host. So, I could bind the name "ReadMe" in the exported namespace to the same object that the local system has named "Read/\/\e" (which would upset Windows' notion of a "proper" name to reside in the file system interface). At the same time, that same object can be referenced in YET ANOTHER namespace as "README" for export to a system that expects 8.3 names.
Reply to
Don Y

This will not work.

Do "i" and "I" name the same file? (In Turkish, the upper-case version of "i" is "?"; the lower-case version of "I" is "?".)

allowed when it would otherwise be ambiguous; this being the case in the

recently introduced, but is not in wide use.)

The "I" ambiguity leads to a number of interesting bugs, such as this one:

I wouldn't want complex, environment-dependant code like that in any system I have to depend on, such as a kernel or file system. We already have it in mission-critical systems like the Domain Name System, and I'm not entirely happy with that.

Stefan

Reply to
Stefan Reuther

Well it has worked for quite some time already.

Yes, I and i name the same file. No, I with two dots above it and I with one dot do not, these are different characters.

No, they do not.

Naming is not language specific, it is alphabet specific only. Various languages may have various alphabets. I would of course prefer if we all just used the Latin alphabet plain, as it is used in English, but there is no problem at all with the capitalization for its variations. So there is no problem storing the file case information the right way. Then if you want to store some caseless hieroglyphs you can do it by just leaving the case information blank (e.g. in DPS you have up to 255 bytes for text and as many bits for the corresponding case data).

Language processing is something else and has nothing to do with names. Similar to the way we deal with a persons name, we do not translate it but we do spell it correctly whenever the alphabet we use would allow it.

There is no ambiguity at all in using the Latin alphabet. If one chooses to introduce one he has to live with it. Your example is not related to how we store and process case information.

Like it or not file systems deal with files and files are named and the names are for human consumption to a great part. So the way the names are stored and searched for belongs there, together with the alphabet rules which apply to reading/writing these names.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

On 11.10.2014 ?. 11:06, snipped-for-privacy@downunder.com wrote: >....

In DPS I treat this as an error. It can happen, one could even reproduce it from a command line (e.g. by copying a directory to another file and by leaving the destination copy file type being that of a directory; will take just a little more typing than normal copy). But the "repair" function will capture that and will report an error, the only way around which would be to delete one of the directory entries pointing to the same (or overlapping) disk areas. [Repair walks all the directories and builds a new CAT (cluster allocation table)]. Sometimes this can occur inadvertently, say the system gets reset before the latest CAT has been updated and some newly allocated file stays "unallocated". Then upon boot some other file gets allocated over it (quite a mess really), the fix is to delete one of the two files.

But copying the directory file can be useful, I have used it during some rescue missions. Then there is no problem to copy the directory file as a non-directory type as a sort of backup, repair will not analyze it then.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

It has "worked" in the sense that people live with it despite the inadequacies, inconsistencies and complications such as massive amounts of locale-dependent code.

Did you fail to read what Stefan wrote? In Turkish, I and i are not the same letter. I with two dots is a different case altogether - in some

from "i" or "?", while in other languages it might be considered an accented form of a normal "i".

and you want capitalised versions to refer to the same file? To a German speaker, it is exactly the same as "Readme" and "README" being the same.

That is completely and utterly incorrect, and is perhaps the basis for your misunderstandings here.

Different languages can use the same alphabet in different ways, and it is not uncommon for them to have variations (such as accents or additional letters) that are treated in wildly different ways from others who use the same accents or letters.

There are many languages and alphabets where the glyphs used for particular letters vary according to their position in a word or sentence - the appearance is different and yet they are the same "letter".

There are only two possible ways to handle naming consistently and rationally in an operating system. You can restrict everything to the plain ANSI character set, in which case you can choose to make names case independent if you want. Or you can make it completely transparent and provide no interpretation beyond a minimum number of "special" characters such as "/". That way you leave it up to applications or libraries to decide how to deal with capitalisation, sorting, etc. - it is not part of the basic OS or filesystem.

Naming is highly language-dependent.

People translate their names all the time. Usually they are translated into something roughly similar but which can be pronounced and written in the other language - occasionally people choose to translate more significantly. Different languages handle names in different ways - in some languages names are declined according to how they are used, leading to even more variety.

So people who choose to be born in Turkey, and choose to be given names containing an "i" or an "?", have only themselves to blame - and have to live with the consequences?

For someone with a clearly Russian/Eastern European name and background, yet with a perfect grasp of English, you are remarkably provincial and demonstrate a serious lack of knowledge and understanding about language, alphabets, and names in an international context.

Reply to
David Brown

To my knowledge, the only successful use of alternate data streams in an NTFS file was a way to hide viruses without changing the apparent size of a file.

In *nix systems, it is normal for there to be a layer of indirection - directory entries contain names and point to inodes, and inodes contain metadata (ownership, access dates, security flags, etc.) and point to lists of datablocks. It is therefore perfectly normal to have multiple directory entries pointing to the same data, and each directory entry has equal "status" as the "name" of the file.

Systems can also have additional methods of connecting names to files, such as symbolic links.

And on some file systems (such as btrfs), there is another layer of indirection beneath inodes so that different files that coincidentally share the same data can share the same data blocks, with copy-on-write mechanisms used to keep them logically independent.

Reply to
David Brown

It has worked for a few milennia, whether you like it or not. Just because a few programmers do not want to be bothered (or are incapable of) handling the naming conventions we have is no good reason to ask for a change.

The above applies to the rest of your post, I really have no time explaining the alphabet. People learn it in primary school, I am sure you have been taught that. Just recall it.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

In previous millenia, people did not try to build systems that work internationally. Even in the beginning of this century, systems that work just in one region were common. Of course, if you use just Codepage-437 or ISO-8859-1, which do not have the Turkish "?" letter, you can agree on a unique case mapping. But then your system won't be usable in Turkey.

Your misconception is that you assume there is a thing such as "the alphabet". There is not "the alphabet". There are hundreds of alphabets, many of which contain common characters, and some of which interpret characters differently than others. And then we have not even started talking about languages that don't use alphabets at all, such as Chinese. Neither have we started to talk about things like sorting, which isn't even uniquely-defined for a language (German has "phonebook" and "dictionary" order), and totally nontrivial for multiple languages (does a Cyrillic "?" go before or after a Greek "?", and where do they go in relation to a Latin "A"?).

Stefan

Reply to
Stefan Reuther

They are also used to store extended attributes such as a marker "this .exe file was downloaded from the internet, display a scary message when the user tries to run it".

Stefan

Reply to
Stefan Reuther

Yeah. And because they do now all the whining unix followers would have the millennia old grammar reinvented just to suit the fact they have been led by their leader into the wrong corner. The fact is they got what they deserved (as does anybody following any leader). Tons of defunct software because of a fundamentally broken filesystem.

I am fluent in only 4 languages, English and German among them (OK, fluent might be overstated for my Russian), what do I know about alphabets. And I have written only one OS with only two filesystems, what do I know about these things.

Unix or whatever followers are bound to know better.

Really.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

To be clear, what are you considering "broken" about the filesystem? (and, *which* -- FFSv1/UFS, FFSv2, CODA, AFS, PORTAL, UNION, ZFS, Reiser, NFS, etc.)

Are your objections to the features offered? Or, the implementation details? Performance? Or, solely to the "naming conventions" of it's content?

Or, to the interfaces made available to it? Or, conventions imposed on those interfaces (e.g., I am annoyed that Windows doesn't adhere to L-R alpha sorts. "Gee, let's alphabetize the keys on the keyboard to make it easier for folks to find the key in which they are interested?")

[Recall, I have *no* filesystem in my design as the entities managed are rarely "files" in the traditional sense. Rather, just a "namespace". The only "persistent store" resides on a "smart", composite block device (I'm still sorting out the implementation details, there)] [Feel free to reply offline, if prefered]
Reply to
Don Y

The fact that in order to have say "index.htm" in a way usable for humans you need to have also INDEX.HTM, Index.htm, and another few to cover the common cases (to cover all cases you need 256 entries).

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

But that's something that you can address in the user interface. It doesn't really impact the filesystem, per se.

[My question hoped to elicit some comments re: the implementation...]

(e.g., add a layer that maps all incoming filenames to uppercase in CREATION, SEARCH and DELETION and the filesystem can still be designed to preserve and recognize case -- it just happens that all entries in the filesystem have this uncanny consistency of always being in uppercase *in* the filesystem's name tables)

I'm sure MS effectively implements a locale-specific "strncmpi()" when it goes hunting for a match.

[Actually, knowing MS's history with buffer overrun issues, I suspect they would use a strcmpi() instead! :-/ ]

I still have to test how the NFS client/server here handle these...

In my case (namespaces), as the names for most objects have been created by the developers -- or, code that they crafted -- it seems an invitation to sloppiness (i.e., bugginess) to allow the developer to refer to "foo" as "Foo", elsewhere in his codebase AND EXPECT THEM TO REFERENCE THE SAME OBJECT.

Reply to
Don Y

For example, all of the rules in my speech synthesizers are expressed in uppercase. Yet, obviously, text fed *to* the synthesizer can be in ANY case -- including mixed.

So, my pattern matching algorithms ignore the case of the input text (but KNOW that the case of the templates will be strictly UPPERcase).

[This allows me to use lowercase in the templates for "special purposes" without fear of "accidentally" matching something in the input text]
Reply to
Don Y

Of course you can. Every problem has its solution. The problem in the above case is the fundamental design of the filesystem. Either you store bytes and do not expose the user to them - but to some text representing these - or you store text and allow the user to consume it.

In the unix filesystem they store bytes and feed them for user consumption which has been, is and will be a problem as long as they do not bite the bullet and fix it.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

Space, what space ? What do you do for U+00A0 and sisters ?

formatting link

Reply to
Tonton Th

That's interesting to know. (That particular message is more irritating than scary - /of course/ I want to run the file, that's why I downloaded it in the first place!)

Reply to
David Brown

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.