Sunday, 29 March 2015

File systems: Case insensitivity is cultural insensitivity

There has long been a divide between case sensitive file systems and case insensitive ones. The former retain all names as provided, and define uniqueness according to very simple rules; the latter either force names to monocase (upper or lower), or retain them in the first form provided, and match without regard to case. Today, the divide roughly parallels the distinction between POSIX (Linux, Mac OS, BSD, etc) and Windows, so it tends to be tied in with the religious war that all that entails, but the concepts don't depend on the OS at all. For the purposes of discussion, I will be using Unix-like path names, because they are valid on all platforms.

If all file names are required to be ASCII and to represent English text, there is no problem. You can define case insensitivity easily: simply treat these 26 byte values as identical to those 26 byte values. But in today's world, it's not good enough to support only English. You need to support all the world's languages, and that usually means Unicode. Since Unicode incorporates ASCII, the file system needs to be able to cope with ASCII file names. Well and good. So, are file names "/ALDIRMA" and "/aldirma" identical? They look like ASCII, and if those letters represented something in English, they'd be the same, so they ought to be the same, right? Nope. That first word is Turkish, and the lower-case form is "/aldırma", with a dotless i. (The second isn't actually a word in any language, so far as I know, but it's a perfectly valid file name.) Should all four (those three plus "/ALDİRMA" with a dotted majuscule I) be considered to represent the same file? And what about the German eszett ("ß") - should that be identical to "ss", because they both upper-case to "SS"? Should the Greek letter sigma, which has distinct final ("ς") and medial ("σ") forms, be considered a single letter? This might conflate file names consisting of words strung together (in which case the final sigma is an indication that there is a word break there).

These distinctions make it virtually impossible to case-fold arbitrary text reliably. Whatever you do, you'll surprise someone - and maybe create a security hole, depending on how various checks are done. To be trustworthy, a file system must behave predictably. There's plenty of room for sloppy comparisons in text searching (for instance, a Google search for "aldirma" does return results containing "aldırma"), but the primary identifier for a file should be simple and faithful. So there are basically two options: Either the case folding rules are kept simple and generic, or there are no case folding rules at all. Will you impose arbitrary English rules on everyone else? And why English - why not, say, Hebrew? Let's treat "/letter" and "/litter" as the same file - after all, there are no vowels in Hebrew, so they should be ignored when comparing file names, that's fair right?

Or we could take the reliable, safe, and international approach, and maintain file names as strict text. Transform them only in ways which have no linguistic meaning (eg NFC normalization), and display them strictly as the user entered them. If you misspell a word, the file system won't fix it; if you get majuscule and minuscule wrong, the file system shouldn't fix that either.

One change I would make in the current POSIX model, though, and that's to require that file names be Unicode text, not arbitrary bytes. There's a broad convention that Linux systems use file names that consist of UTF-8 streams, but it's not enforced anywhere, and that means arbitrary bytes sometimes have to be smuggled around the place. That's necessary for legacy support, I suppose, but it'd be nice to eventually be able to drop all that compatibility code and just use Unicode file names everywhere. But even without that, I still prefer the Linux "it's all bytes, probably UTF-8" model to the Windows "it's all UTF-16 code units, case insensitively" model. Case insensitivity works only if one culture imposes its definition of case on everyone on the planet. Let's not do that.

No comments: