Saturday, 13 December 2014

Unicode makes life easy!

Every now and then, the complaint is raised: "Unicode is hard!". My usual first response is that it's not Unicode that's hard, but human language; for instance, it's not Unicode's fault that:
* English is written left-to-right, Hebrew is written right-to-left, and you sometimes need to put both on the one line, like when you explain the name of PHP's scope resolution operator (Paamayim Nekudotayim, פעמיים נקודתיים‎, "double dot doubled");
* Some languages use some diacritical marks, others use different ones, so there are literally dozens of different marks that could be applied to letters; and each mark could be applied to any of quite a few letters;
* Not all letters in all languages have upper-case and lower-case forms, and in some cases, a single upper-case letter becomes multiple lower-case letters;
* And plenty more besides.

What Unicode does, of course, is bring a lot of these issues to the surface. A programmer who thought that "plain text" could fit inside ASCII, and now has to deal with all of this, will often tend to blame Unicode. But the truth is that Unicode actually makes so much of this easy - partly because you can push most of the work down to a lower-level library, and partly because of some excellent design decisions. Here's a little challenge for you: Make a way to transcribe text in any Latin script, with any diacriticals, using a simple US English keyboard layout. You'll need to devise a system for transforming an easy-to-type notation into all the different adorned characters you'd need to support. Your options: Use Unicode, or use eight-bit character sets from ISO-8859.

Here's how I did it with Unicode - specifically, in Pike, using GTK2 for my UI. First, translate a number of two-character sequences into the codepoints for combining characters; then perform a Unicode NFC normalization. And the second step is optional, done only because certain editors have buggy handling of combining characters (SciTE allows the cursor to get between them!), so really, the entire translation code consists of a handful of straight-forward translations - in my case, seven of them to cover all the combining marks that I need, plus four more for inverted question mark and exclamation mark, and the vowels not found on a US English keyboard (æ and ø); so there are a total of eleven transformations done.

To do the same thing with ISO-8859 character sets would require: First, figure out which character set to use, and then enable only a subset of transformations. Then, have full translation tables including every letter-diacritical combination supported, and make sure you have them all correct. There'll be hundreds of transformation entries, and you'd need to redo them for every single character set; with Unicode, supporting a new language is simply a matter of seeing what's missing, and adding that.

Unicode made my code smaller AND more functional than ISO-8859 ever could. Unicode is awesome.