Monday, 29 August 2016

Rapunzel's front-row seat

"Best day of your life? l figured you should have a decent seat." - Flynn Rider, 'Tangled'

How did Rapunzel and her guide manage to find themselves in such a perfect position to look at the lanterns? Why were there no boats anywhere nearby, despite there being quite a few elsewhere? Wouldn't someone else want to get that kind of view? The cry was "To the boats!", but we see no boats other than the one our heroes are on.

I think this image largely explains it. The tiny rowboat is in a cloud of lanterns that have floated this direction, and are grazing the water's surface before moving on upward into the sky. Imagine if there were lots of boats, sailboats included, on this patch of water - it would be extremely risky for the lanterns (easy for them to crash and splash), and possibly also risky for the boats themselves (flammables near flames? No thanks). Flynn has taken his date to the quietest place around by the simple method of violating the kingdom's safety rule: No boats downwind of the island!

It makes sense. The weather report would tell them which way the wind's most likely to be blowing that evening. Some of the lanterns will rise straight into the air; others will climb for a bit, then settle down, and finally make their rise toward the sky, once the air inside them warms a bit more. (Some will be duds, of course, and won't get into the air. They'll catch the water and sink.) Anyone upwind or crosswind of the island will get a great view of the lanterns flying off into the distance, without any risks; downwind, all you need is a small exclusion zone, and everyone's safe. When the royal lantern comes down almost to the water, Rapunzel helps it on its way, but if she (and the boat) hadn't been there, it would have had enough room to catch the air on its own.

And isn't it perfectly appropriate for the thief ("lovable rogue", I mean, of course) to break the rules and get himself into the perfect, if risky, spot?

Monday, 18 April 2016

Falsehoods Programmers Believe About PEP 8

We're computer programmers. We spend our days warping reality to our purposes, and then leaving behind a textual representation of the exact type of warping so the next person can use our modified reality. Over the years, many MANY Python programmers have looked at a document called PEP 8 and misunderstood it, just as many programmers misunderstand time, or people's names. The same PEP 8 misconceptions crop up over and over again, and after some discussion on python-list, I've collected these commonly-held fallacies.

Remember, all of these assumptions are wrong.

  1. All Python code should follow PEP 8.
  2. If you use a tool named pep8, your code will be PEP 8 compliant.
  3. If your code is PEP 8 compliant, a tool named pep8 will accept it.
  4. The Python Standard Library is PEP 8 compliant.
  5. Okay, at least the new parts of the standard library are PEP 8 compliant.
  6. PEP 8 compliant code is inherently better than non-compliant code.
  7. PEP8-ing existing code will improve it.
  8. Once code is PEP 8 compliant, it can easily be kept that way through subsequent edits.
  9. PEP 8 never changes.
  10. Well, it never materially changes.
  11. I mean, new advice, sure, but it'll never actually go back on a rule.
  12. The line length limit is obsolete in an age of high-resolution displays.
  13. Okay, but if you disregard side-by-side windows, lines of code can be arbitrarily long without hurting readability.
  14. Well, maybe not several hundred characters, but surely 120 characters of code on a line is easy enough to read.
  15. The only valid white space is line breaks and U+0020 SPACE.
  16. Okay, U+0009 TAB when lining up columns, but no other white space.
  17. Oh, come on, no-one would use U+000C FORM FEED in source code.
  18. Everyone uses the same sort of tools (visual text editors) to read and write code.
  19. Ignoring the few weirdos who can cope with their own bizarre choices, every NORMAL person uses the same sort of tools.
  20. Alright, everyone at my organization will use the same tools. I can mandate that, so it must be true.
  21. Readability is an inherent quality of code. It doesn't matter who reads it, good code is good code.
  22. Avoiding the "Names to Avoid" is a sure and simple way to make sure your identifiers aren't confusable.
  23. Unicode is good for identifiers.
  24. Unicode is bad for identifiers.
  25. Unicode is optional for identifiers.
  26. You know what I mean. I'm talking about *non-ASCII* characters. And you shouldn't use them.
  27. PEP 8 is a tool for denying patches/pull requests that you should reject.

As with the articles I'm riffing off, every one of these is false, and I can give examples. And this is far from an exhaustive list. If you want to avoid the worst of the errors, start by reading the actual document (not some tool that borrows its name), particularly the section entitled "A Foolish Consistency is the Hobgoblin of Little Minds".

With thanks to Ben Finney and Dan Sommers for contributions to the above list.

Tuesday, 25 August 2015

Stop kissing Crystal and find Grandpabbie!

Surround Sound allows some neat effects, like hearing that creepily dangerous sound from behind you instead of from the screen... but sometimes there can be additional words hidden in the other channels that you otherwise mightn't be able to hear. When Kristoff brings Anna to meet his friends (well, they're more like family), he greets them all, in several cases by name. But since the audience's attention is on Olaf and Anna ("He's cray zee!"), the actual lines are easily lost. Here's what Kristoff is actually saying...

Hey, guys.
You are a sight for sore eyes.
Hey look, Magma's back from vacation!
Rocko's looking sharp, as usual.
Clay, whoa... I don't even recognize you. You lost so much weight!
Didn't realize how much I missed you guys.
Guys, I've got so much to tell you.
Stop kissing Crystal and find Grandpabbie!

So now you know. And knowing, as they say...

Sunday, 3 May 2015

Upgrading Ubuntu Karmic to Debian Jessie

My server had been running an ancient release of Ubuntu for far too long, and I was getting weary of manually patching things and hoping that I could stay on top of everything. So, with Debian Jessie freshly stable, I figured it's high time to upgrade. My options were to wipe the computer and start over, or attempt an upgrade; being certifiably insane, I chose the latter. Herein is notes from what took place this weekend... as a cautionary tale, perhaps!

First and foremost, try it out on a lesser system. (I wasn't quite insane enough - or maybe stupid enough - to just dive in and start fiddling with a live server.) Upgrading Ubuntu Maverick (10.10) to Debian Jessie (8.0) worked out fairly well, with just a few messinesses and complications, all of which also happened with the full upgrade. But there were rather more problems on the live system.

  1. Replace /etc/apt/sources.list with Jessie content. Easy enough. Don't forget to check /etc/apt/sources.list.d/ for any that are now redundant.
  2. Grab the new GPG keys from so apt can check signatures.
  3. apt-get update, find out about a few more keys needing to be imported. Grab them with gpg --recv-keys 64481591B98321F9; gpg --armor --export 64481591B98321F9|sudo apt-key add - (after checking their validity in whatever way satisfies your level of paranoia).
  4. Due to some major bootstrapping problems, I couldn't simply apt-get dist-upgrade to do the upgrade. For everything to work, I actually had to do several steps: first, grab the Squeeze (Debian 6) repos, and install a somewhat newer kernel; then reboot into that, and finish the upgrade in single-user mode with broken mounts.
  5. STRONG recommendation: Use apt-get -d dist-upgrade to download all packages into the cache. This operation will complete quite happily, and is not bothered by package conflicts. After that, even if the network connection is broken, package upgrading can continue. At very worst, this just lets you leave the download chugging for a while, and then come back when it's done - saves quite a bit of time when you aren't doing this on a high-bandwidth server, like my first test.

  6. In order to complete the upgrade, I had to first upgrade udev, and then only afterward install a new Linux kernel... which udev requires. This meant a big fat warning about how this was very dangerous, but that I could touch a particular file to override the warning and do the installation - with the proviso that it might trash the system if I rebooted into the running kernel. Fortunately, such did not happen, as I was able to subsequently install a recent kernel, but it was cause to pause.

  7. Above all, this change MUST be done by someone prepared to take responsibility. This can't be managed by a script, it might cause downtime, and you need to have a fail-over ready in case something breaks badly. But hey. What's life without a little adventure... Some people go mountain climbing, I go upgrade climbing.

The part I was most impressed by was how much could be done on a running system. Upgrading a 2010 release of one distro to a 2015 release of a different distro, with no outage until the reboot at the end? Rather not bad, I think. Apt is a great tool.

Sunday, 29 March 2015

File systems: Case insensitivity is cultural insensitivity

There has long been a divide between case sensitive file systems and case insensitive ones. The former retain all names as provided, and define uniqueness according to very simple rules; the latter either force names to monocase (upper or lower), or retain them in the first form provided, and match without regard to case. Today, the divide roughly parallels the distinction between POSIX (Linux, Mac OS, BSD, etc) and Windows, so it tends to be tied in with the religious war that all that entails, but the concepts don't depend on the OS at all. For the purposes of discussion, I will be using Unix-like path names, because they are valid on all platforms.

If all file names are required to be ASCII and to represent English text, there is no problem. You can define case insensitivity easily: simply treat these 26 byte values as identical to those 26 byte values. But in today's world, it's not good enough to support only English. You need to support all the world's languages, and that usually means Unicode. Since Unicode incorporates ASCII, the file system needs to be able to cope with ASCII file names. Well and good. So, are file names "/ALDIRMA" and "/aldirma" identical? They look like ASCII, and if those letters represented something in English, they'd be the same, so they ought to be the same, right? Nope. That first word is Turkish, and the lower-case form is "/aldırma", with a dotless i. (The second isn't actually a word in any language, so far as I know, but it's a perfectly valid file name.) Should all four (those three plus "/ALDİRMA" with a dotted majuscule I) be considered to represent the same file? And what about the German eszett ("ß") - should that be identical to "ss", because they both upper-case to "SS"? Should the Greek letter sigma, which has distinct final ("ς") and medial ("σ") forms, be considered a single letter? This might conflate file names consisting of words strung together (in which case the final sigma is an indication that there is a word break there).

These distinctions make it virtually impossible to case-fold arbitrary text reliably. Whatever you do, you'll surprise someone - and maybe create a security hole, depending on how various checks are done. To be trustworthy, a file system must behave predictably. There's plenty of room for sloppy comparisons in text searching (for instance, a Google search for "aldirma" does return results containing "aldırma"), but the primary identifier for a file should be simple and faithful. So there are basically two options: Either the case folding rules are kept simple and generic, or there are no case folding rules at all. Will you impose arbitrary English rules on everyone else? And why English - why not, say, Hebrew? Let's treat "/letter" and "/litter" as the same file - after all, there are no vowels in Hebrew, so they should be ignored when comparing file names, that's fair right?

Or we could take the reliable, safe, and international approach, and maintain file names as strict text. Transform them only in ways which have no linguistic meaning (eg NFC normalization), and display them strictly as the user entered them. If you misspell a word, the file system won't fix it; if you get majuscule and minuscule wrong, the file system shouldn't fix that either.

One change I would make in the current POSIX model, though, and that's to require that file names be Unicode text, not arbitrary bytes. There's a broad convention that Linux systems use file names that consist of UTF-8 streams, but it's not enforced anywhere, and that means arbitrary bytes sometimes have to be smuggled around the place. That's necessary for legacy support, I suppose, but it'd be nice to eventually be able to drop all that compatibility code and just use Unicode file names everywhere. But even without that, I still prefer the Linux "it's all bytes, probably UTF-8" model to the Windows "it's all UTF-16 code units, case insensitively" model. Case insensitivity works only if one culture imposes its definition of case on everyone on the planet. Let's not do that.

Saturday, 13 December 2014

Unicode makes life easy!

Every now and then, the complaint is raised: "Unicode is hard!". My usual first response is that it's not Unicode that's hard, but human language; for instance, it's not Unicode's fault that:
* English is written left-to-right, Hebrew is written right-to-left, and you sometimes need to put both on the one line, like when you explain the name of PHP's scope resolution operator (Paamayim Nekudotayim, פעמיים נקודתיים‎, "double dot doubled");
* Some languages use some diacritical marks, others use different ones, so there are literally dozens of different marks that could be applied to letters; and each mark could be applied to any of quite a few letters;
* Not all letters in all languages have upper-case and lower-case forms, and in some cases, a single upper-case letter becomes multiple lower-case letters;
* And plenty more besides.

What Unicode does, of course, is bring a lot of these issues to the surface. A programmer who thought that "plain text" could fit inside ASCII, and now has to deal with all of this, will often tend to blame Unicode. But the truth is that Unicode actually makes so much of this easy - partly because you can push most of the work down to a lower-level library, and partly because of some excellent design decisions. Here's a little challenge for you: Make a way to transcribe text in any Latin script, with any diacriticals, using a simple US English keyboard layout. You'll need to devise a system for transforming an easy-to-type notation into all the different adorned characters you'd need to support. Your options: Use Unicode, or use eight-bit character sets from ISO-8859.

Here's how I did it with Unicode - specifically, in Pike, using GTK2 for my UI. First, translate a number of two-character sequences into the codepoints for combining characters; then perform a Unicode NFC normalization. And the second step is optional, done only because certain editors have buggy handling of combining characters (SciTE allows the cursor to get between them!), so really, the entire translation code consists of a handful of straight-forward translations - in my case, seven of them to cover all the combining marks that I need, plus four more for inverted question mark and exclamation mark, and the vowels not found on a US English keyboard (æ and ø); so there are a total of eleven transformations done.

To do the same thing with ISO-8859 character sets would require: First, figure out which character set to use, and then enable only a subset of transformations. Then, have full translation tables including every letter-diacritical combination supported, and make sure you have them all correct. There'll be hundreds of transformation entries, and you'd need to redo them for every single character set; with Unicode, supporting a new language is simply a matter of seeing what's missing, and adding that.

Unicode made my code smaller AND more functional than ISO-8859 ever could. Unicode is awesome.