Supporting `legacy' systems

I recently had occasion to exchange email with someone about an IRC server, a server which was recently switched to a new server codebase. The new codebase banned non-UTF-8 messages, even private messages between mutually consenting parties.

I wrote to its admin and we ended up exchanging mail about it. One thing he said was that I should publish essays on supporting legacy systems.

I'm not sure I like the terminology, because I see `legacy' as carrying (mild) pejorative connotations, as if it implied "in need of replacement". I do not consider old computers as in need of replacement just because they're old; indeed, I consider supporting them a good thing. If nothing else, it keeps code authors from getting sloppy, the sort of mindset which (with respect to memory use) I might summarize as "who cares about a ten-meg table; ten megs of RAM costs what, half a cent?" (to which of course the reply is "please tell me where I can get RAM for my SPARCstation-20 (or Mac SE, or MicroVAX-II, or whatever) at that price!").

Occasionally that sort of thinking is justified and reasonable. But more often (well, more often in my experience) it's just sloppiness, and supporting `small' machines is one of the more effective ways to stop slipping into it, or, more precisely, to catch such slippage. (A machine need not be `legacy' to be small. But the correlation is at least moderately strong.)

The IRC server example is (mostly) not about RAM consumption, but it is about supporting non-bleeding-edge systems.

So, I'm taking what I wrote to him and filing off the serial numbers (ie, removing anything that identifies who it was or which the IRC server in question is, reformatting to work better in this medium, paraphrasing, etc) and posting it here, as what may or may not turn out to be the first of a series.

 

I opened the exchange by remarking that the new IRC server did not accept non-UTF-8 text messages. The admin's reply was, basically, "I had to choose, maybe it's the wrong choice", and asking me what the impact on me was. The rest of this post is a reformatting and paraphrase of my response.

It's a wrong choice for my preferences and use cases, certainly.

IRC has, in my experience, until fairly recently1 been charset-blind2, just transporting octet strings, leaving the mapping between them and character strings, if any, up to the user(s) and/or software on the client side.

I think it should stay that way; that paradigm is fully compatible with UTF-8 while still equally supporting other encodings (at least other ASCII-superset encodings, or almost any encoding with the help of client-side translation for things such as commands to the server like JOIN and PART).

The immediate pragmatic effect, of course, is that it becomes impossible to use non-UTF-8 text, such as (almost) any 8859-*, KOI-8, Shift-JIS, etc, text, even between mutually agreeing parties. In the case of that particular IRC server, this was relatively mild, because I didn't use non-ASCII there very much, though I did occasionally use French, a little of which included non-ASCII characters, always using 8859-1. (I would like to have more occasion to use 8859-14; I'm trying to get past my current level of competence in Gaelic (mostly Irish variants) and prefer the old orthography, which 8859-14 appears to be designed for.)

It bothered me much more as a matter of principle.

In the abstract, I have nothing against the original idea of Unicode. A character set which covers all known writing systems would be a Good Thing. I dislike a few aspects of Unicode as it is currently designed, such as the presence of multiple ways to represent things such as accented vowels (either as codepoints representing the resulting character or as the unaccented vowel codepoint combined with a dead-accent codepoint), which compels the disaster that is normalization. Actually, disasters, plural, because there are multiple variants of normalization, which itself is another disaster.

UTF-8, though, I quite dislike. I especially dislike the current stampede to ramming UTF-8 down everyone's collective throat. I don't react positively to having any choice made for me; trying to tell me that you know better than I do what's right for my use case is a quick way to get pushback from me. Quite aside from the "I know better than you do what's right for your use case" arrogance and, often, incorrectness, it demands that things must be character strings rather than octet strings at all levels, in particular demanding that many things which have historically been encoding-blind octet strings must either be UTF-8 themselves or be tagged with some kind of encoding marker, so they can be recoded to/from UTF-83.

This is a problem in that it means that I cannot, for example, write a UTF-8-IRC client capable of using non-ASCII as things stand on my NetBSD systems, because the correct encoding, the mapping between octets and conceptual characters, depends on the font in use by the terminal emulator window in use, which is at least two layers inaccessible to the IRC client (the pty driver and the terminal emulator) and can even change dynamically. I would have to either (a) rewrite (at a bare minimum) the pty driver and the terminal emulator to carry encoding tags with all octets (and what to do with octets that don't represent characters?), (b) switch everything in the whole system over to UTF-8, including the insanity that is normalization rules and the resulting bloating of everything which uses fonts and most things that work with characters at all, or (c) push the whole mess off to the human layer, making the human specify what encoding is in use, including changing that when changing fonts, and still needing uncomfortably large mapping tables.

I don't like any of those. If there were some compelling reason to ban non-UTF-8 text, it might be worth picking one and doing it. But I don't see any downside to using encoding-blind octet strings. They support UTF-8 text just fine, between clients that happen to want to use it, after all. As far as I can see, the only upside to banning non-UTF-8 text like this is that, in the fantasies of the implementors, it pushes people to switch to UTF-8. In practice, certainly in my case and I suspect in others' cases, what it actually does is push people who haven't already drunk the UTF-8 koolaid (and the ones who have are covered just fine by both options anyway) down to ASCII, thereby being even less internationalized...or just abandoning the system that bans non-UTF-8 text, as I did in one case (see footnote 1).

 

[1] This was actually the second time I've run into an IRC server banning non-UTF-8 text. In the first case, the admin proved intransigent and I simply stopped connecting to that server. A pity, too, because it had some people there I had come to consider friends.

[2] Except for the Scandanavian legacy of considering [\]^ as the uppercase versions of {|}~ when doing case-insensitive matching of things such as nicks and channel names (that case-insensitivity is IMO itself a bug, or at best a misfeature).

[3] This is why ssh, as currently specified, is close to unimplementable on many Unix variants. The spec says that various things, such as usernames, must be encoded in UTF-8. But, when I look at a password database entry (typically, a line in /etc/passwd), I don't get character strings; I get octet strings. I can't recode them to/from UTF-8 even in principle without either redesigning the system to tag them with encodings or imposing a single encoding across all usernames. I'm sure the UTF-8 proponents would say I should just switch them all to UTF-8. Why should I have to re-encode all my usernames (and impose that encoding on my users) just to have the privilege of using ssh? My ssh implementation treats those things as opaque octet strings; if some other implementation gets upset, as far as I'm concerned that's on it.

Main