simont | (no subject)

I spent yesterday discovering a truly horrid mess.

There is a defined mechanism for X applications to transfer internationalised text among themselves when cutting and pasting. It's called Compound Text, and it's basically a knockoff of ISO 2022: a collection of base character sets, plus a mechanism of escape sequences allowing switching between those character sets. A bit unwieldy given that we have UTF-8, but then this whole edifice predates UTF-8 so we have to forgive it that. (There is also a newer and simpler mechanism for X applications to offer a UTF-8 string as well as compound text, so things are improving.)

In a Unicode-based application such as PuTTY/pterm, supporting this would have been a severe headache, and it was therefore with great relief that I discovered some time ago that Xlib supplies functions called Xutf8TextListToTextProperty and Xutf8TextPropertyToTextList which will convert between compound text and UTF-8. This seemed to basically work; pasting to and from Emacs didn't always work quite right, but I assumed that was just a version mismatch or something. Phew, I thought; compound text may be a headache, but it isn't my headache.

How wrong I was.

Recently several people have had trouble compiling Unix PuTTY on non-Linux systems; it turns out that these Xutf8* functions are an XFree86 extension and not supported by all Xlibs. So it looks rather as if I'm going to need to implement compound text myself after all.

Well, that doesn't sound too ghastly to begin with. The compound text specification is available as part of the Debian xbooks package; I have a character set translation library already; it supports all the base character sets defined in the specification; and it also already has general code for supporting character encodings which are subsets of ISO 2022. It ought to be the work of an hour or so to put all the pieces together and build my own self-contained compound-text-to-Unicode translator.

So I did that, and it looked sensible enough to me when tested against itself. Then I started testing it against other applications … and that's where the real trouble started.

I wrote a test program which attempted to cross-test between my compound text implementation and Xutf8. This turned up several additional character sets supported by Xutf8 which weren't in the specification I had. Fair enough, I thought, the spec has obviously been updated; so I added support for the one character set I could usefully identify, and then tried to figure out what the others were. No luck as yet: ISO 2022 identifies character sets by codes consisting of one or two arbitrary characters, rather than by their actual names, and as yet I have not yet managed to find out which character sets are referred to by the sequences in question. ISO 2022 itself is available for free download under its alias of ECMA-35, but the register of character sets seems harder to find.

Worse than that, though, Xutf8 also supports an ISO 2022 extension syntax for character sets that don't yet have an allocated identifier in the register. In this syntax you specify an escape sequence and a two-byte length; then you supply precisely that many bytes, of which the first few give the full textual name of a character set and then some data in that character set follows. This is outstandingly annoying since my entire character set conversion library is based on the fundamental assumption that you can receive one character at a time and convert as you go along with a bounded amount of state; the very last thing I want to be required to do is to know in advance how many of the following characters need to be represented in the same weird character set. Very annoying.

Still worse than that, Xutf8 also supports a second ISO 2022 extension, which is ESC % G to switch into UTF-8 mode and ESC % @ to switch back; so in fact Xutf8 can round-trip-encode any Unicode character by doing this. Except that, as it turns out, it isn't permitted! The compound text specification I have explicitly disallows the use of that class of ISO 2022 extension. Arrgh.

And as if that lot wasn't bad enough … I also attempted to test interoperability between my compound text implementation and Emacs. Emacs generally doesn't believe in Unicode, and works internally in a much more ISO 2022-like mechanism, in which every character in a buffer is tagged with the character set it came from. Therefore, it speaks compound text very fluently, and barely speaks UTF-8 at all; given that and its popularity, it was an obvious application that I'd want to interoperate with well.

So I fired up my local Emacs and ran M-x mule-diag. This gave me a buffer full of all sorts of international characters, so I started cutting and pasting them into my test programs to see what would happen. Some of them worked; others didn't. Eventually I figured out a way to persuade Emacs to save the entire buffer to a file in the compound text format, and examined it in detail to see what was missing. As it turned out, the full extent of the problem is simply that Emacs supports a whole load of character set identifiers that I don't know about. That didn't sound too bad; it ought to be just a matter of finding out what character sets those identifiers represent, and implementing them. Right?

Well, as I said earlier, actually getting my hands on the ISO 2022 character set register appears tricky. Instead I resorted to reading the Emacs sources, to see what it thinks all those weird codes represent. And somewhere in the Emacs source, I noticed a comment that said that since the ‘Big5’ encoding of Chinese is too big to fit in the 96x96 space available to a single base character set, it has been split into two sub-character-sets for Emacs's internal use. At this point I started to get suspicious; I prepared a test file encoded in Big5 containing a sample of every defined character, loaded it into Emacs, and saved it back out as compound text (actually figuring out how to persuade Emacs to do that was enough of a challenge, but). Sure enough, the first half of the characters are encoded using the ISO 2022 sequence ESC $ ( 0, and the second half are preceded by ESC $ ( 1. In other words, as closely as I can determine without seeing the real ISO 2022 register, Emacs has unilaterally allocated ISO 2022 identifiers for all its internal character set IDs! Its sole concession to interoperability appears to have been that it has used the existing ISO 2022 identifier for any character set which it knew to possess one.

No wonder pasting to and from Emacs didn't work terribly well even with PuTTY's current Xutf8-based support for compound text. Emacs doesn't in fact speak compound text: it speaks what I suppose I'd have to describe as a fork of ISO 2022, which maintains sideways compatibility with the original for any character set they have in common, but beyond that the two are mutually incomprehensible. If I ever meet the person responsible for that, one of us will regret it.

The only question left is, now what do I do? I really don't want to have to support UTF-8 embedded in ISO 2022, and I'm backed up in this by the compound text spec. I definitely don't want to start supporting Emacs's internal character sets in its fake ISO 2022 abomination, and I strongly suspect I'd be backed up in that by half the standards bodies in the world. But the more things I decline to support because they're evil or stupid or both, the more PuTTY will fail to interoperate with existing applications and people will send me bug reports along the lines of ‘Well, it works fine pasting into another Emacs, it must be a bug in your code’.

*sigh*

Flat | Top-Level Comments Only

Well, it works fine pasting into another Emacs, it must be a bug in your code.

Put it in the FAQ. Hell, link this post in the FAQ. Then when people bitch at you, just reply with "See FAQ".

Do any other applications besides emacs do emacs' style of broken ISO 2022?

I'd strongly avoid implementing this garbage. If you make Putty only support the newer utf-8 copy/paste you refered to, will a modern Emacs (perhaps in the future) negotiate with the other app to agree on a compatible copy/paste protocol? I'd put the onus on emacs to conform to standards, even if they weren't out yet when they went astray.

It's not quite as bad as I'd previously feared; it turns out that Emacs's bizarre character set definitions are within the ISO 2022 private use section (which I hadn't realised there was one of). So they probably aren't technically doing anything wrong, although it's still not very pleasant.

Some more investigation reveals that the Xutf8 functions do actually understand (even if they don't generate) some of Emacs's private character sets. I have a feeling I'm probably going to need to take a conservative/liberal approach: be ready to understand any bizarre crud fed to me by Emacs or XFree86 or anyone else, but only generate standards-compliant niceness myself...

ISO 2022 itself is available for free download under its alias of ECMA-35, but the register of character sets seems harder to find.

all-escapes (http://bjh21.me.uk/all-escapes/all-escapes.txt) lists at least some of them, and it cites its sources, of which the relevant one is <http://www.itscj.ipsj.or.jp/ISO-IR/ (http://www.itscj.ipsj.or.jp/ISO-IR/)>.