simont | WTF-8

The term ‘WTF-8’ occasionally circulates in geek circles, often as a derogatory nickname for UTF-8, and occasionally for other purposes. I keep thinking, for example, that it ought to be the character encoding used in all standards documents published by the OMG.

A week or two ago I realised what it really ought to mean.

It seems depressingly common for Windows software to encode text in Windows-1252 but to claim (in MIME headers and the like) that it's actually ISO 8859-1. The effect of this is that while most characters are displayed correctly, characters which the author thought were single or double quotes in fact turn into strange control characters which are either ignored completely, have weird effects on the terminal, or are simply displayed as error blobs of one kind or another.

A particularly annoying thing that can happen to text which is mislabelled in this way is that it can be converted into UTF-8 by an application which believes the stated encoding. When this happens, the things which were originally intended to be quote characters are translated into the UTF-8 encodings of the ISO 8859-1 control characters which occupy the same code positions as the original Windows-1252 quote characters. In other words, you end up with UTF-8 sequences such as C2 93 and C2 94 (representing the control characters U+0093 and U+0094) where you should see E2 80 9C and E2 80 9D (representing the Unicode double quote characters U+201C and U+201D).

This, I feel, should surely be described as a Windows Transformation Format, and is additionally exactly the kind of snafu you'd expect to see near the letters WTF, so I think that on two separate counts it has an excellent claim to the name WTF-8. Perhaps someone ought to publish an Internet-Draft specifying it.

Flat | Top-Level Comments Only

I still maintain that WTF-8 should be a Gödelised encoding, and should be used for representing INTERCAL programs.

Don't forget that if they stuff up their code they will have to destroy it on an OMGWTFBBQ. ;)

There was a page in my coursework website that I saved in UTF-8 in an attempt to display ancient Greek text. Bizarrely, IE displayed it fine but Firefox displayed ??? until I manually changed the encoding in the View menu. How very confusing. o.O

Regrettably, the HTTP spec says that a character-encoding specified in the HTTP header overrides an encoding specified in the body, e.g. through <meta http-equiv="Content-Type" ...> in HTML. This allows proxies to recode for limited browsers (such as on phones) that only support a few encodings, without needing to parse the body. However it also relies on web servers to parse HTML (or other document format) to determine the encoding that should appear in the HTTP header, but they generally don't. Worse, Apache is commonly configured to specify a certain character encoding for all documents because allowing browsers to guess is apparently a security problem. Firefox follows the spec and believes the HTTP header; IE, as usual, ignores the spec and just guesses.

What would you call the multiply encoded formats when someone's taken what they thought was UCS-2, or similar, and UTF-8'd it multiple times, so you get lots of C2 C3 82 and 83 in a row. It seems to have neat mathematical properties, but I'm usually not in the mood when I see it. I see that so often that it must have a name. I found a file the other day (a MySQL dump, once) which had had this applied six times. So you get

93 -> C2 93 -> C3 82 C2 93 -> C3 83 C2 82 C3 82 C2 93 -> C3 83 C2 83 C3 82 C2 82 C3 83 C2 82 C3 82 C2 93, etc.

Yuck. I've never seen that. Perhaps it comes up more often in databasey sorts of applications (which I generally avoid anyway). In the MIME world, there's scope to mislabel an encoding as a similar encoding, but generally people are pretty clear about whether something's already in UTF-8 or not.

(That said, I have recently been receiving Japanese spam which is actually encoded in Shift-JIS but claims to be ISO-2022-JP, which is a particularly disgusting combination. If it were really in EUC-JP that would at least be vaguely internally consistent, in the sense that one could construct a subset of full ISO-2022 containing both. I can only assume such spam is directed at people whose MUA is both configured in Shift-JIS and ignores character set headers, and it seems astonishing to me that there are enough such people to make spam like that worthwhile...)

Yuck. I've never seen that. Perhaps it comes up more often in databasey sorts of applications (which I generally avoid anyway).

No, what comes up in database-y applications is typically a set of tables in a cross-border customer system being accessed and updated by non-Unicode-aware clients, each of which may be using a different local character set...

...and then someone else comes along and installs a Unicode-aware client on some subset of the customer terminals which expects all this data to be in UTF-8. (For bonus marks, these may then also be used to edit some records.)

Afterwards, another person will come and ask me how to make sure all the records display correctly everywhere. Laughter typically ensues - at least until I realise that they're really not joking.

Real DBMSes know about character encodings. What are you using?

You don't want to know...

(A proprietary ISAM-based beastie. The system in question dates from rather a while before Unicode even existed, so it's just about excusable. Its as-yet-unreleased replacement mercifully does do the Right Things and manages to use a Real Database to boot.)

I think I can probably manage to avoid that then!

Unless you accidentally get yourself employed by my orkplace, you probably can, yes.

Then again, this (http://thedailywtf.com/forums/thread/97990.aspx) story is about one of the two largest companies in our market (we were recently bought by the other one), so as horrifying as the thought is, it's pretty safe to say that It Could Be Worse. Much worse.

This kind of thing is why practical software should treat text with a bit of DWIMmery. Fortunately it's almost always possible to correctly glark the encoding without extra metadata.

(In fact the JSON spec requires decoders to implement this kind of trick; however because JSON files are Unicode and always start off with two ASCII characters, the decoder can work out the transformation format and endianness from the position of the nulls in the first four octets.)

Ooh, yes, I'd not noticed that proprerty of JSON, but you're right! Very useful. There's similar tedious description of that technique with the "<?xm" in the XML spec. Amazingly, at least a year ago, nobody seemed to have implemented the relevant DWIM filter [from InputStream (octet sequence) to Reader (character sequence)] for Java for XML, and I've not been able to convince people that implementing that would be a productive use of my time. So we all arse around with character set objects and get it wrong, :(.

This is a rip-off of RFC 3629, but I got bored in section 4.

http://www-uxsup.csx.cam.ac.uk/~fanf2/hermes/doc/qsmtp/draft-fanf-wtf8.html

WTF-8

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Internet Draft