WTF-8 (Reply)

WTF-8 (Reply)

[entries|reading|network|archive]

simont

[ userinfo | dreamwidth userinfo ]

[ archive | journal archive ]

simont

Tue 2006-11-21 12:09

WTF-8

The term ‘WTF-8’ occasionally circulates in geek circles, often as a derogatory nickname for UTF-8, and occasionally for other purposes. I keep thinking, for example, that it ought to be the character encoding used in all standards documents published by the OMG.

A week or two ago I realised what it really ought to mean.

It seems depressingly common for Windows software to encode text in Windows-1252 but to claim (in MIME headers and the like) that it's actually ISO 8859-1. The effect of this is that while most characters are displayed correctly, characters which the author thought were single or double quotes in fact turn into strange control characters which are either ignored completely, have weird effects on the terminal, or are simply displayed as error blobs of one kind or another.

A particularly annoying thing that can happen to text which is mislabelled in this way is that it can be converted into UTF-8 by an application which believes the stated encoding. When this happens, the things which were originally intended to be quote characters are translated into the UTF-8 encodings of the ISO 8859-1 control characters which occupy the same code positions as the original Windows-1252 quote characters. In other words, you end up with UTF-8 sequences such as C2 93 and C2 94 (representing the control characters U+0093 and U+0094) where you should see E2 80 9C and E2 80 9D (representing the Unicode double quote characters U+201C and U+201D).

This, I feel, should surely be described as a Windows Transformation Format, and is additionally exactly the kind of snafu you'd expect to see near the letters WTF, so I think that on two separate counts it has an excellent claim to the name WTF-8. Perhaps someone ought to publish an Internet-Draft specifying it.

Link

Read Comments

Reply:

From:

Anonymous This account has disabled anonymous posting.