The term ‘WTF-8’ occasionally circulates in geek circles, often as a derogatory nickname for UTF-8, and occasionally for other purposes. I keep thinking, for example, that it ought to be the character encoding used in all standards documents published by the OMG.
A week or two ago I realised what it really ought to mean.
It seems depressingly common for Windows software to encode text in Windows-1252 but to claim (in MIME headers and the like) that it's actually ISO 8859-1. The effect of this is that while most characters are displayed correctly, characters which the author thought were single or double quotes in fact turn into strange control characters which are either ignored completely, have weird effects on the terminal, or are simply displayed as error blobs of one kind or another.
A particularly annoying thing that can happen to text which is mislabelled in this way is that it can be converted into UTF-8 by an application which believes the stated encoding. When this happens, the things which were originally intended to be quote characters are translated into the UTF-8 encodings of the ISO 8859-1 control characters which occupy the same code positions as the original Windows-1252 quote characters. In other words, you end up with UTF-8 sequences such as C2 93 and C2 94 (representing the control characters U+0093 and U+0094) where you should see E2 80 9C and E2 80 9D (representing the Unicode double quote characters U+201C and U+201D).
This, I feel, should surely be described as a Windows Transformation Format, and is additionally exactly the kind of snafu you'd expect to see near the letters WTF, so I think that on two separate counts it has an excellent claim to the name WTF-8. Perhaps someone ought to publish an Internet-Draft specifying it.
no subject
no subject
There was a page in my coursework website that I saved in UTF-8 in an attempt to display ancient Greek text. Bizarrely, IE displayed it fine but Firefox displayed ??? until I manually changed the encoding in the View menu. How very confusing. o.O
no subject
no subject
93 -> C2 93 -> C3 82 C2 93 -> C3 83 C2 82 C3 82 C2 93 -> C3 83 C2 83 C3 82 C2 82 C3 83 C2 82 C3 82 C2 93, etc.
no subject
(That said, I have recently been receiving Japanese spam which is actually encoded in Shift-JIS but claims to be ISO-2022-JP, which is a particularly disgusting combination. If it were really in EUC-JP that would at least be vaguely internally consistent, in the sense that one could construct a subset of full ISO-2022 containing both. I can only assume such spam is directed at people whose MUA is both configured in Shift-JIS and ignores character set headers, and it seems astonishing to me that there are enough such people to make spam like that worthwhile...)
no subject
No, what comes up in database-y applications is typically a set of tables in a cross-border customer system being accessed and updated by non-Unicode-aware clients, each of which may be using a different local character set...
...and then someone else comes along and installs a Unicode-aware client on some subset of the customer terminals which expects all this data to be in UTF-8. (For bonus marks, these may then also be used to edit some records.)
Afterwards, another person will come and ask me how to make sure all the records display correctly everywhere. Laughter typically ensues - at least until I realise that they're really not joking.
no subject
no subject
(A proprietary ISAM-based beastie. The system in question dates from rather a while before Unicode even existed, so it's just about excusable. Its as-yet-unreleased replacement mercifully does do the Right Things and manages to use a Real Database to boot.)
no subject
no subject
Then again, this (http://thedailywtf.com/forums/thread/97990.aspx) story is about one of the two largest companies in our market (we were recently bought by the other one), so as horrifying as the thought is, it's pretty safe to say that It Could Be Worse. Much worse.
no subject
(In fact the JSON spec requires decoders to implement this kind of trick; however because JSON files are Unicode and always start off with two ASCII characters, the decoder can work out the transformation format and endianness from the position of the nulls in the first four octets.)
no subject
Internet Draft
http://www-uxsup.csx.cam.ac.uk/~fanf2/hermes/doc/qsmtp/draft-fanf-wtf8.html