WTF-8

WTF-8

[entries|reading|network|archive]

simont

[ userinfo | dreamwidth userinfo ]

[ archive | journal archive ]

Tue 2006-11-21 12:09

WTF-8

Link

kaet.livejournal.com

Tue 2006-11-21 12:52

What would you call the multiply encoded formats when someone's taken what they thought was UCS-2, or similar, and UTF-8'd it multiple times, so you get lots of C2 C3 82 and 83 in a row. It seems to have neat mathematical properties, but I'm usually not in the mood when I see it. I see that so often that it must have a name. I found a file the other day (a MySQL dump, once) which had had this applied six times. So you get

93 -> C2 93 -> C3 82 C2 93 -> C3 83 C2 82 C3 82 C2 93 -> C3 83 C2 83 C3 82 C2 82 C3 83 C2 82 C3 82 C2 93, etc.

Link	Reply to this \| Thread

simont

Tue 2006-11-21 13:13

Yuck. I've never seen that. Perhaps it comes up more often in databasey sorts of applications (which I generally avoid anyway). In the MIME world, there's scope to mislabel an encoding as a similar encoding, but generally people are pretty clear about whether something's already in UTF-8 or not.

(That said, I have recently been receiving Japanese spam which is actually encoded in Shift-JIS but claims to be ISO-2022-JP, which is a particularly disgusting combination. If it were really in EUC-JP that would at least be vaguely internally consistent, in the sense that one could construct a subset of full ISO-2022 containing both. I can only assume such spam is directed at people whose MUA is both configured in Shift-JIS and ignores character set headers, and it seems astonishing to me that there are enough such people to make spam like that worthwhile...)

Link	Reply to this \| Parent \| Thread

oneplusme.livejournal.com

Tue 2006-11-21 18:35

Yuck. I've never seen that. Perhaps it comes up more often in databasey sorts of applications (which I generally avoid anyway).

No, what comes up in database-y applications is typically a set of tables in a cross-border customer system being accessed and updated by non-Unicode-aware clients, each of which may be using a different local character set...

...and then someone else comes along and installs a Unicode-aware client on some subset of the customer terminals which expects all this data to be in UTF-8. (For bonus marks, these may then also be used to edit some records.)

Afterwards, another person will come and ask me how to make sure all the records display correctly everywhere. Laughter typically ensues - at least until I realise that they're really not joking.

Link	Reply to this \| Parent \| Thread

womble2.livejournal.com

Wed 2006-11-22 01:36

Real DBMSes know about character encodings. What are you using?

Link	Reply to this \| Parent \| Thread

oneplusme.livejournal.com

Wed 2006-11-22 07:38

You don't want to know...

(A proprietary ISAM-based beastie. The system in question dates from rather a while before Unicode even existed, so it's just about excusable. Its as-yet-unreleased replacement mercifully does do the Right Things and manages to use a Real Database to boot.)

Link	Reply to this \| Parent \| Thread

womble2.livejournal.com

Wed 2006-11-22 10:13

I think I can probably manage to avoid that then!

Link	Reply to this \| Parent \| Thread

oneplusme.livejournal.com

Wed 2006-11-22 22:07

Unless you accidentally get yourself employed by my orkplace, you probably can, yes.

Then again, this (http://thedailywtf.com/forums/thread/97990.aspx) story is about one of the two largest companies in our market (we were recently bought by the other one), so as horrifying as the thought is, it's pretty safe to say that It Could Be Worse. Much worse.

Link	Reply to this \| Parent

fanf

Tue 2006-11-21 14:02

This kind of thing is why practical software should treat text with a bit of DWIMmery. Fortunately it's almost always possible to correctly glark the encoding without extra metadata.

(In fact the JSON spec requires decoders to implement this kind of trick; however because JSON files are Unicode and always start off with two ASCII characters, the decoder can work out the transformation format and endianness from the position of the nulls in the first four octets.)

Link	Reply to this \| Parent \| Thread

kaet.livejournal.com

Tue 2006-11-21 14:08

Ooh, yes, I'd not noticed that proprerty of JSON, but you're right! Very useful. There's similar tedious description of that technique with the "<?xm" in the XML spec. Amazingly, at least a year ago, nobody seemed to have implemented the relevant DWIM filter [from InputStream (octet sequence) to Reader (character sequence)] for Java for XML, and I've not been able to convince people that implementing that would be a productive use of my time. So we all arse around with character set objects and get it wrong, :(.

Link	Reply to this \| Parent

navigation

[	go	\|	Previous Entry \| Next Entry	]
[	add	\|	to Memories	]