simont: A picture of me in 2016 (Default)
simont ([personal profile] simont) wrote2006-06-27 12:14 pm

My Evil Hack of the Week

It's early in the week, but I doubt I'll beat this in the next few days: yesterday evening I implemented a string search function (equivalent to Perl's rindex) recursively.

It's reasonably well known among LiveJournal users that putting tables in an LJ entry has a tendency to break the formatting of people's friends pages. In fact, it's not all tables; it's only tables which don't explicitly close <tr> and <td> tags.

This is because LJ's HTML cleaning function, which is applied separately to the HTML generated by each individual journal entry, doesn't know that those two elements can be implicitly closed in HTML 4; so if it sees either of those tags without a corresponding explicit close tag, it remembers, and at the very end of the string it's cleaning it splurges out a huge string of pointless </tr> and </td> tags, long after the </table> before which they might have at least managed to be relatively harmless if silly. When such an LJ entry is output within a page style that uses tables for layout, these unexpected closing tags disrupt the style's external tables and things go badly wrong.

Once upon a time I attempted to prepare a patch against the HTML cleaner which fixed this problem, but unfortunately failed because the code was more hideous than I was prepared to deal with when not being paid large amounts of money. So instead I worked around it in my own journal style (which, thanks to ick-proxy, I use for absolutely all my LJ reading): an S2 journal style is a program in a Perl-like language which is given fragments of HTML and other such stuff as input and produces a complete web page as output, and for about a year mine has been checking for spurious </tr> and </td> tags at the end of those HTML fragments and stripping them off.

Unfortunately, the recent introduction of ‘tagging’ of LJ entries broke my defensive code: the HTML fragment which shows the tags on an entry seems to be appended to what comes out of the HTML cleaner before it gets passed to my style code, meaning that those spurious tags now aren't guaranteed to appear at the very end of the entry, but can be somewhere in the middle. Yesterday I encountered the first LJ entry I'd seen with both a bug-tickling table and tags, and realised that my defence needed a bit more work.

After a bit of thought, the obvious answer seemed to be that any </tr> or </td> tag which appears after the last </table> can be identified as spurious and removed. That sounded easy enough. So I went to the S2 documentation and looked for their equivalent of rindex … and they don't have one. Their string class provides a boolean contains() method to tell you if one string occurs in another, but sadly not one which will tell you where it occurs.

Well, that's not a crippling problem; string functions aren't that hard, so I can just write it myself. The problem is that S2 doesn't provide any looping constructs except foreach: no while, and no general for. It looks rather as if they were trying to ensure the language was unable to express an indefinite loop, in the manner of Douglas Hofstadter's BlooP, presumably to prevent badly-written styles from hanging their web server.

Fortunately, they did leave in recursion. (In fact I was already using recursion for my earlier table-bug workaround: my fix_tables function, if given a string ending in a spurious close tag, stripped the tag off and then recursed in case there was another.) So now I had to implement a string search function using recursion as my only method of looping. Given the existence of the contains() primitive, binary search seemed appropriate: divide the haystack string in half, see if the right-hand half contains the needle, if so recurse on to that, otherwise recurse on to the left-hand half. (Yes, I did consider the possibility that cutting the string in half might itself destroy the only occurrence of the needle. Details left as an exercise for the reader.)

So now my S2 style is running a recursive fix_tables function, which in turn is calling a recursive rindex function several times. Total recursion depth is (number of spurious close tags) + log(size of LJ entry HTML), which fortunately doesn't seem to have run into the recursion limit yet. And it all seems to work: the entry that triggered the problem yesterday has been fixed, but I reproduced the problem in a private entry of my own and am confident my defence now works again.

It's thoroughly silly that I had to do it in the first place, of course. If LJ were to feel like supplying index and rindex primitives, or better still fixing the HTML-cleaner bug in the first place, I could rip the entire edifice out of my style and save their S2 processor a whole load of wasted CPU cycles. But until then, it's staying…

(I described this hack and the reasons for it at post-pizza last night and it got a spontaneous round of applause, which was unexpected and fun :-)

emperor: (Default)

[personal profile] emperor 2006-06-27 11:27 am (UTC)(link)
Have you suggested them fixing this problem?

[identity profile] j4.livejournal.com 2006-06-27 11:29 am (UTC)(link)
I feel slightly less guilty about producing borken HTML now I know that it inspired such a l33t hack. 8-)

[identity profile] j4.livejournal.com 2006-06-27 11:56 am (UTC)(link)
I still maintain that it was morally broken, if not technically broken. (We need web principles as well as web standards!)



[identity profile] j4.livejournal.com 2006-06-27 12:01 pm (UTC)(link)
*falls about giggling*

If that isn't part of a full-on filk, it damn well should be. (If so, bang goes my productivity for the afternoon.)

[identity profile] cartesiandaemon.livejournal.com 2006-06-28 10:57 am (UTC)(link)
Didn't you do a meme where you combined all those one lines and it was really good?

A meme? Moi?

[identity profile] cartesiandaemon.livejournal.com 2006-06-28 11:17 am (UTC)(link)
ROFL. Sorry. But you knew what I meant, that was what I was thinking of.

Don't worry -- I didn't mean to imply that you'd copied it from anywhere, merely that (a) it was an automated summary of your character in an extremely oblique way and (b) people might like to copy it from you :)

once you've one outstandingly line ... coming up with enough filler isn't really motivating

[identity profile] cartesiandaemon.livejournal.com 2006-06-28 11:21 am (UTC)(link)
Yeah. Like books where the title is wonderful and says it all, and reading it isn't really gaining anything. Perhaps the correct artform is Filk Line :)

It took me *aaaages* to work out what filk *was*. I had associations because I knew what people described as it, but it didn't seem to make sense. Then I went to wikipedia and discovered my understanding was right :) Their description is beautifully worked out, something like, songs characterised by being about, or by the sort of people who write filk about, scifi/fantasy songs... It really is "know it when you hear it" but wikipedia was good at capturing that essence like a butterfly of etymology[1].

[1] Entomology pun not intended :)

[identity profile] cartesiandaemon.livejournal.com 2006-06-28 10:58 am (UTC)(link)
Does "Grabbing j4's attention" sound ever so dodgy to everyone else?

[identity profile] cartesiandaemon.livejournal.com 2006-06-28 11:03 am (UTC)(link)
Do I infer you didn't close some t[x] tags? Why is that morally broken?

* Because closing tags is the one true way? I disagree. I see the good side of all tags being open-content-close, but I can also see good reasons for ones that are open-content-separator-content-separator-content-separator-close.

* Because you shouldn't break LJ? That's not moral, that's dirty pragmatism :)

[identity profile] cartesiandaemon.livejournal.com 2006-06-27 12:13 pm (UTC)(link)
Suggestion 1: <Ptc>Write a perl-to-S2 compiler</ptc> :)
Suggestion 2: foreach [1...n] {strip} where n is greater than the likely available stack space :)

[identity profile] cartesiandaemon.livejournal.com 2006-06-28 11:01 am (UTC)(link)
1. Yeah. You'd only have to do it once, but then if you're doing that you should probably just make your own blogging wbesite for people you trust with perl :)

2. Oh. Sorry, I assumed "loop n times" would be in *any* nonstupid language, I hadn't actually looked in to S2. Foreach a string? Can you make a very long loop and break out after n decrements to zero? Sorry, I don't know.

[identity profile] timotab.livejournal.com 2006-06-27 04:27 pm (UTC)(link)
if (whole_thing_contains(needle) and !left_half_contains(needle) and !right_half_contains(needle))
then
move_where_I_make_the_split