simont | My Evil Hack of the Week (Reply)

It's early in the week, but I doubt I'll beat this in the next few days: yesterday evening I implemented a string search function (equivalent to Perl's rindex) recursively.

It's reasonably well known among LiveJournal users that putting tables in an LJ entry has a tendency to break the formatting of people's friends pages. In fact, it's not all tables; it's only tables which don't explicitly close <tr> and <td> tags.

This is because LJ's HTML cleaning function, which is applied separately to the HTML generated by each individual journal entry, doesn't know that those two elements can be implicitly closed in HTML 4; so if it sees either of those tags without a corresponding explicit close tag, it remembers, and at the very end of the string it's cleaning it splurges out a huge string of pointless </tr> and </td> tags, long after the </table> before which they might have at least managed to be relatively harmless if silly. When such an LJ entry is output within a page style that uses tables for layout, these unexpected closing tags disrupt the style's external tables and things go badly wrong.

Once upon a time I attempted to prepare a patch against the HTML cleaner which fixed this problem, but unfortunately failed because the code was more hideous than I was prepared to deal with when not being paid large amounts of money. So instead I worked around it in my own journal style (which, thanks to ick-proxy, I use for absolutely all my LJ reading): an S2 journal style is a program in a Perl-like language which is given fragments of HTML and other such stuff as input and produces a complete web page as output, and for about a year mine has been checking for spurious </tr> and </td> tags at the end of those HTML fragments and stripping them off.

Unfortunately, the recent introduction of ‘tagging’ of LJ entries broke my defensive code: the HTML fragment which shows the tags on an entry seems to be appended to what comes out of the HTML cleaner before it gets passed to my style code, meaning that those spurious tags now aren't guaranteed to appear at the very end of the entry, but can be somewhere in the middle. Yesterday I encountered the first LJ entry I'd seen with both a bug-tickling table and tags, and realised that my defence needed a bit more work.

After a bit of thought, the obvious answer seemed to be that any </tr> or </td> tag which appears after the last </table> can be identified as spurious and removed. That sounded easy enough. So I went to the S2 documentation and looked for their equivalent of rindex … and they don't have one. Their string class provides a boolean contains() method to tell you if one string occurs in another, but sadly not one which will tell you where it occurs.

Well, that's not a crippling problem; string functions aren't that hard, so I can just write it myself. The problem is that S2 doesn't provide any looping constructs except foreach: no while, and no general for. It looks rather as if they were trying to ensure the language was unable to express an indefinite loop, in the manner of Douglas Hofstadter's BlooP, presumably to prevent badly-written styles from hanging their web server.

Fortunately, they did leave in recursion. (In fact I was already using recursion for my earlier table-bug workaround: my fix_tables function, if given a string ending in a spurious close tag, stripped the tag off and then recursed in case there was another.) So now I had to implement a string search function using recursion as my only method of looping. Given the existence of the contains() primitive, binary search seemed appropriate: divide the haystack string in half, see if the right-hand half contains the needle, if so recurse on to that, otherwise recurse on to the left-hand half. (Yes, I did consider the possibility that cutting the string in half might itself destroy the only occurrence of the needle. Details left as an exercise for the reader.)

So now my S2 style is running a recursive fix_tables function, which in turn is calling a recursive rindex function several times. Total recursion depth is (number of spurious close tags) + log(size of LJ entry HTML), which fortunately doesn't seem to have run into the recursion limit yet. And it all seems to work: the entry that triggered the problem yesterday has been fixed, but I reproduced the problem in a private entry of my own and am confident my defence now works again.

It's thoroughly silly that I had to do it in the first place, of course. If LJ were to feel like supplying index and rindex primitives, or better still fixing the HTML-cleaner bug in the first place, I could rip the entire edifice out of my style and save their S2 processor a whole load of wasted CPU cycles. But until then, it's staying…

(I described this hack and the reasons for it at post-pizza last night and it got a spontaneous round of applause, which was unexpected and fun :-)

My Evil Hack of the Week

Post a comment in response: