simont | My Evil Hack of the Week

It's early in the week, but I doubt I'll beat this in the next few days: yesterday evening I implemented a string search function (equivalent to Perl's rindex) recursively.

It's reasonably well known among LiveJournal users that putting tables in an LJ entry has a tendency to break the formatting of people's friends pages. In fact, it's not all tables; it's only tables which don't explicitly close <tr> and <td> tags.

This is because LJ's HTML cleaning function, which is applied separately to the HTML generated by each individual journal entry, doesn't know that those two elements can be implicitly closed in HTML 4; so if it sees either of those tags without a corresponding explicit close tag, it remembers, and at the very end of the string it's cleaning it splurges out a huge string of pointless </tr> and </td> tags, long after the </table> before which they might have at least managed to be relatively harmless if silly. When such an LJ entry is output within a page style that uses tables for layout, these unexpected closing tags disrupt the style's external tables and things go badly wrong.

Once upon a time I attempted to prepare a patch against the HTML cleaner which fixed this problem, but unfortunately failed because the code was more hideous than I was prepared to deal with when not being paid large amounts of money. So instead I worked around it in my own journal style (which, thanks to ick-proxy, I use for absolutely all my LJ reading): an S2 journal style is a program in a Perl-like language which is given fragments of HTML and other such stuff as input and produces a complete web page as output, and for about a year mine has been checking for spurious </tr> and </td> tags at the end of those HTML fragments and stripping them off.

Unfortunately, the recent introduction of ‘tagging’ of LJ entries broke my defensive code: the HTML fragment which shows the tags on an entry seems to be appended to what comes out of the HTML cleaner before it gets passed to my style code, meaning that those spurious tags now aren't guaranteed to appear at the very end of the entry, but can be somewhere in the middle. Yesterday I encountered the first LJ entry I'd seen with both a bug-tickling table and tags, and realised that my defence needed a bit more work.

After a bit of thought, the obvious answer seemed to be that any </tr> or </td> tag which appears after the last </table> can be identified as spurious and removed. That sounded easy enough. So I went to the S2 documentation and looked for their equivalent of rindex … and they don't have one. Their string class provides a boolean contains() method to tell you if one string occurs in another, but sadly not one which will tell you where it occurs.

Well, that's not a crippling problem; string functions aren't that hard, so I can just write it myself. The problem is that S2 doesn't provide any looping constructs except foreach: no while, and no general for. It looks rather as if they were trying to ensure the language was unable to express an indefinite loop, in the manner of Douglas Hofstadter's BlooP, presumably to prevent badly-written styles from hanging their web server.

Fortunately, they did leave in recursion. (In fact I was already using recursion for my earlier table-bug workaround: my fix_tables function, if given a string ending in a spurious close tag, stripped the tag off and then recursed in case there was another.) So now I had to implement a string search function using recursion as my only method of looping. Given the existence of the contains() primitive, binary search seemed appropriate: divide the haystack string in half, see if the right-hand half contains the needle, if so recurse on to that, otherwise recurse on to the left-hand half. (Yes, I did consider the possibility that cutting the string in half might itself destroy the only occurrence of the needle. Details left as an exercise for the reader.)

So now my S2 style is running a recursive fix_tables function, which in turn is calling a recursive rindex function several times. Total recursion depth is (number of spurious close tags) + log(size of LJ entry HTML), which fortunately doesn't seem to have run into the recursion limit yet. And it all seems to work: the entry that triggered the problem yesterday has been fixed, but I reproduced the problem in a private entry of my own and am confident my defence now works again.

It's thoroughly silly that I had to do it in the first place, of course. If LJ were to feel like supplying index and rindex primitives, or better still fixing the HTML-cleaner bug in the first place, I could rip the entire edifice out of my style and save their S2 processor a whole load of wasted CPU cycles. But until then, it's staying…

(I described this hack and the reasons for it at post-pizza last night and it got a spontaneous round of applause, which was unexpected and fun :-)

Flat | Top-Level Comments Only

Have you suggested them fixing this problem?

Apparently they've known about it for some time.

I feel slightly less guilty about producing borken HTML now I know that it inspired such a l33t hack. 8-)

I still think it wasn't technically broken. And as

imc pointed out, even if it were, it wouldn't excuse the HTML cleaner behaving badly, since its entire purpose is to behave well in the face of whatever a user might supply to it including deliberate malice.

But yeah, this was fun :-)

I still maintain that it was morally broken, if not technically broken. (We need web principles as well as web standards!)

Unfortunately, many webmasters don't seem too concerned with moral issues; they spend their cash on writing Flash and grabbing your attention. Um.

*falls about giggling*

If that isn't part of a full-on filk, it damn well should be. (If so, bang goes my productivity for the afternoon.)

It isn't as far as I know; in keeping with my usual attitude to filk, I came up with the one good line and then stopped on the grounds that the rest would never measure up to it. There's only one filk I ever managed to write more than about two lines of, and even that's stayed 3/4 finished for years.

Didn't you do a meme where you combined all those one lines and it was really good?

A meme? Moi?

You might be thinking of the time I searched my diary archives for sentences in iambic pentameter and managed to find just enough rhyming ones to turn into a rather disjointed not-quite-sonnet, but that certainly wasn't my collection of filk-seeds.

My filk-seed collection includes, off the top of my head:

In short, in matters vegetable, mineral and animal
He is the very model of a modern David damerell
I'm the dandy webmaster who you're too scared to mention
I spend my cash on writing Flash and grabbing your attention
Quick reflexes are a must for the habitual player of what is known as (Half-Life)
He works with a mouse, a very big mouse, he's a compsci

For the last of those I have about three quarters of the rest of the song, if I can still remember it, including a triple rhyme that would have made Tom Lehrer blush (maybe). Oh, and there was one where it occurred to me to start off "Eleanor Rigby" with "Eleanor Ripley" and turn it into an Aliens filk, but I never came up with any actual lines for it, although

drswirly provided "Wearing a face that she found in an egg on the floor", and clearly "only way to be sure" should appear in there at some point.

Generally, though, my attitude to filk is that once you've come up with the one outstandingly well crafted line that sparks the whole thing off, coming up with enough filler to pad out the rest of the song is not really motivating and you might just as well leave it at that.

ROFL. Sorry. But you knew what I meant, that was what I was thinking of.

Don't worry -- I didn't mean to imply that you'd copied it from anywhere, merely that (a) it was an automated summary of your character in an extremely oblique way and (b) people might like to copy it from you :)

Yeah. Like books where the title is wonderful and says it all, and reading it isn't really gaining anything. Perhaps the correct artform is Filk Line :)

It took me *aaaages* to work out what filk *was*. I had associations because I knew what people described as it, but it didn't seem to make sense. Then I went to wikipedia and discovered my understanding was right :) Their description is beautifully worked out, something like, songs characterised by being about, or by the sort of people who write filk about, scifi/fantasy songs... It really is "know it when you hear it" but wikipedia was good at capturing that essence like a butterfly of etymology[1].

[1] Entomology pun not intended :)

Does "Grabbing j4's attention" sound ever so dodgy to everyone else?

Do I infer you didn't close some t[x] tags? Why is that morally broken?

* Because closing tags is the one true way? I disagree. I see the good side of all tags being open-content-close, but I can also see good reasons for ones that are open-content-separator-content-separator-content-separator-close.

* Because you shouldn't break LJ? That's not moral, that's dirty pragmatism :)

Suggestion 1: <Ptc>Write a perl-to-S2 compiler</ptc> :)
Suggestion 2: foreach [1...n] {strip} where n is greater than the likely available stack space :)

Suggestion 1 doesn't remove the need to implement rindex in S2; the compiler writer will still have to either do the job or write code to do it for him (which is bound to be at least as hard).

Suggestion 2: if I knew of an S2 language construction that would generate [1..n] for a given n then I'd be a lot happier...

1. Yeah. You'd only have to do it once, but then if you're doing that you should probably just make your own blogging wbesite for people you trust with perl :)

2. Oh. Sorry, I assumed "loop n times" would be in *any* nonstupid language, I hadn't actually looked in to S2. Foreach a string? Can you make a very long loop and break out after n decrements to zero? Sorry, I don't know.

if (whole_thing_contains(needle) and !left_half_contains(needle) and !right_half_contains(needle))
then
move_where_I_make_the_split

I suppose that'd work, although it seems to me to have a few too many special cases (and to potentially include further loops) for my taste. I preferred my approach: instead of dividing the entire string in half, divide in half the portion of the string which is capable of containing the start point of the needle. If the part of the string after that half-way point contains the needle, then the start point occurs after the half way point. Otherwise, the start point must occur before the half way point, meaning that the first "half" of the string plus (needlelen-1) further characters must contain the needle.

In fact, no, hang on, your method doesn't even work. I'm not just after an occurrence of the needle; I'm after the rightmost. So if there are two occurrences, one in the left half and one straddling the halfway point, then the condition in your "if" will not be fulfilled and we'll recurse blindly into the left half, completely missing the rightmost real occurrence in favour of the rightmost one that didn't hit a special case.

My Evil Hack of the Week

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

A meme? Moi?

once you've one outstandingly line ... coming up with enough filler isn't really motivating

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject