Geek Story Hour: Parser of Death

crazyscot's recent LJ post about a factor-of-20 speedup of some code reminds me that I've never written down in here the story of my first summer job, despite it being a standard anecdote I use in real life when I get into those ‘the worst code I've ever encountered’ geek war-story conversations.

After my first year at university, I got a summer job working for a small software development team who maintained a suite of utilities which included an interpreter for a small custom scripting language. Since I'd finished my original task for them ahead of schedule, they decided to give me the job of understanding their script interpreter's expression evaluator and rewriting it using lex and yacc.

It rapidly became clear that the long-departed author of the existing evaluator had not merely been unfamiliar with parser theory. I think that to have written a parser this bad, the author would have to have actually read a book on parser theory, in order to carefully disobey everything in it.

The evaluator (and parser – they weren't separable) worked by leaving the input expression in its original string representation, and gradually transforming that string into one that described the output value. So you'd start with, for instance, an expression reading "hello" + ", " + "world"; the parser would look for the first operator, remove it and the two string literals on either side of it, evaluate the operator, construct a string literal representing the output string, and reinsert it into the expression. So the program would physically construct the partially evaluated expression "hello, " + "world" as a string, then apply the same procedure again to reduce it similarly to "hello, world", at which point it would strip off the quote marks and assign the result into the target variable.

(And yes, it actually did bodily shift the entire tail of the input expression back and forth twice in each iteration of this process.)

The author had evidently gone to some effort to make this process as simple as possible for himself, at the expense of almost any amount of unpleasantness in the actual language syntax. In particular, to avoid having too many cases to switch between, every leaf node of every expression was required to be a string literal: given a string variable s, you couldn't write "prefix" + s + "suffix" – the only way you could refer to s was as an interpolation directive in the middle of a string literal, so "prefix" + "<s>" + "suffix" was the order of the day. (Or "prefix<s>suffix", in this case, but I illustrate the long form to show what had to be done with any other operator.)

Another side effect of this, of course, was that there were no types permitted other than strings. All the arithmetic operators took strings as input, interpreted them as decimal numbers, and output a string similarly encoding the result. Of course Perl has made that practically respectable these days, but Perl doesn't physically format every intermediate result as a decimal string and then reparse it for the next operation!

Finally, all the operators had the same precedence level. Not that you couldn't have implemented variable precedence in an architecture like this, but the author clearly couldn't be bothered to work out how. (There were parentheses; I wouldn't have been too surprised if he'd omitted those too, but fortunately not.)

So I took one horrified look at this and promptly agreed with my boss that a rewrite might not be a bad idea. While I was at it, I wrote a test suite as well, which turned up a few other exciting issues.

The language contained an operator ITEM n OF s, which would return the nth word of the string s. (Yes, that was a basic operator in the expression language; a function call syntax or a separate standard library would certainly have been well beyond the programmer's mental horizon.) But you could also alter the implicit use of whitespace as the ‘word’ separator, by writing ITEM n OF s SEPARATED BY d; this would allow you to, say, fetch the third field of a colon-delimited line. So far, nothing too out of the ordinary – except that one now naturally wonders which way ITEM a OF ITEM b OF c SEPARATED BY d would bracket, since it's exactly isomorphic to the well-known if–else ambiguity.

After noticing this and investigating briefly, I left my cubicle and went to the lab where the rest of the group was wrestling with some recalcitrant hardware. I wrote the above expression on a piece of paper and asked them if they had a syntax specification which might define how it should be considered to be bracketed. They had no such thing, naturally, and in fact it took me a while to convince them that there were two defensible interpretations of the expression.

When I finally convinced them there was a question to be answered, they exasperatedly asked me the obvious question: ‘why don't you just test what the existing implementation does, and do that?’. At this point it took all the conscience I had to stand there and say ‘because the existing implementation crashes when given this code’. I was very close to going meekly back to my desk and coding a deliberate segfault into my new parser, ‘as requested’.

(Fortunately, at this point, they did the sensible thing and gave me discretion to decide this previously unspecified point of the syntax in any way I deemed appropriate. There is of course a well-known right way to resolve the if–else ambiguity, so I went away and did that.)

The one remaining crawling horror was particularly crawling. In my example expressions above I used angle brackets as the delimiters for the variable-interpolation syntax. In fact the default delimiters were the ISO 8859-1 guillemet characters (i.e. French quotation marks), but since those are difficult to type on the average English speaker's keyboard, there was a feature to allow them to be reconfigured.

This reconfiguration was implemented by a preprocessing pass on the input expression string, which globally replaced the current delimiter characters with codes deemed unlikely to be used for anything else: something along the lines of the ASCII control-code values 1 and 2. (This pass, incidentally, also dealt with backslash-escaping in the original string expression by transforming all the string-delimiting quote marks into another unlikely control code and deleting backslashes, so that at least the repeated rewriting of the expression in the main parse phase wouldn't have to constantly undo and redo the escaping. They must have assumed that nobody would ever feed the code real ASCII control values.)

Unfortunately that preprocessing pass had not considered the possibility that the chosen interpolation delimiters might appear outside a string literal, so it replaced them unconditionally without checking the lexical context. I found this out when I reconfigured my delimiters to the ordinary ASCII angle brackets (which seemed like an obvious choice). Everything worked fine until I tried to write an expression using the less-than operator – at which point it got transformed into Ctrl-A by the preprocessor and then flagged as a syntax error during parsing!

So I threw all of this in the bin with great force, and rewrote the entire evaluator using fairly standard lex and yacc. While I was at it, I arranged for variables to be stored internally as either strings or integers (according to which one they were last used as) and lazily converted as necessary, which eliminated all the string processing in the middle of complex arithmetic expressions.

When I'd finished rewriting the evaluator and also written a set of regression tests to check to the limits of my ability that it behaved indistinguishably from the old one in any case where the old one didn't crash, we ran a crude speed test, and found that the interpreter was running a full factor of twenty faster.

That isn't really a boast about my ability. I didn't do anything especially clever; the lazy type conversion was the cleverest thing I did, and although I'd had to think it up myself due to never having seen it before, I guessed (rightly, of course) that it was probably pretty standard among people doing this sort of thing who weren't utter muppets. I attribute the entire factor of 20 to the breathtaking idiocy of the original code. In particular, the language interpreter contained quite a lot of stuff other than the expression parser, so for the whole thing to have sped up by a factor of 20 must have meant that over 95% of the CPU time was previously being spent in the old parser!

At the time this speed test was running, the boss was out of the office; only the group's full-time engineer and I were present. We grinned identical grins at each other and looked forward to delighting the boss with the huge performance improvement when he returned.

The boss took one look at the speed test, and shook his head. ‘We can't ship that,’ he said, ‘it's far too embarrassing. We'll have to deliberately slow it back down again, and ship lots of incremental speed upgrades.’

I laughed.

He turned out not to be joking.