simont | Geek Story Hour: Parser of Death

Geek Story Hour: Parser of Death

crazyscot's recent LJ post about a factor-of-20 speedup of some code reminds me that I've never written down in here the story of my first summer job, despite it being a standard anecdote I use in real life when I get into those ‘the worst code I've ever encountered’ geek war-story conversations.

After my first year at university, I got a summer job working for a small software development team who maintained a suite of utilities which included an interpreter for a small custom scripting language. Since I'd finished my original task for them ahead of schedule, they decided to give me the job of understanding their script interpreter's expression evaluator and rewriting it using lex and yacc.

It rapidly became clear that the long-departed author of the existing evaluator had not merely been unfamiliar with parser theory. I think that to have written a parser this bad, the author would have to have actually read a book on parser theory, in order to carefully disobey everything in it.

The evaluator (and parser – they weren't separable) worked by leaving the input expression in its original string representation, and gradually transforming that string into one that described the output value. So you'd start with, for instance, an expression reading "hello" + ", " + "world"; the parser would look for the first operator, remove it and the two string literals on either side of it, evaluate the operator, construct a string literal representing the output string, and reinsert it into the expression. So the program would physically construct the partially evaluated expression "hello, " + "world" as a string, then apply the same procedure again to reduce it similarly to "hello, world", at which point it would strip off the quote marks and assign the result into the target variable.

(And yes, it actually did bodily shift the entire tail of the input expression back and forth twice in each iteration of this process.)

The author had evidently gone to some effort to make this process as simple as possible for himself, at the expense of almost any amount of unpleasantness in the actual language syntax. In particular, to avoid having too many cases to switch between, every leaf node of every expression was required to be a string literal: given a string variable s, you couldn't write "prefix" + s + "suffix" – the only way you could refer to s was as an interpolation directive in the middle of a string literal, so "prefix" + "<s>" + "suffix" was the order of the day. (Or "prefix<s>suffix", in this case, but I illustrate the long form to show what had to be done with any other operator.)

Another side effect of this, of course, was that there were no types permitted other than strings. All the arithmetic operators took strings as input, interpreted them as decimal numbers, and output a string similarly encoding the result. Of course Perl has made that practically respectable these days, but Perl doesn't physically format every intermediate result as a decimal string and then reparse it for the next operation!

Finally, all the operators had the same precedence level. Not that you couldn't have implemented variable precedence in an architecture like this, but the author clearly couldn't be bothered to work out how. (There were parentheses; I wouldn't have been too surprised if he'd omitted those too, but fortunately not.)

So I took one horrified look at this and promptly agreed with my boss that a rewrite might not be a bad idea. While I was at it, I wrote a test suite as well, which turned up a few other exciting issues.

The language contained an operator ITEM n OF s, which would return the nth word of the string s. (Yes, that was a basic operator in the expression language; a function call syntax or a separate standard library would certainly have been well beyond the programmer's mental horizon.) But you could also alter the implicit use of whitespace as the ‘word’ separator, by writing ITEM n OF s SEPARATED BY d; this would allow you to, say, fetch the third field of a colon-delimited line. So far, nothing too out of the ordinary – except that one now naturally wonders which way ITEM a OF ITEM b OF c SEPARATED BY d would bracket, since it's exactly isomorphic to the well-known if–else ambiguity.

After noticing this and investigating briefly, I left my cubicle and went to the lab where the rest of the group was wrestling with some recalcitrant hardware. I wrote the above expression on a piece of paper and asked them if they had a syntax specification which might define how it should be considered to be bracketed. They had no such thing, naturally, and in fact it took me a while to convince them that there were two defensible interpretations of the expression.

When I finally convinced them there was a question to be answered, they exasperatedly asked me the obvious question: ‘why don't you just test what the existing implementation does, and do that?’. At this point it took all the conscience I had to stand there and say ‘because the existing implementation crashes when given this code’. I was very close to going meekly back to my desk and coding a deliberate segfault into my new parser, ‘as requested’.

(Fortunately, at this point, they did the sensible thing and gave me discretion to decide this previously unspecified point of the syntax in any way I deemed appropriate. There is of course a well-known right way to resolve the if–else ambiguity, so I went away and did that.)

The one remaining crawling horror was particularly crawling. In my example expressions above I used angle brackets as the delimiters for the variable-interpolation syntax. In fact the default delimiters were the ISO 8859-1 guillemet characters (i.e. French quotation marks), but since those are difficult to type on the average English speaker's keyboard, there was a feature to allow them to be reconfigured.

This reconfiguration was implemented by a preprocessing pass on the input expression string, which globally replaced the current delimiter characters with codes deemed unlikely to be used for anything else: something along the lines of the ASCII control-code values 1 and 2. (This pass, incidentally, also dealt with backslash-escaping in the original string expression by transforming all the string-delimiting quote marks into another unlikely control code and deleting backslashes, so that at least the repeated rewriting of the expression in the main parse phase wouldn't have to constantly undo and redo the escaping. They must have assumed that nobody would ever feed the code real ASCII control values.)

Unfortunately that preprocessing pass had not considered the possibility that the chosen interpolation delimiters might appear outside a string literal, so it replaced them unconditionally without checking the lexical context. I found this out when I reconfigured my delimiters to the ordinary ASCII angle brackets (which seemed like an obvious choice). Everything worked fine until I tried to write an expression using the less-than operator – at which point it got transformed into Ctrl-A by the preprocessor and then flagged as a syntax error during parsing!

So I threw all of this in the bin with great force, and rewrote the entire evaluator using fairly standard lex and yacc. While I was at it, I arranged for variables to be stored internally as either strings or integers (according to which one they were last used as) and lazily converted as necessary, which eliminated all the string processing in the middle of complex arithmetic expressions.

When I'd finished rewriting the evaluator and also written a set of regression tests to check to the limits of my ability that it behaved indistinguishably from the old one in any case where the old one didn't crash, we ran a crude speed test, and found that the interpreter was running a full factor of twenty faster.

That isn't really a boast about my ability. I didn't do anything especially clever; the lazy type conversion was the cleverest thing I did, and although I'd had to think it up myself due to never having seen it before, I guessed (rightly, of course) that it was probably pretty standard among people doing this sort of thing who weren't utter muppets. I attribute the entire factor of 20 to the breathtaking idiocy of the original code. In particular, the language interpreter contained quite a lot of stuff other than the expression parser, so for the whole thing to have sped up by a factor of 20 must have meant that over 95% of the CPU time was previously being spent in the old parser!

At the time this speed test was running, the boss was out of the office; only the group's full-time engineer and I were present. We grinned identical grins at each other and looked forward to delighting the boss with the huge performance improvement when he returned.

The boss took one look at the speed test, and shook his head. ‘We can't ship that,’ he said, ‘it's far too embarrassing. We'll have to deliberately slow it back down again, and ship lots of incremental speed upgrades.’

I laughed.

He turned out not to be joking.

Flat | Top-Level Comments Only

Nice story! I'm sure the DailyWTF would appreciate it too, if you're that way inclined.

‘We can't ship that,’ he said, ‘it's far too embarrassing. We'll have to deliberately slow it back down again, and ship lots of incremental speed upgrades.’
This is far too plausible.

I managed a 3,000-fold speed improvement two years ago, through a judicious combination of more efficient I/O and an algorithm with better efficiency order.

Fortunately, I was allowed to ship that because the previous effort had transcended far-too-embarrassing and moved into mistaken-for-a-crash. (-8

I second that; DailyWTF would be a great place for this. Excellent story!

Nice post. Reminds me of the time when I had re-written a 500 odd line PL/SQL code completely in 4 SQL queries. The running time of the app came down from 15 mins to ~20 secs. Turned out the original code had 'procedural' versions of correlated subquerying and pivoting. Although, the consequences in my case were better - the 'rewrite' became a legend (albeit shortlived) of sorts - I would attribute the performance improvement to the lack of clear understanding on part of the original developer. :)

I'd imagine it's pretty rare that you can get an orders-of-magnitude speedup without it indicating a severe lack of something or other on the part of the original developer. (Be it understanding, or time and effort, or business case, or something else.) About the only exceptions I can think of would be if an external constraint of some sort disappears (patent expiry, perhaps?), or if algorithms research comes up with a previously completely unknown better approach to the problem.

I rewrote my bosses code one time (she wasn't a full on developer, more an ideas-gal), which implemented a huge speedup, possible > x20. We were doing a exam system that would pull in results for a 5 year course. She had lot of loops calling SQL queries, I replaces it with one big SQL query, that pulled in all the data, and then looped through the results, displaying them correctly. This stopped the page from timing out, which made everyone happy.

Only problem was, when I left someone else had to take over that project, and they didn't understand what I had done either. The original way was the easiest to understand, my way was very messy but fast. I recommended that they look at the original code.

Hah! Just five minutes ago, I was reviewing a patch on our internal code that produced a 200x speedup on a bit of code.

There, the change was from "hadn't gotten around to implementing a parallel version of this particular function that uses the hardware accelerators" to "had". (In this case, that would be the SPEs on the Cell/B.E. processor. But any hardware with SIMD and lots of multicore would likely get at least order-of-magnitude speedup of similar functions from the simplistic C code version to a version that used all the hardware.)

I think that anything involving SQL is cheating, when telling speed-up war stories.

Two-orders-of-magnitude speedups don't necessarily require the original coder to have been particularly foolish, and are surprisingly easy to come by.

Actually, I keep doing this to myself ... mostly by replacing O(N^2) with O(NlogN) as N inevitably grows ... I generally do it the most obvious and easy to debug way, then consult the profiler to decide if it is worth doing the fancy way.

An example would be a toy event-driven simulator which I wrote the first time using an linked list as the priority queue. For small N, the priority queue barely registered. As the size and complexity of the networks I'm interested in increased, that rose to 90% of the CPU time! Replacing that with a heapq took very little effort and dropped the queue overhead down to 10% or so. If N keeps growing, I'll probably have to come up with something cleverer eventually ... but to do it now would be a waste of time, it might never happen.

Or, to put it in terms of Wall's Virtues, I like to wait until Impatience outweighs Laziness. After all, to do it _before_ it becomes annoying would be Premature Optimization :-)

-----sharks (http://nick.zoic.org/)

Another way it could happen without reflecting too badly on the original developer would be finding a way to trick a SQL optimizer out of doing something stupid.

Oracle seems to be a lot more prone to that than SQL Server. One big area is that the optimizer forgets everything it should know if there's a user-defined function call in a Where clause. Oracle will try to run that before the rest of the conditions, and it will not notice if none of the parameters to the function are coming from the query. I got a ~200x speedup in PL/SQL code written by a friend of mine by putting the results of a function call into a variable before running the query.

Sadly I seem to be doing this very thing on a daily basis.

The trick of "leave some speed-up in reserve" is sadly not uncommon! Especially in new code - leaving in some deliberate delays that can be removed at a later date. TheDailyWTF has several examples.

I'm pleased to say that none of this happens in my current codebase, although I've heard that there was some in the olden days.

There is of course a well-known right way to resolve the if–else ambiguity, so I went away and did that.

Which is that well-known right way?

Attach the else to the most recent if which doesn't already have one?

Yes. The reason why that's the sensible choice has to do with bounding the amount of lookahead required by the parser: when you see the first else, you don't want to have to wait until the second (or absence thereof) in order to decide which if to attach it to, so you write the syntax so that it attaches to the same if whether there's a second else or not.

The delimiter, syntax, and evaluation issues you described seem rather...familiar. The company's name wouldn't happen to start with the letter "E" would it?

I thought your name sounded familiar -no wonder you found this rewrite so obvious... you're legendary! Thank you - I've been using PuTTY for what must be pushing ten years. Your work is much appreciated.

"Transformational grammar", developed by Noam Chomsky in the 1950's, is a close match for the the parsing style used by the long-departed code author. Maybe the programmer was a linguist?

http://en.wikipedia.org/wiki/Transformational_grammar

Interesting thought!

In fact I wouldn't blame the original author, in the absence of knowledge that there was a body of theory giving other efficient approaches, for implementing a parser in that basically transformational style. I'd done so myself a few years earlier, when I thought I'd try implementing an expression evaluator with no prior reading just to see whether I could.

It's all the details that were wrong. If he'd only started by lexing the expression into a list of atoms and operators – in fact, preferably a linked list, for ease of changing length – then he wouldn't have had to keep shifting entire strings back and forth all the time...

I am using putty right this second.

Oh dear. I an Not A Proper Geek because I don't use curly braces and pointers in VBA and Excel. But I've managed factor-of-four speedup over a C++ function for correlation-finding.

Bad code can be beaten in any language and the sad truth is that it's easier to do it right. And was the first time, too.

Seriously, did you have to slow it down?

No, they didn't make me slow it down. I was only there for the summer. If they did end up slowing it down (and didn't change their minds), it was after I went...

A couple of posters have cheerfully spoken of their wonder-kind hyperbolic speed improvements, followed by the disgusted comment "but the idiots following me couldn't understand what I did so they reverted to the old code".

Speed isn't everything, folks. If you write some pile of high-speed twisted code no mortal can understand, then you're just wasting time. Remember to code like the next guy is a whacked out axe murderer with your address and a desire to kill anyone who makes his life more difficult.

Flat | Top-Level Comments Only

Geek Story Hour: Parser of Death

no subject

no subject

no subject

no subject

no subject

code

no subject

no subject

Keep doing it to myself ...

no subject

no subject

no subject

no subject

no subject

Hmmmm...

Wow you're the PuTTY guy!

Generative grammar?

no subject

wtf

no subject

How did you slow it down?

no subject

Speed is everything