Geek Story Hour: Parser of Death
crazyscot's recent LJ post about a factor-
After my first year at university, I got a summer job working for a small software development team who maintained a suite of utilities which included an interpreter for a small custom scripting language. Since I'd finished my original task for them ahead of schedule, they decided to give me the job of understanding their script interpreter's expression evaluator and rewriting it using lex and yacc.
It rapidly became clear that the long-
The evaluator (and parser –"hello" + ", " + "world"; the parser would look for the first operator, remove it and the two string literals on either side of it, evaluate the operator, construct a string literal representing the output string, and reinsert it into the expression. So the program would physically construct the partially evaluated expression "hello, " + "world" as a string, then apply the same procedure again to reduce it similarly to "hello, world", at which point it would strip off the quote marks and assign the result into the target variable.
(And yes, it actually did bodily shift the entire tail of the input expression back and forth twice in each iteration of this process.)
The author had evidently gone to some effort to make this process as simple as possible for himself, at the expense of almost any amount of unpleasantness in the actual language syntax. In particular, to avoid having too many cases to switch between, every leaf node of every expression was required to be a string literal: given a string variable s, you couldn't write "prefix" + s + "suffix" –s was as an interpolation directive in the middle of a string literal, so "prefix" + "<s>" + "suffix" was the order of the day. (Or "prefix<s>suffix", in this case, but I illustrate the long form to show what had to be done with any other operator.)
Another side effect of this, of course, was that there were no types permitted other than strings. All the arithmetic operators took strings as input, interpreted them as decimal numbers, and output a string similarly encoding the result. Of course Perl has made that practically respectable these days, but Perl doesn't physically format every intermediate result as a decimal string and then reparse it for the next operation!
Finally, all the operators had the same precedence level. Not that you couldn't have implemented variable precedence in an architecture like this, but the author clearly couldn't be bothered to work out how. (There were parentheses; I wouldn't have been too surprised if he'd omitted those too, but fortunately not.)
So I took one horrified look at this and promptly agreed with my boss that a rewrite might not be a bad idea. While I was at it, I wrote a test suite as well, which turned up a few other exciting issues.
The language contained an operator ITEM n OF s, which would return the nth word of the string s. (Yes, that was a basic operator in the expression language; a function call syntax or a separate standard library would certainly have been well beyond the programmer's mental horizon.) But you could also alter the implicit use of whitespace as the ‘word’ separator, by writing ITEM n OF s SEPARATED BY d; this would allow you to, say, fetch the third field of a colon-ITEM a OF ITEM b OF c SEPARATED BY d would bracket, since it's exactly isomorphic to the well-if–else ambiguity.
After noticing this and investigating briefly, I left my cubicle and went to the lab where the rest of the group was wrestling with some recalcitrant hardware. I wrote the above expression on a piece of paper and asked them if they had a syntax specification which might define how it should be considered to be bracketed. They had no such thing, naturally, and in fact it took me a while to convince them that there were two defensible interpretations of the expression.
When I finally convinced them there was a question to be answered, they exasperatedly asked me the obvious question: ‘why don't you just test what the existing implementation does, and do that?’. At this point it took all the conscience I had to stand there and say ‘because the existing implementation crashes when given this code’. I was very close to going meekly back to my desk and coding a deliberate segfault into my new parser, ‘as requested’.
(Fortunately, at this point, they did the sensible thing and gave me discretion to decide this previously unspecified point of the syntax in any way I deemed appropriate. There is of course a well-if–else ambiguity, so I went away and did that.)
The one remaining crawling horror was particularly crawling. In my example expressions above I used angle brackets as the delimiters for the variable-
This reconfiguration was implemented by a preprocessing pass on the input expression string, which globally replaced the current delimiter characters with codes deemed unlikely to be used for anything else: something along the lines of the ASCII control-
Unfortunately that preprocessing pass had not considered the possibility that the chosen interpolation delimiters might appear outside a string literal, so it replaced them unconditionally without checking the lexical context. I found this out when I reconfigured my delimiters to the ordinary ASCII angle brackets (which seemed like an obvious choice). Everything worked fine until I tried to write an expression using the less-
So I threw all of this in the bin with great force, and rewrote the entire evaluator using fairly standard lex and yacc. While I was at it, I arranged for variables to be stored internally as either strings or integers (according to which one they were last used as) and lazily converted as necessary, which eliminated all the string processing in the middle of complex arithmetic expressions.
When I'd finished rewriting the evaluator and also written a set of regression tests to check to the limits of my ability that it behaved indistinguishably from the old one in any case where the old one didn't crash, we ran a crude speed test, and found that the interpreter was running a full factor of twenty faster.
That isn't really a boast about my ability. I didn't do anything especially clever; the lazy type conversion was the cleverest thing I did, and although I'd had to think it up myself due to never having seen it before, I guessed (rightly, of course) that it was probably pretty standard among people doing this sort of thing who weren't utter muppets. I attribute the entire factor of 20 to the breathtaking idiocy of the original code. In particular, the language interpreter contained quite a lot of stuff other than the expression parser, so for the whole thing to have sped up by a factor of 20 must have meant that over 95% of the CPU time was previously being spent in the old parser!
At the time this speed test was running, the boss was out of the office; only the group's full-
The boss took one look at the speed test, and shook his head. ‘We can't ship that,’ he said, ‘it's far too embarrassing. We'll have to deliberately slow it back down again, and ship lots of incremental speed upgrades.’
I laughed.
He turned out not to be joking.
no subject
‘We can't ship that,’ he said, ‘it's far too embarrassing. We'll have to deliberately slow it back down again, and ship lots of incremental speed upgrades.’
This is far too plausible.
no subject
Fortunately, I was allowed to ship that because the previous effort had transcended far-too-embarrassing and moved into mistaken-for-a-crash. (-8
no subject
no subject
(Anonymous) 2009-11-07 05:17 pm (UTC)(link)no subject
code
(Anonymous) 2009-11-07 10:06 pm (UTC)(link)Only problem was, when I left someone else had to take over that project, and they didn't understand what I had done either. The original way was the easiest to understand, my way was very messy but fast. I recommended that they look at the original code.
no subject
There, the change was from "hadn't gotten around to implementing a parallel version of this particular function that uses the hardware accelerators" to "had". (In this case, that would be the SPEs on the Cell/B.E. processor. But any hardware with SIMD and lots of multicore would likely get at least order-of-magnitude speedup of similar functions from the simplistic C code version to a version that used all the hardware.)
no subject
Two-orders-of-magnitude speedups don't necessarily require the original coder to have been particularly foolish, and are surprisingly easy to come by.
Keep doing it to myself ...
An example would be a toy event-driven simulator which I wrote the first time using an linked list as the priority queue. For small N, the priority queue barely registered. As the size and complexity of the networks I'm interested in increased, that rose to 90% of the CPU time! Replacing that with a heapq took very little effort and dropped the queue overhead down to 10% or so. If N keeps growing, I'll probably have to come up with something cleverer eventually ... but to do it now would be a waste of time, it might never happen.
Or, to put it in terms of Wall's Virtues, I like to wait until Impatience outweighs Laziness. After all, to do it _before_ it becomes annoying would be Premature Optimization :-)
-----sharks (http://nick.zoic.org/)
no subject
Oracle seems to be a lot more prone to that than SQL Server. One big area is that the optimizer forgets everything it should know if there's a user-defined function call in a Where clause. Oracle will try to run that before the rest of the conditions, and it will not notice if none of the parameters to the function are coming from the query. I got a ~200x speedup in PL/SQL code written by a friend of mine by putting the results of a function call into a variable before running the query.
no subject
(Anonymous) 2009-11-07 09:18 pm (UTC)(link)no subject
I'm pleased to say that none of this happens in my current codebase, although I've heard that there was some in the olden days.
no subject
if–elseambiguity, so I went away and did that.Which is that well-known right way?
Attach the
elseto the most recentifwhich doesn't already have one?no subject
else, you don't want to have to wait until the second (or absence thereof) in order to decide whichifto attach it to, so you write the syntax so that it attaches to the sameifwhether there's a secondelseor not.Hmmmm...
(Anonymous) 2009-11-08 01:20 am (UTC)(link)Wow you're the PuTTY guy!
(Anonymous) 2009-11-08 10:38 am (UTC)(link)Generative grammar?
http://en.wikipedia.org/wiki/Transformational_grammar
no subject
In fact I wouldn't blame the original author, in the absence of knowledge that there was a body of theory giving other efficient approaches, for implementing a parser in that basically transformational style. I'd done so myself a few years earlier, when I thought I'd try implementing an expression evaluator with no prior reading just to see whether I could.
It's all the details that were wrong. If he'd only started by lexing the expression into a list of atoms and operators – in fact, preferably a linked list, for ease of changing length – then he wouldn't have had to keep shifting entire strings back and forth all the time...
wtf
(Anonymous) 2009-11-08 12:19 pm (UTC)(link)no subject
Bad code can be beaten in any language and the sad truth is that it's easier to do it right. And was the first time, too.
How did you slow it down?
(Anonymous) 2009-11-09 10:51 pm (UTC)(link)no subject
Speed is everything
(Anonymous) 2009-12-12 02:59 pm (UTC)(link)Speed isn't everything, folks. If you write some pile of high-speed twisted code no mortal can understand, then you're just wasting time. Remember to code like the next guy is a whacked out axe murderer with your address and a desire to kill anyone who makes his life more difficult.