simont () wrote2017-02-20 08:09 pm

Over the weekend, I realised, extremely belatedly, that the expression language I designed for my free-software project `spigot` has a grammar bug. Specifically, it's a context-dependency analogous to the C `typedef` bug: the same expression can parse differently depending on whether a given identifier is currently defined to be a function or a variable, which means that you need to do your scope analysis in sync with the parse, so that you can know what set of definitions is currently in scope for the subexpression you're currently looking at.

The problem arises because `spigot` has the unusual grammar feature of implicit multiplication by juxtaposition, i.e. you can just put two expressions next to each other without an explicit `*` or × sign in between:

``\$ spigot 'let x=100 in x(x+1)'10100``

And `spigot` also supports user-defined functions:

``\$ spigot 'let f(x)=x+100 in f(20)'120``

These two grammar features collide, of course. If you write an expression of the form `a(stuff)`, then `spigot` will have to parse it differently depending on whether `a` is a function or a variable.

It would be nice to think that you could just defer the decision until after parsing was finished, by turning such expressions into some kind of ambiguous AST node, with the semantics ‘This is either a function call or a multiplication, we'll sort out which later on’. But the precedences of the other operators mean that you really do have to know it at parse time, because expressions like `a^f(b)` or `f(a)^b` will apply the `^` operator to different amounts of the expression depending on whether `f` is a function. A nice way to illustrate this is to try defining `f` in two ways that come out the same if you just write `f(x)` on its own:

``\$ spigot 'let f=10 in f(2)'20\$ spigot 'let f(x)=x*10 in f(2)'20``

and then to observe that those two definitions still manage to make a difference if you put other operators next to them:

``\$ spigot 'let a=3,b=4,f=10      in a^f(b)'    #  3^10 * 4236196\$ spigot 'let a=3,b=4,f(x)=x*10 in a^f(b)'    #  3^4012157665459056928801\$ spigot 'let a=3,b=4,f=10      in f(a)^b'    #  10 * 3^4810\$ spigot 'let a=3,b=4,f(x)=x*10 in f(a)^b'    #  30^4810000``

So the first question is: how on earth did I manage to make that mistake? To anyone who's studied grammars and parsing and language design (like me), and especially to anyone who's got involved with the grammar of C in particular (also like me), this is the kind of mistake that you would surely have been annoyed by before – which I have been – and be determined never to introduce the same bug into any grammar you designed. Which I was.

The answer is because I wrote all this code in the wrong order. When I was originally putting `spigot` together, I didn't know how general I was going to be able to make its expression evaluation in the first place, because I wasn't sure of exactly where my techniques for exact real computation were going to run out of steam. I had no real expectation of being able to support enough generality to even need user-defined functions, and they certainly weren't in my plans when I wrote the original parser. Originally, my ‘functions’ consisted of only the built-in ones like `sin` or `cos` or `exp` or `log`, and the parser just enshrined the complete list of them as effectively language keywords which it knew to treat specially. In other words, right from the start the parser had to know what identifiers were and were not functions – but at the time, that was OK, because the answer was always the same. Then, later on, I added user-defined functions in the obvious sort of way, and somehow I didn't even notice that I had introduced a scope-dependency to the lexer. It was only when I idly tried constructing a yacc grammar for the spigot expression language this weekend – just to see whether it would come out looking coherent – that I ran head-on into the problem and realised what a hole I had dug myself into.

Well, what's done is done. `spigot` doesn't have very many users yet (at least that I'm aware of), so I suppose I could just change the syntax and live with changing the meaning of some existing expressions. But I don't really think I want to, because I do like the fact that `spigot` manages to have an expression language that's closer to ordinary maths notation than the typical programming language manages. I don't want to have to lose multiplication by juxtaposition; I like it!

(`spigot`'s other unusual feature improving mathematical clarity is that it has no need for the typical language's bewildering variety of integer and floating-point types, each essentially representing some subset of the real numbers, but with different tradeoffs between speed and expressiveness so that you constantly have to remember which one you're using and occasionally go round the houses to compensate for it being the wrong one. For example, the way `gnuplot` makes you write `1.0/3` in place of `1/3`, or the constant need to compensate for overflowing intermediate values.)

So, if I'm not going to get rid of it, I suppose I'd better come up with an argument for how it's not that bad really!

Pondering that question, I found myself considering how I would best augment a `yacc`-style parser generator to cope with grammar ambiguities of this kind.

One option, and as far as I'm aware the usual option for people parsing C, is to have the parser feed back information to the lexer, so that the lexer can return different token types for the things that need to be distinguished. In C, this means that `typedef_name` becomes a distinct token type from `identifier`; in `spigot`, the token types would probably be called `var_id` and `func_id` (and perhaps also `unknown_id`).

This technique works in practice, but it's pretty annoying. Among its shortcomings are:

• To implement it in Bison, you have to do something fiddly with mid-rule actions, to ensure that the current set of identifier types is kept up to date even if it changes in the middle of parsing some expression. (In `spigot`, expressions of the form `let a=b,c=d` put the first definition in scope during the second, so you have to keep yourself up to date quite promptly.)
• Involving the lexer means that you have to cross two layer boundaries in your front end: between parser and type/scope analysis, and between lexer and parser. Clearly the parser needs to collude with the scope analysis, but it would be nice if you could get away with only that one layer violation.
• This technique needs you to be very confident of exactly how the LR parser automaton operates and how much lookahead it needs, because you need to be sure that it will never need to lex any identifier token in order to get enough lookahead to decide to reduce by the rule that defines that same identifier. And no pre-provided analysis will help you find out, for a given grammar, whether that's safe.
• Also, once you've defined more than one sub-type of ‘identifier’ to disambiguate one particular part of the language design, the rest of your grammar needs to know about all the sub-types and handle them all, even in totally different grammatical contexts where it doesn't matter which is which anyway. For example, in `spigot`, the grammar rule that parses new definitions in my `let` statement would have to be prepared to receive any of the subtypes of ‘identifier’ in the slot where you name the new thing being defined. It would be nicer to be able to just write ‘identifier’ everywhere you could, and only specify the disambiguation in the one case where it's important.

In spite of all that, the lexer feedback technique does at least work. But I think there's a nicer way, which you can do without involving the lexer at all.

If you just write the ambiguous grammar anyway, and feed it into `yacc` or Bison or whatever, then it will report a parsing action conflict (in this case, shift/reduce), indicating that at some point in the parsing state machine, two different actions would both be legal, and would potentially give rise to different parses. For other ambiguities of this kind, there's a long-standing tradition of not necessarily solving them by rewriting the actual CFG to make it unambiguous, but instead simply providing auxiliary information alongside the grammar itself to allow the parser generator to choose the right one of the possible actions. Usually this takes the form of directives like `%prec`.

In this case, it's not as simple as `%prec`, because the right answer to ‘shift or reduce?’ has to be determined based on which kind of identifier we're looking at. But the key point is that we can wait to resolve the question until we actually have to decide on that conflicted parsing action – and at that point, we have to do some disambiguation, i.e. check what's in scope.

Concretely, suppose we augment the grammar specification in two ways. One is that certain nonterminals on the parse stack include a specification of scope. In `spigot`'s case, the relevant piece of grammar is the part that sets up the definitions section of the `let` statement:

``let_expr : LET defs IN expr ;``
``defs     : def         | defs ',' def         ;``
``def      : ID '=' expr         | ID '(' args ')' '=' expr         ;``

You would augment this by saying that the nonterminals `def` and `defs` each came with an extra piece of information, stating the set of identifiers whose definitions are contained inside each one, and whether each of those identifiers is defined as a function or a variable. Something like:

``defs     : def            [ defs_in = defs_in(\$1) ]         | defs ',' def   [ defs_in = defs_in(\$1) union defs_in(\$3) ]         ;``
``def      : ID '=' expr               [ defs_in = { \$1 : VARIABLE } ]         | ID '(' args ')' '=' expr  [ defs_in = { \$1 : FUNCTION } ]         ;``

And the other part of the mechanism is that you provide a function that can return some kind of sub-type information for a given token. You'd define an enumeration of subtypes of the token `ID` (here, the same values `VARIABLE` and `FUNCTION` shown above), and then you could augment the two problematic productions in the main expression grammar

``   expr : ID        | ID '(' args ')'        | /* other options */ ;``

so that each one required the `ID` token to have a particular subtype:

``   expr : ID[subtype=VARIABLE]        | ID[subtype=FUNCTION] '(' args ')'        | /* other options */ ;``

and now the only remaining thing you would have to do is to provide an actual piece of code to determine the `subtype` property for a given `ID` token, and it would do it by looking back along the entire LR parser stack until it found an object with an attached `defs_in` property including the identifier in question.

So now the LR parser constructor can use all of this information to know how to resolve its shift/reduce (or whatever) conflict. Better still, it can prove at parser-generation time that the conflict is resolved, because it knows that the `subtype()` property can only take one of the two possible values, and so it can be sure that the two rules that conflict with each other can never be valid at once.

The nice thing about this is that it allows the parser to make its decision a great deal later than it would with the lexer-feedback implementation. You don't have to know, when you lex the identifier, what kind of thing the current scope might or might not believe it to be. Not only that, but the parser doesn't even have to know what kind of identifier it is when it comes to the point of shifting it! It can defer the choice until it is actually facing the shift/reduce conflict that the unaugmented version of the grammar would have, i.e. the very latest possible moment where it really needs to commit to one parsing action or the other. And at that point, it can look at the conflicting action options, rule out one or both by means of checking whatever token subtype(s) would make them valid, and then decide.

So this makes the technique a lot more robust against eager lexing. You could imagine situations in which you had some kind of extra-powerful parser generator, which was prepared to escalate to LR(2) or LR(3) tables if that would help it with a particular grammar; and you could imagine a grammar containing one feature that required extra lookahead, and a feature somewhere completely different requiring this kind of property – and with this technique, there's no need to trade those off against each other, because the ambiguity resolution system doesn't impose any requirement to have lexing delayed until the scope analysis is ready.

(Indeed, this implementation strategy also allows a much more mundane kind of implementation flexibility, namely that an implementation might for some reason decide it wanted to do its entire lexing pass before the parse began.)

At the same time, this technique also makes the system more independent of exactly what's going on on the parse stack. In the lexer-feedback approach, you probably arranged to push and pop things from the lexer's active scope stack by putting the pushes and pops in mid-rule actions, which means you were using your implicit knowledge of exactly when Bison will enact those actions, i.e. when it will reduce. So your specification of how to resolve ambiguities is inextricably tied to the fundamental LR-nature of the parsing system you're implementing it on.

But in my system, you place far fewer requirements on the parsing system in use. Basically your only requirement is that it provides something resembling an LR parser stack, i.e. a partially reduced summary of everything that's been seen so far in the input. So as long as you know how to reliably check that for whether a name is currently in scope as a function, there's no need to worry about the details. In particular, if I needed the name of a variable to be in scope for its own definition (which `spigot` doesn't, but other languages with similar constructions do), it would be easy to recognise the left-hand side of an unfinished `def` in the subtype determiner, without having to worry about how to insert a mid-rule action that would manage to reduce something in time.

So I think I'm now more or less convinced that there exists a means of handling `typedef`-style grammar quirks that is robust, more or less principled, and not dependent on implementation details in the way that the usual lexer feedback hack is.

Of course `spigot` itself has a handwritten recursive descent parser rather than any of this interesting augmented-autogenerated-LR business. (Indeed, that's how I got into this mess in the first place.) So I don't need to actually do any of this – though if I ever succumb to the temptation to write my own parser generator, I will surely add this feature to it (if only so I can use C as a show-off test case).

The main effect of having thought through all this is that I now believe that you could do this approximately-principled LR thing if you needed to. And as a result, I think I've achieved my original goal of rationalising to myself why the language design bug I accidentally introduced in `spigot` is not quite as much of a sin as I previously held it to be :-)

#### no subject

Huzzah for rationalisation!

Question for you - if you came upon a blackboard which had written on it "a(stuff)", how would you know if the original writer had meant to multiply the two or perform a on stuff? Or is this something which is imprecise because it's imprecise when mathematicians write in by hand?

#### no subject

And then there are FORTRAN FORMAT statements, measured against which you're positively angelic...