My language design is bad and I should feel bad
Over the weekend, I realised, extremely belatedly, that the expression language I designed for my free-spigot
has a grammar bug. Specifically, it's a context-typedef
bug: the same expression can parse differently depending on whether a given identifier is currently defined to be a function or a variable, which means that you need to do your scope analysis in sync with the parse, so that you can know what set of definitions is currently in scope for the subexpression you're currently looking at.
The problem arises because spigot
has the unusual grammar feature of implicit multiplication by juxtaposition, i.e. you can just put two expressions next to each other without an explicit *
or × sign in between:
$ spigot 'let x=100 in x(x+1)'
10100
And spigot
also supports user-
$ spigot 'let f(x)=x+100 in f(20)'
120
These two grammar features collide, of course. If you write an expression of the form a(stuff)
, then spigot
will have to parse it differently depending on whether a
is a function or a variable.
It would be nice to think that you could just defer the decision until after parsing was finished, by turning such expressions into some kind of ambiguous AST node, with the semantics ‘This is either a function call or a multiplication, we'll sort out which later on’. But the precedences of the other operators mean that you really do have to know it at parse time, because expressions like a^f(b)
or f(a)^b
will apply the ^
operator to different amounts of the expression depending on whether f
is a function. A nice way to illustrate this is to try defining f
in two ways that come out the same if you just write f(x)
on its own:
$ spigot 'let f=10 in f(2)'
20
$ spigot 'let f(x)=x*10 in f(2)'
20
and then to observe that those two definitions still manage to make a difference if you put other operators next to them:
$ spigot 'let a=3,b=4,f=10 in a^f(b)' # 3^10 * 4
236196
$ spigot 'let a=3,b=4,f(x)=x*10 in a^f(b)' # 3^40
12157665459056928801
$ spigot 'let a=3,b=4,f=10 in f(a)^b' # 10 * 3^4
810
$ spigot 'let a=3,b=4,f(x)=x*10 in f(a)^b' # 30^4
810000
So the first question is: how on earth did I manage to make that mistake? To anyone who's studied grammars and parsing and language design (like me), and especially to anyone who's got involved with the grammar of C in particular (also like me), this is the kind of mistake that you would surely have been annoyed by before –
The answer is because I wrote all this code in the wrong order. When I was originally putting spigot
together, I didn't know how general I was going to be able to make its expression evaluation in the first place, because I wasn't sure of exactly where my techniques for exact real computation were going to run out of steam. I had no real expectation of being able to support enough generality to even need user-sin
or cos
or exp
or log
, and the parser just enshrined the complete list of them as effectively language keywords which it knew to treat specially. In other words, right from the start the parser had to know what identifiers were and were not functions –
Well, what's done is done. spigot
doesn't have very many users yet (at least that I'm aware of), so I suppose I could just change the syntax and live with changing the meaning of some existing expressions. But I don't really think I want to, because I do like the fact that spigot
manages to have an expression language that's closer to ordinary maths notation than the typical programming language manages. I don't want to have to lose multiplication by juxtaposition; I like it!
(spigot
's other unusual feature improving mathematical clarity is that it has no need for the typical language's bewildering variety of integer and floating-gnuplot
makes you write 1.0/3
in place of 1/3
, or the constant need to compensate for overflowing intermediate values.)
So, if I'm not going to get rid of it, I suppose I'd better come up with an argument for how it's not that bad really!
Pondering that question, I found myself considering how I would best augment a yacc
-style parser generator to cope with grammar ambiguities of this kind.
One option, and as far as I'm aware the usual option for people parsing C, is to have the parser feed back information to the lexer, so that the lexer can return different token types for the things that need to be distinguished. In C, this means that typedef_name
becomes a distinct token type from identifier
; in spigot
, the token types would probably be called var_id
and func_id
(and perhaps also unknown_id
).
This technique works in practice, but it's pretty annoying. Among its shortcomings are:
- To implement it in Bison, you have to do something fiddly with mid-
rule actions, to ensure that the current set of identifier types is kept up to date even if it changes in the middle of parsing some expression. (In spigot
, expressions of the formlet a=b,c=d
put the first definition in scope during the second, so you have to keep yourself up to date quite promptly.) - Involving the lexer means that you have to cross two layer boundaries in your front end: between parser and type/scope analysis, and between lexer and parser. Clearly the parser needs to collude with the scope analysis, but it would be nice if you could get away with only that one layer violation.
- This technique needs you to be very confident of exactly how the LR parser automaton operates and how much lookahead it needs, because you need to be sure that it will never need to lex any identifier token in order to get enough lookahead to decide to reduce by the rule that defines that same identifier. And no pre-
provided analysis will help you find out, for a given grammar, whether that's safe. - Also, once you've defined more than one sub-
type of ‘identifier’ to disambiguate one particular part of the language design, the rest of your grammar needs to know about all the sub- types and handle them all, even in totally different grammatical contexts where it doesn't matter which is which anyway. For example, in spigot
, the grammar rule that parses new definitions in mylet
statement would have to be prepared to receive any of the subtypes of ‘identifier’ in the slot where you name the new thing being defined. It would be nicer to be able to just write ‘identifier’ everywhere you could, and only specify the disambiguation in the one case where it's important.
In spite of all that, the lexer feedback technique does at least work. But I think there's a nicer way, which you can do without involving the lexer at all.
If you just write the ambiguous grammar anyway, and feed it into yacc
or Bison or whatever, then it will report a parsing action conflict (in this case, shift/reduce), indicating that at some point in the parsing state machine, two different actions would both be legal, and would potentially give rise to different parses. For other ambiguities of this kind, there's a long-%prec
.
In this case, it's not as simple as %prec
, because the right answer to ‘shift or reduce?’ has to be determined based on which kind of identifier we're looking at. But the key point is that we can wait to resolve the question until we actually have to decide on that conflicted parsing action –
Concretely, suppose we augment the grammar specification in two ways. One is that certain nonterminals on the parse stack include a specification of scope. In spigot
's case, the relevant piece of grammar is the part that sets up the definitions section of the let
statement:
let_expr : LET defs IN expr ;
defs : def
| defs ',' def
;def : ID '=' expr
| ID '(' args ')' '=' expr
;
You would augment this by saying that the nonterminals def
and defs
each came with an extra piece of information, stating the set of identifiers whose definitions are contained inside each one, and whether each of those identifiers is defined as a function or a variable. Something like:
defs : def [ defs_in = defs_in($1) ]
| defs ',' def [ defs_in = defs_in($1) union defs_in($3) ]
;def : ID '=' expr [ defs_in = { $1 : VARIABLE } ]
| ID '(' args ')' '=' expr [ defs_in = { $1 : FUNCTION } ]
;
And the other part of the mechanism is that you provide a function that can return some kind of sub-ID
(here, the same values VARIABLE
and FUNCTION
shown above), and then you could augment the two problematic productions in the main expression grammar
expr : ID
| ID '(' args ')'
| /* other options */ ;
so that each one required the ID
token to have a particular subtype:
expr : ID[subtype=VARIABLE]
| ID[subtype=FUNCTION] '(' args ')'
| /* other options */ ;
and now the only remaining thing you would have to do is to provide an actual piece of code to determine the subtype
property for a given ID
token, and it would do it by looking back along the entire LR parser stack until it found an object with an attached defs_in
property including the identifier in question.
So now the LR parser constructor can use all of this information to know how to resolve its shift/reduce (or whatever) conflict. Better still, it can prove at parser-subtype()
property can only take one of the two possible values, and so it can be sure that the two rules that conflict with each other can never be valid at once.
The nice thing about this is that it allows the parser to make its decision a great deal later than it would with the lexer-
So this makes the technique a lot more robust against eager lexing. You could imagine situations in which you had some kind of extra-
(Indeed, this implementation strategy also allows a much more mundane kind of implementation flexibility, namely that an implementation might for some reason decide it wanted to do its entire lexing pass before the parse began.)
At the same time, this technique also makes the system more independent of exactly what's going on on the parse stack. In the lexer-
But in my system, you place far fewer requirements on the parsing system in use. Basically your only requirement is that it provides something resembling an LR parser stack, i.e. a partially reduced summary of everything that's been seen so far in the input. So as long as you know how to reliably check that for whether a name is currently in scope as a function, there's no need to worry about the details. In particular, if I needed the name of a variable to be in scope for its own definition (which spigot
doesn't, but other languages with similar constructions do), it would be easy to recognise the left-def
in the subtype determiner, without having to worry about how to insert a mid-
So I think I'm now more or less convinced that there exists a means of handling typedef
-style grammar quirks that is robust, more or less principled, and not dependent on implementation details in the way that the usual lexer feedback hack is.
Of course spigot
itself has a handwritten recursive descent parser rather than any of this interesting augmented-
The main effect of having thought through all this is that I now believe that you could do this approximately-spigot
is not quite as much of a sin as I previously held it to be :-)