My language design is bad and I should feel bad [entries|reading|network|archive]

[ userinfo | dreamwidth userinfo ]
[ archive | journal archive ]

Mon 2017-02-20 20:09
My language design is bad and I should feel bad

Over the weekend, I realised, extremely belatedly, that the expression language I designed for my free-software project spigot has a grammar bug. Specifically, it's a context-dependency analogous to the C typedef bug: the same expression can parse differently depending on whether a given identifier is currently defined to be a function or a variable, which means that you need to do your scope analysis in sync with the parse, so that you can know what set of definitions is currently in scope for the subexpression you're currently looking at.

The problem arises because spigot has the unusual grammar feature of implicit multiplication by juxtaposition, i.e. you can just put two expressions next to each other without an explicit * or × sign in between:

$ spigot 'let x=100 in x(x+1)'

And spigot also supports user-defined functions:

$ spigot 'let f(x)=x+100 in f(20)'

These two grammar features collide, of course. If you write an expression of the form a(stuff), then spigot will have to parse it differently depending on whether a is a function or a variable.

It would be nice to think that you could just defer the decision until after parsing was finished, by turning such expressions into some kind of ambiguous AST node, with the semantics ‘This is either a function call or a multiplication, we'll sort out which later on’. But the precedences of the other operators mean that you really do have to know it at parse time, because expressions like a^f(b) or f(a)^b will apply the ^ operator to different amounts of the expression depending on whether f is a function. A nice way to illustrate this is to try defining f in two ways that come out the same if you just write f(x) on its own:

$ spigot 'let f=10 in f(2)'
$ spigot 'let f(x)=x*10 in f(2)'

and then to observe that those two definitions still manage to make a difference if you put other operators next to them:

$ spigot 'let a=3,b=4,f=10      in a^f(b)'    #  3^10 * 4
$ spigot 'let a=3,b=4,f(x)=x*10 in a^f(b)' # 3^40
$ spigot 'let a=3,b=4,f=10 in f(a)^b' # 10 * 3^4
$ spigot 'let a=3,b=4,f(x)=x*10 in f(a)^b' # 30^4

So the first question is: how on earth did I manage to make that mistake? To anyone who's studied grammars and parsing and language design (like me), and especially to anyone who's got involved with the grammar of C in particular (also like me), this is the kind of mistake that you would surely have been annoyed by before – which I have been – and be determined never to introduce the same bug into any grammar you designed. Which I was.

The answer is because I wrote all this code in the wrong order. When I was originally putting spigot together, I didn't know how general I was going to be able to make its expression evaluation in the first place, because I wasn't sure of exactly where my techniques for exact real computation were going to run out of steam. I had no real expectation of being able to support enough generality to even need user-defined functions, and they certainly weren't in my plans when I wrote the original parser. Originally, my ‘functions’ consisted of only the built-in ones like sin or cos or exp or log, and the parser just enshrined the complete list of them as effectively language keywords which it knew to treat specially. In other words, right from the start the parser had to know what identifiers were and were not functions – but at the time, that was OK, because the answer was always the same. Then, later on, I added user-defined functions in the obvious sort of way, and somehow I didn't even notice that I had introduced a scope-dependency to the lexer. It was only when I idly tried constructing a yacc grammar for the spigot expression language this weekend – just to see whether it would come out looking coherent – that I ran head-on into the problem and realised what a hole I had dug myself into.

Well, what's done is done. spigot doesn't have very many users yet (at least that I'm aware of), so I suppose I could just change the syntax and live with changing the meaning of some existing expressions. But I don't really think I want to, because I do like the fact that spigot manages to have an expression language that's closer to ordinary maths notation than the typical programming language manages. I don't want to have to lose multiplication by juxtaposition; I like it!

(spigot's other unusual feature improving mathematical clarity is that it has no need for the typical language's bewildering variety of integer and floating-point types, each essentially representing some subset of the real numbers, but with different tradeoffs between speed and expressiveness so that you constantly have to remember which one you're using and occasionally go round the houses to compensate for it being the wrong one. For example, the way gnuplot makes you write 1.0/3 in place of 1/3, or the constant need to compensate for overflowing intermediate values.)

So, if I'm not going to get rid of it, I suppose I'd better come up with an argument for how it's not that bad really!

Pondering that question, I found myself considering how I would best augment a yacc-style parser generator to cope with grammar ambiguities of this kind.

One option, and as far as I'm aware the usual option for people parsing C, is to have the parser feed back information to the lexer, so that the lexer can return different token types for the things that need to be distinguished. In C, this means that typedef_name becomes a distinct token type from identifier; in spigot, the token types would probably be called var_id and func_id (and perhaps also unknown_id).

This technique works in practice, but it's pretty annoying. Among its shortcomings are:

  • To implement it in Bison, you have to do something fiddly with mid-rule actions, to ensure that the current set of identifier types is kept up to date even if it changes in the middle of parsing some expression. (In spigot, expressions of the form let a=b,c=d put the first definition in scope during the second, so you have to keep yourself up to date quite promptly.)
  • Involving the lexer means that you have to cross two layer boundaries in your front end: between parser and type/scope analysis, and between lexer and parser. Clearly the parser needs to collude with the scope analysis, but it would be nice if you could get away with only that one layer violation.
  • This technique needs you to be very confident of exactly how the LR parser automaton operates and how much lookahead it needs, because you need to be sure that it will never need to lex any identifier token in order to get enough lookahead to decide to reduce by the rule that defines that same identifier. And no pre-provided analysis will help you find out, for a given grammar, whether that's safe.
  • Also, once you've defined more than one sub-type of ‘identifier’ to disambiguate one particular part of the language design, the rest of your grammar needs to know about all the sub-types and handle them all, even in totally different grammatical contexts where it doesn't matter which is which anyway. For example, in spigot, the grammar rule that parses new definitions in my let statement would have to be prepared to receive any of the subtypes of ‘identifier’ in the slot where you name the new thing being defined. It would be nicer to be able to just write ‘identifier’ everywhere you could, and only specify the disambiguation in the one case where it's important.

In spite of all that, the lexer feedback technique does at least work. But I think there's a nicer way, which you can do without involving the lexer at all.

If you just write the ambiguous grammar anyway, and feed it into yacc or Bison or whatever, then it will report a parsing action conflict (in this case, shift/reduce), indicating that at some point in the parsing state machine, two different actions would both be legal, and would potentially give rise to different parses. For other ambiguities of this kind, there's a long-standing tradition of not necessarily solving them by rewriting the actual CFG to make it unambiguous, but instead simply providing auxiliary information alongside the grammar itself to allow the parser generator to choose the right one of the possible actions. Usually this takes the form of directives like %prec.

In this case, it's not as simple as %prec, because the right answer to ‘shift or reduce?’ has to be determined based on which kind of identifier we're looking at. But the key point is that we can wait to resolve the question until we actually have to decide on that conflicted parsing action – and at that point, we have to do some disambiguation, i.e. check what's in scope.

Concretely, suppose we augment the grammar specification in two ways. One is that certain nonterminals on the parse stack include a specification of scope. In spigot's case, the relevant piece of grammar is the part that sets up the definitions section of the let statement:

let_expr : LET defs IN expr ;
defs     : def
| defs ',' def
def      : ID '=' expr
| ID '(' args ')' '=' expr

You would augment this by saying that the nonterminals def and defs each came with an extra piece of information, stating the set of identifiers whose definitions are contained inside each one, and whether each of those identifiers is defined as a function or a variable. Something like:

defs     : def            [ defs_in = defs_in($1) ]
| defs ',' def [ defs_in = defs_in($1) union defs_in($3) ]
def      : ID '=' expr               [ defs_in = { $1 : VARIABLE } ]
| ID '(' args ')' '=' expr [ defs_in = { $1 : FUNCTION } ]

And the other part of the mechanism is that you provide a function that can return some kind of sub-type information for a given token. You'd define an enumeration of subtypes of the token ID (here, the same values VARIABLE and FUNCTION shown above), and then you could augment the two problematic productions in the main expression grammar

   expr : ID
| ID '(' args ')'
| /* other options */ ;

so that each one required the ID token to have a particular subtype:

   expr : ID[subtype=VARIABLE]
| ID[subtype=FUNCTION] '(' args ')'
| /* other options */ ;

and now the only remaining thing you would have to do is to provide an actual piece of code to determine the subtype property for a given ID token, and it would do it by looking back along the entire LR parser stack until it found an object with an attached defs_in property including the identifier in question.

So now the LR parser constructor can use all of this information to know how to resolve its shift/reduce (or whatever) conflict. Better still, it can prove at parser-generation time that the conflict is resolved, because it knows that the subtype() property can only take one of the two possible values, and so it can be sure that the two rules that conflict with each other can never be valid at once.

The nice thing about this is that it allows the parser to make its decision a great deal later than it would with the lexer-feedback implementation. You don't have to know, when you lex the identifier, what kind of thing the current scope might or might not believe it to be. Not only that, but the parser doesn't even have to know what kind of identifier it is when it comes to the point of shifting it! It can defer the choice until it is actually facing the shift/reduce conflict that the unaugmented version of the grammar would have, i.e. the very latest possible moment where it really needs to commit to one parsing action or the other. And at that point, it can look at the conflicting action options, rule out one or both by means of checking whatever token subtype(s) would make them valid, and then decide.

So this makes the technique a lot more robust against eager lexing. You could imagine situations in which you had some kind of extra-powerful parser generator, which was prepared to escalate to LR(2) or LR(3) tables if that would help it with a particular grammar; and you could imagine a grammar containing one feature that required extra lookahead, and a feature somewhere completely different requiring this kind of property – and with this technique, there's no need to trade those off against each other, because the ambiguity resolution system doesn't impose any requirement to have lexing delayed until the scope analysis is ready.

(Indeed, this implementation strategy also allows a much more mundane kind of implementation flexibility, namely that an implementation might for some reason decide it wanted to do its entire lexing pass before the parse began.)

At the same time, this technique also makes the system more independent of exactly what's going on on the parse stack. In the lexer-feedback approach, you probably arranged to push and pop things from the lexer's active scope stack by putting the pushes and pops in mid-rule actions, which means you were using your implicit knowledge of exactly when Bison will enact those actions, i.e. when it will reduce. So your specification of how to resolve ambiguities is inextricably tied to the fundamental LR-nature of the parsing system you're implementing it on.

But in my system, you place far fewer requirements on the parsing system in use. Basically your only requirement is that it provides something resembling an LR parser stack, i.e. a partially reduced summary of everything that's been seen so far in the input. So as long as you know how to reliably check that for whether a name is currently in scope as a function, there's no need to worry about the details. In particular, if I needed the name of a variable to be in scope for its own definition (which spigot doesn't, but other languages with similar constructions do), it would be easy to recognise the left-hand side of an unfinished def in the subtype determiner, without having to worry about how to insert a mid-rule action that would manage to reduce something in time.

So I think I'm now more or less convinced that there exists a means of handling typedef-style grammar quirks that is robust, more or less principled, and not dependent on implementation details in the way that the usual lexer feedback hack is.

Of course spigot itself has a handwritten recursive descent parser rather than any of this interesting augmented-autogenerated-LR business. (Indeed, that's how I got into this mess in the first place.) So I don't need to actually do any of this – though if I ever succumb to the temptation to write my own parser generator, I will surely add this feature to it (if only so I can use C as a show-off test case).

The main effect of having thought through all this is that I now believe that you could do this approximately-principled LR thing if you needed to. And as a result, I think I've achieved my original goal of rationalising to myself why the language design bug I accidentally introduced in spigot is not quite as much of a sin as I previously held it to be :-)

[xpost |]

[personal profile] andrewduckerTue 2017-02-21 09:04

Huzzah for rationalisation!

Question for you - if you came upon a blackboard which had written on it "a(stuff)", how would you know if the original writer had meant to multiply the two or perform a on stuff? Or is this something which is imprecise because it's imprecise when mathematicians write in by hand?

Link Reply to this | Thread
[personal profile] simontTue 2017-02-21 09:13

Or is this something which is imprecise because it's imprecise when mathematicians write in by hand?

It's exactly that. In fact I remember Feynman complaining about it in Surely You're Joking – he invented his own idiosyncratic notation for a lot of maths which didn't have all those ambiguities in which f(x) looks like f times x and dy/dx tempts you to cancel the ds. (And naturally he got used to using this in his own jottings, and then used it in front of someone else, who got completely confused, at which point he realised that maths notation is actually for communicating with other people and so he'd have to go back to the substandard normal notation after all.)

Some of the alternatives he described looked quite sensible, but others made me think he hadn't thought enough about futureproofing. For example, he had replacement notations for trig functions which looked like square root-type symbols, with a bar extending over the entire argument of the function. That's fine as far as it goes, but one's first question is 'OK, now how do you plan to extend this to an open-ended set of further mathematical functions that people will end up needing to use?' It's all very well if sin, cos and tan are the only things you'll ever need to do this to, but in practice you've now replaced the problem of picking a different word for each function with the much hairier one of picking a recognisably different symbol. (And then getting them all into Unicode / LaTeX / whatever...)

(Aww – I'd never looked before, but naturally someone has tried to implement Feynman's notation in TeX :-)

Link Reply to this | Parent
[personal profile] simontTue 2017-02-21 09:35

Aside: it now occurs to me that if you have a parser generator that supports this kind of subtype mechanism, then you could also use it as an alternative to the traditional %prec mechanism for resolving operator precedence without all that tedious mucking about with 11 subtly different nonterminals for 'expression'.

Instead, you have just one nonterm for 'expression', in the same way you would with %prec. You endow each expression with metadata indicating what its outermost operator is, and then each grammar production that combines two expressions using a new operator will have a constraint that says how the new operator has to compare with the outermost ones in the subexpressions. For example,

expr: expr[op_prec >= prec('+')] '+' expr[op_prec > prec('+')]
where the careful use of >= on the left and > on the right also lets you control whether the operator is left- or right-associative.

In fact, using this technique, you could probably get away with having just one grammar production covering all binary operators – and now you've achieved a massive separation of concerns, because your language grammar becomes independent of the set of supported operators and their precedence levels, so you could add a new precedence level without having to regenerate your parser at all.

Of course, this use of symbol subtyping doesn't do any scope analysis – the op_prec property I illustrate above can be both generated and checked with no need to look all the way back up the parse stack. But if you were really ambitious, you could combine the two techniques, and do scope analysis to determine the precedence of your binary operators – so that your language could support syntax in which you locally introduce a new binary operator token, or change the precedence of an existing one, and the changes are precisely scoped to only the small region of code in which the programmer actually thought it would make some specialist thing clearer...

Link Reply to this
[personal profile] deborah_cTue 2017-02-21 09:01

And then there are FORTRAN FORMAT statements, measured against which you're positively angelic...

Link Reply to this
[identity profile] cartesiandaemon.livejournal.comTue 2017-02-21 23:23

I did not follow all the parsing details. I want to read that through in more detail at some point.

But it also occurred to me, it should *look* obvious, I don't know if there are any heuristics you use in naming variables vs functions that could be promoted to a language requirement.

OTOH, I guess, function-call operator for ints could be defined to be multiply :)

Or, programmers could suck it up and figure out a way to use different sorts of bracket for "invoke function" from "collect expressions together" :)

Link Reply to this | Thread
[personal profile] simontWed 2017-02-22 06:18

I don't know if there are any heuristics you use in naming variables vs functions that could be promoted to a language requirement.

And you manage to make it look like coincidence that you say this just below [ profile] deborah_c's comment about Fortran! :-)

Or, programmers could suck it up and figure out a way to use different sorts of bracket for "invoke function" from "collect expressions together" :)

Oddly, I did have the weird thought some time last year that – at least for functions in the true mathematical sense, i.e. no side effects and always the same output for the same input – square brackets might be quite appropriate in a computing context.

Because, mathematically speaking, a function is a static collection of (input,output) pairs, i.e. it's basically the same as an array or dict/map/hash – better yet, an immutable one. So it should have the same syntax as an array lookup!

The fact that in programming the thing we call 'function' computes each result on demand whereas 'map' stores them all in advance is a mere implementation distinction, not affecting the semantics of the operation, so it's OK to relegate it to not being flagged up by a mandatory syntactic distinction. (Indeed, the difference between the two kinds of thing is already blurred in languages like Python, where on the one hand you can subclass dict to make it automatically fill in values for missing keys on the first access attempt, and on the other hand, decorate functions to make them automatically memoised.)

Of course, the other meaning of 'function' in programming – the mechanistic one of 'save all my current state, go off and run arbitrary other code, come back with or without a return value, who knows what happened in between' – doesn't give rise to the same argument...

Link Reply to this | Parent
[identity profile] fivemack.livejournal.comThu 2017-02-23 10:02

Mathematica decided to use [] for invoke-function and () only for grouping (with {} for list construction and [[]] for index-into-list), so there's a half-reasonable-but-weird precedent for your last suggestion.

Link Reply to this | Parent
[ go | Previous Entry | Next Entry ]
[ add | to Memories ]