simont | Layer-free shell syntax

Here's a question I've been pondering for a while, to which I don't have any good answer.

I often find myself answering questions from Unix users who are having a little trouble working out how to get some complicated piece of shell syntax to work. The impression I'm left with is that a major source of confusion is the fact that the POSIX shell syntax as a whole – defined as the entire sequence of steps that get from a single input string (typed on the command line or read from a script) to the decisions of individual programs about what to do – is composed of a great many layers, and it doesn't take all that unusual a situation before it becomes important to understand what happens in which layer in order to debug your problem.

Examples of gotchas arising from this, many of which you need to develop quite a deep understanding to sort out, include:

searching for a string starting with a minus sign using a command along the lines of ‘grep \-10 file.txt’. (‘The backslash causes other characters to be treated literally rather than specially, so why not this one?’)
expecting ‘~’ and ‘~user’ to still be expanded after a colon (‘PATH=/here:~/there:$PATH’).
getting subtly wrong results when adding a prefix to an existing command line, if the latter had redirections. (E.g. prefixing ‘sudo’ or ‘gdb --args’, and finding that in the former case the file is not opened with root privilege and in the latter case all of gdb's interactive output gets lost.)
almost anything involving multiple layers of escaping. (The typical example being regular expressions, where you have to first escape special characters to stop them being special to the shell, and then escape them a second time to stop them being special to the application. And don't forget to shell-escape the extra characters you used for the second job, of course.)
when a one-character option takes an argument either with or without a separating space, and you want to pass an empty string as argument, omitting the space before the quoted empty string. (‘If ‘-s foo’ and ‘-sfoo’ and ‘-s ""’ all do what I expect, why on earth would ‘-s""’ not?’)

Some of these even catch me out on occasion, and I know perfectly well why they don't work. Typically I realise my mistake as soon as I see the error message or odd behaviour – but it's telling that I nonetheless still get it wrong some of the time, because the real rules of how this stuff works are just too cumbersome to think through from first principles for every command I type, and so I use a simpler intuitive model most of the time and only resort to applying my full understanding when that can't cope.

And that's the point, really. The system is just too complicated to manage sensibly. So: if we were designing a shell syntax from scratch, would it be possible to arrange it in such a way that you don't ever have to reason about layers? The ground rule would be: the system is permitted to divide labour between multiple processes or pieces of code, as long as that division is its problem rather than the user's, and (barring the diagnosis of actual bugs) a user never has to keep in mind the divisions between those components just to figure out what a given command will do or what command they need to do a given task.

But the catch is that you aren't allowed to solve the problem by reducing the expressive power of the syntax. If you do, application authors will take up the slack by implementing all the things you didn't, and they'll all do it differently from each other, or not do it at all, and nobody will be better off. (See, for example, the haphazard state of wildcard support in Windows command-line programs.)

Here's a simple example of the sort of thing I'm thinking of. In a summer job when I was a student, I had occasion to write a tool that presented an internal command line, and I got to make up the syntax used on that command line. Basically on a whim, I set it up as follows: spaces were special (they separated words), double quotes were special (they stopped spaces from separating words), and the only other special things were braces, which were used to enclose special tag names. Initially, when starting to scan a command line and looking for the actual command word, the only things you were allowed to do with braces was to use them to escape the existing special characters, so that {{} and {}} and {"} translated into the literal characters {, } and ". But the main command processor only parsed the input line as far as was necessary to find the command name – and parsing the rest of the line was deferred until that command had had a chance to add extra sequences to the list of braced escapes. So, for instance, if a command had decided to treat the minus sign specially, that command could also have added support for an escaped version {-} which was not treated specially; so a user would not have had to draw the mental distinction between characters treated specially by the ‘shell’ and by one particular subcommand, and would have been able to work on the much simpler principle that all special characters were defusable by bracing them. (Or, perhaps better still, the subcommands could all have adopted the principle that everything not braced was treated literally, and invented braced tags to do everything special they might need.)

Of course it was easy for me to do something like that in a small language supporting only a few commands, particularly when there was no division into subprocesses (each of my commands was just a procedure within the same source file). And even so, my design wasn't perfect (e.g. it would still have suffered from the ‘-s""’ flaw in my above list). And it had very simple functionality by comparison to the full scope of real shell syntax. And, of course, I wouldn't even dare to imagine that any replacement shell syntax design would be realistically adoptable at this stage, what with a huge established source base.

But it illustrates the sort of idea I'm thinking of. So, just as an exercise for curiosity's sake … could the entire edifice of shell design be reworked, in principle, by means such as the above or by totally different means, so as to avoid the need to have users keep track of the multiple layers of code involved?

Flat | Top-Level Comments Only

Yes. Tcl's original quoting characters are { and }. These days it has " as well.

Though they're not interchangeable, of course, due to the different interpolation behaviour.

What I found amusing, though, was when I learned that except for the interpolation, they are identical - in particular, you can use quotation marks for "blocks" (arguments to things like if or proc or the like), if there are no interpolatables in the block (or you escape them yourself). Even multi-line ones.

That was simply something I wasn't used to, coming from C-like languages: blocks had braces and strings had quotes, and never the twain shall meet - but in Tcl, essentially everything's a string. (Or everything's a list of strings? At any rate, "blocks" aren't special.)

And in fact Tcl's syntax is remarkably shallow especially compared with the shell.

A lot of the problem with quotation syntax is that unquoting the string implies a rewrite, which leads to problems with repeated doubling of \\\\ and suchlike. Perhaps quotations should be passed down to the next layer verbatim, so that they are only unquoted at the last possible moment.

One way of looking at the problem is that it arises from being "stringly typed". Perhaps a command line should be parsed once, at which point more specific types are inferred for the various parts of the command, and subsequent processing of the command happens in a type-safe manner. So for example, when expanding a glob the resulting list is a list of filenames, not a list of undistinguished command line arguments, so there can be no confusion if one of the files is named -rf.

A lot of the problem with quotation syntax is that unquoting the string implies a rewrite, which leads to problems with repeated doubling of \\\\ and suchlike. Perhaps quotations should be passed down to the next layer verbatim, so that they are only unquoted at the last possible moment.

But then how can you tell at which level a \} (or \\\} or \\\\\}, etc.) is to be interpreted? (That is, where the "last possible moment" for unquoting that character sequence is: the bottom-most layer, or somewhere in between?)

If it isn't the bottom-most it isn't the last possible moment.

The problem of course is this leads to an un-unixy design where each program has to do unquoting of its arguments if necessary, rather than relying on the shell to handle all metacharacters, and this in turn inevitably leads to incompatibilities.

It's not really Unixy, but POSIX already has a kind of shared dequoting system in the form of getopt(), which handles the quoting of operands using --. Of course, this leads to precisely the kind of incompatibilities you refer to.

I'd been pondering the "only unquote once, as late as possible" rule, since that's effectively what URIs do -- there's one quoting scheme, and effectively each layer only treats non-quoted characters as special and passes quoted ones on to the next layer down. This only works because the layers don't conflict over their special characters, but I wonder if you could combine this with Tclish nestable quotes to get something useful.

It also occurs to me that you mostly don't pass shell commands to programs as a single quoted string, but as the tail of another command, which is kind of reminiscent of the way the URI syntax is defined to allow an entire URI to be used with no additional quoting as a query string.

Layer-free shell syntax

no subject

no subject

no subject

no subject

no subject

no subject

no subject