Layer-free shell syntax
Here's a question I've been pondering for a while, to which I don't have any good answer.
I often find myself answering questions from Unix users who are having a little trouble working out how to get some complicated piece of shell syntax to work. The impression I'm left with is that a major source of confusion is the fact that the POSIX shell syntax as a whole –
Examples of gotchas arising from this, many of which you need to develop quite a deep understanding to sort out, include:
- searching for a string starting with a minus sign using a command along the lines of ‘
grep \-10 file.txt
’. (‘The backslash causes other characters to be treated literally rather than specially, so why not this one?’) - expecting ‘
~
’ and ‘~user
’ to still be expanded after a colon (‘PATH=/here:~/there:$PATH
’). - getting subtly wrong results when adding a prefix to an existing command line, if the latter had redirections. (E.g. prefixing ‘
sudo
’ or ‘gdb --args
’, and finding that in the former case the file is not opened with root privilege and in the latter case all of gdb's interactive output gets lost.) - almost anything involving multiple layers of escaping. (The typical example being regular expressions, where you have to first escape special characters to stop them being special to the shell, and then escape them a second time to stop them being special to the application. And don't forget to shell-
escape the extra characters you used for the second job, of course.) - when a one-
character option takes an argument either with or without a separating space, and you want to pass an empty string as argument, omitting the space before the quoted empty string. (‘If ‘ -s foo
’ and ‘-sfoo
’ and ‘-s ""
’ all do what I expect, why on earth would ‘-s""
’ not?’)
Some of these even catch me out on occasion, and I know perfectly well why they don't work. Typically I realise my mistake as soon as I see the error message or odd behaviour –
And that's the point, really. The system is just too complicated to manage sensibly. So: if we were designing a shell syntax from scratch, would it be possible to arrange it in such a way that you don't ever have to reason about layers? The ground rule would be: the system is permitted to divide labour between multiple processes or pieces of code, as long as that division is its problem rather than the user's, and (barring the diagnosis of actual bugs) a user never has to keep in mind the divisions between those components just to figure out what a given command will do or what command they need to do a given task.
But the catch is that you aren't allowed to solve the problem by reducing the expressive power of the syntax. If you do, application authors will take up the slack by implementing all the things you didn't, and they'll all do it differently from each other, or not do it at all, and nobody will be better off. (See, for example, the haphazard state of wildcard support in Windows command-
Here's a simple example of the sort of thing I'm thinking of. In a summer job when I was a student, I had occasion to write a tool that presented an internal command line, and I got to make up the syntax used on that command line. Basically on a whim, I set it up as follows: spaces were special (they separated words), double quotes were special (they stopped spaces from separating words), and the only other special things were braces, which were used to enclose special tag names. Initially, when starting to scan a command line and looking for the actual command word, the only things you were allowed to do with braces was to use them to escape the existing special characters, so that {{}
and {}}
and {"}
translated into the literal characters {
, }
and "
. But the main command processor only parsed the input line as far as was necessary to find the command name –{-}
which was not treated specially; so a user would not have had to draw the mental distinction between characters treated specially by the ‘shell’ and by one particular subcommand, and would have been able to work on the much simpler principle that all special characters were defusable by bracing them. (Or, perhaps better still, the subcommands could all have adopted the principle that everything not braced was treated literally, and invented braced tags to do everything special they might need.)
Of course it was easy for me to do something like that in a small language supporting only a few commands, particularly when there was no division into subprocesses (each of my commands was just a procedure within the same source file). And even so, my design wasn't perfect (e.g. it would still have suffered from the ‘-s""
’ flaw in my above list). And it had very simple functionality by comparison to the full scope of real shell syntax. And, of course, I wouldn't even dare to imagine that any replacement shell syntax design would be realistically adoptable at this stage, what with a huge established source base.
But it illustrates the sort of idea I'm thinking of. So, just as an exercise for curiosity's sake … could the entire edifice of shell design be reworked, in principle, by means such as the above or by totally different means, so as to avoid the need to have users keep track of the multiple layers of code involved?
no subject
no subject
What I found amusing, though, was when I learned that except for the interpolation, they are identical - in particular, you can use quotation marks for "blocks" (arguments to things like
if
orproc
or the like), if there are no interpolatables in the block (or you escape them yourself). Even multi-line ones.That was simply something I wasn't used to, coming from C-like languages: blocks had braces and strings had quotes, and never the twain shall meet - but in Tcl, essentially everything's a string. (Or everything's a list of strings? At any rate, "blocks" aren't special.)
no subject
A lot of the problem with quotation syntax is that unquoting the string implies a rewrite, which leads to problems with repeated doubling of \\\\ and suchlike. Perhaps quotations should be passed down to the next layer verbatim, so that they are only unquoted at the last possible moment.
One way of looking at the problem is that it arises from being "stringly typed". Perhaps a command line should be parsed once, at which point more specific types are inferred for the various parts of the command, and subsequent processing of the command happens in a type-safe manner. So for example, when expanding a glob the resulting list is a list of filenames, not a list of undistinguished command line arguments, so there can be no confusion if one of the files is named -rf.
no subject
But then how can you tell at which level a \} (or \\\} or \\\\\}, etc.) is to be interpreted? (That is, where the "last possible moment" for unquoting that character sequence is: the bottom-most layer, or somewhere in between?)
no subject
The problem of course is this leads to an un-unixy design where each program has to do unquoting of its arguments if necessary, rather than relying on the shell to handle all metacharacters, and this in turn inevitably leads to incompatibilities.
no subject
no subject
It also occurs to me that you mostly don't pass shell commands to programs as a single quoted string, but as the tail of another command, which is kind of reminiscent of the way the URI syntax is defined to allow an entire URI to be used with no additional quoting as a query string.