Layer-free shell syntax [entries|reading|network|archive]
simont

[ userinfo | dreamwidth userinfo ]
[ archive | journal archive ]

Wed 2011-03-16 11:36
Layer-free shell syntax

Here's a question I've been pondering for a while, to which I don't have any good answer.

I often find myself answering questions from Unix users who are having a little trouble working out how to get some complicated piece of shell syntax to work. The impression I'm left with is that a major source of confusion is the fact that the POSIX shell syntax as a whole – defined as the entire sequence of steps that get from a single input string (typed on the command line or read from a script) to the decisions of individual programs about what to do – is composed of a great many layers, and it doesn't take all that unusual a situation before it becomes important to understand what happens in which layer in order to debug your problem.

Examples of gotchas arising from this, many of which you need to develop quite a deep understanding to sort out, include:

  • searching for a string starting with a minus sign using a command along the lines of ‘grep \-10 file.txt’. (‘The backslash causes other characters to be treated literally rather than specially, so why not this one?’)
  • expecting ‘~’ and ‘~user’ to still be expanded after a colon (‘PATH=/here:~/there:$PATH’).
  • getting subtly wrong results when adding a prefix to an existing command line, if the latter had redirections. (E.g. prefixing ‘sudo’ or ‘gdb --args’, and finding that in the former case the file is not opened with root privilege and in the latter case all of gdb's interactive output gets lost.)
  • almost anything involving multiple layers of escaping. (The typical example being regular expressions, where you have to first escape special characters to stop them being special to the shell, and then escape them a second time to stop them being special to the application. And don't forget to shell-escape the extra characters you used for the second job, of course.)
  • when a one-character option takes an argument either with or without a separating space, and you want to pass an empty string as argument, omitting the space before the quoted empty string. (‘If ‘-s foo’ and ‘-sfoo’ and ‘-s ""’ all do what I expect, why on earth would ‘-s""’ not?’)

Some of these even catch me out on occasion, and I know perfectly well why they don't work. Typically I realise my mistake as soon as I see the error message or odd behaviour – but it's telling that I nonetheless still get it wrong some of the time, because the real rules of how this stuff works are just too cumbersome to think through from first principles for every command I type, and so I use a simpler intuitive model most of the time and only resort to applying my full understanding when that can't cope.

And that's the point, really. The system is just too complicated to manage sensibly. So: if we were designing a shell syntax from scratch, would it be possible to arrange it in such a way that you don't ever have to reason about layers? The ground rule would be: the system is permitted to divide labour between multiple processes or pieces of code, as long as that division is its problem rather than the user's, and (barring the diagnosis of actual bugs) a user never has to keep in mind the divisions between those components just to figure out what a given command will do or what command they need to do a given task.

But the catch is that you aren't allowed to solve the problem by reducing the expressive power of the syntax. If you do, application authors will take up the slack by implementing all the things you didn't, and they'll all do it differently from each other, or not do it at all, and nobody will be better off. (See, for example, the haphazard state of wildcard support in Windows command-line programs.)

Here's a simple example of the sort of thing I'm thinking of. In a summer job when I was a student, I had occasion to write a tool that presented an internal command line, and I got to make up the syntax used on that command line. Basically on a whim, I set it up as follows: spaces were special (they separated words), double quotes were special (they stopped spaces from separating words), and the only other special things were braces, which were used to enclose special tag names. Initially, when starting to scan a command line and looking for the actual command word, the only things you were allowed to do with braces was to use them to escape the existing special characters, so that {{} and {}} and {"} translated into the literal characters {, } and ". But the main command processor only parsed the input line as far as was necessary to find the command name – and parsing the rest of the line was deferred until that command had had a chance to add extra sequences to the list of braced escapes. So, for instance, if a command had decided to treat the minus sign specially, that command could also have added support for an escaped version {-} which was not treated specially; so a user would not have had to draw the mental distinction between characters treated specially by the ‘shell’ and by one particular subcommand, and would have been able to work on the much simpler principle that all special characters were defusable by bracing them. (Or, perhaps better still, the subcommands could all have adopted the principle that everything not braced was treated literally, and invented braced tags to do everything special they might need.)

Of course it was easy for me to do something like that in a small language supporting only a few commands, particularly when there was no division into subprocesses (each of my commands was just a procedure within the same source file). And even so, my design wasn't perfect (e.g. it would still have suffered from the ‘-s""’ flaw in my above list). And it had very simple functionality by comparison to the full scope of real shell syntax. And, of course, I wouldn't even dare to imagine that any replacement shell syntax design would be realistically adoptable at this stage, what with a huge established source base.

But it illustrates the sort of idea I'm thinking of. So, just as an exercise for curiosity's sake … could the entire edifice of shell design be reworked, in principle, by means such as the above or by totally different means, so as to avoid the need to have users keep track of the multiple layers of code involved?

[xpost |http://simont.livejournal.com/231126.html]

LinkReply
[personal profile] pseudomonasWed 2011-03-16 11:44
Having more prohibited characters in file names would help with the escaping thing. Obviously, that's filesystem specs more than shell specs, but I still have no clue how a filename with a newline in it would behave in something like xargs.
Link Reply to this | Thread
[personal profile] simontWed 2011-03-16 11:47
Wrongly, of course :-) For this purpose the GNU version of xargs has the -0 switch which makes it expect to read filenames from standard input separated by NUL, and GNU find has a corresponding option -print0 which outputs in that format. But of course you can only use those if you know you'll never have to port to anything non-GNU, and nothing other than find will output in that format, and in any case the whole system ought ideally to have been set up so that it worked reliably without a user having to remember to add the special and obscure "work reliably" flag.
Link Reply to this | Parent | Thread
[personal profile] j4Wed 2011-03-16 12:06
nothing other than find will output in that format

grep -Z?

I suspect I'm missing the point though... :-} not entirely sure I understand about layers (though it's fascinating & I will re-read to see if it's clearer the second time!).
Link Reply to this | Parent | Thread
[personal profile] simontWed 2011-03-16 12:19
Ooh, I didn't know about grep -Z! Thank you.

(But apart from find -print0, grep -Z, the aqueducts and public sanitation, what have GNU ever done for us? :-)
Link Reply to this | Parent | Thread
[personal profile] ewxWed 2011-03-16 22:22
The BSD find/xargs have -print0/-0 as well, and perl has a -0 option (though of course you could set $/ manually). I think there are other things that this format has spread to also…
Link Reply to this | Parent
[identity profile] pjc50.livejournal.comWed 2011-03-16 13:15
Once you have:
- the ability to wrap some commands up in a larger context
- an escaping system

Then you have the layering problem, of working out how to escape the inner stuff in a way that won't be misinterpreted. Tcl has this problem. Most programming languages don't as they have a sharp distinction between string literals and everything else, and a high reliance on brackets to explicitly limit the scope you need to read to understand an action unit.
Link Reply to this | Thread
[personal profile] simontWed 2011-03-16 13:21
The other nice feature of brackets, of course, is that they nest. Half the escaping problem arises because the 'treat this literally' syntax can't be nested: if shells used the Postscript approach of enclosing literal strings in parens instead of quotes, then you could trivially quote any piece of shell syntax you liked (by induction, that piece would contain matched parens if it quoted anything in turn) without having to worry about escaping the escapes.

In fact, doesn't Tcl do that? And also GNU m4, IIRC.
Link Reply to this | Parent | Thread
[identity profile] bjh21.me.ukWed 2011-03-16 13:45
Yes. Tcl's original quoting characters are { and }. These days it has " as well.
Link Reply to this | Parent | Thread
[personal profile] pneWed 2011-03-16 14:00
Though they're not interchangeable, of course, due to the different interpolation behaviour.

What I found amusing, though, was when I learned that except for the interpolation, they are identical - in particular, you can use quotation marks for "blocks" (arguments to things like if or proc or the like), if there are no interpolatables in the block (or you escape them yourself). Even multi-line ones.

That was simply something I wasn't used to, coming from C-like languages: blocks had braces and strings had quotes, and never the twain shall meet - but in Tcl, essentially everything's a string. (Or everything's a list of strings? At any rate, "blocks" aren't special.)
Link Reply to this | Parent
[personal profile] fanfWed 2011-03-16 14:08
And in fact Tcl's syntax is remarkably shallow especially compared with the shell.

A lot of the problem with quotation syntax is that unquoting the string implies a rewrite, which leads to problems with repeated doubling of \\\\ and suchlike. Perhaps quotations should be passed down to the next layer verbatim, so that they are only unquoted at the last possible moment.

One way of looking at the problem is that it arises from being "stringly typed". Perhaps a command line should be parsed once, at which point more specific types are inferred for the various parts of the command, and subsequent processing of the command happens in a type-safe manner. So for example, when expanding a glob the resulting list is a list of filenames, not a list of undistinguished command line arguments, so there can be no confusion if one of the files is named -rf.
Link Reply to this | Parent | Thread
[personal profile] pneWed 2011-03-16 14:20
A lot of the problem with quotation syntax is that unquoting the string implies a rewrite, which leads to problems with repeated doubling of \\\\ and suchlike. Perhaps quotations should be passed down to the next layer verbatim, so that they are only unquoted at the last possible moment.

But then how can you tell at which level a \} (or \\\} or \\\\\}, etc.) is to be interpreted? (That is, where the "last possible moment" for unquoting that character sequence is: the bottom-most layer, or somewhere in between?)
Link Reply to this | Parent | Thread
[personal profile] fanfWed 2011-03-16 14:30
If it isn't the bottom-most it isn't the last possible moment.

The problem of course is this leads to an un-unixy design where each program has to do unquoting of its arguments if necessary, rather than relying on the shell to handle all metacharacters, and this in turn inevitably leads to incompatibilities.
Link Reply to this | Parent | Thread
[identity profile] bjh21.me.ukWed 2011-03-16 15:57
It's not really Unixy, but POSIX already has a kind of shared dequoting system in the form of getopt(), which handles the quoting of operands using --. Of course, this leads to precisely the kind of incompatibilities you refer to.
Link Reply to this | Parent
[identity profile] bjh21.me.ukWed 2011-03-16 14:22
I'd been pondering the "only unquote once, as late as possible" rule, since that's effectively what URIs do -- there's one quoting scheme, and effectively each layer only treats non-quoted characters as special and passes quoted ones on to the next layer down. This only works because the layers don't conflict over their special characters, but I wonder if you could combine this with Tclish nestable quotes to get something useful.

It also occurs to me that you mostly don't pass shell commands to programs as a single quoted string, but as the tail of another command, which is kind of reminiscent of the way the URI syntax is defined to allow an entire URI to be used with no additional quoting as a query string.
Link Reply to this | Parent
[personal profile] jackWed 2011-03-16 16:09
I'm embarrassed I don't use a shell often enough to really say, but off the top of my head:

- Avoid confusion between layers and bypass escaping entirely. The paradigm for "being a metacharacter" is not "being one of this subset" or "following a certain character" but is "entered from the keyboard in a different way, eg. by holding a modifier key" and "is displayed in a different colour, different width, or other non-text way".

- Alternatively, a step down would be to have ONE metacharacter that's rarely used by programs (perhaps even a special unicode character?) that always precedes a metacharacter. This makes simple commands longer, but long commands saner and at this point may be a worthwhile tradeoff. (Like using $var in perl -- it's more verbose, but clearer, which perl REALLY needs).

- The first suggestion could even be implemented in terms of the second suggestion, if the shell does everything in #1 but stores it internally like #2. That way, you CAN edit it raw if you have to.

- Programs accept arguments from the shell not in one undifferentiated sequential glob of text, but through different routes which implicitly determine what's a filename, what's an option, and what's general text. You could escape options similarly to metacharacters using the techniques in 1 and 2, and the shell could provide keyboard commands to specifiy which was which, but perhaps normally what you type into the appropriate sort. (Perhaps syntax-hilighted so you can see when something's wrong.) You might go even further than the shell does and specify a fixed relationship for when program options have parameters, and the eventual program receieve input in some simple nested arrays or something, notionally like (opt 1 (filename params) (other params)) (opt 2 (filename params) (other params)) (opt 3) (flag 1) (general filename input). [Although some shells sort of do this already?]
Link Reply to this | Thread
[personal profile] pseudomonasWed 2011-03-16 18:01
Quotewise, I like the perlish way that you can use any character to delimit q{foo} and qq!bar! and so on.
Link Reply to this | Parent
[personal profile] jackFri 2011-03-18 00:50
What I was trying to say in the pub, but perhaps needed to be written down instead, is that many of our ideas are quite far-reaching, but perhaps the idea of having _one_ unusual escape character with the shell, with or without special keys/graphics to indicate it, would be a comparatively superficial change to an existing shell, and might be worth trying as an experiment -- I've never implemented a shell in any way, and probably won't try this one, but someone more confident might well be able to do so without too extensive a modification of an existing one...?
Link Reply to this | Parent
[identity profile] ewx.livejournal.comWed 2011-03-16 19:15
This is the kind of thing that ends up in s-expressions, isn’t it?
Link Reply to this
[identity profile] atheorist.livejournal.comWed 2011-03-16 22:29
Tcl has a standard, simple way of handling layers, similar to your internal command line.

MetaML and other staged computation languages can even type-check multi-leveled structures, and say that they're guaranteed to evaluate without syntax errors.
Link Reply to this
navigation
[ go | Previous Entry | Next Entry ]
[ add | to Memories ]