simont: A picture of me in 2016 (Default)
simont ([personal profile] simont) wrote2013-06-19 02:02 pm

Line comments versus line splicing

Here's a thing I was pondering the other day about lexing.

Suppose you have a language containing two moderately common lexical features: a comment mechanism in which comments are newline-terminated (C++/C99 style //, shell #), and a line-splicing mechanism in which a backslash at the end of a line causes the next line to be glued on to it as if they were one long line. How should these interact?

C++ and sh have different answers. In C++, a backslash at the end of a // comment still line-splices, with the effect that the next line of the source file is considered to be part of the comment. (I've seen this trip up a user doing ASCII art in // comments.) In shell, it's the other way round: a comment renders everything to end-of-line irrelevant, including a trailing backslash. Either of these is defensible in terms of layering – the language specification has multiple layers of syntactic processing, with comments in one and line-splicing in another, and these behaviours drop out as a consequence of which order the layers appear in. (But it's kind of annoying that they do it differently.)

Anyway. Leaving aside how existing languages handle this interaction, how would we like to handle it, if we were designing a new syntax with both of these features? For the sake of illustration, I'll assume that the language also has syntactically significant newlines, so that line-splicing is a desirable thing to be attempting in the first place. (I can't really see how it isn't just an obvious misfeature in C/C++.)

A thing that's often annoyed me in make is that if you have a sprawling variable definition with lots of line-splices, you can't comment one particular line:

OBJECTS = one.o \
two.o \ # putting a comment here doesn't work
three.o \
four.o # putting it here doesn't either, arrgh \
five.o

So, for a start, it would be nice to arrange that at least one of the above syntaxes works. Either would be OK, but I prefer the first, because of the scenario I mentioned above about people doing ASCII art in comments – you might plausibly end a line comment with a backslash when line-splicing was the furthest thing from your mind.

Another credible use case is if you want to comment one of the line-spliced lines out:

OBJECTS = one.o \
two.o \
# three.o \
four.o

(Also, you'd probably want to include a comment saying why it was commented out, which adds extra confusion.)

So I think that leads me to the following rules:

  • A backslash causes line splicing if it is the last lexical token on a physical line, i.e. it still line-splices even if a comment-to-eol appears after it. (Also, while I'm here, it also shouldn't matter if whitespace appears after it. Syntactically significant whitespace is often unhelpful, but syntactically significant trailing whitespace is especially egregious.)
  • A backslash appearing at the end of a comment-to-eol is just part of the comment and has no line-splicing effect.
  • If a line with a line-splicing backslash is followed by one or more lines containing nothing but comments-to-eol (with optional initial whitespace), then those comment-only lines are completely ignored and the subsequent one will be spliced on to the original line instead.

So that permits all of the following usages:

OBJECTS = one.o \
two.o \ # This object is notable for some reason
three.o \
# the next two objects must not be reordered
four_a.o \
four_b.o \
# end of reordering constraint
five.o \
# six.o \ # removed until bug 1234 is fixed
seven.o \
eight.o

So. What use case have I missed with that analysis, and/or why is my modest proposal going to lead to crawling horrors in some totally different scenario which existing languages don't mess up so badly?

gerald_duck: (ascii)

[personal profile] gerald_duck 2013-06-19 02:42 pm (UTC)(link)
Hmm.

For clarity, I'd prefer to think of three or more features interacting, not two:
  1. Newline to terminate a comment
  2. Newline for at least one other purpose
  3. A token that can precede a newline to 'escape' it, token+nl turning into nothing

There are then three issues:
  • Precedence
  • Whether parsing transforms any of these syntactic structures into something that affects lower-precedence ones
  • Whether or not we want a single newline to be able to serve multiple syntactic purposes

Having the escaping mechanism at lowest precedence is obviously pointless. So should escaping have higher precedence than all possible uses of a newline, or be lower than some but higher than others? To my way of thinking, the least mind-bending option is to say that escaping has unequivocally higher precedence than any syntax that uses newlines: either you see the escape token at the end of a line, or the newline isn't escaped. End of story.

End of that story, at any rate. There is still the question of how syntaxes that rely on newlines should compose when an unescaped newline occurs. C/C++ say that comment-to-nl has higher precedence than preprocessor directives, and that comment-to-nl is equivalent to newline, so '#define MOO baa//quack' defines MOO to baa. That, I think, is fair enough. Accepting, of course, your point that the syntax of preprocessor directives in C/C++ is an obvious misfeature in the first place. (If they could be freely interpolated, life would be much better in many ways, and the issue of whether comment-to-nl was deemed equivalent to whitespace or newline wouldn't arise. Though that makes comment-to-nl itself feel like a mistake in the language; hohum.)

So far as it goes, I'm fairly happy with C/C++'s rules. They let me write what I need to without tripping up, and when I do see something grotesque (which, naturally, was written by someone else) I can work out what it does. There are worse syntactic problems to worry about.

Looking at your proposed rules, by my understanding you're saying comment-to-nl should be treated as a newline, and should have higher precedence than the escape token, that newline as a statement delimiter should have lower precedence than the escape token. Then your rule 3 throws a curveball.

Picking things apart more carefully, I think the totality of what you're suggesting is:
  • '//' to end of line is tokenised as comment
  • '\' nl? comment+ is elided
…though you don't make explicit what you want this to do:
ten.o    \ # A comment
           # another comment
eleven.o
…I'm guessing you want it to treat ten.o and eleven.o as being on the same line?

While that has some heuristic benefits, it looks pretty clunky when formalised. Maybe a better solution is for shell to have a commenting style that sits within the line, so the problem goes away!

If designing a syntax from scratch, a far neater solution is to set aside a dedicated character, not used for anything else, and say that character to end of line is whitespace. Now your example becomes:

OBJECTS = one.o ♦
          two.o ♦ This object is notable for some reason
          three.o ♦
♦ the next two objects must not be reordered
          four_a.o ♦
          four_b.o ♦
♦ end of reordering constraint
          five.o ♦
♦         six.o ♦ removed until bug 1234 is fixed
          seven.o ♦
          eight.o

…which I find far more tolerable.

Even better, people should heed the moral, clear in hindsight, that newline is a syntactic element like any other and, like any other, you should avoid confusing nesting and interaction of syntactic constructs like the plague!
fanf: (Default)

[personal profile] fanf 2013-06-19 06:52 pm (UTC)(link)
Another option is to put the continuation marker at the start of the next line, and allow it to take effect across multiple preceding comment-only or whitespace-only lines. Eg:
OBJECTS = one.o
\\          two.o # This object is notable for some reason
\\          three.o
# the next two objects must not be reordered
\\          four_a.o
\\          four_b.o
# end of reordering constraint
\\          five.o
#\\         six.o # removed until bug 1234 is fixed
\\          seven.o
\\          eight.o

[identity profile] deliberateblank.livejournal.com 2013-06-20 03:22 am (UTC)(link)
Ah, I see someone has been reading TDWTF. (YIK pretty much everyone has read that article.)

The article in question is perhaps more locally exemplified by:

#include <stdio.h>

#define FOO() \
printf("A\n"); \
// printf("B\n"); \
printf("C\n"); \
printf("D\n")


int main(int argc, char *argv[])
{
printf("1\n");
// foo \
printf("2\n");
printf("3\n");

printf("\n");

FOO();

return 0;
}

Which does two thing which one might not expect, one more so than the other.

I think your rules fix both of them, but do enable constructs where the start of the line splice is divorced by many comment lines from what it is spliced to, which might not be ideal. (If that makes a semantic difference, then perhaps one could consider that another argument against python...) Perhaps a requirement that all ignored comment line *must* also be splicing lines? Dunno.
Edited 2013-06-20 03:23 (UTC)