simont | Line comments versus line splicing

Here's a thing I was pondering the other day about lexing.

Suppose you have a language containing two moderately common lexical features: a comment mechanism in which comments are newline-terminated (C++/C99 style //, shell #), and a line-splicing mechanism in which a backslash at the end of a line causes the next line to be glued on to it as if they were one long line. How should these interact?

C++ and sh have different answers. In C++, a backslash at the end of a // comment still line-splices, with the effect that the next line of the source file is considered to be part of the comment. (I've seen this trip up a user doing ASCII art in // comments.) In shell, it's the other way round: a comment renders everything to end-of-line irrelevant, including a trailing backslash. Either of these is defensible in terms of layering – the language specification has multiple layers of syntactic processing, with comments in one and line-splicing in another, and these behaviours drop out as a consequence of which order the layers appear in. (But it's kind of annoying that they do it differently.)

Anyway. Leaving aside how existing languages handle this interaction, how would we like to handle it, if we were designing a new syntax with both of these features? For the sake of illustration, I'll assume that the language also has syntactically significant newlines, so that line-splicing is a desirable thing to be attempting in the first place. (I can't really see how it isn't just an obvious misfeature in C/C++.)

A thing that's often annoyed me in make is that if you have a sprawling variable definition with lots of line-splices, you can't comment one particular line:

OBJECTS = one.o \
          two.o \ # putting a comment here doesn't work
          three.o \
          four.o # putting it here doesn't either, arrgh \
          five.o

So, for a start, it would be nice to arrange that at least one of the above syntaxes works. Either would be OK, but I prefer the first, because of the scenario I mentioned above about people doing ASCII art in comments – you might plausibly end a line comment with a backslash when line-splicing was the furthest thing from your mind.

Another credible use case is if you want to comment one of the line-spliced lines out:

OBJECTS = one.o \
          two.o \
#         three.o \
          four.o

(Also, you'd probably want to include a comment saying why it was commented out, which adds extra confusion.)

So I think that leads me to the following rules:

A backslash causes line splicing if it is the last lexical token on a physical line, i.e. it still line-splices even if a comment-to-eol appears after it. (Also, while I'm here, it also shouldn't matter if whitespace appears after it. Syntactically significant whitespace is often unhelpful, but syntactically significant trailing whitespace is especially egregious.)
A backslash appearing at the end of a comment-to-eol is just part of the comment and has no line-splicing effect.
If a line with a line-splicing backslash is followed by one or more lines containing nothing but comments-to-eol (with optional initial whitespace), then those comment-only lines are completely ignored and the subsequent one will be spliced on to the original line instead.

So that permits all of the following usages:

OBJECTS = one.o \
          two.o \ # This object is notable for some reason
          three.o \
# the next two objects must not be reordered
          four_a.o \
          four_b.o \
# end of reordering constraint
          five.o \
#         six.o \ # removed until bug 1234 is fixed
          seven.o \
          eight.o

So. What use case have I missed with that analysis, and/or why is my modest proposal going to lead to crawling horrors in some totally different scenario which existing languages don't mess up so badly?

Flat | Top-Level Comments Only

The perversion of defining a macro in the expansion of another macro is orthogonal to the issue of whether or not preprocessor directives can be freely interpolated. The C/C++'s preprocessor could have allowed:

  #define MOO(define) #define x QUACK

…but they've prohibited it from working the way you might expect (and hijacked # inside macros for another purpose, which is why I was sick and used "define" as the macro parameter's name (-8 ).

I'm guessing they decided that macros defining macros was a can of worms best left unopened.

At this point I'm remembering Modula-3's pragma syntax: <* … *> to go with the Wirth (* … *) for comments. (How come everyone's happy using [ … ] for arrays in Pascal instead of (. … .), but still uses (* … *) instead of { … } for comments?) Had C used something similar, we could have avoided all this __attribute__ and __declspec nonsense for starters!

If I ever design a language in the C problem space, knowing what I know now, the pragma syntax will be something like:

void f() {
  #pragma #switch(COMPILER_VERSION) {
      RECENT_GCC:
          diagnostic push;
          diagnostic ignored "-Wuninitialized";
          break;
      MSVC:
          warning(push);
          warning(disable: C4700);
          break;
  }

  // Do some stuff...

  #pragma #switch(COMPILER_VERSION) {
      RECENT_GCC: diagnostic pop; break;
      MSVC: warning(pop); break;
  }
}

Line comments versus line splicing

no subject