I said N^N because that's the total number of different ways to choose one of N ending states for each of N starting states. The number of bits you'd need to specify one of those ways is the base-2 log of that many, which is more like N*log(N); I intended this to be clear but probably failed :-)
The multiple-byte stuff is a concern, yes; in particular, the number of necessary states in the C state machine gets pretty large as soon as you start considering backslashes escaping closing quotes, and the state when you've just seen a / and may or may not be about to see a * (or vice versa), and that sort of thing. I'm working on a shortcut to avoid having to store all those states explicitly.
Nested comments would be a problem, I agree; in particular, they would make the number of states in the lexical analyser infinite. And one thing I've been so far trying to avoid is having variable-size annotations in each node, because that way lies garbage collection and all sorts of yuck (to say nothing of the fact that if you can't put a constant upper bound on the time required to combine two node annotations, you lose the log-time guarantee on the B-tree operations...).
Still, glad you liked it; I'd hate to have wasted that much typing :-)
Nested comments would be a problem, I agree; in particular, they would make the number of states in the lexical analyser infinite. And one thing I've been so far trying to avoid is having variable-size annotations in each node, because that way lies garbage collection and all sorts of yuck.
Hmm. What exactly is your bracket counter, in that case? Was that the reason you were special casing it?
My bracket counter records, for a given chunk of text, how many levels of brackets/braces/whatever are closed within that text (and hence must presumably have been open at the start of it), and how many new levels are opened before the end of it. I'm not sure of the best way to deal with mismatches (an open paren matching a closing brace), but I suspect the answer is to match any open-thing with any close-thing and leave the top-level code to notice when they don't really match and complain to the user somehow.
The important thing about this is that it operates at the syntactic level, not the lexical level, which means that nothing else is conditional on it. Nothing in the syntax annotation for a chunk of text ever has to say "well, if there are three levels of brackets open at the start of this text then this happens, but if there are four then it's all totally different". The bracket counts themselves are conditional on the starting lexical state, but nothing in turn is conditional on them.
Hence, arbitrary nesting in the lexical analyser is a much worse problem than arbitrary nesting in the syntax, because lots of stuff is conditional on the opening lexer state, so allowing infinitely many distinguishable such states is asking for trouble.
Come to think of it, PostScript is going to have a problem with this, since it supports arbitrarily nested parentheses in strings...
The multiple-byte stuff is a concern, yes; in particular, the number of necessary states in the C state machine gets pretty large as soon as you start considering backslashes escaping closing quotes, and the state when you've just seen a / and may or may not be about to see a * (or vice versa), and that sort of thing. I'm working on a shortcut to avoid having to store all those states explicitly.
Nested comments would be a problem, I agree; in particular, they would make the number of states in the lexical analyser infinite. And one thing I've been so far trying to avoid is having variable-size annotations in each node, because that way lies garbage collection and all sorts of yuck (to say nothing of the fact that if you can't put a constant upper bound on the time required to combine two node annotations, you lose the log-time guarantee on the B-tree operations...).
Still, glad you liked it; I'd hate to have wasted that much typing :-)
Hmm. What exactly is your bracket counter, in that case? Was that the reason you were special casing it?
The important thing about this is that it operates at the syntactic level, not the lexical level, which means that nothing else is conditional on it. Nothing in the syntax annotation for a chunk of text ever has to say "well, if there are three levels of brackets open at the start of this text then this happens, but if there are four then it's all totally different". The bracket counts themselves are conditional on the starting lexical state, but nothing in turn is conditional on them.
Hence, arbitrary nesting in the lexical analyser is a much worse problem than arbitrary nesting in the syntax, because lots of stuff is conditional on the opening lexer state, so allowing infinitely many distinguishable such states is asking for trouble.
Come to think of it, PostScript is going to have a problem with this, since it supports arbitrarily nested parentheses in strings...