Ooh. I like that. Yes. That's very nice. I'm glad I asked you to tell us.
Couple of ideas. I think that rather than storing ten four-bit numbers per node and so on, you probably want to use matrices, and look at methods of storing sparce matrices (I presume you meant N*N rather than N^N - at least I can't see why the latter is necessary). That said, functionally, I suspect it'll come out looking similar. In any case, the storage difficulties can be sidestepped by having a sufficiently big bottom branch size, or not annotating the bottom nodes.
Secondly, I don't think that the autogeneration thing would be that hard. However, a thing that that might cause problems is multiple byte stuff, such as #ifdef 0 and #endif or /* and */. You'd have to check for them on deletion, etc.
Hang on. Thinking about that, I'm not sure that a matrix is sufficient. Consider if you considered both nested brackets and nested comments. Now you need two dimensions values which interact in nasty ways (multipally commented out brackets, etc). And potentially more, for more dimensions.
Don't know. But it's the end of my lunch break. So I'll stop thinking about it.
I said N^N because that's the total number of different ways to choose one of N ending states for each of N starting states. The number of bits you'd need to specify one of those ways is the base-2 log of that many, which is more like N*log(N); I intended this to be clear but probably failed :-)
The multiple-byte stuff is a concern, yes; in particular, the number of necessary states in the C state machine gets pretty large as soon as you start considering backslashes escaping closing quotes, and the state when you've just seen a / and may or may not be about to see a * (or vice versa), and that sort of thing. I'm working on a shortcut to avoid having to store all those states explicitly.
Nested comments would be a problem, I agree; in particular, they would make the number of states in the lexical analyser infinite. And one thing I've been so far trying to avoid is having variable-size annotations in each node, because that way lies garbage collection and all sorts of yuck (to say nothing of the fact that if you can't put a constant upper bound on the time required to combine two node annotations, you lose the log-time guarantee on the B-tree operations...).
Still, glad you liked it; I'd hate to have wasted that much typing :-)
Nested comments would be a problem, I agree; in particular, they would make the number of states in the lexical analyser infinite. And one thing I've been so far trying to avoid is having variable-size annotations in each node, because that way lies garbage collection and all sorts of yuck.
Hmm. What exactly is your bracket counter, in that case? Was that the reason you were special casing it?
My bracket counter records, for a given chunk of text, how many levels of brackets/braces/whatever are closed within that text (and hence must presumably have been open at the start of it), and how many new levels are opened before the end of it. I'm not sure of the best way to deal with mismatches (an open paren matching a closing brace), but I suspect the answer is to match any open-thing with any close-thing and leave the top-level code to notice when they don't really match and complain to the user somehow.
The important thing about this is that it operates at the syntactic level, not the lexical level, which means that nothing else is conditional on it. Nothing in the syntax annotation for a chunk of text ever has to say "well, if there are three levels of brackets open at the start of this text then this happens, but if there are four then it's all totally different". The bracket counts themselves are conditional on the starting lexical state, but nothing in turn is conditional on them.
Hence, arbitrary nesting in the lexical analyser is a much worse problem than arbitrary nesting in the syntax, because lots of stuff is conditional on the opening lexer state, so allowing infinitely many distinguishable such states is asking for trouble.
Come to think of it, PostScript is going to have a problem with this, since it supports arbitrarily nested parentheses in strings...
Couple of ideas. I think that rather than storing ten four-bit numbers per node and so on, you probably want to use matrices, and look at methods of storing sparce matrices (I presume you meant N*N rather than N^N - at least I can't see why the latter is necessary). That said, functionally, I suspect it'll come out looking similar. In any case, the storage difficulties can be sidestepped by having a sufficiently big bottom branch size, or not annotating the bottom nodes.
Secondly, I don't think that the autogeneration thing would be that hard. However, a thing that that might cause problems is multiple byte stuff, such as #ifdef 0 and #endif or /* and */. You'd have to check for them on deletion, etc.
Hang on. Thinking about that, I'm not sure that a matrix is sufficient. Consider if you considered both nested brackets and nested comments. Now you need two dimensions values which interact in nasty ways (multipally commented out brackets, etc). And potentially more, for more dimensions.
Don't know. But it's the end of my lunch break. So I'll stop thinking about it.
The multiple-byte stuff is a concern, yes; in particular, the number of necessary states in the C state machine gets pretty large as soon as you start considering backslashes escaping closing quotes, and the state when you've just seen a / and may or may not be about to see a * (or vice versa), and that sort of thing. I'm working on a shortcut to avoid having to store all those states explicitly.
Nested comments would be a problem, I agree; in particular, they would make the number of states in the lexical analyser infinite. And one thing I've been so far trying to avoid is having variable-size annotations in each node, because that way lies garbage collection and all sorts of yuck (to say nothing of the fact that if you can't put a constant upper bound on the time required to combine two node annotations, you lose the log-time guarantee on the B-tree operations...).
Still, glad you liked it; I'd hate to have wasted that much typing :-)
Hmm. What exactly is your bracket counter, in that case? Was that the reason you were special casing it?
The important thing about this is that it operates at the syntactic level, not the lexical level, which means that nothing else is conditional on it. Nothing in the syntax annotation for a chunk of text ever has to say "well, if there are three levels of brackets open at the start of this text then this happens, but if there are four then it's all totally different". The bracket counts themselves are conditional on the starting lexical state, but nothing in turn is conditional on them.
Hence, arbitrary nesting in the lexical analyser is a much worse problem than arbitrary nesting in the syntax, because lots of stuff is conditional on the opening lexer state, so allowing infinitely many distinguishable such states is asking for trouble.
Come to think of it, PostScript is going to have a problem with this, since it supports arbitrarily nested parentheses in strings...