(no subject)

So, over the weekend I mentioned having had a cool idea about B-trees (again). For angoel, and anyone else who's interested, here are the details...

In order to explain this, I'll need to set the scene for a bit first.

As most geeks who know me will know, I have a strong interest in the uses of B-trees as in-memory data structures; I think they're good for a lot more interesting things than being on-disk storage for databases, and they can certainly be made to do some much more interesting things than simply storing sorted lists.

Many of my enhancements revolve around the idea of annotating each node in the tree so that it describes a property of all the elements in or below that node. For example, you can annotate each node with the number of elements in or below it, and that allows you to find an element of the list by its numeric position in log time (and once you've done this you can use the tree for storing unsorted lists, sort of like an array but more flexible). Or if your tree elements are measurable in two dimensions, such as words which have an alphabetical order and a length, then you can sort the tree in alphabetical order but annotate each node with the maximum length of all the words in or below it, and this allows you to restrict tree searches to only words longer than a given length. (A good use for this is first-fit memory allocation; you store your free blocks in a red-black tree sorted by address, and annotate with maximum block sizes, and this allows you to efficiently find the lowest-address block of at least a given size.)

So I've known about this sort of trickery for a couple of years now, and counted trees in particular are a technology I use in practically every reasonably large program I write for myself. Last year, being a good little well-trained mathematician, I tried to generalise this by finding a single form of node annotation which would include all the tricks I knew about as special cases. What I came up with was the idea of describing each tree element by means of a function (in the mathematical sense), and annotating each node with the composition of the functions for all the elements under it (in order, of course). Hence, in a counted tree, the function for each element is "add one", and the function for a subtree full of elements is the composition of N of these, which is of course "add N". Annotating with maximum size, the function for each element is "take the maximum of X and the input", and when you compose a lot of those you get a function of the same form with a different value of X.

So this was all very well, but like many strong generalisations in maths it leaves you wondering what the use was. It felt like a really powerful idea, but I couldn't actually think of a good reason why you might want to keep a long list of functions in a B-tree, constantly insert or delete functions from the middle of the list, and always know what the composition of the whole lot would come to. I wondered whether it would be useful to store a whole list of matrices in a tree, for example, and always know the product of the whole lot even when bits in the middle were changed and fiddled with; but no application sprang to mind. I thought a particularly cute demonstration example would be storing a list of Rubik's Cube twists in a tree, being able to change stuff in the middle and always knowing the total effect of the whole list on a solved cube; but again, I can't think of any real need to be able to do that. So it was a neat generalisation, but not a terribly helpful one.

The best use of this I'd previously been able to think of was one which makes this sort of B-tree genuinely usable as a data structure for an editor buffer. Editors have always been the most obvious application for my B-trees, since they support insertion, deletion, cut, copy and paste all in log time and never lose the ability to go straight to a given byte position; I have yet to get round to actually doing it, but I want to rewrite my personal hex editor so that it uses these things as its underlying data structure. Now the only thing missing to use this in a text editor is that you'd need to be able to get to a particular line number, not a particular byte number; so I worked out a way to annotate each node with the function describing how the characters below that node turn a (line,column) position into a later (line,column) position. It turns out that there are only three ways this can work: (a) add N to the column position, (b) add N to the column position and round up to the next number congruent to K mod 8; and (c) add N to the line number and set the column position to K. These are the only possible ways in which a fixed sequence of bytes can affect an arbitrary (line,column) pair, and this set of functions is closed under composition. So annotate each node like this, and suddenly you can jump to any (line,column) position in log time. Hence, your editor buffer is simply a counted B-tree storing a list of individual bytes, and no matter what you do to that tree it automatically maintains an efficient means of finding the stuff that should be displayed in the user's rectangular window at any given moment.

Right. Now I've finished setting the scene, and can progress to the idea I had on Saturday.

What suddenly occurred to me, out of the blue, was a means of taking the above sort of B-tree-as-text-editor-buffer, and using further node annotations to perform syntax analysis for a language-sensitive editor. The idea is that there would be some syntax-related node annotations, automatically maintained throughout ordinary editing processes, which were sufficient to allow a small number of log-time searches to answer any pertinent syntax-related question - such as "to what column position should this line be indented?", or "should I highlight this bit of text as being a comment or a string or neither?".

The obvious "naive" way to maintain syntax data throughout an editor buffer is to have some sort of state machine scanning through the buffer doing lexical analysis in much the same way a compiler would, and to store the state of the machine at regular intervals (say, at the start of each line). Then, when the user makes a change half way through the buffer, you restart the lexical analysis from the last stored state before the change, and re-analyse the changed section of the buffer; and you keep going until you reach a stored state which matches the state you've worked out for the new text, at which point you can trust that everything beyond that will already be OK. Of course this is slow, and might potentially go all the way to the end of the buffer - if you inserted a double quote character, then it's entirely possible that everything which was previously unquoted becomes quoted and vice versa, all the way to the end of the buffer. So a single-character change to the buffer can cause syntax analysis changes which take O(N) time to propagate. Of course you can ameliorate this in practice by doing the updating lazily, but there are still bad cases (suppose you've got two views on the same large buffer, one at the start and one at the end, and you make a change in the former and then switch straight to the latter).

My idea was to go one better. To allow arbitrary cut-and-pasting of bits of the buffer, you need to store syntax data for any sequence of characters in such a way that it already tells you everything you need to know no matter what the state was just before it. So, if your lexical analyser is a state machine, the data you need to store is the function which maps each state of the machine to the state it would end up in after seeing the given sequence of characters. So these functions for each individual character will simply be columns out of the state machine's transition table; but the annotations on high-level nodes of the tree will be practically arbitrary sets of mappings.

Once you've done this, the lexical-analysis state at any given point in the buffer is available to you in a single log-time search; you can determine the state mapping for the whole sequence of characters from the start of the buffer to the point you're interested in, and so you can instantly determine what state a machine would be in if it had started from state zero at the beginning of the file. And there's no O(N) update required if you change something at the beginning of the file and immediately want to know what the effect was at the end; all the syntax data for the text in between is stored in such a way that the tree can tell you the answer to that immediately.

That's the lexical level dealt with. After that, you move up to the syntactic level: for each node, you state how many levels of brackets and braces are closed within the text under that node, and how many new ones are opened. (Again, you have to store this separately for each potential starting state of the lexer.) This allows a log-time search to find the nearest open brace above a given point (good for working out the correct indentation), or to find the matching closed brace for a given open brace.

Sounds too good to be true, doesn't it? Well, perhaps it is. After thinking about it for a while, I see a couple of possible practical problems; but I have yet to work out whether they're really practical problems, and whether this idea would actually work.

The first problem is the size of the node annotations, and the time taken to compute them. If the state machine contains ten states, then storing ten four-bit numbers per node isn't too much to ask; but if it contains 64, then suddenly the annotation in each node is 48 bytes, for the lexer alone. Not so good. It's possible that the range of possible mapping functions is not actually N^N, that there are invariants that allow you to store a more compact form of the data; but even thinking about analysing this sort of thing leads on to the second practical problem, which is that if a restartable notation for syntax analysis can't be automatically derived from simple state-machine and regex specifications of the syntax, then writing a new editor mode for an editor using this scheme would be really really hard, and nobody would do it.

I haven't yet worked out how big these problems are; and as a result I'm a lot less excited about the idea than I was when I thought of it early on Saturday morning. But it still might work; these problems might be solvable, and it might yet be possible to design an editor back end which magically maintains enough syntax information that all the front-end functions such as indenting and highlighting have no hard work left to do. We shall see.