(no subject)

It's an odd kind of feeling to actually achieve one of your personal Holy Grails.

This week, in among being ill, working and so on, I managed to find time to write a piece of code that solves a problem I've been wanting to see solved since 1994.

In my first year at university, I decided to write a hex editor. There were plenty around already, and indeed I'd already written one a few years earlier, but I wanted to write another one because I didn't know of any with an insert mode.

Many common applications for a hex editor (executable files, for example) never need an insert mode, since practically the whole file has lots of fixed cross-references to byte offsets within it, and changing the distance between any two things is liable to blow the whole thing up. So if you're thinking of this sort of application, the obvious data structure to use for a hex editor is simply to store the entire file as a big array, making it easy to jump to a given byte position.

To implement an insert mode, of course, I could see that I'd have to do something entirely different, but wasn't sure what. I knew that I'd have to be able to efficiently find a given byte position, efficiently insert or delete characters, and ideally (for cut and paste purposes) efficiently link and unlink arbitrarily big chunks from the middle of the data. Some of the required operations are supported efficiently by arrays; others are supported efficiently by linked lists. It didn't seem obvious how to arrange for all of them to be properly efficient at once.

My stopgap measure, at the time, was to implement a pragmatic structure I called an NCA (for ‘Non-Contiguous Array’) which was basically a linked list of array chunks with a fixed maximum size; although the operation of finding a given byte position was therefore still O(N) in principle, it had a very small constant of proportionality so it didn't become a problem in practice. I wrote my hex editor on top of this structure, and decided that until I could do better, it would just have to do.

Balanced tree structures (B-trees, red-black trees, AVL trees) are obviously an area of interest to problems like this; they have the general ability to do lots of things in guaranteed O(log N) time. And it occurred to me a few years back that if you were to annotate every node in such a tree with the total number of elements in or below that node, then you could correctly maintain those counts without losing the log-time nature of any of the standard tree operations, and this would give you the ability to do a new kind of tree search which found the Nth tree element in log time, given any N.

In a conventional sorted tree this is of only limited use – it allows you to efficiently track the median or other order statistics of a constantly changing data set, but there aren't that many applications I can think of for this ability – but then I realised that once you had these counts in place, you could abandon the idea of the tree having a sorting order. Suddenly you have a data structure which can store an unordered list of items, in such a way that you can insert or delete single items from the list in log time and you can find the item at a particular numeric position in log time. This looked like precisely the thing I'd been after.

It wasn't there yet, though. Cut and paste was the remaining headache; linked lists, and my original ‘NCA’ implementation in particular, support the ability not merely to insert a single item in the middle of a list, but to insert an enormous chunk just as easily. Can you do this with balanced trees?

It turns out that you can: it's possible to find a log-time algorithm which will split a balanced tree down the middle at a point of your choice and give you two valid trees as output, and it's also possible to find a log-time algorithm which will join two balanced trees together such that everything in one tree appears before everything in the other. I figured these algorithms out myself for B-trees; I later discovered that Knuth volume 3 shows a sketch of similar algorithms for AVL trees (and he knew about the counted-tree trick as well). Given the split and join algorithms, it now becomes easy to link in a large chunk (split your original tree at the insertion point, then join the two halves on to each side of the new chunk) or to extract a large chunk (split the tree in two places, take out the middle bit, join the outer two back together) given only the numeric offsets of the positions at which you want to do this.

This solution was now uniformly superior to my previous one. A data structure of this type would have been a better and more scalable solution than my original ‘NCA’ code. However, I wasn't finished yet; I was now starting to see some of the real possibilities of annotating nodes in tree structures, so I reckoned I could do even better.

One useful idea was this: instead of storing individual bytes in my B-tree, I could store blocks of bytes (not unlike the blocks in the linked list in the original NCA). The advantage of this is that I could replace some of those blocks with a new type of element: one which is simply a placeholder indicating ‘now copy N bytes from position P in the original file F’. This would save me the hassle of loading the entire file into memory when my hex editor started up: instead, it would simply read from the disk as necessary to display the parts of the file you were paging through. As you gradually modify the file you're editing, the initial single placeholder block gets split into smaller ones and pieces of it are replaced by literal-data blocks; if you cut and paste around the file, then you can end up with a lot of placeholder blocks indicating a patchwork of pieces of the original file. Eventually you save the modified file, and in order to write it out the program has to seek back and forth reading the original file and interleaving bits of it with bits of literal data.

In order to make this work, I had to do something slightly more subtle than storing a count of elements in each tree node: I had to store a count of bytes, or in other words the sum of the length field of each (literal or placeholder) block in the tree. This would allow me to do a single log-time search down the tree starting with an exact byte position, and have the search return a data block and a sub-offset within that block.

Also, I still wasn't happy with the cut and paste capabilities. It's all very well to be able to unlink a large chunk of data from your tree and link it in again somewhere else, but cut and paste just isn't that restricted. Often you want to paste in more than one copy; and even when you don't, the program doesn't know, the first time you hit Paste, that you aren't going to want to hit it again in a moment. So it has to spend a lot of time copying large chunks of tree, and that's an O(N) job no matter what your clever data structure.

Or is it?

Out of all the stuff I've described, this is probably the bit I'm most proud of. What you can do is to put a reference count in each tree node, counting the number of other tree nodes (or tree header structures) which link to that node. Normally reference counts are used for garbage collection, but in this case that's not what's on my mind: I'm going to use them to implement copy on write.

Throughout the entire B-tree utility module, in absolutely every tree operation, you always check the reference count on a node before you attempt to alter that node; and if that reference count is greater than one, you don't alter the node itself. Instead you create a new node that's a clone of the old one; you use this node in place of the old one (and therefore decrement the reference count in the old node); and you increment the reference count in each child of the node (because there's now an extra node pointing at them).

The effect of this is that it's a trivial – not merely log-time but actually constant-time – operation to clone an arbitrarily large section of tree. You just construct a new tree header structure, point its root pointer at the same node that's the root of the input tree, and increment that node's reference count. Then you have two trees containing exactly the same data; but as soon as you try to modify one of them, the reference counting mechanism causes a load of nodes to be duplicated, and now you have two trees which mostly share their nodes, but in which there's one path down the tree in which they differ. So the semantics from the user point of view are precisely as if you'd made two independent copies (since operations on one tree never affect the other), but you've avoided an O(N) operation to make the copy, and you still haven't destroyed the log-time performance of any of the existing tree operations.

So, now I have what must surely be the ultimate data structure for a hex editor. In summary:

A B-tree containing an unsorted list (by means of having an element count in each node, allowing numeric lookups in place of the usual sort-based lookup).
Elements of this tree represent blocks of bytes, either literally or as placeholders representing substrings of the original file.
Each tree node is annotated with the total length of all elements in or below that node, enabling a search to any byte position.
Split and join algorithms are implemented on this B-tree, enabling efficient linking and unlinking of large chunks.
Reference counts are used to implement copy-on-write semantics, enabling insanely efficient cloning of large chunks.

And as a result:

Finding a particular byte position in the file can be done in log time.
Inserting or deleting individual bytes can be done in log time.
Any cut, copy or paste operation can be done in log time, no matter how enormous the piece of file you're cutting or pasting.
Opening the file to begin with is a constant-time operation.
The only operations which still need to be O(N) are searching through the file, and saving the modified file to disk.

I've had most of this in my mind for some years now, but it's only this week that I finally got round to actually implementing it, and verifying that it works. The result is a hex editor with a fully functional insert mode, and with the ability to lazily load the file so you can edit entire CD images without either taking years or using up all the machine's memory, and these two abilities can work together in any combination you like.

I hope to be making this utility available on the web soon, for anyone who might find it useful. For the moment, though, I'm just looking back with satisfaction on a ten-year quest in which I actually found everything I was originally looking for, and more. Those don't happen often.