At post-pizza last night Ian set me an algorithms problem, because he thought it would be right up my street. The solution I came up with turned out to be overkill for his actual needs, but it was a rather cute use of annotated trees which I hadn't thought about before, so I want to write it down somewhere.
(Background on annotated trees: given any balanced-tree data structure storing a list of items – red-black trees, AVL trees, B-trees, anything else similar that I don't know about – you can annotate the tree by storing some additional data in each node which describes some property of the sublist of items contained in or below that node. Provided the annotation for any individual item can be computed in constant time and the annotation for the concatenation of two sublists can be computed in constant time from the annotations for those sublists, it turns out that you can augment all the usual and not-so-usual tree-modification algorithms – insert, delete, split and join – to maintain these annotations without losing the log-time bounded performance of any operation. I've previously written about the great usefulness of annotating nodes with a simple element count, and some slightly more sophisticated annotations which have useful applications in editor implementation. And I have a real B-tree library implementing all this in working code, which allows the client to specify an arbitrary node annotation by providing functions to compute it, and then permits interesting kinds of lookup based on the annotation.)
Ian's scenario was that he was trying to schedule a bunch of jobs to be done; each job has a known duration and a known deadline by which it must be completed. The usual trick here is to sort the jobs by their deadline, doing the most urgent ones first. But a literal sort algorithm isn't adequate, because Ian's list of jobs is dynamic: at any time, a new job may become known, and need to be added to the list. Sometimes, adding a new job in the middle of the list might render it impossible to meet all the current deadlines – in which case Ian wanted the ‘add job’ function to return failure, so that the client could be warned in advance that their job would not be possible, and could take some alternative action so as not to need it done after all.
This seemed like just the sort of problem which could be solved by an appropriately annotated tree, which was why Ian brought it to me in the first place. And, after a couple of false starts, it turns out that it can.
You store your jobs in a B-tree, sorted by their deadline, as one might expect. You also annotate each node of the tree with two additional pieces of data: (a) the latest deadline of anything in that subtree, and (b) the latest time you would have to start doing the jobs in that subtree in order to get all of them done by their deadlines.
We must show that these are easy to compute for a single job and for the concatenation of two correctly annotated lists of jobs. Clearly both are easy to compute for a single job: the deadline is provided by the user, and the latest safe start time is that minus the job duration.
Now, given sublist A with final deadline DA and latest start time SA, and sublist B with final deadline DB and latest start time SB (and, by the nature of adjacent sublists in a sorted tree, we also know that nothing in B has a deadline before DA), we need to be able to compute the annotations for the concatenated list. Well, the latest deadline for the full list is of course just DB. Now when is the latest start time for the whole lot? Well, the jobs in list A must be finished by time DA (of course), and they must also be finished by time SB, so that we can start doing the jobs in list B. So we compute EDA, the effective deadline for list A, as the earlier of DA and SB; then we adjust our start time SA to take account of the difference between EDA and DA, so that the start time for the whole list is equal to SA + (EDA − DA). That's it.
Given these annotations, we can now do a constant-time query on the current job list whenever we like, to find out when is the latest time we must start doing jobs in order to get them all done. In other words, as Ian put it, we can query the tree at any time to find out how long it's safe to procrastinate for.
So if the latest safe start time for the entire tree is still in the future, we're OK. If it's in the past, we're doomed, and we are inevitably going to miss some deadline or other.
Hence, we have exactly the ‘add job’ operation Ian wanted. Given a new job, we add it to our B-tree (in log time), and then we query the new latest safe start time. If it's before the current time, then it isn't feasible to add the new job, so we remove it from the tree again (still in log time) and return failure. Otherwise, we leave it in the tree and return success. Done!
In fact this was overkill for Ian's actual application, since there's a sufficiently small bound on the number of jobs that it's much easier to do simple stuff with circular buffers and swallow the occasional O(N) reshuffle. But he was curious to know how it would be done efficiently for large job collections, and I think the above approach answers that to at least my satisfaction.