I borrowed Gridlinked. I'll let you know what I thought when I give it back, I expect :-)
2d sorting problem: so you have a bunch of (x,y) points which you want to index in such a way that a log-time query on some arbitrary coordinate pair (x0,y0) can tell you the total number of points in the data structure that have both x < x0 and y < y0. (Well, actually my points all have weights and I want the total weight of the points in that quadrant, but that's usually a trivial modification and this case is no exception.)
First, convert one of your coordinates (let's say x) into a consecutive numeric index, by sorting all the x-coordinates into order and storing a mapping from actual x-coordinates to positions in that order. (Do that however you like; several different strategies let you do that lookup in log time.)
Now, go through your points in order of x-coordinate, and insert each one in turn into a balanced tree sorted by y. That balanced tree has element counts (or total-weight figures) stored in each node, which enables a log-time search to find the total number (or weight) of elements that come before a given y-coordinate. So if I paused during this insertion loop at the point when I'd just inserted all the elements with x < x0, then my tree would be in a state in which I can efficiently query it to find how much stuff has y < y0, which is exactly the answer I want. So if only I could retrieve the state of my balanced tree at every stage in that insertion process, so that I'd be able to do this query for lots of different x0 values, then I'd be happy.
So the trick is: when I do my balanced-tree insertion operation, I never modify an existing node. When I find I want to modify a node, I instead create a copy of it and modify that instead. That means I need to "modify" its parent, to point at the copy instead of the original, which means I create a copy of that too, and so on back up to the root. So before such an insertion I had a tree containing (say) n elements, and afterwards that tree is still there and unchanged but now I'm also holding a new node which is the root of a tree containing one extra element, which just happens to share all but O(log n) of its nodes with the first tree.
So if I do my insertions like that, then after each insertion I end up with a distinct tree root, and at the end of the process all of those roots are still valid. So I store them in an array, and then I can pick out whichever I want to refer to. My data structure is now complete. To perform a query given x0,y0, I first do a log-time lookup in my x list to convert x0 into some numeric index n; then I pick the nth element out of my array of balanced tree roots, which gives me a tree containing precisely the elements with x < x0 sorted by y, and then I can do a log-time search down that tree for y0 while adding up node counts as I go. Success! Log time always.
Downsides are:
since O(log N) nodes are duplicated in each tree insertion, the total structure size is O(N log N) which is more inconveniently large than it sounds (space being generally more expensive than time)
the structure has to be built in advance for a fixed set of data and can't be updated dynamically without losing its efficiency
as far as I can see it's fundamentally limited to two dimensions and can't be extended into even three let alone general n.
But it's good enough for my purposes (and in fact as of today I have it implemented in working code and apparently giving the right answers) in spite of all those.
But it's good enough for my purposes (and in fact as of today I have it implemented in working code and apparently giving the right answers) in spite of all those.
2d sorting problem: so you have a bunch of (x,y) points which you want to index in such a way that a log-time query on some arbitrary coordinate pair (x0,y0) can tell you the total number of points in the data structure that have both x < x0 and y < y0. (Well, actually my points all have weights and I want the total weight of the points in that quadrant, but that's usually a trivial modification and this case is no exception.)
First, convert one of your coordinates (let's say x) into a consecutive numeric index, by sorting all the x-coordinates into order and storing a mapping from actual x-coordinates to positions in that order. (Do that however you like; several different strategies let you do that lookup in log time.)
Now, go through your points in order of x-coordinate, and insert each one in turn into a balanced tree sorted by y. That balanced tree has element counts (or total-weight figures) stored in each node, which enables a log-time search to find the total number (or weight) of elements that come before a given y-coordinate. So if I paused during this insertion loop at the point when I'd just inserted all the elements with x < x0, then my tree would be in a state in which I can efficiently query it to find how much stuff has y < y0, which is exactly the answer I want. So if only I could retrieve the state of my balanced tree at every stage in that insertion process, so that I'd be able to do this query for lots of different x0 values, then I'd be happy.
So the trick is: when I do my balanced-tree insertion operation, I never modify an existing node. When I find I want to modify a node, I instead create a copy of it and modify that instead. That means I need to "modify" its parent, to point at the copy instead of the original, which means I create a copy of that too, and so on back up to the root. So before such an insertion I had a tree containing (say) n elements, and afterwards that tree is still there and unchanged but now I'm also holding a new node which is the root of a tree containing one extra element, which just happens to share all but O(log n) of its nodes with the first tree.
So if I do my insertions like that, then after each insertion I end up with a distinct tree root, and at the end of the process all of those roots are still valid. So I store them in an array, and then I can pick out whichever I want to refer to. My data structure is now complete. To perform a query given x0,y0, I first do a log-time lookup in my x list to convert x0 into some numeric index n; then I pick the nth element out of my array of balanced tree roots, which gives me a tree containing precisely the elements with x < x0 sorted by y, and then I can do a log-time search down that tree for y0 while adding up node counts as I go. Success! Log time always.
Downsides are:
- since O(log N) nodes are duplicated in each tree insertion, the total structure size is O(N log N) which is more inconveniently large than it sounds (space being generally more expensive than time)
- the structure has to be built in advance for a fixed set of data and can't be updated dynamically without losing its efficiency
- as far as I can see it's fundamentally limited to two dimensions and can't be extended into even three let alone general n.
But it's good enough for my purposes (and in fact as of today I have it implemented in working code and apparently giving the right answers) in spite of all those.:) Ah, cool. Thank you.