simont | Grand unified theory

Rather late in the day compared to many people, I've recently been taking steps toward joining the DVCS generation.

For a year or two I've been an occasional light user of bzr, either to hold temporary branches off my main SVN repository (e.g. pre-commit polishing of a big patch somebody contributed to one of my public projects) or to hold projects too small, too experimental, too embarrassingly silly, or occasionally too private or legally encumbered to want to put into my public main SVN. A few weeks ago I managed to lose my long-standing fear of git, by dint of playing with test repositories and examining the output of git fast-export until I actually understood how its data structure fitted together and could work out everything else by reasoning about that. Having done so, I immediately migrated all my bzr repositories to git, because that kind of understanding is very valuable to me and bzr's documentation seems to place almost no emphasis on imparting it.

At the weekend, though, I actually did find the document which explains bzr's data structure – and, despite a superficially completely different user experience, it's actually very similar to that of git. As, I discovered after a brief browse on another website, is the data structure of Mercurial. The user interfaces can vary, but all three of these DVCSes have an essentially similar underlying data model.

And, curiously, a thing that struck me about this model is that it's surprisingly similar to something I already know about: Usenet.

Consider:

Your fundamental entities – whether articles or commits – are exchanged between sites in a peer-to-peer manner, with no necessary hierarchy forcing it to happen in a particularly organised way. Any given site will probably have only a subset of the entities in existence.
Each entity must therefore have a universally unique identifier, so as to keep track of it once it's wandered around servers for a bit.
The unique identifiers are far too long and fiddly to type (though you can cut and paste them), so typically client software also lets you use more friendly numeric identifiers (git is an exception here, admittedly) – but those are local to a particular site, so you have to take care not to refer to them in your article text or else you'll make no sense to people reading them on a different site.
Entities are linked together by means of each one referencing those from which it naturally follows on. Since there's nothing stopping multiple entities from citing the same parent, this naturally gives rise to a branching tree layout.
Nearly all entities are followups to existing ones, which quote a lot of context from the parent entity and add a small contribution of their own.
Entities once created can typically not be rewritten while keeping the same unique identifier; they can instead be superseded by creating a second version of them with a different ID, leading to occasional glitches in the thread diagram when somebody else turned out to have already followed up to the original version of the entity.

Even some of the fine details match up too: git's separate ‘Author’ and ‘Committer’ fields, allowing one person to commit data originally written by another, are reminiscent of Usenet's occasionally distinct ‘From’ and ‘Reply-To’ headers.

Of course, there are some differences too. So the next question is, what can we learn from this comparison? What killer features do DVCSes have from which Usenet could benefit, or vice versa?

Merges. DVCSes, of course, would be of no use at all if it weren't for the merge commit: a commit which references two or more parent commits, so that two branches of the diverging thread tree recombine into one. This is certainly something which I (and I'd guess others) would occasionally have found useful in a sprawling Usenet discussion. Of course there's nothing stopping you posting an article whose References header cites two posts neither of which is a direct ancestor of the other, but in practice support for gracefully handling this in clients is patchy at best. But the likes of gitk have shown the way: when can we expect Usenet clients as a matter of course to generate thread diagrams which support merging?
Cancels. In the opposite direction, Usenet has the concept of a cancel message, which you proactively send out to all sites to warn them that something you (or someone else) previously posted should be discounted. I could see uses for that sort of thing in the DVCS world: if a commit introduces a non-obvious security hole, for example, it would be nice to be able to tag it after the fact with a warning marker (perhaps GPG-signed to show you mean it) which would be automatically propagated to everyone who pulled from you. Then there'd be some future commit which included a ‘resolved’ marker for that warning, indicating that it fixed the bug – and then the client software could mark all the revisions in between as unsafe, and automatically track which branches the bug had propagated to but the fix had not.
Rebasing. Usenetters occasionally supersede a posted article with one only slightly different from it. DVCSes permit this (git commit --amend), but many of them go much further, supporting far more complex transformations of the commit history. The typical example is the rebase, in which an entire strand of development is reconstructed as if it originated from a different parent commit – so all the articles on the strand are reposted with different IDs and references, but (essentially) the same actual content. It can surely only be lack of imagination that has prevented Usenet from embracing this idea wholeheartedly: there's an almost practical use case of correcting your error when you've followed up to the wrong article by mistake (and moving over the followups to the erroneous post too if you didn't realise fast enough), but much more excitingly we could introduce the ability to transplant all the comments of your local troll so that they follow up to totally different posts and make no sense! Anyone who thinks Usenet just isn't chaotic enough must be licking their lips already over this one.
AOLers. It can't be too soon to start preparing for the day when major DVCS-hosted software projects get invaded by legions of people from some mass-market ISP which has made it unwisely easy to access them. Expect repositories around the world to be flooded with commits which just add ‘ME TOO!’ to the end of comments in source files, or which fail to compile because all the punctuation is totally wrong. DVCS maintainers should start working on a killfile mechanism now, before it's too late!

Flat | Top-Level Comments Only

Interesting analogy :) I'd add to that the lack of canonicalness of the project in DVCS is like the lack of ownership of newsgroups, which is one of the reasons why people have moved to blogs where someone can definitively delete bad comments.

RSS aggregation irks me as it serves the same function as NNTP with a much less efficient protocol.

I've been thinking on and off about the peak and decline of text Usenet (http://www.cam.ac.uk/cs/newsserver/newsvolume.html) for a while now. While the easier moderation of blogs and web forums might indeed be part of it, I suspect that actually the greater ease of both creation and discovery is a big part of it too.

Creating a blog or web forum can take as little as seconds; creating a newsgroup in a managed hierarchy takes days a minimum and can take months in the worst case. Equally finding web forums is trivial (http://www.google.com/search?q=cycling+forum) and newsgroups don't tend to appear in the same lists. The ease of creation, plus time, is arguably part of the cause of the ease of discovery.

I think these things are, at least by now, primary and the better moderation features of the web secondary, because you don't get to be put off Usenet by the trolls and flamewars if you don't even get to see it in the first place. (Granted that this might have been less true ten years ago when a greater proportion of the online population would have been aware of Usenet.)

It's a bit of a shame that graph starts in 1993!

Amusing comparison; there are certainly some parallels! There are a couple of articles on the Git repository format that jumped to mind as I read the first couple of paragraphs:

- http://eagain.net/articles/git-for-computer-scientists/
- http://keithp.com/blogs/Repository_Formats_Matter/

I think I found the first of those at some point in the past, but it didn't manage to banish my fear of git. I'm not sure exactly why; it's difficult to put into words what it is that I understand now but didn't get from that article.

It can't be too soon to start preparing for the day when major DVCS-hosted software projects get invaded by legions of people from some mass-market ISP which has made it unwisely easy to access them.

You think you're joking, but honestly, I'm surprised it doesn't sound like that now :) It might not be "me too" or "not me", but, eg. editing brace styles is probably the equivalent for software :)

Surely Author and Committer are more closely equivalent to From and Sender respectively, or perhaps From and Approved?

I think a world in which unskilled people do devastating things because the bar has been lowered for being able to write software arrived when people started doing security-critical stuff in PHP. Github for ~~lesbians~~ AOLers would merely be another manifestation of the problem. )-8

Sender! That was the one I couldn't quite bring to mind, thank you. Approved is surely exactly analogous to git's Signed-off-by.

Even crossed out, what on earth have lesbians got to do with any of this?!

Edited 2009-12-09 14:15 (UTC)

http://xkcd.com/624/

Gosh, I'd completely forgotten that one!

eta: and it was only three months ago, at least if the HTTP last-modified date on the image can be trusted. My mind is going, Dave.

Edited 2009-12-09 17:18 (UTC)

If we depended on the quality of posts we would have solved the problem of restricting who can make changes to a newsthread. We never did and so the signal to noice ratio decayed dramatically in the early 1990s. Just one difference I guess.

Grand unified theory

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Imagine if we used Usenet posts to build a product!