|
Statistical personal history It's now been about nine and a half years since I started a private diary in the form of a set of heavily encrypted text files on my computer. The intention of that diary was to facilitate my learning from experience, by recording any experiences I thought might be useful in future and which I might plausibly forget. Over the years it has proven its worth again and again, for this and other purposes. When I started it I had no particular intention to confine it to any specific subject area, but it's turned out to be almost entirely full of my love life, not because that's the most important thing in the world to me (it isn't) but because it's by far the most significant aspect of my life which isn't basically sorted. It occurred to me recently that I seem to write more in that diary when something either bad or difficult is going on in my life, because that's usually when I need to do a lot of thinking (and hence writing). When things become good I record it, but if they stay good for a while I generally don't need to say much about it; for example, at one point there's a nearly complete lack of entries for a year and a half while I was going out with lark_ascending. (Mind you, this isn't universal: there's also a dearth of entries in late 1998, not because my life was good but because I was suffering from RSI at the time…) So I then wondered, what would happen if I plotted the frequency of my private diary entries against time? Would I see obvious peaks clearly attributable to specific events in my past, or would the highest points turn out to be conjunctions of several things, or would it mostly be random noise, or what? So I've been having a go at this, on and off, for the past few days. The biggest problem is choosing a granularity at which to break the graph down: too fine and you get a huge number of tiny spikes with no clear pattern, but too coarse and two meaningful spikes merge into one and you start to lose interesting detail. Lacking any particularly clever means of picking a granularity, I eventually resorted to plotting the graph at a wide range of granularities and paging back and forth until I found the most meaningful-looking one. (Which turned out to be a standard deviation of about a month; I wonder if that in itself says something about the scale on which I perceive meaning in my life.) As it turns out, at that resolution I do indeed see clear peaks which are nearly all attributable to specific incidents (and, given the predominant subject matter, in many cases specific people). There are a couple of exceptions (the second highest peak on the entire diagram, in particular, appears on close inspection to be a group of unrelated minor incidents all occurring around the same time for no obvious reason), but most of the major features on the graph are clearly identifiable. It's quite tempting to start measuring the relative significance of the various incidents by the relative heights of the peaks, but it turns out that this is a granularity artifact: dial the granularity down and the highest peak divides into smaller ones and a different peak becomes the winner, but dial it up and the highest peak becomes shorter and squatter while several smaller peaks in a different area merge into one big one and collectively overtake it. I suppose each peak must be at its highest when the graph granularity is roughly equal to the duration of the incident that caused it. So probably what I should really be doing to measure the impact of each incident on the diary would be to measure the overall area under the graph which it caused, but that's not so easy to read off from the peaks and troughs. If anyone has any useful input on the problem of plotting a usefully informative representation of a data set like this without needing an intelligence-guided choice of parameters, it would be welcome. In case it's useful to know, I'm currently plotting the graph by replacing each data point with a Gaussian (effectively a convolution, if you consider my original data set as a sum of Dirac deltas) and summing, rather than plotting a conventional histogram with fixed dividing lines between blocks (I was worried that a peak might look very different depending on whether it crossed a dividing line, so I picked a strategy which didn't run that risk); so ‘granularity’ means choosing the variance of the Gaussian appropriately. I'm vaguely considering the idea of picking the variance of the Gaussian for each data point differently, according to some metric related to the surrounding points, but no particularly sensible-sounding idea has come to mind yet. |