Statistical personal history
It's now been about nine and a half years since I started a private diary in the form of a set of heavily encrypted text files on my computer. The intention of that diary was to facilitate my learning from experience, by recording any experiences I thought might be useful in future and which I might plausibly forget. Over the years it has proven its worth again and again, for this and other purposes. When I started it I had no particular intention to confine it to any specific subject area, but it's turned out to be almost entirely full of my love life, not because that's the most important thing in the world to me (it isn't) but because it's by far the most significant aspect of my life which isn't basically sorted.
It occurred to me recently that I seem to write more in that diary when something either bad or difficult is going on in my life, because that's usually when I need to do a lot of thinking (and hence writing). When things become good I record it, but if they stay good for a while I generally don't need to say much about it; for example, at one point there's a nearly complete lack of entries for a year and a half while I was going out with
lark_ascending. (Mind you, this isn't universal: there's also a dearth of entries in late 1998, not because my life was good but because I was suffering from RSI at the time…) So I then wondered, what would happen if I plotted the frequency of my private diary entries against time? Would I see obvious peaks clearly attributable to specific events in my past, or would the highest points turn out to be conjunctions of several things, or would it mostly be random noise, or what?
So I've been having a go at this, on and off, for the past few days. The biggest problem is choosing a granularity at which to break the graph down: too fine and you get a huge number of tiny spikes with no clear pattern, but too coarse and two meaningful spikes merge into one and you start to lose interesting detail. Lacking any particularly clever means of picking a granularity, I eventually resorted to plotting the graph at a wide range of granularities and paging back and forth until I found the most meaningful-
As it turns out, at that resolution I do indeed see clear peaks which are nearly all attributable to specific incidents (and, given the predominant subject matter, in many cases specific people). There are a couple of exceptions (the second highest peak on the entire diagram, in particular, appears on close inspection to be a group of unrelated minor incidents all occurring around the same time for no obvious reason), but most of the major features on the graph are clearly identifiable.
It's quite tempting to start measuring the relative significance of the various incidents by the relative heights of the peaks, but it turns out that this is a granularity artifact: dial the granularity down and the highest peak divides into smaller ones and a different peak becomes the winner, but dial it up and the highest peak becomes shorter and squatter while several smaller peaks in a different area merge into one big one and collectively overtake it. I suppose each peak must be at its highest when the graph granularity is roughly equal to the duration of the incident that caused it. So probably what I should really be doing to measure the impact of each incident on the diary would be to measure the overall area under the graph which it caused, but that's not so easy to read off from the peaks and troughs.
If anyone has any useful input on the problem of plotting a usefully informative representation of a data set like this without needing an intelligence-
no subject
Anyway yes. Let me think...I would get it to plot an initial histogram, take some statistics out of it like how much area it's got under it by a certain amount of time, calculate an optimum spacing and set that to be the histogram binsize for the proper graph. This depends on me being able to set the histogram binsixe, however.
no subject
This all sounds a lot like some of the chromatographic analysis I've done lots of in chemistry (though not at the level of detail I'm describing here). Which means that what you're trying to do is to pick a set of peaks that is well-represented by the data: you're looking for life events with varying duration and intensity, which generate diary entries around their point of maximum intensity. You can model as a set of Gaussians, or maybe something with a skew factor. So what you need to do is to do curve-fitting to fit a set of Gaussians onto your diary-entry-frequency data, and then plot the Gaussiams.
no subject
no subject
The thing that still worries me is, how do you pick how many Gaussians to use? I mean, given as many Gaussians to play with as I have data points, I imagine I'd get my best approximation to the real cumulative frequency graph by arranging one Gaussian centred at the point of each diary entry with a very small variance, and then I'm back where I started. Drop the maximum number of Gaussians and you're forced to get a good fit by matching them to overall features of the graph rather than individual data points, but the number of Gaussians is still an adjustable parameter which trades off overview against detail, so you're back at the same problem of needing a human to judge what tradeoff they really wanted.
The real trouble is, I intuitively feel that there ought to be some combined measure of overview and detail which is maximised (reflecting the idea that you've got a decent amount of both) at some interim level of granularity, but every concrete metric I've so far come up with turns out to be monotonic in granularity one way or the other.
no subject
See my other post. Think of it as a predictive model: you're using part of the data to build a model of your mental state - you then test that model by trying to predict the rest of the data. The smoothed curve you generate represents the probablity of making an entry on a particular day. If you use too many Gaussians, you're overfitting, so you use cross-validation to see if you're doing that.
ISTR coming up with a scoring system for guess-the-probability games, there was a log in it somewhere. Alternatively you could come up with a simple scoring system where you take the dot product of the smoothed graph (discretised into days) and the raw data (again binned into days).
no subject
I've no suggestions on the maths, though am intrigued; I think most people have similar patterns but haven't graphed them :) But will reiterate a comment I think I made before; perhaps the wise thing to do is try to *change* the distribution, so you have a more balanced record of you feelings :)
no subject
no subject
no subject
no subject
no subject
no subject
I once did a word count vs time, and a friend of mine turned it into a nice graphic. Being a flash-and-trousers types, she removed the legends, but hey, it looks nice, and the trend is correct. Or was when the graph was made.
http://www.stray-toaster.co.uk/jpgs/graph.png
A clarity index v time might be good as well, as maybe you become more (oe less) coherent during different moods. Not that I am implying I have ever found your writings incoherent, oh no, but it is a nice (if vague and cuddly) metric. Can I stop digging now?