Statistical personal history [entries|reading|network|archive]
simont

[ userinfo | dreamwidth userinfo ]
[ archive | journal archive ]

Sun 2006-07-09 14:32
Statistical personal history

It's now been about nine and a half years since I started a private diary in the form of a set of heavily encrypted text files on my computer. The intention of that diary was to facilitate my learning from experience, by recording any experiences I thought might be useful in future and which I might plausibly forget. Over the years it has proven its worth again and again, for this and other purposes. When I started it I had no particular intention to confine it to any specific subject area, but it's turned out to be almost entirely full of my love life, not because that's the most important thing in the world to me (it isn't) but because it's by far the most significant aspect of my life which isn't basically sorted.

It occurred to me recently that I seem to write more in that diary when something either bad or difficult is going on in my life, because that's usually when I need to do a lot of thinking (and hence writing). When things become good I record it, but if they stay good for a while I generally don't need to say much about it; for example, at one point there's a nearly complete lack of entries for a year and a half while I was going out with [livejournal.com profile] lark_ascending. (Mind you, this isn't universal: there's also a dearth of entries in late 1998, not because my life was good but because I was suffering from RSI at the time…) So I then wondered, what would happen if I plotted the frequency of my private diary entries against time? Would I see obvious peaks clearly attributable to specific events in my past, or would the highest points turn out to be conjunctions of several things, or would it mostly be random noise, or what?

So I've been having a go at this, on and off, for the past few days. The biggest problem is choosing a granularity at which to break the graph down: too fine and you get a huge number of tiny spikes with no clear pattern, but too coarse and two meaningful spikes merge into one and you start to lose interesting detail. Lacking any particularly clever means of picking a granularity, I eventually resorted to plotting the graph at a wide range of granularities and paging back and forth until I found the most meaningful-looking one. (Which turned out to be a standard deviation of about a month; I wonder if that in itself says something about the scale on which I perceive meaning in my life.)

As it turns out, at that resolution I do indeed see clear peaks which are nearly all attributable to specific incidents (and, given the predominant subject matter, in many cases specific people). There are a couple of exceptions (the second highest peak on the entire diagram, in particular, appears on close inspection to be a group of unrelated minor incidents all occurring around the same time for no obvious reason), but most of the major features on the graph are clearly identifiable.

It's quite tempting to start measuring the relative significance of the various incidents by the relative heights of the peaks, but it turns out that this is a granularity artifact: dial the granularity down and the highest peak divides into smaller ones and a different peak becomes the winner, but dial it up and the highest peak becomes shorter and squatter while several smaller peaks in a different area merge into one big one and collectively overtake it. I suppose each peak must be at its highest when the graph granularity is roughly equal to the duration of the incident that caused it. So probably what I should really be doing to measure the impact of each incident on the diary would be to measure the overall area under the graph which it caused, but that's not so easy to read off from the peaks and troughs.

If anyone has any useful input on the problem of plotting a usefully informative representation of a data set like this without needing an intelligence-guided choice of parameters, it would be welcome. In case it's useful to know, I'm currently plotting the graph by replacing each data point with a Gaussian (effectively a convolution, if you consider my original data set as a sum of Dirac deltas) and summing, rather than plotting a conventional histogram with fixed dividing lines between blocks (I was worried that a peak might look very different depending on whether it crossed a dividing line, so I picked a strategy which didn't run that risk); so ‘granularity’ means choosing the variance of the Gaussian appropriately. I'm vaguely considering the idea of picking the variance of the Gaussian for each data point differently, according to some metric related to the surrounding points, but no particularly sensible-sounding idea has come to mind yet.

LinkReply
[identity profile] feanelwa.livejournal.comSun 2006-07-09 13:42
Hmm. I was dealing with a problem like this the other week: I had a set of images where the useful data I needed to extract was lines, and the stuff I most needed to get rid of was lines too, but stronger. The solution I hit on in the end was to sobel filter the images, take the histogram of the image which had lots of discrete chunks in, the first of which has all the lines in, the others have just the bad lines in, find the first zero in it, and replace the pixels whose local average of the sobel filtered version was bigger than where the first zero came, with the image mean. This did bugger all good, because I just got reconstructions that were a blank space of image mean, because all my images contained a lot of pixels at that value so my strongest convergence point was there and not at the actual answer. Bollocks.

Anyway yes. Let me think...I would get it to plot an initial histogram, take some statistics out of it like how much area it's got under it by a certain amount of time, calculate an optimum spacing and set that to be the histogram binsize for the proper graph. This depends on me being able to set the histogram binsixe, however.
Link Reply to this
[identity profile] ptc24.livejournal.comSun 2006-07-09 14:02
Damn! I was reading that and thinking: maybe I should suggest using some Gaussians to make the diary entries "fuzzy", and then I scrolled down.

This all sounds a lot like some of the chromatographic analysis I've done lots of in chemistry (though not at the level of detail I'm describing here). Which means that what you're trying to do is to pick a set of peaks that is well-represented by the data: you're looking for life events with varying duration and intensity, which generate diary entries around their point of maximum intensity. You can model as a set of Gaussians, or maybe something with a skew factor. So what you need to do is to do curve-fitting to fit a set of Gaussians onto your diary-entry-frequency data, and then plot the Gaussiams.
Link Reply to this | Thread
[identity profile] ptc24.livejournal.comSun 2006-07-09 14:21
Also: whatever smoothing you do, the way to see if you've done a good job is to randomly divide your journal entries into two (or more) batches, make a smoothed graph using one batch, and seeing how well it agrees with with the other batch. Then use the second batch to make another smoothed graph, and test it on the first batch. Hey presto, you have a numerical estimate of the quality of your smoothing algorithm, which should let you optimise any parameters you might be using.
Link Reply to this | Parent
[personal profile] simontSun 2006-07-09 14:53
Hm. Sounds as if this would be better done on the cumulative frequency graph (because then you can conveniently measure the residual as the integral of absolute or rms error over the entire run, without having to solve the same problem to find the curve you're trying to approximate to).

The thing that still worries me is, how do you pick how many Gaussians to use? I mean, given as many Gaussians to play with as I have data points, I imagine I'd get my best approximation to the real cumulative frequency graph by arranging one Gaussian centred at the point of each diary entry with a very small variance, and then I'm back where I started. Drop the maximum number of Gaussians and you're forced to get a good fit by matching them to overall features of the graph rather than individual data points, but the number of Gaussians is still an adjustable parameter which trades off overview against detail, so you're back at the same problem of needing a human to judge what tradeoff they really wanted.

The real trouble is, I intuitively feel that there ought to be some combined measure of overview and detail which is maximised (reflecting the idea that you've got a decent amount of both) at some interim level of granularity, but every concrete metric I've so far come up with turns out to be monotonic in granularity one way or the other.
Link Reply to this | Parent | Thread
[identity profile] ptc24.livejournal.comSun 2006-07-09 15:10
how do you pick how many Gaussians to use?

See my other post. Think of it as a predictive model: you're using part of the data to build a model of your mental state - you then test that model by trying to predict the rest of the data. The smoothed curve you generate represents the probablity of making an entry on a particular day. If you use too many Gaussians, you're overfitting, so you use cross-validation to see if you're doing that.

ISTR coming up with a scoring system for guess-the-probability games, there was a log in it somewhere. Alternatively you could come up with a simple scoring system where you take the dot product of the smoothed graph (discretised into days) and the raw data (again binned into days).
Link Reply to this | Parent
[identity profile] cartesiandaemon.livejournal.comSun 2006-07-09 14:05
The other reason why lovelife can come to predominate a private diary is that it's often harder to talk to friends about, because they're likely to know people involved.

I've no suggestions on the maths, though am intrigued; I think most people have similar patterns but haven't graphed them :) But will reiterate a comment I think I made before; perhaps the wise thing to do is try to *change* the distribution, so you have a more balanced record of you feelings :)
Link Reply to this
[personal profile] aldabraSun 2006-07-09 15:16
You're plotting frequency of entries rather than size of entries?
Link Reply to this | Thread
[personal profile] simontSun 2006-07-09 22:11
So far, yes. I might try size at some point and see if it comes out looking noticeably different.
Link Reply to this | Parent
[identity profile] ex-lark-asc.livejournal.comMon 2006-07-10 16:42
I'm pleased to hear I'm not universal :)
Link Reply to this | Thread
[personal profile] simontMon 2006-07-10 18:28
If it's any help, the RSI dip doesn't stand out nearly as much as you do :-)
Link Reply to this | Parent | Thread
[identity profile] ex-lark-asc.livejournal.comTue 2006-07-11 10:08
You really know how to make a girl feel good about herself, don't you ;)
Link Reply to this | Parent
[identity profile] mwk.livejournal.comMon 2006-07-10 18:27
Wouldn't it also be useful to do a syllable/word count/some other textual analysis thing per post per time? I mean, things may be going well, and the posts more infrequent (or, erm, less frequent. Not sure which of those constructs I like better. Are they the same? I wouldn't be convinced.) but do they increase in length/complication?

I once did a word count vs time, and a friend of mine turned it into a nice graphic. Being a flash-and-trousers types, she removed the legends, but hey, it looks nice, and the trend is correct. Or was when the graph was made.

http://www.stray-toaster.co.uk/jpgs/graph.png

A clarity index v time might be good as well, as maybe you become more (oe less) coherent during different moods. Not that I am implying I have ever found your writings incoherent, oh no, but it is a nice (if vague and cuddly) metric. Can I stop digging now?
Link Reply to this
navigation
[ go | Previous Entry | Next Entry ]
[ add | to Memories ]