Statistical personal history

Statistical personal history

[entries|reading|network|archive]

simont

[ userinfo | dreamwidth userinfo ]

[ archive | journal archive ]

Sun 2006-07-09 14:32

Statistical personal history

Link

ptc24.livejournal.com

Sun 2006-07-09 14:02

Damn! I was reading that and thinking: maybe I should suggest using some Gaussians to make the diary entries "fuzzy", and then I scrolled down.

This all sounds a lot like some of the chromatographic analysis I've done lots of in chemistry (though not at the level of detail I'm describing here). Which means that what you're trying to do is to pick a set of peaks that is well-represented by the data: you're looking for life events with varying duration and intensity, which generate diary entries around their point of maximum intensity. You can model as a set of Gaussians, or maybe something with a skew factor. So what you need to do is to do curve-fitting to fit a set of Gaussians onto your diary-entry-frequency data, and then plot the Gaussiams.

Link	Reply to this \| Thread

ptc24.livejournal.com

Sun 2006-07-09 14:21

Also: whatever smoothing you do, the way to see if you've done a good job is to randomly divide your journal entries into two (or more) batches, make a smoothed graph using one batch, and seeing how well it agrees with with the other batch. Then use the second batch to make another smoothed graph, and test it on the first batch. Hey presto, you have a numerical estimate of the quality of your smoothing algorithm, which should let you optimise any parameters you might be using.

Link	Reply to this \| Parent

simont

Sun 2006-07-09 14:53

Hm. Sounds as if this would be better done on the cumulative frequency graph (because then you can conveniently measure the residual as the integral of absolute or rms error over the entire run, without having to solve the same problem to find the curve you're trying to approximate to).

The thing that still worries me is, how do you pick how many Gaussians to use? I mean, given as many Gaussians to play with as I have data points, I imagine I'd get my best approximation to the real cumulative frequency graph by arranging one Gaussian centred at the point of each diary entry with a very small variance, and then I'm back where I started. Drop the maximum number of Gaussians and you're forced to get a good fit by matching them to overall features of the graph rather than individual data points, but the number of Gaussians is still an adjustable parameter which trades off overview against detail, so you're back at the same problem of needing a human to judge what tradeoff they really wanted.

The real trouble is, I intuitively feel that there ought to be some combined measure of overview and detail which is maximised (reflecting the idea that you've got a decent amount of both) at some interim level of granularity, but every concrete metric I've so far come up with turns out to be monotonic in granularity one way or the other.

Link	Reply to this \| Parent \| Thread

ptc24.livejournal.com

Sun 2006-07-09 15:10

how do you pick how many Gaussians to use?

See my other post. Think of it as a predictive model: you're using part of the data to build a model of your mental state - you then test that model by trying to predict the rest of the data. The smoothed curve you generate represents the probablity of making an entry on a particular day. If you use too many Gaussians, you're overfitting, so you use cross-validation to see if you're doing that.

ISTR coming up with a scoring system for guess-the-probability games, there was a log in it somewhere. Alternatively you could come up with a simple scoring system where you take the dot product of the smoothed graph (discretised into days) and the raw data (again binned into days).

Link	Reply to this \| Parent

navigation

[	go	\|	Previous Entry \| Next Entry	]
[	add	\|	to Memories	]