22 Comments

The history of attempts to axiomatize Occam's Razor is far more complicated, and modern approaches far more subtle.

Abstract: Complexity in the Paradox of SimplicityJonathan PostComputer Futures, Inc.Philip FellmanUniversity of Southern New Hampshire

Expand full comment

Andrew, the more I reread your 8:57pm post, the more I doubt the essential honesty of it (that you don't intuitively pick (or to go a step back, formulate) less complicated mechanisms for observed phenomena, and you don't intuitively do so by factoring in the apparently limited energy available in our local system.

Expand full comment

Hopefully A.,

I don't work in settings where I choose between theory A and theory B, or between model A and model B. If I have 2 models I'm considering, I'd rather embed them continuously in a larger model that includes both the originals as special cases. (That's called "continuous model expansion" as opposed to "discrete model averaging.") There's more discussion of this in a couple of the sections in chapter 6 of Bayesian Data Analysis.

Similarly, to Utilitarian: I'm not really ever estimating the probability of hypotheses. Rather, I use hypotheses to estimate parameters and make predictions. Then I use these predictions to evaluate the model.

Expand full comment

Another argument for Occam's razor is that we have to favor some hypotheses over others in our prior probability assignments. For instance, consider the example from before with the sequence 0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1. We can easily come up with an uncountably infinite number of possible hypotheses: e.g., that after 8 repetitions of "0,1", the next number in the sequence is any particular element of the set of real numbers. (If we refuse to favor some hypotheses over others, we must assign measure 0 to all of them.) Occam's razor states a principle about which types of hypotheses we ought to favor.

In any event, Occam's razor may not be essential for the original premise of the post. Suppose we think an alien planet is just as likely to be named Alpha as to be named Beta, and that those aliens are just as likely to have blue skin as blue skin. Suppose these events are probabilistically independent of playing ping pong, riding unicycles, and liking the number 7. Define the event A' as follows: "Person A was abducted by aliens from the planet Alpha; they had green skin, they liked to play ping-pong, they rode around on unicycles, and their favorite number was 7." By the assumptions stated above, P(A') = P(B). But P(A') <= P(A), because A' is a subset of A. So P(B) <= P(A). (Most likely, the inequality will be strict.)

Expand full comment

Andrew, your take on my 4:32pm post?

Expand full comment

Anders,

I agree that it can be a useful statement to say, for example, that 97% of the variance in the data is explained by only two factors, or to say that certain factors are lost in the noise, or to compare different datasets with regard to how many factors are needed to explain most of the interesting patterns. However, I don't see this in terms of a prior distribution favoring simpler models; I see it as a choice of how to summarize a complicated model (for example, summarizing by the largest and most important factors). I agree completely with you that if a simple model does pretty well, this can be interesting, and a social scientist can look at how these factors change over time. I just completely reject the statement in the original blog entry above that the simpler model is "more probable." I'm much happier working with the possibility of a really complicated model (which would be structured; I'm not talking about estimating 435 parameters with a flat prior) and then using "Occam's razor," if you'd like to call it that, to prepare useful summaries.

Expand full comment

HA: Something like that.

Expand full comment

I've been thinking that the intuition behind Occam's Razor is probably rooted in entropy and the apparent overall state of the universe? Overall, the universe seems to be rather homogeneous. Thus, when considering two theories for a phenomenon, the more homogeneous/more entropic model, vs. the less homogeneous/less entropic model, all other things being equal the more homogeneous/more entropic model is statistically more probable. Has anyone else hitched some version of bayesian reasoning/entropy/and the relative homogeneity of the universe in this way as an explanation for the intuition behind Occam's Razor?

Expand full comment

Suppose you fit a model to your hypotheses. Most likely the fit of H1 will be worse than to H2, which will be worse than H3, and so on. If you had to choose between these hypotheses, which one would you choose? If only the fit matters, you should choose the one with 435 factors since it gives the best fit. But it is pretty obviously a much less plausible social science model than saying that there are a few big factors (and then many small hidden in the "noise"). That the real state of affairs is much more complex than the hypotheses does not mean that the simple model doesn't tell us something interesting.

(BTW, I actually did this kind of analysis of the Swedish parliament,http://www.eudoxa.se/politi...and could show that it was mostly 1-dimensional like British politics, unlike the 4D Norwegian politicshttp://www.essex.ac.uk/ecpr...and the 2D US politicshttp://www.citebase.org/abs...)

Kolmogorov complexity applies in social science too: an explanation with many ad hoc details is generally valued less than an explanation based on a few simple, general principles. One reason to value the simpler explanation is that it is less prone to overfitting, and can be expected to work well in the future too (unless it is too vacuous to actually tell us anything, of course).

A lot of people get seduced by complex "realistic" models in climate or neuroscience. The problem is that you seldom learn anything from them. They do things, but we cannot follow what happens or why, and the only way of predicting what will happen when parameters are changed is to run the model anew. That might be close to how reality works, but it does not help us understand what is going on or make robust predictions. Often reducing the model to equivalent but simpler models gives valuable insights.

In the case of the political votes, I think the simple models would do a pretty good job of predicting how representatives vote. That would be useful for a political analyst, and for the sociologist it would be interesting to examine what the components are (and how they change over time). That sounds like a good start for research to me. Instead starting with the most complex assumption (everybody votes for their own special reasons) doesn't get us very far.

Expand full comment

Stuart,

If it's an esthetic principle, that's ok. We just have different esthetic senses. You like Japanese rock gardens, I like overstuffed sofas. (Refer to Radford's quote in my first comment above.) Based on the popularity of Occam's Razor, I suspect that Radford and I have a minority preference. Nonetheless I think my preferences have worked well in the many applied problems that I've worked on.

I would argue that the appropriate esthetic principles can depend on what problems you're working on. Many aspects of physics and genetics are inherently discrete (for example, just one or two genes determining something), whereas social and environmental phenomena are typically explained by many factors.

Expand full comment

None of these hypotheses is true. I certainly don't want someone quoting some 800-year-old guy called Occam as a reason for choosing H1 or even H2.

Neither of them are scientific hypotheses - they are not validated by experiment. You have to look at the data, formulate the hypothesis, then look at the data again, and see if the hypothesis is validated. Then you have a scientific hypothesis.

All that Occam's razor says, is that if there are two hypotheses, H1 and H2, which were formulated before seeing the extra data and are equally validated by that extra data, we should choose the simpler (or more elegant) one. It is not a scientific principle, it is an esthetic one, which is kept because it is usefull (simpler theories are less error prone) and has had some success in the past (the universe in some ways is a lot simpler than our intutitions would suggest).

The second that a piece of data comes in that could distinguish between H1 and H2, Occam's razor is blunted and should be discarded. What is often seen as a version of Occam's razor - the idea that theories generally must persist identically through time and through space - is actually an observation, not a principle.

Expand full comment

Peter,

I agree with you on your Aids example. Hypothesis 2 violates various physical laws and models and makes no sense at all.

Things are a little different in social science. For example:

H1: votes in Congress are determined by a single factor (which can be interpreted as position on economic issues)

H2: votes in Congress are determined by 2 factors (ecnomics and foreign policy)

H3: votes in Congress are determined by 3 factors

etc.

None of these hypotheses is true. I certainly don't want someone quoting some 800-year-old guy called Occam as a reason for choosing H1 or even H2.

From a statistical point of view, I think the key distinction is between discrete models (such as your Aids example) and continuous models. In the latter, many many factors eventually do make a difference, even if you can't pick it out with finite data at hand.

Expand full comment

Andrew, here is an example that might be more familiar to you.

H1: AIDS is caused by a virus.H2: AIDS is caused by a virus until August 31, 2008, and after that it is caused by a bacterium.

I see that this example is much weaker than my first example though; viruses and bacteria are not black boxes, so we can give theoretical reasons why H2 is a bizarre hypothesis. Because H1 and H2 are not fundamental theories, what really matters is the degree to which H1 and H2 are implied by our fundamental theories, rather than their internal complexity.

Occam's razor applies directly to fundamental physics.

Expand full comment

Let me add that H1 will be a shorter computer program in any programming language I can imagine that has not been constructed explicitly so that H2 is shorter. (And note that your English description of H1 is shorter.)

Expand full comment

Peter,

Because H1 has lower Kolmogorov complexity. In pseudo-Python, H1 is,

for true: {print 0; print 1}

whereas H2 is,

for [i in 0 to 7]: {print 0; print 1}for true: print 1

H1 consists of fewer characters (or tokens).Ergo, it has lower Kolmogorov complexity.

Expand full comment

Peter,

The situation you describe isn't like any problem I've ever worked on. In the problems I've worked on (mostly in social science and public health), people don't present me with sequences like this.

Expand full comment