Are More Complicated Revelations Less Probable?

Consider two possible situations, A and B. In situation A, we come across a person–call him "A"–who makes the following claim: "I was abducted by aliens from the planet Alpha; they had green skin." In situation B, we come across a different person–call him "B"–who tells us, "I was abducted by aliens from the planet Beta; they had blue skin, they liked to play ping-pong, they rode around on unicycles, and their favorite number was 7." In either situation, we are likely to assign low subjective probability to the abduction claim that we hear. But should we assign higher subjective probability to the claim in one situation more than in the other?

Mindful of Occam’s razor, and careful to avoid the type of reasoning that leads to the conjunction fallacy, we might agree that A’s claim is, in itself, more probable, because it is less specific. However, we have to condition our probability assessment on the evidence that A or B actually made his claim. While B’s claim is less intrinsically likely, the hypothesis that B’s claim is true has strong explanatory power to account for why B made the specific statements he did. Thus, in the end it may not be so obvious whether we should believe A’s claim more in situation A than we believe B’s claim in situation B.

To be concrete, let A be the event that A’s claim is true, B be the event that B’s claim is true, C be the fact that A made the claims he did, and D be the fact that B made the claim’s he did. We can agree that P(B) < P(A); for definiteness, say P(B) = 0.001*P(A). However, the relevant comparison is between P(A|C) and P(B|D). Bayes’ theorem says

P(A|C) = P(A)*P(C|A) / [P(A)*P(C|A) + P(~A)*P(C|~A)];
P(B|D) = P(B)*P(D|B) / [P(B)*P(D|B) + P(~B)*P(D|~B)] = 0.001*P(A)*P(D|B) / [0.001*P(A)*P(D|B) + P(~B)*P(D|~B)].

If either story is true, it’s fairly likely that the person would tell it to us the way it happened; for convenience, assume P(C|A) = P(D|B) = 1. Also assume that P(A) and P(B) are small enough that we can make the approximation P(~A) = P(~B) = 1 in our formulas. We then have

P(A|C) = P(A) / [P(A) + P(C|~A)];
P(B|D) = 0.001*P(A) / [0.001*P(A) + P(D|~B)].

Now, we can probably agree that P(D|~B) < P(C|~A). This is because, if the person wasn’t abducted, it’s less likely that he would give the exact details that B gave than the more general account that A gave. (I don’t mean to suggest that people who claim to be abducted are likely to give short accounts. I mean, rather, that the probability of giving any particular highly detailed account is less than the probability of giving any particular less detailed account.) If we decide that P(D|~B) < 0.001*P(C|~A), then P(B|D) > P(A|C).

There are some cases in which it may be appropriate automatically to give lower probability to more specific claims, even in light of the above reasoning: e.g., in the case of futurists who predict elaborately detailed scenarios. In this case, unless we have reason to think they’ve come back in time from the future, the equivalent of what I called P(D|B) above is likely to be low. That is, even if the futurist’s claims were true, this would not make it very likely that the futurist would successfully predict all of them, since–unlike person A or B, who saw the aliens–the futurist hasn’t viewed firsthand exactly how things will turn out.

GD Star Rating
Tagged as: , ,
Trackback URL:
  • I don’t understand Occam’s Razor as a general principle. I can see that it might apply in some settings (such as some areas of physics) but not in others (such as social science). My favorite quote on this (see also here) comes from Radford Neal’s book, Bayesian Learning for Neural Networks, pp. 103-104:

    Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.

  • Tom

    I think the point of occam’s razor is that you start with your simple ‘vanilla’ model, then add to it as evidence comes in. Like an artist starting with guidelines, then filling in details and adding colour. How else would you do it?

  • Vladimir Nesov

    And if B wasn’t abducted, the fact that he tells that he was suggests that he wants to lie convincingly and so gives a more detailed account. No silver bullet…

  • The logic sounds correct to me, though I haven’t checked the math line by line.

  • Tom,

    I agree with you on building from simple models as a modeling strategy–we discuss this a lot in our new book–I just don’t see Occam’s razor as a principle for choosing a model.

  • Andrew:

    Let’s say you see the following string, and you are asked to predict what comes next:


    Here are two hypotheses which are both consistent with the data:

    H1. 0,1 repeats indefinitely.
    H2. 0,1 repeats 8 times, and then the rest is all ones.

    You probably prefer H1. Why?

  • Peter,

    The situation you describe isn’t like any problem I’ve ever worked on. In the problems I’ve worked on (mostly in social science and public health), people don’t present me with sequences like this.

  • Peter,

    Because H1 has lower Kolmogorov complexity. In pseudo-Python, H1 is,

    for true: {print 0; print 1}

    whereas H2 is,

    for [i in 0 to 7]: {print 0; print 1}
    for true: print 1

    H1 consists of fewer characters (or tokens).
    Ergo, it has lower Kolmogorov complexity.

  • Let me add that H1 will be a shorter computer program in any programming language I can imagine that has not been constructed explicitly so that H2 is shorter. (And note that your English description of H1 is shorter.)

  • Andrew, here is an example that might be more familiar to you.

    H1: AIDS is caused by a virus.
    H2: AIDS is caused by a virus until August 31, 2008, and after that it is caused by a bacterium.

    I see that this example is much weaker than my first example though; viruses and bacteria are not black boxes, so we can give theoretical reasons why H2 is a bizarre hypothesis. Because H1 and H2 are not fundamental theories, what really matters is the degree to which H1 and H2 are implied by our fundamental theories, rather than their internal complexity.

    Occam’s razor applies directly to fundamental physics.

  • Peter,

    I agree with you on your Aids example. Hypothesis 2 violates various physical laws and models and makes no sense at all.

    Things are a little different in social science. For example:

    H1: votes in Congress are determined by a single factor (which can be interpreted as position on economic issues)

    H2: votes in Congress are determined by 2 factors (ecnomics and foreign policy)

    H3: votes in Congress are determined by 3 factors


    None of these hypotheses is true. I certainly don’t want someone quoting some 800-year-old guy called Occam as a reason for choosing H1 or even H2.

    From a statistical point of view, I think the key distinction is between discrete models (such as your Aids example) and continuous models. In the latter, many many factors eventually do make a difference, even if you can’t pick it out with finite data at hand.

  • Stuart Armstrong

    None of these hypotheses is true. I certainly don’t want someone quoting some 800-year-old guy called Occam as a reason for choosing H1 or even H2.

    Neither of them are scientific hypotheses – they are not validated by experiment. You have to look at the data, formulate the hypothesis, then look at the data again, and see if the hypothesis is validated. Then you have a scientific hypothesis.

    All that Occam’s razor says, is that if there are two hypotheses, H1 and H2, which were formulated before seeing the extra data and are equally validated by that extra data, we should choose the simpler (or more elegant) one. It is not a scientific principle, it is an esthetic one, which is kept because it is usefull (simpler theories are less error prone) and has had some success in the past (the universe in some ways is a lot simpler than our intutitions would suggest).

    The second that a piece of data comes in that could distinguish between H1 and H2, Occam’s razor is blunted and should be discarded. What is often seen as a version of Occam’s razor – the idea that theories generally must persist identically through time and through space – is actually an observation, not a principle.

  • Stuart,

    If it’s an esthetic principle, that’s ok. We just have different esthetic senses. You like Japanese rock gardens, I like overstuffed sofas. (Refer to Radford’s quote in my first comment above.) Based on the popularity of Occam’s Razor, I suspect that Radford and I have a minority preference. Nonetheless I think my preferences have worked well in the many applied problems that I’ve worked on.

    I would argue that the appropriate esthetic principles can depend on what problems you’re working on. Many aspects of physics and genetics are inherently discrete (for example, just one or two genes determining something), whereas social and environmental phenomena are typically explained by many factors.

  • Suppose you fit a model to your hypotheses. Most likely the fit of H1 will be worse than to H2, which will be worse than H3, and so on. If you had to choose between these hypotheses, which one would you choose? If only the fit matters, you should choose the one with 435 factors since it gives the best fit. But it is pretty obviously a much less plausible social science model than saying that there are a few big factors (and then many small hidden in the “noise”). That the real state of affairs is much more complex than the hypotheses does not mean that the simple model doesn’t tell us something interesting.

    (BTW, I actually did this kind of analysis of the Swedish parliament,
    and could show that it was mostly 1-dimensional like British politics, unlike the 4D Norwegian politics
    and the 2D US politics

    Kolmogorov complexity applies in social science too: an explanation with many ad hoc details is generally valued less than an explanation based on a few simple, general principles. One reason to value the simpler explanation is that it is less prone to overfitting, and can be expected to work well in the future too (unless it is too vacuous to actually tell us anything, of course).

    A lot of people get seduced by complex “realistic” models in climate or neuroscience. The problem is that you seldom learn anything from them. They do things, but we cannot follow what happens or why, and the only way of predicting what will happen when parameters are changed is to run the model anew. That might be close to how reality works, but it does not help us understand what is going on or make robust predictions. Often reducing the model to equivalent but simpler models gives valuable insights.

    In the case of the political votes, I think the simple models would do a pretty good job of predicting how representatives vote. That would be useful for a political analyst, and for the sociologist it would be interesting to examine what the components are (and how they change over time). That sounds like a good start for research to me. Instead starting with the most complex assumption (everybody votes for their own special reasons) doesn’t get us very far.

  • I’ve been thinking that the intuition behind Occam’s Razor is probably rooted in entropy and the apparent overall state of the universe? Overall, the universe seems to be rather homogeneous. Thus, when considering two theories for a phenomenon, the more homogeneous/more entropic model, vs. the less homogeneous/less entropic model, all other things being equal the more homogeneous/more entropic model is statistically more probable. Has anyone else hitched some version of bayesian reasoning/entropy/and the relative homogeneity of the universe in this way as an explanation for the intuition behind Occam’s Razor?

  • Nick Tarleton

    HA: Something like that.

  • Anders,

    I agree that it can be a useful statement to say, for example, that 97% of the variance in the data is explained by only two factors, or to say that certain factors are lost in the noise, or to compare different datasets with regard to how many factors are needed to explain most of the interesting patterns. However, I don’t see this in terms of a prior distribution favoring simpler models; I see it as a choice of how to summarize a complicated model (for example, summarizing by the largest and most important factors). I agree completely with you that if a simple model does pretty well, this can be interesting, and a social scientist can look at how these factors change over time. I just completely reject the statement in the original blog entry above that the simpler model is “more probable.” I’m much happier working with the possibility of a really complicated model (which would be structured; I’m not talking about estimating 435 parameters with a flat prior) and then using “Occam’s razor,” if you’d like to call it that, to prepare useful summaries.

  • Andrew, your take on my 4:32pm post?

  • Another argument for Occam’s razor is that we have to favor some hypotheses over others in our prior probability assignments. For instance, consider the example from before with the sequence 0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1. We can easily come up with an uncountably infinite number of possible hypotheses: e.g., that after 8 repetitions of “0,1”, the next number in the sequence is any particular element of the set of real numbers. (If we refuse to favor some hypotheses over others, we must assign measure 0 to all of them.) Occam’s razor states a principle about which types of hypotheses we ought to favor.

    In any event, Occam’s razor may not be essential for the original premise of the post. Suppose we think an alien planet is just as likely to be named Alpha as to be named Beta, and that those aliens are just as likely to have blue skin as blue skin. Suppose these events are probabilistically independent of playing ping pong, riding unicycles, and liking the number 7. Define the event A’ as follows: “Person A was abducted by aliens from the planet Alpha; they had green skin, they liked to play ping-pong, they rode around on unicycles, and their favorite number was 7.” By the assumptions stated above, P(A’) = P(B). But P(A’) <= P(A), because A' is a subset of A. So P(B) <= P(A). (Most likely, the inequality will be strict.)

  • Hopefully A.,

    I don’t work in settings where I choose between theory A and theory B, or between model A and model B. If I have 2 models I’m considering, I’d rather embed them continuously in a larger model that includes both the originals as special cases. (That’s called “continuous model expansion” as opposed to “discrete model averaging.”) There’s more discussion of this in a couple of the sections in chapter 6 of Bayesian Data Analysis.

    Similarly, to Utilitarian: I’m not really ever estimating the probability of hypotheses. Rather, I use hypotheses to estimate parameters and make predictions. Then I use these predictions to evaluate the model.

  • Andrew, the more I reread your 8:57pm post, the more I doubt the essential honesty of it (that you don’t intuitively pick (or to go a step back, formulate) less complicated mechanisms for observed phenomena, and you don’t intuitively do so by factoring in the apparently limited energy available in our local system.

  • The history of attempts to axiomatize Occam’s Razor is far more complicated, and modern approaches far more subtle.

    Abstract: Complexity in the Paradox of Simplicity

    Jonathan Post
    Computer Futures, Inc.

    Philip Fellman
    University of Southern New Hampshire