Occam's Razor
Followup to: Burdensome Details, How Much Evidence?
The more complex an explanation is, the more evidence you need just to find it in belief-space. (In Traditional Rationality this is often phrased misleadingly, as "The more complex a proposition is, the more evidence is required to argue for it.") How can we measure the complexity of an explanation? How can we determine how much evidence is required?
Occam's Razor is often phrased as "The simplest explanation that fits the facts." Robert Heinlein replied that the simplest explanation is "The lady down the street is a witch; she did it."
One observes that the length of an English sentence is not a good way to measure "complexity". And "fitting" the facts by merely failing to prohibit them is insufficient.
Why, exactly, is the length of an English sentence a poor measure of complexity? Because when you speak a sentence aloud, you are using labels for concepts that the listener shares - the receiver has already stored the complexity in them. Suppose we abbreviated Heinlein's whole sentence as "Tldtsiawsdi!" so that the entire explanation can be conveyed in one word; better yet, we'll give it a short arbitrary label like "Fnord!" Does this reduce the complexity? No, because you have to tell the listener in advance that "Tldtsiawsdi!" stands for "The lady down the street is a witch; she did it." "Witch", itself, is a label for some extraordinary assertions - just because we all know what it means doesn't mean the concept is simple.
An enormous bolt of electricity comes out of the sky and hits something, and the Norse tribesfolk say, "Maybe a really powerful agent was angry and threw a lightning bolt." The human brain is the most complex artifact in the known universe. If anger seems simple, it's because we don't see all the neural circuitry that's implementing the emotion. (Imagine trying to explain why Saturday Night Live is funny, to an alien species with no sense of humor. But don't feel superior; you yourself have no sense of fnord.) The complexity of anger, and indeed the complexity of intelligence, was glossed over by the humans who hypothesized Thor the thunder-agent.
To a human, Maxwell's Equations take much longer to explain than Thor. Humans don't have a built-in vocabulary for calculus the way we have a built-in vocabulary for anger. You've got to explain your language, and the language behind the language, and the very concept of mathematics, before you can start on electricity.
And yet it seems that there should be some sense in which Maxwell's Equations are simpler than a human brain, or Thor the thunder-agent.
There is: It's enormously easier (as it turns out) to write a computer program that simulates Maxwell's Equations, compared to a computer program that simulates an intelligent emotional mind like Thor.
The
formalism of Solomonoff Induction measures the "complexity of a
description" by the length of the shortest computer program which
produces that description as an output. To talk about the "shortest
computer program" that does something, you need to specify a space of
computer programs, which requires a language and interpreter.
Solomonoff Induction uses Turing machines, or rather, bitstrings that
specify Turing machines. What if you don't like Turing machines?
Then there's only a constant complexity penalty to design your own
Universal Turing Machine that interprets whatever code you give it in
whatever programming language you like. Different inductive formalisms
are penalized by a worst-case constant factor relative to each other,
corresponding to the size of a universal interpreter for that formalism.
In the better (IMHO) versions of Solomonoff Induction, the computer program does not produce a deterministic prediction, but assigns probabilities to strings. For example, we could write a program to explain a fair coin by writing a program that assigns equal probabilities to all 2^N strings of length N. This is Solomonoff Induction's approach to fitting the observed data. The higher the probability a program assigns to the observed data, the better that program fits the data. And probabilities must sum to 1, so for a program to better "fit" one possibility, it must steal probability mass from some other possibility which will then "fit" much more poorly. There is no superfair coin that assigns 100% probability to heads and 100% probability to tails.
How do we trade off the fit to the data, against the complexity of the program? If you ignore complexity penalties, and think only about fit, then you will always prefer programs that claim to deterministically predict the data, assign it 100% probability. If the coin shows "HTTHHT", then the program which claims that the coin was fixed to show "HTTHHT" fits the observed data 64 times better than the program which claims the coin is fair. Conversely, if you ignore fit, and consider only complexity, then the "fair coin" hypothesis will always seem simpler than any other hypothesis. Even if the coin turns up "HTHHTHHHTHHHHTHHHHHT..." Indeed, the fair coin is simpler and it fits this data exactly as well as it fits any other string of 20 coinflips - no more, no less - but we see another hypothesis, seeming not too complicated, that fits the data much better.
If you let a program store one more binary bit of information, it will be able to cut down a space of possibilities by half, and hence assign twice as much probability to all the points in the remaining space. This suggests that one bit of program complexity should cost at least a "factor of two gain" in the fit. If you try to design a computer program that explicitly stores an outcome like "HTTHHT", the six bits that you lose in complexity must destroy all plausibility gained by a 64-fold improvement in fit. Otherwise, you will sooner or later decide that all fair coins are fixed.
Unless your program is being smart, and compressing the data, it should do no good just to move one bit from the data into the program description.
The way Solomonoff induction works to predict sequences is that you sum up over all allowed computer programs - if any program is allowed, Solomonoff induction becomes uncomputable - with each program having a prior probability of (1/2) to the power of its code length in bits, and each program is further weighted by its fit to all data observed so far. This gives you a weighted mixture of experts that can predict future bits.
The Minimum Message Length formalism is nearly equivalent to Solomonoff induction. You send a string describing a code, and then you send a string describing the data in that code. Whichever explanation leads to the shortest total message is the best. If you think of the set of allowable codes as a space of computer programs, and the code description language as a universal machine, then Minimum Message Length is nearly equivalent to Solomonoff induction. (Nearly, because it chooses the shortest program, rather than summing up over all programs.)
This lets us see clearly the problem with using "The lady down the street is a witch; she did it" to explain the pattern in the sequence "0101010101". If you're sending a message to a friend, trying to describe the sequence you observed, you would have to say: "The lady down the street is a witch; she made the sequence come out 0101010101." Your accusation of witchcraft wouldn't let you shorten the rest of the message; you would still have to describe, in full detail, the data which her witchery caused.
Witchcraft may fit our observations in the sense of qualitatively permitting them; but this is because witchcraft permits everything, like saying "Phlogiston!" So, even after you say "witch", you still have to describe all the observed data in full detail. You have not compressed the total length of the message describing your observations by transmitting the message about witchcraft; you have simply added a useless prologue, increasing the total length.
The real sneakiness was concealed in the word "it" of "A witch did it". A witch did what?
Of course, thanks to hindsight bias and anchoring and fake explanations and fake causality and positive bias and motivated cognition, it may seem all too obvious that if a woman is a witch, of course she would make the coin come up 0101010101. But of this I have already spoken.
The Vapnik Chernovenkis Dimension also offers a way of filling in the detail of the the concept of "simple" appropriate to Occam's Razor. I've read about it in the context of statistical learning theory, specifically "probably approximately correct learning".
Having successfully tuned the parameters of your model to fit the data, how likely is it to fit new data, that is, how well does it generalise. The VC dimension comes with formulae that tell you. I've not been able to follow the field, but I suspect that VC dimension leads to worst case estimates whose usefulness is harmed by their pessimism.
Posted by: Alan Crowe | September 26, 2007 at 07:23 AM
Great post!
Posted by: Hopefully Anonymous | September 26, 2007 at 08:27 AM
"Your accusation of witchcraft wouldn't let you shorten the rest of the message; you would still have to describe, in full detail, the data which her witchery caused."
My model of witches, if I had one, would produce a given simple sequence like 01010101 with greater probability than a given random sequence like 00011011. Wouldn't yours? I might agree if you said "in nearly full detail".
Posted by: steven | September 26, 2007 at 09:20 AM
Steven, that means you have to transmit the accusation of witchcraft, followed by a computer program, followed by the coded data. Why not just transmit the computer program followed by the coded data? I don't expect my own environment to be random noise, but that has nothing to do with witchcraft...
Alan, I agree that VC dimension is an important conceptually different way of thinking about "complexity". One of its primary selling points is that, for example, it doesn't attach infinite complexity to a model class that contains one real-valued parameter, if that model class isn't very flexible (i.e., it says only "the data points are greater than R"). But VC complexity doesn't plug into standard probability theory as easily as Solomonoff induction.
Posted by: Eliezer Yudkowsky | September 26, 2007 at 11:41 AM
In Solomonoff induction it is important to use a two-tape Turing machine where one tape is for the program and one is for the input and work space. The program tape is an infinite random string, but the program length is defined to be the number of bits that the Turing machine actually reads during its execution. This way the set of possible programs becomes a prefix free set. It follows that the prior probabilities will add up to one when you weight by 2^(-l) where l is program length. (I believe this was realized by Leonid Levin. In Solomonoff's original scheme the prior probabilities did not add to one.) This also allows the beautiful interpretation that the program tape is assigned by independent coin flips for each bit, and the 2^-l weighting arises naturally rather than as an artificial assumption. I believe this is discussed in the information theory book by Cover and Thomas.
Posted by: Stephen Jordan | September 26, 2007 at 12:10 PM
Eliezer,
"I don't expect my own environment to be random noise, but that has nothing to do with witchcraft..."
I think I misinterpreted the math and now see what you're getting at. Would it be an accurate translation to human language to say, "a sequence like 10101010 may favor witchcraft over the hypothesis that nothing weird is going on (i.e. the coinflips are random), but it will never favor witchcraft over the simpler hypothesis that something weird is going on that isn't witchcraft"?
I find it awkward to think of "witchcraft" as just a content-free word; what "witchcraft" means to me is something like the possibility that reality includes human-mind-like things with personalities and with preferences that they achieve through unknown nonstandard causal means. If you coded that up, it would probably no longer be content-free; it would allow shortening the rest of the program generating the sequences in some cases and require lengthening it in some other cases. In all realistic cases the resulting program would still be longer than necessary.
Posted by: steven | September 26, 2007 at 01:14 PM
Good comments, all!
Steven, yes. Stephen, also yes.
Posted by: Eliezer Yudkowsky | September 26, 2007 at 01:53 PM
Eli, you said:
An enormous bolt of electricity comes out of the sky and hits something, and the Norse tribesfolk say, "Maybe a really powerful agent was angry and threw a lightning bolt." The human brain is the most complex artifact in the known universe. If anger seems simple, it's because we don't see all the neural circuitry that's implementing the emotion. (Imagine trying to explain why Saturday Night Live is funny, to an alien species with no sense of humor. But don't feel superior; you yourself have no sense of fnord.) The complexity of anger, and indeed the complexity of intelligence, was glossed over by the humans who hypothesized Thor the thunder-agent.
I think it's worth noting that Norse tribesfolk already knew about human beings, so whatever model of the universe they made had to include angry agents in it somewhere.
Posted by: Peter de Blanc | September 26, 2007 at 02:39 PM
What you are talking about in terms of Solmonoff induction is usually called algorithmic information theory and the shortest-program-to-produce-a-bit-string is usually called Kolmogorov-Chaitin information. I am sure you know this. Which begs the question, why didn't you mention this? I agree, it is the neatest way to think about Occam's razor. I am not sure why some are raising PAC theory and VC-dimension. I don't quite see how they illuminate Occam. Minimalist inductive learning is hardly the simplest "explanation" in the Occam sense, and is actually closer to Shannon entropy in spirit, in being more of a raw measure. Gregory Chaitin's 'Meta Math: The Search for Omega', which I did a review summary of is a pretty neat look at this stuff.
Posted by: Venkat | September 26, 2007 at 05:33 PM
Venkat: I think there is a very good reason to mention PAC learning. Namely, Kolmogorov complexity is uncomputable, so Solomonoff induction is not possible even in principle. Thus one must use approximate methods instead such as PAC learning.
Posted by: Stephen Jordan | September 26, 2007 at 07:12 PM
Occam’s razor is not conclusive and it’s not science. It is not unscientific but I would say that it fits into the category of philosophy. In science you do not get two theories, take the facts you know, and then conclude based on the simplest theory. If you’re doing this, you need to do better experiments to determine the facts. Occam’s razor can be a useful heuristic to suggest what experiments should be done. Just like mathematical elegance, Occam’s razor suggests that something is on the right track but it is not decisive. To look back at the facts and then interpret it through Occam’s razor is just an exercise in hindsight bias.
Your analogy with Norse tribesfolk reminds me of the NRA slogan, “Guns don’t kill people, people kill people”. There are many different levels of causation. The gun can be said to be the secondary cause of why someone died. The person pulling the trigger would be the primary cause. The secondary cause of thunder is nature but the first cause that brought things into existence and created the system is God. Nature cannot be its own cause.
The rest of what you wrote sounds like you're pulling numbers out of your arse. The last sentence should be read in your best Norse tribesfolk accent.
Posted by: Cure of Ars | September 26, 2007 at 07:34 PM
In science you do not get two theories
You're right - there are an infinite number of theories consistent with any set of observations. Any set. All observed facts are technically consistent with the prediction that gravity will reverse in one hour, but nobody believes that because of... Occam's Razor!
Posted by: Nick Tarleton | September 26, 2007 at 08:16 PM
The rest of what you wrote sounds like you're pulling numbers out of your arse.
Cure of Ars, I should prefer it if you no longer commented on my posts. There may be a place on Overcoming Bias for Catholics; but none for those who despise math they don't understand.
Posted by: Eliezer Yudkowsky | September 26, 2007 at 08:22 PM
MIT Press has just published Peter Grünwald's The Minimum Description Length Principle. His Preface, Chapter 1, and Chapter 17 are available at that link. Chapter 17 is a comparison of different conceptions of induction.
I don't know this area well enough to judge Peter's wok, but it is certainly informative. Many of his points echo Eliezer's. If you find this topic interesting, Peter's book is definitely worth checking out.
Posted by: Jed Harris | September 29, 2007 at 01:51 AM