I had a discussion with Christian Robert about the mystical feelings that seem to be sometimes inspired by Bayesian statistics. The discussion originated with an article by Eliezer so it seemed appropriate to put the discussion here on Eliezer's blog. As background, both Christian and I have done a lot of research on Bayesian methods and computation, and we've also written books on the topic, so in some ways we're perhaps too close to the topic to be the best judge of how a newcomer will think about Bayes.

See, this kind of terminological disagreement illustrates why I think it's better to use the codelength idea :-)

Can normalized maximum likelihood be used to send data? If so, then it implies an implicit prior over data sets which is exactly 2^(-l(x)), where l(x) is the length of the code. Whether or not this means it is "equivalent" to Bayes would seem to depend on what the word "Bayesian" means to you; in my lexicon it means a philosophical commitment to the necessity of using prior distributions that are essentially arbitrary. Once you've accepted that priors are necessary, then the rules for updating them are mathematical theorems which are no longer disputable.

Note that the above argument "Can method X be used to send data? If so, then it implies an implicit prior over data sets..." works for a wide range of methods X (e.g. Support Vector machines, Belief nets) which various people have claimed are not explicitly Bayesian.

It also means they are ALL subject to the mighty No Free Lunch Theorem which says roughly that in general, data compression cannot be achieved. All modeling and statistical learning techniques should therefore be prefaced by disclaimers noting that "this method does not work in general, but if we make certain assumptions about the nature of the process generating the data..."

Andrew, thanks for starting this discussion, looking forward to future OB posts from you (don't tell Eliezer that you're into things like the Gibbs sampler and Metropolis algorithm, though).

It sounds to me as though you have your concept of reason muddled-in with ideas from decision theory. "What is true" and "what to do" are rather different issues.

You don't have to be a genius to see that there's something suspiciously incomplete about Bayes. The Anthropic puzzles have not been solved using Bayesian methods, and Bayesian experts report being 'confused' in certin cases such as the Doomsday argument. This hints that there may be more powerful reasoning methods.

We know that Deduction is merely a special case of Induction (namely the case where the probabilities are set to 100%). In other words, Deduction is merely a shadow of Induction. Could it be that Induction in turn is merely a shadow of some as yet undiscovered more powerful method still? If so, there would be some shifting parameter which is not probability, but that when set to some special case would look like a probability.

Intuitionistic logic is the only one I can think of that has any possibility of competing for the office of calculus ratiocinator. I can just about imagine conducting all one's thought on every meta-level without ever assuming that if a proposition cannot be false, it must be true. But a classicist can pass among intuitionists just by prefixing everything with not-not, and who's to know if a professed intuitionist isn't doing the same? Personally, I think intuitionism was a historical accident that would never have happened if Babbage's machines had been more practical, but that is another story.

@Sebastian: My personal probability of Bayes' theorem being incorrect is 'epsilon', i.e. nonzero but too low for me to bother tracking the exact order of magnitude.

If I'm not assigning actual numbers, I'm not doing probabilistic inference, even if I believe I am. Even if I have an actual epsilon not plucked out of the air, if I throw it away, I've reverted to POML.

BTW, combining the last two points, I see from Google that there is such a thing as intuitionistic Bayesianism. I do not know how well known this is.

@Andrew: All these systems can work, and they all have logical holes too.

Logical holes in POML? Gödel's completeness theorem proves their absence. (His incompleteness theorems talk about theories expressed in POML, not POML itself.) Did you have something else in mind?

This has been an interesting discussion, revealing to me that the participants in this forum have a much different perspective on Bayes, compared to the perspectives of Christian Robert and I have.

I have little to add to the discussion except to comment on Richard Kennaway's statement that "standard mathematical logic does [model valid reasoning in general]."

One thing I've learned in applied statistics is that there are lots of different logical frameworks that can work well. So, sure, mathematical logic can model valid reasoning, so can Bayes, so can fuzzy sets, machine learning, etc. All these systems can work, and they all have logical holes too. It's the nature of inference.

Daniel, I've read Grünwald's tutorial, and it's clear that MDL is not Bayesian: just look at the normalized maximum likelihood distribution, which when it exists solves the MDL problem -- and violates the likelihood principle, as it requires a summation over the data space. (...And does not require an explicit prior; not sure if there's an implicit prior determined by the choice of data format, per your statement.) Rissanen is pretty contemptuous of the Bayesian approach (and the frequentist approach and the likelihood approach; dude has some strong opinions). I was hoping for a nice 70 page Grünwald-style tutorial on MML, as I have not been able to find one myself.

What is the probability that P(A|B) = P(B|A) P(A)/P(B)? One.This is just an approximation.The only way to check a mathematical proof is by feeding it to a proof-verifying physical system. That system could e.g. be a program running on a digital computer or a human (including yourself). Unless you have perfect and absolutely reliable knowledge about both the physical system and the physical laws of this universe - which you can't, in practice - there's always the chance that the system in question does it wrong; in this case, that it outputs "proof is correct" even though the proof isn't.Feeding the same possible proof to very many different proof-verifying physical systems can dramatically decrease the probability of an incorrect verdict like that. It can't push it to exactly zero.My personal probability of Bayes' theorem being incorrect is 'epsilon', i.e. nonzero but too low for me to bother tracking the exact order of magnitude.

I think the question of whether probabilistic reasoning or POML reasoning is more fundamental is not so straightforward. When one searches for a "most" fundamental logic, one finds that the structure of the candidate systems becomes "loopy", with each perfectly capable of being embedded within the others.

For example, should you go with POML or intuitionistic reasoning in mathematics? There's no formal criterion. Every intuitionistic theorem is a classical theorem once you restrict the quantifiers properly. Every classical theorem is an intuitionistic theorem if you replace "P or Q" with "not-(not P and not-Q).

anonym: I think part of the "mystical" feeling probably comes from the realization of how widely applicable Bayes' theorem is and a sense that it can function as something like the foundation of a calculus of thought such as people like Leibniz and Boole have sought for so long and can be applied to every aspect of thought.

"Calculus of thought" is a solved problem. What Leibniz sought, Boole found, and Frege, Russell, and Whitehead brought to completion: the calculus of thought is propositional and first-order predicate calculus. That is, the calculus with which we must reason, to reason validly, not the ways in which we do reason, which is whatever our meatware does, valid or not, and is not a calculus.

A problem with regarding Bayesian reasoning or anything else -- quantum logic, modal logic, or whatever -- as the calculus of thought is that on the metalevel, we always go on reasoning in POML: plain old mathematical logic, where things are simply true, or false. Bayes' theorem itself is of this nature. It is about probabilities, but is not a probabilistic statement. What is the probability that P(A|B) = P(B|A) P(A)/P(B)? One. Bayesian and other calculi are mathematically accurate models of certain aspects of the world, but they do not model valid reasoning in general. Standard mathematical logic does.

Be warned that Vapnik's awesome book is highly opinionated. I suggest reading some other material on MDL and MML before progressing to this very (to quote a reviewer) "Russian" book.

one of the best starting points is Baxter and Olivier's "MDL and MML: Similarities and Differences", 1994 (Tech Report in three parts). David Dowe's page http://www.csse.monash.edu.... may also be interesting for you.The best intro paper on MDL is probably Grünwald's "A Tutorial Introduction to the Minimum Description Length Principle", which also addresses your question about priors in MDL (and mentions some consistency results, if I remember correctly). Grünwald's recent book on MDL also makes for an interesting read, if you want to dig deeper. Li & Vitalyi's canonical book on Kolmogorov complexity will give you the most profound understanding of the topics Daniel Burfoot mentioned above.

The foundation of MDL is information theory, the standard textbook by Thomas and Cover has many good problems. For a good tutorial specifically about MDL, do a scholar.google search for "Grunwald" and "MDL tutorial". Also see Vapnik's awesome "Nature of Statistical Learning Theory" for a good discussion of the relationship between Bayes, MDL, and VC-theory style regularization as well as other intriguing topics.

I don't believe MDL does avoid the necessity of using a prior distribution; I believe it makes the necessity of choosing such a distribution philosophically clear and unavoidable (it is the choice of data format agreed on in advance between sender and receiver).

The problem of inconsistency is a deep technical one and as I said I don't believe the codelength view offers any technical advantages, only philosophical ones. The basic analogue of the inconsistency problem seems to be the case where you have a data format that could in principle achieve low code rates for a certain data set, if you could infer the right set of model parameters; but for whatever reason the data "tricks" you into inferring the wrong parameters and so you can't get the optimal low codelength.

The assumption required for Bayesian reasoning to work is that the possible outcomes being assigned probabilities are independent of the motivations of the agent observing these outcomes. As the motivations of the agent start to become mixed with the possible outcomes, Bayesian reasoning starts to break down.

It sounds to me as though you have your concept of reason muddled-in with ideas from decision theory. "What is true" and "what to do" are rather different issues.

The 'mystical feelings' inspired by Bayes are quite misplaced. Bayes can be beaten.

The assumption required for Bayesian reasoning to work is that the possible outcomes being assigned probabilities are independent of the motivations of the agent observing these outcomes. As the motivations of the agent start to become mixed with the possible outcomes, Bayesian reasoning starts to break down.

This is clearly seen in puzzles of Anthropic reasoning, such the Doomsday argument, or puzzles of self-reference such as Newcomb's box, which cannot be solved via Bayesian methods.

The reason Bayes breaks down in these situations is that the boundaries between ontological categories are somewhat fluid, whereas Bayes requires that these boundaries be precisely defined. The reason for the ambiguity of ontological categories is the impossibility of finite algorithmic definitions of the meaning (semantics) of many concepts.

Analogy formation is a more general and poweful method of reasoning than Bayes, because analogy formation provides a means to ensure interoperability between different knowledge domains, and thus it can deal with fluid ontological categories.

Bayesian reasoning is merely a special case of analogy formation, namely the case where the semantic meaning of concepts is fixed (i.e the case where the boundaries between ontological categories are precisely defined).

I've got some questions about the codelength version of Bayesian statistics, which I'm assuming is synonymous with MML:

- do you have any recommendations for a good introductory text with some problem sets?- MDL attempts to avoid the necessity of using a prior distribution; how does it go wrong?- is there a codelength/MML analogue to Bayesian posterior inconsistency results such as those of Persi and Diaconis [ref]?

Daniel, thanks for your perspective; it gives me lots to ponder.

Cyan,

See, this kind of terminological disagreement illustrates why I think it's better to use the codelength idea :-)

Can normalized maximum likelihood be used to send data? If so, then it implies an implicit prior over data sets which is exactly 2^(-l(x)), where l(x) is the length of the code. Whether or not this means it is "equivalent" to Bayes would seem to depend on what the word "Bayesian" means to you; in my lexicon it means a philosophical commitment to the necessity of using prior distributions that are essentially arbitrary. Once you've accepted that priors are necessary, then the rules for updating them are mathematical theorems which are no longer disputable.

Note that the above argument "Can method X be used to send data? If so, then it implies an implicit prior over data sets..." works for a wide range of methods X (e.g. Support Vector machines, Belief nets) which various people have claimed are not explicitly Bayesian.

It also means they are ALL subject to the mighty No Free Lunch Theorem which says roughly that in general, data compression cannot be achieved. All modeling and statistical learning techniques should therefore be prefaced by disclaimers noting that "this method does not work in general, but if we make certain assumptions about the nature of the process generating the data..."

Andrew, thanks for starting this discussion, looking forward to future OB posts from you (don't tell Eliezer that you're into things like the Gibbs sampler and Metropolis algorithm, though).

It sounds to me as though you have your concept of reason muddled-in with ideas from decision theory. "What is true" and "what to do" are rather different issues.

You don't have to be a genius to see that there's something suspiciously incomplete about Bayes. The Anthropic puzzles have not been solved using Bayesian methods, and Bayesian experts report being 'confused' in certin cases such as the Doomsday argument. This hints that there may be more powerful reasoning methods.

We know that Deduction is merely a special case of Induction (namely the case where the probabilities are set to 100%). In other words, Deduction is merely a shadow of Induction. Could it be that Induction in turn is merely a shadow of some as yet undiscovered more powerful method still? If so, there would be some shifting parameter which is not probability, but that when set to some special case would look like a probability.

semantic distance perhaps?

@Tyrrell, intuitionism:

Intuitionistic logic is the only one I can think of that has any possibility of competing for the office of calculus ratiocinator. I can just about imagine conducting all one's thought on every meta-level without ever assuming that if a proposition cannot be false, it must be true. But a classicist can pass among intuitionists just by prefixing everything with not-not, and who's to know if a professed intuitionist isn't doing the same? Personally, I think intuitionism was a historical accident that would never have happened if Babbage's machines had been more practical, but that is another story.

@Sebastian: My personal probability of Bayes' theorem being incorrect is 'epsilon', i.e. nonzero but too low for me to bother tracking the exact order of magnitude.

If I'm not assigning actual numbers, I'm not doing probabilistic inference, even if I believe I am. Even if I have an actual epsilon not plucked out of the air, if I throw it away, I've reverted to POML.

BTW, combining the last two points, I see from Google that there is such a thing as intuitionistic Bayesianism. I do not know how well known this is.

@Andrew: All these systems can work, and they all have logical holes too.

Logical holes in POML? Gödel's completeness theorem proves their absence. (His incompleteness theorems talk about theories expressed in POML, not POML itself.) Did you have something else in mind?

This has been an interesting discussion, revealing to me that the participants in this forum have a much different perspective on Bayes, compared to the perspectives of Christian Robert and I have.

I have little to add to the discussion except to comment on Richard Kennaway's statement that "standard mathematical logic does [model valid reasoning in general]."

One thing I've learned in applied statistics is that there are lots of different logical frameworks that can work well. So, sure, mathematical logic can model valid reasoning, so can Bayes, so can fuzzy sets, machine learning, etc. All these systems can work, and they all have logical holes too. It's the nature of inference.

Thanks, Daniel and Manuel!

Daniel, I've read Grünwald's tutorial, and it's clear that MDL is not Bayesian: just look at the normalized maximum likelihood distribution, which when it exists solves the MDL problem -- and violates the likelihood principle, as it requires a summation over the data space. (...And does not require an explicit prior; not sure if there's an implicit prior determined by the choice of data format, per your statement.) Rissanen is pretty contemptuous of the Bayesian approach (and the frequentist approach and the likelihood approach; dude has some strong opinions). I was hoping for a nice 70 page Grünwald-style tutorial on MML, as I have not been able to find one myself.

What is the probability that P(A|B) = P(B|A) P(A)/P(B)? One.This is just an approximation.The only way to check a mathematical proof is by feeding it to a proof-verifying physical system. That system could e.g. be a program running on a digital computer or a human (including yourself). Unless you have perfect and absolutely reliable knowledge about both the physical system and the physical laws of this universe - which you can't, in practice - there's always the chance that the system in question does it wrong; in this case, that it outputs "proof is correct" even though the proof isn't.Feeding the same possible proof to very many different proof-verifying physical systems can dramatically decrease the probability of an incorrect verdict like that. It can't push it to exactly zero.My personal probability of Bayes' theorem being incorrect is 'epsilon', i.e. nonzero but too low for me to bother tracking the exact order of magnitude.

@Richard Kennaway

I think the question of whether probabilistic reasoning or POML reasoning is more fundamental is not so straightforward. When one searches for a "most" fundamental logic, one finds that the structure of the candidate systems becomes "loopy", with each perfectly capable of being embedded within the others.

For example, should you go with POML or intuitionistic reasoning in mathematics? There's no formal criterion. Every intuitionistic theorem is a classical theorem once you restrict the quantifiers properly. Every classical theorem is an intuitionistic theorem if you replace "P or Q" with "not-(not P and not-Q).

anonym: I think part of the "mystical" feeling probably comes from the realization of how widely applicable Bayes' theorem is and a sense that it can function as something like the foundation of a calculus of thought such as people like Leibniz and Boole have sought for so long and can be applied to every aspect of thought.

"Calculus of thought" is a solved problem. What Leibniz sought, Boole found, and Frege, Russell, and Whitehead brought to completion: the calculus of thought is propositional and first-order predicate calculus. That is, the calculus with which we must reason, to reason validly, not the ways in which we do reason, which is whatever our meatware does, valid or not, and is not a calculus.

A problem with regarding Bayesian reasoning or anything else -- quantum logic, modal logic, or whatever -- as the calculus of thought is that on the metalevel, we always go on reasoning in POML: plain old mathematical logic, where things are simply true, or false. Bayes' theorem itself is of this nature. It is about probabilities, but is not a probabilistic statement. What is the probability that P(A|B) = P(B|A) P(A)/P(B)? One. Bayesian and other calculi are mathematically accurate models of certain aspects of the world, but they do not model valid reasoning in general. Standard mathematical logic does.

Be warned that Vapnik's awesome book is highly opinionated. I suggest reading some other material on MDL and MML before progressing to this very (to quote a reviewer) "Russian" book.

Cyan,

one of the best starting points is Baxter and Olivier's "MDL and MML: Similarities and Differences", 1994 (Tech Report in three parts). David Dowe's page http://www.csse.monash.edu.... may also be interesting for you.The best intro paper on MDL is probably Grünwald's "A Tutorial Introduction to the Minimum Description Length Principle", which also addresses your question about priors in MDL (and mentions some consistency results, if I remember correctly). Grünwald's recent book on MDL also makes for an interesting read, if you want to dig deeper. Li & Vitalyi's canonical book on Kolmogorov complexity will give you the most profound understanding of the topics Daniel Burfoot mentioned above.

Cyan,

The foundation of MDL is information theory, the standard textbook by Thomas and Cover has many good problems. For a good tutorial specifically about MDL, do a scholar.google search for "Grunwald" and "MDL tutorial". Also see Vapnik's awesome "Nature of Statistical Learning Theory" for a good discussion of the relationship between Bayes, MDL, and VC-theory style regularization as well as other intriguing topics.

I don't believe MDL does avoid the necessity of using a prior distribution; I believe it makes the necessity of choosing such a distribution philosophically clear and unavoidable (it is the choice of data format agreed on in advance between sender and receiver).

The problem of inconsistency is a deep technical one and as I said I don't believe the codelength view offers any technical advantages, only philosophical ones. The basic analogue of the inconsistency problem seems to be the case where you have a data format that could in principle achieve low code rates for a certain data set, if you could infer the right set of model parameters; but for whatever reason the data "tricks" you into inferring the wrong parameters and so you can't get the optimal low codelength.

The assumption required for Bayesian reasoning to work is that the possible outcomes being assigned probabilities are independent of the motivations of the agent observing these outcomes. As the motivations of the agent start to become mixed with the possible outcomes, Bayesian reasoning starts to break down.

It sounds to me as though you have your concept of reason muddled-in with ideas from decision theory. "What is true" and "what to do" are rather different issues.

Heh,

The 'mystical feelings' inspired by Bayes are quite misplaced. Bayes can be beaten.

The assumption required for Bayesian reasoning to work is that the possible outcomes being assigned probabilities are independent of the motivations of the agent observing these outcomes. As the motivations of the agent start to become mixed with the possible outcomes, Bayesian reasoning starts to break down.

This is clearly seen in puzzles of Anthropic reasoning, such the Doomsday argument, or puzzles of self-reference such as Newcomb's box, which cannot be solved via Bayesian methods.

The reason Bayes breaks down in these situations is that the boundaries between ontological categories are somewhat fluid, whereas Bayes requires that these boundaries be precisely defined. The reason for the ambiguity of ontological categories is the impossibility of finite algorithmic definitions of the meaning (semantics) of many concepts.

Analogy formation is a more general and poweful method of reasoning than Bayes, because analogy formation provides a means to ensure interoperability between different knowledge domains, and thus it can deal with fluid ontological categories.

Bayesian reasoning is merely a special case of analogy formation, namely the case where the semantic meaning of concepts is fixed (i.e the case where the boundaries between ontological categories are precisely defined).

(I had a brainfart -- that should be Freedman and Diaconis.)

Daniel Burfoot,

I've got some questions about the codelength version of Bayesian statistics, which I'm assuming is synonymous with MML:

- do you have any recommendations for a good introductory text with some problem sets?- MDL attempts to avoid the necessity of using a prior distribution; how does it go wrong?- is there a codelength/MML analogue to Bayesian posterior inconsistency results such as those of Persi and Diaconis [ref]?