Bayes: radical, liberal, or conservative?

I wrote the following (with Aleks Jakulin) to introduce a special issue on Bayesian statistics of the journal Statistica Sinica (volume 17, 422-426).  I think the article might be of interest even to dabblers in Bayes, as I try to make explicit some of the political or quasi-political attitudes floating around the world of statistical methodology.


As a lifetime member of the International Chinese Statistical Association, I am pleased to introduce a volume of Bayesian articles.  I remember that in graduate school, Xiao-Li Meng, now editor of this journal, told me they didn’t teach Bayesian statistics in China because the idea of a prior distribution was contrary to Mao’s quotation, "truth comes out of empirical/practical evidence."  I have no idea how Thomas Bayes would feel about this, but Pierre-Simon Laplace, who is often regarded as the first applied Bayesian, was active in politics during and after the French Revolution.

In the twentieth-century Anglo-American statistical tradition, Bayesianism has certainly been seen as radical.  As statisticians, we are generally trained to respect conservatism, which can sometimes be defined mathematically (for example, nominal 95% intervals that contain the true value more than 95% of the time) and sometimes with reference to tradition (for example, deferring to least-squares or maximum-likelihood estimates).  Statisticians are typically worried about messing with data, which perhaps is one reason that the Current Index to Statistics lists 131 articles with "conservative" in the title or keywords and only 46 with the words "liberal" or "radical."

Like many political terms, the meaning of conservatism depends on its comparison point.  Does the Democratic Party in the U.S. represent liberal promotion of free expression or a conservative perpetuation of government bureaucracy?  Do the Republicans promote a conservative defense of liberty and property or a radical revision of constitutional balance?  And where do we place seemingly unclassifiable parties such as the Institutional Revolutionary Party in Mexico or the pro-Putin party in Russia?

Such questions are beyond the scope of this essay, but similar issues arise in statistics.  Consider the choice of estimators or prior distributions for logistic regression.  Table 1 gives an example of the results of giving specified doses of a toxin to 20 animals.  Racine et al. (1986) fit a logistic regression to these data assuming independent binomial data with the logit probability of death being a linear function of dose.  The maximum likelihood estimate for the slope is 7.8 with standard error of 4.9, and the corresponding Bayesian inference with flat prior distribution is similar (but with a slightly skewed posterior distribution; see Gelman et al. 2003, Section 3.7).

This noninformative analysis would usually be considered conservative–perhaps there would be some qualms about the uniform prior distribution (why defined on this particular scale), but with the maximum likelihood estimate standing as a convenient reference point and fallback.  But now consider another option.

Instead of a uniform prior distribution on the logistic regression coefficients, let us try a Cauchy distribution centered at 0 with a scale of 2.5, assigned to the coefficient of the standardized predictor.  This is a generic prior distribution that encodes the information that it is rare to see changes of more than 5 points on the logit scale (which is what it would take to shift a probability from 0.01 to 0.5, or from 0.5 to 0.99).  Similar models have been found useful in the information retrieval literature (Genkin, Lewis, and Madigan, 2006).  Combining the data in Table 1 with this prior distribution yields an estimated slope of 4.4 with standard error 1.9.  This is much different from the classical estimate; the prior distribution has made a big difference.

Is this new prior distribution conservative?  When coming up with it (and using it as the default in our bayesglm package in R), we thought so:  the argument was that true logistic regression coefficients are almost always quite a bit less than 5 (if predictors have been standardized), and so this Cauchy distribution actually contains less prior information than we really have.  From this perspective, the uniform prior distribution is the most conservative, but sometimes too much so (in particular, for datasets that feature separation, coefficients have maximum likelihood estimates of infinity), and this new prior distribution is still somewhat conservative, thus defensible to statisticians.

But from another perspective–that of prediction–our prior distribution is not particularly conservative, and the flat prior is even less so!  Let us explain.  We took the software of Genkin, Lewis, and Madigan (2005), which fits logistic regressions with a variety of prior distributions and found that a Gaussian prior distribution with center 0 and scale 2.5 performed quite well as measured using predictive error from five-fold cross validation, generally beating the corresponding Cauchy model (as well as the maximum likelihood estimate) in predictive error, when evaluated on a large corpus of datasets.  The conclusion may be that the Gaussian distribution is better than the Cauchy at modeling the truth, or at least that this particular Gaussian prior distribution is closer in spirit to what cross-validation is doing:  hiding 20% of the data and trying to make predictions using the model built on the other 80%.

This result is consistent with the hypothesis that our Cauchy prior distribution has more dispersion than the actual population of coefficients that might be encountered.  But is it conservative?  From the computer scientist’s standpoint of prediction, it is the Gaussian prior distribution that is conservative, in yielding the lowest expected predictive error for a new dataset (to the best of our knowledge).

Thinking about binary data more generally, the most conservative prediction of all is 0.5 (that is, guessing that both outcomes are equally likely).  From this perspective, one starts with the prior distribution and then uses data to gain efficiency, which is the opposite of the statistician’s approach of modeling the data first.  Which of these approaches makes more sense depends on the structure of the data, and more generally one can use hierarchical approaches that fit prior distributions from data.  Our point here is that, when thinking predictively, weak prior distributions are not necessarily conservative at all, and as statisticians we should think carefully about the motivations underlying our principles.

Statistical arguments, like political arguments, sometimes rely on catchy slogans.  When I was first learning statistics, it seemed to me that proponents of different statistical methods were talking past each other, with Bayesians promoting "efficiency" and "coherence" and non-Bayesians bringing up principles such as "exact inference" and "unbiasedness."  We cannot, unfortunately, be both efficient and unbiased at the same time (unless we perform unbiased _prediction_ instead of _estimation_, in which case we are abandoning the classical definition of unbiasedness that conditions on the parameter value).

Statistics, unlike (say) physics, is a new field, and its depths are close to the surface.  Hard work on just about any problem in applied statistics takes us to foundational challenges, and this is particuarly so of Bayesian statistics.  Bayesians have sometimes been mocked for their fondness of philosophy, but as Bayes (or was it Laplace?) once said, "with great power comes great responsibility," and, indeed, the power of Bayesian inference–probabilistic predictions about everything–gives us a special duty to check the fit of our model to data and to our substantive knowledge.  In the great tradition of textbook writers everwhere, I know nothing at all about the example of Racine et al. (1986) given in Table 1, yet I feel reasonably confident that the doses in the experiment do not take the true probability of death from 0.003 to 0.999 (as would result from the odds ratio implied by the maximum likelihood estimate of 7.8).  It seems much more conservative to me to suppose this extreme estimate to have come from sampling variation, as is in fact consistent with the model and data.  Even better, ultimately, would be more realistic models that appropriately combine information from multiple experiments–a goal that is facilitated by technical advances such as presented in the papers in this volume.

 

References:

Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003).  Bayesian Data Analysis, second edition.  London:  CRC Press.

Genkin, A., Lewis, D. D., and Madigan, D. (2005).  BBR:  Bayesian logistic regression software.  Center for Discrete Mathematics and Theoretical Computer Science, Rutgers University.  www.stat.rutgers.edu/~madigan/bbr/

Genkin, A., Lewis, D. D., and Madigan, D. (2006).  Large-scale Bayesian logistic regression for text categorization.  Technometrics.

Racine, A., Grieve, A. P., Fluhler, H., and Smith, A. F. M. (1986).  Bayesian methods in practice:  experiences in the pharmaceutical industry (with discussion).  Applied Statistics 35, 93-150.

 

Table:

Dose (log g/ml)   Number of animals   Number of deaths -0.86                    5                    0 -0.30                    5                    1 -0.05                    5                    3 0.73                    5                    5

Table 1.  Bioassay data from Racine et al. (1986), used as an example for fitting logistic regression.

GD Star Rating
loading...
Tagged as:
Trackback URL:
  • Hopefully Anonymous

    Learning applied mathematics/statistics (including bayesian analysis) is a high priority for me. It seems to me that one can’t think meaningfully about rational decision making (and personal outcome maximization) without some substantial familiarity with mathematical/statistical techniques like bayesian analysis, logical regression, etc. Great post -I’ll reread it soon, hopefully for 100% clarity on your points, subpoints, and ideas.

  • http://www.cawtech.freeserve.co.uk Alan Crowe

    The key insight in Bayesian statistics is that once you have enough data the prior distribution doesn’t really matter. The data overwhelm it and you end up with pretty much the same posterior distribution no matter what prior you started with. This insight keeps getting lost. Practictioners naturally hope that the quantity of data that they have managed to collect is sufficient and look to choosing the “correct” prior to get the maximum of information from the data. Rather than worry about getting the “correct” prior, ones method could focus on trying out all the different priors that seem reasonable. If different priors lead to different results one concludes that the data gathered so far are not yet sufficient to decide the issue, and if shortage of time forces a choice one is aware that one is exercising ones judgement, for better or worse, in prejudging the issue by basing ones forced choice on one prior in preference to another.

    I propose the use of, to coin a phrase, dialectical priors. In a controversy the two sides may compute different posterior distributions from the same data because they start from different priors. This is a feature not a bug and we should celebrate it as a postive advantage of the Bayesian approach. What has happened is that evidence that seems quite adequate to those whose judgement inclines them to accept a proposition nevertheless seems feeble and unpersuasive to those who were sternly skeptical from the outset. In is natural that doubters require more evidence to persuade them than enthusiasts do. It is unreasonable to expect a methodology to resolve such disputes and it is to the credit of Bayesianism that it sharpens such disputes rather than disguising them.

    Things get really interesting if the disputants accept each others priors. This what I think of as “dialectical priors”. Then each has a target to shoot for in terms of the quantity and quality of data required to persuade their opponents.

    This could go two ways. First, one side could gather sufficient evidence to persuade even using the others side’s priors. There is then no room to quibble over the choice of priors. Alternatively, hypothetical calculations may show that a prior distribution is so skeptical that even obviously sufficient evidence cannot overwhelm it. Maintaining such a prior would thus be exposed as an attempt to reject empiricism. I think that the interest lies in the cases where one thinks that a hypothesis is absurd and wish to write in a big fat zero, damning the hypothesis beyond empirical rescue. Well zero will not do, but what small number do you chose a priori? Too small and you appear unreasonably stubborn; no one will care whether you are persuaded. Too large and you will be forced to concede even though you hunger for more data.

    There is perhaps a third possibility. Research is difficult and expensive. Perhaps, considering the priors advocated by the parties to the dispute, one may find that realistic amounts of data are unlikely to overwhelm either sides’ prior. I need to do some calculations to get a feel for this. I suspect it only arises when there is lots of “noise”, for example a treament that is claimed to cut a death rate from 55% to 45%, or a theory in sociology that is claimed to be true on average but is prey to many confounding factors that need to be “averaged out”. I believe it is realistic to sometimes reach an impasse in which the available funds for research are insufficient to allow the issue to be decided. It is good for a statistical methodology to reflect this by allowing each side to stick with their a priori judgement.

  • http://profile.typekey.com/andrewgelman/ Andrew

    Alan,

    I would focus as much on subjectivity in the choice of likelihood as in the choice of prior. More generally, with hierarchical models a single prior (or population) distribution can apply to many settings, which somewhat allays the concerns in the second-to-last-paragraph of your comment. See Section 2.8 of Bayesian Data Analysis for an example.

  • http://www.saveons.org/2007/07/08/ La Casa de los millones

    Ace_Bunny

    Paulie Fortunato Carlo Schmid-Sutter List of Taekwondo Techniques Pereyaslavets The Tragedy of Macbeth Konoha-tengu Francis Kynaston D’Arcy, Saskatchewan History of Dalsland La Casa de los millones