Academic Overconfidence

Another important bias in academic consensus is overconfidence.  Even in the hardest of hard science, Henrion and Fischhoff showed in 1986 (ungated for now here) that published error estimates for fundamental constants of physics were seriously overconfident.   Looking at 306 estimates for particle properties, 7% were outside of a 98% confidence interval (where only 2% should be).   In seven other cases, each with 14 to 40 estimates, the fraction outside the 98% confidence interval ranged from 7% to 57%, with a median of 14%. 

Last week’s New Scientist described a dramatic example with policy implications (ungated for now here):

In July 1971, Stephen Schneider, a young American climate researcher at NASA’s Goddard Space Flight Center in New York, made headlines in The New York Times when he warned of a coming cooling that could "trigger an ice age". … The US National Academy of Sciences reported "a finite probability that a serious worldwide cooling could befall the Earth within the next 100 years". … It is often claimed today that the fad for cooling was a brief interlude propagated by a few renegade researchers or even that the story is a myth invented by today’s climate sceptics. It wasn’t. There was good science behind the fears of global cooling.  … 

All this raises an alarming question. If climatologists were so wrong then, why should we believe them now? As those who played a part in the cooling scare now readily admit, those early studies were based on flimsy data collected by very few, often young, researchers. In 1971, when Schneider’s paper appeared, he was instantly regarded as a world expert. It was his first publication.  Today, vastly more research has been done into how and why climate changes. The consensus on warming is much bigger, much broader, much more sophisticated in its science and much longer-lasting than the spasm of concern about cooling.

This is too pat an answer.  Yes, we have more data now, but the issue is our tendency to claim more than our data can support.  I’m not saying global warming is wrong, just that we should be less confident than academics suggest.   

GD Star Rating
loading...
Tagged as: ,
Trackback URL:
  • Perry E. Metzger

    The fundamental constants thing isn’t surprising, because the statistical model for computing the 98% confidence interval isn’t as reliable as one would like. Let me explain, as someone who’s done measurements in the lab (though not as a real professional).

    The measurement you report and the intervals you give for your confidence are based on the variance in your trial measurements combined with your assumptions about various kinds of systematic error. The latter is important because making lots of measurements only compensates for errors if your errors are random and normally distributed rather than systematic or skewed. In truth, though, your errors aren’t likely to be truly normally distributed, and there are systematic errors that inevitably creep in. Can you wave a magic statistical analysis wand and fix this so your confidence interval is correct? Sometimes you can’t, and you have to go with what you have.

    What does this mean? I think we should be careful about labeling such things “bias”. Rather, they are generally just mistakes. The bias, if any, is an unrealistic belief on the part of the reader in the ability of any team to produce a completely accurate estimate of the error bars on their results.

    By the way, I’m rather skeptical of the general power of metaanalysis for re-processing experimental results, if only because I worry that (as often happens) the mathematical assumptions underlying the statistical techniques are frequently not met in the real world, but it is hard to know when they are and aren’t correct.

  • http://pdf23ds.net pdf23ds

    Perry, wouldn’t systematic bias in sample errors bias the estimation of the confidence interval to be too broad just as often as too narrow? Yet, if I understand the paper right, the confidence intervals reported were consistently more narrow than was justified, which I think can’t be explained by simple mistakes.

  • Matthew

    Maybe they aren’t actually CONSTANTS after all. . .

  • Douglas Knight

    Shorter Perry E. Metzler:
    Scientists are frequentists, not bayesians.

    Unfortunately, scientists aren’t sophisticated enough to know the difference, so in other contexts they treat their error bars as absolute.

  • http://profile.typekey.com/sentience/ Eliezer Yudkowsky

    pdf23ds, systematic errors – defined as correlated errors in a particular direction – will always bias the confidence interval to be too narrow, not too broad. The confidence interval is based on observed variance and systematic errors don’t contribute to *observed* variance, while contributing greatly to actual error.

  • http://pdf23ds.net pdf23ds

    Eliezer, could you elaborate?

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    If you assumed there were no systematic error, you would be biased to underestimate total error. Physicists are familiar with the need to estimate systematic error, but this data says they tend to underestimate such error. This is not too surprising, as the incentives they face are to produce measurements with as low an error as possible, and it is very hard for a referee to show that someone has underestimated a systematic error.

  • http://profile.typekey.com/sentience/ Eliezer Yudkowsky

    Pdf, it has to do with the statistical tools that are used to calculate “confidence intervals”. It’s not that physicists are making up the confidence intervals and being psychologically biased. They’re calculating confidence intervals from the data using a particular biased statistical method. The calculation looks at how much the experimental data varies, then uses that and some false statistical assumptions to estimate, say, the likely distance between the mean experimental value and the real value. One of the false assumptions is that errors are uncorrelated. If experimental errors are significantly correlated, the calculation comes out wrong and the confidence intervals are systematically too narrow. For more detail, read up on analysis of variance, bias-variance decomposition, and so on.

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Eliezer, you give the impression that physicists ignore systematic error and assume errors are uncorrelated. This is just not true.

  • http://profile.typekey.com/sentience/ Eliezer Yudkowsky

    I confess that I am not much acquainted with how physicists, in particular, estimate their confidence intervals. But if these are genuinely subjective, human-estimated confidence intervals, how on Earth did the physicists end up with a surprise index as *low* as 7% on a 98% confidence interval? And if they are formally calculated confidence intervals – which one would kinda expect of a published paper – then the essential point is the same: *some* kind of correlation in the errors is being ignored by the model class. It’s all well and good to not ignore systematic errors, but apparently even hard scientists fail to not ignore it enough. So long as journals want confidence intervals to be produced by calculation from statistical models of less than AGI-level complexity, those confidence intervals will probably continue to be over-narrow.

  • John DePalma

    Wall Street economists exhibit similar overconfidence biases when forecasting. 2 quick examples, one on year-ahead GDP estimates and the other on monthly unemployment reports …

    (1) Participants in the Wall Street Journal “Monthly Economic Forecasting Survey” assign far too high a probability to the accuracy of their respective forecasts.

    As part of the survey the Wall Street Journal phrases an “Economists’ Confidence” question as follows: “As an indication of your confidence level in your GDP projections in this survey, please estimate on a scale of 0 to 100 the probability that officially reported GDP growth, on average, will fall within 0.5% of your GDP forecasts for those four quarters.” In the year or 2 in which this question has been asked, the average response has been 64%. (Amusingly, in all these surveys there is a delusional– or perhaps clairvoyant– participant who responds 100%.) By penciling in that 64% response forecasters assume their year-ahead GDP estimates have root-means-squared errors just above 0.5%.

    But one-year ahead consensus GDP forecasts have historically missed by over 1%. Given the tendency for consensus projections to show greater accuracy than individual estimates, the average forecast miss for each individual would be even higher than 1%.

    A much bigger information set than available is necessary to generate projections at the degree of accuracy that these forecasters are assuming in their WSJ survey responses. On the day of quarterly (annualized) advance GDP releases the root-mean-squared forecast error of the Bloomberg-compiled consensus from 1998 – 2004 was 0.8%. In anticipation of the official BEA release these are estimates of GDP growth in the quarter that *just finished a month ago* This results translates to a root-mean-squared forecast error of 0.4% for average growth for the year, near the root-mean squared forecast error that the WSJ survey suggests for the *year-ahead forecast*.

    (2) Before the September non-farm payrolls report, the NABE conducted a pari-mutuel auction. (See http://www.nabe.com/press/pr061004.html ) The clearing prices suggested a 124k expectation for private payrolls growth with only a 25% chance of a private payrolls gain below 75k or above 174k. Over the last few years, though, (Bloomberg compiled) consensus forecasts have missed by more than 50k 60% of the time.

  • http://www.hedweb.com/bgcharlton Bruce G Charlton

    In general terms scientists as individuals are prone to be over-confident in their own work, but that is merely psychological and related to motivation. Criticism is best and most appropriate when it comes from others.

    The problem arises when a whole scientific field becomes overconfident, and this sometimes happens due to overvaluing a specific methodology (or, it could be a theory). Any methodology has a systematic bias, and if other methods are denigrated this error can become cumulative.

    An example from my experience is the randomized controlled trial (RCT) in medicine. As a source of evidence concerning the effectiveness and effect-size of therapies, the RCT suffers from selection biases (ie. biased recruitment to trials). This systematic bias is amplified by meta-analysis of randomized trials.

    And this systematic error of RCTs has been substantially uncorrected because part of the rationale for randomized trials is that they are intrinsically superior to any potentially corrective alternative research methodologies (such as surveys, cohort studies, animal experiments etc).

    Usually the correction eventually comes from market forces – because where bias is uncorrected there are oportunities for young scientists to make a reputation by over-turning the status quo.

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Bruce, I hope we will have more posts in the future on biases in the medical literature.

  • Perry E. Metzger

    Eliezer & Robin: I would like to point out, regardless of incentives, it is just plain hard to estimate systematic error. Math can’t really help you, and that which can’t be helped by math is difficult to do in an absolutely objective way. You generally can’t run a calculation and get an estimate of your systematic error, and you can’t get an objective estimate of the probability distribution of your systematic error either.

    Metrology is difficult stuff. One of the reasons why people like it when measurements get repeated both with the same and differing methodologies is the understanding that systematic error is difficult to judge. The implicit hope is that, across multiple methods and multiple experimenters, systematic error will be somewhat randomly distributed, thus giving the community the ability to judge the true value of the observable.

    Some kinds of bias are easy enough to remove, but some are not. The easiest kinds to remove can be removed with objective methods, such as mathematical methods. Those that cannot be removed with objective methods are not nearly so straightforward. Often, one uses a flawed method, such as a statistical calculation based on assumptions that are known not to be strictly true, on the basis that the best calculation you can make is better than no calculation at all. So far as I can tell, that is in fact generally true. We would rather know the value of the gravitational constant roughly than not to know it at all, for example…

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Perry, estimating is complicated and hard work. But this is an excuse for error, not for bias. Once physicists can see the track record of overconfidence, they have an opportunity to correct it by increasing their estimates of systematic error. If their bias continues, it shows they choose not to increase their estimates enough.

  • Perry E. Metzger

    Robin, I’m not sure that is true. What you are saying, essentially, is that after seeing that a number of estimates of some constant do not fall within each other’s error bars, physicists should then increase the size of the error bars. I don’t think that is reasonable.

    Not all methods of measurement are identical, and different groups use different instruments, so the systematic errors made by different groups are different. That means that it is not necessarily the case that all groups are underestimating their errors — in fact, it is most likely that only some of them are underestimating error. Increasing your error based on the “gut feel” that it is not large enough is no more scientific than underestimating it.

    You are, in effect, promoting a bias towards overestimation of error. Overestimation bias is also a bias. It is, however, a “silent bias” in that metastudies are less likely to find it because fewer measurements will lie outside each others error bars. Overestimation, however, means that it is harder for an outside observer to determine where the different experiments are clustering, and thus makes it harder to figure out where the most likely figure actually lies! The price for overestimation is loss of valuable information!

    Generally speaking, I favor a different solution entirely, which is documenting your calculation, so third parties can see what your assumptions were. A silent assumption is possibly deceptive, but a documented assumption is not, especially since you have to make some assumptions in almost all scientific measurement.

    In the old days, when reports had to be published entirely on paper and it was difficult to put all the information you collected into your publication, this was impossible. Now that scientific publications routinely accompany the main article with supplementary data published online, I think we should simply go all the way and provide the reader with as much raw data as possible along with the exact assumptions and methods used to reduce it. Third parties can then know what assumptions you made, and can make their own assessment of what your error bars really mean.

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Perry, your argument is worth an entire post responding to, and I intend to do so soon.

  • http://profile.typekey.com/sentience/ Eliezer Yudkowsky

    Perry, I too hope that future journal papers, as well as being available online, will also include all raw data. But:

    1) Calculating confidence intervals, by itself, throws away an insane amount of information compared to the raw data.

    2) You could just as easily say: “Confidence interval XYZ was calculated using statistical model Q, which has a historical surprise index in physics of 7% (Henrion and Fischoff 1986).” Then the same information is transmitted as in the current case, but it comes with an appropriate caution.

  • http://www.optimizelife.com Gustavo Lacerda

    Perry said: “the statistical model for computing the 98% confidence interval isn’t as reliable as one would like”.

    Indeed, fat tails could explain these results. I remember reading somewhere about how fat-tailed normal distributions being much more common than people think (one family of distributions in particular). I don’t remember the name of this distribution for sure, but it was generalization of the Gaussian, and it was also the limit distribution for its own family (there was an analog of the Central Limit Theorem).

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Gustavo, the paper I cite also shows that a lot less than the expected 50% of cases falls in the interquartile range.

  • John DePalma

    Couldn’t the evidence for “academic overconfidence” cited here be partly a footprint of selection biases in what gets published?

    Hypothetically, each paper could contain correctly computed confidence intervals. But if certain studies get published by showing an aberrant point estimate calculation, then it will appear as if the published studies have confidence intervals that are too narrow.

    (For a similar dynamic, see http://www.slate.com/id/2103486 “Ordinarily, studies with large sample sizes should be more convincing than studies with small sample sizes. Following the fates of 10,000 workers should tell you more than following the fates of 1,000 workers. But with the minimum-wage studies, that wasn’t happening…”)

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    John, yes, publication selection bias could produce a similar effect, but in fact in this sort of physics almost all the studies get published.

  • Douglas Knight

    Perry E Metzger:
    ‘Overestimation bias is also a bias. It is, however, a “silent bias” in that metastudies are less likely to find it because fewer measurements will lie outside each others error bars.’

    I suspect that I’m completely misunderstanding you. It seems to me that this kind of study can just as easily find underconfidence, if too many studies fall within the error bars.
    It may be more useful to figure out which groups and menthods are overconfident, but it is cheap and I think quite reasonable to assume uniform overconfidence.

    Robin Hanson:
    Your narrow claim that all studies get published doesn’t eliminate selection bias. In “Cargo Cult Science,” Feynman blames the history of the charge of the electron by a type of selection bias following initial error. He also seems to say that physicists have learned from this mistake, but that seems like a tall order.

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Douglas, I meant publication selection bias. Yes, individual experimentalists might inappropriately select data to report.

  • http://www.catbirdseat.typepad.com Ray G

    The root of my initial skepticism towards global warming actually stems from the “coming ice age” scare laid on me as an elementary student in the 70s.

    The only thing that I am 100% convinced of now, is that certain entities couldn’t care less about the planet, but this could prove to be an effective economic tool against the US and/or capitalism in general.