Another important bias in academic consensus is overconfidence. Even in the hardest of hard science, Henrion and Fischhoff showed in 1986 (ungated for now here) that published error estimates for fundamental constants of physics were seriously overconfident. Looking at 306 estimates for particle properties, 7% were outside of a 98% confidence interval (where only 2% should be). In seven other cases, each with 14 to 40 estimates, the fraction outside the 98% confidence interval ranged from 7% to 57%, with a median of 14%.

The root of my initial skepticism towards global warming actually stems from the "coming ice age" scare laid on me as an elementary student in the 70s.

The only thing that I am 100% convinced of now, is that certain entities couldn't care less about the planet, but this could prove to be an effective economic tool against the US and/or capitalism in general.

Perry E Metzger:'Overestimation bias is also a bias. It is, however, a "silent bias" in that metastudies are less likely to find it because fewer measurements will lie outside each others error bars.'

I suspect that I'm completely misunderstanding you. It seems to me that this kind of study can just as easily find underconfidence, if too many studies fall within the error bars.It may be more useful to figure out which groups and menthods are overconfident, but it is cheap and I think quite reasonable to assume uniform overconfidence.

Robin Hanson:Your narrow claim that all studies get published doesn't eliminate selection bias. In "Cargo Cult Science," Feynman blames the history of the charge of the electron by a type of selection bias following initial error. He also seems to say that physicists have learned from this mistake, but that seems like a tall order.

Couldn't the evidence for "academic overconfidence" cited here be partly a footprint of selection biases in what gets published?

Hypothetically, each paper could contain correctly computed confidence intervals. But if certain studies get published by showing an aberrant point estimate calculation, then it will appear as if the published studies have confidence intervals that are too narrow.

(For a similar dynamic, see http://www.slate.com/id/210... "Ordinarily, studies with large sample sizes should be more convincing than studies with small sample sizes. Following the fates of 10,000 workers should tell you more than following the fates of 1,000 workers. But with the minimum-wage studies, that wasn't happening...")

Perry said: "the statistical model for computing the 98% confidence interval isn't as reliable as one would like".

Indeed, fat tails could explain these results. I remember reading somewhere about how fat-tailed normal distributions being much more common than people think (one family of distributions in particular). I don't remember the name of this distribution for sure, but it was generalization of the Gaussian, and it was also the limit distribution for its own family (there was an analog of the Central Limit Theorem).

Perry, I too hope that future journal papers, as well as being available online, will also include all raw data. But:

1) Calculating confidence intervals, by itself, throws away an insane amount of information compared to the raw data.

2) You could just as easily say: "Confidence interval XYZ was calculated using statistical model Q, which has a historical surprise index in physics of 7% (Henrion and Fischoff 1986)." Then the same information is transmitted as in the current case, but it comes with an appropriate caution.

Robin, I'm not sure that is true. What you are saying, essentially, is that after seeing that a number of estimates of some constant do not fall within each other's error bars, physicists should then increase the size of the error bars. I don't think that is reasonable.

Not all methods of measurement are identical, and different groups use different instruments, so the systematic errors made by different groups are different. That means that it is not necessarily the case that all groups are underestimating their errors -- in fact, it is most likely that only some of them are underestimating error. Increasing your error based on the "gut feel" that it is not large enough is no more scientific than underestimating it.

You are, in effect, promoting a bias towards overestimation of error. Overestimation bias is also a bias. It is, however, a "silent bias" in that metastudies are less likely to find it because fewer measurements will lie outside each others error bars. Overestimation, however, means that it is harder for an outside observer to determine where the different experiments are clustering, and thus makes it harder to figure out where the most likely figure actually lies! The price for overestimation is loss of valuable information!

Generally speaking, I favor a different solution entirely, which is documenting your calculation, so third parties can see what your assumptions were. A silent assumption is possibly deceptive, but a documented assumption is not, especially since you have to make some assumptions in almost all scientific measurement.

In the old days, when reports had to be published entirely on paper and it was difficult to put all the information you collected into your publication, this was impossible. Now that scientific publications routinely accompany the main article with supplementary data published online, I think we should simply go all the way and provide the reader with as much raw data as possible along with the exact assumptions and methods used to reduce it. Third parties can then know what assumptions you made, and can make their own assessment of what your error bars really mean.

Perry, estimating is complicated and hard work. But this is an excuse for error, not for bias. Once physicists can see the track record of overconfidence, they have an opportunity to correct it by increasing their estimates of systematic error. If their bias continues, it shows they choose not to increase their estimates enough.

Eliezer & Robin: I would like to point out, regardless of incentives, it is just plain hard to estimate systematic error. Math can't really help you, and that which can't be helped by math is difficult to do in an absolutely objective way. You generally can't run a calculation and get an estimate of your systematic error, and you can't get an objective estimate of the probability distribution of your systematic error either.

Metrology is difficult stuff. One of the reasons why people like it when measurements get repeated both with the same and differing methodologies is the understanding that systematic error is difficult to judge. The implicit hope is that, across multiple methods and multiple experimenters, systematic error will be somewhat randomly distributed, thus giving the community the ability to judge the true value of the observable.

Some kinds of bias are easy enough to remove, but some are not. The easiest kinds to remove can be removed with objective methods, such as mathematical methods. Those that cannot be removed with objective methods are not nearly so straightforward. Often, one uses a flawed method, such as a statistical calculation based on assumptions that are known not to be strictly true, on the basis that the best calculation you can make is better than no calculation at all. So far as I can tell, that is in fact generally true. We would rather know the value of the gravitational constant roughly than not to know it at all, for example...

In general terms scientists as individuals are prone to be over-confident in their own work, but that is merely psychological and related to motivation. Criticism is best and most appropriate when it comes from others.

The problem arises when a whole scientific field becomes overconfident, and this sometimes happens due to overvaluing a specific methodology (or, it could be a theory). Any methodology has a systematic bias, and if other methods are denigrated this error can become cumulative.

An example from my experience is the randomized controlled trial (RCT) in medicine. As a source of evidence concerning the effectiveness and effect-size of therapies, the RCT suffers from selection biases (ie. biased recruitment to trials). This systematic bias is amplified by meta-analysis of randomized trials.

And this systematic error of RCTs has been substantially uncorrected because part of the rationale for randomized trials is that they are intrinsically superior to any potentially corrective alternative research methodologies (such as surveys, cohort studies, animal experiments etc).

Usually the correction eventually comes from market forces - because where bias is uncorrected there are oportunities for young scientists to make a reputation by over-turning the status quo.

Wall Street economists exhibit similar overconfidence biases when forecasting. 2 quick examples, one on year-ahead GDP estimates and the other on monthly unemployment reports ...

(1) Participants in the Wall Street Journal "Monthly Economic Forecasting Survey" assign far too high a probability to the accuracy of their respective forecasts.

As part of the survey the Wall Street Journal phrases an "Economists' Confidence" question as follows: "As an indication of your confidence level in your GDP projections in this survey, please estimate on a scale of 0 to 100 the probability that officially reported GDP growth, on average, will fall within 0.5% of your GDP forecasts for those four quarters." In the year or 2 in which this question has been asked, the average response has been 64%. (Amusingly, in all these surveys there is a delusional-- or perhaps clairvoyant-- participant who responds 100%.) By penciling in that 64% response forecasters assume their year-ahead GDP estimates have root-means-squared errors just above 0.5%.

But one-year ahead consensus GDP forecasts have historically missed by over 1%. Given the tendency for consensus projections to show greater accuracy than individual estimates, the average forecast miss for each individual would be even higher than 1%.

A much bigger information set than available is necessary to generate projections at the degree of accuracy that these forecasters are assuming in their WSJ survey responses. On the day of quarterly (annualized) advance GDP releases the root-mean-squared forecast error of the Bloomberg-compiled consensus from 1998 - 2004 was 0.8%. In anticipation of the official BEA release these are estimates of GDP growth in the quarter that *just finished a month ago* This results translates to a root-mean-squared forecast error of 0.4% for average growth for the year, near the root-mean squared forecast error that the WSJ survey suggests for the *year-ahead forecast*.

(2) Before the September non-farm payrolls report, the NABE conducted a pari-mutuel auction. (See http://www.nabe.com/press/p... ) The clearing prices suggested a 124k expectation for private payrolls growth with only a 25% chance of a private payrolls gain below 75k or above 174k. Over the last few years, though, (Bloomberg compiled) consensus forecasts have missed by more than 50k 60% of the time.

I confess that I am not much acquainted with how physicists, in particular, estimate their confidence intervals. But if these are genuinely subjective, human-estimated confidence intervals, how on Earth did the physicists end up with a surprise index as *low* as 7% on a 98% confidence interval? And if they are formally calculated confidence intervals - which one would kinda expect of a published paper - then the essential point is the same: *some* kind of correlation in the errors is being ignored by the model class. It's all well and good to not ignore systematic errors, but apparently even hard scientists fail to not ignore it enough. So long as journals want confidence intervals to be produced by calculation from statistical models of less than AGI-level complexity, those confidence intervals will probably continue to be over-narrow.

The root of my initial skepticism towards global warming actually stems from the "coming ice age" scare laid on me as an elementary student in the 70s.

The only thing that I am 100% convinced of now, is that certain entities couldn't care less about the planet, but this could prove to be an effective economic tool against the US and/or capitalism in general.

Douglas, I meant publication selection bias. Yes, individual experimentalists might inappropriately select data to report.

Perry E Metzger:'Overestimation bias is also a bias. It is, however, a "silent bias" in that metastudies are less likely to find it because fewer measurements will lie outside each others error bars.'

I suspect that I'm completely misunderstanding you. It seems to me that this kind of study can just as easily find underconfidence, if too many studies fall within the error bars.It may be more useful to figure out which groups and menthods are overconfident, but it is cheap and I think quite reasonable to assume uniform overconfidence.

Robin Hanson:Your narrow claim that all studies get published doesn't eliminate selection bias. In "Cargo Cult Science," Feynman blames the history of the charge of the electron by a type of selection bias following initial error. He also seems to say that physicists have learned from this mistake, but that seems like a tall order.

John, yes, publication selection bias could produce a similar effect, but in fact in this sort of physics almost all the studies get published.

Couldn't the evidence for "academic overconfidence" cited here be partly a footprint of selection biases in what gets published?

Hypothetically, each paper could contain correctly computed confidence intervals. But if certain studies get published by showing an aberrant point estimate calculation, then it will appear as if the published studies have confidence intervals that are too narrow.

(For a similar dynamic, see http://www.slate.com/id/210... "Ordinarily, studies with large sample sizes should be more convincing than studies with small sample sizes. Following the fates of 10,000 workers should tell you more than following the fates of 1,000 workers. But with the minimum-wage studies, that wasn't happening...")

Gustavo, the paper I cite also shows that a lot less than the expected 50% of cases falls in the interquartile range.

Perry said: "the statistical model for computing the 98% confidence interval isn't as reliable as one would like".

Indeed, fat tails could explain these results. I remember reading somewhere about how fat-tailed normal distributions being much more common than people think (one family of distributions in particular). I don't remember the name of this distribution for sure, but it was generalization of the Gaussian, and it was also the limit distribution for its own family (there was an analog of the Central Limit Theorem).

Perry, I too hope that future journal papers, as well as being available online, will also include all raw data. But:

1) Calculating confidence intervals, by itself, throws away an insane amount of information compared to the raw data.

2) You could just as easily say: "Confidence interval XYZ was calculated using statistical model Q, which has a historical surprise index in physics of 7% (Henrion and Fischoff 1986)." Then the same information is transmitted as in the current case, but it comes with an appropriate caution.

Perry, your argument is worth an entire post responding to, and I intend to do so soon.

Robin, I'm not sure that is true. What you are saying, essentially, is that after seeing that a number of estimates of some constant do not fall within each other's error bars, physicists should then increase the size of the error bars. I don't think that is reasonable.

Not all methods of measurement are identical, and different groups use different instruments, so the systematic errors made by different groups are different. That means that it is not necessarily the case that all groups are underestimating their errors -- in fact, it is most likely that only some of them are underestimating error. Increasing your error based on the "gut feel" that it is not large enough is no more scientific than underestimating it.

You are, in effect, promoting a bias towards overestimation of error. Overestimation bias is also a bias. It is, however, a "silent bias" in that metastudies are less likely to find it because fewer measurements will lie outside each others error bars. Overestimation, however, means that it is harder for an outside observer to determine where the different experiments are clustering, and thus makes it harder to figure out where the most likely figure actually lies! The price for overestimation is loss of valuable information!

Generally speaking, I favor a different solution entirely, which is documenting your calculation, so third parties can see what your assumptions were. A silent assumption is possibly deceptive, but a documented assumption is not, especially since you have to make some assumptions in almost all scientific measurement.

In the old days, when reports had to be published entirely on paper and it was difficult to put all the information you collected into your publication, this was impossible. Now that scientific publications routinely accompany the main article with supplementary data published online, I think we should simply go all the way and provide the reader with as much raw data as possible along with the exact assumptions and methods used to reduce it. Third parties can then know what assumptions you made, and can make their own assessment of what your error bars really mean.

Perry, estimating is complicated and hard work. But this is an excuse for error, not for bias. Once physicists can see the track record of overconfidence, they have an opportunity to correct it by increasing their estimates of systematic error. If their bias continues, it shows they choose not to increase their estimates enough.

Eliezer & Robin: I would like to point out, regardless of incentives, it is just plain hard to estimate systematic error. Math can't really help you, and that which can't be helped by math is difficult to do in an absolutely objective way. You generally can't run a calculation and get an estimate of your systematic error, and you can't get an objective estimate of the probability distribution of your systematic error either.

Metrology is difficult stuff. One of the reasons why people like it when measurements get repeated both with the same and differing methodologies is the understanding that systematic error is difficult to judge. The implicit hope is that, across multiple methods and multiple experimenters, systematic error will be somewhat randomly distributed, thus giving the community the ability to judge the true value of the observable.

Some kinds of bias are easy enough to remove, but some are not. The easiest kinds to remove can be removed with objective methods, such as mathematical methods. Those that cannot be removed with objective methods are not nearly so straightforward. Often, one uses a flawed method, such as a statistical calculation based on assumptions that are known not to be strictly true, on the basis that the best calculation you can make is better than no calculation at all. So far as I can tell, that is in fact generally true. We would rather know the value of the gravitational constant roughly than not to know it at all, for example...

Bruce, I hope we will have more posts in the future on biases in the medical literature.

In general terms scientists as individuals are prone to be over-confident in their own work, but that is merely psychological and related to motivation. Criticism is best and most appropriate when it comes from others.

The problem arises when a whole scientific field becomes overconfident, and this sometimes happens due to overvaluing a specific methodology (or, it could be a theory). Any methodology has a systematic bias, and if other methods are denigrated this error can become cumulative.

An example from my experience is the randomized controlled trial (RCT) in medicine. As a source of evidence concerning the effectiveness and effect-size of therapies, the RCT suffers from selection biases (ie. biased recruitment to trials). This systematic bias is amplified by meta-analysis of randomized trials.

And this systematic error of RCTs has been substantially uncorrected because part of the rationale for randomized trials is that they are intrinsically superior to any potentially corrective alternative research methodologies (such as surveys, cohort studies, animal experiments etc).

Usually the correction eventually comes from market forces - because where bias is uncorrected there are oportunities for young scientists to make a reputation by over-turning the status quo.

Wall Street economists exhibit similar overconfidence biases when forecasting. 2 quick examples, one on year-ahead GDP estimates and the other on monthly unemployment reports ...

(1) Participants in the Wall Street Journal "Monthly Economic Forecasting Survey" assign far too high a probability to the accuracy of their respective forecasts.

As part of the survey the Wall Street Journal phrases an "Economists' Confidence" question as follows: "As an indication of your confidence level in your GDP projections in this survey, please estimate on a scale of 0 to 100 the probability that officially reported GDP growth, on average, will fall within 0.5% of your GDP forecasts for those four quarters." In the year or 2 in which this question has been asked, the average response has been 64%. (Amusingly, in all these surveys there is a delusional-- or perhaps clairvoyant-- participant who responds 100%.) By penciling in that 64% response forecasters assume their year-ahead GDP estimates have root-means-squared errors just above 0.5%.

But one-year ahead consensus GDP forecasts have historically missed by over 1%. Given the tendency for consensus projections to show greater accuracy than individual estimates, the average forecast miss for each individual would be even higher than 1%.

A much bigger information set than available is necessary to generate projections at the degree of accuracy that these forecasters are assuming in their WSJ survey responses. On the day of quarterly (annualized) advance GDP releases the root-mean-squared forecast error of the Bloomberg-compiled consensus from 1998 - 2004 was 0.8%. In anticipation of the official BEA release these are estimates of GDP growth in the quarter that *just finished a month ago* This results translates to a root-mean-squared forecast error of 0.4% for average growth for the year, near the root-mean squared forecast error that the WSJ survey suggests for the *year-ahead forecast*.

(2) Before the September non-farm payrolls report, the NABE conducted a pari-mutuel auction. (See http://www.nabe.com/press/p... ) The clearing prices suggested a 124k expectation for private payrolls growth with only a 25% chance of a private payrolls gain below 75k or above 174k. Over the last few years, though, (Bloomberg compiled) consensus forecasts have missed by more than 50k 60% of the time.

I confess that I am not much acquainted with how physicists, in particular, estimate their confidence intervals. But if these are genuinely subjective, human-estimated confidence intervals, how on Earth did the physicists end up with a surprise index as *low* as 7% on a 98% confidence interval? And if they are formally calculated confidence intervals - which one would kinda expect of a published paper - then the essential point is the same: *some* kind of correlation in the errors is being ignored by the model class. It's all well and good to not ignore systematic errors, but apparently even hard scientists fail to not ignore it enough. So long as journals want confidence intervals to be produced by calculation from statistical models of less than AGI-level complexity, those confidence intervals will probably continue to be over-narrow.