Academic Stats Prediction Markets
In a column, Andrew Gelman and Eric Loken note that academia has a problem:
Unfortunately, statistics—and the scientific process more generally—often seems to be used more as a way of laundering uncertainty, processing data until researchers and consumers of research can feel safe acting as if various scientific hypotheses are unquestionably true.
They consider prediction markets as a solution, but largely reject them for reasons both bad and not so bad. I’ll respond here to their article in unusual detail. First the bad:
Would prediction markets (or something like them) help? It’s hard to imagine them working out in practice. Indeed, the housing crisis was magnified by rampant speculation in derivatives that led to a multiplier effect.
Yes, speculative market estimates were mistaken there, as were most other sources, and mistaken estimates caused bad decisions. But speculative markets were the first credible source to correct the mistake, and no other stable source had consistently more accurate estimates. Why should the most accurate source should be blamed for mistakes made by all sources?
Allowing people to bet on the failure of other people’s experiments just invites corruption, and the last thing social psychologists want to worry about is a point-shaving scandal.
What about letting researchers who compete for grants, jobs, and publications write critical referee reports and publish criticism, doesn’t that invite corruption too? If you are going to forbid all conflicts of interest because they invite corruption, you won’t have much left you will allow. Surely you need to argue that bet incentives are more corrupting that other incentives.
And there are already serious ways to bet on some areas of science. Hedge funds, for instance, can short the stock of biotech companies moving into phase II and phase III trials if they suspect earlier results were overstated and the next stages of research are thus underpowered.
So by your previous argument, don’t you want to forbid such things because they invite corruption? You can’t have it both ways; either bets are good so you want more, or bets are bad so you want less, or you must distinguish the good from the bad somehow.
More importantly, though, we believe that what many researchers in social science in particular are more likely to defend is a general research hypothesis, rather than the specific empirical findings. On one hand, researchers are already betting—not just money (in the form of research funding) but also their scientific reputations—on the validity of their research.
No, the whole problem here that we’d like to solve is that scientific reputations are not tied very strongly to research validity. Folks often gain enviable reputations from publishing lots of misleading research.
On the other hand, published claims are vague enough that all sorts of things can be considered as valid confirmations of a theory (just as it was said of Freudian psychology and Marxian economics that they can predict nothing but explain everything).
Now we have a not-so-bad reason to avoid prediction markets: people are often unclear about what they mean, and they often don’t really want to be clear. And honestly, many of their patrons don’t want them to be clear either. We might create a prediction market on if what they meant will ever be clear. But they won’t want to pay for it, and others paying for it might just be mean.
And scientists who express great confidence in a given research area can get a bit more cautious when it comes to the specifics.
Yeah, that’s the problem with being clear; you might end up being clearly wrong.
For example, our previous ethics column, “Is It Possible to Be an Ethicist Without Being Mean to People,” considered the case of a controversial study, published in a top journal in psychology, claiming women at peak fertility were three times more likely to wear red or pink shirts, compared to women at other times during their menstrual cycles. After reading our published statistical criticism of this study in Slate, the researchers did not back down; instead, they gave reasons for why they believed their results (Tracy and Beall, 2013). But we do not think that they or others really believe the claimed effect of a factor of 3. For example, in an email exchange with a psychologist who criticized our criticisms, one of us repeatedly asked whether he believed women during their period of peak fertility are really three times more likely to wear red or pink shirts, and he repeatedly declined to answer this question.
What we think is happening here is that the authors of this study and their supporters separate the general scientific hypothesis (in this case, a certain sort of connection between fertility and behavior) from the specific claims made based on the data. We expect that, if forced to lay down their money, they would bet that, in a replication study, women in the specified days in their cycle would be less than three times more likely to wear red or pink, compared to women in other days of the cycle. Indeed, we would not be surprised if they would bet that the ratio would be less than two, or even less than 1.5. But we think they would still defend their hypothesis by saying, first, that all they care about is the existence of an effect and not its magnitude, and, second, that if this particular finding does not replicate, the non-replication could be explained by a sensitivity to experimental conditions.
Those authors might well be right that an expected replication ratio of 1.5 does indeed support their key hypothesis of the existence of a substantial effect with a certain sign. This doesn’t seem a reason not to bet on what that replication ratio would be, conditional on a replication being tried. One could also bet on long term consensus opinion on this general hypothesis; not all bets have to be about specifics. One could even bet on if such a long term consensus opinion will ever form.
In addition, betting cannot be applied easily to policy studies that cannot readily be replicated. For example, a recent longitudinal analysis of an early childhood intervention in Jamaica reported an effect of 42% in earnings (Gertler et al., 2013). The estimate was based on a randomized trial, but we suspect the effect size was being overestimated for the usual reason that selection on statistical significance induces a positive bias in the magnitude of any comparison, and the reported estimate represents just one possible comparison that could have been performed on these data (Gelman, 2013a). So, if the study could be redone under the same conditions, we would bet the observed difference would be less than 42%. And under new conditions (larger-scale,modern-day interventions in other countries), we would expect to see further attenuation and bet that effects would be even lower, if measured in a controlled study using pre-chosen criteria. Given the difficulty in setting up such a study, though, any such bet would be close to meaningless. Similarly, there might be no easy way of evaluating the sorts of estimates that appear from time to time in the newspapers based on large public-health studies.
Bets on new studies might be “meaningless” for evaluating old studies – but we should care more about evaluating policy than about evaluating studies. If there will be some large studies conducted in the future, prediction market estimates of their future estimate values could be very useful to inform policy, even more useful than the actual estimates those studies will find.
That said, scientific prediction markets could be a step forward, just because it would facilitate clear predictive statements about replications. If a researcher believes in his or her theory, even while not willing to bet his or her published quantitative finding would reappear in a replication, that’s fine, but it would be good to see such a statement openly made. We don’t know that such bets would work well in practice—the biggest challenge would seem to be defining clearly enough the protocol for any replications—but we find it helpful to think in this framework, in that it forces us to consider, not just what is in a particular past data set, but also what might be happening in the general population.
On that, we can mostly agree.