Big Impact Isn’t Big Data

A common heuristic for estimating the quality of something is: what has it done for me lately? For example, you could estimate the quality of a restaurant via a sum or average of how much you’ve enjoyed your meals there. Or you might weight recent visits more, since quality may change over time. Such methods are simple and robust, but they aren’t usually the best. For example, if you know of others who ate at that restaurant, their meal enjoyment is also data, data that can improve your quality estimate. Yes, those other people might have different meal priorities, and that may be a reason to give their meals less weight than your meals. But still, their data is useful.

Consider an extreme case where one meal, say your wedding reception meal, is far more important to you than the others. If you weigh your meal experiences in proportion to meal importance, your whole evaluation may depend mainly on one meal. Yes, if meals of that important type differ substantially from other meals then using this method best avoids biases from using unimportant types of meals to judge important types. But the noise in your estimate will be huge; individual restaurant meals can vary greatly for many random reasons even when the underlying quality stays the same. You just won’t know much about meal quality.

I mention all this because many seem eager to give the recent presidential election (and the recent Brexit vote) a huge weight in their estimate the quality of various prediction sources. Sources that did poorly on those two events are judged to be poor sources overall. And yes, if these were by far more important events to you, this strategy avoids the risk that familiar prediction sources have a different accuracy on events like this than they do on other events. Even so, this strategy mostly just puts you at the mercy of noise. If you use a small enough set of events to judge accuracy, you just aren’t going to be able to see much of a difference between sources; you will have little reason to think that those sources that did better on these few events will do much better on other future events.

Me, I don’t see much reason to think that familiar prediction sources have an accuracy that is very different on the most important events, relative to other events, and so I mainly trust comparisons that use a lot of data. For example, on large datasets prediction markets have shown a robustly high accuracy compared to other sources. Yes, you might find other particular sources that seem to do better in particular areas, but you have to worry about selection effects – how many similar sources did you look at to find those few winners? And if prediction market participants became convinced that these particular sources had high accuracy, they’d drive market prices to reflect those predictions.

GD Star Rating
Tagged as:
Trackback URL:
  • Hugh Parsonage

    Since as you noted people in practice weight ‘more important’ events more heavily when judging the quality of a prediction machine, should we subsidize prediction markets whose outcomes are important to predict?

    • We should subsidize markets until the marginal cost of adding new info equals the social marginal value. That should on average be larger subsidies for more important questions.

      • Robert Koslover

        Subsidize? No. Not by taxing me, thank you. Legalize? Yes! Let the “market” decide!

  • davidmanheim

    The other key point is that binary outcomes aren’t a very information-rich / useful way of judging a prediction model. Predicting 51%-49% when the result is 49%-51% is a better fit than predicting 44%-56%, even though the second got the “answer” correct.

  • Brexit and Trump represent a natural experiment: the two most significant recent electoral events by far (as measured by popular interest – personally, I think their importance is overstated). The prediction markets gave Trump a .1 probability and Brexit a .15 probability. The conjunction of these probabilities is .015. Or perhaps .75 is a better estimate for what the markets said about Brexit was .25, in which case the probability of both occurring under the null hypothesis is .025. This natural experiment rejects the accuracy of prediction at the scientifically standard .05 level.

    [By “natural experiment” I refer to the indications that these aren’t cherry-picked outcomes.]

    • davidmanheim

      That’s not how probability works, even if you accept the paradigm of null-hypothesis-significance-testing.

      Simple counterexample; if markets had predicted 11 events as 75% likely, and all happen, the conjunction is ~0.042. Your procedure says that invalidates them.

      • Don’t be so smug about your purported knowledge of probability. The very simple answer to your “counter-example” is that the appropriate test is one-tailed.

        The logic is very simple and requires no deep knowledge of probability or philosophical commitment to frequentism. The only possible objection to my reasoning is that the errors are correlated. (I think uncorrelated errors is a due assumption if you accept the logic of prediction markets.)

      • sflicht

        As someone who spends a lot of time trying to predict financial markets, let me assure you that uncorrelated errors is basically never a good assumption.

      • OK, but however you combine the probabilities, it’s going to be below .1. If my natural experiement argument is sound, significance at the .1 level is still cause for concern.

      • davidmanheim

        OK, let’s specify this test a bit more clearly. What’s your null hypothesis? That the pair of predictions is no better than chance?

        If you can show me a coherent null, specify what test you would use, and show how it can be applied reasonably to this case, I’ll stop being “smug about my purported knowledge of probability.”

      • The null hypothesis is that the probability assigned by the prediction market is the true probability. Since your concern is with the form of the null, we don’t need to look at joint probability. If a prediction market assigned a .05 probability to a Trump win (this is hypothetical), and Trump won, that would mean that random fluctuations cannot account for the prediction market’s failure to predict within standard levels of significance.

        You’re looking in the wrong place. If there’s a problem with my argument, it concerns the universe to which the test pertains (the cherry-picking problem), not combining probabilities or the mechanics of hypothesis testing.

      • davidmanheim

        But as you formulated it, rejecting the null here would be saying that it is untrue that both p(A)=market odds and p(B)=market odds. It doesn’t tell you about the probability that at least one was right, for instance.

      • No, I formulated it as a two-tailed null for simplicity. Stating it as one-tailed is trivial. (The probability of a Trump win was .05 or less.)

      • Petter

        If all of those happen, the predictions were not that good.

      • davidmanheim

        They were miscalibrated, which is different than incorrect. You’d want a proper scoring rule to differentiate between good and bad predictions – but my point was that hypothesis tests require a null hypothesis, and simply saying the joint probability is < 0.05 doesn't make anything falsified.