A common heuristic for estimating the quality of something is: what has it done for me lately? For example, you could estimate the quality of a restaurant via a sum or average of how much you’ve enjoyed your meals there. Or you might weight recent visits more, since quality may change over time. Such methods are simple and robust, but they aren’t usually the best. For example, if you know of others who ate at that restaurant, their meal enjoyment is also data, data that can improve your quality estimate. Yes, those other people might have different meal priorities, and that may be a reason to give their meals less weight than your meals. But still, their data is useful.

I formulated it as a two-tailed null for simplicity. Stating it as one-tailed is trivial. (The probability of a Trump win was not as assigned by the prediction market or less.)

But as you formulated it, rejecting the null here would be saying that it is untrue that both p(A)=market odds and p(B)=market odds. It doesn't tell you about the probability that at least one was right, for instance.

And that means that as a hypothesis test, it tells you not to assume that both of two predictions will be exactly correct. Is that what you claimed originally?

The null hypothesis is that the probability assigned by the prediction market is the true probability. Since your concern is with the form of the null, we don't need to look at joint probability. If a prediction market assigned a .05 probability to a Trump win (this is hypothetical), and Trump won, that would mean that random fluctuations cannot account for the prediction market's failure to predict within standard levels of significance.

You're looking in the wrong place. If there's a problem with my argument, it concerns the universe to which the test pertains (the cherry-picking problem), not combining probabilities or the mechanics of hypothesis testing.

They were miscalibrated, which is different than incorrect. You'd want a proper scoring rule to differentiate between good and bad predictions - but my point was that hypothesis tests require a null hypothesis, and simply saying the joint probability is < 0.05 doesn't make anything falsified.

OK, let’s specify this test a bit more clearly. What’s your null hypothesis? That the pair of predictions is no better than chance?

If you can show me a coherent null, specify what test you would use, and show how it can be applied reasonably to this case, I'll stop being "smug about my purported knowledge of probability."

OK, but however you combine the probabilities, it's going to be below .1. If my natural experiement argument is sound, significance at the .1 level is still cause for concern.

As someone who spends a lot of time trying to predict financial markets, let me assure you that uncorrelated errors is basically never a good assumption.

Don't be so smug about your purported knowledge of probability. The very simple answer to your "counter-example" is that the appropriate test is one-tailed.

The logic is very simple and requires no deep knowledge of probability or philosophical commitment to frequentism. The only possible objection to my reasoning is that the errors are correlated. (I think uncorrelated errors is a due assumption if you accept the logic of prediction markets.)

We should subsidize markets until the marginal cost of adding new info equals the social marginal value. That should on average be larger subsidies for more important questions.

That's not how probability works, even if you accept the paradigm of null-hypothesis-significance-testing.

Simple counterexample; if markets had predicted 11 events as 75% likely, and all happen, the conjunction is ~0.042. Your procedure says that invalidates them.

Brexit and Trump represent a natural experiment: the two most significant recent electoral events by far (as measured by popular interest - personally, I think their importance is overstated). The prediction markets gave Trump a .1 probability and Brexit a .25 probability. The conjunction of these probabilities is .025. This natural experiment rejects the null hypothesis beyond the scientifically standard .05 level. (I base prediction market stats on Predictwise, which draws on various betting markets.)

[By "natural experiment" I refer to the indications that these aren't cherry-picked outcomes. To the extent that an experiment is natural, it is less cherry-picked than published studies.]

The other key point is that binary outcomes aren't a very information-rich / useful way of judging a prediction model. Predicting 51%-49% when the result is 49%-51% is a better fit than predicting 44%-56%, even though the second got the "answer" correct.

Since as you noted people in practice weight 'more important' events more heavily when judging the quality of a prediction machine, should we subsidize prediction markets whose outcomes are important to predict?

## Big Impact Isn’t Big Data

I formulated it as a two-tailed null for simplicity. Stating it as one-tailed is trivial. (The probability of a Trump win was not as assigned by the prediction market or less.)

But as you formulated it, rejecting the null here would be saying that it is untrue that both p(A)=market odds and p(B)=market odds. It doesn't tell you about the probability that at least one was right, for instance.

And that means that as a hypothesis test, it tells you not to assume that both of two predictions will be exactly correct. Is that what you claimed originally?

The null hypothesis is that the probability assigned by the prediction market is the true probability. Since your concern is with the form of the null, we don't need to look at joint probability. If a prediction market assigned a .05 probability to a Trump win (this is hypothetical), and Trump won, that would mean that random fluctuations cannot account for the prediction market's failure to predict within standard levels of significance.

You're looking in the wrong place. If there's a problem with my argument, it concerns the universe to which the test pertains (the cherry-picking problem), not combining probabilities or the mechanics of hypothesis testing.

They were miscalibrated, which is different than incorrect. You'd want a proper scoring rule to differentiate between good and bad predictions - but my point was that hypothesis tests require a null hypothesis, and simply saying the joint probability is < 0.05 doesn't make anything falsified.

OK, let’s specify this test a bit more clearly. What’s your null hypothesis? That the pair of predictions is no better than chance?

If you can show me a coherent null, specify what test you would use, and show how it can be applied reasonably to this case, I'll stop being "smug about my purported knowledge of probability."

If all of those happen, the predictions were not that good.

Subsidize? No. Not by taxing me, thank you. Legalize? Yes! Let the "market" decide!

OK, but however you combine the probabilities, it's going to be below .1. If my natural experiement argument is sound, significance at the .1 level is still cause for concern.

As someone who spends a lot of time trying to predict financial markets, let me assure you that uncorrelated errors is basically never a good assumption.

Don't be so smug about your purported knowledge of probability. The very simple answer to your "counter-example" is that the appropriate test is one-tailed.

The logic is very simple and requires no deep knowledge of probability or philosophical commitment to frequentism. The only possible objection to my reasoning is that the errors are correlated. (I think uncorrelated errors is a due assumption if you accept the logic of prediction markets.)

We should subsidize markets until the marginal cost of adding new info equals the social marginal value. That should on average be larger subsidies for more important questions.

That's not how probability works, even if you accept the paradigm of null-hypothesis-significance-testing.

Simple counterexample; if markets had predicted 11 events as 75% likely, and all happen, the conjunction is ~0.042. Your procedure says that invalidates them.

Brexit and Trump represent a natural experiment: the two most significant recent electoral events by far (as measured by popular interest - personally, I think their importance is overstated). The prediction markets gave Trump a .1 probability and Brexit a .25 probability. The conjunction of these probabilities is .025. This natural experiment rejects the null hypothesis beyond the scientifically standard .05 level. (I base prediction market stats on Predictwise, which draws on various betting markets.)

[By "natural experiment" I refer to the indications that these aren't cherry-picked outcomes. To the extent that an experiment is natural, it is less cherry-picked than published studies.]

The other key point is that binary outcomes aren't a very information-rich / useful way of judging a prediction model. Predicting 51%-49% when the result is 49%-51% is a better fit than predicting 44%-56%, even though the second got the "answer" correct.

Since as you noted people in practice weight 'more important' events more heavily when judging the quality of a prediction machine, should we subsidize prediction markets whose outcomes are important to predict?