A common heuristic for estimating the quality of something is: what has it done for me lately? For example, you could estimate the quality of a restaurant via a sum or average of how much you’ve enjoyed your meals there. Or you might weight recent visits more, since quality may change over time. Such methods are simple and robust, but they aren’t usually the best. For example, if you know of others who ate at that restaurant, their meal enjoyment is also data, data that can improve your quality estimate. Yes, those other people might have different meal priorities, and that may be a reason to give their meals less weight than your meals. But still, their data is useful.
Consider an extreme case where one meal, say your wedding reception meal, is far more important to you than the others. If you weigh your meal experiences in proportion to meal importance, your whole evaluation may depend mainly on one meal. Yes, if meals of that important type differ substantially from other meals then using this method best avoids biases from using unimportant types of meals to judge important types. But the noise in your estimate will be huge; individual restaurant meals can vary greatly for many random reasons even when the underlying quality stays the same. You just won’t know much about meal quality.
I mention all this because many seem eager to give the recent presidential election (and the recent Brexit vote) a huge weight in their estimate the quality of various prediction sources. Sources that did poorly on those two events are judged to be poor sources overall. And yes, if these were by far more important events to you, this strategy avoids the risk that familiar prediction sources have a different accuracy on events like this than they do on other events. Even so, this strategy mostly just puts you at the mercy of noise. If you use a small enough set of events to judge accuracy, you just aren’t going to be able to see much of a difference between sources; you will have little reason to think that those sources that did better on these few events will do much better on other future events.
Me, I don’t see much reason to think that familiar prediction sources have an accuracy that is very different on the most important events, relative to other events, and so I mainly trust comparisons that use a lot of data. For example, on large datasets prediction markets have shown a robustly high accuracy compared to other sources. Yes, you might find other particular sources that seem to do better in particular areas, but you have to worry about selection effects – how many similar sources did you look at to find those few winners? And if prediction market participants became convinced that these particular sources had high accuracy, they’d drive market prices to reflect those predictions.
I formulated it as a two-tailed null for simplicity. Stating it as one-tailed is trivial. (The probability of a Trump win was not as assigned by the prediction market or less.)
But as you formulated it, rejecting the null here would be saying that it is untrue that both p(A)=market odds and p(B)=market odds. It doesn't tell you about the probability that at least one was right, for instance.
And that means that as a hypothesis test, it tells you not to assume that both of two predictions will be exactly correct. Is that what you claimed originally?