When I think of Aumann's agreement theorem, my first reflex is to average. You think A is 80% likely; my initial impression is that it's 60% likely. After you and I talk, maybe we both should think 70%. "Average your starting beliefs", or perhaps "do a weighted average, weighted by expertise" is a common heuristic.
But sometimes, not only is the best combination not the average, it's more extreme than either original belief.
Let's say Jane and James are trying to determine whether a particular coin is fair. They both think there's an 80% chance the coin is fair. They also know that if the coin is unfair, it is the sort that comes up heads 75% of the time.
Jane flips the coin five times, performs a perfect Bayesian update, and concludes there's a 65% chance the coin is unfair. James flips the coin five times, performs a perfect Bayesian update, and concludes there's a 39% chance the coin is unfair. The averaging heuristic would suggest that the correct answer is between 65% and 39%. But a perfect Bayesian, hearing both Jane's and James's estimates – knowing their priors, and deducing what evidence they must have seen - would infer that the coin was 83% likely to be unfair. [Math footnoted.]
Perhaps Jane and James are combining this information in the middle of a crowded tavern, with no pen and paper in sight. Maybe they don't have time or memory enough to tell each other all the coins they observed. So instead they just tell each other their posterior probabilities – a nice, short summary for a harried rationalist pair. Perhaps this brevity is why we tend to average posterior beliefs.
However, there is an alternative. Jane and James can trade likelihood ratios. Like posterior beliefs, likelihood ratios are a condensed summary; and, unlike posterior beliefs, sharing likelihood ratios actually works.
Let's listen in on a conversation where Jane and James trade likelihood ratios:
JANE: My observations are seven and a half times as likely if the coin is unfair, as if it is fair.
JAMES: My observations are two and half times as likely if the coin is unfair, as if it is fair.
BOTH, in unison: That means our joint observations are about nineteen times as likely if the coin is unfair. But our prior for unfair coins is 20%, which means a prior odds ratio of 1:4. Combining with Bayes' theorem, we get (1:4)*(19:1), which is about 5:1 in favor of an unfair coin.
[BAR PATRONS edge away slightly.]
Now that you see how sharing likelihood ratios can work, you'll probably be itching to put them to work in your daily life. As with most rationalist tricks, it helps to have a number of cached examples of places they can be used.
(1) Distinguish evidence from priors. I've been in several conversations that went roughly like this:
Person A: So, what do you think of Jack?
Person B: My guess is that he's moderately (smart/trustworthy/whatever), but not extremely so.
Person A: Is the "not extremely so" because you observed evidence Jack isn't, or because most people aren't and you don't have much data that Jack is? Where's the peak of your likelihood function?
This type of dialog is useful. Let's say that A's initial impression is that Jack is amazing, and B's impression is that Jack is somewhat less amazing. If B knows Jack well, A should lower her estimate of Jack. But if B's impression come from a tiny amount of amazing-looking data from Jack — just not enough to pull Jack all the way from "probably normal" to "probably amazing" — A should raise her estimate. B's posterior expectations about Jack's amazingness are identical in the two cases, even though B's observations in the two cases have opposite implications for A. Trading likelihoods notices the difference, but trading average posterior impressions doesn't.
(2) Avoid double-counting your priors. Robin Hanson suggested adjusting women's SAT-math scores toward the mean (downward for high-scoring women, upward for low-scoring women) if women's math aptitudes have a smaller standard deviation than men's. Moral intuitions aside, adjusting in this manner would improve the scores' accuracy if used as stand-alone estimates of strangers' SAT-math abilities; perhaps a woman with a single score of 800 has the same expected score on subsequent SAT-math tests (the same best point estimate for "true SAT-math ability") as a man who received a single score of, say, 770.
However, adjusting scores in this manner mixes likelihoods in with priors. SAT scores are best interpreted as likelihood functions: an SAT score of 800 has one likelihood from a person whose "true ability" is superb, another from a person whose "true ability" is moderate, etc. If you mix these likelihood functions with your prior (as gender-adjusted SAT scores would mix them), combining multiple indicators becomes more difficult. For example: suppose again that a single SAT-math score of 800 from a woman implies the same best point estimate of "true ability" as a single score of 770 in a man (because of differing priors plus the chance of testing error). Two SAT-math retests of 800 in a woman will then imply a higher best point estimate of true ability than two 770's in a man. The "gender-adjustment" would work for single, stand-alone SAT measurements, but it breaks when multiple indicators are combined. If you mix the likelihood function with your prior and then combine it with other mixed indicators (e.g., multiple gender-adjusted SAT scores, or gender-adjusted SAT scores plus gender-adjusted letters of rec), you pull too strongly toward the prior.
The take-home in all these cases is to keep hold of your likelihood ratios. Instead of tracking how likely your lead theory is to be true, or remembering a single theory which was most representative of the range of remaining possibilities (like Jack's average expected amazingness), try to track how likely your data-set is under one hypothesis vs. another. (You'll need to separately remember the prior on each hypothesis.) I suspect such tracking may also help with confirmation bias; I don't know if it ends up confusing people in other ways.
One major caveat: in our example with the coin, and in A's and B's estimations of Jack's amazingness, combining likelihood ratios led to more extreme beliefs. (In more general examples, combining likelihood ratios may not lead to more extreme beliefs, but it almost always leads to more specific beliefs.) When trying this yourself, make sure the likelihood ratios you're combining are independent indicators of the variable you're trying to infer. Otherwise, you and your co-rationalists may pull one another to beliefs that are unjustifiably extreme (or specific).
(Math for original example:
James, to end up with a 39% posterior on the coin being heads-weighted, must have seen four heads and one tail:
P(four heads and one tail| heads-weighted) = (0.75^4 ∙ 0.25^1) = 0.079. P(four heads and one tail | fair) = 0.031. P(heads-weighted | five heads) = (0.2∙0.079)/(0.2∙0.079 + 0.8∙0.031) = 0.39, which is the posterior belief James reports.
Jane must similarly have seen five heads and zero tails.
Plugging the total nine heads and one tail into Bayes' theorem:
P(heads-weighted | nine heads and a tail) = ( 0.2 ∙ (0.75^9 ∙ 0.25^1) ) / ( 0.2 ∙ (0.75^9 ∙ 0.25^1) + 0.8 ∙ (0.5^9 ∙ 0.5^1) ) = 0.83, giving us a posterior belief of 83% that the coin is heads-weighted.)