If you hadn't seen it, we've played a narrower version of this quite a bit, and it works pretty well: https://www.lesswrong.com/posts/nmwog5hGidZniDDpR/aumann-agreement-game

Expand full comment

The Idol Grinder

Simon: Appalling. Randy: Dude, why are you here? Paula covers her ears. These American Idol contestants, you know the ones, have no concept of how much they can't sing. They're not self-critical or introspective enough. Of cours...

Expand full comment

Nick your game of predicting true story sequels is close to a proposal of mine. I want to take 100 or so random sample people and describe them each in terms of 100 or so parameters, everything from height to income to how neat their room is. I then want a web page where visitors would describe themselves in terms of these features, and then be tested on the sample people. The test is this: for each sample person they had time to look at, they would be shown a random half of that person's features, and try to guess the other half of those features (ideally assigning probability distributions, even joint distributions).

Enough of this sort of data and we could figure out which features people use to infer which other features of people. This would go a long way toward showing us the various signaling games we play. But it would not help much to calibrate people's rationality under disagreement; for that we need games where people react to the estimates of others.

Expand full comment

Here's another game that could be constructed, although it would take more work. The ideal is to have estimation tasks that are similar to ones we face in ordinary life rather than trivia quizzes or mathematical puzzles.

The idea is to collect a number of case descriptions. A case description could be a one-pager describing the basic facts about the early stages of a relationship, the buisness plan of a start-up company, or the biography of a person up to a certain age. Participants would read the case descriptions and try to estimate various outcome measures - whether the start-up succeeded, how long the relationship lasted, what became of the individual 20 years later. The estimates would be compared to the actual outcome, which would have been recorded during the preparation of the case descriptions. Participans would be scored on both calibration and discrimination, given feedback, and the game would continue for many rounds with new case descriptions so that participants could improve over time. You could also have teams who would be allowed to discuss between themselves before issuing the team's estimates.

For it to work well, it would be important to select the cases randomly from the relevant sample population. If one sampled from 50% of successful and 50% unsuccessful companies, or only from people who had biographies written about them, one would reduce the value of the game. So unless one could think of some clever way of compiling relevant cases, it would take a significant effort to put toghether this kind of game for general life domains. On the other hand, such a game would seem to me to have great educational value.

It might be a good investment for an enlightened ministry of education somewhere to produce and promote such material. From a scientific point of view, it would also be intersting to study how much performance in such a game would correlate with G, experience, political views, and other factors. Perhaps it might also be useful for diagnosing some psychiatric problems.

Expand full comment

Here's my thought for a fun game to play on the computer.

First, create a source of numbers between 0 and 1 with an unknown distribution (unknown to the player, anyway). For the game we will need to generate a sequence of numbers from this source. Each number will represent a position along a line, i.e. think of a number as the relative distance from the left end of the line. The game is played between two horizontal lines. From the top line balls will drop and on the bottom line the player can position a catcher. The size of the catcher varies in each round, but corresponds to a percentage of the line length, for example, 50% or 98%. In each round, the player can place the catcher anywhere they want on the bottom line, so long as it stays within the line. Game play is as follows:

Many balls are dropped from the top line and pile up, forming an approximate picture of the secret distribution. The player is then allowed to place the catcher and the balls are cleared away and new ones following the same distribution drop again. The player's goal is to place the catcher, not so that it maximizes ball's caught, but so that it catches the correct percentage of balls for its size. For example, the 10% catcher should only catch 10% of the balls dropped over a sufficiently large number of ball drops (for the game, let's say 1000). The score is based on how accurate the player was. If the 10% catcher catches 50% of the balls dropped, the player did a poor job and gets a low score, whereas if the 98% catcher actually catches 97%, we'd consider that pretty good (after all, the sample size is small enough for there to be some error).

While not exactly in the language of confidence intervals, it's similar to the task when picking a confidence interval, except that in real life we don't always get to see lots of sample data first.

Expand full comment

Another structure:

1. Choose a quantitative estimation task based on a historical record, e.g given a set of baseball player statistics, predict the next season's win-loss ratio.2. Participants individually select an interval of predetermined size, earning points if the true value is within that range. (Forcing the players to make a firm written estimate initially gives them a focus for confirmation bias and overconfidence.)3. Participants are then allowed to see all the individual predictions and to make new predictions, scored separately. Repeat this some set number of times, watching for convergence or divergence, to make a round.4. After a round, each player's cumulative score is made visible to all players, informing future rounds.

Expand full comment

It might be interesting to add a calibration game on top of other, existing games where there are plenty of emotionally laden questions with objective answers - for example, "What's your 50% confidence interval on how much Monopoly money you think you'll have 4 turns from now?" Suppose that you, player A, can bet Monopoly money with player B, on how much Monopoly money C will have in 4 turns - bearing in mind that C is also making bets! Then you must judge the calibration of others, and the game starts to look genuinely meta-rational. (Note: There must be a defined order of resolution for bets, to avoid circular dependencies.)

I agree that people might overestimate their general calibration from learning to play a calibration game, but it's better than nothing - you've got to get started somewhere.

Expand full comment

Robin: yes, there is a risk that some will be more accurate in the test than in other situations, and yet will extrapolate their test performance. And track records and prediction markets have important additional advantages. Here I was looking for a quick and simple way of getting at least some benefit to improve one's own calibration.

Michael A: yes, it's 2007, although I actually started making a few posts to this blog in late 2006...

Michael V: there'll probably be some significant G-loading for tests like this. It might be interesting to know how much, and whether it would depend on how the test was structured. Is there a factor of "good judgment" apart from G, which would reveal itself in guessing tasks that were (a) not logical/mathematical/verbal/spatial but instead ambiguous situations from ordinary life, and (b) not knowledge-intensive like trivial quizzes? If there is such a factor, part of it might be truth-seeking motivation. If we remove that, does any "capacity to make good judgments when one tries" factor remain? In other words, is WISDOM = G + KNOWLEDGE + DESIRE TO BE WISE, or are there additional components, such as meta-rationality or intuitive judgement/common sense? (I haven't looked for this in the literature - maybe somebody here knows?)

Bill: the almanac game is in the ballpark, but I'd ideally like to avoid testing trivia knowledge.

One type of question that I believe is common in buisness job interviews is something like, "How many lamp posts are there in Manhattan?". But this type of question also gets boring once one has mastered the general approach to solving them (estimate the number in a typical city blocks; estimate number of city blocks; multiply).

Expand full comment

I see; you are right, the "above or below" sort of ruins the "inside or outside" with a 50% probability interval. I would rather the game have people think about "equally likely" rather than "equally attractive", so I suppose I would only recommend the original version.

Thanks for the comment; I'm glad I learned about this.

Expand full comment

60% was chosen so that if C has the same information as B, then inside and outside will be equally attractive.

If C guesses "inside," then the expected gain is .6*1 point = 0.6 points.

If C guesses "outside," then the expected gain is .4*(.5*1 point + .5*2 points) = 0.6 points.

I'm assuming C doesn't care if B gains or loses points (rationality is not a zero-sum game).

Expand full comment

Peter said:

>> (An alternate version allows C to win another point if, when the answer is outside, C guesses "Above" or "Below" correctly)

>In that case, you should use a 60% confidence interval.

How come? I don't follow your reasoning.

If it were a 60% probability interval, then C would always pick inside, no?

Is it to make the game fair between B and C, since in this version, C can win two points while B can only win one?

Expand full comment

Bill said:

> (An alternate version allows C to win another point if, when the answer is outside, C guesses "Above" or "Below" correctly)

In that case, you should use a 60% confidence interval.

Expand full comment

Rather than viewing this as either a one-shot game or some kind of subgame perfect equilibrium, another way to view it would be as a repeated game with learning. Literature about financial market dynamics then becomes relevant, and I would recommend a paper by William Brock and Cars Hommes from 1997 in Econometrica, "Rational Routes to Randomness" along with a followup in 1998 in the Journal of Economic Dynamics and Control.

So, their approach involves a cost to obtaining accurate information, perhaps the risk to the "intuitive gut" players of following their gut, or obtaining real information about an asset, versus simply using a rule of thumb, or buying an index fund, the latter rather like simply taking the average of everybody, presumed to be costless (or less costly than obtaining good information). In this setup what happens often are oscillations, easily chaotic or even more complex. In the simpler oscillations what goes on is the system going back and forth between being dominated by the smart, informed people and the rule of thumb index fund buyers. The mechanism is that when the system is near its fundamental, or obeying true information, there is no longer much of a gain from following the strategy of obtaining true information, so players switch to the rule of thumb/buying an index fund, etc., which then can become destabilizing. As the system moves away from the fundamental, it pays more to obtain information, so players start switching back. This is the basis of the dynamics.

Expand full comment

Here is something that may or may not fit the bill; it is called the almanac game.

Requirements: Three players, A, B, and C. An almanac. Some competitive prize (e.g. winner buys everyone drinks).

Player A asks a random fact out of an almanac (e.g. "What is the per capita US consumption of bananas in 2006?").

Player B gives not an estimate, but a midrange interval i.e. an interval where B assigns 50% chance to the correct answer being inside or outside the interval, and a 50% chance of being above or below the interval given outside (e.g. "My midrange is 100 to 200").

Player C then chooses "Inside" or "Outside". A then reads the answer, and if C is right, C gets a point, otherwise, B gets a point. (An alternate version allows C to win another point if, when the answer is outside, C guesses "Above" or "Below" correctly)

The almanac is now passed to the next player, and roles rotate (B reads, C assigns range, A chooses inside or out).

This only tests "Facts in almanac" knowledge, but it might help people calibrate themselves. For example, an overconfidence bias would suggest that people's ranges would be too tight all the time, whatever their subject matter expertise (e.g. even if I am a banana expert, my midrange, while tighter than the one above, would still be too tight i.e. more likely to be outside than inside). Someone without that bias would score higher, as the chances of "Outside" being correct would be closer to 50-50.

I don't know if or how this could be adjusted to test honesty in disagreements. Adding player D between A and B and imposing a penalty on B if C beats B but not D? I don't know...

Expand full comment

Yates just called them "the best" and "the worst" when he made that observation.

The test Nick described sounds like a type of IQ test.We really need a test that shows beliefs changing in response to information recieved after the beliefs were formed.

We already have the tests produced by and for Heuristics and Biases experiments. Just taking those would be a very good start. Writing similar but larger tests would be a next step.

Expand full comment

Nick Bostrom is blogging! It really is 2007!

Expand full comment