A game for self-Calibration?

It seems that people on average are overconfident in their own beliefs. But some people probably are unusually reliable. When there is a disagreement, these people are generally on the right side. If you are one of these people, then you would be better off (epistemically) following your own gut rather taking the advice of your friends. Of course, if you delude yourself in thinking your intuitions are more reliable even though they aren’t, then you’ll be worse off.

One response to this predicament is to take the advice of all your friends, on the argument that on average this makes people better off. One problem with this recommendation is that if only the best people follow it, then the net effect may be that average belief accuracy declines. "The problem with the world," wrote Bertrand Russell, "is that the stupid are cocksure and the intelligent are full of doubt." I’m not sure this is true with regard to intelligence, but if we substitue "wisdom", it may be more plausible. The modesty response could amplify the problem.

Is there a better way?

It seems that what might be useful would be to have a test for wisdom; more specifically, some way that you could test whether your intuitions are unusually accurate.

There are all sorts of indirect indicators that one might use – education, peer esteem, IQ scores, etc. But (1) it is not clear how valid these indicators are, and (2) there are so many of them that a person could easily get the result they want by picking the indicators that cast them in the most favorable light.

It would be good to have a more direct measurement of belief accuracy. What we might be especially interested in is assessing general epistemic accuracy ("good judgment") as opposed to how much knowledge you have in some particular domain. If you could determine how good your judgment is in general, you could use this to calibrate yourself – i.e. to assign a weight to your own gut versus those of other people.

Perhaps one could invent some kind of game that would provide such a test? Suppose one could easily generate a set of questions such that (a) the true answers can be easily discovered (perhaps at a later date), (b) the participants in the game have roughly the same relevant information, and (c) the questions are not confined to some particular narrow or artificial domain. The people playing the game could make guesses about these questions, and the winner would be the one with the most right answers. The challenge would be to design the game in such way that a suitable set of questions could be easily generated and scored. Preferably, it should also be fun to play. Participants should not collect more information about the questions, just to make their own best judgment based on information they all shared.

GD Star Rating
Tagged as:
Trackback URL:
  • http://profile.typekey.com/robinhanson/ Robin Hanson

    In order to be a test of your honesty when disagreeing, the test should let people see each other’s opinions, and then rate them on how well they listened to others.

    This test approach is well worth exploring, but a big problem is that people are often context dependent in their honestly. For example, engineers and hard scientists who tend to be well calibrated at their job may so feel smugly superior to others regarding politics, even though the factors that encourage their honestly at their job are not present when they discuss politics.

    One solution would be to make the test very wide ranging across topic areas. But people might still try harder to be honest when taking a test, so they could feel smugly superior when they disagree with others outside the test. So you’d want a way to correct for this effect when you estimate your honesty.

    Betting markets and other track records on particular topic areas would seem the best source to evaluate your honestly on each topic area.

    Betting markets or track records

  • http://www.lifeboat.com Michael Anissimov

    Nick Bostrom is blogging! It really is 2007!

  • michael vassar

    Yates just called them “the best” and “the worst” when he made that observation.

    The test Nick described sounds like a type of IQ test.
    We really need a test that shows beliefs changing in response to information recieved after the beliefs were formed.

    We already have the tests produced by and for Heuristics and Biases experiments. Just taking those would be a very good start. Writing similar but larger tests would be a next step.

  • Bill

    Here is something that may or may not fit the bill; it is called the almanac game.

    Requirements: Three players, A, B, and C. An almanac. Some competitive prize (e.g. winner buys everyone drinks).

    Player A asks a random fact out of an almanac (e.g. “What is the per capita US consumption of bananas in 2006?”).

    Player B gives not an estimate, but a midrange interval i.e. an interval where B assigns 50% chance to the correct answer being inside or outside the interval, and a 50% chance of being above or below the interval given outside (e.g. “My midrange is 100 to 200”).

    Player C then chooses “Inside” or “Outside”. A then reads the answer, and if C is right, C gets a point, otherwise, B gets a point. (An alternate version allows C to win another point if, when the answer is outside, C guesses “Above” or “Below” correctly)

    The almanac is now passed to the next player, and roles rotate (B reads, C assigns range, A chooses inside or out).

    This only tests “Facts in almanac” knowledge, but it might help people calibrate themselves. For example, an overconfidence bias would suggest that people’s ranges would be too tight all the time, whatever their subject matter expertise (e.g. even if I am a banana expert, my midrange, while tighter than the one above, would still be too tight i.e. more likely to be outside than inside). Someone without that bias would score higher, as the chances of “Outside” being correct would be closer to 50-50.

    I don’t know if or how this could be adjusted to test honesty in disagreements. Adding player D between A and B and imposing a penalty on B if C beats B but not D? I don’t know…

  • http://cob.jmu.edu/rosserjb Barkley Rosser

    Rather than viewing this as either a one-shot game or some kind of subgame perfect equilibrium, another way to view it would be as a repeated game with learning. Literature about financial market dynamics then becomes relevant, and I would recommend a paper by William Brock and Cars Hommes from 1997 in Econometrica, “Rational Routes to Randomness” along with a followup in 1998 in the Journal of Economic Dynamics and Control.

    So, their approach involves a cost to obtaining accurate information, perhaps the risk to the “intuitive gut” players of following their gut, or obtaining real information about an asset, versus simply using a rule of thumb, or buying an index fund, the latter rather like simply taking the average of everybody, presumed to be costless (or less costly than obtaining good information). In this setup what happens often are oscillations, easily chaotic or even more complex. In the simpler oscillations what goes on is the system going back and forth between being dominated by the smart, informed people and the rule of thumb index fund buyers. The mechanism is that when the system is near its fundamental, or obeying true information, there is no longer much of a gain from following the strategy of obtaining true information, so players switch to the rule of thumb/buying an index fund, etc., which then can become destabilizing. As the system moves away from the fundamental, it pays more to obtain information, so players start switching back. This is the basis of the dynamics.

  • http://www.spaceandgames.com Peter de Blanc

    Bill said:

    > (An alternate version allows C to win another point if, when the answer is outside, C guesses “Above” or “Below” correctly)

    In that case, you should use a 60% confidence interval.

  • Bill

    Peter said:

    >> (An alternate version allows C to win another point if, when the answer is outside, C guesses “Above” or “Below” correctly)

    >In that case, you should use a 60% confidence interval.

    How come? I don’t follow your reasoning.

    If it were a 60% probability interval, then C would always pick inside, no?

    Is it to make the game fair between B and C, since in this version, C can win two points while B can only win one?

  • http://www.spaceandgames.com Peter de Blanc

    60% was chosen so that if C has the same information as B, then inside and outside will be equally attractive.

    If C guesses “inside,” then the expected gain is .6*1 point = 0.6 points.

    If C guesses “outside,” then the expected gain is .4*(.5*1 point + .5*2 points) = 0.6 points.

    I’m assuming C doesn’t care if B gains or loses points (rationality is not a zero-sum game).

  • Bill

    I see; you are right, the “above or below” sort of ruins the “inside or outside” with a 50% probability interval. I would rather the game have people think about “equally likely” rather than “equally attractive”, so I suppose I would only recommend the original version.

    Thanks for the comment; I’m glad I learned about this.

  • http://profile.typekey.com/nickbostrom/ Nick Bostrom

    Robin: yes, there is a risk that some will be more accurate in the test than in other situations, and yet will extrapolate their test performance. And track records and prediction markets have important additional advantages. Here I was looking for a quick and simple way of getting at least some benefit to improve one’s own calibration.

    Michael A: yes, it’s 2007, although I actually started making a few posts to this blog in late 2006…

    Michael V: there’ll probably be some significant G-loading for tests like this. It might be interesting to know how much, and whether it would depend on how the test was structured. Is there a factor of “good judgment” apart from G, which would reveal itself in guessing tasks that were (a) not logical/mathematical/verbal/spatial but instead ambiguous situations from ordinary life, and (b) not knowledge-intensive like trivial quizzes? If there is such a factor, part of it might be truth-seeking motivation. If we remove that, does any “capacity to make good judgments when one tries” factor remain? In other words, is WISDOM = G + KNOWLEDGE + DESIRE TO BE WISE, or are there additional components, such as meta-rationality or intuitive judgement/common sense? (I haven’t looked for this in the literature – maybe somebody here knows?)

    Bill: the almanac game is in the ballpark, but I’d ideally like to avoid testing trivia knowledge.

    One type of question that I believe is common in buisness job interviews is something like, “How many lamp posts are there in Manhattan?”. But this type of question also gets boring once one has mastered the general approach to solving them (estimate the number in a typical city blocks; estimate number of city blocks; multiply).

  • http://profile.typekey.com/sentience/ Eliezer Yudkowsky

    It might be interesting to add a calibration game on top of other, existing games where there are plenty of emotionally laden questions with objective answers – for example, “What’s your 50% confidence interval on how much Monopoly money you think you’ll have 4 turns from now?” Suppose that you, player A, can bet Monopoly money with player B, on how much Monopoly money C will have in 4 turns – bearing in mind that C is also making bets! Then you must judge the calibration of others, and the game starts to look genuinely meta-rational. (Note: There must be a defined order of resolution for bets, to avoid circular dependencies.)

    I agree that people might overestimate their general calibration from learning to play a calibration game, but it’s better than nothing – you’ve got to get started somewhere.

  • Carl Shulman

    Another structure:

    1. Choose a quantitative estimation task based on a historical record, e.g given a set of baseball player statistics, predict the next season’s win-loss ratio.
    2. Participants individually select an interval of predetermined size, earning points if the true value is within that range. (Forcing the players to make a firm written estimate initially gives them a focus for confirmation bias and overconfidence.)
    3. Participants are then allowed to see all the individual predictions and to make new predictions, scored separately. Repeat this some set number of times, watching for convergence or divergence, to make a round.
    4. After a round, each player’s cumulative score is made visible to all players, informing future rounds.

  • http://homepage.mac.com/redbird/ Gordon Worley

    Here’s my thought for a fun game to play on the computer.

    First, create a source of numbers between 0 and 1 with an unknown distribution (unknown to the player, anyway). For the game we will need to generate a sequence of numbers from this source. Each number will represent a position along a line, i.e. think of a number as the relative distance from the left end of the line. The game is played between two horizontal lines. From the top line balls will drop and on the bottom line the player can position a catcher. The size of the catcher varies in each round, but corresponds to a percentage of the line length, for example, 50% or 98%. In each round, the player can place the catcher anywhere they want on the bottom line, so long as it stays within the line. Game play is as follows:

    Many balls are dropped from the top line and pile up, forming an approximate picture of the secret distribution. The player is then allowed to place the catcher and the balls are cleared away and new ones following the same distribution drop again. The player’s goal is to place the catcher, not so that it maximizes ball’s caught, but so that it catches the correct percentage of balls for its size. For example, the 10% catcher should only catch 10% of the balls dropped over a sufficiently large number of ball drops (for the game, let’s say 1000). The score is based on how accurate the player was. If the 10% catcher catches 50% of the balls dropped, the player did a poor job and gets a low score, whereas if the 98% catcher actually catches 97%, we’d consider that pretty good (after all, the sample size is small enough for there to be some error).

    While not exactly in the language of confidence intervals, it’s similar to the task when picking a confidence interval, except that in real life we don’t always get to see lots of sample data first.

  • http://profile.typekey.com/nickbostrom/ Nick Bostrom

    Here’s another game that could be constructed, although it would take more work. The ideal is to have estimation tasks that are similar to ones we face in ordinary life rather than trivia quizzes or mathematical puzzles.

    The idea is to collect a number of case descriptions. A case description could be a one-pager describing the basic facts about the early stages of a relationship, the buisness plan of a start-up company, or the biography of a person up to a certain age. Participants would read the case descriptions and try to estimate various outcome measures – whether the start-up succeeded, how long the relationship lasted, what became of the individual 20 years later. The estimates would be compared to the actual outcome, which would have been recorded during the preparation of the case descriptions. Participans would be scored on both calibration and discrimination, given feedback, and the game would continue for many rounds with new case descriptions so that participants could improve over time. You could also have teams who would be allowed to discuss between themselves before issuing the team’s estimates.

    For it to work well, it would be important to select the cases randomly from the relevant sample population. If one sampled from 50% of successful and 50% unsuccessful companies, or only from people who had biographies written about them, one would reduce the value of the game. So unless one could think of some clever way of compiling relevant cases, it would take a significant effort to put toghether this kind of game for general life domains. On the other hand, such a game would seem to me to have great educational value.

    It might be a good investment for an enlightened ministry of education somewhere to produce and promote such material. From a scientific point of view, it would also be intersting to study how much performance in such a game would correlate with G, experience, political views, and other factors. Perhaps it might also be useful for diagnosing some psychiatric problems.

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Nick your game of predicting true story sequels is close to a proposal of mine. I want to take 100 or so random sample people and describe them each in terms of 100 or so parameters, everything from height to income to how neat their room is. I then want a web page where visitors would describe themselves in terms of these features, and then be tested on the sample people. The test is this: for each sample person they had time to look at, they would be shown a random half of that person’s features, and try to guess the other half of those features (ideally assigning probability distributions, even joint distributions).

    Enough of this sort of data and we could figure out which features people use to infer which other features of people. This would go a long way toward showing us the various signaling games we play. But it would not help much to calibrate people’s rationality under disagreement; for that we need games where people react to the estimates of others.

  • http://digitalretrograde.com/archives/2007/01/the_idol_grinde.html digital retrograde

    The Idol Grinder

    Simon: Appalling. Randy: Dude, why are you here? Paula covers her ears. These American Idol contestants, you know the ones, have no concept of how much they can’t sing. They’re not self-critical or introspective enough. Of cours…