Tag Archives: Statistics

My Poll, Explained

So many have continued to ask me the same questions about my recent twitter poll, that I thought I’d try to put all my answers in one place. This topic isn’t that fundamentally interesting, so most you you may want to skip this post.

Recently, Christine Blasey Ford publicly accused US Supreme Court nominee Brett Kavanaugh of a sexual assault. This accusation will have important political consequences, however it is resolved. Congress and the US public are now put in the position of having to evaluate the believability of this accusation, and thus must consider which clues might indicate if the accusation is correct or incorrect.

Immediately after the accusation, many said that the timing of the accusation seemed to them suspicious, occurring exactly when it would most benefit Democrats seeking to derail any nomination until after the election, when they may control the Senate. And it occurred to me that a Bayesian analysis might illuminate this issue. If T = the actual timing, A = accurate accusation, W = wrong accusation, then how much this timing consideration pushes us toward final beliefs is given by the likelihood ratio p(T|W)/p(T|A). A ratio above one pushes against believing the accusation, while a ratio below one pushes for it.

The term P(T|A) seemed to me the most interesting term, and it occurred to me to ask what people thought about it via a Twitter poll. (If there was continued interest, I could ask another question about the other term.) Twitter polls are much cheaper and easier for me to do than other polls. I’ve done dozens of them so far, and rarely has anyone objected. Such polls only allow four options, and you don’t have many characters to explain your question. So I used those characters mainly to make clear a few key aspects of the accusation’s timing:

Many claimed that my wording was misleading because it didn’t include other relevant info that might support the accusation. Like who else the accuser is said to have told when, and what pressures she is said to have faced when to go public. They didn’t complain about my not including info that might lean the other way, such as low detail on the claimed event and a lack of supporting witnesses. But a short tweet just can’t include much relevant info; I barely had enough characters to explain key accusation timing facts.

It is certainly possible that my respondents suffered from cognitive biases, such as assuming too direct a path between accuser feelings and a final accusation. To answer my poll question well, they should have considered many possible complex paths by which an accuser says something to others, who then tell others people, some of which then chose when to bring pressure back on that accuser to make a public accusation. But that’s just the nature of any poll; respondents may well not think carefully enough before answering.

For the purposes of a Twitter poll, I needed to divide the range from 0% to 100% into four bins.
I had high uncertainty about where poll answers would lie, and for the purpose of Bayes rule it is factors that matter most. So I choose three ranges of roughly a factor of 4 to 5, and a leftover bin encompassing an infinite factor. If anything, my choice was biased against answers in the infinite factor bin.

I really didn’t know which way poll answers would go. If most answers were high fractions, that would tend to support the accusation, while if most answers were low fractions, that would tend to question the accusation. Many accused me of posting the poll in order to deny the accusation, but for that to work I would have needed a good guess on the poll answers. Which I didn’t have.

My personal estimate would be somewhere in the top two ranges, and that plausibly biased me to pick bins toward such estimates.  As two-thirds of my poll answers were in the lowest bin I offered, that suggests that I should have offered an even wider range of factors. Some claimed that I biased the results by not putting more bins above 20%. But that fraction is still below the usual four-bin target fraction of 25% per bin.

It is certainly plausible that my pool of poll respondents are not representative of the larger US or world population. And many called it is irresponsible and unscientific to run an unrepresentative poll, especially if one doesn’t carefully show which wordings matter how via A/B testing. But few complain about the thousands of other Twitter polls run every day, or of my dozens of others. And the obvious easy way to show that my pool or wordings matter is to show different answers with another poll where those vary. Yet almost no one even tried that.

Also, people don’t complain about others asking questions in simple public conversations, even though those can be seen as N=1 examples of unrepresentative polls without A/B testing on wordings. It is hard to see how asking thousands of people the same question via a Twitter poll is less informative than just asking one person that same question.

Many people said it is just rude to ask a poll question that insinuates that rape accusations might be wrong, especially when we’ve just seen someone going through all the pain of making one. They say that doing so is pro-rape and discourages the reporting of real rapes, and that this must have been my goal in making this poll. But consider an analogy with discussing gun control just after a shooting. Some say this is rude then to discuss anything but sympathy for victims, but others say this is exactly a good time to discuss gun control. I say that when we must evaluate a specific rape accusation is exactly a good time to think about what clues might indicate in what direction on whether this is an accurate or wrong accusation.

Others say that it is reasonable to conclude that I’m against their side if I didn’t explicitly signal within my poll text  that I’m on their side. That’s just the sort of signaling game equilibrium we are in. And so they are justified in denouncing me for being on the wrong side. But it seems a quite burdensome standard to hold on polls, which already have too few characters to allow an adequate explanation of a question, and it seems obvious that the vast majority of Twitter polls today are not in fact being held to this standard.

Added 24Sep: I thought the poll interesting enough to ask, relative to its costs to me, but I didn’t intend to give it much weight. It was all the negative comments that made it a bigger deal.

Note that, at least in my Twitter world, we see a big difference in attitudes between vocal folks who tweet and those who merely answer polls. That later “silent majority” is more skeptical of the accusation.

GD Star Rating
loading...
Tagged as: , , ,

Peer Review Is Random

Which academic articles get published in the more prestigious journals is a pretty random process. When referees review an academic paper, less than 20% of the variability in referee ratings is explained by a tendency to agree:

This paper presents the first meta-analysis for the inter-rater reliability (IRR) of journal peer reviews [using] … 70 reliability coefficients … from 48 studies. … [covering] 19,443 manuscripts; on average, each study had a sample size of 311 manuscripts (minimum: 28, maximum: 1983). … The more manuscripts that a study is based on, the smaller the reported IRR coefficients are. .. If the information of the rating system for reviewers was reported in a study, then this was associated with a smaller IRR coefficient. … An ICC of .23 indicates that only 23% of the variability in the reviewers’ rating of a manuscript could be explained by the agreement of reviewers. (more: HT Tyler)

reviewreliability

The above is from their key figure, showing reliability estimates and confidence intervals for studies ordered by estimated reliability. The most accurate studies found the lowest reliabilities, clear evidence of a bias toward publishing studies that find high reliability. I recommend trusting only the most solid studies, which give the most pessimistic (<20%) estimates.

Seems a model would be useful here. Model the optimal number of referees per paper, given referee reliability, the value of identifying the best papers, and the relative cost of writing vs. refereeing a paper. Such a model could estimate losses from having many journals with separate referees evaluate the each article, vs. an integrated system.

GD Star Rating
loading...
Tagged as: , ,

How Exceptional Is Gelman?

In response to my saying:

Academia is primarily an institution for credentialling folks as intellectually impressive, so that others can affiliate with them.

Andrew Gelman penned “Another reason I’m glad I’m not an economist“:

That [Robin] would write such an extreme statement without even feeling the need to justify it (and, no, I don’t think it’s true, at least not in the “academia” that I know about) . . . that I see as a product of being in an economics department.

I responded:

I have posted many times here on [this]. … The standard idealistic [story] is that academics know useful and important things, things which students want to learn, media want to report, consulting clients want to apply, … These idealistic theories … have [these listed] detailed problems. … It seems far simpler to me to just postulate that people care primarily about affiliating with others who have been certified as prestigious.

Andrew answered: Continue reading "How Exceptional Is Gelman?" »

GD Star Rating
loading...
Tagged as: ,

Why so little model checking done in statistics?

One thing that bugs me is that there seems to be so little model checking done in statistics.  Data-based model checking is a powerful tool for overcoming bias, and it’s frustrating to see this tool used so rarely.  As I wrote in this referee report,

I’d like to see some graphs of the raw data, along with replicated datasets from the model. The paper admirably connects the underlying problem to the statistical model; however, the Bayesian approach requires a lot of modeling assumptions, and I’d be a lot more convinced if I could (a) see some of the data and (b) see that the fitted model would produce simulations that look somewhat like the actual data. Otherwise we’re taking it all on faith.

But, why, if this is such a good idea, do people not do it? 

Continue reading "Why so little model checking done in statistics?" »

GD Star Rating
loading...
Tagged as:

How should unproven findings be publicized?

A year or so ago I heard about a couple of papers by Satoshi Kanazawa on "Engineers have more sons, nurses have more daughters" and "Beautiful parents have more daughters."  The titles surprised me, because in my acquaintance with such data, I’d seen very little evidence of sex ratios at birth varying much at all, certainly not by 26% as was claimed in one of these papers.  I looked into it and indeed it turned out that the findings could be explained as statistical artifacts–the key errors were, in one of the studies, controlling for intermediate outcomes and, in the other study, reporting only one of multiple potential hypothesis tests.  At the time, I felt that a key weakness of the research was that it did not include collaboration with statisticians, experimental psychologists, or others who are aware of these issues.

Continue reading "How should unproven findings be publicized?" »

GD Star Rating
loading...
Tagged as: , ,

Statistical inefficiency = bias, or, Increasing efficiency will reduce bias (on average), or, There is no bias-variance tradeoff

Statisticians often talk about a bias-variance tradeoff, comparing a simple unbiased estimator (for example, a difference in differences) to something more efficient but possibly biased (for example, a regression).  There’s commonly the attitude that the unbiased estimate is a better or safer choice.  My only point here is that, by using a less efficient estimate, we are generally choosing to estimate fewer parameters (for example, estimating an average incumbency effect over a 40-year period rather than estimating a separate effect for each year or each decade).  Or estimating an overall effect of a treatment rather than separate estimates for men and women.  If we do this–make the seemingly conservative choice to not estimate interactions, we are implicitly estimating these interactions at zero, which is not unbiased at all!

I’m not saying that there are any easy answers to this; for example, see here for one of my struggles with interactions in an applied problem—in this case (estimating the effect of incentives in sample surveys), we were particularly interested in certain interactions even thought they could not be estimated precisely from data.

GD Star Rating
loading...
Tagged as:

Useful bias

I would like to introduce the perhaps, in this forum, heretical notion of useful bias.  By useful bias I mean the deliberate introduction of an error as a means to solving a problem.  The two examples I discuss below are concrete rather than abstract and come from my training as an infantry officer many years ago.  Now technology solves the problems they solved, but the examples may still serve to illustrate the notion.

Continue reading "Useful bias" »

GD Star Rating
loading...
Tagged as: , ,

Truth is stranger than fiction

Robin asks the following question here:

How does the distribution of truth compare to the distribution of opinion?  That is, consider some spectrum of possible answers, like the point difference in a game, or the sea level rise in the next century. On each such spectrum we could get a distribution of (point-estimate) opinions, and in the end a truth.  So in each such case we could ask for truth’s opinion-rank: what fraction of opinions were less than the truth?  For example, if 30% of estimates were below the truth (and 70% above), the opinion-rank of truth was 30%.

If we look at lots of cases in some topic area, we should be able to collect a distribution for truth’s opinion-rank, and so answer the interesting question: in this topic area, does the truth tend to be in the middle or the tails of the opinion distribution?  That is, if truth usually has an opinion rank between 40% and 60%, then in a sense the middle conformist people are usually right.  But if the opinion-rank of truth is usually below 10% or above 90%, then in a sense the extremists are usually right.

My response:

1.  As Robin notes, this is ultimately an empirical question which could be answered by collecting a lot of data on forecasts/estimates and true values.

2.  However, there is a simple theoretical argument that suggests that truth will be, generally, more extreme than point estimates, that the opinion-rank (as defined above) will have a distribution that is more concentrated at the extremes as compared to a uniform distribution.

The argument goes as follows:

Continue reading "Truth is stranger than fiction" »

GD Star Rating
loading...
Tagged as: , ,

Sick of Textbook Errors

One of the most well-worn examples in introductions to Bayesian reasoning is testing for rare diseases: if the prior probability that a patient has a disease is sufficiently low, the probability that the patient has the disease conditional on a positive diagnostic test result may also be low, even for very accurate tests. One might hope that every epidemiologist would be familiar with this textbook problem, but this New York Times story suggests otherwise:

For months, nearly everyone involved thought the medical center had had a huge whooping cough outbreak, with extensive ramifications. […]

Then, about eight months later, health care workers were dumbfounded to receive an e-mail message from the hospital administration informing them that the whole thing was a false alarm.

Now, as they look back on the episode, epidemiologists and infectious disease specialists say the problem was that they placed too much faith in a quick and highly sensitive molecular test that led them astray.

While medical professionals can modestly improve their performance on inventories of cognitive bias when coached, we should not overestimate the extent to which formal instruction such as statistics or epidemiology classes will improve actual behavior in the field.

GD Star Rating
loading...
Tagged as: , ,

Symmetry Is Not Pretty

From Chatty Apes we learn that symmetry has little to do with whether a face is attractive:

Measurable symmetry accounts for less than 1% of the variance in the attractiveness of women’s faces and less than 3% of the variance of the attractiveness of men’s faces.  … the initial studies showing big effects typically involved samples of less than 20 faces each, which is irresponsibly small for correlational studies with open-ended variables.  Once the bigger samples starting showing up, the effect basically disappeared for women and was shown to be pretty low for men.  But no one believed the later, bigger studies, even most of their own authors — pretty much everyone in my business still thinks that symmetry is a big deal in attractiveness.  So, the first lesson I learned:  Small samples are …  My solution has been to ditch the old p<.05 significance standard.

I see the same thing in health economics; once people see some data supporting a  theory that makes sense to them, they neglect larger contrary data.   

GD Star Rating
loading...
Tagged as: ,