Tag Archives: Statistics

Peer Review Is Random

Which academic articles get published in the more prestigious journals is a pretty random process. When referees review an academic paper, less than 20% of the variability in referee ratings is explained by a tendency to agree:

This paper presents the first meta-analysis for the inter-rater reliability (IRR) of journal peer reviews [using] … 70 reliability coefficients … from 48 studies. … [covering] 19,443 manuscripts; on average, each study had a sample size of 311 manuscripts (minimum: 28, maximum: 1983). … The more manuscripts that a study is based on, the smaller the reported IRR coefficients are. .. If the information of the rating system for reviewers was reported in a study, then this was associated with a smaller IRR coefficient. … An ICC of .23 indicates that only 23% of the variability in the reviewers’ rating of a manuscript could be explained by the agreement of reviewers. (more: HT Tyler)

reviewreliability

The above is from their key figure, showing reliability estimates and confidence intervals for studies ordered by estimated reliability. The most accurate studies found the lowest reliabilities, clear evidence of a bias toward publishing studies that find high reliability. I recommend trusting only the most solid studies, which give the most pessimistic (<20%) estimates.

Seems a model would be useful here. Model the optimal number of referees per paper, given referee reliability, the value of identifying the best papers, and the relative cost of writing vs. refereeing a paper. Such a model could estimate losses from having many journals with separate referees evaluate the each article, vs. an integrated system.

GD Star Rating
a WordPress rating system
Tagged as: , ,

How Exceptional Is Gelman?

In response to my saying:

Academia is primarily an institution for credentialling folks as intellectually impressive, so that others can affiliate with them.

Andrew Gelman penned “Another reason I’m glad I’m not an economist“:

That [Robin] would write such an extreme statement without even feeling the need to justify it (and, no, I don’t think it’s true, at least not in the “academia” that I know about) . . . that I see as a product of being in an economics department.

I responded:

I have posted many times here on [this]. … The standard idealistic [story] is that academics know useful and important things, things which students want to learn, media want to report, consulting clients want to apply, … These idealistic theories … have [these listed] detailed problems. … It seems far simpler to me to just postulate that people care primarily about affiliating with others who have been certified as prestigious.

Andrew answered: Continue reading "How Exceptional Is Gelman?" »

GD Star Rating
a WordPress rating system
Tagged as: ,

Why so little model checking done in statistics?

One thing that bugs me is that there seems to be so little model checking done in statistics.  Data-based model checking is a powerful tool for overcoming bias, and it’s frustrating to see this tool used so rarely.  As I wrote in this referee report,

I’d like to see some graphs of the raw data, along with replicated datasets from the model. The paper admirably connects the underlying problem to the statistical model; however, the Bayesian approach requires a lot of modeling assumptions, and I’d be a lot more convinced if I could (a) see some of the data and (b) see that the fitted model would produce simulations that look somewhat like the actual data. Otherwise we’re taking it all on faith.

But, why, if this is such a good idea, do people not do it? 

Continue reading "Why so little model checking done in statistics?" »

GD Star Rating
a WordPress rating system
Tagged as:

How should unproven findings be publicized?

A year or so ago I heard about a couple of papers by Satoshi Kanazawa on "Engineers have more sons, nurses have more daughters" and "Beautiful parents have more daughters."  The titles surprised me, because in my acquaintance with such data, I’d seen very little evidence of sex ratios at birth varying much at all, certainly not by 26% as was claimed in one of these papers.  I looked into it and indeed it turned out that the findings could be explained as statistical artifacts–the key errors were, in one of the studies, controlling for intermediate outcomes and, in the other study, reporting only one of multiple potential hypothesis tests.  At the time, I felt that a key weakness of the research was that it did not include collaboration with statisticians, experimental psychologists, or others who are aware of these issues.

Continue reading "How should unproven findings be publicized?" »

GD Star Rating
a WordPress rating system
Tagged as: , ,

Statistical inefficiency = bias, or, Increasing efficiency will reduce bias (on average), or, There is no bias-variance tradeoff

Statisticians often talk about a bias-variance tradeoff, comparing a simple unbiased estimator (for example, a difference in differences) to something more efficient but possibly biased (for example, a regression).  There’s commonly the attitude that the unbiased estimate is a better or safer choice.  My only point here is that, by using a less efficient estimate, we are generally choosing to estimate fewer parameters (for example, estimating an average incumbency effect over a 40-year period rather than estimating a separate effect for each year or each decade).  Or estimating an overall effect of a treatment rather than separate estimates for men and women.  If we do this–make the seemingly conservative choice to not estimate interactions, we are implicitly estimating these interactions at zero, which is not unbiased at all!

I’m not saying that there are any easy answers to this; for example, see here for one of my struggles with interactions in an applied problem—in this case (estimating the effect of incentives in sample surveys), we were particularly interested in certain interactions even thought they could not be estimated precisely from data.

GD Star Rating
a WordPress rating system
Tagged as:

Useful bias

I would like to introduce the perhaps, in this forum, heretical notion of useful bias.  By useful bias I mean the deliberate introduction of an error as a means to solving a problem.  The two examples I discuss below are concrete rather than abstract and come from my training as an infantry officer many years ago.  Now technology solves the problems they solved, but the examples may still serve to illustrate the notion.

Continue reading "Useful bias" »

GD Star Rating
a WordPress rating system
Tagged as: , ,

Truth is stranger than fiction

Robin asks the following question here:

How does the distribution of truth compare to the distribution of opinion?  That is, consider some spectrum of possible answers, like the point difference in a game, or the sea level rise in the next century. On each such spectrum we could get a distribution of (point-estimate) opinions, and in the end a truth.  So in each such case we could ask for truth’s opinion-rank: what fraction of opinions were less than the truth?  For example, if 30% of estimates were below the truth (and 70% above), the opinion-rank of truth was 30%.

If we look at lots of cases in some topic area, we should be able to collect a distribution for truth’s opinion-rank, and so answer the interesting question: in this topic area, does the truth tend to be in the middle or the tails of the opinion distribution?  That is, if truth usually has an opinion rank between 40% and 60%, then in a sense the middle conformist people are usually right.  But if the opinion-rank of truth is usually below 10% or above 90%, then in a sense the extremists are usually right.

My response:

1.  As Robin notes, this is ultimately an empirical question which could be answered by collecting a lot of data on forecasts/estimates and true values.

2.  However, there is a simple theoretical argument that suggests that truth will be, generally, more extreme than point estimates, that the opinion-rank (as defined above) will have a distribution that is more concentrated at the extremes as compared to a uniform distribution.

The argument goes as follows:

Continue reading "Truth is stranger than fiction" »

GD Star Rating
a WordPress rating system
Tagged as: , ,

Sick of Textbook Errors

One of the most well-worn examples in introductions to Bayesian reasoning is testing for rare diseases: if the prior probability that a patient has a disease is sufficiently low, the probability that the patient has the disease conditional on a positive diagnostic test result may also be low, even for very accurate tests. One might hope that every epidemiologist would be familiar with this textbook problem, but this New York Times story suggests otherwise:

For months, nearly everyone involved thought the medical center had had a huge whooping cough outbreak, with extensive ramifications. […]

Then, about eight months later, health care workers were dumbfounded to receive an e-mail message from the hospital administration informing them that the whole thing was a false alarm.

Now, as they look back on the episode, epidemiologists and infectious disease specialists say the problem was that they placed too much faith in a quick and highly sensitive molecular test that led them astray.

While medical professionals can modestly improve their performance on inventories of cognitive bias when coached, we should not overestimate the extent to which formal instruction such as statistics or epidemiology classes will improve actual behavior in the field.

GD Star Rating
a WordPress rating system
Tagged as: , ,

Symmetry Is Not Pretty

From Chatty Apes we learn that symmetry has little to do with whether a face is attractive:

Measurable symmetry accounts for less than 1% of the variance in the attractiveness of women’s faces and less than 3% of the variance of the attractiveness of men’s faces.  … the initial studies showing big effects typically involved samples of less than 20 faces each, which is irresponsibly small for correlational studies with open-ended variables.  Once the bigger samples starting showing up, the effect basically disappeared for women and was shown to be pretty low for men.  But no one believed the later, bigger studies, even most of their own authors — pretty much everyone in my business still thinks that symmetry is a big deal in attractiveness.  So, the first lesson I learned:  Small samples are …  My solution has been to ditch the old p<.05 significance standard.

I see the same thing in health economics; once people see some data supporting a  theory that makes sense to them, they neglect larger contrary data.   

GD Star Rating
a WordPress rating system
Tagged as: ,

Malatesta Estimator

We frequently encounter competing estimates of politically salient magnitudes. One example would be the number of attendees at the 1995 “Million Man March”.  Obviously, frequently the estimates emanate from biased observers seeking to create or dispel an impression of strength.  Someone interested in generating a more neutral estimate might consider applying what I would call the Malatesta Estimator, which I have named after its formulator, the 14th Century Italian mercenary captain, Galeotto Malatesta of Rimini (d. abt. 1385). His advice was: “Take the mean between the maximum given by the exaggerators, and the minimum by detractors, and deduct a third” (Saunders 2004).  This simplifies into: the sum of the maximum and the minimum, divided by three.  It adjusts for the fact that the minimum is bounded below by zero, while there is no bound on the maximum.  Of course, it only works if the maximum is at least double the minimum.

In the case of the Million Man March, supporters from the Nation of Islam claimed attendance of 1.5 to 2 million.  The Park Service suggested initially that 400,000 had participated.  The Malatesta Estimator therefore yields an estimate of 800,000.  We can calibrate this by comparing it with an estimate by Dr. Farouk El-Baz and his team at the Boston University Remote Sensing Lab.  Dr. El-Baz and his team used samples of 1 meter square pixels from a number of overhead photos to estimate the density per pixel, and then calculated an estimate for the entire area.  Their estimate was 837,000, with 20% error bounds giving a range from 1 million to 670,000.

Saunders, Frances Stonor. 2004. The Devil’s Broker: Seeking Gold, God, and Glory in Fourteenth-Century Italy. (New York: HarperCollins), p. 93.

BU Remote Sensing Lab Press Release: http://www.bu.edu/remotesensing/Research/MMM/MMMnew.html

Accessed 14 December 2006.

GD Star Rating
a WordPress rating system
Tagged as: , ,