Conclusion-Blind Review

In 1977 Michael Mahoney found that journal reviewer evaluations depend on a paper’s conclusion, not just its methods:

75 journal reviewers were asked to referee manuscripts which described identical experimental procedures but which reported positive, negative, mixed, or no results. In addition to showing poor interrater agreement, reviewers were strongly biased against manuscripts which reported results contrary to their theoretical perspective.

Alas we can’t repeat this experiment, as it supposedly violates human subject ethics (the reviewers were not told they were being studied).  But it seems clear that people quite often judge papers by their conclusions, and this creates  publication biases. 

As an undergraduate I helped Riley Newman measure the strength of gravity at short distances.  Such measurements are tricky, and one is tempted to keep looking for "mistakes" until one gets the standard number.  To keep himself honest, Riley would give a colleague the exact value of a key parameter, and himself use this value with noise added in.  Only when he had done all he could to reduce errors would he ask for the exact value, and then directly publish the resulting final estimate.   

In this spirit, consider conclusion-blind review.  Authors would write, post, and submit at least two versions of each paper, with opposite conclusions.  Only after a paper was accepted would they say which conclusion was real.   (To avoid funding bias, perhaps we could forbid them from telling even their funders early which conclusion was real.)   

Many journals have experimented with author-blind review, where the author’s name is hidden.  But conclusion bias seems a bigger problem than author bias, and internet posts do not spoil conclusion-blind review.  The main problem I can see is delay in distributing the information of the paper’s conclusion.  But creating an incentive for faster journal review wouldn’t be such a bad thing. 

GD Star Rating
a WordPress rating system
Tagged as:
Trackback URL:
  • http://profile.typekey.com/guan/ Guan Yang

    I once saw a papir submitted to a referee where the authors’ names at the top of the page had been blacked out, but their email addresses and institutional affiliations in the footnote at the bottom had not.

  • conchis

    In defence of such apparent biases, people often respond with the argument that “extraordinary claims require extraordinary evidence”. At some level, this argument seems reasonable, but I worry: (a) that if our assessments of what constitutes an “extraordinary claim” are already biased, this rule will only magnify such biases; and (b) that the vagueness of the rule allows too much scope for inconsistent application of the “extraordinary evidence” requirement. I’d be interested to hear your thoughts on the argument, both in this specific context, and more generally.

  • Curt Adams

    Anybody who’s tried to get even a mildly controversial paper published knows about conclusion bias. IME the author blinding isn’t going to be very effective either – frequently people I know can deduce who reviewed the paper, so I figure most of the time the reviewers can deduce the author, or at least lab affiliation, which is close enough. Two different papers is an interesting idea but I see a lot of problems. The biggest is that a big part of the review is how the author connects the discovered evidence to the conclusions. That has to differ between the two papers, and so reviewers can’t do one review for both papers. Having to write two papers is a big strike too. Finally most papers have more than one conclusion, leaving the “reverse” paper ill-defined.

    I lean more to adding in more people to buffer the bias effect. We could have independent, paid, meta-reviewers who look at the reviewer’s writings and affiliations to determine whether they have substatial bias. For efficiency, this could be reserved for cases where the author considers bias an issue, and treated more as an appeal. Larger review groups and a policy of “publish controversies” could also help, at least when the dispute is already live in the field.

  • http://profile.typekey.com/nickbostrom/ Nick Bostrom

    To illustrate what conchis might have in mind, suppose a researcher sets out to estimate the current world population by aggregating different data sources and putting them through some complex statistical proceedure. A reviewer looking at the statistical proceedure might not spot anything wrong with it, so the paper should presumably be published. But suppose the world population estimate it give is 289 people. Then we know something has gone wrong, and the paper must be rejected or revised.

    So it seems that the conclusion of a paper could be a valid clue to its quality, albeit one that could easily be abused to rationalize prejudices.

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Nick, obviously any proposal to blind review to anything risks reviewers ignoring relevant info. So it could only make sense if it prevented a bias that caused more harm than the good that could be done by looking at the relevant info. Authors would do some self-selection, so I very much doubt anyone would submit a paper claiming world population to be 289.

    Curt, I agree more writing would be required, but less than two entirely different papers. And yes papers with more conclusions could require more variations. Reviewing reviewers also seems an interesting approach to explore.

  • http://profile.typekey.com/bayesian/ Peter McCluskey

    Without more evidence concerning the harm caused by conclusion bias, it’s hard to tell whether your suggestion is valuable or whether it reflects overconfidence in your ability to find solutions to biases.
    Here is a real example of a scientific paper whose quality I’m unsure how much to discount on the basis of its conclusions. It claims that special relativity isn’t quite right, and proposes a modified version of special relativity. A short version of the paper was rejected by a major physics journal which apparently made little effort to hide the fact that it was basing the rejection on the conclusion (i.e. it appeared to have not read enough of the paper to evaluate the reasoning). A lesser journal published the full paper (see http://cat.inist.fr/?aModele=afficheN&cpsidt=17497226).
    I happen to know the author via a hiking club, and am wondering how seriously I should take his arguments. It’s clear that he’s at least a competent physicist. To the limited extent I can follow his reasoning, it appears good enough that I doubt I could find flaws in it with anything resembling a reasonable effort. I’ve tentatively concluded that his hypothesis has less than a 50% chance of being right. My reasons for assigning a low probability seem to be based entirely on the belief that most people who think they’ve found something wrong with special relativity are a bit crazy. I haven’t seen any other signs that he’s more prone to crazy ideas than an ordinary physicist is. Should I believe I’m suffering from conclusion bias and increase my estimate of the probability that he is right?

  • http://cob.jmu.edu/rosserjb Barkley Rosser

    Well, as the editor of a journal that has a reputation for publishing papers that contain unexpected or unusual conclusions, I fear that I must agree with the general argument here. I certainly am all too aware of referees making judgments on whether a paper’s conclusions agree or disagree with their priors or biases. I see it all the time.

    That said, it is not clear that Robin Hanson’s intriguing suggestion is always operational. It is in the case of experimental economics where there is a very definite conclusion coming out of the paper (“people contribute less in a public goods game when shown picturs of white rats in boxes than when they are shown pictures of black cats climbing trees”). But lots of such papers come out with much murkier conclusions: in this this and in others cases that maybe sometimes, and so forth. And when one gets beyond experimental economics the conclusions may even be harder to pin down. There may not be a clear “alternative” conclusion, although presumably one could write a different one.

    Of course what is weird, and I have seen it more than once, is when a paper’s stated conclusions rather clearly contradict what one finds in the content of the paper itself. There seem to be a lot of reasons for this, including both bias of the authors or a pathetic desire to please the presumed biases of editors and referees, or just plain stupidity.

    Regarding the whole double blind procedure in refereeing, it is true that it is getting harder and harder to maintain this, especially as increasingly most papers are up on some website somewhere that can be pretty easily accessed by a bit of googling. I have even heard it stated by advocates of single blind refereeing that “if a paper is not up on a website, it must not be very good.” I do not buy that.

    It is certainly true that many referees are able to identify the author(s) of a paper, even without googling around websites. This is more the case in experimental economics, I think, where there is a hard core that is almost like an ingrown toenail it so self-referential and so preeningly self-conscious. This is breaking down somewhat as more and more economists are doing experiments, but sometimes these “outsiders” have trouble getting in if they are not part of one of the “in” labs that are well known and well established. I gather this is also a problem in the harder sciences as well.

    However, I can attest that there are frequently errors in these assumptions about who the author(s) is/are. I remember well once having a paper turned down by the QJE. A referee wrote that “this is clearly another lousy paper by X.” I was (and still am) not X. However, I had cited an obscure and unpublished working paper by X, and that was back before we were googling all the websites. Oh well…

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Peter, the claim is not that it is never reasonable to judge a paper by its conclusion; the claim is that there is a tendency to weigh this factor too heavily. If the paper you mention had even a 10% chance of being right, that would make it well worth publishing.

  • Stuart Armstrong

    What about deliberately biasing the reviewers, i.e. choosing (through a review of the litterature, published positions, etc…) reviewers who would tend to disagree with the paper’s conclusion?

    Then any paper that does get published will be of a high quality. The cost is that borderline, or even good papers may get unfairly rejected (the old story, balancing type I versus type II errors, as usual). Since science has to be very conservative in this way, the cost may be worth paying. Meta-reviewing might help, as would having reviewers precicely justifying their decisions.

    Or, conclusion blind review might be implimented in a few paper – say 5% of a field or so – enough to make the reviewers aware that the conclusion may be wrong, and force them to go through the body of the text.

    On a side note, conclusion blind review would be most entertaining to see in mathematics or physics ^_^ Not to mention completely impossible. But it does seem that it helps the most in those fields the most prone to bias.

  • Stuart Armstrong

    Short idea, related: what about unpackaging a paper (if possible) into smaller sections, and getting reviewers to comment on specific steps on each section? If one reviewer approved the method of gathering the data, (without knowing anything more), another approved the format of the experiment, another the statistical method involved…

    This makes it more likely for mistakes to be spotted, and may help overcome conclusion bias, as only the last reviewer(s) would know the conclusion. And it he/she/they were the only ones who turned it down, that would be a red flag warning of possible conclusion bias.

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Barkley, conclusion blinding could apply not only to experiment papers, but also to econometric papers, and to many theory papers. I don’t mean to imply that there is always a unique alternative. Instead I propose to require a set of versions that sample a reasonable range of alternative conclusions. This can also help with the combinatorial problem when there are many sub conclusions.

  • http://cob.jmu.edu/rosserjb Barkley Rosser

    Robin,

    The practical problem for journal editors is that we have a hard enough time getting decent referees who will get reports in on any sort of decent schedule even for one version of a paper, much less several versions, some of which are phoney. Same goes for the idea of chopping papers up into sections. Refereeing resources are unfortunately quite scarce in the real world of publishing.

    Stuart Armstrong is more on the money with a reasonable solution, assuming that the editors are well enough informed about the subject area. A wise and well-informed editor will usually try to get a range of reviewers, one (or some) who might be presumed to be favorably inclined to the paper (and its conclusions) and one (or some) who might be presumed to be not so favorably inclined.

    This makes thing easy if both sets agree, which sometimes happens. The problem arises then, of course, when they do not and they break in the predictable way(s). That is when editors have to work harder, and might be a situation where your conclusion-blinding suggestion might be useful on occasion.

  • Curt Adams

    I was going to suggest a method like Stuart’s where the conclusion-neutral parts be reviewed separately from the conclusion-specific parts. This could at least signal that a paper was relevant and well-designed but controversial. It might also encourage fairer introductions.

    I think “ingrown toenail” situations are typical in science. Reviewers for a paper are normally chosen to be highly knowledgeable in the specific area of interest, and that’s almost always a small group of labs. I’ve even seen a few cases of partitioning where multiple lines of investigation roll on with differing conclusions and virtually no interaction. For quite a while in transposable elements there were some investigators who claimed it was well-known there was no evidence for organismal regulation of TE’s while another set claimed it was well-established! Both had good, respected researchers but they rarely cited each others papers (it has since been proven there’s a lot of regulation going on via RNA interference at least – by research in a separate field). The disagreement was easy to understand, since the experiments were evaluated by subtle differences in complex and necessarily somewhat speculative mathematical models- but having both sides ignore the controversy was surreal.

    I haven’t seen too many papers whose conclusions *disagree* with their findings but I see a lot with a huge overreach, where the conclusions are far stronger than justified by the paper. I was in a journal club for a couple years and virtually everytime we looked at a paper from Science or Nature we concluded there was severe overreach – to the point that we started joking about it. The most extreme case was one where the authors found a p= 0.05 correlation in some aspect of a bird species and concluded that was the primary method for all speciation! It made me wonder if in order to get a paper into those top journal you needed a zinger conclusion, and since a single study rarely produces conclusive results in evolutionary biology, the result was that only papers with exciting unjustified conclusions which slipped past the reviewers somehow got published in the top journals.

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Barkley, I propose each referee only review one random version of the paper, though they could have access to the other version if they wanted.

  • Curt Adams

    The (good) reviewer shortage is a very real problem. I’ve seen some very serious suggestions that science would be better served if a small portion of grants were diverted to hiring scientists as professional reviewers. In experimentally driven fields at least, reviewers are relatively cheap compared to researchers. Here’s one paper: http://www.elsevier.de/sixcms/media.php/795/ejcb_forum.pdf

    Robin, how do you conclusion-blind a theory paper? Wouldn’t it necessarily have a flaw in the math or logic?

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Curt, yes a false conclusion theory paper would contain a logic flaw. But it is a rare referee who would find the flaw.

  • http://cob.jmu.edu/rosserjb Barkley Rosser

    Robin,

    Regarding the random reviewer scheme, it still helps if the editor knows of the biases of the referees. This will (or should) provide an appropriate prior on the editor’s judgment of the referee’s judgment of the paper, whichever version the referee gets. Or are you proposing that the editor not know which referee got the paper? Again, all this assumes the editor knows enough about the people in the sub-area to make such judgments, which I fear is far from always being the case.

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Barkley, I have (as yet) no objection to editors taking as much as possible into account about referees and their inclinations.

  • Paul Gowder

    This seems a feature rather than a bug. The reason reviewers ideally ought to be chosen is because they’re familar with the literature in the area of the paper. And obviously, that familiarity will lead them to develop a theoretical position, because the weight of the evidence, in their estimation, will be on that side. A paper that disagrees has the special burden to deal with that extra weight of evidence. Some variation should be explainable by papers failing to do so.

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Those who think there is little bias in judging a paper by its conclusion, do you think physicist Newman was mistaken to hide a parameter value from himself when adjusting his experiment?

  • Paul Gowder

    Before I answer that, can you offer some more details of what exactly it was that Newman was hiding? Was it a hypothesized value, or a result?

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Paul, an estimate for the strength of gravity is a formula involving a bunch of experiment parameters such as the mass of this, the length of that, and voltage there. Newman hide the value of one of these parameters, in the sense of only seeing it plus noise.

  • http://profile.typekey.com/nickbostrom/ Nick Bostrom

    Prima facie it seems to me Newman was admirable in taking these precautions to guard against his own potential biases.

    I’d expect some (pseudo?) fields to get a boost from conclusion-blind reviewing, such as parapsychology – unless, of course, reviewers substituted topic-based judgments for conclusion-based judgments.

  • Pingback: Overcoming Bias : Who Wants Unbiased Journals?