In 1977 Michael Mahoney found that journal reviewer evaluations depend on a paper’s conclusion, not just its methods: 75 journal reviewers were asked to referee manuscripts which described identical experimental procedures but which reported positive, negative, mixed, or no results. In addition to showing poor interrater agreement, reviewers were strongly biased against manuscripts which reported results contrary to their theoretical perspective.
Prima facie it seems to me Newman was admirable in taking these precautions to guard against his own potential biases.
I'd expect some (pseudo?) fields to get a boost from conclusion-blind reviewing, such as parapsychology - unless, of course, reviewers substituted topic-based judgments for conclusion-based judgments.
Paul, an estimate for the strength of gravity is a formula involving a bunch of experiment parameters such as the mass of this, the length of that, and voltage there. Newman hide the value of one of these parameters, in the sense of only seeing it plus noise.
Before I answer that, can you offer some more details of what exactly it was that Newman was hiding? Was it a hypothesized value, or a result?
Those who think there is little bias in judging a paper by its conclusion, do you think physicist Newman was mistaken to hide a parameter value from himself when adjusting his experiment?
This seems a feature rather than a bug. The reason reviewers ideally ought to be chosen is because they're familar with the literature in the area of the paper. And obviously, that familiarity will lead them to develop a theoretical position, because the weight of the evidence, in their estimation, will be on that side. A paper that disagrees has the special burden to deal with that extra weight of evidence. Some variation should be explainable by papers failing to do so.
Barkley, I have (as yet) no objection to editors taking as much as possible into account about referees and their inclinations.
Regarding the random reviewer scheme, it still helps if the editor knows of the biases of the referees. This will (or should) provide an appropriate prior on the editor's judgment of the referee's judgment of the paper, whichever version the referee gets. Or are you proposing that the editor not know which referee got the paper? Again, all this assumes the editor knows enough about the people in the sub-area to make such judgments, which I fear is far from always being the case.
Curt, yes a false conclusion theory paper would contain a logic flaw. But it is a rare referee who would find the flaw.
The (good) reviewer shortage is a very real problem. I've seen some very serious suggestions that science would be better served if a small portion of grants were diverted to hiring scientists as professional reviewers. In experimentally driven fields at least, reviewers are relatively cheap compared to researchers. Here's one paper: http://www.elsevier.de/sixc...
Robin, how do you conclusion-blind a theory paper? Wouldn't it necessarily have a flaw in the math or logic?
Barkley, I propose each referee only review one random version of the paper, though they could have access to the other version if they wanted.
I was going to suggest a method like Stuart's where the conclusion-neutral parts be reviewed separately from the conclusion-specific parts. This could at least signal that a paper was relevant and well-designed but controversial. It might also encourage fairer introductions.
I think "ingrown toenail" situations are typical in science. Reviewers for a paper are normally chosen to be highly knowledgeable in the specific area of interest, and that's almost always a small group of labs. I've even seen a few cases of partitioning where multiple lines of investigation roll on with differing conclusions and virtually no interaction. For quite a while in transposable elements there were some investigators who claimed it was well-known there was no evidence for organismal regulation of TE's while another set claimed it was well-established! Both had good, respected researchers but they rarely cited each others papers (it has since been proven there's a lot of regulation going on via RNA interference at least - by research in a separate field). The disagreement was easy to understand, since the experiments were evaluated by subtle differences in complex and necessarily somewhat speculative mathematical models- but having both sides ignore the controversy was surreal.
I haven't seen too many papers whose conclusions *disagree* with their findings but I see a lot with a huge overreach, where the conclusions are far stronger than justified by the paper. I was in a journal club for a couple years and virtually everytime we looked at a paper from Science or Nature we concluded there was severe overreach - to the point that we started joking about it. The most extreme case was one where the authors found a p= 0.05 correlation in some aspect of a bird species and concluded that was the primary method for all speciation! It made me wonder if in order to get a paper into those top journal you needed a zinger conclusion, and since a single study rarely produces conclusive results in evolutionary biology, the result was that only papers with exciting unjustified conclusions which slipped past the reviewers somehow got published in the top journals.
The practical problem for journal editors is that we have a hard enough time getting decent referees who will get reports in on any sort of decent schedule even for one version of a paper, much less several versions, some of which are phoney. Same goes for the idea of chopping papers up into sections. Refereeing resources are unfortunately quite scarce in the real world of publishing.
Stuart Armstrong is more on the money with a reasonable solution, assuming that the editors are well enough informed about the subject area. A wise and well-informed editor will usually try to get a range of reviewers, one (or some) who might be presumed to be favorably inclined to the paper (and its conclusions) and one (or some) who might be presumed to be not so favorably inclined.
This makes thing easy if both sets agree, which sometimes happens. The problem arises then, of course, when they do not and they break in the predictable way(s). That is when editors have to work harder, and might be a situation where your conclusion-blinding suggestion might be useful on occasion.
Barkley, conclusion blinding could apply not only to experiment papers, but also to econometric papers, and to many theory papers. I don't mean to imply that there is always a unique alternative. Instead I propose to require a set of versions that sample a reasonable range of alternative conclusions. This can also help with the combinatorial problem when there are many sub conclusions.
Short idea, related: what about unpackaging a paper (if possible) into smaller sections, and getting reviewers to comment on specific steps on each section? If one reviewer approved the method of gathering the data, (without knowing anything more), another approved the format of the experiment, another the statistical method involved...
This makes it more likely for mistakes to be spotted, and may help overcome conclusion bias, as only the last reviewer(s) would know the conclusion. And it he/she/they were the only ones who turned it down, that would be a red flag warning of possible conclusion bias.
What about deliberately biasing the reviewers, i.e. choosing (through a review of the litterature, published positions, etc...) reviewers who would tend to disagree with the paper's conclusion?
Then any paper that does get published will be of a high quality. The cost is that borderline, or even good papers may get unfairly rejected (the old story, balancing type I versus type II errors, as usual). Since science has to be very conservative in this way, the cost may be worth paying. Meta-reviewing might help, as would having reviewers precicely justifying their decisions.
Or, conclusion blind review might be implimented in a few paper - say 5% of a field or so - enough to make the reviewers aware that the conclusion may be wrong, and force them to go through the body of the text.
On a side note, conclusion blind review would be most entertaining to see in mathematics or physics ^_^ Not to mention completely impossible. But it does seem that it helps the most in those fields the most prone to bias.
Peter, the claim is not that it is never reasonable to judge a paper by its conclusion; the claim is that there is a tendency to weigh this factor too heavily. If the paper you mention had even a 10% chance of being right, that would make it well worth publishing.