I teach health economics data (to both undergrads and grads) by going over the main regression tables of a bunch of recently published journal articles. Such regressions usually have a health indicator (such as death rate) as the dependent variable, some focal factor which was the reason for the study as an independent variable, and then a bunch of other possible factors as control variables. Common variables include age, gender, race, income, education, alcohol, weight, exercise, living density, marital status, hours of sleep, dietary fat, medical spending, water supply, and so on.
How would one test the claim that selection bias is worse than other biases (failure to consider endogeneity, I guess)? RH's observation that you get different answers is a necessary condition, but not sufficient. If you can exhibit explicit biases, like correlation with funding, that's pretty good, but it might be that one of the sides is doing the appropriate correction and the other isn't. Moreover, if you can exhibit explicit biases, you probably have better options available.
David, yes, my best guess is that empirical selection bias is so bad that we are better off on relying on a mix of control variable estimates and randomized experiments, with a heavier emphasis on experiments that one would otherwise choose.
Robin, I realize that could be true. Are you saying that things as they are are bad enough that it actually is true? If so, that's a pretty radical statement.
Empirical researchers generally only seriously investigate one variable at a time. A million papers have been written trying to estimate the returns to education, and a million more to estimate the returns to tenure. But the ones that focus on education just throw tenure in on the right-hand side of the regression and vice-versa. I've certainly been guilty of this, and I'm not enough of an econometrician to know what to do about it, or if anything can be done. But it does suggest a problem with Robin's approach. The right-hand side variables that the researcher is not focused on were probably just thrown in there, without any attempt to deal with endogeneity or anything else. So while the estimates of those coefficients are free of one kind of bias (the bias in favor of whatever the researcher wants the answer to be), they may have other biases that are worse. It seems like Robin's suggestion is tantamount to suggesting that we would be better off if all researchers were constrained to just always run OLS. Isn't it?
Perry, I would prefer journals that followed your rule, as well as the rule that clinical trials must be declared beforehand, to let us correct for publication selection bias. But in the competitive market for journals, it is not clear our preferences will win.
I think that, now that we have the internet, people should be required to provide their raw data and calculations in supplementary online documentation accompanying every journal article. We are no longer in an age where there is an excuse that there is no place to store the information, and it would be much easier for people to check other people's work if the entire corpus was made available.
I like the idea. I will try to sneak it into my referee's comments when I can, though I suspect that it would be easier to do in the health field because the same variables end up on the right side in many studies.
Douglas, yes of course it would be better to know in more detail who is biased in which direction.
Conchis, the point is that it is much harder to game all the control variable estimates.
If we all started taking this advice, wouldn't we then expect researchers to just start gaming the system?
How would one test the claim that selection bias is worse than other biases (failure to consider endogeneity, I guess)? RH's observation that you get different answers is a necessary condition, but not sufficient. If you can exhibit explicit biases, like correlation with funding, that's pretty good, but it might be that one of the sides is doing the appropriate correction and the other isn't. Moreover, if you can exhibit explicit biases, you probably have better options available.
David, yes, my best guess is that empirical selection bias is so bad that we are better off on relying on a mix of control variable estimates and randomized experiments, with a heavier emphasis on experiments that one would otherwise choose.
Robin, I realize that could be true. Are you saying that things as they are are bad enough that it actually is true? If so, that's a pretty radical statement.
David, yes, when selection bias is bad enough, we are better off with only OLS biases.
Empirical researchers generally only seriously investigate one variable at a time. A million papers have been written trying to estimate the returns to education, and a million more to estimate the returns to tenure. But the ones that focus on education just throw tenure in on the right-hand side of the regression and vice-versa. I've certainly been guilty of this, and I'm not enough of an econometrician to know what to do about it, or if anything can be done. But it does suggest a problem with Robin's approach. The right-hand side variables that the researcher is not focused on were probably just thrown in there, without any attempt to deal with endogeneity or anything else. So while the estimates of those coefficients are free of one kind of bias (the bias in favor of whatever the researcher wants the answer to be), they may have other biases that are worse. It seems like Robin's suggestion is tantamount to suggesting that we would be better off if all researchers were constrained to just always run OLS. Isn't it?
Perry, I would prefer journals that followed your rule, as well as the rule that clinical trials must be declared beforehand, to let us correct for publication selection bias. But in the competitive market for journals, it is not clear our preferences will win.
I think that, now that we have the internet, people should be required to provide their raw data and calculations in supplementary online documentation accompanying every journal article. We are no longer in an age where there is an excuse that there is no place to store the information, and it would be much easier for people to check other people's work if the entire corpus was made available.
Nick, usually: medical spending and water supply have no effect, alcohol is good, low weight is bad, and rural living is good.
I like the idea. I will try to sneak it into my referee's comments when I can, though I suspect that it would be easier to do in the health field because the same variables end up on the right side in many studies.
Interesting idea. What are the main discrepancies from the received wisdom that seem to emerge when you look at the studies in this way?