Control Variables Avoid Bias

I teach health economics data (to both undergrads and grads) by going over the main regression tables of a bunch of recently published journal articles.   Such regressions usually have a health indicator (such as death rate) as the dependent variable, some focal factor which was the reason for the study as an independent variable, and then a bunch of other possible factors as control variables.   Common variables include age, gender, race, income, education, alcohol, weight, exercise, living density, marital status, hours of sleep, dietary fat, medical spending, water supply, and so on. 

I warn students that most studies have an agenda associated with their focal factor; the authors, funders, and referees have answers they expect and want to see.  Authors can manipulate the statistics to get the answer they want, and funders and referees can refuse to publish unwanted answers.   So I tell students to focus more on the control variables when deciding what to believe.   For example, you can better trust the control variable estimates of the effect of alcohol, than the estimates from studies where alcohol was the main focus.   

Of course authors won’t be as careful about control variables, and so you should expect more sloppiness and noise in the estimates.  But control estimates should be less biased.  I wish someone would do a meta-analysis comparing the estimates of control and focal variables, to test my bias suspicions.

Added: A big problem is the increasing trend to not include control variable estimates in the published paper.   For example, this week’s interesting NEJM article on air pollution and heart attacks just says "all estimates were adjusted for age, ethnicity, education, household income, smoking status, years smoked, cigarettes per day, diabetes, hypertension, systolic blood pressure, BMI, and hypercholesterolemia." 

More Added: Oops – that study does give control variable estimates.  This study on sleep, however, does not. 

GD Star Rating
loading...
Tagged as: ,
Trackback URL:
  • http://profile.typekey.com/nickbostrom/ Nick Bostrom

    Interesting idea. What are the main discrepancies from the received wisdom that seem to emerge when you look at the studies in this way?

  • http://profile.typekey.com/tschoegl/ Adrian Tschoegl

    I like the idea. I will try to sneak it into my referee’s comments when I can, though I suspect that it would be easier to do in the health field because the same variables end up on the right side in many studies.

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Nick, usually: medical spending and water supply have no effect, alcohol is good, low weight is bad, and rural living is good.

  • Perry E. Metzger

    I think that, now that we have the internet, people should be required to provide their raw data and calculations in supplementary online documentation accompanying every journal article. We are no longer in an age where there is an excuse that there is no place to store the information, and it would be much easier for people to check other people’s work if the entire corpus was made available.

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Perry, I would prefer journals that followed your rule, as well as the rule that clinical trials must be declared beforehand, to let us correct for publication selection bias. But in the competitive market for journals, it is not clear our preferences will win.

  • David J. Balan

    Empirical researchers generally only seriously investigate one variable at a time. A million papers have been written trying to estimate the returns to education, and a million more to estimate the returns to tenure. But the ones that focus on education just throw tenure in on the right-hand side of the regression and vice-versa. I’ve certainly been guilty of this, and I’m not enough of an econometrician to know what to do about it, or if anything can be done. But it does suggest a problem with Robin’s approach. The right-hand side variables that the researcher is not focused on were probably just thrown in there, without any attempt to deal with endogeneity or anything else. So while the estimates of those coefficients are free of one kind of bias (the bias in favor of whatever the researcher wants the answer to be), they may have other biases that are worse. It seems like Robin’s suggestion is tantamount to suggesting that we would be better off if all researchers were constrained to just always run OLS. Isn’t it?

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    David, yes, when selection bias is bad enough, we are better off with only OLS biases.

  • David J. Balan

    Robin, I realize that could be true. Are you saying that things as they are are bad enough that it actually is true? If so, that’s a pretty radical statement.

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    David, yes, my best guess is that empirical selection bias is so bad that we are better off on relying on a mix of control variable estimates and randomized experiments, with a heavier emphasis on experiments that one would otherwise choose.

  • Douglas Knight

    How would one test the claim that selection bias is worse than other biases (failure to consider endogeneity, I guess)? RH’s observation that you get different answers is a necessary condition, but not sufficient. If you can exhibit explicit biases, like correlation with funding, that’s pretty good, but it might be that one of the sides is doing the appropriate correction and the other isn’t. Moreover, if you can exhibit explicit biases, you probably have better options available.

  • conchis

    If we all started taking this advice, wouldn’t we then expect researchers to just start gaming the system?

  • http://profile.typekey.com/robinhanson/ Robin Hanson

    Douglas, yes of course it would be better to know in more detail who is biased in which direction.

    Conchis, the point is that it is much harder to game all the control variable estimates.