Which variables do I need to control for?
Five types of variables
Which Variables Do I Need?
I started this substack off wanting to discuss inexact matching, but I just think it would be helpful that before we get into the nuts and bolts of inexact matching, a clear explanation of how to achieve unconfoundedness may be useful. It comes up a lot, and I think sometimes there isn’t enough clear exposition about it out there, so I figured why not just do this now, and next week I’ll wrap up my inexact matching substack.
Most of us learned causal inference for the very first time in a stats class where we learned about “running regressions” and “controlling for covariates”. If the class was somewhat advanced, that introduction to the ordinary least squares formulas might even be followed by learning the Frisch-Waugh-Lovell theorem where we learned OLS was “partialing out” those extra variables effects so that we could focus just on the partial relationship between the covariate of interest and the outcome. And then if we went even further, we might then learn about the theoretical properties of OLS whereby if all the confounders were included in the model and measured well, and the data was reasonably smooth and treatment effects homogenous and additive, then under certain OLS specifications, we could obtain estimates of the ATE. Without those covariates in the model, we learned that the model suffered from “omitted variable bias”. What did that mean? It meant that had we not controlled for the confounders, then our OLS estimates wouldn’t be causal. Those are fun moments in everybody’s life, I think — realizing that even outside the experiment, I might actually get an estimate of a causal effect by “running a regression”.
But the Frisch-Waugh-Lovell theorem just shows you a window into how the OLS model works, not which variables to control for. Which ones do you not control for? Well, we know the answer, but we often don’t know exactly how to explain it, so I’m going to use a series of pictures as pictures speak a thousand words. Our goal in controlling for variables is to satisfy the condition called conditional independence, or what is more often called unconfoundedness. Problem is, the conditional independence assumption is not testable, and as a result, we can’t actually check if we have the right variables in our model. Don’t believe me? Check out section 12.2.3 from from Rubin and Imbens’ 2015 book on causal inference:
Without being able to test the implications of conditional independence, you can’t technically check some conditional equalities that are corollaries of it, which means pedagogically explaining which variables to include and not include falls outside of the potential outcomes framework we associate with Rubin. But that just means we have to look somewhere else for guidance on variable selection, not that we can’t justify it.
But before I get into that method, let me first lay out some terms. Some of these terms you will know, but I’m going to break some terms into multiple concepts so that I can illustrate a very narrow point which is you don’t need to control for everything, but you do need to control for the right things.
Here’s the five variable names I want you to learn. I doubt that other professors will list them all out like this, and no one is going to blame you if you don’t either, but I think it may be useful to nonetheless see them as distinct until you’re pretty sure you understand the points I am trying to make.
Outcomes. These can be potential outcomes or realized outcomes but either way they are variables measuring an outcome associated with a treatment.
Treatment. This is the variable measuring the intervention you’re interested in, often a binary indicator, though not always.
Confounders. The simplest confounder forms a triangle in which the treatment, D, and the outcome, Y, have a common ancestor, W. In the DAG notation that Judea Pearl developed, a confounder is a non-collider along a backdoor path like this: D← W → Y. W is a non-collider and as the parent of both D and Y, it “confounds” D by introducing spurious correlations. Hence — confounder.
Covariates. This one is an old term I am going to use for a specific meaning and hence draw the ire of people who tell me this is confusing so I shouldn’t do it. But I’m going to run this one up the flagpole and see if anyone salutes it. I am going to break from tradition and call something a covariate if it causes Y but not D and there is no path from the covariate to Y except through Y. I’ll explain why I do this later.
Colliders. These are variables that have two immediate parents (one to the left and one to the right) along a path between the treatment and the outcome, eg D ← X → A ← B → Y. In this path, A is a collider because its father, X, and its mother, B, are to his left and right, respectively, pointing at him. A is a collider because his parents’ treatment effects “collide” at him on that particular path. The function of a collider is to block partial correlations of the first and last variable along that path, unless you condition on it, in which case conditioning introduces a spurious correlation between the two variables.
Not all variables matter, but the right ones absolutely matter
When trying to satisfy conditional independence, you only want to control for only the ones that satisfy conditional independence or improve precision. That’s a little tautological maybe, so it would be helpful if we had a more formalized method for selecting the variables needed to satisfy conditional independence. To illustrate this, I will introduce you to causal graphs called directed acyclic graphs (DAG) which I find extremely helpful if you are going to be trying to estimate causal effects by “running regressions controlling for things”.
DAGs are not just lines on a graph, though they are in fact lines on a graph. The lines on the graph are the method’s notation, just like potential outcomes is the notation of the Rubin-Neyman causal model. These nodes and arrows represent a range of causal relationships confounding in complex ways the causal effect of the treatment on the outcome in observational data. They use funny names like ancestors and descendants, confounders and colliders, so they take a little getting used to. DAGs were first invented by Sewell Wright, son of Philip Wright the economist, in the early 20th century, but they were later built up extensively by the computer scientist Judea Pearl and his collaborators to become a powerful means of expressing and identifying causal effects in data. And in my “simple DAG”, I want to show you how you can deduce which variables to include in a model by “closing backdoor paths” from the treatment to the outcome.
There are three ways to get from D to Y on this graph: one causal, two non-causal. And our goal is shut down the non-causal “backdoor paths” so that the only remaining reason why the treatment and the outcome have any statistical association is the causal effect. The direct edge D→Y is the causal effect, expressed perhaps as the ATE or the ATT, and the two backdoor paths are (1) D←W→Y and (2) D←C→Y. The fact that both W and C have arrows pointing away from them on these paths means that the backdoor paths are “open”. If a backdoor path is open, then the first and last variables are spuriously correlated. Blocking on variables where arrows on the left and right do not point directly at it, or what is clunky called a “non-collider”, shuts down backdoor paths. So if you “block on” both C and W, then it’s as though those lines aren’t even there, and that spurious correlation disappears. I illustrate that below by removing the edges altogether from the graph, though I do this just for helpful illustration.
What is the point here? If you block on C and W, then you can estimate the causal effect of D on Y. And when you can block on C and W so that you can estimate the causal effect of D on Y it means you have satisfied the backdoor criterion. In this context that means to match on W and C because it is necessary and sufficient to block all backdoor paths from D to Y, leaving only the treatment effect as an explanation for the test statistic based on D and Y. The variables you want to control for, in other words, are the variables that satisfy the backdoor criterion — no more, but also no less.
I mentioned earlier there were two more variables: covariates and colliders. People without theoretical guidance to select variables to include as controls will inadvertently control for confounders, covariates and colliders. One of these is necessary, one is harmless, and one is fatal, but which one? They don’t come with labels. What do those look like on a causal graph?
I’ve made a new causal graph and added both variables so you can see how they are different from the confounders, W and C, we just discussed. See the new “Modification of the original DAG” graphic below. Notice that I added two new variables, B and X, to the causal graph. One of them creates a new path from D to Y, D → B ← Y, and because both D and Y point their arrows at B, B is a collider along that backdoor path. Put differently, variables are not colliders in some exogenous sense. They are rather colliders if along a path they are the child of two parents to their left and right. Colliders, in other words, describe a variable along a path, not the variable itself, and their special function is to block any spurious correlation between the two parents, here that being D and Y. So D and Y, while connected through B along this path, are independent along that path because B is a collider and so long as you leave it out of your model, they will remain independent along that path.
But what about the covariate X? I’ve decided to put X on the DAG because I want to note that in addition to the important confounders and colliders, there are a third kind of variable you can control for which really don’t cause any problems or help for point estimation. Covariates in a model do not solve any omitted variable bias, but unlike colliders, they also don’t make matters worse. They are basically benign variables that don’t anything to the point estimate but which can improve precision in a regression framework. The reason that adding X is irrelevant for estimating the ATE is because while it does cause Y, there is no way to get from D to Y, and it’s the open backdoor paths that we are worried about, not variables in abstract. X is not confounding D, in other words, even though it causes Y, which is easier to see and explain graphically for most of us than in any other form. The only advantage of conditioning on it is it can reduce residual variance in a regression and improve precision, but otherwise that’s it (though that, to be fair, may be enough of a reason to consider including it).
Figuring out which variables to include and not include is much simpler and more principled when using the backdoor criterion as it shows precisely the path to unbiased estimation through variable selection and adjustment. And sometimes there are even more than one paths, depending on the DAG, but without some prior theory, and an easy way to sort through them, it's likely going to be based on hunches with vague allusions to having a “rich set of controls”.
But what if colliders are within those “rich set of controls”? Then including it will introduce bias, not reduce it. Do not be fooled -- your goal is not to include a rich set of controls. The goal is to condition on known and quantified confounders such that you satisfy the backdoor criterion by blocking all backdoor paths. That’s when you meet the conditional independence assumption, and if you do, then you’re halfway home. That’s because satisfying conditional independence is only half the battle.
You still have to contend with the oft overlooked common support, or overlap, assumption. But that’s for a different topic.
Conclusion
Causal graphs have detractors and zealots just like many things that eventually become canonized did early on. And in economics and other social sciences, it’s still very much early on. How do I know? Because it isn’t used much in education — yet — and its main creator, Judea Pearl, is still living. Two signs I think that something does not enjoy much widespread popularity.
But I nevertheless teach them, and while they are powerful and can be used broadly for many things, over time I’ve narrowly come to promote them in two areas: for justifying which variables to use to satisfy conditional independence, and to guide the researcher when contemplating an instrumental variables design. These are good fits for causal graphs because both of those designs depend crucially on prior knowledge, and both of these seem like areas where everyone agrees they need help reasoning through the design. And while you can say that all designs depend on prior knowledge, some methods like synth or did or RDD seem to make users feel that the assumptions are sufficiently non-problematic that they can get by with only the loosest of prior knowledge. While that isn’t true, it’s a harder sell to tell ppl you need a causal graph for RDD so I tend to focus people instead on unconfoundedness / conditional independence and IV, the former especially.
The reason that we need causal graphs for unconfoundedness is because in the end unconfoundedness is not testable. You can’t even use balance test because balance speaks to the common support assumption, and many of the methods used will mechanically balance (eg exact matching) covariates if possible, and where it isn’t, they’ll use bias adjustment methods to address it.
Point is simple: if you are going down the dangerous rocky path of Mount Unconfoundedness, bring a light. Otherwise you’ll trip and fall to your death. Not all variables satisfy the backdoor criterion, and therefore not all variables will satisfy the core unconfoundedness / conditional independence assumption. It may not even be possible, or it may be trivial, but the point is, you know you have it because of theory’s guidance. Get too close to a collider in your conditioning set and the whole thing collapses. Conditioning on a hundred irrelevant covariates and you may be balancing and weighting and imputing on things which unbenownst to you have no relevance for estimating aggregate causal parameters.
And none of this even touched the methods you should consider when using covariate adjustment strategies, which is what this series has been about and will keep being about. Even when you find the variables you need, even when unconfoundedness is plausible, issues around common support are deep enough and ignored enough that it merits its own careful discussion.
So, consider this substack as my way of answering a question I get all the time. Which covariates? The ones that satisfy the backdoor criterion. But what if I can’t do that, either because I don’t philosophically believe in using prior knowledge to systematically select variables for regressions based on prior theory about treatment assignment or because my DAG says the backdoor criterion cannot be met with the dataset I possess? Well then there you have your answer — then skip the unconfoundedness part of the book. It’s not appropriate for the problem you’re facing.
Thank you for such a explanation.
Very helpful. Thanks!