A Tale of Time Varying Covariates
Sant'Anna and Zhao (2020) Explainer
These days I am interested in the new papers on difference-in-differences by econometricians and applied microeconomists. I am reading all that I can about this new literature both because I find the work interesting, but also because difference-in-differences is so popular, though maybe there’s some endogeneities in one or the other too. Today I would like to discuss a recent paper in this pantheon of articles, “Doubly Robust Difference-in-Differences Estimators” by Pedro Sant’Anna and Jun Zhao in a 2020 issue of the Journal of Econometrics. As this paper is deep, I will only be discussing a few parts of it with the hope that in the future I can and will circle back.
I didn’t expect to like this paper as much as I liked the other difference-in-differences papers I’ve been reading, mainly because this paper is not directly about staggered roll out. But it is about time varying covariates and TWFE’s struggle to handle them while maintaining consistency, and thus that makes it practically very important. So let’s consider this paper now.
History of thought
As always, I am interested in the history of econometric thought, so I’d like to give a run down of the reader of some background which may help the reader fit this new paper into a broader history.
As with Pedro’s paper with Brant Callaway, this new paper by Sant’Anna and Zhao (SZ) owes part of its existence to two articles in the Review of Economic Studies. The first is that classic paper on semiparametric difference-in-differences by Alberto Abadie in one of the 2005 issues of the Review of Economic Studies. The second is another classic from 1997 by Heckman, Ichimura and Todd entitled, “Matching as an Econometric Estimator: Evidence from a Job Training Programme”.1 Let me review some select elements of these two articles to the best I can.
Alberto Abadie (2005)
Abadie (2005) merges work on the the propensity score by Guido Imbens and others, such as this 2001 article with Keisuke Hirano with that of difference-in-differences. The propensity score is nice because it can reduce the dimensionality of covariates into a single scalar. Rosenbaum and Rubin showed in the “propensity score theorem” that the propensity score absorbed all covariate information such that once you condition on the propensity score, the covariates offered no additional information.
The propensity score theorem shows that you don’t need the covariates once you have the propensity score, but it does not dogmatically lay down a precise estimation method. I think that is why over the years there were so many different estimation procedures based on the propensity score, from nearest neighbor matching, to stratification, to weighting and so forth. Imbens and Hirano (2001) focus on the weighting elements, and Abadie (2005) extends weighting to difference-in-differences.2
One of the contributions of Abadie (2005) that I understand best is how he allowed for a conditional parallel trends assumption. A conditional parallel trends assumption is a belief that only holds within values of the same covariate. For instance, perhaps you are comfortable asserting that parallel trends holds for males separately, and females separately, but not combined. Then allowing for such selection on observables must be addressed in both the identification stage and the estimation stage. Abadie (2005) did so in the following four step process:3
Estimate the conditional probability of treatment using either a series logit or linear probability model against the observable covariates that you claim parallel trends depends on4
Using the fitted values, calculate a “propensity score”
Calculate the difference in mean outcomes after the treatment minus before the treatment
Weight each unit’s difference using the propensity score with or without standardization
Heckman, Ichimura and Todd (1997)
The first time I heard of the methods in Heckman, Ichimura and Todd (1997) was Abadie and Imbens (2011) on correcting for matching discrepancies. I discuss this procedure in detail in my book, Causal Inference: the Mixtape, section 5.3.2. While I will not condense this classic paper into my rudimentary understanding, I will lay out for you my understanding of their two-step adjustment to the difference-in-differences extension of matching.
Both Heckman, et al. (1997) and Abadie (2005) are unique for incorporate time invariant controls into a difference-in-differences design. Panel fixed effects can’t handle time invariant controls because the implicit demeaning that fixed effects involves eliminates all time invariant factors. But as Pedro Sant’Anna illustrated in the onesie he made for his cute daughter, twoway fixed effects is not the same thing as difference-in-differences. And Heckman, et al. (1997) showed there were other ways you could skin this cat.
Heckman, et al. (1997) uses an “outcome regression” to adjust for the difference-indifferences estimation itself. They write out a regression-based estimator for the ATT as:
where the mu term is a function that specializes to:
The mu terms are based on specific values of each group’s X terms which are themselves inputs in a fitted regression formula. We might regress an outcome Y onto a set of covariates X, obtain estimated coefficients, then use those coefficients to calculate the mu terms. Say our fitted values from a regression are:
Then to calculate mu, you’d simply plug in each unit’s value of X to get mu for that treatment or control group. Let unit 1 have an X=2 and unit 2 have an X=14, then the first unit would have a mu of 4.18 and the second unit would have a mu of 10.3.
Formally, the outcome regression (OR) approach calculates the difference-in-differences, not by weighting on the inverse of the propensity score, but rather by the mu terms corresponding to each unit’s covariates according to the following formula:
where the first two Y terms are sample means in the after (1) and before (0) periods, while the interior term is the same for the control group. If you squint your eyes and look closely, you can see the difference-in-differences expression represented by first differences for each group which are then differenced from one another.5
Sant’Anna and Zhao (2020)
Basic ideas of DR DD
With these two papers in mind, we can walk from each one along a path where they meet and join at SZ. SZ brings together Abadie’s propensity score weighting approach and Heckman, et al. outcome regression approach into a single difference-in-differences estimator called “doubly robust difference-in-differences”. It is called “doubly robust” because it gives you two chances to get the covariate adjustment right.
Let’s say that you are flipping coins. Heads you win, tails you lose. But I give you two chances to land a heads. If you get a heads the first time, you win. If you get a tails the first time, but heads the second time, you win. The odds you get a heads on the first time is 0.50 but the odds you get it at least once in two throws is 0.75. The more chances I give you to get something right, the more likely it is you’ll get it right.
That, you see, is the underlying logic of doubly robust estimation. The “chances” to get it right are to use Abadie’s propensity score method and Heckman, et al’s outcome regression method at the same time without penalty for doing so. You’re basically controlling for X twice: once with a linear regression, and once with a propensity score, to improve your odds of obtaining the right answer.
The paper moves in four parts, but in this entry I’m going to cover three.6 They are:
Basic assumptions for DD with covariates
Twoway fixed effects assumptions for DD with covariates
Estimation using DRDD as an alternative to TWFE with covariates
DD the following average treatment effect for the treatment group:
The difference-in-differences estimator here requires three assumptions which are, on its face, very similar to that of the ones we saw before with Abadie (2005). The first assumption is that our data is either panel or repeated cross-sections. The handling of repeated cross-sections are complex, so I’ve chosen to focus only on the panel results.
As I said, though, the main assumption of any difference-in-differences design is the parallel trends assumption. And as SZ is about the handling of covariates, it has a conditional parallel trends assumption. I’ll write it down here, but it’s pretty simple — there are four conditional expectations, and one of them is counterfactual in nature.
where the 0s in the superscript mean it’s a potential outcome in a world where the unit had not been treated, the 1 and 0 subscripts refer to the post and pre-periods, respectively, X is a matrix of time invariant covariates, and D is the treatment assignment itself.
And finally, a third assumption regards the overlap of treatment and control group units across every value of X. This overlap condition is a common one in the propensity score literature, but here it is for the raw X values themselves.
Time varying covariates
When we have all three assumptions, we can use either outcome regression or propensity score weighting to estimate the ATT. But what about TWFE? Can’t we use it? My whole adult life, I’d never until recently heard one person complain about including time varying covariates in a regression, let alone a difference-in-differences specification of a regression. Let’s consider a simple TWFE and its analog with time varying covariates now. This is a simple TWFE model which under assumptions 1 and 2, a simple parallel trends instead of a conditional parallel trends assumption, will identify the ATT.
Since delta identifies the ATT with parallel trends, then surely simply adding time varying covariates will identify the ATT under conditional parallel trends, right? Consider this version:
These two equations are the same except that the second one includes a matrix of time varying covariates. I still only need the interaction term to be strictly exogenous and controlling X surely doesn’t affect that now does it? It turns out it might depending on whether a total of six, not just the three we discussed, assumptions hold.
To illustrate, let’s take the previous equation and take conditional expectations.
When we difference these two lines to calculate the ATT expression, we get the following:
Here is the first new assumption necessary for TWFE to identify the ATT with time varying covariates: homogenous treatment effects in X. If the treatment affects the influence of X itself such that theta differs, then our estimate may be biased. This is, in other words, another way of saying that X is a “bad control”, but it is more specifically about bad time varying controls. SZ call this the homogenous treatment effects in X and expression it more simply as:
The next two assumptions regards X-specific trends. Before taking the actual DD, let’s write out the terms we will need to do so:
Now, using these terms, let’s difference the first differences for each group:
Eliminating terms and simplifying this expression, we find that with time varying covariates, the DD estimate equals:
If you look closely, you’ll see it is a new parallel trends assumption. For DD to be an unbiased estimate of the ATT, we must assume parallel trends in the X terms as well the potential outcomes themselves, or what SZ call the “no X-specific trends in both groups” assumption. Without homogenous treatment effects in X and without “no X-specific trends in treatment and control”, then TWFE will not identify the ATT.
Why not use both?
Both outcome regression and propensity score weighted difference-in-differences can identify the ATT with only three assumptions, whereas TWFE requires six. So it seems wise to use one of them, but which one? They both depend on properly modeling the outcome with a regression or the underlying propensity score with the proper specification. Well, what if we didn’t have to choose between the two? What if we could have both?
Doubly robust DD combines the outcome regression approach with the propensity score approach. In order to get the ATT right, you need to hit one of them — but not necessarily both. To see how this works mechanically, let’s remind ourselves of some notation:
Using these terms, we can write out the DR DD estimator for panel data as:
You may recognize this expression because it is nearly identical to the expression I reviewed a month ago when discussing Callaway and Sant’Anna. That’s because, as I’ll discuss at the end, CS gives you the option to estimate using DR DD.
Seeing is believing, so let’s review a couple of results from Monte Carlo experiments. I’ll focus on two of their results: a situation where both the propensity score and the outcome regression are properly specified, and a situation where neither is. We’ll look at the performance of TWFE, OR, inverse probability weighting (IPW) and DR now. Their experiment will have 1,000 observations run through 10,000 simulations. The propensity score is estimated with a logit and OR estimation will use a linear specification. Results from this experiment are in the following table.
The first thing we see is that with this data generating process, TWFE has a point estimate that is severely biased by 21. None of the others are even close to this. DR outperforms inverse probability weighting, but not OR. But considering we have no idea whether OR or IPW are right ex ante, it’s interesting that DR gets so close to the OR model without knowing.
Now consider a scenario where we haven’t properly specified either correctly. How bad is the TWFE bias when it simply controls for the matrix of X versus the other three which attempt to model it parametrically?
Under this very specific DGP, IPW does best, and DR is second best, but consider the scenario we are in — we don’t know ex ante whether to use OR or IPW. DR uses both. It’s buy one, get one free, and it strictly dominates TWFE. And it is also between the other two which is helpful information to consider when making decisions under uncertainty.
The inclusion of time varying covariates in a TWFE specification is as common as flies at a picnic. Everyone does it. I mention in my Mixtape, in fact, that one of the benefits of regression is its ability to handle covariates seemingly so effortlessly and that’s true — adding those terms in is not computationally difficult. What is difficult is the bias it creates if the treatment effect is heterogenous to these X terms, or if there’s X-specific trends.
The SZ paper emphasizes the problems time varying controls that by nature violate assumptions 4-6. All of the models we cover control for the covariates; it’s just that OR, IPW and DR control for pre-treatment values of these covariates. The biases still exist, but even the bias of bad controls can be greatly minimized using one of these other procedures.
One last reason we study this paper, though, is that the DR estimator is the heart of the Callaway and Sant’Anna estimator. Let me reproduce the code from the Callaway and Sant’anna which used the “did” R package. Notice line 16 which says “Estimation method” and lists “dr”, “ipw” and “reg”. Well, these are the three procedures we’ve been reviewing. The backbone of the CS estimator is DR, in other words. So if for no other reason than that you were intrigued by that estimator, then learning DR is rewarding.
|library(did) # Callaway & Sant'Anna|
|castle <- data.frame(read.dta13('https://github.com/scunning1975/mixtape/raw/master/castle.dta'))|
|castle$effyear[is.na(castle$effyear)] <- 0 # untreated units have effective year of 0|
|# Estimating the effect on log(homicide)|
|atts <- att_gt(yname = "l_homicide", # LHS variable|
|tname = "year", # time variable|
|idname = "sid", # id variable|
|gname = "effyear", # first treatment period variable|
|data = castle, # data|
|xformla = NULL, # no covariates|
|#xformla = ~ l_police, # with covariates|
|est_method = "dr", # "dr" is doubly robust. "ipw" is inverse probability weighting. "reg" is regression|
|control_group = "nevertreated", # set the comparison group which is either "nevertreated" or "notyettreated"|
|bstrap = TRUE, # if TRUE compute bootstrapped SE|
|biters = 1000, # number of bootstrap iterations|
|print_details = FALSE, # if TRUE, print detailed results|
|clustervars = "sid", # cluster level|
|panel = TRUE) # whether the data is panel or repeated cross-sectional|
If we’re hanging a painting, we use a hammer, but we don’t use a hammer when we’re making a lasagna. We should use the right tool for the job we have, not just the tools we have laying around. And if we don’t have the right tool, my hope is that these substack entries can help you find the ones you need. TWFE is great when the assumptions for its consistency hold, but not when they don’t. It is a tool used to hang paintings, not make lasagnas. I hope this entry has been helpful. Cheers!
I originally thought that using the propensity score meant nearest neighbor matching. The difference between the score itself and the estimation was, in other words, too subtle for me to originally understand. And for a long long time, it was the same way with difference-in-differences. While I knew panel fixed effects estimation and difference-in-differences were distinct ideas, I so identified difference-in-differences with panel fixed effects estimator that I thought difference-in-differences was nothing more than a particular regression specification. I did not know, though, that difference-in-differences estimation could be done any other way.
This can be condensed into two steps — estimate the propensity score and weight the first differences for treatment and control — but for the sake of pedagogical clarity, I’ll break it into four steps.
The “semiparametric” part of this procedure lies in the specific way in which the covariates enter into the estimation stage.
Adjustment for the control group is added because instead of using the sample means for these control group units, Heckman, et al. (1997) use their predicted values based on values of X.
I’m going to hold off on covering semiparametric bounds for another day as this substack is long enough, but please note that one of the most important contributions of this paper is SZ’s work on the semiparametric bounds.