“At some level, all methods for causal inference can be viewed as imputation methods, although some more explicitly than others.” (Imbens and Rubin 2015)
Kirill Borusyak (UCL), Xavier Jaravel (LSE) and Jann Spiess (Stanford) 2021 “Revisiting Event Study Designs: Robust and Efficient Estimation”.
A new working paper on the econometric challenges of and solutions to difference-in-differences with differential timing designs dropped this week. For some people, parts of the paper will feel familiar; for others, it will offer new insights and new solutions to extremely important research challenges. I often experienced deja vu while reading it as my feelings traveled through familiar territory into unfamiliar places. Because of the important points it raises and the tools it provides, I predict the paper will become as commonly known as any of the new difference-in-differences papers and for years to come.
But perhaps you’re asking yourself — why? Why do we need another econometrics paper about difference-in-differences when we already have so many others? What possible marginal benefit are we getting from this paper that could justify the marginal cost of our time and energy associated with studying it? Such questions are rational to ask given the scarcity of our time and that we all are basically living crammed against our respective constraints. You can be forgiven any time you skip a paper or two if only because the opportunity cost of any one paper is very high. To learn any more difference-in-differences, you’re thinking, better be worth my time. I believe this one does justify our time, and in this newsletter, I hope to convince you of the paper’s importance both for its original insights into critical problems associated with differential timing, as well as the solutions offered to circumvent them.
A brief history of this paper
People will have a range of responses to Borusyak, Jaravel and Spiess (BJS) because of its unusual history. This new working paper is an update to an older, influential working paper by Borusyak and Jaravel from 2016/2017. That original working paper has been cited nearly 400 times, making it one of the most cited papers in this new pantheon of papers on the econometrics of difference-in-differences.
The original working paper was seminal in that it was one of the first papers, and in some cases the first paper, to highlight and analyze numerous econometric problems associated with differential timing and two-way fixed effects (TWFE). When the previous working paper was first released in 2016/2017, the discussion of any TWFE “flaws”, if you want to call them that, was not common and definitely broadly understood like now. Borusyak and Jaravel 2016/2017 was one of the first to bring to our attention things like TWFE’s weird weighting under differential timing, the potential biases associated with heterogeneity, the insufficiency of parallel trendsfor identification and problems with commonly specified event study models.1
But soon thereafter, paper after paper appeared documenting various problems with TWFE. Independent and complementary discoveries about this seemingly “simple” difference-in-differences design appeared alongside Borusyak and Jaravel (see here, here, here, here, here and here) in a short time, most of which are by now published.2 Placing the original working paper alongside the one is somewhat like staring at two pictures of the same person taken in different decades. You can see the resemblance, but at the same time, you also can see the layered differences. With a new coauthor, Jann Spiess, as well as an extensive updating of the paper, the paper has changed quite a bit which justifies this explainer.
Defined parameters and identifying assumptions
One of the things you may have noticed with the new papers is the analytical separation of definitions of target parameters and estimation. You see this type of approach for instance in Pedro Sant’Anna’s various papers and BJS has this flavor too. They first begin by defining a target parameter as the weighted average of smaller individual level treatment effects:
Notice that the static parameter is essentially the final product of a weighting procedure of individual treatment effects placing the paper in a tradition, as I said, of defining target parameters in terms of smaller building blocks (such as the group-time ATT by Callaway and Sant’Anna (2020) or the cohort-specific ATT by Sun and Abraham (2020)) which can then be aggregated into larger parameters like lego blocks used to make toy skyscrapers. This means there will always be two elements to this estimator: (1) the recovery of such “small” treatment effects, and (2) the weights themselves.
With an explicit target in mind, we know where we are going. We will need to estimate both the individual treatment effects, as well as the weights. With both in hand, we can construct various aggregate policy parameters like the ATT. BJS warns researchers about the dangers of conflating the target parameter with the parameter in the simple TWFE regression as they only map to one another under specific criteria which may or may not hold. Those criteria, or assumptions, are the following:
Parallel trends. Let’s look closely at this identifying assumption and compare it with other papers’ expression of parallel trends.
Their expression of the parallel trends assumption is somewhat original, at least as far as difference-in-differences papers are concerned, because it states that one of our potential outcomes is a linear function of fixed effects, whereas other authors write down an assumption where differences Y(0) over time for treatment and comparison groups are the same. Notice secondly that this parallel trends assumption is expressed not holding at the group level, but at the individual level. I’ll admit, it is unclear to me whether their assumption is the same as the standard parallel trends expression we often see. I am not objecting to it; rather I am simply alerting the reader to this subtle difference both for your own edification as well as because this assumption will guide the estimator’s construction later.
No anticipation of treatment prior to the event date.
The second assumption is now standard in this literature. We see it in Callaway and Sant’Anna (2020), Sun and Abraham (2020), and others. It exists as an assumption because with it, we are able to define the start of treatment dates which is relevant for estimation as well as pre-trend analysis.
Restricted causal effects
The third assumption is one I hadn’t seen before, at least not expressed quite this way, but it is something that the authors hammer repeatedly throughout the paper — insofar as you are willing to impose structure on treatment effects because of a priori theory, then there are gains to be made in estimation due to increased power. It is here where we will see later that homogenous treatment effects enters, because all that homogenous treatment effects is is a structural assumption justifying a certain model. It is therefore interesting to me that the authors explicitly re-introduce economic theory back into reduced form estimation where it has maybe been somewhat missing from our econometric estimators for quite some time.
Traditional Modeling Practices
The first problem they diagnose concerns the problems associated with TWFE estimation of a dynamic event study model under differential timing and heterogeneity. Consider a standard event study model like this on:
This regression model is “fully dynamic” in the sense that all leads and lags excluding the t-1 lead is included. I present this model, not because it is recommended, but rather because this model is so common that it might as well as be called the canonical event study model. Assuming we pass a pre-trend test on the leads equalling zero, the lags in this model represent dynamic treatment effects in relative event time.
This specification implicitly makes two assumptions: no anticipation (i.e., assumption 2 implying zero coefficients on the leads) and homogenous treatment effects (i.e., assumption 3 where we do not include i subscripts on our treatment parameters). While one can assume parallel trends, that assumption is not sufficient to find zero causal effects in pre-treatment periods, and ultimately non-zero leads are interpretable as causal effects themselves, most likely due to anticipation.
Another difference between this model and our target parameter is that the fully dynamic specification does not define the estimands as weighted averages of individual treatment effects, at least not explicitly. BJS caution us from simply accepting a regression model is reliable when it isn’t clear what the weighting procedure is or how a model maps back to assumptions.
You could also write down a simple model where treatment is absorbed into a single dummy as opposed to the fully dynamic specification. This is often called the “static” specification and it imposes parallel trends and no anticipation by virtue of pre-treatment being equal to zero. But it also imposes a very strong version of assumption 3 — specifically that treatment effects be identical for units. So already we can see that the authors will likely be exploring the ramifications of a model like this one which hadn’t been built from first principles (i.e., our assumptions).
Under-identification of the Fully Dynamic Specification
The first problem BJS note is that the TWFE model we wrote down above does not impose a strong enough version of no anticipation. If it did, then the coefficients on leads would be zero. But as the coefficients are left open, the question is whether we can identify the parameters at all. One of the problems created by differential timing relates to identifying leads because without a pool of never-treated groups in our model, then population regression coefficients on leads will be “under identified”. We will be unable to distinguish the individual leads from a simple linear trend as both will fit the data equally well. They describe this problem of “unrestricted dynamics” with the following paragraph:
“Formally, the problem arises because a linear time trend and a linear term in the cohort Ei (subsumed by the unit fixed effects) can perfectly reproduce a linear term in relative time Kit = A-Ei. Therefore a complete set of treatment leads and lags, which is equivalent to the fixed effects of relative time, is collinear with the unit and period fixed effects”. (my emphasis)
What’s being said here is a subtle point, but it is one we’ve seen before in Sun and Abraham (2020). The fully dynamic specification under differential timing suffers from two, not one, type of collinearity. There is the dummy variable trap associated with a complete set of dummies for every value of event time, which everyone already knew about — hence why we all drop one lead from our regressions. But then there is this strange additional form of collinearity caused by the collinearity of unit and period fixed effects with the TWFE model in event time. Resolving this collinearity problem requires stronger restrictions on leads that must be imposed by dropping two leads, not just one.3 Hence we see what is meant by having too weak of a no anticipation assumption — we need no anticipation to more true so that we can justify dropping more than just one lead.
But let’s say that we were to drop a second lead. Which one? Whichever second lead we ultimately drop, we are implicitly invoking no anticipation on that particular time period, be it the t-4 lead or the t-2 lead. This is why BJS practically earnestly recommend we consider relying on institutional knowledge and economic theory to guide our choices here.4 This part of the paper was kind of refreshing and original to be honest. Like my love of directed acyclic graphs (DAGs), BJS remind us that economic theory is not merely a source for the kinds of topics we study; it is also an aid when it comes to estimation choices, even reduced form style work. This willingness to lean on a priori knowledge benefits us because as we exclude more and more leads, we increase our power. I found it interesting to consider how stronger than minimal assumptions could allow for more powerful tests.
Negative Weights in the Static Parameter and the Bias of Longrun Causal Effects
Under-identification concerns the event study leads; what about the static regression model? As I alluded to earlier, assumption 3 is creating problems for us in the static specification because it explicitly excludes the possibility of heterogeneity. This means that, in practice, the underlying weights estimated by the mechanics of TWFE may be negative, and particularly so for long run treatment effects. In other words, though those long run treatment effects are absorbed into the static parameter, they can still through their implicit negative weighting distort the coefficient away from any true effect.
But what about these weights? We’ve hardly discussed them. They are easier for me to see and discuss when focusing on the static parameter. If you can commit to no anticipation (assumption 2) and parallel trends (assumption 1), then TWFE will in fact estimate a weighted average of treatment effects. This is great news — the truth is contained in the TWFE coefficient. It’s a weighted average, too, just like we want right? Not quite.
The problem with only having assumptions 1 and 2 when estimating a static model is with the sign of the underlying weights. These weights can be negative if assumption 3 doesn’t hold, even though these weights will always sum to one. The authors take us through a series of examples to illustrate how this problem is introduced with the differential timing scenario, which I won’t reproduce here, but suffice it to say the static parameter estimate under differential timing weights down the long run treatment effects when assumption 3 does not hold. This problem, while discouraging, is not guaranteed, though. It can vanish, even, when the donor pool of never-treated units is large. Of this they write:
“With a large never-treated group, our setting becomes closer to that of a classical non-staggered difference-in-differences design, and therefore [in practice] the negative weights disappear.”
Negative weights are closely related to problems related to extrapolation in linear regression, so another way of conceiving of this bias is that under homogenous treatment effects, TWFE can actually correctly extrapolate beyond the support of the data and recover the average treatment effect itself. The mechanics of TWFE in other words implicitly adds up numbers based on contrasts between treated units and comparison units, and when treatment effects are homogenous, its extrapolation is a feature, not a bug, allowing for a more accurate estimate of the ATT associated with the panel’s treated units.
But with heterogeneity, differential timing is not so benign because TWFE is working through a series of comparisons between treated and untreated units, treated and not-yet-treated units, and finally a “forbidden” contrast (their term) comparing treated units to previously treated units, either treated earlier in the panel, or treated before the start date of the panel itself. And it’s here where the extrapolation properties fail us. When treatment effects are homogenous, such contrasts are harmless. Their taboo nature comes not from making such contrasts themselves, but by the presence of heterogenous treatment effects contained within the comparison group. While homogeneity allows us to identify long run effects using the implicit extrapolation contained in the mechanics of TWFE, with heterogeneity, such talents disappear and are replaced by a distorting statistical process which biases coefficients away from the target parameters we care about.
Imputation-based estimation and testing
So where to from here? Well up til now, you would be forgiven for feeling bummed because of all this negativity. But this paper is like several other papers in that it is a sunny, positive, optimistic paper, not a pessimistic one, because they have figured a way around this forest of TWFE related problems. Like a doctor giving an antidote for your ailment, BJS have a cure for your difference-in-differences ills. The cure, in this case, is an alternative to a conventional TWFE model, something they call “imputation based estimation” which is a robust and efficient estimator built from the ground up. The authors will present asymptotic analysis, show that the estimator is consistent and asymptotically normal, with good coverage and valid pre-tests.
First things first, though — we must revisit some of these assumptions to guarantee this thing they made after all really works. They do this by modifying some of the assumptions they mentioned earlier as well as introducing some new ones. Specifically, they modify assumption 1 slightly to make it a more general model of the potential outcome, Y(0). This new model can accommodate time varying controls though it is unclear the degree to which the Sant’Anna and Zhao (2020) critique of time varying controls under various data generating processes will hold for this estimator. They also introduce a fourth homoskedasticity assumption.
Using new assumption 1 and assumptions 2-4, they build an efficient estimator. This efficiency property of their estimator is perhaps one of the most intriguing parts of the paper. They show that this estimator is efficient among all linear unbiased estimators, which sounds a lot like a BLUE OLS property. In a 3-step process, they show that by simply imputing potential outcomes for all units using a TWFE regression, they will avoid the forbidden contrasts we’ve become accustomed to worry about while simultaneously recovering the individual level treatment effects that map onto our target parameter once properly weighted. These steps are:
Estimate theta using theta hat using only the untreated units.
Imputing missing counterfactuals and calculate treatment effects using the coefficients from step 1.
Estimate the weighted average of these individually imputation based treatment effects
Of this estimator, they write:
“The idea is to estimate the model of Yit(0) using the untreated observations and extrapolate it to impute Yit(0) for treated units… The imputation structure of the estimator is related to the direct estimate of the counterfactual considered by Gobillion and Magnac (2016) for linear factor models.”
There is at least one subtlety with this multi-step process and it concerns step 1. It is very important that the reader understand that step 1 regresses Y onto fixed effects using only the untreated units. As a sidebar, you may have seen this step before in a similar multi-step process that Bertrand, Duflo and Mullainthan (2004) outlined as an alternative to clustering and block bootstrapping to get conservative standard errors.5 That 17 year old paper, in other words, also suggested regressing Y onto fixed effects using only untreated units with differential timing, but they then immediately moved into calculating residuals for the treatment group based on these coefficient estimates, whereas BJS move into obtaining fitted values for the treatment group.
Advantages of the Imputation Estimator Against Alternatives
With so many estimators in your pocket, you may be wondering well why do I need this? There are three possible advantages I’d like to stick in your head.
First, this procedure fast which will count a lot in large datasets. The bootstrap may be computationally intensive for larger datasets, whereas this is likely to be fast. Second, for many people, imputation may be a more intuitive way to think about the process of estimation. We are imputing potential outcomes for the treatment group which through steps 2 and 3 allow us to calculate both individual treatment effects as well as more aggregated ones through weighting. BJS is not original in presenting imputation as a method for calculating treatment effects; as the Imbens and Rubin quote I opened with notes, nearly all if not all estimators in causal inference can be re-cast as imputation. We see it in earlier work such as Heckman, et al. (1997) outcome regression, Athey, et al. (2018) matrix completion for panel data, and matching estimators. But whether you find imputation intuitive too — well, de gustibus non est disputadum.
But third is the efficiency gains. Perhaps the most important advantage of the imputation estimator is its efficiency properties. As said earlier, the imputation estimator has BLUE like properties owing to assumption 4 about homoskedastic residuals. But seriously — who cares really about homoskedasticity properties in 2021? Heteroskedasticity is likely a more realistic assumption. But, as it turns out, this homoskedasticity assumption is merely providing a benchmark; their Monte Carlo simulations show that even under various alternative error structures, the efficiency gains made from the imputation estimator can be non-trivial.
Comparisons with other estimators
Let me try and stick this landing by wrapping this up. BJS compare their estimator to other unbiased estimators because recall their claim that the imputation estimator has something like a BLUE property. It is, among linear unbiased estimators, the one with lowest variance under homoskedasticity. But what about its performance under heteroskedasticity or other types of error structures? How will it perform then, and importantly for many of us, how will it perform relative to other estimators? Given the plethora of estimators surrounding you, this is perhaps the million dollar question — should I use this or not?
The authors consider two estimators for comparison: Sun and Abraham (2020) (SA) and De Chaisemartin and D’Haultfœille (2020) (DCDH).6 I present variance, coverage and biases associated with relaxing both assumption 4 and assumption 2 for all three estimators in Table 3 from BJS. There’s a ton of numbers here, so what should we look for? The main thing I have focused on is the variance columns. Notice how both DCDH and SA have slightly higher variance. Across the board, the imputation estimator has lower variance except for a few trivial cases where the variance is only slightly larger (e.g., AR(1) residuals). Interestingly, the bias associated with violations of no anticipation are often slightly lower, sometimes by as much as 40% lower, though not always (for longer horizons for instance).
The estimator is available in Stata and will soon to be uploaded to ssc. BJS note that the estimator is very flexible because it can be used with triple differences, event studies, and time varying controls. You can get the Stata software directly from Kirill Borusyak by emailing him. I highly recommend you do.
The solution to differential timing isn’t to make the assumptions one needs in order to keep using the estimator you know and like. Rather it is to “embrace the heterogeneity” by using robust estimators with engines designed for such things. This is done, in part, by defining precisely the target parameter of interest, which they propose is a weighted average of estimated individual treatment effects in which weights, like other papers, can be chosen to move towards the policy parameters we know and love, such as ATT. And in their schema, it is by building a multi-step regression based estimator that can under minimal assumptions recover the treatment effects of interest.
Their proposed solution is the imputation estimator. When one thinks about it, so much of contemporary causal tools can be traced back to weighted imputations. We see this in synthetic control, matrix completion and matching estimation. As the quote at the beginning says, nearly all if not all of causal inference tools in fact have this feature, even if not explicitly. BJS is explicit about it.
I predict that the paper’s citation count will far exceed 400 as time passes. It’s an important paper worth your close attention. I hope you have found this sub stack helpful.
Looking back, probably the most sensational discovery of this new difference-in-differences literature (for me anyway) was finding that parallel trends, while necessary, was not sufficient to identify the average treatment on the treated when there are heterogenous treatment effects and differential timing. It’s hard to remember when that felt new to me, given I now take it for granted, but needless to say, it has been an unnerving discovery for many seasoned researchers to learn this.
I remain intrigued by the discovery of these problems by independent researchers working across the country at or around the same point in time. The one common theme that I have seen in this body of work is that while discoveries have been made independent of one another, most of these papers have been written by junior economists. Not only have they been assistant professors, but in several instances, the authors were graduate students. For whatever reason, over the last five plus years, applied econometricians caught wind of it and began working on problems related to it, by all appearances, simultaneously and independently of one another. Pretty interesting if you ask me.
“If Stata told you to jump off a bridge, would you do that too?” — paraphrasing of Mrs. Cunningham to one of our children.
I would be a bad promoter of my own textbook if I didn’t point out that directed acyclic graphs are great ways to incorporate a priori economic theory into causal estimation, even if in this application I really have no idea how I’d do so. But that’s kind of the story of my entire adult life — knowing I should do something, could do something, and having no idea how to actually do that something.
I discuss this third procedure from Bertrand, Duflo and Mullainthan (2004) in my book, Causal Inference: the Mixtape, in my chapter on difference-in-differences. See here.
I intend to review DCDH on this sub stack soon.