Two Stage DiD and Taming the DiD Revolution

Gardner (2021) explainer

Gardner, John (2021) “Two Stage Difference-in-differences”, Working paper.


“It seems natural that [TWFE] should identify the ATT.” - John Gardner (2021)

A long and windy introduction

The DiD credibility revolution has been for some a source of genuine excitement over a topic researchers haven’t thought deeply about since graduate school — the field of econometrics. They may be excited both because the retooling their skillset is associated with new human capital investments, but also because it concerns one of their go-to designs: the difference-in-differences design.1

But for others the new econometrics of DiD has been somewhat intimidating and even a bit frustrating because we may be inclined to think that in order to move forward, we must incur substantial resources costs with rising marginal costs. Some of us worry about the additional burden because we are operating already at the edge of resource time constraints as is. And if this market for papers continues to grow, how many theory papers from this literature will we in the end be expected to read? Ten? Twenty? And how different will each one be from the last. The more the new literature grows, the less likely it is many of us will read them because of high marginal costs due to time constraints as well as perceived declines in marginal benefits.

But those are feelings. We need DiD to do our research, and so may not be sure when to stop or where to focus our attention. That uncertain may cause us frustration, discouragement, feeling overwhelmed by all the options and even cynicism that this revolution may creep into our minds. Old dogs prefer the tricks they learned when they were puppies, after all, because of the substantial human capital already invested in those tricks.

So, in this substack, my hope is that I can introduce you to an estimator that is worth your scarce time because of its scientific insights as well as its practical usefulness. Practical in that it will play to many people’s strengths by building on the aforementioned human capital everyone already possesses: an understanding of the mechanics of OLS. And I suspect that they will also be intrigued that the estimator can be trivially implemented in major programming languages including Stata. So, with that in mind, I’d like to introduce you to a new working paper by John Gardner, an associate professor at University of Mississippi, entitled “Two Stage Difference-in-differences”.

A Scouting Report of the Three Types of Robust DiD Estimators

There is a growing number of robust estimators in the new econometrics of DiD, and trying to keep track of them is independently challenging, let alone learning them sufficiently well that one can choose between them. Before diving into Gardner’s estimator, I’d like to offer a map of sorts that I hope can help you position his estimator in the family of robust DiD estimators. Consider this working taxonomy as a working framework, one I’m willing to update as I gain a deeper understanding of what is common and what uncommon among these new estimators. And while this taxonomy isn’t perfect, I think it may help you because it captures at least some of the elements common to estimators. Consider it a kind of scouting report, and nothing more.

All of the robust DiD estimators focus on identifying the overall ATT associated with some treatment by avoiding using already-treated units as controls. Avoiding those comparisons introduces forms of sample selection, though, which must be accounted for. Authors have different strategies to accomplish this, and most of those strategies involve building to a larger ATT through a weighting scheme based on estimating smaller building block ATTs, like the group-time ATT or estimated individual treatment effect itself. I sort a robust estimator into one or more of the following:

  1. Weighted group-time ATT

  2. Stacking through balancing in relative event time

  3. Imputation methods

When I began this project, the weighted group-time ATT was the sort of estimator that I sort of favored. It was the one pursued by Callaway and Sant’Anna (2020) and Sun and Abraham (2020). These two estimators were nested versions of each other, which made them “feel” more likely to be the “correct” approach. Software was available for each of these, already papers have been published using them, and so I tended to prefer and advocate for them.

But in time, as I learned more about this emerging world of robust estimators, I noticed that weighting up the ATT using never or not-yet-treated as controls wasn’t the only way to skin the cat. Stacking, for instance, was also viable alternative. Stacking solved the problem by turning the differential timing problem back into a two-group design by restructuring the dataset into relative event time rather than calendar time. This was done because the two-group design actually do not suffer from the problems that many had reported about TWFE. Once the data was reconstructed into a balanced panel in relative event time, where treatment is centered at the same “relative treatment date”, then one estimates a conventional TWFE model controlling for group and period fixed effects to recover a weighted average of treatment effects.2 The article most commonly associated with this strategy is Cengiz, et al. (2019) though there are others too.

The third approach is an imputation method which estimates missing counterfactuals in a multi-step process that exploits the parallel trends assumption’s implications for dynamics in the untreated units. An example of this is Borusyak, Jaravel and Spiess (2021) and their imputation estimator, although in many ways Athey, et al. (2017) article on matrix completion with panel data, while not technically a DiD estimator since it does not depend on any parallel trends assumption, will be for all practical purposes in the conversation because of its imputation method being useful in all the same situations as these robust DiD estimators (e.g., staggered rollout in panel data).

Where, then, does Gardner fit in this schema? The answer is somewhere between all three. Technically, Gardner does begin with a target parameter equalling the weighted group-time ATT placing it in the company of such excellent papers as Callaway and Sant’Anna (2020). But it will not estimate the overall ATT using doubly robust methods or inverse probability weights. Rather, two-stage Difference-in-differences (2sDiD), as he calls it, will be in the end be an adaptation of the familiar two-way fixed effects (TWFE) regression based solutions, like stacking. But it is also a multi-step process with fitted values estimated using only the control units which places it firmly in the world of imputation alongside Borusyak, Jaravel and Spiess (2021).

Target causal parameter, parallel trends and model misspecification

Defining target parameters

Gardner formalizes his target parameter by equating it with the population regression coefficient on the treatment dummy. When one looks closely, they should recognize it as the mean group-time ATT similar to that defined by Callaway and Sant’Anna (2020):

where g indexes a set of units treated at the same time called a “group”, p indexes the common treatment dates for said groups, i refers to individual units and t refers to calendar time.3 This interior term is the group-time ATT because it aggregates over individual treatment effects by group and period. Given periods where a treatment occurs, then we can define the mean of this parameter as the “overall ATT”:

This is probably what we all had in mind — an average treatment effect for all periods treated.

Implications of parallel trends

Gardner makes several references to a parallel trends assumption, which isn’t surprising given this is after all a paper on DiD. But it’s the implication of the parallel trends assumption that I want to note. He describes the implications of his parallel trends in the following way:

“Absent the treatment, treated units would experience the same change in outcomes as untreated units. Mathematically, this amounts to the assumption that average untreated potential outcomes decompose into additive group and period effects.” (my emphasis, p. 4)

It is unclear precisely what form his parallel trends assumption will take, but since it’s a linear function of group and period fixed effects, it might be the Borusyak, Jaravel and Spiess (2021) factor model as opposed to the conditional parallel trends assumption in Callaway and Sant’Anna (2020).4

If we assume this version of parallel trends, then as he stated, the mean outcomes will evolve according to the following linear specification:

where lambda is a series of group dummies, gamma is a series of period dummies, and Dgp is a “static” random variable absorbing treatment and thus equalling 1 if that unit was treated at that time period. Notice that the subscripts on the beta coefficient — there are heterogenous treatment effects in this conditional mean function.

Model misspecification

With definitions and assumptions in place, let’s begin by writing down the canonical TWFE regression model with a static specification so often used in DiD designs:

Notice the lack of subscripts on the beta coefficient now and how this differs from the previous equation. We had high hopes that if we ran this specification, our estimated coefficient might be interpreted as a weighted average over all ATTs. Borusyak and Jaravel (2017) say we had hoped that this specification was recovering:

“a regression weighted mean of the average effect of the treatment in each post-treatment period.” (my emphasis)

We can be forgiven for thinking this because in the two group case, the above regression actually did recover the overall ATT. But Borusyak and Jaravel (2017) and others note that this may not have a reasonable causal interpretation under heterogeneity with differential timing, even though it had under the two group case. To see the correct specification that accounts for heterogeneity, let’s manipulate the original conditional mean function by adding a zero to the right hand side. Define the zero as:

Now add this zero to the conditional mean expression from the equation before and rearrange slightly to get:

The problem with this regression model is that the last term — a kind of “new error term” — doesn’t zero out because it is not mean independent of treatment status due to effect sizes differing according to group and period. The static specification is misspecified because:

“Misspecified difference-in-differences regression models [TWFE] project heterogenous treatment effects onto group and period fixed effects rather than the treatment status itself.” (my emphasis)

It is this projection that is ultimately the source of bias in the unadjusted TWFE model as it leads to biased estimates in the static specification.

Interpreting coefficients under TWFE

So if TWFE doesn’t recover the group-time ATT or the overall ATT because of model misspecification, then what exactly is it recovering? Gardner writes of this estimate by saying that:

“[The static specification] identifies the linear projection of average outcomes onto group and period effects and a treatment indicator (which differs [from] E(Ygpit|g, p, Dgp) when that function is nonlinear)” (my emphasis)

In other words under his parallel trends assumption, TWFE recovers the following weighted parameters:

but the weights are equal to:

where conditional and unconditional probabilities equal:

  1. P(Dgp=1|p): share of units treated for a given p period

  2. P(Dgp=1|g): share of periods treated for a given g group

  3. P(Dgp=1): share of unit X time treated out of total observations

  4. P(g,p): population share of observations corresponding to a given group g and period p

Let me try to explain the implications of this admittedly complicated weighting scheme. This interpretation is not dissimilar to that which Goodman-Bacon (2021) found using his decomposition of the TWFE estimator. The longer a group’s observed treatment duration, the greater P(Dgp=1|g) will be, and the larger that conditional probability, the more the group’s treatment effect will be absorbed by the group dummy. Similarly, the greater P(Dgp=1|p), the more treatment effects that occurred during that period will be absorbed by period fixed effects. Larger groups will receive similar weight. The group and period fixed effects are doing a lot of work to screw things up by creating such wonky weights.

But it was this observation — that the TWFE model projects heterogeneous treatment effects onto the group and period fixed effects — that led Gardner to his deceptively simple solution. If we could simply remove these fixed effects, then might a regression using a transformed outcome variable work? The answer, as he proves in Appendix A, is a resounding yes.

Two stage Difference-in-Differences (2sDiD)

A second taxonomy of the new econometrics of DiD is the structure and purpose of a given paper. There’s three types of papers in the econometrics of DiD literature: papers that only show the shortcoming of TWFE (e.g., Goodman-Bacon 2021), papers that present original alternatives to TWFE (e.g., Callaway and Sant’Anna 2020), and papers that do both (e.g., Sun and Abraham 2020). Gardner (2021) is of the third type: he shows why TWFE fails, but he also shows how to fix it.

Here I want to outline the set of steps that when followed will isolate the overall ATT even under heterogeneity and differential timing. The key is a first step, as alluded to in the previous paragraph, conducted before estimating the static specification.

Step 1: Estimating group and period fixed effects

The 2SDiD is a simple two step procedure based on removing group and period fixed effects based on the parallel trends assumption in a first step using only the comparisons units periods when Dgp=0. So let’s remove these two fixed effects using the units where Dgp=0.

So long as we have common support with respect to the group and period fixed effects, meaning there are treatment and comparison units for each group and period, then fixed effects can be identified using only the untreated group and periods. The specification for this becomes:

which can only be estimated for units where Dgp=0. This specification gives us consistent estimates of our group and period fixed effects under parallel trends (see Appendix A for proof):

We then residualize the outcome variable by removing the estimated fixed effects entirely from our measured outcomes:

Step 2: Estimating the overall ATT

Once we have the group and period fixed effects estimates and have residualized the outcome for all units (even Dgp=1) using them, we move to step 2. In Step 2, we get consistent and unbiased estimates of the overall ATT using this transformation as our outcome. Gardner summarizes the success of the second stage:

“The overall group X period ATT is identified from a comparison of mean outcomes between treated and untreated groups after removing group and period fixed effects.” (p. 8, my emphasis)

In the second step, we regress our transformed outcome variable onto treatment status:

The reason that this estimates the overall ATT is because the mean outcome, after netting out estimated group and period fixed effects, this “new error term” is no longer correlated with group and period fixed effects, as they will be gone from the regression altogether. Without them, this “new error term” is equal to zero and mean outcomes conditional on group, period and treatment status equals:

And thus the 2SDiD estimator identifies:

Notice that this will tend to put weight, by definition, on groups earlier in their treatment, not so much because of negative weighting as we have seen elsewhere but because all groups have an early treated, but due to the fact that differential timing implies imbalance in calendar time, later treated groups have fewer post-treatment periods and thus fewer beta coefficients overall.5 But otherwise, we identify the overall ATT which is average over every group-time ATT weighted by P(g,p|Dgp=1)

Software

The steps involved are simple, but the inference is a little more complicated because the second step does not take into account that the transformed outcome variable was conducted using estimates from an earlier stage. Gardner notes that we can use GMM to estimate each equation separately. Kyle Butts, a talented young PhD student at UC Boulder, has graciously provided code in Stata though which will do this in one line. His ado package, did2s, is downloaded from GitHub on line 8.

In the following snip of code, I present three coefficients on the overall ATT using the 2SDiD estimator in Stata code. The dataset used comes from the Cheng and Hoekstra (2013) article that I briefly discussed in the Callaway and Sant’Anna substack from earlier. It is also downloaded from GitHub on line 13.

When we estimate TWFE using the unadjusted outcome on log homicides (line 19). we get a statistically significant treatment effect of 0.076, or a 7.6% increase in homicides from the castle doctrine gun reform which ruled across the country from 2005 to 2009 in their panel.

But then starting at line 22, I estimate the effect using 2SDiD procedure. Lines 22-28 do this manually. First we estimate the fixed effects using the Dgp=0 units (line 24). Then we residualize the outcome variable in line 27. In the second step, we estimate the treatment effect (without group and time fixed effects — so no use of xtreg) in line 28, but notice — it is necessary that you use the “nocons” command otherwise it will drop one of the fixed effects.6 Now the effects are not very different, but as you may have noticed repeatedly in earlier analysis of the Cheng and Hoekstra castle doctrine data, TWFE performs relatively well because of the large number of untreated units (echoed by the fact that the Bacon decomposition always shows very small weights on the late-to-early 2x2s).7

But as we mentioned, the standard errors are biased since they fail to incorporate the first step. Therefore we use the command did2s that Kyle wrote in line 32 and we get the same coefficient of 0.075 with slightly larger standard errors (two times as large).

**********************************************************************
* name: 2sdid.do
* author: scott cunningham (baylor) using kyle butts (colorado) ado
* description: estimate treatment effects using 2sDiD
* date: June 2, 2021
**********************************************************************
net install did2s, replace from("https://raw.githubusercontent.com/kylebutts/did2s_stata/main/ado/")
net install cleanplots, replace from("https://tdmize.github.io/data/cleanplots")
* ssc install did2s
* load data
use https://github.com/scunning1975/mixtape/raw/master/castle.dta, clear
set scheme cleanplots
** Static specification and population weights
** Begin TWFE specification.
xi: xtreg l_homicide i.year post [aweight=popwt], fe vce(cluster sid)
** Begin Manual 2SDiD
* Step 1: Manually (note standard errors are off)
reg l_homicide i.sid i.year [aweight=popwt] if post == 0
* Step 2: Regress transformed outcome onto treatment status for all units
predict adj, residuals
reg adj i.post [aweight=popwt], vce(cluster sid) nocons
* 1.post .075
** Begin Butts' did2s ado file
did2s l_homicide [aweight=popwt], first_stage(i.sid i.year) second_stage(i.post) treatment(post) cluster(sid)
view raw gistfile1.txt hosted with ❤ by GitHub

Conclusion

I foresee a termination point in the near future in the new econometrics of DiD such that beyond a certain core material, the gains from reading the marginal theory paper may not be a substantial improvement over other robust alternatives that already exist. Thus I think we are entering a second stage where selection of approaches will be dictated by things such as which parallel trends assumption you are willing to make, and one’s own need for speed when it comes to estimation. Practically speaking, I suspect that the marginal theory paper soon will not perform substantially better on unbiasedness and consistency seeing as all of the robust estimators are consistent and unbiased but rather on things like the appropriateness of any one estimator to one’s dataset based on particular characteristics of the estimator that you may see are suitable to the data and problem in your project. My prediction is even more pragmatic things like computational speed, ease of use, and availability of conventional software packages will also be major elements in adoption as well .

There are many elements of Gardner’s paper that caused me to like it, some of which were scientific insights, and some which were arguably aesthetic. I have selected this paper because it is one of the latest new econometrics of DiD papers on the scene, but more importantly, because I think it offers much that readers will appreciate.

  1. Gardner’s solution sticks to regression formulas and conditional expectations which may make it easily accessible to many who have thought about DiD primarily in terms of a TWFE regression specification

  2. Gardner’s two-stage Difference-in-differences (2SDiD) estimator has a great name that reminds me of the most popular instrumental variables design, the two stage least squares (2SLS) estimator. While they share only the same number of stages, that was still a fun little branding Gardner chose with a great abbreviation: 2SDiD has a nice ring to it.

  3. 2SDiD is easy to implement, requiring almost no programming, in Stata and I believe in R as well. That’s because the estimator is deceptively simple which I show in the manual representation in lines 24-28 above as well as line 32.

It’s possible that these three reasons alone could be sufficient for adoption by many people. The estimator’s intuitive diagnosis of why TWFE fails, its solution, and the availability of the solution in common software needing very little programming are all likely to lead to at least some adoption. That the estimator is consistent with valid asymptotic standard errors is also an argument for its use. And the fact that it is fast while still possessing the ability to bend to traditional designs, like DiDiD and event studies, will make the marginal cost of adoption very low relative to the marginal benefit.

I hope that this explainer has helped open your eyes to this very interesting paper. I would like to now conclude with a short 30 minute interview with the author itself. In it you will learn more about John’s background, his journey into working on this paper, as well as technical questions to it such as whether it can provide consistent estimates of individual treatment effects or just mean treatment effects, as well as whether his procedure will work properly with reversible treatment.


Interview

1

Given we are academic researchers, I take it for granted that like me, deep down, you are a junkie for learning new things.

2

Interestingly, Gardner shows the form of the stacking weights in Appendix B of this manuscript.

3

Groups are units that share the same treatment date, or period. More than two groups explicitly means a panel will be imbalanced in relative event time even if balanced in calendar time. Take two groups and a ten year panel. Group K is treated in year 3 and thus has 7 post-treatment lags. Group L is treated in year 7 and has 3 post-treatment lags. Notice that in calendar time, each unit has ten periods, but in relative event time, units in group K have 7 post-treatment lags, but group L does not. It is in this sense that we say differential timing creates imbalanced panels in relative event time, which is why historically so often authors would bin or trim the “hanging leads and lags”.

4

This is, again, not a criticism — it is merely noting the subtlely different expressions of parallel trends assumption that are buried in these myriad papers, not all of which are precisely the same. These differences matter because they encompass different data generating processes, which will impact performance in Monte Carlo simulations and practice in real life. Gardner’s will model potential outcomes in terms of additive group fixed effects plus an error term with mean zero.

5

Event studies can also be done with the 2SDiD method but instead of using a static specification, one estimates a dynamic one. But I won’t rewrite this specification since it follows the same logic, but the key is that the outcome has been transformed so that group and period fixed effects are removed.

6

Like a champ, an earlier version of this substack did that very thing! As I tell my kids, do as I say, not as I do…

7

Probably, it would be helpful if I had a better dataset to illustrate the problems of TWFE, but I leave this trivial proof to the reader.