Waiting for Event Studies: A Play in Three Acts

Sun and Abraham (2020) Explainer


Goodman-Bacon (2021, forthcoming) is a decomposition of the coefficient from a panel fixed effects estimate with time dummies (“twoway fixed effects” or TWFE) of a static treatment parameter into average causal effects and bias terms under differential timing. His decomposition revealed that TWFE needs parallel trends as always thought but also must restrict the dynamics of treatment heterogeneity. Once we introduce heterogeneity in the treatment effects over time, an unadjusted TWFE estimation of the static parameter will be biased, and may even flip signs, which is worrisome given our uncertainty about what effects sometimes to even a priori expect.

Is the problem the static parameter? If so, then maybe we can circumvent this entirely by estimating dynamic regressions, such as are done in contemporary difference-in-differences event studies. But a recent 2020 article at the Journal of Econometrics by Liyang Sun and Sarah Abraham shows that the coefficients on lead and lag indicators in a dynamic specification can also be biased with TWFE. So to better understand these problems and alternative routes, let’s dive into and unpack Sun and Abraham (2020).

Act I: The Violation of Strict Exogeneity

Many of us learned the panel fixed effects estimator in our panel econometrics sequence during graduate school. The panel fixed effects estimator was interesting because it could eliminate certain kinds of unobserved heterogeneity bias present in the pooled OLS model. This unobserved heterogeneity caused the treatment variable to become endogenous when it was subsumed into an unobservable composite error term consisting of the sum of the structural error term and the heterogeneity itself. It the heterogeneity was associated with the treatment variable, but time invariant, then fixed effects could eliminate it. It was consistent so long as the error was systematically uncorrelated with the regressors, hence the nickname “strict exogeneity”.

In a difference-in-differences setting, I usually thought of strict exogeneity in terms of whether a policy was adopted endogenously. Perhaps a state is passing minimum wages because of unobservable trends in inequality. How do I think about that in the context of possible strict exogeneity violations? My mind would usually switch to potential outcomes and ask the question differently — might strict exogeneity be another way of describing parallel trends? And for a long time, that was how I thought — “strict exogeneity implies parallel trends”, I would think, and therefore spend time in projects trying to better understand the degree to which I had evidence for parallel trends.

But now I know that strict exogeneity is a more pregnant assumption implying more than merely parallel trends. Rather, strict exogeneity also rules out other types of problems present when treatment effects are heterogeneous and policies trickle across regions at different points in time. Such scenarios are called differential timing, and are by now broadly familiar to readers as it describes most of the problems highlighted by the new difference-in-differences literature. What I now know is that heterogeneity can also violate strict exogeneity, not because it violates “parallel trends”, but for separate reasons. Learning that was reassuring because it showed that the problem had never been with TWFE; the problem had always been with me and my incomplete understanding of how differential timing related to strict exogeneity.

I’m going to use the helpful notation that John Gardner at the University of Mississippi’s economics department uses in his paper to discuss this violation. Consider the following model:

where g is a group indicator and p is a period indicator. We are interested in the group-time ATT (Callaway and Sant’Anna 2020) which is equivalent to the coefficient on X:

If we introduce heterogeneity in terms of deviations from mean treatment effects corresponding to timing of policy adoption, then we can rewrite our structural model, and after taking expectations, get the following expression:

The first term is what Gardner calls the overall average ATT, which we can think of as the average over every individual group’s ATT. The last term is a new error term and it is not necessarily mean-zero conditional on g, p, the group and time varying X. In fact, TWFE will only identify this “overall average ATT” if all groups have the same ATT (across group ATT) or only one treatment group (i.e., no differential timing). This implies that strict exogeneity is, in fact, violated with heterogeneity and differential timing because the composite error term ends up being correlated with treatment variable and group fixed effects. This is not new — this is old. It just feels new to me and may you too.

In the paper I will discuss today, I will show that this strict exogeneity violation is not unique to a single static parameter. It is present in TWFE’s event study coefficients too. This problem is real in that it is or is not embedded in reality. And while assuming homogeneity may give us unbiased estimates on paper, it will not if such assumptions are unwarranted. Heterogeneity makes the world interesting, but it also makes estimation of causal effects a challenging task sometimes. At least, until recently.

Act II: Revisiting the event study under differential timing and TWFE

Scene I: The sun descends over our heroes

The punchline of Liyang Sun (at MIT) and Sarah Abraham’s (at Cornerstone consulting) 2020 paper in the Journal of Econometrics, “Estimating Dynamic Treatment Effects in Event Studies with Heterogenous Treatment Effects” is that TWFE estimates of coefficients on lead and lag indicators will be contaminated with information from other leads and lags except when we restrict that heterogeneity as well as impose other assumptions such as parallel trends and limited anticipation of the treatment itself. Oddly enough, the contamination of lead and lag coefficients is the presence of treatment effects from other periods. As a result, many conventional tests employed in DD designs, such as testing the joint significance on the pre-treatment leads, may yield false negatives or false positives. That is, pre-trend tests may suggest balance when there is none, and they may show imbalance when there is, all because certain key assumptions didn’t hold.

The notation of the article is dense, and while I regularly wished it had been an easier paper to read, I suspect that that is not possible. So I made myself a cheat sheet and a few times even cheat sheets for my cheat sheet. So if you want to learn from this paper, you will need to study it closely by working with their notation to work carefully proofs and propositions.

There are several key letters as subscripts and superscripts, as well as general notational syntax, that you need to keep track of. E is a unit’s “treatment date”. The lower case l is a “relative time indicator” used to indicate a relative time period like t-1 or t+3. The lower case e corresponds to a “cohort” sharing the same treatment date E. Callaway and Sant’Anna call such units a “group” but they mean the same thing. And g stands for a bin which is the imposing of balance in relative time by “shoving” all imbalanced leads and lags into a single lead and lag, respectively.

It helped me to interpret Sun and Abraham as somewhat of a hybrid between Goodman-Bacon (2021, forthcoming) and Callaway and Sant’Anna (2020) in that it contains a decomposition of TWFE (like Goodman-Bacon) but it also presents an alternative estimator (like Callaway and Sant’Anna). Unlike Goodman-Bacon (2021, forthcoming), their analysis focuses on the “dynamic” specification in event studies. Callaway-Santanna’s estimator is an alternative to TWFE and can estimate the group-time ATT which can then be used to estimate leads and lags. It is not surprising, therefore, that the Callaway and Sant’Anna estimator and the Sun and Abraham estimator end up looking similar. In fact, Sun and Abraham is a special case of Callaway and Sant’Anna.

Scene II: Identification

They call their target parameter the “cohort average treatment effect on the treatment group” or CATT for short. It is expressed as:

The only notation that you may be unfamiliar with is the second term in the bracket in which an infinity symbol is in the superscript. This is the potential outcome of unit i in a world where it is untreated. We have defined the cohort ATT for a group e and relative time period l. There are three assumptions needed to estimate this core parameter.

Assumption 1: Parallel trends. Parallel trends holds for all groups.

As the reader understands most likely what is meant by this phrase, I won’t belabor it here. But I will note something that caught my eye. The authors make an interesting concession that if you think parallel trends doesn’t hold for a group, then you should exclude it from the analysis. This is the familiar assumption in all DD designs so I don’t belabor it. One of the features I’ve noticed lately is how often the new DD papers recommend dropping units from analysis.

Assumption 2: No relevant anticipation of the treatment.

No anticipation means that the CATT equals zero in the pre-treatment periods. If individuals rationally looking forward see the treatment coming, they might change their behavior which would then violate this assumption to the degree that the behaviors made in the lead up are changing potential outcomes.

Assumption 3: Treatment effect homogeneity. Each group has the same treatment profile.

Assumption 3 does not require that treatment effects be constant over time, unlike Goodman-Bacon (2021, forthcoming). Rather, it requires the same treatment profile for all cohorts. In other words, assumption 3 assumes that all groups have the same treatment profile whether static or dynamic in nature.

Before presenting their alternative to TWFE, we first need to look into the TWFE estimator so that we can understand why and when it fails without these assumptions.

Scene III: Contaminated leads and lags

Let’s consider a simple TWFE regression model based on a common specification:

Sun and Abraham say that you need to exclude some relative periods from this dynamic specification, but I was surprised to learn that you need to exclude at least two to avoid multi-collinearity. They recommend therefore dropping two relative time indicators such as t-1 and some other distant one.

One of the interesting features of dynamic specifications under differential timing is that the dataset becomes imbalanced in relative event time even for an otherwise balanced panel in calendar time. In a 10 year panel where group 1 is treated in year 3 and group 2 is treated in year 7, both groups have 3 leads, but only the second group has 3+ leads. Likewise, both groups have 3 lags, but only group 7 has 3+ lags. Thus even though both groups are balanced in calendar time, they are imbalanced in relative time.

A common practice therefore is to either trim the data but dropping all 3+ leads and 3+ lags, or “binning” the data so that all 3+ actual leads are absorbed into a single 3+ lead “bin”, and vice versa for the excess lags. Either practice will balance the panel in relative time, but there are certain advantages to trimming under the TWFE specification because, as I will discuss, dropping excess leads and excess lags will remove those leads’ and lags’ elements from the population regression coefficient on any one lead and lag. Because the bias of population regression coefficients in the dynamic specification come from the presence of treatment effects from other relative time periods, trimming will eliminate some, but not all, of the contamination simply by removing those elements terms from the population regression coefficient altogether.

As alluded to in the previous paragraph, their decomposition reveals surprising information about the elements of each estimated coefficient on lead and lag indicators from the dynamic specification. I will discuss these now in the form of propositions taken from the paper.

Proposition 1. The population regression coefficient on relative panel bin g is a linear combination of differences in trends 1) from the own relative period, 2) from relative periods belonging to other bins included in the specification and 3) from relative periods excluded from the specification.

Proposition 2: Parallel trends only. Under parallel trends only, the population regression coefficient on the bin indicator is a linear combination of the CATT for that relative period, but also the CATT from other relative periods as well.

Parallel trends is necessary for identification but unlike many of us thought, it isn’t sufficient. With only parallel trends, the population regression coefficient becomes a weighted sum of group specific average treatment effects from other periods — both those in the specification for which you have included leads or lag indicators, but also those excluded. This is what the authors mean by contamination — it is the presence of cohort-specific ATTs from other periods for any population regression coefficient on lead and lag indicators.

Population 3: Parallel trends and no anticipation of treatment. If you have parallel trends, and groups don’t anticipate treatment, then the population regression coefficient on any given bin g is a linear combination of post-treatment cohort specific ATTs for all future periods.

One of the advantages of being able to credibly commit to no anticipation — an assumption that must be evaluated on a project by project basis — is that it removes some of the contamination we have in some of these population regression coefficients. No anticipation essentially means that the cohort ATT is zero for the pre-treatment leads.

But just because you have no anticipation, and just because that implies CATT=0 for those pre-trends, doesn’t mean that the population regression coefficient on the leads will be zero! Not in light of Proposition 1 anyway. A coefficient is a weighted sum of three things, and the own period CATT is but one of those three. I will come back to this but for now I simply leave it in the air like a spectre haunting you.

Population 4: Parallel trends and treatment profile homogeneity. If parallel trends holds and groups all have the same “treatment effect profile” (be it dynamic or constant), then the CATT for any group e is simply the ATT. That therefore means that the population regression coefficient will equal a linear combination of ATT from the own relative period and ATT from other relative periods.

Again, we see that contamination inside the population regression coefficient persists even when we can credibly buy off on parallel trends and treatment profile homogeneity. This is because, as we said in Proposition 1, the coefficient is a weighted sum of three elements, not just two. Only when we assume all three — parallel trends, treatment effect profile homogeneity and no anticipation — does the population regression coefficient on a lead and lag equal the ATT for that particular relative time period.

Why do we care about this apart from simply caring about bias in our estimators more generally? Perhaps the most significant issue that this contamination creates is the validity problems it creates for testing whether pre-trends are different from zero. A few things can be learned from this contamination. First, it’s possible to fail such a test when there are no pre-trends for no other reason than that the pre-trends contain the contaminated treatment effects from other periods. In fact, failure is possible even with parallel trends and no anticipation. Only with all three assumptions will the contaminated terms fall out due to the properties of the weights. But it can also mean you pass such a test when you shouldn’t. The presence of contamination calls into question, in other words, all of our pre-trend tests under differential timing.

Act III: Solution

Sometimes you are so close to something, you can’t even see the thing any longer. You can’t because you’re too busy seeing through it. The same can be the case with Sun and Abraham. So let’s back up and try to remember why this paper is being presented then perhaps where it’s going will make more sense. I introduced this paper because it’s relevant for our event studies when we are estimating leads and lags under differential timing. We learned that when doing so, TWFE will have potential biases unless strong assumptions hold, and since parallel trends is itself already a strong assumption, this may be a tough cookie to swallow for some researchers.

Sun and Abraham therefore do us a solid by presenting a solution to the problems that they outlined. Their solution is a method that does not suffer from the same problems as what I just outlined. As it turns out, this estimator is a special case of the Callaway and Sant’Anna estimator I discussed a few weeks ago, thus bringing us full circle in a way. So let’s begin.

They focus first on a weighted average of CATT for a particular event group e and their relative time periods l. The weights in this are shares of cohorts that experience at least l periods relative to treatment, normalized by the size of a bin g. This can be written as:

This v estimator is based on two elements: weights and CATT for each group-time. They propose the interaction-weighted estimator to calculate this statistic which is a three step procedure.

  1. First estimate the following regression using either “never-treated” units as controls (C) or “last cohort treated” units as controls. Note this is one of the key differences with Callaway and Sant’Anna who can use “not yet treated” as controls. This regression will interact relative time dummies with group dummies excluding indicators for the comparison group, C.

  2. Estimate the weights, Pr(E=e), by sample shares of each cohort in that period.

  3. Form the IW estimator by taking the weighted average over all estimates for CATT from step 1 multiplied by the weight estimates from step 2.

Where our DD estimator is simply:

In conclusion, Sun and Abraham note that the rise in difference-in-differences is at least partially caused by the increased availability of rich panel data. And the use of event studies is standard. But as we learned with Goodman-Bacon (2021, forthcoming), TWFE without some manipulation may be biased estimates of regression coefficients on leads and lags due to the mechanics of OLS handling of terms outside the window. Sun and Abraham propose solutions that should help us going forward as well.

Epilogue: Interview with Liyang (“Sophie”) Sun

I’d like to now leave you with a 30 minute interview I did in late March 2021 with one of the authors, Liyang (“Sophie”) Sun, a newly minted PhD student at MIT en route to UC Berkeley for a postdoc before heading over to Madrid for a full-time position. We talk about the history of the paper, her approach to econometrics, and her background. I hope you have found this entry helpful as a start to better understanding the challenges of estimating dynamic treatment effects with TWFE under heterogeneity and differential timing.