#### Discover more from Scott's Substack

This week, I am going to be a keynote incredibly enough, along with Ronny Kohavi (he’s day one, I’m day two) at a conference at Microsoft. It seems like someone there had to have made a mistake to invite me to do that, but they did, and I’m assuming they know what kind of economist I am — who I am and who I am not — so I’m going. I am not a “real manager like Pep”, and since I just finished Ted Lasso season 3 in all its wonderful glory, I feel like it’s going to be a very hard thing to get me to not quote it a lot, but I think I’m just going to go and be myself. It’s an all day keynote — as in several hours, not just an hour thing, sort of like one of my workshops, only I’m going to call it being a keynote speaker because I swear I remember them saying that when they first called me. So here is what I decided to do the day on — I’m going to do it on the ATT and different ways you can estimate it.

One of the things that has really been occurring to me is that if you do not know have a specific parameter that you set out to estimate ahead of time, then you won’t know the assumptions you need to make in order to do it, and you won’t therefore know the statistical model you need to estimate in order to do so. I knew it for a long time is the weird thing, ever since I really started diving into the diff-in-diff literature, because it really comes up there, but for some reason until I started making simulations of heterogenous treatment effects under unconfoundedness and ignorable treatment assignment, and then realizing that the exact same regression model — a fully saturated model in which I interacted the treatment variable with higher order polynomial transformations and interactions — was unbiased for *two* parameters of interest, the ATE and the ATT, it just didn’t really dawn on me. I just don’t seem to fully understand any econometric models, best I can tell until I can sit down and either make a simulation illustrating some of its properties, or just work through basic things in an excel spreadsheet.

Take the following for instance the following code. I showed this a month or so ago, but I’m just going to post it again because I really was not prepared for it before I actually sat down to do it. I model the Y(0) potential outcomes in lines 29. As you can see, Y(0) is a function of a bunch of polynomials and interactions for two continuous variables. But then in line 30, I model Y(1) as equal to Y(0) plus 2500 (the ATE) plus heterogeneity in re-centered version of those two variables. The advantage of modeling the potential outcomes is that you know the ATE and the ATT. The ATE was 2500 (line 30), and the ATT was 1971.

* Simulation with heterogenous treatment effects, unconfoundedness and OLS estimation | |

clear all | |

program define het_te, rclass | |

version 14.2 | |

syntax [, obs(integer 1) mu(real 0) sigma(real 1) ] | |

clear | |

drop _all | |

set obs 5000 | |

gen treat = 0 | |

replace treat = 1 in 2501/5000 | |

* Poor pre-treatment fit | |

gen age = rnormal(25,2.5) if treat==1 | |

replace age = rnormal(30,3) if treat==0 | |

gen gpa = rnormal(2.3,0.75) if treat==0 | |

replace gpa = rnormal(1.76,0.5) if treat==1 | |

su age | |

replace age = age - `r(mean)' | |

su gpa | |

replace gpa = gpa - `r(mean)' | |

gen age_sq = age^2 | |

gen gpa_sq = gpa^2 | |

gen interaction=gpa*age | |

gen y0 = 15000 + 10.25*age + -10.5*age_sq + 1000*gpa + -10.5*gpa_sq + 500*interaction + rnormal(0,5) | |

gen y1 = y0 + 2500 + 100 * age + 1000*gpa | |

gen delta = y1 - y0 | |

su delta // ATE = 2500 | |

su delta if treat==1 // ATT = 1971 | |

local att = r(mean) | |

scalar att = `att' | |

gen att = `att' | |

gen earnings = treat*y1 + (1-treat)*y0 | |

* Regression 1: constant treatment effects, no quadratics | |

reg earnings treat age gpa, robust | |

local treat1=_b[treat] | |

scalar treat1 = `treat1' | |

gen treat1=`treat1' | |

* Regression 2: constant treatment effects, quadratics and interaction | |

reg earnings treat age age_sq gpa gpa_sq c.gpa#c.age, robust | |

local treat2=_b[treat] | |

scalar treat2 = `treat2' | |

gen treat2=`treat2' | |

* Regression 3: Heterogenous treatment effects, partial saturation | |

regress earnings i.treat##c.age##c.gpa, robust | |

local ate1=_b[1.treat] | |

scalar ate1 = `ate1' | |

gen ate1=`ate1' | |

* Obtain the coefficients | |

local treat_coef = _b[1.treat] | |

local age_treat_coef = _b[1.treat#c.age] | |

local gpa_treat_coef = _b[1.treat#c.gpa] | |

local age_gpa_treat_coef = _b[1.treat#c.age#c.gpa] | |

* Save the coefficients as scalars and generate variables | |

scalar treat_coef = `treat_coef' | |

gen treat_coef_var = `treat_coef' | |

scalar age_treat_coef = `age_treat_coef' | |

gen age_treat_coef_var = `age_treat_coef' | |

scalar gpa_treat_coef = `gpa_treat_coef' | |

gen gpa_treat_coef_var = `gpa_treat_coef' | |

scalar age_gpa_treat_coef = `age_gpa_treat_coef' | |

gen age_gpa_treat_coef_var = `age_gpa_treat_coef' | |

* Calculate the mean of the covariates | |

egen mean_age = mean(age), by(treat) | |

egen mean_gpa = mean(gpa), by(treat) | |

* Calculate the ATT | |

gen treat3 = treat_coef_var + /// | |

age_treat_coef_var * mean_age + /// | |

gpa_treat_coef_var * mean_gpa + /// | |

age_gpa_treat_coef_var * mean_age * mean_gpa if treat == 1 | |

* Drop coefficient variables | |

drop treat_coef_var age_treat_coef_var gpa_treat_coef_var age_gpa_treat_coef_var mean_gpa mean_age | |

* Regression 4: Heterogenous treatment effects, full saturation | |

regress earnings i.treat##c.age##c.age_sq##c.gpa##c.gpa_sq, robust | |

local ate2=_b[1.treat] | |

scalar ate2 = `ate2' | |

gen ate2=`ate2' | |

* Obtain the coefficients | |

local treat_coef = _b[1.treat] | |

local age_treat_coef = _b[1.treat#c.age] | |

local age_sq_treat_coef = _b[1.treat#c.age_sq] | |

local gpa_treat_coef = _b[1.treat#c.gpa] | |

local gpa_sq_treat_coef = _b[1.treat#c.gpa_sq] | |

local age_age_sq_coef = _b[1.treat#c.age#c.age_sq] | |

local age_gpa_coef = _b[1.treat#c.age#c.gpa] | |

local age_gpa_sq_coef = _b[1.treat#c.age#c.gpa_sq] | |

local age_sq_gpa_coef = _b[1.treat#c.age_sq#c.gpa] | |

local age_sq_gpa_sq_coef = _b[1.treat#c.age_sq#c.gpa_sq] | |

local gpa_gpa_sq_coef = _b[1.treat#c.gpa#c.gpa_sq] | |

* Save the coefficients as scalars and generate variables | |

scalar treat_coef = `treat_coef' | |

gen treat_coef_var = `treat_coef' | |

scalar age_treat_coef = `age_treat_coef' | |

gen age_treat_coef_var = `age_treat_coef' | |

scalar age_sq_treat_coef = `age_sq_treat_coef' | |

gen age_sq_treat_coef_var = `age_sq_treat_coef' | |

scalar gpa_treat_coef = `gpa_treat_coef' | |

gen gpa_treat_coef_var = `gpa_treat_coef' | |

scalar gpa_sq_treat_coef = `gpa_sq_treat_coef' | |

gen gpa_sq_treat_coef_var = `gpa_sq_treat_coef' | |

scalar age_age_sq_coef = `age_age_sq_coef' | |

gen age_age_sq_coef_var = `age_age_sq_coef' | |

scalar age_gpa_coef = `age_gpa_coef' | |

gen age_gpa_coef_var = `age_gpa_coef' | |

scalar age_gpa_sq_coef = `age_gpa_sq_coef' | |

gen age_gpa_sq_coef_var = `age_gpa_sq_coef' | |

scalar age_sq_gpa_coef = `age_sq_gpa_coef' | |

gen age_sq_gpa_coef_var = `age_sq_gpa_coef' | |

scalar age_sq_gpa_sq_coef = `age_sq_gpa_sq_coef' | |

gen age_sq_gpa_sq_coef_var = `age_sq_gpa_sq_coef' | |

scalar gpa_gpa_sq_coef = `gpa_gpa_sq_coef' | |

gen gpa_gpa_sq_coef_var = `gpa_gpa_sq_coef' | |

* Calculate the mean of the covariates | |

egen mean_age = mean(age), by(treat) | |

egen mean_age_sq = mean(age_sq), by(treat) | |

egen mean_gpa = mean(gpa), by(treat) | |

egen mean_gpa_sq = mean(gpa_sq), by(treat) | |

* Calculate the ATT | |

gen treat4 = treat_coef_var + /// | |

age_treat_coef_var * mean_age + /// | |

age_sq_treat_coef_var * mean_age_sq + /// | |

gpa_treat_coef_var * mean_gpa + /// | |

gpa_sq_treat_coef_var * mean_gpa_sq + /// | |

age_treat_coef_var * mean_age * age_age_sq_coef + /// | |

age_treat_coef_var * mean_age * age_gpa_coef + /// | |

age_treat_coef_var * mean_age * age_gpa_sq_coef + /// | |

age_sq_treat_coef_var * mean_age_sq * age_sq_gpa_coef + /// | |

age_sq_treat_coef_var * mean_age_sq * age_sq_gpa_sq_coef + /// | |

gpa_treat_coef_var * mean_gpa * gpa_gpa_sq_coef if treat == 1 | |

* Drop coefficient variables | |

drop treat_coef_var age_treat_coef_var age_sq_treat_coef_var gpa_treat_coef_var gpa_sq_treat_coef_var /// | |

age_age_sq_coef_var age_gpa_coef_var age_gpa_sq_coef_var age_sq_gpa_coef_var age_sq_gpa_sq_coef_var gpa_gpa_sq_coef_var | |

gen agegpa=age*gpa | |

* Matching model 1 | |

teffects nnmatch (earnings age gpa) (treat), atet nn(1) metric(maha) | |

mat b=e(b) | |

local match1 = b[1,1] | |

scalar match1=`match1' | |

gen match1=`match1' | |

* Matching model 2 | |

teffects nnmatch (earnings age gpa) (treat), atet nn(1) metric(maha) biasadj(age gpa) | |

mat b=e(b) | |

local match2 = b[1,1] | |

scalar match2=`match2' | |

gen match2=`match2' | |

* Matching model 3 | |

teffects nnmatch (earnings age gpa age_sq gpa_sq agegpa) (treat), atet nn(1) metric(maha) | |

mat b=e(b) | |

local match3 = b[1,1] | |

scalar match3=`match3' | |

gen match3=`match3' | |

* Matching model 4 | |

teffects nnmatch (earnings age gpa age_sq gpa_sq agegpa) (treat), atet nn(1) metric(maha) biasadj(age age_sq gpa gpa_sq agegpa) | |

mat b=e(b) | |

local match4 = b[1,1] | |

scalar match4=`match4' | |

gen match4=`match4' | |

collapse (max) att treat1 treat2 ate1 ate2 treat3 treat4 match1 match2 match3 match4 | |

end | |

simulate att treat1 treat2 ate1 ate2 treat3 treat4 match1 match2 match3 match4, reps(1000): het_te |

Well, given the heterogeneity in the treatment effects, you cannot recover any aggregate causal parameter unless you fully saturate and I illustrated that in lines 42, 48, 55 and 94. And I show the outcome of those regressions from 1,000 simulations above. The coefficient for the ATE in those regressions is just that on the treatment dummy itself and is equal to 2500, as I said, but only the fully saturated regression model recovers it. See the above figure with the only correct specification being the bottom right.

But the thing I did now know was that the same model — the one I use in line 94, the fully saturated one in higher order terms and interactions — contained not just the ATE, but the ATT itself. Recall that the ATE was 2500, but the ATT was 1971, and yet I only estimated *one regression* to get both of them. See below where I show that the first two regressions assumed constant treatment effects (something Imbens and Rubin say about exogeneity in their 2015 book, in fact — that exogeneity in a particular regression model they present assumes unconfoundedness, linearity and constant treatment effects), and the last two allowed for heterogeneity, but unless I estimated the same one as that bottom right one above, I couldn’t find an unbiased estimate. Ignore that it says 1980 though — that’s a typo I got wrong when I made it and I keep messing up the seeding so I just left it. But that was the correct specification of the unbiased estimator — the fully saturated model. And how did I recover the ATT? Good lord just look at lines 93 to 163. All of that to get the ATT. But then look below at what happens when you *do* actually do that — take each treatment coefficient, add them up multiplied by the mean values of the variables that they modify, and you get to the ATT.

For some reason, until I did that, it’s like I hadn’t really connected those dots. Something about how a single regression model contained the ATE and the ATT but that it had to be worked out differently just got my head spinning, because that was when I started thinking to myself this:

If I want the ATE, then what assumptions must I make? You have to assume that

*both*potential outcomes satisfy unconfoundedness / ignorable treatment assignment. See here:

But if you want only the ATT, then you only need to assume that

*one of them*is:

I knew that already; it’s in the book. It’s not that I didn’t know it. It’s that in the book, when I taught the material on subclassification and matching, I didn’t cover regression. I covered regression in the chapter on probability earlier. So I never actually estimated any causal parameters with regression in that chapter — a major shortcoming. But then I decided I wanted to dial that regression chapter way way back, which meant moving OLS into that matching and subclahsifiation chapter, which meant completely restructuring it, which meant really emphasizing the correct and incorrect regression specifications, which meant writing more code, which meant simulations, and then it just started to hit me — you can’t use a regression model to estimate *any* aggregate causal parameter unless you specify ahead of time *which* one you wanted. Why? Because not only would you not necessarily realize that the exogeneity required by a regression model in which the covariates entered linearly required that the treatment effects be constant; you also just wouldn’t even know how to combine them to get the causal parameter of interest if you even did know to saturate. And given I’d never seen my colleagues do saturated regressions — I’d only seen people do interactions with covariates if they were interested in *that specific marginal effect* with respect to those covariates, not to actually get the full parameter itself — it just kind of kept me circling and circling and circling over and over and going back to the same place again which is this:

You have to know turn your research question into an aggregate causal parameter of interest the moment you move away from constant treatment effects because otherwise you really don’t know the assumptions or the model itself

I guess this had been brewing for a long time; it’s been in my slides for a while. But you know how sometimes you sort of know something, but then don’t know it at the same time, or for some reason its roots just go deeper than ever? This trip in Europe, through Madrid, Scotland and England, something about all of the interactions and conversations around diff-in-diff just kept me returning to that same place again and again which was so many questions get resolved once you actually say ahead of time what your goal is.

But then I would recall my interview with Guido Imbens from last year, when I asked him (kind of surprised I even did to be honest ask him this) whether what I’d heard — that people originally weren’t super enthusiastic about the LATE theorem — was the case and why. You can watch his response here — boy do I love showing this interview. I just could listen to Guido talk all day long. He’s just such an incredible combination of insightful person and really easy to talk to person. For a guy like me, a normal guy, it’s just a delight every time. And he basically said things that made me feel a little like I was on the side of his critics weirdly enough.

He said that people at the time, without naming names, felt like their LATE theorem was “a little like cheating” because the way it was done back then (and now) is you stated the parameter you wanted, and then you built machines to get it. And Heckman in a 1990 AEA P&P had showed you can’t unless the instrument moved the probability of treatment from 0 to 1, you can’t get the ATE. And look at how Heckman’s paper progressed below — starting with the parameter of interest.

Well, Imbens and Angrist in their Econometrica didn’t start with the parameter of interest. Heckman showed you can’t get the ATE with IV except under unrealistic assumptions, so what did they do? They showed what you got, which was the LATE — the ATE for the complier subpopulation. Which is probably not a policy relevant parameter except as you either moved away from heterogeneity (or much of it) or the compliers suddenly became representative of the entire population itself, which seemed awfully coincidental that the *one instrument* that you chose just happened to find those random compliers!

And yet people became content with identifying the LATE. And I guess that got me thinking more and more about diff-in-diff this last couple weeks. If you want the ATT, and the more I have thought about it, the more I’m convinced that the ATT is almost certainly the parameter we always wanted in the first place short of being Pfizer wanting to vaccinate the entire planet, then it was the great because (1) you didn’t need that full unconfoundedness assumption and maybe that was a little more behaviorally relevant but also (2) it sort of brought up the issues with twoway fixed effects regression models.

Wooldridge shows in his Mundlak estimator that a fully saturated regression model would have the ability to identify the ATT in diff-in-diff, not so terribly dissimilar from my above example for unconfoundedness (not parallel trends), using twoway fixed effects. But then I just started looking more and more at the diff-in-diff equation, as I call it, and how staring right there at me was so many assumptions. Watch the sequence of steps to get to the parallel trends expression.

Step 1: Orley said that diff-in-diff is just “four averages and three subtractions”. It’s the after minus before of the treatment compared to the after minus before of the control. Write that down now:

where *k* is a treatment group and *U* is an untreated comparison group, *Post* is after an intervention and *Pre* is the period just before (or “baseline” or *t-1* depending on your favorite poison).

Step 2: Now substitute potential outcomes for average values of the outcome, but which ones? Well this was how I augmented my lectures. I’ll now go in steps. Do you know why you need to assume “no anticipation”? No anticipation is really not a good name (sorry econometricians) because it implies agents who aren’t forward thinking. But there are plenty of instances where the assumption I’m about to show you can hold and yet agents be forward looking. All that “no anticipation” (NA) means is that the baseline is that period where the outcome is not treated. And how can the outcome get treated? If a future event changes the behavior a forward looking agent in such a way that the treatment literally carries backwards in time, then that would. So now let’s replace Y=Y(0) for

*k*group’s pre period. Furthermore, since the*k*group is treated in the post period then replace Y=Y(1) in the post period for them as well:

Step 3: But what about that horribly awkward named assumption, SUTVA. SUTVA, the stable unit treatment value assumption, is critical in the diff-in-diff equation because of two reasons. First, it requires no spillovers, which means that in the post period, group

*U*is untreated — i.e., Y=Y(0) for group*U*in the post period. And since there are no spillovers then under NA, the pre period is also trivially equal to zero too. But also no hidden variation in treatment means that the underlying units themselves within group*k*all get the same treatment such that the aggregation over them is sensible.

Step 4: Add a zero. The great things about zeroes is when you add zero to either side of an identity, the identity remains the same. So add a zero:

Well, when you take those red terms there — both of which are counterfactual, note — then going from step 4 to step 5 is just using the commutative property of addition and simply rearrange their order like this:

Well, look at what the top is: the top is the ATT, and the second line is parallel trends (PT).

And so it’s all really simple. If you want the ATT, you can get it with DiD. You need longitudinal data, no anticipation so that baseline values of Y are untreated, SUTVA so that there are no spillovers with your comparison group so that when your *k* is treated, your *U *group is not, and parallel trends. And wildly enough, wherever you need to go back to satisfy no anticipation, so long as you have parallel trends from that starting point, then an event study would let you estimate the ATT for the period both after some announcement but before the enforcement, but then also after the enforcement itself, because the ATT is always with respect to a given moment.

Well, I was just surprised how many questions asked always seemed to go back to that simple adage — what parameter are you wanting? If you’re wanting the ATT, you can get it with DID, but you need to want that one, and if you did, then there is a correct and an incorrect specification to get it. Heck, there’s an incorrect twoway fixed effects specification to get it (the modal one we all used before the DID crisis began in earnest throughout economics around 2018, even though the first appears to post in 2016) and there’s a correct one (the Mundlak estimator by Jeff Wooldridge).

And so then I just kept going back to the basics and seeing how simple it really always was. DiD is an incredible simple method, but also challenging too. If you could assume those three things (SUTVA, NA and PT), then you don’t even need more than two periods. The event study isn’t for identification — it’s just to provide evidence of a smoking gun (i.e., that PT is credible).

But if you can’t buy into parallel trends, then what? Well, then you could always assume a factor model in which Y(0) was generated according to observed and unobserved dynamic factors like this:

This factor model is not estimated; it is rather assumed, and it **not** the same as parallel trends. In fact, more than likely if parallel trends were to hold for several years beforehand, such as assumed to be the case when estimating an event study anyway (and under PT and NA it would be true that the ATT=0 in all pre-treatment), then more than likely technically so would the factor model. Except that Abadie regularly advises a much longer series for estimating the most likely, most probable, weighted average group of donor pool units that probably did indeed satisfy that model. In other words, a factor model would likely absorb instances with parallel trends and instances without it, unlike DiD, you need many periods to find them for identification, not just plausible evidence for parallel trends.

**Conjecture of what is hard for applied people (particularly me)**

I have been thinking for a while that the challenge of this new diff in diff literature for people my age who did their PhDs in economics maybe maybe before 2015 (mine is 2007) was that they were taught the constant treatment effects specification of twoway fixed effects, and just had a lot of human capital built up in it — in all kinds of odd ways, even down to the kinds of robustness checks and tables they made using it. And switching out human capital is always costly. But now I’m not entirely sure that that is really it.

Now I think that at least one non-trivial challenge of this new diff-in-diff literature is what Heckman said above in that 1990 AEA P&P:

state your research question as a specific aggregate causal parameter.

Many of us had no trouble whatsoever stating our research question as an aggregate causal parameter, but we didn’t have a *specific* one in mind. If we had, maybe we would’ve been immediately dissatisfied to learn that our TWFE model required no dynamics as well as parallel trends but even then only could ever obtain an estimate of the “variance weighted ATT”. If we didn’t want that specific parameter, then why would we want to ever use that specification to get it?

I think it’s hard to be honest to turn a research question into a *specific* parameter of interest. Who really has sat down and said “I want the ATT but not the ATE” or “I want the LATE but not the ATT”? Sure, some have, and far more do now than ever before, but a lot of us didn’t until we really had to confront the diff-in-diff renaissance which has been like drinking from the proverbial firehose to be honest. But go back through your vita and look at your old diff-in-diff papers before 2016. How often did you say literally “the ATT”? How often did you say “the causal effect”? But there is no such thing as “the causal effect” — under heterogeneity, there are as many treatment effects as there are people in your dataset. We don’t estimate the individual ones anyway, at least not historically — rather we estimated the *average* ones. But which average? Even for the ATT, there are many averages. There’s the cohort ATT in the event study, and there’s the overall ATT, there’s the group averaged ATT, and you could even imagine subgroup ATTs (i.e., the ATTs for white men). There’s just so many different ways to summarize a large number of individuals in the treatment group’s treatment effects — if you don’t say up front which one you want, you won’t know the assumptions you have to make, and you definitely won’t know the model that has the best shot at getting it.

So, Microsoft talk will be sort of about that. It’ll be a little bit of a walk through these topics, but with an effort towards being practical and concrete, not just inspirational (but that too — I am of course always trying to channel my inner Ted Lasso):

Potential outcomes and the ATT

Simple comparisons, selection bias and randomization

Unconfoundedness, saturated regression, and matching with bias adjustment

Difference-in-differences, covariate adjustment, and staggered adoption

Synthetic control with non-negative weighting, and an allusion to the problems of negative weights and ways to do them at minimal negative weighting

Applications, exercises, spreadsheets (I’ll be sure to use excel, not google sheets — in praise of Excel, one of the greatest pieces of data software ever created in my opinion), lecture, code, and so on. And that’s it. And it’s helping make my revision of the book a lot more focused, a lot clearer at least on that chapter. I’m hoping that when I get back from Microsoft, in between my online workshops coming out this summer, I can just squarely focus and get a lot done on the revision.

So that’s it! Wish me luck!