#### Discover more from Scott's Substack

# ATT Estimation using Regression and Matching with Heterogeneous Treatment Effects

### A Deep Dive into Saturated Regressions and Nonparametric Matching Approaches (with simulations!)

**Act III: Regression specifications for the ATT**

In this Substack, I will address several critical concepts that I believe warrant our attention, despite the fact that my thoughts may not be entirely original. After multiple attempts to refine my writing and arrange my ideas, I have decided to dive in and share my perspective on these matters, as I believe they hold significant importance. The following points will be the focus of our discussion:

The necessity for a "parameters first" principled approach in specifying regressions, as some of us may not be doing so correctly.

The implication of heterogeneous treatment effects on regression specifications, specifically the presence of both average treatment effects (ATE) and average treatment effects on the treated (ATT) within the same specification.

The potential pitfalls of recovering the ATT from a regression specification, and why matching might be a more reliable alternative.

Now, let us delve into each of these points and explore their relevance. To maintain continuity, I will continue to employ the horse race metaphor I have used in previous essays. So, saddle up and join me as we delve deeper into these captivating subjects.

**Recommended Steps for Causal Projects**

When conducting causal projects, it is crucial to follow a structured approach to ensure accurate and meaningful results. Here are the recommended steps:

**Define the Desired Parameter (ATT):**Start by clearly defining the parameter you are interested in, such as the Average Treatment Effect on the Treated (ATT).**Establish Necessary Beliefs (Identification):**Determine the beliefs or assumptions required for the identification of the causal effect, which is essential for drawing valid conclusions.**Develop Estimators (Estimator):**Create estimators that generate the correct values, ensuring the reliability and accuracy of your analysis.

**Common Pitfalls**

It is not uncommon for researchers to skip steps 1 and 2 and jump directly to step 3, running regressions before establishing the necessary beliefs and defining the desired parameters. This approach can lead to the assumption of exogeneity (step 2) and the hope that the estimates are weighted averages of individual treatment effects (step 1). However, this is not guaranteed and may result in misleading conclusions. To avoid this, it is vital to follow the recommended steps outlined above.

This is one of the more subtle lessons I think we learned from the large slate of recent difference-in-differences (diff-in-diff) articles by econometricians. It wasn’t merely that the literature on differential timing introduced us to the biases of a particular TWFE specification; it wasn’t that it presented several new estimators; it wasn’t that we got new code. In my opinion, the diff-in-diff literature pushed us to articulate which parameter we wanted, think carefully of the identification, and then build the estimators. In other words, for the attentive reader, the new diff-in-diff papers urged us to go through steps 1 to 3 as opposed to skipping to the 3rd step where we ran regressions and crossed our fingers.

You can see this, I think, in how people talked before and after this wave of diff-in-diff research. I can’t prove this, but in my flawed memory, there were far more people in the old days who when they ran diff-in-diff not merely equated it with TWFE static specifications — they also would say they were estimating **the causal effect** of some policy. But there is not **the** causal effect. In a population of heterogenous treatment effects, there are many causal effects. There are individual treatment effects, and there are a number of ways to summarize them as aggregate causal parameters. For diff-in-diff and synthetic control, that is usually the average treatment effect on the treatment group, ATT, though sometimes it will be a variance weighted version of that. This shift was partly due to the accessibility of new tools and a greater understanding of the potential outcomes modeling framework, but it was also unavoidable given many of the new estimators were drilling down deeper and deeper on smaller building block causal parameters like the group-time ATT(g,t) and then aggregating using transparent weights to various larger “policy relevant” ATT parameters. We were hearing about the ATT far more than we were hearing about “the causal effect”.

In an interview I did with Guido Imbens a year ago, I asked him about the reaction people had to his *Econometrica* with Josh Angrist in which they proved the LATE theorem, and he said that some viewed it skeptically, and some viewed it positively. The ones who viewed it skeptically didn’t dispute the math or that the LATE was interesting. It was more that they criticized Imbens and Angrist of going out of order. Listen as he explains:

In other words, critics thought Imbens and Angrist were “cheating” because they recognized the negative results of Heckman and Manski that IV without full compliance could not identify the ATE. So, rather than try to find an IV estimator that would, they simply skipped to step (3), put IV into the Rubin framework with potential outcomes, and explained what IV was identifying if not the ATE. They learned, in a breakthrough paper, that IV was estimating the ATE for a subpopulation called the compliers, or what we now call the LATE.

Goodman-Bacon’s 2022 paper in JoE has a similar flavor to that. He doesn’t start with (1) and go step by step to (3). Rather, he simply goes to the canonical TWFE model, or what is sometimes called the static mode with fixed effects, which is step (3) (“running regressions”). And he then uses the Frisch-Waugh-Lovell theorem to decompose the TWFE parameter estimate. Mapping that back to potential outcomes, he shows that the potential bias is buried inside that estimator. But he does not turn around, unlike Sun and Abraham (2021), and build an estimator that didn’t suffer from those problems. Rather, his paper was more like the doctor who tells you that you’re sick, but then simply refers you to an expert.

But what have those experts done? They’ve said “be explicit. Say precisely what your research question is. Express that research question as a specific causal parameter. Think about what beliefs you need to estimate it (“identification”). Then build a machine that estimates it with data.” In my opinion, the recent surge in diff-in-diff research has pushed economists to be more intentional and explicit about the causal parameters they are estimating, with a greater focus on ATT. This development has encouraged researchers to think more critically about the assumptions and methods used in their analyses, ultimately leading to a more robust understanding of causal effects.

**Regression specifications**

In this section, we will take the advice of specifying our target parameters and identification into account. We will assume that we have unconfoundedness and linearity, but covariate imbalance and heterogeneous treatment effects. Our goal is to estimate the average treatment effect (ATE) and the average treatment effect on the treated (ATT) using regression specifications and nonparametric matching. Even when linearity assumptions hold, the author prefers to use matching due to the likelihood of making coding mistakes.

To explore this, let's consider a data generating process with the following characteristics:

Age is generated from a normal distribution with different means and standard deviations for the treatment and control groups:

Treatment group: mean 25, standard deviation 2.5

Control group: mean 30, standard deviation 3

GPA is generated from a normal distribution with different means and standard deviations for the treatment and control groups:

Treatment group: mean 1.76, standard deviation 0.5

Control group: mean 2.3, standard deviation 0.75

Age and GPA are centered around their respective means, and squared terms and interaction terms are generated:

Age squared (age_sq)

GPA squared (gpa_sq)

Interaction between age and GPA (interaction)

Next, we generate outcome variables for both no treatment (y0) and treatment (y1) cases, as well as the treatment effect (delta). The equations for these are:

No treatment (y0):

Treatment (y1):

Notice that by re-centering the age and gpa variable, we allow for treatment effects to move above and below the average depending on that person’s re-centered age and re-centered gpa in the Y(1) formula. This is heterogenous treatment effects, because in the next line, we see that Y(1) differs according to a person’s age and gpa plus a constant of 2500.

Based on this simulation, which I’ll show below, tthe average treatment effect (ATE) at 2500 and the average treatment effect on the treated (ATT) at 1980. The reason that the ATE is 2500 is because the sum of deviations from the mean are zero, and so averaging over all units causes the re-centered age and re-centered GPA terms to equal zero leaving only 2500 as the ATE. But the ATT must be calculated and is based on treatment assignment.

**Covariate Imbalance**

One of the advantages of the propensity score theorem is simply dimension reduction. We have two variables. Rather than show the distribution of age and GPA across treatment and control, which I already know are imbalanced confounders, I’ll collapse them into a single scalar bounded between 0 and 1 and plot that. You can see here that those with higher propensity scores have more mass in the right tail and vice versa for those with low scores to the left. A typical feature of non-experimental data — the covariates are naturally imbalanced. And given estimates of the causal parameters under unconfoundedness require covariate balance, this will introduce bias for our matching method given it cannot leverage the modeled extrapolation of regressions.

But one advantage of regression is that it extrapolates beyond the support of the data, and if you know that functional form, then it’ll do a great job at it, hitting the bullseye even, so long as you know the underlying functional form itself. With matching, we have a work around from a 2011 *Journal of Business and Statistics* paper by Abadie and Imbens in which they use regression adjustment to estimate the selection bias and then adjust the treatment effects, row by row, for each treated unit when estimating the ATT. I’ll walk us through that below.

**Regression specifications for the ATT**

In this part, we will look at two naive regression specifications for estimating the ATE using the data generating process specified earlier. The first regression assumes constant treatment effects with no quadratic terms, while the second regression includes quadratic terms and interaction. The equations for these regressions in LaTeX format are as follows:

Regression 1: constant treatment effects, no quadratics

Regression 2: constant treatment effects, quadratics and interaction

The goal is to compare the coefficient on treat (i.e., the treatment effect) with the ATE value of 2500. And it’s interesting to note that the second equation does resemble the data generating process of the actual data itself. However, I will show that neither of these naive regression specifications will recover the ATT. I will explore other methods to estimate the ATT in the following sections. Below are the results of those simulations. Note that the dashed blue line is the ATE of 2500, but the solid red line is the mean estimate over all 1000 trials. Both are biased, even though the only confounders we have are age and GPA and linearity is satisfied. The constant treatment effects assumption buried in strict exogeneity is creating problems for us.

Now, let's move on to saturated regressions where the treatment variable is interacted with all covariates and their interactions. We will start with a simple saturation based just on age, GPA, and an interaction, but no higher-order polynomial. Then, we will interact the treatment variable with age and GPA up to a quadratic term and then an interaction of all four of those variables.

**Saturated Regressions**

Simple saturation: treatment interaction with age, GPA, and their interaction

Treatment interaction with age and GPA up to a quadratic and their interactions

That’s right. That’s the specification you need **just to get the ATE**. The coefficient on beta 1 is the estimated ATE. Let’s now look at this output.

Interestingly, the bias on the partial saturated model on the left is much smaller than what we found before, but the saturation on a fully saturated model, many interactions we don’t even need, yields the exact ATE value.

**Estimating the ATT from the First Saturated Regressions**

But now comes the hard part. Next we have to figure out how to recover the ATT from these saturated regressions. And that, as it turns out, is not as simple as just looking at a coefficient on a single dummy, but ironically, it does not require a different regression. The same regression specification, when saturated, allows us to obtain unbiased estimates of both the ATE *and* the ATT, but the work required to get the latter is a lot more involved.

To estimate the ATT from the saturated regressions, we will use the coefficients from the treatment interactions and their corresponding variable means. The formula for the ATT will involve summing the coefficients multiplied by the mean of their corresponding variable in the treated group. Below is the equation for estimating the ATT using the coefficients from the first saturated regression:

In the Stata code provided below, we first run the saturated regression on the simpler model which only had age, gpa and an interaction of gpa and age. We save the coefficients for the treatment variable and all its interactions, including the interaction of gpa and age. Then, we calculate the mean of age and GPA for the treated group. Finally, we use the coefficients and the mean values to compute the ATT for the treated group. And this is an estimate of the ATT implied by that first saturated model.

To estimate the ATT from the second saturated regression is considerably more tedious — not in the Stata specification itself. That is only one line thanks to Stata syntax that simplifies interactions. Rather, once we have those coefficients, we have to then multiply by them **all treatment interactions**, of which there are many, and several we do not even need at all given the data generating process. But over controlling in this case does not have any deleterious effect, suggesting that the specifications you employ may want to explore enough higher order terms assuming that the parameters can be estimated in the first place.

**Estimating the ATT from the Second Saturated Regression**

To estimate the ATT from the second saturated regression, we will use the coefficients from the treatment interactions and their corresponding variable means. The formula for the ATT will involve summing the coefficients multiplied by the mean of their corresponding variable in the treated group. Below is the equation for estimating the ATT using the coefficients from the second saturated regression:

In the Stata code provided, we first run the second saturated regression and save the coefficients for the treatment variable and its interactions with age, age_sq, gpa, and gpa_sq. Then, we calculate the mean of age, age_sq, gpa, and gpa_sq for the treated group. Finally, we use the coefficients and the mean values to compute the ATT for the treated group.

Alright, let's dive into the results! Below, we present the estimated ATTs from the first saturated regression, which includes a shorter list of covariates and only one interaction, and the second saturated regression, with a fuller set of interactions. It appears that the first estimate is somewhat biased, while the second estimate is impressively close to the true value. The point estimate from the second regression is 1,972, while the actual ATT is 1,980, making the difference a mere 8. Not too shabby!

**Estimating ATT with Nonparametric Matching**

In this section, we revisit nonparametric matching for estimating the ATT. By minimizing the Mahalanobis distance, we evaluate four matching models with and without bias adjustment. These methods, based on Abadie and Imbens (2006) and Abadie and Imbens (2011), involve choosing "optimal" matches from the control group to the treatment group and calculating the simple average outcomes before taking the difference. To address covariate imbalance, we apply bias adjustment techniques as per Abadie and Imbens (2011). We will explore these methods through four distinct models, showcasing their effectiveness in reducing estimation bias and providing accurate treatment effect estimates. For each model, we minimize the Mahalanobis distance to obtain the best possible matches.

Model 1: Matches on age and GPA only.

Model 2: Matches on age and GPA, while also estimating selection bias based on these two covariates.

Model 3: Matches using a fuller set of controls, including higher-order polynomials of age and GPA.

Model 4: Matches using the fuller set of controls as in Model 3, and then estimates the selection bias for each row to net it out before calculating the averages.

Below is a figure presenting the results of 1000 simulations for all four matching models, just as we've done in previous sections. Take note that the bias is significantly lower than our saturated regression with misspecification, even without adjusting for selection bias. In both cases—matching on age and GPA only, and matching on the fuller set of controls including higher-order polynomials—adjusting for selection bias through regression adjustment reduces the bias. Remarkably, in the fuller model with higher-order polynomials (Model 4), the parameter estimate under matching is unbiased.

The figure clearly demonstrates the effectiveness of matching methods combined with bias adjustment in reducing estimation bias, particularly when using a fuller set of controls with higher-order polynomials. This approach provides accurate and unbiased estimates of treatment effects, making it a valuable tool for causal analysis.

**Conclusion**

In conclusion, let's circle back to the essential steps of our analysis: defining target parameters, establishing identification, and selecting appropriate estimators. Nonparametric matching offers several advantages over traditional regression techniques, especially when estimating the ATT. While regression may seem simpler at first glance, calculating the ATT involves more complexity than a single line of specification. Nonparametric matching eliminates the need for a linearity assumption, which can be crucial in certain contexts, as we'll explore in future substacks.

This discussion also highlights the implications of heterogeneous treatment effects concerning confounders. One can obtain both the ATE and the ATT from a single saturated regression, although their construction after the estimation relies on different information from the same model.

Interestingly, this journey into saturated regressions and their applications was inspired by Wooldridge's Mundlak estimator in the context of difference-in-differences analysis.

Thank you for following along with this in-depth exploration. Apologies for the delay in releasing this substack; it proved to be a challenging and enlightening journey. Don't forget to like, share, and follow! If you enjoyed this content, consider becoming a subscriber for more insights and analyses.

**Stata Code**

cd "/Users/scott_cunningham/Library/CloudStorage/Dropbox-MixtapeConsulting/scott cunningham/Mac/Documents/Causal-Inference-1/Lab/Matching" | |

* Simulation with heterogenous treatment effects, unconfoundedness and OLS estimation | |

clear all | |

program define het_te, rclass | |

version 14.2 | |

syntax [, obs(integer 1) mu(real 0) sigma(real 1) ] | |

clear | |

drop _all | |

set obs 5000 | |

gen treat = 0 | |

replace treat = 1 in 2501/5000 | |

gen age = rnormal(25,2.5) if treat==1 | |

replace age = rnormal(30,3) if treat==0 | |

gen gpa = rnormal(2.3,0.75) if treat==0 | |

replace gpa = rnormal(1.76,0.5) if treat==1 | |

su age | |

replace age = age - `r(mean)' | |

su gpa | |

replace gpa = gpa - `r(mean)' | |

gen age_sq = age^2 | |

gen gpa_sq = gpa^2 | |

gen interaction=gpa*age | |

gen y0 = 15000 + 10.25*age + -10.5*age_sq + 1000*gpa + -10.5*gpa_sq + 500*interaction + rnormal(0,5) | |

gen y1 = y0 + 2500 + 100 * age + 1000*gpa | |

gen delta = y1 - y0 | |

su delta // ATE = 2500 | |

su delta if treat==1 // ATT = 1980 | |

local att = r(mean) | |

scalar att = `att' | |

gen att = `att' | |

gen earnings = treat*y1 + (1-treat)*y0 | |

* Regression 1: constant treatment effects, no quadratics | |

reg earnings treat age gpa, robust | |

local treat1=_b[treat] | |

scalar treat1 = `treat1' | |

gen treat1=`treat1' | |

* Regression 2: constant treatment effects, quadratics and interaction | |

reg earnings treat age age_sq gpa gpa_sq c.gpa#c.age, robust | |

local treat2=_b[treat] | |

scalar treat2 = `treat2' | |

gen treat2=`treat2' | |

* Regression 3: Heterogenous treatment effects, partial saturation | |

regress earnings i.treat##c.age##c.gpa, robust | |

local ate1=_b[1.treat] | |

scalar ate1 = `ate1' | |

gen ate1=`ate1' | |

* Obtain the coefficients | |

local treat_coef = _b[1.treat] | |

local age_treat_coef = _b[1.treat#c.age] | |

local gpa_treat_coef = _b[1.treat#c.gpa] | |

local age_gpa_treat_coef = _b[1.treat#c.age#c.gpa] | |

* Save the coefficients as scalars and generate variables | |

scalar treat_coef = `treat_coef' | |

gen treat_coef_var = `treat_coef' | |

scalar age_treat_coef = `age_treat_coef' | |

gen age_treat_coef_var = `age_treat_coef' | |

scalar gpa_treat_coef = `gpa_treat_coef' | |

gen gpa_treat_coef_var = `gpa_treat_coef' | |

scalar age_gpa_treat_coef = `age_gpa_treat_coef' | |

gen age_gpa_treat_coef_var = `age_gpa_treat_coef' | |

* Calculate the mean of the covariates | |

egen mean_age = mean(age), by(treat) | |

egen mean_gpa = mean(gpa), by(treat) | |

* Calculate the ATT | |

gen treat3 = treat_coef_var + /// | |

age_treat_coef_var * mean_age + /// | |

gpa_treat_coef_var * mean_gpa + /// | |

age_gpa_treat_coef_var * mean_age * mean_gpa if treat == 1 | |

* Drop coefficient variables | |

drop treat_coef_var age_treat_coef_var gpa_treat_coef_var age_gpa_treat_coef_var mean_gpa mean_age | |

* Regression 4: Heterogenous treatment effects, full saturation | |

regress earnings i.treat##c.age##c.age_sq##c.gpa##c.gpa_sq, robust | |

local ate2=_b[1.treat] | |

scalar ate2 = `ate2' | |

gen ate2=`ate2' | |

* Obtain the coefficients | |

local treat_coef = _b[1.treat] | |

local age_treat_coef = _b[1.treat#c.age] | |

local age_sq_treat_coef = _b[1.treat#c.age_sq] | |

local gpa_treat_coef = _b[1.treat#c.gpa] | |

local gpa_sq_treat_coef = _b[1.treat#c.gpa_sq] | |

local age_age_sq_coef = _b[1.treat#c.age#c.age_sq] | |

local age_gpa_coef = _b[1.treat#c.age#c.gpa] | |

local age_gpa_sq_coef = _b[1.treat#c.age#c.gpa_sq] | |

local age_sq_gpa_coef = _b[1.treat#c.age_sq#c.gpa] | |

local age_sq_gpa_sq_coef = _b[1.treat#c.age_sq#c.gpa_sq] | |

local gpa_gpa_sq_coef = _b[1.treat#c.gpa#c.gpa_sq] | |

* Save the coefficients as scalars and generate variables | |

scalar treat_coef = `treat_coef' | |

gen treat_coef_var = `treat_coef' | |

scalar age_treat_coef = `age_treat_coef' | |

gen age_treat_coef_var = `age_treat_coef' | |

scalar age_sq_treat_coef = `age_sq_treat_coef' | |

gen age_sq_treat_coef_var = `age_sq_treat_coef' | |

scalar gpa_treat_coef = `gpa_treat_coef' | |

gen gpa_treat_coef_var = `gpa_treat_coef' | |

scalar gpa_sq_treat_coef = `gpa_sq_treat_coef' | |

gen gpa_sq_treat_coef_var = `gpa_sq_treat_coef' | |

scalar age_age_sq_coef = `age_age_sq_coef' | |

gen age_age_sq_coef_var = `age_age_sq_coef' | |

scalar age_gpa_coef = `age_gpa_coef' | |

gen age_gpa_coef_var = `age_gpa_coef' | |

scalar age_gpa_sq_coef = `age_gpa_sq_coef' | |

gen age_gpa_sq_coef_var = `age_gpa_sq_coef' | |

scalar age_sq_gpa_coef = `age_sq_gpa_coef' | |

gen age_sq_gpa_coef_var = `age_sq_gpa_coef' | |

scalar age_sq_gpa_sq_coef = `age_sq_gpa_sq_coef' | |

gen age_sq_gpa_sq_coef_var = `age_sq_gpa_sq_coef' | |

scalar gpa_gpa_sq_coef = `gpa_gpa_sq_coef' | |

gen gpa_gpa_sq_coef_var = `gpa_gpa_sq_coef' | |

* Calculate the mean of the covariates | |

egen mean_age = mean(age), by(treat) | |

egen mean_age_sq = mean(age_sq), by(treat) | |

egen mean_gpa = mean(gpa), by(treat) | |

egen mean_gpa_sq = mean(gpa_sq), by(treat) | |

* Calculate the ATT | |

gen treat4 = treat_coef_var + /// | |

age_treat_coef_var * mean_age + /// | |

age_sq_treat_coef_var * mean_age_sq + /// | |

gpa_treat_coef_var * mean_gpa + /// | |

gpa_sq_treat_coef_var * mean_gpa_sq + /// | |

age_treat_coef_var * mean_age * age_age_sq_coef + /// | |

age_treat_coef_var * mean_age * age_gpa_coef + /// | |

age_treat_coef_var * mean_age * age_gpa_sq_coef + /// | |

age_sq_treat_coef_var * mean_age_sq * age_sq_gpa_coef + /// | |

age_sq_treat_coef_var * mean_age_sq * age_sq_gpa_sq_coef + /// | |

gpa_treat_coef_var * mean_gpa * gpa_gpa_sq_coef if treat == 1 | |

* Drop coefficient variables | |

drop treat_coef_var age_treat_coef_var age_sq_treat_coef_var gpa_treat_coef_var gpa_sq_treat_coef_var /// | |

age_age_sq_coef_var age_gpa_coef_var age_gpa_sq_coef_var age_sq_gpa_coef_var age_sq_gpa_sq_coef_var gpa_gpa_sq_coef_var | |

gen agegpa=age*gpa | |

* Matching model 1 | |

teffects nnmatch (earnings age gpa) (treat), atet nn(1) metric(maha) | |

mat b=e(b) | |

local match1 = b[1,1] | |

scalar match1=`match1' | |

gen match1=`match1' | |

* Matching model 2 | |

teffects nnmatch (earnings age gpa) (treat), atet nn(1) metric(maha) biasadj(age gpa) | |

mat b=e(b) | |

local match2 = b[1,1] | |

scalar match2=`match2' | |

gen match2=`match2' | |

* Matching model 3 | |

teffects nnmatch (earnings age gpa age_sq gpa_sq agegpa) (treat), atet nn(1) metric(maha) | |

mat b=e(b) | |

local match3 = b[1,1] | |

scalar match3=`match3' | |

gen match3=`match3' | |

* Matching model 4 | |

teffects nnmatch (earnings age gpa age_sq gpa_sq agegpa) (treat), atet nn(1) metric(maha) biasadj(age age_sq gpa gpa_sq agegpa) | |

mat b=e(b) | |

local match4 = b[1,1] | |

scalar match4=`match4' | |

gen match4=`match4' | |

collapse (max) att treat1 treat2 ate1 ate2 treat3 treat4 match1 match2 match3 match4 | |

end | |

simulate att treat1 treat2 ate1 ate2 treat3 treat4 match1 match2 match3 match4, reps(1000): het_te | |

** Regressions ATE | |

* Figure1: Control for age and gpa | |

kdensity _sim_2, xtitle(Estimated ATE) xline(2500, lwidth(medthick) lpattern(dash) lcolor(blue) extend) xlabel(2500 2717) xline(2717, lwidth(med) lpattern(solid) lcolor(red) extend) subtitle(Saturated w/ age gpa and interactions) ytitle("") title("") note("") | |

graph save "Graph" "./figures/sim2.gph", replace | |

* Figure2: Control for age and gpa, polynomials and interactions | |

kdensity _sim_3, xtitle(Estimated ATE) xline(2500, lwidth(medthick) lpattern(dash) lcolor(blue) extend) xlabel(2500 2389) xline(2389, lwidth(med) lpattern(solid) lcolor(red) extend) subtitle(Saturated w/ age gpa polynomials and interactions) ytitle("") title("") note("") | |

graph save "Graph" "./figures/sim3.gph", replace | |

* Figure3: Saturate age and gpa | |

kdensity _sim_4, xtitle(Estimated ATE) xline(2500, lwidth(medthick) lpattern(dash) lcolor(blue) extend) xlabel(2500 2532) xline(2532, lwidth(med) lpattern(solid) lcolor(red) extend) subtitle(Saturated w/ age gpa and interactions) ytitle("") title("") note("") | |

graph save "Graph" "./figures/sim4.gph", replace | |

* Figure4: Saturate age and gpa, polynomials and interactions | |

kdensity _sim_5, xtitle(Estimated ATE) xline(2500, lwidth(medthick) lpattern(dash) lcolor(blue) extend) xlabel(2500 2500) xline(2500, lwidth(med) lpattern(solid) lcolor(red) extend) subtitle(Saturated w/ age gpa polynomials and interactions) ytitle("") title("") note("") | |

graph save "Graph" "./figures/sim5.gph", replace | |

graph combine ./figures/sim2.gph ./figures/sim3.gph ./figures/sim4.gph ./figures/sim5.gph, title(OLS Estimates of ATE with heterogenous treatment effects) note(Four kernel density plots of estimated coefficients from 1000 simulations) | |

graph save "Graph" "./figures/combined_kernels_ate.gph", replace | |

graph export ./figures/combined_kernels_ate.jpg, as(jpg) name("Graph") quality(90) replace | |

** Regressions ATT | |

* Figure3: saturated treatment with age gpa and interactions | |

kdensity _sim_6, xtitle(Estimated ATT) xline(1980, lwidth(medthick) lpattern(dash) lcolor(blue) extend) xlabel(1980 1746) xline(1746, lwidth(med) lpattern(solid) lcolor(red) extend) subtitle(Saturated w/ age gpa and interactions) ytitle("") title("") note("") | |

graph save "Graph" "./figures/sim6.gph", replace | |

* Figure4: saturated treatment with age gpa polynomials and interactions | |

kdensity _sim_7, xtitle(Estimated ATT) xline(1980, lwidth(medthick) lpattern(dash) lcolor(blue) extend) xlabel(1980 1972) xline(1972, lwidth(med) lpattern(solid) lcolor(red) extend) subtitle(Saturated w/ age gpa polynomials and interactions) ytitle("") title("") note("") | |

graph save "Graph" "./figures/sim7.gph", replace | |

graph combine ./figures/sim6.gph ./figures/sim7.gph, title(OLS Estimates of ATT with heterogenous treatment effects) note(Two kernel density plots of estimated coefficients from two regressions and 1000 simulations) | |

graph save "Graph" "./figures/combined_kernels_att.gph", replace | |

graph export ./figures/combined_kernels_att.jpg, as(jpg) name("Graph") quality(90) replace | |

** Matching | |

* Figure5: Matching without bias adjustment | |

kdensity _sim_8, xtitle(Estimated ATT) xline(1980, lwidth(medthick) lpattern(dash) lcolor(blue) extend) xlabel(1980 2007) xline(2007, lwidth(med) lpattern(solid) lcolor(red) extend) subtitle(Age and GPA and no bias adjustment) ytitle("") title("") note("") | |

graph save "Graph" "./figures/sim_8.gph", replace | |

* Figure6: Matching with bias adjustment | |

kdensity _sim_9, xtitle(Estimated ATT) xline(1980, lwidth(medthick) lpattern(dash) lcolor(blue) extend) xlabel(1980 1998) xline(1998, lwidth(med) lpattern(solid) lcolor(red) extend) subtitle(Age and GPA and bias adjustment) ytitle("") title("") note("") | |

graph save "Graph" "./figures/sim_9.gph", replace | |

* Figure8: Matching without bias adjustment | |

kdensity _sim_10, xtitle(Estimated ATT) xline(1980, lwidth(medthick) lpattern(dash) lcolor(blue) extend) xlabel(1980 2007) xline(2007, lwidth(med) lpattern(solid) lcolor(red) extend) subtitle(Age and GPA polynomials and no bias adjustment) ytitle("") title("") note("") | |

graph save "Graph" "./figures/sim_10.gph", replace | |

* Figure9: Matching with bias adjustment | |

kdensity _sim_11, xtitle(Estimated ATT) xline(1980, lwidth(medthick) lpattern(dash) lcolor(blue) extend) xlabel(1980 1980) xline(1980, lwidth(med) lpattern(solid) lcolor(red) extend) subtitle(Age and GPA polynomials and bias adjustment) ytitle("") title("") note("") | |

graph save "Graph" "./figures/sim_11.gph", replace | |

graph combine ./figures/sim_8.gph ./figures/sim_9.gph ./figures/sim_10.gph ./figures/sim_11.gph, title(Matching with Minimized Maha on Age and GPA) note(Four kernel density plots of estimated ATT from 1000 simulations) | |

graph save "Graph" "./figures/combined_kernels_maha.gph", replace | |

graph export ./figures/combined_kernels_maha.jpg, as(jpg) name("Graph") quality(90) replace | |

capture log close | |

exit |

In both of these, mind you, we were focused on βᵢ, the coefficient on D. But how then would we recover the ATT? Which coefficient and which specification of the four would we use to find that parameter? That’s what I want to now show you. Recall in that in line 32 of this simulation, the ATT was 2105. The bottom two figures represent efforts to estimate the ATT

In order to estimate the ATT, we must first saturate the model. That term ``saturate’’ refers to interacting the treatment variable with the covariates. But here we will saturate twice: what I’ll call “partial” and what I’ll call “full”. These terms only have meaning in the sense that the “partial saturation” is biased but the “full saturation” is unbiased.

In equation 3, I saturated the model with respect to age, GPA and Age X GPA interaction. This yields seven coefficients: one on age, GPA, their interaction, D, DxAge, DxGPA and DxAgexGPA. How do we use these seven coefficients to get the ATT? I bet many of you will be surprised to learn how involved it actually is to estimate the ATT using a simple regression when faced with heterogenous treatment effects. Below is that method.

As you can see, since we had four coefficients associated with D (it alone plus the three interactions), we had to take all four of them and sum them, multiplied by either 1 for the treatment coefficient itself of the sample average for the covariates the coefficients modify. But when we did this, as this was not the OLS model that generated the data, the ATT estimate was 2,003 not 2,105 (bottom left figure).

But, what if we were to “fully saturate” meaning interact the treatment with both confounders, their interactions, their quadratics, and those interactions? Well would you believe me if I told you that to then calculate the ATT would require this insane calculation? It’s so long, it wouldn’t even fit in substack’s native equation editor!

Incredibly enough, when we estimate the fully saturated model and then take all of the coefficients from the interaction of the treatment with the other confounders and their higher order terms, using the formula above, this estimator is unbiased.

*Matching *

Now that we see that the canonical models are biased, even though technically speaking the error term would appear to be independent of the treatment and covariates (as notice it is additive in Y0 and iid from the normal distribution), let’s now move to specifications that fully saturate the model. First, to illustrate that how dependent this still remains on functional form, I saturate the model with respect to age, GPA and an interaction but not with respect to quadratics. As it is unlikely someone would know the functional form up to some polynomial, I thought it might be helpful to illustrate its performance. And though the bias is small (it’s off by $33), note that the distribution of the estimates nonetheless do not contain the true effect of 2500.

The bottom right is the fourth regression which fully saturates the model up to the correct specification which includes a quadratic term in both age and GPA. is the only one that gets it right. It saturates fully with respect to first and second order polynomials, and interactions across first order polynomials and across the second order polynomials. And only this one is unbiased, and it is a beautiful estimate.

So why was it that this fourth one performed well, but the other three did not? Simple — because it correctly modeled the data generating process itself. OLS performed extrapolations to the correct mean potential outcomes in this scenario because OLS is phenomenal at extrapolating beyond the support of the data, but heterogenous treatment effects can introduce some challenges for that. I encourage you to play around with this code.

*Scene 2: Non-linear DGP, heterogenous treatment effects using OLS and matching*

Next we focus on a different data generating process that satisfies unconfoundedness, but not exogeneity. Note that it does not satisfy exogeneity because the linearity assumption implicit in it as noted by Imbens and Rubin (2015) is not satisfied. This is done via a step function in age itself, which is our only covariate and which is also continuous. I tried to make this one simpler so as to focus on the only details that I think matter for now. The one challenge here

**Summary**

What’s the takeaway of this? It’s simple. Heterogenous treatment effects places a lot of demands on exogeneity that aren’t ordinarily there.