

Discover more from Scott's Substack
Tymon Słoczyński’s “Interpreting OLS Estimands When Treatment Effects are Heterogenous: Smaller Groups Get Larger Weights” was published in 2022 Review of Economics and Statistics. It’s short at 9 pages, beautifully written and full of important lessons as well as surprises. It’s one of my favorite econometrics papers published in the last year. It provides an important theorem regarding the performance of OLS under unconfoundedness and heterogenous treatment effects, and given the OLS model he examines is perhaps the most common one, the marginal benefit relative to the marginal cost of understanding the paper is very high. This substack is my attempt to boost the paper’s signal to help others better understand what OLS is, but also importantly is not, doing.
In today’s substack, I am going to discuss the algebraic properties of Tymon’s “weighted average interpretation of OLS” theorem. In a “Part 2” follow-up substack, I will conclude with a discussion of the causal implications of the theorem. Todays’ substack will have a lot of Stata code using simulations as well as some visuals I created with those simulations. I will also introduce you to the command in Stata and R called hettreatreg in the “Part 2” substack uses Tymon’s decomposition to explain the causal interpretation of the OLS estimand under the assumptions and situations I describe here.
From Twoway to no-way fixed effects
Some of the most important papers in econometrics I’ve read of the last several years have had to do with workhorse OLS models. Take the twoway fixed effects estimator for differential timing difference-in-differences. Twoway fixed effects seemed obvious, simple and easily interpretable, we thought. Its use was ubiquitous in policy evaluation too. But Goodman-Bacon, in his highly influential 2021 article in Journal of Econometrics showed that it was potentially biased given that it calculated a weighted average of terms, and that some of those terms were biased under dynamic treatment effects. This bias was caused by improper comparisons baked into the model’s estimation (specifically comparing treated to already treated units in the sample). But the main culprit was not the improper comparisons so much as it was the heterogeneity itself.
John Gardner showed that treatment effect heterogeneity split away from the twoway fixed effects coefficient thus forming a new composite error term that, even without any confounders, became entangled with the fixed effects under differential timing and heterogenous treatment effects. Specifically, echo’ing Bacon’s decomposition, when the treatment effects differed by group (one fixed effect) and time (second fixed effect), then the composite error term was correlated with the treatment and therefore violating strict exogeneity. Somehow this result had not been well understood despite the model’s universal adoption among practitioners making Bacon’s decomposition extremely valuable, a theme I think that runs through many new econometrics articles focusing on the standard models like OLS with controls.
Or take Goldsmith-Pinkham, Hull and Kolesar’s recent working paper “Contamination Bias in Linear Regressions” which is a a generalization of insights found in the judge fixed effects literature as well as twoway fixed effects like Sun and Abraham (2021). Their paper shows a problem exists when OLS is estimated with multiple additive treatments too. The coefficients fail to recover interpretable weighted averages of treatment effects because of contamination caused by treatment effects from other treatment variables spilling over to every other coefficient within the regression.
Regression, as it turns out, is deep and despite its age and extensive popularity, we are still learning things about it, as well as the world when it is used properly. To help motivate today’s substack, let’s write down something specific. Consider the following OLS model with additive treatment dummy, D, and controls, X. We know that the minimizing of the sum of squared residuals yields a reasonable approximation of the conditional expectation function. Statistical software calculates that statistic nearly instantaneously as the solution to the underlying optimization problem has a unique solution under fairly general conditions like that the regressors are not linear transformations of one another and the covariates vary vertically across the rows.
An OLS estimate of tau produced by software spits out a single number. Software does not spit out a decomposition of that number, though, unless you force it to. Decompositions of that single number, though, can help us understand why OLS chose that number in the first place by offering new perspectives on it, and sometimes those new perspectives yield problems that we otherwise could not understand without the decomposition. The coefficient estimated with OLS, as it turns out, is not some uniform parameter but rather is instead a weighted average of other objects and this more relevant to interpretation when the treatment effects are heterogenous. Neither those weights nor the other objects were known before Tymon’s paper, and to learn what they are, he had to decompose the estimand. Knowing the form by which the tau hat coefficient can be broken into other parts is the substance of the first part of the paper.1 Today, I only want to talk about the first part as I think it’s better if we just focus on one thing at a time.
Linear projection, group shares and the propensity score
We are interested in the interpretation of tau in the liner projection of y on d and X and I’m going to use the linear projection, L(.|.) on the model in equation (1) now:
Linear projection is a concept from the theory of linear regression describing the expected value of the dependent variable as a linear function of independent variables. While it does posit a relationship in the data that says, “If we knew the true parameters, then this is how we would expect y to change on average with changes in X and d,” it is not based on potential outcomes, and so the parameters cannot be interpreted in that strict, Rubin causal framework (yet). It is purely an expression of how the mean of the outcome, measured using linear projection, changes with different stratifications of the data (e.g., d and X).
The second term needed for what follows is a measure of the share of units in the treatment group, of the unconditional probability of treatment, which we represent with rho. If there are 5,000 in an experiment, and 2500 are assigned to a job training program, then rho=0.5. Rho, in other words, is the number of units in the treatment group divided by the number of units in the sample. You will sometimes hear me call it a group share. Rho measures the share of the units in the treatment group, which means 1-rho measures the share of units in the control group.
And finally, there is the propensity score. Published in 1983 by Paul Rosenbaum and Don Rubin, it has become one of the most cited papers in the history of Biometrika and shows no sign of slowing. It is a dimension reduction technique for collapsing a multi-dimensional set of covariates into a single scalar measuring the conditional probability of treatment (i.e., as a function of the covariates). The propensity score, in Tymon’s paper, is assumed to follow a linear probability model which we assume is the best linear approximation to the true propensity score. This linear approximation of the propensity score is something I will discuss more in a different post, but for now I simply write it down as:
Equation (2), our linear projection model, can be seen as partially linear and can be made very accurate through higher order polynomials and rich interactions. Given these linear projections describe the entire dataset, we can write down separate linear projections for the treatment group and the control. To do that, I will replace equation (2) listing the additive covariates with the propensity score expression from equation (3):
These definitions reveal the importance that variability in the propensity score play in the linear projection of y. As the propensity scores vary within a sample, the linear projection of y moves according to a constant plus the gamma parameter which functions as something shifting the mean of y.
Average Partial Linear Effects
Tymon introduces two assumptions that he says in footnote 3 are not controversial despite not particular common in causal inference. The assumptions are simply that these the linear projections in equations (4) and (5) both exist and are unique for treatment and control groups. Using those assumptions, let’s write down the “average partial linear effect of d” or APLE for short. We can do this either for the entire sample or separately for the two treatment groups. It is a comparison between the linear projections of the two means and so the expression for it is simply equation (4) minus equation (5) shown here:
The APLE is called a “partial linear effect” because we’ve partialled out the effect of the covariates with the propensity score. And it’s an average because it uses the linear projection operator. The word “effect” will likely cause readers to interpret it as causal, but note the dependent variable is the realized outcome, y, and not the potential outcomes, y(1) and y(0), so we must interpret the word “effect” for now as something other than a causal concept no matter how awkward that may sound to the ear. The forthcoming substack will introduce potential outcomes at which point we can begin re-interpreting these APLE according to some causal terminology that makes this make more sense to those looking for causal interpretations in regressions like these.
As indicated, the equation (6) is the APLE for all units in the sample, but we can also write down APLE equations that apply only to certain groups. The only difference between these sub-population APLE terms and the overall APLE is which mean of the propensity score will be calculated. Is it for the whole sample (equation 6), or is it just for a particular treatment group (equation 7)? Here I show the expression for the sub-sample.
Weighted Average Interpretation of OLS Theorem
Under the presumed existence and uniqueness of those linear projections, Tymon shows that the OLS estimate of tau is a weighted average of the APLE for the treatment group and the APLE for the comparison group. This is, in other words, a decomposition of the OLS estimand from a single number found when minimizing the sum of squared residuals into a weighted average of two other numbers not directly observed when running regressions in software or calculating it by hand. Equation (8), shown next, is Tymon’s “weighted average interpretation of OLS” theorem which is the central focus of the article itself.
The weights, represented with Greek letter omega, are complex and based on variances in the propensity score, as well as group shares. They are shown here, but I will write them down in a simulation too for those who, like me, sometimes need to see both the algebra and the code to understand it.
Let me illustrate equation (9) for you now. Here is the code I promised. It shows 100 values of the variance in the treatment group propensity score. I set the variance of the control group propensity score to be constant at 0.1 (line 9) and I set the rho parameter to be 0.5 (line 8). I then calculate the omega1 weight in line 12 using the formula from equation (9). Feel free to copy this into your own local machine so that you can see precisely the definition expressed in Stata syntax.
clear all | |
set obs 100 | |
* Create a variable for the variance of the propensity score among the treated | |
gen var_pscore_1 = _n * 0.0025 | |
* Set fixed values | |
local rho 0.5 | |
local var_pscore_0 0.1 | |
* Calculate the weight | |
gen weight1 = (1 - `rho') * `var_pscore_0' / (`rho' * var_pscore_1 + (1 - `rho') * `var_pscore_0') | |
* Create the graph | |
twoway (line weight1 var_pscore_1, lcolor(black) lwidth(medium) lpattern(solid)), ytitle("Weight on APLE,1") xtitle("Variance of Propensity Score among Treated") title("Weight on APLE1 and Var(P(X)|D=1]") note("P(X)|D=1 is the propensity score for the treatment group") | |
graph export ./tymon_variance_weight.png, as(png) name("Graph") | |
The weight on the APLE for the treatment group is strictly decreasing as the variance of the propensity score of the treatment group rises according to equation (9). It does this because the propensity score variance term for the treatment group is in the denominator, not the numerator, and so as it rises, the denominator rises, and the fraction falls. The weight is also positively increasing in the relative size of the control group because for omega1, the numerator begins with (1-rho) modifying the variance of the control group’s propensity score variance. But I will hold off discussing the role of the group shares until my next substack so that we can just focus on a few narrower parts of the paper today.
In line 15, I plotted a simple line connecting the omega1 weight to the variance in the propensity score for the treatment group so that readers needing a visual can see. As you can see here, the weight on the average partial linear effect for the treated is declining monotonically in the variance of the propensity score (which cannot exceed 0.25). But the weights are not themselves interesting so much as the things they modify and how all of it adds up to the OLS coefficient itself (equation 8).
The big picture here is simple: you can either estimate tau hat directly by minimizing the sum of squared residuals, or you can follow the following four steps. Either way you get the same answer. And when I say the same answer, I mean exactly the same answer. Here are the four steps:
Estimate the propensity score by regressing d on X using OLS and predict the conditional probability of treatment, p(X), using those fitted values from that linear regression
Calculate the APLE terms for treatment and control using the formula in equation (7)
Calculate the omega1 and omega0 weights using the formula in equations (9) and (10)
Calculate the weighted average of APLE according to the formula in equation (8)
Seeing is believing, so let’s do this using a simulation. The assignment of units to treatment (lines 6 and 7) is independent of potential outcomes conditional on covariates and so satisfies unconfoundedness. There are two confounders — age (lines 10-11) and high school GPA ( lines 12-13) — but they have different sample means and different sample variances for treatment versus control. These two variables determine our potential outcomes (lines 27-28) which means they are confounders. That means the treatment and control have different distributions of the confounders which will introduce bias unless they can be controlled for (but as we’ll see in the next substack, that is not the only source of bias with this OLS specification). Fortunately, I made these data, I know the confounders, and I possess them in the dataset, so controlling for them is easy. It’s less easy in the real world, but it’s easy for us here.
clear all | |
set seed 5150 | |
set obs 5000 | |
* Create treatment and control groups | |
gen treat = 0 | |
replace treat = 1 in 2501/5000 | |
* Generate covariates | |
gen age = rnormal(25, 2.5) if treat == 1 | |
replace age = rnormal(30, 3) if treat == 0 | |
gen gpa = rnormal(2.3, 0.75) if treat == 0 | |
replace gpa = rnormal(1.76, 0.5) if treat == 1 | |
* Center the covariates | |
egen mean_age = mean(age) | |
replace age = age - mean_age | |
egen mean_gpa = mean(gpa) | |
replace gpa = gpa - mean_gpa | |
* Generate additional variables | |
gen age_sq = age^2 | |
gen gpa_sq = gpa^2 | |
gen interaction = gpa * age | |
* Generate potential outcomes | |
gen y0 = 15000 + 10.25*age - 10.5*age_sq + 1000*gpa - 10.5*gpa_sq + 500*interaction + rnormal(0, 5) | |
gen y1 = y0 + 2500 + 100 * age + 1100 * gpa | |
gen delta = y1-y0 | |
su delta // ATE = 2500 | |
su delta if treat==1 // ATT = 1962 | |
local att = r(mean) | |
scalar att = `att' | |
gen att = `att' | |
su delta if treat==0 // ATU = 3037 | |
local atu = r(mean) | |
scalar atu = `atu' | |
gen atu = `atu' | |
* Generate observed outcome | |
gen earnings = treat * y1 + (1 - treat) * y0 | |
* Regress treatment on covariates to get propensity score | |
reg treat age gpa age_sq gpa_sq interaction | |
predict pscore | |
* Calculate squared propensity score | |
gen pscore_sq = pscore^2 | |
* Calculate E[p(X)^2] and E[p(X)]^2 for the treated group | |
summarize pscore_sq if treat == 1, meanonly | |
local E_pscore_sq_1 = r(mean) | |
summarize pscore if treat == 1, meanonly | |
local mean_pscore_1 = r(mean) | |
local E_pscore_1_sq = (`mean_pscore_1')^2 | |
* Variance of the propensity score for the treated group | |
local var_pscore_1 = `E_pscore_sq_1' - `E_pscore_1_sq' | |
* Repeat the process for the control group | |
summarize pscore_sq if treat == 0, meanonly | |
local E_pscore_sq_0 = r(mean) | |
summarize pscore if treat == 0, meanonly | |
local mean_pscore_0 = r(mean) | |
local E_pscore_0_sq = (`mean_pscore_0')^2 | |
local var_pscore_0 = `E_pscore_sq_0' - `E_pscore_0_sq' | |
* Display the variances | |
di "Variance of propensity score for treated: " `var_pscore_1' | |
di "Variance of propensity score for control: " `var_pscore_0' | |
* Calculate rho, the share of units treated | |
su treat, meanonly | |
local rho = r(mean) | |
gen rho = `rho' | |
* Calculate the weights | |
gen weight1 = ((1 - `rho') * `var_pscore_0') / (`rho' * `var_pscore_1' + (1 - `rho') * `var_pscore_0') | |
gen weight0 = 1 - weight1 | |
* 1. Obtain the OLS regression coefficient directly | |
reg earnings age gpa age_sq gpa_sq interaction treat | |
di "Coefficient of treat from OLS: " _b[treat] | |
* 2. Obtain the OLS regression coefficient using Tymon's theorem | |
* For treated group (d = 1) | |
reg earnings pscore if treat == 1 | |
scalar alpha_1 = _b[_cons] | |
scalar gamma_1 = _b[pscore] | |
* For control group (d = 0) | |
reg earnings pscore if treat == 0 | |
scalar alpha_0 = _b[_cons] | |
scalar gamma_0 = _b[pscore] | |
* Expected value of the propensity score for each group | |
su pscore if treat == 1, meanonly | |
scalar E_pscore_1 = r(mean) | |
su pscore if treat == 0, meanonly | |
scalar E_pscore_0 = r(mean) | |
scalar APLE_1 = (alpha_1 - alpha_0) + (gamma_1 - gamma_0) * E_pscore_1 | |
scalar APLE_0 = (alpha_1 - alpha_0) + (gamma_1 - gamma_0) * E_pscore_0 | |
scalar tau_hat = weight1 * APLE_1 + weight0 * APLE_0 | |
di "Calculated treatment coefficient using weighted APLEs: " tau_hat | |
I baked heterogenous treatment effects with respect to age and GPA directly into the data generating process in lines 27-29. Heterogeneity in this context means that age and GPA have different effects on Y(0) than they did Y(1). This was what we mean by heterogenous treatment effects with respect to covariates: the relationship between the potential outcomes and the covariates differs by potential outcomes. This, in other words, is heterogenous treatment effects.
I calculate the treatment effect in line 29 and then summarize the sample treatment effects to get the ATE, the ATT and the ATU (all in lines 31 to 39) which will come in handy in the second substack for understanding the OLS decomposition from a causal inference perspective as opposed to merely an algebraic one.
Successfully writing out the code to implement steps 1 to 4 was quite challenging for this old dog who keeps forgetting how to perform basic tricks, but eventually I got it. So if you want to really understand Tymon’s paper and need some help, I encourage you to run my code above on your machine and reflect on what each line is doing. I think you will find that eye opening as you peruse his paper more closely. Let’s now do this ourselves by running OLS using line 85. The output from this regression is below. The coefficient on “treat” is $2,387.489.
But we could also, as I said, get this same number using Tymon’s OLS theorem based on weights and APLE terms. Lines 79 to 110 do this. Using these variance weights related to the propensity scores and the average partial linear effects, I calculate tau hat the “long way” using Tymon’s theorem and find that it, too, is $2,387.4894.
In other words, OLS can be found either way — the long way, or the short way. But only the long way will show you the weights and the underlying numbers adding up to the OLS coefficient found the short way, and it’s this long way that in the next substack will help us understand the bias of this OLS specification under unconfoundedness and heterogeneity.
Concluding remarks
Nothing in today’s substack indicted OLS as a liar despite the title of the post. I didn’t show that OLS was lying, though I did accuse it of lying. Lying in the context of econometrics is my normal word for “bias” and “inconsistency”, and I have no substantiated my claim that this OLS specification is a liar about the causal effects.
But I will. This is not the end of Tymon’s paper. Tomorrow or the next day, I will conclude my discussion of his paper with a part 2 of “Lies, Damn Lies and OLS Weights”. And in that discussion, I will delve into a corollary of Tymon’s “weighted average interpretation of OLS” theorem that focuses, not on the APLE terms, but rather causal parameters that map onto those APLE terms once potential outcomes and assumptions are introduced. That is when I will bring the smoking gun as evidence that that OLS specification lies. And to give you a sneak peak, so that this is not too much of a cliffhanger, here is the Stata output summarizing the ATE and the ATT in these simulated data.
Herein lies the value of creating simulations based on potential, not just realized, outcomes: you know the causal parameters by simply summarizing the treatment effect variable. The ATE is $2,500 and the ATT is $1,962. Why do I show you this? Remind yourself what our OLS coefficient had been.
The OLS coefficient from that regression is $2387.4894 which is neither $2,500 nor $1,962. Here is my evidence that OLS had lied. Even when I introduce unconfoundedness, even controlling for the confounders and even controlling for the same transformations of them in a regression, OLS still lied. That number it produced is neither the ATE nor the ATT.
So then what is it? Well, tune in to find out! Cliffhanger! Don’t forget to subscribe, like or share!
There is similar insights, though, found in Goodman-Bacon’s 2021 aforementioned article on twoway fixed effects. Those readers who are very familiar with the variance weighting generated by twoway fixed effects will likely recognize much of what we do today.