#### Discover more from Scott's Substack

In an earlier substack on Tymon Sloczynski‘s 2022 Restat (hereafter “Tymon”) I went through the algebraic properties of this OLS model:

I noted that Tymon had a theorem he called the “weighted average interpretation of OLS” that showed the OLS estimate of *tau* was a weighted average of two average partial linear effect quantities — one for the treatment group and one for the control group.

Look closely here at the weights:

Notice how the weight on the first treatment group APLE term (equation 3) is *increasing* as the share of the control group gets larger. See that? The *1-rho* where *rho* is the share of units in the treatment group? That means as the number of units in the treatment group *falls*, the weight on that APLE rises and vice versa. You can see the same thing in that second weight — as the share of units in the treatment group, *rho*, rises, the weight falls.

I will explore this odd feature of OLS weighting today with you with a simulation. Then I want to transition from the algebraic OLS theorem Tymon worked out to its corollary regarding causal terms when using potential outcomes notation. So let’s go!

Let me show you now what I mean about the weird weights again using a simulation. This simulation will loop through 4,992 iterations of the size of the treatment group starting at 4 out of 5000 (*rho=*0.008) to 4,996 out of 5,000 (*rho*=0.9992). Each time it will do some analysis and save several calculations which I’ll use to illustrate the relationship between *omega1* (equation 3) and *rho*, remembering that *rho*, again, is the share of units in the treatment group, and *omega1* is describing the funky weight that is placed on the APLE for the treatment group. Here is the simulation in Stata. You could probably copy it into ChatGPT and get the R version easily.

clear all | |

* Set up the results file | |

tempname handle | |

postfile `handle' ate att atu ols prob_treat prob_control n tymons_ate tymons_att tymons_atu w1 w0 delta using results.dta, replace | |

* Loop through the iterations | |

forvalues i = 4/4996 { | |

clear | |

set seed 5150 | |

set obs 5000 | |

gen treat = 0 | |

replace treat = 1 in `i'/5000 | |

* Imbalanced covariates | |

gen age = rnormal(25, 2.5) if treat == 1 | |

replace age = rnormal(30, 3) if treat == 0 | |

gen gpa = rnormal(2.3, 0.75) if treat == 0 | |

replace gpa = rnormal(1.76, 0.5) if treat == 1 | |

* Re-center the covariates | |

su age, meanonly | |

replace age = age - r(mean) | |

su gpa, meanonly | |

replace gpa = gpa - r(mean) | |

* Quadratics and interaction | |

quietly gen age_sq = age^2 | |

quietly gen gpa_sq = gpa^2 | |

quietly gen interaction = gpa * age | |

* Modeling potential outcomes | |

quietly gen y0 = 15000 + 10.25*age + -10.5*age_sq + 1000*gpa + -10.5*gpa_sq + 500*interaction + rnormal(0, 5) | |

quietly gen y1 = y0 + 2500 + 100 * age + 1100 * gpa | |

quietly gen treatment_effect = y1 - y0 | |

* Calculate ATE, ATT, and ATU | |

su treatment_effect, meanonly | |

local ate = r(mean) | |

su treatment_effect if treat == 1, meanonly | |

local att = r(mean) | |

su treatment_effect if treat == 0, meanonly | |

local atu = r(mean) | |

* Generate earnings variable | |

quietly gen earnings = treat * y1 + (1 - treat) * y0 | |

* Get the weights, OLS coefficient and ALPE | |

quietly hettreatreg age gpa age_sq gpa_sq interaction, o(earnings) t(treat) vce(robust) | |

local ols `e(ols1)' | |

local prob_treat `e(p1)' | |

local prob_control `e(p0)' | |

local n `e(N)' | |

local tymons_ate `e(ate)' | |

local tymons_att `e(att)' | |

local tymons_atu `e(atu)' | |

local w1 `e(w1)' | |

local w0 `e(w0)' | |

local delta `e(delta)' | |

* Post the results to the results file | |

post `handle' (`ate') (`att') (`atu') (`ols') (`prob_treat') (`prob_control') (`n') (`tymons_ate') (`tymons_att') (`tymons_atu') (`w1') (`w0') (`delta') | |

} | |

* Close the postfile | |

postclose `handle' | |

* Use the results | |

use results.dta, clear | |

gsort prob_treat | |

gen id=_n | |

* Simple graph | |

twoway (line w1 prob_treat), ytitle("OLS weight on APLE,1") xtitle("Share of units in treatment group") title("Weighted Average Interpretation of Sloczynsi OLS Theorem") subtitle("Treatment Group Size vs. APLE,1 Weights") note("5000 OLS regressions varying treatment share from 0.001 to 0.9994") legend(position(6)) | |

#delimit cr | |

Line 78 gives you this plot of the relationship between the weight on APLE (which is only about the treatment group) and the APLE itself.

Let me explain this graph for anyone who, like me, sort of stares at graphs a little bit at the beginning unsure of the story it’s trying to tell. The vertical axis is that omega weight. Specifically, it is the left-hand-side weight on the APLE labeled “*omega1*” in equation (3). And if you look at equation (2), you can see it there too modifying the APLE,1.

The x-axis is the *rho* term, or what I label here the “share of units in treatment group”. In other words, it’s a fraction ranging from 4 out of 5,000 unit in the treatment group all the way to 4,996 out of 5,000 in the treatment group. The reason it goes from left to right is because I did a loop, in other words, almost 5,000 times where I calculated the two variables. I had to do the looping because the propensity score changes as I changed group shares and therefore the variance changed too as the share of units in both treatment and control changes.

So, with that said, as we move from left to right, the treatment group is getting *larger*. But as we move from left to right, the weight on the APLE,1 is getting *smaller*. Which means that the *weighted* *APLE,1* is falling as the treatment group gets *larger* causing the OLS coefficient to move in the *opposite direction* as what we might expect. To help you understand the counterintuitive nature of this, let’s move from the somewhat opaque concept of the APLE to the concepts we do know — the average treatment effect (ATE), the average treatment effect on the treatment group (ATT) and the average treatment effect on the untreated group (ATU).

**Sidebar: Sometimes ATT is the ATE and sometimes it isn’t. How come?**

*Randomization: When ATE=ATT=ATU*

Before I move into Tymon’s decomposition of the OLS coefficient as a weighted average of the ATT and the ATU, I wanted to first just make an obvious, but maybe not so obvious, observation which is that the OLS coefficient in an RCT, where you simply compare the mean of the outcome for the treatment group and the mean of the outcome for the control group will equal the ATE, but it will also equal the ATT, *and* it will even also equal the ATU too. To see this, consider this decomposition of the simple difference in means when you have no covariates. This is from the fourth chapter of my book, *Causal Inference: the Mixtape* *(Yale University Press, 2021)*, titled “Potential Outcomes”. That link takes you to subsection 4.1.3 titled “Simple differences in means decomposition”.

The left hand side is the simple difference in mean outcomes in the population. And in a simple switching equation, *Y=DY(1) + (1-D)Y(0)*, we know that that number because it can be calculated with realized outcomes in the data. The right hand side cannot be calculated with data, though, because we don’t know the ATE, the ATT, the ATU or *E[Y(0)|D=1]*. The top term on the right is the ATE; the second row is selection bias representing the difference between counterfactual *E[Y(0)]* for the treatment group and the same quantity for the control group, and the last is the weighted difference between the *ATT* and the *ATU*, where the weight is the share of units in the control group. This whole expression is just a simple working out of the definition of the ATE, ironically, which you can see if you read section 4.1.3 slowly and work through the decomposition yourself.

That equation is *always* true, whether you run an experiment, or whether you are simply comparing *any* two groups. The comparison between any two groups at a point in time will *always* be equal to the sum of those three rows — *always*. Imagine for a moment I flipped a coin: heads you go into the treatment group, tails you go into the control group. This means the treatment assignment is independent of the potential outcomes. Let me write it down formally so you recognize it:

Let’s put this into words. Let’s say the treatment is either “go to college” or “stop after high school”. If we randomly assigned kids to college or high school only based on a coin flip, then on average, the kids in college would have the same **average** characteristics as those who didn’t. Independence, in other words, distributes equally *all* variables equally — same mean, same variance, same distribution — across the two groups so long as the treatment is independent of those variables. Which it is in a randomized experiment.

Independence means all variables in expectation at the population level are the same for both groups — including the potential outcomes, i.e., even the mean of *Y(1) *and the mean of *Y(0)* too. Furthermore, even the mean *treatment effect*, *Y(1) - Y(0)*, is the same in both groups since the mean of both potential outcomes is the same in each group. This is a subtle but deeply important deduction we are allowed to make. It is what permits us therefore to write down these logical deductions.

Using those equalities, we can substitute all of them into equation (5), and when we do the second and third rows disappear. The second row of equation (5), for instance, labeled “selection bias” disappears once we note that equation (7) holds. Selection bias does not exist *in the population* for any truly randomized experiment with full compliance.

But then what about the last term — the heterogenous treatment effects. See here why that too vanishes under a randomized experiment.

And so equation (5), defining the simple difference in means in terms of causal parameters and selection bias, is simply equal to the ATE (the top row) because the second row and third row zero out.

But you know what else is equal to the ATE? The ATT and the ATU. We just showed that the ATT=ATU in equation (9). But that’s because it was all expressed with conditional expectations and since that whole expression was linear in those conditional expectations, they just canceled out because one half of that equation was equal to zero and so was the other half. And that means *ATT=ATU*. But check this out. Let’s write down the definition of the ATT and remember the deduction that is allowed if the assignment of the treatment is independent of both potential outcomes.

Do you see how in the third row of equation (10) I replaced *E[Y(1)|D=1]* with just *E[Y(1)]*? Do you know why I did that? Because look back up at equations (7) and (8): the mean of *Y(1)* and the mean of *Y(0)* is the same whether I’m looking at the treatment or the control group. If it’s the same regardless, then it doesn’t depend on it, and we can drop the *D* term from the expression. But then if we drop it, then ATT=ATE because the last line is the definition of the ATE. You can do the same thing with the ATU if you want to, but you’ll get to the same place.

So to summarize, in the RCT, all the mean causal parameters we’ve introduced are the same. Which leads to my second point: when are they *not* the same then?

*Roy Models: ATE not equal to ATT not equal to ATU*

In a 1990 article in *Econometrica*, James Heckman and Bo Honoré discuss “the empirical content of the Roy model”. I was reminded of this last night when emailing Petra Todd about the history of the average treatment effect on the treatment group. The Roy model from 1951 entitled “Some Thoughts on the Distribution of Earnings” is both a classic in labor economics, as well as a favorite of Jim Heckman’s. He has long maintained that it is a deep, integral part of the history of the potential outcomes model of causality, at least 20 years before Don Rubin. The paper is short at 12 pages and has no math, but it underlines the important role of *rational* *sorting* based on the heterogenous returns to occupations across two sectors — fishing and hunting — and how that simple idea changes completely what the distribution of earnings looks like and also what it means. Here’s a deck of slides by Chris Taber also describing it in case you want to dig into it a little.

When people sort into an activity based on the returns to that activity, where returns is defined as *Y(1) - Y(0)*, then we don’t have independence because independence means “no sorting based on potential outcomes”. And this may be an easy assumption to make to someone without a behavioral background, but I think for anyone with any background in the social sciences where people have some modicum of intentionality in their choices and respond, at all, to desires for happiness and the avoidance of pain, it seems very obvious that independence is really a strong assumption to make, maybe even a radical one to make. People most likely do the opposite of that; they make choices *because* they expect these choices to help them. Yes, they do so under uncertainty, and most likely it depends a lot on what choices and what outcomes we mean, but for the really important decisions in life, it seems unwise to just assume no one is sorting based on gains. When you see one group with different mean outcomes from another, it’s almost certainly the case the comparisons you make are rife with selection bias.

But that’s not the only bias. Equation (5) says that any difference in two groups’ outcomes will be biased in *two* not just *one *way. The first bias is the selection bias. The two groups differ from one another on their “baseline” or *Y(0)* outcome means, but since one is counterfactual (the treatment group), and the other is observed (the control), you can’t check. You just know it’s there and if there’s sorting, you know they’re different too.

But notice there is also a third line, and this is explicit in the Roy model. The groups are also different with respect to the returns to each occupations. Some people go into fishing, in other returns, because they *uniquely* gain from it relative to hunting, and some go into hunting for the opposite reason. Thus we see that the simple difference is a biased estimate of the ATE because of selection bias; it is also different from the ATT and it is different from the ATU. The ATT and the ATU are different from the ATE and different from one another under any scenario where people sort *and* where there are heterogenous treatment effects.

And *that*, while not the history of the ATT entirely, is at least one part of its history. These causal terms are different in the population if real life is a Roy (1951) which it probably is. Let’s now move ahead.

**Back to OLS, the ATE, ATT and the ATU**

Unconfoundedness is an extension of the independence assumption we wrote down earlier. It is sometimes even called “conditional independence” (Angrist and Pischke 2009). Here is that expression:

Ironically, the meaning of equation (11) is extremely similar to the meaning of equation (6). The difference is that the randomization of the treatment occurs *within* or *across* (however you want to envision it) the “dimensions” of the covariates. So, in other words, for people with identical confounder characteristics, the treatment is random. It is, in other words, *conditionally *random, not *unconditionally* random like in equation (6). And as I’ve said before, I think this assumption is quite radical. So much so I asked Dall-E 3 to make me pictures of observationally identical women accepting a boyfriend’s proposal for marriage by flipping coins. As I simply could not for the life of me get DALL-E 3 to give me precisely what I wanted, I’m giving up and hoping you can see the metaphor in this anyway. Here’s a few of them. Not sure a stranger seeing this would say “oh look a picture of unconfoundedness” but indeed that was my goal.

Anyway you get the point — those are pictures of people who are observationally identical to one another flipping coins to get married, which is what unconfoundedness would mean if you tried to match on a set of variables necessary and sufficient to satisfy that conditional randomization.

When I frame it that way, and I think that that is what is implied by unconfoundedness, I tend to think to myself this: “probably I could believe in unconfoundedness when thinking about whether to have the chicken or the steak for dinner but it gets increasingly harder and harder to believe in conditional randomization as I move deeper and deeper into the more serious decisions of one’s life like getting married, having a child, getting divorced, taking a job, picking your schooling or going to war. Then probably even if I *could* in principle satisfy unconfoundedness, I doubt it’s the case any dataset has them and besides I doubt anyone knows the variables to include anyway. It’s likely a pipe dream and I probably need an instrument.”

And yet I love regressions and matching and propensity scores! They’re so interesting! They’re so pretty and so powerful if you can and are willing to assume unconfoundedness. Anyway, let’s get back to Tymon’s theorem.

Tymon introduces two more assumptions once he moves into the world of causal interpretations of his theorem. They are unconfoundedness in the mean potential outcomes (i.e., E[Y(1)|D=1,X=x]=E[Y(1)|D=1,X=x] and so on) and a linearity assumption with respect to the expected potential outcomes as a function of the propensity score. He writes:

“Sufficient for this assumption, but not necessary, is that the conditional mean of

dis linear inXand the conditional means ofy(1) andy(0) are linear in the true propensity score, which is now equal top(X). “

Tymon and I have corresponded and in the simulation I will show in part 3 of this substack series, it’s possible my simulation actually violates this assumption but in ways that neither one of us fully understand. Which is interesting because if I can figure that out, that’ll probably be a good sign my knowledge has sufficiently deepened. But ultimately I am going to show you what I did figure out and use it probably to note even moreso that I think it’s always better to use regression adjustment or matching with bias adjustment. Anyway here’s the causal interpretation of the OLS coefficient as a weighted average of the ATT and the ATU.

But, while the OLS coefficient is equal to a weighted average of the ATT and the ATU, since those weights are always downplaying whatever it’s modifying, the OLS coefficient tends to “look like” whichever treatment parameter has the *smallest* group. Which is the opposite of how to weight those two parameters to get the ATE. And so if your goal is the ATE, then that’s problematic since the weights will bias you. Consider this example on Google Sheets.

I created a simple table with 15 people. 40% of them are in the treatment group (*rho=0.4*) and 60% are in the control group (*1-rho=0.6*). Columns A and B are the potential outcomes, Y(1) and Y(0) respectively, and column C is the treatment effect equal to Y(1)-Y(0). I assigned units to treatment according to the Roy model such that if the treatment effect was non-negative, they sorted into the treatment and if negative they stayed in the control state. I then calculated the three causal parameters we are considering and the ATE was 0, the ATT was 7.17 and the ATU was -4.78. I calculated the ATE by simply taking the mean of column C which was equal to 0 exactly.

You can calculate the ATE that way, *or* you could calculate it as the weighted average of the ATT and the ATU using as your weights the share of units in the treatment group and the share of units in the control group. When I do this (row 23), I also get a zero.

Well, it’s not exactly right what I do next in line 24 because remember the weights that OLS generates are based on the variance terms times group shares, but I wanted to make this simple and ignore those variance terms for now. My point is remember the weights are based on the group shares but in the *wrong direction*. The weight on the ATT that OLS generates is *not* increasing in the share of units treated; rather it’s *decreasing* in the share of units treated. It’s sort of as though you *reversed* the weights such that you’re weighting the ATT with the wrong weight and the ATU with the wrong weight too. But this is what OLS does. Tymon has a great explanation of the intuition of it here.

What a great paragraph with great writing. I bet it took him weeks to get that paragraph just right. He notes that OLS is the best linear predictor of *realized outcomes*, which not mean it is the best predictor of *potential outcomes*, particularly the one that became counterfactual under the treatment assignment. The OLS weights are optimal, but they are only optimal for predicting reality, not an alternative reality which is what causal inference is about (at least in the Rubin framework).

He then writes down a few bias terms by subtracting either the ATE or the ATT from OLS coefficient. First here is the bias of ATE:

Assumption 1 and 2 just assumes that the linear projection for the APLE,0 and APLE,1 exist. If they do, then the bias from nonlinearity (the top row) vanishes. And if we assume unconfoundedness in the mean potential outcomes and if the conditional mean of the potential outcomes are linear in the true propensity score, then the bias from nonlinearity vanishes, but not the bias from heterogeneity. And the same is true for the bias of the other two parameters as well:

It’s fascinating in a way that the OLS coefficient is equal to the sum of all three terms: the parameter of interest, the bias from nonlinearity and the bias from heterogeneity. That’s not exactly what the difference in means was that I showed above — there is no selection bias in these Corollaries — but we do at least see the bias coming from heterogeneity.

See, this is actually something that Imbens and Rubin (2015) noted too. They note in a short couple of paragraphs early on that this regression model, under exogeneity, assumed linearity in the functional form, unconfoundedness and constant treatment effects, all of which Tymon is covering as well, only here we get an a term that we can use to measure the amount of bias. He writes the following regarding these omega and delta terms:

“While w0 is guaranteed to be positive under assumptions 1 and 2, δ may be positive or negative. Both w0 and δ, however, are bounded between 0 and 1 in absolute value. Thus, w0 and |δ| can be interpreted as the percentage of our measure of heterogeneity, τ

ATU−τATT, which contributes to bias.”

Isn’t that interesting? Delta can be positive or negative even if the average treatment effect itself is zero. Thus the OLS is both biased for the ATE and could even “flip signs” under unconfoundedess. Groan. But the point is, since they both modify the heterogeneity, we can actually use that to quantify under unconfoundedness in mean potential outcomes and linearity of mean potential outcomes in the true propensity score the amount of bias for that particular parameter estimate that comes from the heterogeneity itself.

The Roy models? Yes. If a Roy model is describing the sorting at all, then even under unconfoundedness, depending on the magnitudes of those treatment effect heterogeneities, you’re going to likely get diverging values of the ATT and the ATU. But you’ll be able to also calculate the weighted difference at least. Tymon gives an example of the use case for his diagnostic here:

Notice then, that if your goal, as he says, is to estimate the ATT using your regression model, then OLS is expected to be biased by whatever the value of your *omega0 *term is (times 100 to get the percent). But if you wanted to use it to get the ATE, then you’d know it was biased by the delta term times 100 to get the percent.

**Concluding remarks**

That’s a lot. I went on some rants again, as always when unconfoundedness comes up. I just seem to have such a paradoxical relationship with that assumption. The microeconomist in me refuses to believe it can ever exist because I have such a strong belief in behavior governed by rational sorting on gains. We can say that this heterogeneity is driven by covariates, and thus when you capture those covariates, then the variation in the sorting can only be due to chance. Fine. And I have a bridge to sell you too. Please buy my new crypto coin while you’re at it. I have some things in the trunk of my car that will cure all your ailments while I have you. That’s kind of how I feel about the claims people made that they have those confounders. A little humility please. My cynicism over unconfoundedness scientifically is almost like a little devil sitting on my shoulder whispering to me about how stupid it is to believe it.

But then on my other shoulder is an angel. He disagrees with the demon. He notes how beautiful these are, and even if they’re not true fully, what if you just avoided all colliders and then filled up the covariate set with the confounders you’re most confident about, and the ones you’re not, you just fill it in with the predictors of Y(0) using the control group and the predictors of Y(1) using the treatment group. Or better yet, only estimate the ATT and appeal to a slightly less offensive form of irrationality. Kind of like a Roy-lite model of human behavior where people ignore the opportunity costs when making decisions. As students take principles of microeconomics and *still* fail that part of the exam, I think it’s not the craziest thing to say they may do it in real life too.

So if you can buy it, then can’t you just estimate these terms with that regression we wrote down in equation (1)? Well sure you can. It’s a free country. You can also run across the interstate. You just probably shouldn’t because eventually you’re going to get hit. Tymon showed with his theorem that it’s probably a sufficient statistic to just look at the share of units in the treatment group and immediately say to yourself “if *rho* is high, then my OLS coefficient is biased towards the ATU, which is weird but that’s it.” And so if you like doing that, and I don’t know why you would like that but if you do like that, then sure go for it.

But running regressions knowing full well that the bias is there under heterogeneity — well that’s like running across the interstate to get to the other side when right beside you is a walking bridge that will get you there easily. And that walking bridge is either regression adjustment or matching with bias adjustment. So in part 3 of “Lies, Damn Lies and OLS Weights”, I will conclude this series and illustrate for you with a simulation precisely how this all went down. Thanks again for joining me on this journey through a great paper. I hope you have enjoyed it.

## Lies, Damn Lies and OLS Weights Part II

edited Nov 27Hi Scott, thank you for this insightful comment on the intricacies of OLS weighting! Just to clarify something; in about the fourth paragraph, regarding omega 0, it seems to increase, and not decrease with rho - omega 0 is 0 when rho is 0 and 1 when rho is one. Right? Thanks in advance!