I seem to be using the substack to try and put things into my work words. This is all still fairly raw to me. It’s taking practice and repeatedly checking my gut with tables of potential outcomes and simulations.
None of what I said here was about bias. None of it about estimation or identification. It’s still me just trying to think more carefully about parameters expressed as average treatment effects and how those vary by weighting schemes that one is always confronted with.
Today’s post is more in that being but it has to do with when weighting is going to really require you think carefully about precisely what one’s parameter does and does not mean. That is, when will being a little unsure about all this weighting bite you in the butt and when won’t it? And the answer is when individuals sort into aggregated geographic units based on potential outcomes, then the overall ATE for a population of individuals and the ATE across communities where that population resides.
Cosmos thinks I’m talking about Simpsons paradox. Bacon said all this is covered in Gary Solon, Steven Haider and Jeff Wooldridge. Here’s their gated 2015 JHR. But if you look at the third point about weighting and “unmodeled heterogeneous treatment effects” in this abstract, I’m sure bacon is right. There’s worse things than a decade later thinking those guys thoughts after them, though.
But first, an experiment! I shall have a robot now throw into sky a coin three times. Each time this robot will observe the top facing side of the coin and if it is the head of a person, he will make a note it was so. But if it isn’t, and shows perhaps an eagle or a building, then he will make a note of that as well. Then, however many times that happens, out of three attempts, this robot will consider that outcome to have occurred and will then announce whether or not the coin has chosen for this post to be behind a paywall.
Tails! The coin has spoken! And with that, we shall begin! Consider becoming a paying subscriber anyway so that you too can read on a regular basis about weighting in causal inference, the experimental evidence of being nice to robots and best of all my feelings!
Introduction
I have had an absolutely awful sleep problem recently. Im sure it’s indicative of something but I’m trying to just enjoy life and not think too much about it. But in my sleepless nights, I tend to wake up very early, having dreamt about something or other, and that something or other then becomes this substack. Which means I’ve been dreaming about weighting and causal inference.
So remember how I’ve been for a while talking about this first step of a diff in diff checklist is where you “define your target parameter”? Usually, I think we just write down ATT and move on. But maybe it was because I got stuck on the fact that county level crime regressions will often look very different than their level counterparts, but I found that I wanted to better understand how the ATT for one dataset could be different from another more disaggregated version of the same dataset.
So here’s a simple roadmap. This post is about three things. It will explain:
A simple table with six individuals, showing how aggregation works.
Another Texas example illustrating population-weighted versus county aggregated treatment effects.
A simulation where we see how different ways in which selection / sorting into counties (either independent or dependent on potential outcomes) will affect the defined weighted causal parameter.
Step 1: A Simple Table Example
I made a table to show you six people living in a concealed carry location. A concealed carry location is one where a person is or is not exposed to a law allowing themselves and others to carry a gun in public concealed on their body. Y₁ means a persons death from homicide if treated and Y₀ means their death from homicide if not treated with concealed carry exposure. And the treatment effect associations with concealed carry, (Y₁ - Y₀), can be -1 (concealed carry reduced homicide), +1 (concealed carry caused their death from homicide) or 0 (it had no effect).
Notice that the average treatment effect, ATE, is the average of column D labeled “delta”. Also notice that each person has a weight of 1/6. But if we wanted to calculate the average treatment effect for those who live in concealed carry locations — which is Alan, Betty and Chad — then it’s the average of their treatment effects, which is (1-1+0)/3 which is a zero. I took the liberty of making this as a picture too.
We can define different causal parameters using simple means summarizing treatment effect averages over different populations is the point. But each time we calculate a different average for a different population, we use different weights.
Now I want to introduce counties. That is, in each case, a person lives in a different community called a “county”. A county is smaller aggregated region where people live and in this example, there are 10 people and 2 counties.
As before, each person has a treatment effect and those treatment effects can be -1, 0 or +1. This is what is meant by “heterogenous treatment effects” — it is that the unit level treatment effects can vary.
If I average column D labeled delta again, I get (1+0+0+0+1+1+0+1-1-1)/10 which is +0.2. On average, for the average person, concealed carry increases homicides by 1/5th of a person. In other words, it is on net harmful because it’s ATE is +0.2 which is more than 0.
But these people live in different counties. Specifically, the first 8 live in county 1; the last 2 in county 2. If take the average treatment effect for each of those counties; then I’m taking the average (1+0+0+0+1+1+0+1)/8 for county 1, which is +0.5 and in yellow. And I take a second average of (-1-1)/2 which is -1 and in turquoise. Then if I average those two numbers, I get (+0.5-1)/2 which is -0.25.
Thus I get what at first seems like a paradox. If I average over people, that is the average person’s treatment effect, then concealed carry harms people. But if I average over counties, which were themselves first averages of those same people’s treatment effects, then I find it hurts people.
If you are wondering “do I need to know this, Scott? This seems esoteric.” And I’ll answer that with a different question. Have you been using the Callaway and Sant’Anna diff in diff estimator? For those who have like me worked with Callaway and Sant’Anna a million times, the former is more like the “simple” average treatment effect and the second is more like the “group” average treatment effect. Though even then, not exactly. Their simple average is the simple average over all group-time ATT, and the group average is the average over all group averaging of the group-time ATTs.
Step 2: Texas simulation
So then, I just showed using a simple table the role of weighting in summarizing unit level treatment effects. If we calculate the average ATE per county first, and then take the mean across counties, we might get a different number than if we had taking the mean over all individuals, simply because the weights were different. This is I think what Solon, Haider and Wooldridge meant in their 2015 JHR on weighting which I’ll reproduce here again:
Notice the last point: unmodeled heterogeneity of effects. Unmodeled in the strictest Neyman-Rubin sense of a treatment effect which is we are entirely agnostic about what the treatment effects per person can be. That level of agnosticism makes weighting subtle but important.
Now I’d like to show a simulation. I’ve written out the details of that simulation here. This example is about Texas where I live. And I’ve take some liberties with the population size but not too many liberties. This is mostly right — nearly half the states population lives in just five counties.
When I take the average of those treatment effects, weighted by their county populations such that I get the average treatment effect for the average person, I get an average for the entire state to be +2. And that is because half the population lives in 5 counties counties, each comprised of 3 million people, each with an ATE of +5. The rest of the state’s 15 million live in 249 counties, uniformly distributing the population, where each county has an ATE of -1.
But if I had taken the average of each county’s ATE, not the average of the average persons treatment effects which is what the +2 is, I’d get a county-ATE of -0.88.
Neither is wrong. They are both right for what they are measuring. One is the average for the average person. The other is the average for the average county.
If it’s not obvious yet, the reason I am writing about this repeatedly is because we sometimes get data at the population level (eg the census, coroners reports, the NLSY97) and we sometimes get data at the city, county, firm, school level. The target parameter in each dataset, if you’re not careful, might be different without making intentional efforts to reweight. But even when you reweight, perhaps by grouped populations, the point still remains that doing that is simply a way to create different weights from the original weights implied by your dataset. And that reweighting is only relevant if your goal is the parameter associated with those other set of weights.
Step 3: A Simulation of Aggregation & Sorting without and with independence
But in each of these cases, notice that heterogeneity in the treatment effects had been playing a role in why the group ATT and the simple ATT were different. Part of it had been also the differences in the sizes of those grouped units, but part of it was also due to the heterogeneity.
Next I want to explore a third thing that isn’t just about heterogeneity and isn’t just about population sizes across groups. It’s about selection and the choices that people make as to where to live in the state.
Charles Tiebout was an economist whose work influenced something in public finance called Tiebout sorting. It’s sometimes also called “people voting with their feet”. It means I might move to an area because the schools and other amenities are better there. And since those area finance those amenities with various taxes, there can be all kinds of equilibrium residential choices relevant for public financing.
Without sorting on potential outcomes
In my example, I’ll assume two situations. First, I’ll assume Tiebout sorting that is independent of potential outcomes, Y0 and Y1. Since the sorting into counties is independent of potential outcomes, it’s independent of treatment effects, and thus the ATE across counties will be the same because counties will on average have the same mean Y0 and Y1 and therefore the same mean treatment effects. This is the consequence of independence, which is another way of saying people are randomizing with coin flips they’ll live.
Here’s the python code on Google Colab, as well as below. I call it “Tiebout and Roy meet but do not have a baby” because I’m tired. But the reason is because Roy sorting is on treatment effects, and I’m saying people are sorting but their reason has nothing to do with the returns to concealed carry. Which actually doesn’t seem crazy, but at the same time I have a good friend that turned down a job at an Ivy League school because their partner couldn’t bring their guns. So I don’t think it actually is crazy to say some people are sorting based on gun ownership, though that is not itself the same as saying that people are sorting on the returns to concealed carry. That’s actually a different thing. Anyway, this is Tiebout and Roy don’t have a baby sorting:
# Tiebout and Roy meet but do not have a baby.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd # Add pandas import
# Set random seed for reproducibility
np.random.seed(123)
# Define population parameters
num_people = 30_000_000
num_large_counties = 5
num_small_counties = 249
total_counties = num_large_counties + num_small_counties
# Define county populations
large_county_pop = 3_000_000 # 3M each for large counties
small_county_pop = int(15_000_000 / num_small_counties) # Remaining split equally
# Generate individual-level potential outcomes
y0 = np.random.normal(10, 2, num_people)
treatment_effect = np.random.normal(2, 1, num_people)
y1 = y0 + treatment_effect
individual_ate = y1 - y0
# Assign counties (similar to your Stata code)
counties = np.zeros(num_people)
current_idx = 0
# Assign large counties
for i in range(num_large_counties):
counties[current_idx:current_idx + large_county_pop] = i + 1
current_idx += large_county_pop
# Assign small counties
for i in range(num_small_counties):
counties[current_idx:current_idx + small_county_pop] = i + num_large_counties + 1
current_idx += small_county_pop
# Calculate ATEs
overall_ate = np.mean(individual_ate)
county_ates = np.array([np.mean(individual_ate[counties == i])
for i in range(1, total_counties + 1)])
county_level_ate = np.mean(county_ates)
# Sort counties by ATE for visualization
sorted_indices = np.argsort(county_ates)
sorted_ates = county_ates[sorted_indices]
# Create plot
plt.figure(figsize=(12, 6))
plt.bar(range(total_counties), sorted_ates,
color='blue', alpha=0.7, edgecolor='black')
# Add reference lines
plt.axhline(overall_ate, color='black', linestyle='dashed', linewidth=2,
label=f'Overall ATE = {overall_ate:.2f}')
plt.axhline(county_level_ate, color='red', linestyle='dashed', linewidth=2,
label=f'County Average ATE = {county_level_ate:.2f}')
# Customize plot
plt.xlabel("County Index")
plt.ylabel("Average Treatment Effect (ATE)")
plt.title("County-Level Average Treatment Effects")
plt.legend()
plt.grid(True, linestyle=':', alpha=0.3)
plt.show()
# Print the means to verify they're close
print(f"Overall ATE: {overall_ate:.4f}")
print(f"County-level Average ATE: {county_level_ate:.4f}")
First of all, I love python figures the most of all figures. I wish I had a better sense of aesthetics to articulate what it is, but these figures just feel softer. Something about the font and the edges.
But secondly, notice that the overall or “simple” ATE is the same whether I average over the entire population than if I averaged first the counties and then averaged those county ATEs. Notice in the code, too — there is heterogenous treatment effects. And there is weighting. So this isn’t because I assumed constant treatment effects. Rather this is because I assumed people picked their homes by flipping coins across the state of Texas and when they did that, the mean potential outcomes are the same in all of them, and as such, weighting across units or weighting up to counties and then across units gives the same value despite being different parameters technically.
With sorting on potential outcomes
Okay now let’s have people sort. Man was this a pain to figure out. This is “Tiebout and Roy meet, fall in love, and have a family.”
# Tiebout and Roy meet, fall in love, and have a family.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Set random seed for reproducibility
np.random.seed(123)
# Define population parameters
num_people = 30_000_000
num_large_counties = 5
num_small_counties = 249
total_counties = num_large_counties + num_small_counties
# Define county populations
large_county_pop = 3_000_000 # 3M each for large counties
total_large_pop = large_county_pop * num_large_counties
small_county_pop = 60241 # Adjusted to make total population exact
# Generate treatment effects with two different distributions
# For positive effects: Normal distribution with mean 5 and SD 1
positive_effects = np.random.normal(5, 1, num_people//2)
# For negative effects: Normal distribution with mean -1 and SD 0.5
negative_effects = np.random.normal(-1, 0.5, num_people - num_people//2)
# Combine the effects
treatment_effect = np.concatenate([positive_effects, negative_effects])
# Sort individuals by treatment effect
sorted_indices = np.argsort(treatment_effect)[::-1] # Sort in descending order
# Initialize county assignments
counties = np.zeros(num_people)
# Assign first 15M people (highest treatment effects) to large counties
large_county_total = num_large_counties * large_county_pop
counties[sorted_indices[:large_county_total]] = np.repeat(np.arange(1, num_large_counties + 1), large_county_pop)
# Assign remaining to small counties
remaining_indices = sorted_indices[large_county_total:]
small_county_assignments = np.repeat(
np.arange(num_large_counties + 1, total_counties + 1),
small_county_pop
)[:len(remaining_indices)]
counties[remaining_indices] = small_county_assignments
# Calculate ATEs
overall_ate = np.mean(treatment_effect) # Population average
county_ates = np.array([np.mean(treatment_effect[counties == i])
for i in range(1, total_counties + 1)])
county_level_ate = np.mean(county_ates) # Average of county averages
# Sort counties by ATE for visualization
sorted_indices = np.argsort(county_ates)
sorted_ates = county_ates[sorted_indices]
# Create plot
plt.figure(figsize=(12, 6))
plt.bar(range(total_counties), sorted_ates,
color='blue', alpha=0.7, edgecolor='black')
# Add reference lines
plt.axhline(overall_ate, color='black', linestyle='dashed', linewidth=2,
label=f'Overall ATE = {overall_ate:.2f}')
plt.axhline(county_level_ate, color='red', linestyle='dashed', linewidth=2,
label=f'County Average ATE = {county_level_ate:.2f}')
# Customize plot
plt.xlabel("County Index")
plt.ylabel("Average Treatment Effect (ATE)")
plt.title("County-Level Average Treatment Effects")
plt.legend()
plt.grid(True, linestyle=':', alpha=0.3)
plt.show()
# Print values to verify
print(f"Overall ATE: {overall_ate:.4f}")
print(f"County-level Average ATE: {county_level_ate:.4f}")
print(f"\nFirst 5 county ATEs: {county_ates[:5]}")
print(f"Last 5 county ATEs: {county_ates[-5:]}")
Anyway, notice that the people with the negative treatment effects sort into the rural (small) counties such that their ATEs are all negative but the ones with the positive treatment effects sort into the 5 largest counties. The overall ATE is still +2 like before, but now when I average over the counties, the county averaged ATEs is -0.88.
So you see what I mean? Nothing is biased. Those are clear, defined, causal parameters and the only reason that the overall ATE and the county average ATE differ from one another is because of sorting on treatment effects and heterogenous treatment effects.
It’s both, I mean. Heterogenous treatment effects is not enough to get you to a place where the simple ATE and the averaged county ATEs differ, because when Tiebout sorting does not follow the Roy model, then the county-averaged ATEs and the overall simple ATE are the same number. The weights in that sense “don’t matter”.
But to quote that Solon, Haider and Wooldridge paper I cited from the 2015 JHR — what if you had “unrestricted heterogenous treatment effects” and Roy-like sorting on treatment effects. People for whom the returns to concealed carry improve their lives live in the rural counties, in other words. People for whom the returns to concealed carry harm their lives live in the five biggest counties. Then the overall ATE is still +2 which is the average effect of concealed carry for the average person. But now the county average ATE is the average over those county level ATEs. And now they differ.
Final Thoughts
It’s not quite right to say, I think, that “the weights don’t matter without sorting on the treatment effects”. I mean, it’s right and it’s not right. It’s right in that when sorting is independent of treatment effects, you get the same number whether you average 30 million people or average those sorted people into their home communities then average those 254 communities. So in that sense the weights “don’t matter”. They don’t matter if you want the overall ATE, I guess, and don’t want to weight. Maybe you don’t want to weight because you don’t have population data by county or something? I don’t know, but that’s the only conclusion I know to say. But it will “matter” if people sort by treatment effects into those communities because in that sense these averages won’t be the same.
But I guess I don’t like that because what I would rather just say is — just say what your parameter is. Just say up front that you want the average of this group of units but not this group of units. I mean, it’s literally an analogy of the choices you’re making in Callaway and Sant’Anna when you specify “group” versus “simple” in that R syntax, or “csdid_estat simple” vs “csdid_estat group”. It’s kind of the same thing for all pedagogical purposes because if what I’ve written here feels unnecessarily confusing, then how are you going to explain those two averages in CS, let alone choose between them?
So that’s it. I might actually finally be done. I think these simulations were helpful for helping me get to the bottom of this. Now whether I could actually teach this in a class — that’s a whole other thing. But I think I’m going to try to more and more. I’m a keynote speaker this year at a conference on causal inference, so I’m toying with the idea of just focusing on this. I sort of got into this in the new edition of the mixtape, too, but not at this level. Maybe third edition I’ll know how to land the plane better. But hopefully this has helped at least somewhat.
But consider being a paying subscriber so that you see all my posts every time as opposed to in expectation only half the time!