Step 1: More on How Aggregation Can Change Your Targeted Causal Effects: A Lesson from Counties, States, and Individuals
More on diff-in-diff checklist
I’m constantly thinking about the averaging of treatment effects. And when I say constantly, what I mean is “a lot”. I do so because I teach this material a lot and intend to go on tangents when I do. And this has been one of those. It’s like I’ll see this really obvious point while teaching something — like how an average treatment effect literally divided by the number of units in a spreadsheet — and then the idea that the weights are based on the number of units is all I can think about. And I do so because the idea of a “unit” changes depending on whether it’s a dataset of people, counties or states. A county of people’s treatment effects versus the people themselves — will it ever matter that those aren’t the same units when undertaking a study about those same people? When should we care and why?
Today, I want to explore why estimates of the same treatment effect can change dramatically depending on whether we measure them at the individual, county, or state level. This issue crops up frequently in policy research—especially in areas like minimum wages, education spending, and public health—but I’ll illustrate it using a hypothetical example related to gun laws.
But first Cosmos will flip a coin — best two out of three — to determine the paywall. Drum roll please!
Tails! I like the white labeling of the frequencies, Cosmos. Nice use of purple too. And I appreciate the enthusiasm in the subtitle. You are a language model full of surprises.
So buckle up and consider becoming a subscriber. This continues my writing about designing a diff in diff, where step 1 is simply to be sure you know what your target causal effect is.
Here’s the key takeaways from this Substack.
Causal parameters are averaged treatment effects but it depends on what we’re averaging over—whether individuals, counties, or states. Units of observations matter as we average over units.
Different weighting schemes yield different ATEs when treatment effects are heterogeneous. The simple average of a population is not the same number as the simple average of communities where the people in that population reside, and even moreso if they sort into those communities based on their potential outcomes.
Many empirical studies assume weighting treatment effects doesn’t matter which means they’re probably assuming constant treatment effects or independence with respect to geographic units, but results can flip when switching between state-, county-, or individual-level estimates. This is particularly the case when individuals will compare studies or use different data sets as robustness in a single study expecting them all to give the same answer.
Defining the target parameter is crucial: Are we wanting to know an average effect for the average person, the average county, or the average state? Well all data sets give me the ability to get that and if so, how and if not, why?
OK, let’s get started.
The Fundamental Insight: What Are We Averaging?
In the potential outcomes framework, we define the unit-level treatment effect for each individual as:
where Y(1) is the outcome for individual if treated, and Y(0) is the outcome if untreated. The Average Treatment Effect (ATE) is just the mean of these individual treatment effects:
Notice the ATE is the average but it’s the average over those units, which here are individuals. When we estimate an average treatment effect, we use data, and what is easy to forget is that if you are taking weighted averages over the units in your dataset, then different units are associated with different weights. Therefore the ATE can differ depending on the level of pre-existing aggregation that happened before you got it even if you’re ultimately studying the same people.
Think of it this way. If we analyze individual-level data, we are typically averaging over individuals. But if we use county level, we’re no longer averaging individuals—we’re averaging over counties. Ask yourself — are those mathematically required to be identical? Why if so? Why not if not? The unit in a data set directly affects what kind of average you are taking.
In a world where treatment effects are homogeneous (i.e., every person experiences the same causal effect), this stuff about weighting wouldn’t even matter. You identify treatment effects for one person, you’ve done so for all. But in modern causal inference, practitioners often are unwilling to say what the distribution of those treatment effects are. Many people prefer to be agnostic. As such, weighting will matter, and yet weighting seems to be the least discussed thing I’ve seen.
Three different weighting schemes
Now let’s take the concrete example I have been considering. Imagine we want to estimate the average effect of a concealed carry gun law on firearm-related deaths. But we have three possible datasets:
Individual-level data from coroner reports.
County-level data aggregating those same individual deaths up to the county level.
State-level data aggregating up even further.
There is an ATE for every one of those datasets and the weights are always 1/N where N is the size of the dataset. For instance, imagine there are 350 million people, living in 3,000 counties spread across 50 states. Then consider these:
Individual-level ATE: The expected treatment effect for the average person. The weight in the ATE is the same for everyone — it’s 1/350 million.
County-level ATE: The expected treatment effect for the average county. The weight in that ATE is 1/3000.
State-level ATE: The expected treatment effect for the average state. The weight is the same for every state — it’s 1/50.
A Hypothetical Thought Experiment
What if the people with the negative treatment effects for a gun law tend to live in the small areas but the people with the positive treatment effects live together in large areas? For the scenario where we want the average over all individuals, such sorting matters are irrelevant. Johnny’s treatment effect is a -1. If he lives in a concealed carry area, he dies and if he lives without one he lives which here would be -1. Averaging over a population won’t change that about him. His treatment effect does not depend on aggregation.
But what if Johnny lives alone in his own county? Then his county will have a -1. His weight is different then than when he lives with others even though in the overall average, it doesn’t matter where he lives. You would only want to weight his county up, more or less, if your study was about counties, and not people.
So, imagine a population with 10 million people. Each individual in that population has a treatment effect. If you want to know the average treatment effect for that population, it would be simply a weighted average of those individual treatment effects where the weights are the same for each person – 1/10 million.
But now imagine that your data set is that 10 million person community broken up into 200 counties. This is where heterogeneous treatment effects starts to raise its ugly head. If the individual treatment effects are not the same across the population, and if individuals systematically sort into certain counties based on their potential outcomes, and you treated the unit as the county, even implicitly, then the average treatment effects across the counties will be different — maybe very different — than the average treatment effect across all people, even for the same state. All simply because of what you were averaging over and the implicit weights used for that averaging.
Like I said, say that in 199 of these counties, the ones which are rural and sparsely populated, the gun law has a small negative effect (e.g., reducing firearm-related deaths slightly). But in the largest urban county, with maybe 8-9 million residents, the law has a strong positive effect (increasing firearm-related deaths substantially). That large urban county has a weight of 1/200, where is had you had individual level data, it would have been though the weight was closer to 0.8-0.9 for itself. It doesn’t matter that its population is massive in size relative to the other counties if you are taking simple averages over the counties. And that is because when working with that county data, you were averaging over 200 counties not 10 million people. When you average over the counties, you’re saying that counties are important. But why are they so important that you’d average them and not average over all people?
Let’s call then the two types of averaging the individual-level ATE which uses as its weight 1/N and the county-level ATE the one that uses as its weight 1-Nc where N is the entire population size and Nc is the number of counties in that population. We often think of heterogenous treatment effects impacting the numerator — what we are averaging. But while heterogenous treatment effects doesn’t change the weighting, it will make weighting more relevant a topic as not all weighted averages are the same.
When Can Aggregation be considered Biasing a Causal Effect?
Economists and statisticians have long studied how aggregation can bias causal inference, sometimes leading to famous misinterpretations. Here’s three famous types of puzzles based on aggregations.
Simpson’s Paradox: When subgroup effects reverse after aggregation, often because treatment effects differ by group composition (Pearl, 2004, “Simpson’s Paradox: An Anatomy”).
Ecological Fallacy: When researchers infer individual causal effects from aggregated data, leading to misleading conclusions (Subramanian et al., 2009).
Composition Bias: When aggregating data changes the weight given to different treatment effects, altering the estimated ATE (Xie, 2013, “Population Heterogeneity and Causal Inference”).
I will probably do a explainer on composition bias, as it’s slightly different to what I’m talking about here. But for now I’ll just put a pin in it. What I wanna do now is just give some empirical examples of things that might be examples of what I’m talking about with the caveat in mind that I’m not 100% positive. But I need some examples :-).
Empirical Examples of Aggregation in Policy Studies
I’ve been trying to track some empirical papers that aren’t gun law papers to try to figure out how I might explain some of these. These may not be exactly on point, but these are what I have today. So here are some examples of possible real-worldcases from different fields:
Education studies:
Hanushek, Rivkin, and Taylor (1996) showed that state-level analyses of education spending exaggerated its positive effects, while student-level analyses showed much smaller effects. Which was right? Maybe they both were. Maybe aggregation was responsible for the discrepancy.
Minimum Wage Studies:
State-level studies can find negative employment effects (Meer and West, 2016), while county-pair studies (Dube, Lester, & Reich, 2010) can show minimal effects. Other studies will use firm-level data like Manning (2021) and Clemens (2021) and might also find different effects still too.
Often the unstated assumption is that the lack of harmony in the findings is due to design or data quality. But it’s also possible, at least in principle, that each data implicitly has an aggregated treatment effects that differs because the units of observations are measured in different ways, causing weights to differ, causing some values to be weighted up higher or lower relative to others. Under heterogeneous treatment effects, those details don’t matter. But what if treatment effects are wildly heterogenous? Then it can make interpretability across studies a challenge.
Labor Supply Elasticities:
Micro-level studies on wage changes often find smaller effects than macroeconomic studies using aggregate data, highlighting the challenge of reconciling individual and market-level causal responses (Chetty et al., 2013). Reconciling these differences between highly aggregated supply elasticities and disaggregated ones was, in fact, the motivation of Marie Connolly’s very interesting 2008 JOLE.
Defining Target Parameters is a Descriptive Task, but Choosing Among Them is an Ethical Task
As I understood it in college, there are three main branches of philosophy at their roots: epistemology (theory of knowledge), ontology (theory of reality) and ethics (theory of what should be). Within ethics is where we put theories of art and aesthetics more generally because it’s within ethics that we typically make judgment calls.
It’s within epistemology and metaphysics where we might define average treatment effects using different weights. But it’s within ethics where we would say one is more relevant or more preferred than another. That’s my point — there are different types of mistakes and not all mistakes come down to mismeasurement or failures in parallel trends. Sometimes the mistakes are confusing the weights.
When you set out to estimate average effects, you will be implicitly engaging in ethical tasks because you will say one thing should be the target but not another. You will be suggesting that one weight is better than another. But then why? Why do you say that averaging over the individuals is better than averaging over counties? It is not self-evident it is. It only is if counties don’t matter as interesting objects of study themselves, but is that always true? Is it? Why is it?
I guess what I’m saying is that technically all averaged treatment effects are true — they just may not all be our targets. Not all of them are relevant to you. But then what makes one parameter relevant but not another? What criteria should we use? When should we care about the ATE that weights people the same versus the ATE that weights based on residence? Doesn’t it depend, not on the estimator or the design, but the question?
In one sentence, what makes one perimeter a target, but not another, is one’s own desires and one’s own curiosities. I have always been sympathetic to that answer to just about any question like this. I do what I do because it is what I want to do. And so if you want to know the average effect of some treatment over all individuals that’s your business. And if someone else wants to know the average effect of a treatment overall communities and thinks that communities are relevant units of observation, then that’s their business too.
But let’s say you really are indifferent, and don’t know which parameter you want, then probably I would suggest you entertain this thought experiment: which policy-maker do you hope will make a decision based on your study’s findings? Is it the governor? Does the governor want to help the most people? Can she only control state laws? Then that’s an averaging over individuals.
But will the governor allow each county to choose its own gun laws? Do local communities matter too? Then that may elevate the value of the concept of the average community in research and policy making.
I think the reason I keep circling around this is that I keep wondering — when would we want to know the average treatment effect measured at the county level but not the state? And all I keep thinking is that it would seem like it must be the case where counties are worthwhile objects of interest because if they aren’t, then it seems like it’s not. But I’m still puzzling over this to be honest.
For now I simply say that there are many ways to weight treatment effects. They are not the same, they don’t give the same answer and they will happen naturally in different ways depending on the structure of your data, and you have to think about this way before you engage in estimation as you have to decide what your own weights will be, and then make choices accordingly.
Hi Scott, I have one question and one clarification request.
First for clarification, in the "Minimum Wage Studies" section, you write that under heterogeneous treatment effects, it does not matter but matters in wildly heterogeneous effects. Do you mean to say homogenous on the prior or is it about the extent of heterogeneity? I ask this clarification because if the extent of heterogeneity is a crucial factor knowing the critical value for heterogeneity might be good for future research.
My question is, I have been thinking about a scenario where I do not see an effect on aggregate (sum) measure but see if I weight it by population. My thought process is that if, for example, the outcome is a number of visits to the hospital, an increase of 10 visits might mean very different for different states, whereas 10 visits per population are the same no matter the size. Do you think, therefore, aggregation matters when the "treatment effect" that we calculate has different implications (from the perspective of seriousness to success/failure) for different states? In other words, the choice of outcomes dictates if aggregation matters.
In panel regressions, when using aggregate data (say at the state level) the ATE is wrong if the weighting is wrong. Typically, if one weights by state population, one is overweighting, and the ATE is dominated by treatment effects in just a few large states. If one does not weight, the small states dominate. Since the treatment effects are certain to differ between states, ATEs are likely to depend on the weight used. It is easy to see why a regression with aggregate data would reach different conclusions from that an individual level analysis if the wrong weights are used in the former. The article examples you give are behind paywalls, so I cannot tell what the weights were and how they were justified, but I'll bet they did not weight or weighted by population, such that the ATE in the aggregate-level study is wrong.
The proper weight (a function of population) should be one that gives equal influence to states of different sizes. To do that one can use the Breusch-Pagan test, exploring different weights until finding the one with the least heteroskedasticity. Better yet, one can add up the dfbetas on the treatment variable for each state and try different weights until the dfbetas are close to the same for each state. In my experience, the proper weight differs greatly from regression to regression. It partly depends on the size of the dependent variable, since there is more relative variation when there are small numbers. Thus the proper regression weight in a murder regression is population to the 1.2 power, while it is population to much smaller powers with other crimes, for example population to the 0.3 power in a robbery regression.
In all, this is a data issue. Getting the proper weight so that neither small nor large states have excessive influence in the particular data set is not one that can be determined by theory or mathematical analysis. Lastly, the varying views on proper weighting give researchers degrees of freedom - different weights get different results - and sometimes it looks like a particular weight was used to achieve desirable results.