Part 1: A Selected History of Quantitative Causal Inference
The Extraordinary Role of Data Workers and Data Theorists at Princeton and Harvard
This is Part 1 in a three part substack that I am calling “A Selected History of Quantitative Causal Inference”. I had previously had all of it in one super long scroll, but I think it’s better to break it up into three parts so it’s more digestible. This first part discusses the philosophical and statistical origins of the potential outcomes model, as well as its historical link to the randomized experiment. Some of you may know this by heart, but I still think it’s a good essay to give to new students who are learning causal inference for the first time, particularly if they need something that they can sink their teeth into as they begin working with terms and concepts and models that are perhaps a little esoteric to them.
Prologue
What are stories and why do we tell them? And what is a true story or are stories ever true? What even is truth? I wonder this a lot when I consider that a hundred people could watch the same event and all come away reporting one hundred different things of the nearly infinite number of things that occurred from the nearly infinite perspectives that existed. Are any stories therefore every truly true? Yes and no.
Consider the famous line by George Box, the famous statistician, who said, “All models are wrong but some are useful.” What is a model, though, if not storified reduction of reality into something so simple others can understand it? All models are wrong but some are useful. But when you take something and reduce it down to something simpler, you necessarily leave things out. Doesn’t that mean the model isn’t true? And the things you leave out — what if they are important but to include them would make the story more difficult to understand? To Box, usefulness, not mere accuracy, is at least part of the scientist’s job.
Or consider art. Is art true? The literary theorist, Victor Shklovsky, once compared art to a man who walks up a mountain without shoes. At the first, he can feel the rocks on his feet and they are sharp and hurt. At the top, he not only can’t feel the rocks — he can’t even feel his feet. They are numb from reality. The purpose of art, according to Shklovsky, is to make the rocks feel like rocks again by “defamiliarizing the familiar”. Art, when successful, does this — it takes reality, but twists it around, maybe even beyond recognition, so that we might in seeing it see life and reality again. That which we know too well can sometimes be not known at all. But art, by defamiliarizing it, can sometimes turn the light switch on. In a sense, “all art is wrong, but some are useful.” What makes it useful then? To Shklovsky, it’s useful when the phenomenology of it awakens us to what’s real, but forgotten.
Well, today I want to tell a story. A story about causal inference. It’s a story about notations, some which made things easier to understand even while describing things that may themselves be unrealistic. But to tell the story, I have to leave things out. Not because they are unimportant but because I think including everything would confuse the reader, not enlighten them. After all, stories always subjectively select facts from a larger set of subjectively seen facts — something that is utterly unavoidable — and tells a story that when successful helps you better understand and the larger story. This is ironically my goal in my podcast interviews — to hear the stories of people, to be present in the conversations and engaged, I hope to not only hear their stories, but my story too. And my hope is that the same can be said of you.
Around a month and a half ago, I was asked to do a keynote at a conference on natural experiments in Melbourne, Australia. I decided to use that opportunity to take my video interviews from my podcast with people like Orley Ashenfelter, Guido Imbens and Josh Angrist, clip them, and tell a story that has been brewing in my mind. I called this, “History of Quantitative Causal Inference: Applied Data Workers and Statisticians” but today I am going to call it “A Selected History of Quantitative Causal Inference: The Extraordinary Role of Data Workers and Data Theorists at Princeton and Harvard”. It’s a selected history in other words because I leave out many people because the focus in the end is on two schools and its faculty and students: Princeton’s Industrial Relations Section, and Harvard’s Stats and Econ dept, in the 1970s and 1980s and 1990s. It has video clips which I thought might spice this up for you.
But why are these stories, in my mind, so “useful” to use Box’s original point? Because I want people to see that the movement that begins with Neyman, Fisher, Rubin, Ashenfelter, Card, Krueger, LaLonde, Angrist and Imbens was by coincidence a perfectly joined contemporary shift in “how things were done” with a movement towards what we sometimes call the natural experiment methodologies away from the modeling of outcomes. It was a movement away from poor empirical work, and a movement towards physical assignment of treatments. It’s the story of passionate and opinionated people from opposite ends of the spectrum and it’s the story of creative people who were also open minded and possessed a calm “glass half full” personality that most likely helped calm what might have been stormy waters to help three people work together and make real breakthroughs. So let’s begin!
Act I: Inventing the counterfactual
Scene 1: Philosophical Origins of a new causality framework
The thing that I want to first do is push you away from thinking of “causal inference” and “causality” as synonyms. Rather I want you to think of causality as the more general, highest level, philosophical question regarding cause and effect dating back to antiquity, but I want you to think of “causal inference” as scientific jargon describing a particular kind of solution to the historic causality question that began in the late 1700s and 1800s in philosophy through the early 1900s in statistics and concluded with the late 20th century in the work of labor economists, statisticians and econometricians at a few elite American universities. Obviously causal inference is interested in causality — it is its very purpose. But many groups have been interested in causality, too , not just the people I discuss in this substack and so I prefer to pretend that “causal inference” is the trademark name of the things I discuss here so as to draw the lines that allow me to tell the story I wish to tell today.
Progress on causal inference ramps up in the late 1700s with the writings of the great Scottish philosopher, David Hume, but because I want to explicitly focus on a causality concept with a succinct definition, I am going to start with a late 1800s moral philosopher and economist named John Stuart Mill, who gave us the following definition of what is now called the “causal effect”.
“If a person eats of a particular dish, and dies in consequence, that is, would not have died if he had not eaten it, people would be apt to say that eating of that dish was the source of his death.” — John Stuart Mill
Mill’s short sentence is importance in many ways, but one of them is simply in defining causal effects at all. We cannot get started trying to realistically ascertain whether two events are causal or coincidental until we at least agree on what causality means and what coincidence means. And Mill’s definition had a certain intuitive self-evidence to it that I think helps explain why the idea he describes ultimately carries the day — the idea of the counterfactual. Let’s summarize what he said with an example. Last night someone somewhere ate a steak, after which they unfortunately succumbed to a cardiac arrest and died. Mills would say that we know the steak caused the heart attack if we could know whether he would’ve died had he not eaten the steak. Thus to Mill, his definition of the causal effect was not based on correlations or regressions in large populations, but rather the very simple, though arguably more problematic, idea of comparing two situations to one another: a world where the man ate the steak, and a world where the man did not. And if there was a difference in mortality under those two states of the world, then we know the steak caused the man’s death. But if the man dies both when he does eat the steak and when he doesn’t eat the steak, then we know the steak isn’t the cause of his heart attack.
It’s possible that Mill understood that in order to make causal effects based on time, he would have to directly confront the logical fallacy post hoc ergo propter hoc, or in English, “after this therefore because of this”. You see, it’s a logical fallacy to base causality on a sequence of events. One day I was listening to Justin Bieber in the car and shortly after the speakers went out. I told the kids “Justin Bieber broke my car”, but even they saw that that claim was unwarranted on the evidence provided. Maybe the speakers would’ve gone out anyway. So to skirt around this issue of temporal irrelevance, how are we going to do it? And Mill introduces the idea of the counterfactual (though he does not use that language) wherein he introduces a time travel device in which we could go back in time to the moment the steak was eaten and stop the man from doing it, observe his mortality, and then transport ourselves back home to report on the causal effect.
But it’s interesting to note that while Mill did give us a definition of a causal effect, his definition of the causal effect was as I alluded to a second ago ultimately pretty dissatisfying from a pragmatic point of view because it was, as I said, based on science fiction. If a person must both eat and not eat a steak, but only one of those events can practically happen, in order for us to know if the steak caused a heart attack, then obviously we cannot say one way or the other. After all, the man did eat the steak last night; he did not not eat the steak. So Mill’s definition, while crisp and tight and intuitive and maybe even self-evident, is a puzzle because how can this definition ever be practically useful? Without a time machine anyway, it’s not clear it can help us.
Scene 2: Statistical Origins of a new causality framework
You see, to know if one thing causes another thing, we need two things we all agree on:
A definition of what it means to cause something
A method that helps us confirm whether something caused something
John Stuart Mill, as I said, gave a satisfying answer to #1, and while he did attempt to provide some means of verifying causal effects with data, they did not ultimately have the lasting impact as compared to the definition he provided. But even without a means of knowing, his definition was crucial and basically stuck, even if it had to be rediscovered 50 years later by someone else.
In 1923, an article by a statistician named Jerzy Neyman was published. While I don’t know the impact that it had broadly in 1923, the impact it has had since then cannot be overstated. Don Rubin, a modern statistician who I will discuss more later, said of this 1923 article:
“Yet, although the seeds of the idea [that causal effects are defined as comparisons between potential outcomes] can be traced back at least to the 18th century [most likely Hume], the formal notation for potential outcomes was not introduced until 1923 by Neyman.” — Don Rubin
Rubin’s opinion is that part of what makes the 1923 article so valuable is similar to why I said Mill’s statement was valuable. Neyman provided notation that made very clear what a causal effect was and was not. But it was the formalized notation, ironically, that makes this paper stand out to us in the history of causal inference.
The paper describes the production possibilities of land and its ability to produce different levels of crops (“yield”). And Neyman begins with a description of a field experiment with many different k plots of land on which there are many different i varieties of fertilizer that could be applied to the land. Neyman writes:
“… U_ik is the yield on the its variety on the kth plot…” - Neyman (1923)
Neyman calls this variable “U_ik” a “potential yield" which inherently distinguishes it from what the econometrician, Guido Imbens, terms the “realized outcome” or in this case, “realized yield”. Potential outcomes versus realized outcomes. They are both outcomes, but somehow they are different. Neyman, though, is describing the potential yield and i indexes all possible varieties (treatments) of fertilizer that could be applied to all k plots of land. And while a given k plot is only going to receive a single i fertilizer variety, Neyman sets it up so that ex ante a farmer at least has a choice as to which variety of fertilizer any given plot could receive.
In setting the problem up like this, by distinguishing between what does happen (realized outcomes) versus what could happen (potential outcomes), Neyman is introducing Mill’s counterfactual concept into statistics in such a way that makes solving the causal inference problem tractable. Hear me out. After defining the potential yields, he then collects them as U which is all the individual potential yields of all plots and all varieties: U_ik with varieties i=1, …, v and plots k=1, …, m. These "potential yields” are not the realized yields — they are not in other words what any given k plot will ultimately be treated with and then produce. Rather, it’s more like U describes a string theory like multi-verse of all possible outcomes associated with all possible varieties for all of the plots. Not so much infinite worlds, unless i is continuous, but an exhaustive description of all possible outcomes under all possible treatment assignments for any given k plot of land. Of these potential yields, Rubin (1990) said Neyman considered them “a priori fixed but unknown.”
This “a priori fixed but unknown” idea is a staple in most time travel movies. Doc Brown warns Marty McFly in Back to the Future to not change anything when he goes back in time because changing even one thing can have repercussions that cascade into the future causing who knows what catastrophic changes. Even the smallest changes will cause a major change (“a priori fixed”) to the future, but what exactly that change is, neither Doc Brown nor Marty McFly can say (“but unknown”). And so in most time travel movies, protagonists are constantly in a race to fix the problems they in fact did create when they went back in time and made seemingly small changes, which is part of what makes these stories so neat. Who know that one of the fathers of statistics, Jerzy Neyman, was writing one of the first time travel stories in this technical 1923 statistics article.
But Neyman does not see — he himself would admit later — the full implications of his own paper, because something he does later in the paper after providing his “possible yield” definition of causal effects was provided merely as a plot device, not as a prescription for scientists to do. As with many of the classic statisticians, Neyman animated randomization with something like a bingo ball machine they called the “urn model”. Since each k plot of land could receive any i variety, which variety would each plot receive? Neyman assigned fertilizer to each plot using the urn model wherein inside the urn was all i varieties of fertilizer. One by one, he would walk past a plot of land, reach inside the urn, pull out the fertilizer and place it on the plot of land until all of the land was fertilized with all possible varieties. Judea Pearl once told me over dinner that going from potential outcomes to realized outcomes was a “major philosophical move”, but to Neyman it was done via an assignment mechanism involving an urn filled with fertilizer that was drawn from blindly by the farmer.
Urn models were not special, but I want to pause and just one more time point out what Neyman’s urn model was doing. By pulling blindly fertilizer from the urn and assigning that fertilizer to each plot of land, he was introducing what Rubin later called a price that “was stochastically identical to the completely randomized experiment.” Only Neyman himself did not see it. For him, this was merely a thought experiment. It was not the key to unlocking a 20th century revolution in scientific inquiry.
But while Neyman himself doesn’t see this, his arch-rival, Ronald Fisher, in a book published two years, absolutely saw the implications of Neyman’s notation and the urn model used to assign treatment to land. In his classic 1925 book, Statistical Methods for Research Workers, Fisher explicitly calls for physical randomization of interventions in order to solve the problem of causal inference. Imbens and Rubin in their own classic 2015 book on causal inference, say this about the importance of Fisher’s book:
“Before the 20th century, there appears to have been only limited awareness of the concept of the assignment mechanism. Although by the 1930s, randomized experiments were firmly established in some areas of scientific investigation, notably in agricultural experiments, there was no formal statement for a general assignment mechanism and, moreover, not even formal arguments in favor of randomization until Fisher (1925).” — Imbens and Rubin (2015), my emphasis
So here we have the value of this tag team wrestling team of Jerzy Neyman and Ronald Fisher. Neyman (1923) defined causal effects by introducing potential outcomes. He then introduces randomization via an urn to animate how one moves between potential outcomes and realized outcomes. Neyman had unknowingly placed Mill’s counterfactual in check — he couldn’t see that in fact it was more than just check. It was checkmate. Fisher (1925), though, did see it was checkmate and his book screams physical randomization causing Mill to concede defeat. If Neyman’s definition of causal effects using potential outcomes was a movie about time travel to counterfactual states of the world, Fisher’s physical randomization was the time machine to get there. Treatment assignment could unlock the fundamental problem of causal inference so long as the treatment assignment was physical randomization.
Scene 3: Physical Randomization’s Magic
Don Rubin, who I have mentioned a few times now, is the former chair of the Harvard statistics department, would in the 1970s write numerous papers laying out an explicit re-formulating of Neyman’s insightful notation, cleaned up a little, and usually made more contemporary. Rubin would generalize the notation and call it, not potential yields, but rather potential outcomes. Potential outcomes modeling of causal effects was to define the causal effect as a simple contrast between two states of the world: one world in which a treatment occurred versus another where it did not. The model retained the time travel nature of Mill and Neyman, but in Rubin’s hand would be put to new uses. So influential was Rubin’s work on this that Holland in a 1985 article would call Neyman’s potential outcomes notation the “Rubin causal model”. But Rubin (no doubt aware of Stigler’s law of eponymy) would correct people and say that the ideas were based in part on Neyman’s original 1923 article. Let’s look at this more contemporary usage of potential outcomes.
When we aggregate and take the population mean over all these individual treatment effects, we get what is now called the average treatment effect (ATE):
An insight comes when we apply the law of iterated expectation to this definition so as to show that the ATE is just the weighted average over four conditional expectations. We define the share of a population treated with the intervention with the greek letter pi, and with some simple manipulation can show that when compare the “realized outcomes” of the treatment group with the control group, we get the following formal decomposition:
Any time we compare average realized outcomes between two groups — one treated, one not — that difference is, by definition, theoretically equal to the sum of three quantities: (line 1) the ATE, (line 2) selection bias, and (line 3) heterogenous returns to treatment by treatment and control group weighted by the share of the population in the control group. And if we use physical randomization as the treatment assignment mechanism, then we know mean characteristics of the treatment group and the control are in expectation the same, which allows us to make the following somewhat bizarre deduction.
Descendants of Fisher in the experimental design tradition who learn that the randomized controlled trial is the queen of all causal inference designs do not always write equations like these down. They tend to, instead, talk about randomization as randomizing, within the “realized outcomes” world of things, physical variables, both seen and unseen. But Neyman and Rubin’s notation shows that there is more than just equality in seen and unseen variables in the treatment and control groups — there are existing conditional expected potential outcomes, like E[Y^1|D=1], that are being set equal to non-existent potential outcomes E[Y^1|D=0]. The first term exists because we observe Y^1 for the treatment group. We do not, though, observe Y^1 for the control group. And yet since each person has associated with him or her both potential outcomes, then just like physical randomized created the same average number of men and women in each group, it also created the same average value of Y^1 and Y^0 in both groups. Which is, as I allude to here, a bit weird because in each of the lines above, there is one average that is real and one that is not, and yet physical randomization set equal to one another something that is real and something that is not real. What in the world can it possibly mean for something to both not exist and yet have a value? Does that value both exist and not exist too, like Schrödinger's cat?). But because they are equal, physical randomization deletes line 2 from the above decomposition of a simple contrast, but also sets ATT to be equal to ATU (and equal to the ATE) causing line 3 to also be deleted.
Physical randomization is just about the closest thing to time travel that we will find in this lifetime because it allows us to peer into alternative worlds simply by looking at control groups. We know that the mean value of Y^0 for the treatment group in an alternative dimension is equal to the Y^0 for the control group in our real world because that’s what physical randomization does. That’s “the science”, as Rubin often says. Physical randomization sets things equal that don’t even exist but which were nonetheless screwing up our comparisons by introducing biases.
It’s important to stress here just what Neyman-Fisher as a team are introducing. They are not introducing the experiment — that had been used for a long time. Even Leondaro da Vinci ran experiments. Humans have for basically a long, long time suspected that to know causal effects was to make simple comparisons, but what they didn’t know is that those simple comparisons because magically reflective of causal effects with physical randomization of the treatment.
I also don’t want you to hear me say that physical randomization had never been used in an experiment either. In Piece and Jastrow (1885), a team of psychologists randomized treatments. But it wasn’t to infer causal effects — it was, rather, to fool the subjects because the treatments they were examining were being rolled out sequentially, and they didn’t want the respondents to correctly guess at what was coming next. As Imbens and Rubin (2015) note, Neyman-Fisher appear to be the source of the insight that when we define causal effects in terms of potential outcome comparisons, and we physically assign treatments, then we have a formalized basis and theoretical explanation as to why randomization is so precious.
And as Rubin notes, a scientific revolution follows — only that revolution is mainly in agriculture and medicine. It is less so explicitly within economics. Early work in the 20th century by Yale’s Cowles Commission, for instance, did have traces of potential outcomes in it. Tinbergen and Haavelmo specifically in their work on simultaneous equations and instrumental variables seemed to explicitly use potential outcomes like notation and reasoning as well according to Guido Imbens in a 2014 article. But that work on instrumental variables and simultaneous equations was relatively limited to influence within economics. Perhaps because that work was so closely tied to our own modeling of supply and demand, the king of all simultaneous determinants of realized outcomes modeling that we have. Insofar as other disciplines maybe had less need for handling simultaneous equations, the Cowles Commission work may have had limited interaction and influence with the groups, like Neyman-Fisher influenced parties, that were also interested in causality. Listen to Imbens describe in an interview I did with him in early 2022 about econometricians movement from using potential outcomes to using realized outcomes and his opinion about its implications it had for interpretation and communication.
Transcript of Guido Imbens:
“The later, sort of around that, well, a little later, but sort of influenced by that. I wrote this book review of a book about some econometrics meetings where I looked back at Tinbergen’s work about the supply and demand for the simultaneous equation set up. [Tinbergen] starts off being very clear in terms of the potential outcomes in a way that still resonates a lot with me. And it's kind of interesting to see how economists then generalize that to these simultaneous equations models with K endogenous variables and M exogenous ones, but doing it purely in terms of the, the realized outcomes. And that I think at that point, a lot of the clarity was lost. And we lost the statistics part of the audience. Because it wasn't very clear what was meant anymore. So I think the potential outcomes are a big part of what make this so clear.”
Intermission
So, we see here a few core ideas: that philosophers and statisticians each introduced counterfactual ideas to define causal effects. That Jerzy Neyman in a 1923 article introduced more formalized notation describing potential yields in a field experiment as well as introduced an idea we now call the “treatment assignment mechanism” in which treatments are assigned to units. The treatment assignment mechanism he introduces is an urn model from which fertilizers were randomly selected and applied to each plot of land. Ronald Fisher, two years later in a famous book, put two and two together and concluded that while Neyman was speaking of a thought experiment when he referenced an urn model to assign fertilizer to land, in fact physical randomization in the real world would translate potential outcomes to realized outcomes in such a way that simple comparisons between groups would yield estimates of average causal effects.
In Part 2, I am going to then move from Neyman and Fisher’s potential outcomes based randomized experiments to the late 20th century work by two entities: the labor economists at Princeton’s Industrial Relations Section in the 1980s and two economists at Harvard in the 1990s. The next substack will support the story with many wonderful video clips with Orley Ashenfelter, Josh Angrist and Guido Imbens, and I hope that you stick around to read it!
David Lewis, one of the greatest metaphysical philosophers of the 20th century, second only to Wittgenstein, would define causality the same way though with far less poetry than Mill. He said of it: “Causation is something that makes a difference, and the difference it makes must be a difference from what would have happened without it.”
There are some changes made, but they are mostly in appearance. I switched out Y for U to describe outcomes, and though Neyman numbered potential varieties with the letter i, which in his original notation was a subscript, I prefer to use binaries 1 and 0 in the super script. But there is itself some diversity in how to do this. Some today will place potential outcomes inside parentheses like Y(1), and still others will place the numbers as subscripts like Neyman had originally done. In addition, I designate the units as i whereas Neyman designated the treatments with i. But the concepts are otherwise the same — just like Mill, something caused something else if when you compare the potential outcomes under different treatment assignments, there is a “difference”.