

Discover more from Scott's Substack
Difference-in-differences, Average Treatment Effects and the Importance of Mechanisms: Part 2
Counterfeit Did, True DiD or RCT?
I’m not really probably going to be doing a ton of editing on these Ignaz Semmelweis posts because to be honest, I’ve just been utterly astonished that as I’ve been studying the case more closely that I’m watching my opinions evolve. I think the substack really just has to become more of a place where I’m processing at least part of the time. So bear with me and if you just know I’m inviting you to eavesdrop on my personal thoughts, and take everything I say with a grain of salt, I won’t feel bad if I’m leading anyone astray.
In that last post, I went through Ignaz Semmelweis’s article in which he came to the conclusion that the reason the fatal fevers were happening in the physician wing but not as much in the midwife wing of a Vienna hospital was because physicians’ hands were contaminated with “cadaveric” particles from when they did autoposies for class. It’s a brilliant article in my opinion, and its rhetoric I’m sure holds wisdom and guidance for all of us, no matter our background.
But in this essay, I want to walk through a few things related to difference-in-differences, drawing upon the evidence Ignaz presented, and how I think about this too. The basic idea in this is going to be that depending on what you consider the treatment to be, then it may be you only need parallel trends or it may be weirdly enough you need more. Potential outcomes will be used so beware.
What do I think difference-in-differences is?
I was thinking about how to title that yesterday and decided against saying “What is difference-in-differences” in the off chance that I may change my mind. And even though I don’t think I will, I decided to hedge and just say I think of difference-in-differences as this:
Difference-in-differences is (1) four averages and three subtractions (2) equalling the ATT plus a non-parallel trends bias.
I think that’s the definition I may use, too, going forward. Let me explain it in its two parts. The first part of that definition I got from this interview with Orley Ashenfelter I did last year where he explained that he decided that instead of explaining the results of his rich fixed effects regressions using the word “regression”, he decided to tell the people he worked with using the phrase “difference-in-differences” because he knew that the regression specification he was using was just that: four averages and three subtractions. I clipped the link so it should go directly to that part of the interview if you want to watch it here. (I love these interviews).
By now, I think many people reading this have already updated their beliefs and know that diff-in-diff isn’t simply a synonym for twoway fixed effects, but I suspect like me many of you (especially the really old people my age!) did once think it was. But there’s been a metric ton of diff-in-diff educating the last five years. It’s almost like we were sentenced to some Siberian re-education camp designed solely to rip all econometric heterodoxy related to diff-in-diff out of our minds and are only now allowed to come back to civilized society and mingle! Of course, it’s debatable that anyone wants to socialize with someone coming straight from the diff-in-diff re-education camps, but if you’re a subscriber to this substack, chances are you have at least one foot still in that world. So what is this four averages and three subtractions of which Orley speaketh? Here:
where “post” and “pre” refer to when the treatment group, D=1, was treated. And in this setup, the D=0 group is simply “not the first treatment group”. Before we even get into whether it’s treated or not, I simply want to focus our attention so we can see that the diff-in-diff equation here is nothing more than a contrast in two differences.
So why did it ever get stuck in our heads that fixed effects regressions were synonyms for diff-in-diff? Because this particular regression specification has a coefficient that is numerically identical to that equation above.
Specifically, the delta coefficient in that regression is literally the same as the four averages and three subtractions in the first equation. I still find it funny that I didn’t notice that, but to quote Tim Robinson, “not everybody knows how to do every thing!”
Handwashing was RCT not DiD
Okay, so moving along, let’s say you grant me that that’s all difference-in-differences. Then you can totally see why Ignaz Semmelweis’s hand washing experiment would be difference-in-differences. Recall what happened:
Physicians in the physician wing in mid 1847 were told to wash their hands before delivering babies with chlorinated lime
Midwives in the midwife wing didn’t do that because they weren’t handling dead bodies.
Deaths from the childbed fevers plummeted in the physician wing to more or the less same mortality rate as that in the midwife wing. Recall they had been over 4 times higher in the physician wing just that year so that was amazing.
Absolutely, he was right, and absolutely his experiment showed he was right. But as I said in that post, given the treatment assignment mechanism in the Vienna hospital sorted women into either the physician wing or midwife wing according to when they showed up for care and what time of day and day of week it was, it was almost certainly randomized treatment assignment. He after all never actually takes “Four averages and three subtractions” nor did he run a regression specification. For Semmelweis, it was actually sufficient best I can tell to simply show in table form that the two wings reached a kind of stasis in which the mortality rate was the same. In other words, all he did was show the result of an RCT — he just showed it happening over time and he showed the differences pre-treatment to illustrate the problem.
So where am I going with this? I’m going two places with this, but the first is I don’t think that’s the diff-in-diff. I think Semmelweis, 80 years before Neymar and Fisher articulated the potential outcomes framework and the principles of randomized treatment assignment for causal inference, was simply showing common sense causal reasoning. Here was the RCT:
Women are being randomly assigned because of the hospital’s scheduling system to either the physician or midwife wing (and had been for going on around 7 years in fact going back to 1840 when an ordinance was passed in the city).
He hypothesized a very specific mechanism even though he did not know the theoretical shape of that mechanism involved micro-organisms: the physicians hands were contaminated with cadaveric material that basically poisoning the mothers and killing many of them.
When he requires the men to wash their hands with chlorine, then given mothers are being randomly assigned to the male physicians, Ignaz was randomly assigning cadaveric material to the mothers or not according to when their water broke.
Pretty interesting right? I actually had never thought of it as an RCT — like a literal RCT — until I sat down to write this substack series and went line by line through that beautiful piece of scientific rhetoric he’d written. But that’s what it was. It was always an RCT; the hand washing experiment had never been a diff-in-diff.
Counterfeit versus True DiD
This leads me to my second point, which is the hand washing experiment was a counterfeit diff-in-diff. What’s a counterfeit DiD? I’m going to do this slowly with multiple latex equations just because I like to. First step, I’ll rewrite the “four averages and three subtractions” just so we can start over.
Hand washing as the treatment
Now I’m going to replace each of those Y terms with their corresponding potential outcome expression depending on whether in that period that group is treated or not. But to do this, we have to decide something: what is the treatment? We have two options. Our first option is that the treatment is “hand washing”. Our second option is that the treatment is “removing cadaveric material with chlorinated lime”.
This is where things get deep. Ignaz clearly believes the treatment is “removing cadaveric material with chlorinated lime” because he says as much. It’s not hand washing — it’s removing the cadaver material with technology. But look at how this changes things. If the treatment is hand washing, for instance, then the midwives were never treated because they weren’t washing their hands. In which case we’d write this:
Notice how the definition of treatment status here is that the physicians in the baseline pre period aren’t washing their hands and therefore their potential outcome, Y(0), is based on “not washing hands”. That’s because the treatment under consideration is washing hands. And so since in the baseline the physicians aren’t doing that, nor are the midwives, then physicians in the baseline have the same treatment status as midwives do in both periods. In which case we can do the following trick where we add the following zero. That zero is the following expression and it’s in red because it’s counterfactual:
It’s counterfactual because in the post-treatment period, the physicians are washing their hands, and so it is a fiction world in which they aren’t in late 1847. So then with this additional zero, we add it to the (4a) and get (6a).
And then I’ll rearrange it for us but keep the equation numbered as 6a.
And then I’ll rewrite it again to make it even easier to read:
where ATT is the average treatment effect on the treated group and non-parallel trends bias is when the change in Y(0) for midwives and physicians is different from one another. This, I will contend, is diff-in-diff: four averages and three subtractions that collapses to (6a) such that parallel trends (not counting no anticipation and SUTVA) is necessary and sufficient to identify the ATT.
Removing cadaveric material is the treatment
But what if hand washing wasn’t the treatment? After all, Semmelweis doesn’t give a flip out hand washing. He noted that washing with soap, for instance, was not sufficient because hand washing with soap didn’t get rid of the smell and therefore he deduced it wasn’t get rid of the cadaveric materials either. Semmelweis didn’t want clean hands — he wanted the cadaveric material gone. And chlorinated lime did that. So what if the treatment status isn’t hand washing but rather does the person’s hand have or doesn’t it have cadaveric material on it?
If you frame the experiment as the removal of cadaveric material as opposed to hand washing, then weirdly enough the treatment status changes. Why? Because:
Post-treatment physicians do not have cadaveric material on their hands so Y=Y(1).
Pre-treatment physicians do have cadaveric material on their hands so Y=Y(0).
Midwives never have cadaveric material on their hands so Y=Y(1).
Do you see it? If the treatment is “no cadaveric material”, then midwives do not share baseline treatment status to physicians, but if the treatment is hand washing they do! Well, what’s the implications of this. Let’s rewrite it again by going back to equation 4a. I’ll start a whole new sequence of equations starting with 4b instead.
We are going to still need to add that red zero in order to find the ATT, and that red zero is this:
So then we add (5b) to (4b) and get (6b):
So first of all, notice that second row is not the same as (6a). Equation 6a is a standard DiD equation in which the only things coming up is the ATT parameter plus the non-parallel trends bias term (not counting the no anticipation and SUTVA conditions, which just to make this post manageable I had to impose from the start, but which merit their own posts another time). You can tell it’s different from equation 6a because thought he physician changes in Y(0) are the same, the midwife changes are not. They are changes in Y(1), not Y(0) because the midwives were always treated.
For those who are familiar with Andrew Goodman-Bacon’s now well-known result, the reason that this particular 2x2 isn’t collapsing to that conventional DiD form of ATT plus parallel trends bias is because the comparison group, midwives, were already treated. What were they treated with? Clean hands. Non-contaminated hands. And since we defined the treatment to be “no cadaveric material on the hands”, as they never had cadaveric material on their hands, their realized outcome was always drawn from Y(1) not Y(0), because Y(0) recall in this setup is the baseline value of “cadaveric material on the hands” for male physicians.
Andrew, though, managed to figure out what assumptions were needed by adding another set of zeroes. Those zeroes were:
and
And when you combine those zeroes with equation 6b, guess what you get.
That’s right. Those of you who have seen Andrew’s paper now see where it’s coming from — it’s simply adding in several counterfactual zeroes, which then shows why DiD expressions that use already treated groups as controls can be problematic. They require that counterfactual parallel trends hold — which in this situation, are almost like extreme counterfactual trends because notice, it’s changing Y(0) for all groups, and even the midwives Y(0) don’t exist, let alone the physicians’ counterfactual Y(0). But it’s also requiring that the ATT of having clean hands not be evolving for midwives over time.
Now should the ATT of non-contaminated hands be evolving for midwives over time? I mean think about it — Semmelweis has no idea what the cadaveric material could be doing. So how could he, or really anyone, possibly have an informed opinion on this? Any answer is just as good or just as bad as any other. The problem of using the already treated group as a comparison is that it’s the one time we are having to make firm statements about what a treatment effect can or cannot do which is impossible without a prior theory of that, and in Semmelweis’ case, that theory wouldn’t come in his lifetime. Parallel trends is only about Y(0), in other words, evolution over time, but Delta ATT is about the treatment effects themselves.
So, I guess what I’m saying now is that the hand washing “experiment” so to speak. It wasn’t a true DiD. Not that Semmelweis presented it as that, but I think increasingly these days as people have learned more about Semmelweis antiseptic theory, and the success of it, they’ve taken the rhetoric of his book and that article — which I don’t think is remotely widely appreciated and which I hope to one day make more progress in understanding — saw elements of DiD in it, and concluded that this was an example of it.
But it’s more complicated than that. If the treatment was hand washing, then it’s the standard DiD equalling the ATT plus a parallel trends bias term. But if the treatment was non-contaminated hands (i.e., no cadaveric material), then it isn’t. Because in that case, the midwives were “always treated” so to speak by clean hands. Therefore all realized outcomes were the Y(1) potential outcomes and the identifying assumptions are equation 6b not equation 6a, which I consider to be the more typical DiD identifying assumptions.
Now back to the RCT
But then what? So what? What is my point? Well apart from using the Semmelweis case to walk myself up and down the reasoning of what a difference-in-differences design is and isn’t, I wanted to probably go back to just thinking about what Semmelweis was laying witness to. And here is what I think.
In 1840, Vienna started unknowingly an RCT. That RCT was a long-running randomized experiment in which mothers were randomly allocated to physicians who had been performing autoposies and still had material from the cadavers on their hands when performing a delivery. It was a random assignment because the Vienna hospitals used a scheduling system in which on particular days at particular times, if you showed up at the hospital for care, you’d go to either First of Second Clinic independent of potential outcomes. Even knowing ahead of time that the First Clinic was worse was simply not sufficient to get into the Second Clinic because the hospitals appeared to have little to no discretion. No doubt there was some manipulation, but these were hospital beds for poor women, not rich families, so I’m not really even sure what influence you could possibly expect the woman whose water had just broke to have. She can’t schedule her pregnancy and it doesn’t sound like she had the ability to pick her doctor either. So they randomized her when her water broke depending on what day of the week it was and what time of the day it was. And that was the 1840 RCT.
Well, if the 1840 ordinance is truly a randomized experiment, then you don’t need DiD. A simple difference in means in 1840 is the ATE as E[Y(0)] is then the same for both groups, and heterogenous treatment effects between the two groups are likely non-existent. So therefore a simple difference in means is simply identifying the ATE. Well incredibly enough, Semmelweis provided Table 7 which is the reports of both clinics in 1839 and 1840, which allows us to do both a “true DiD” and allows us to estimate the ATE using only 1840. So let’s do it! Let’s see what we get.
If we just compare death rates for 1840 for First to Second clinic, it’s 9.5 deaths per 100 births in the First Clinic versus 2.6 deaths per 100 births in the Second Clinic, which is a difference of 9.5-2.6 or 6.9 deaths per 100 births. In other words, the effect of this cruel experiment was that it caused the death rate to rise by 6.9 deaths per 100 births, which given there were three thousand births in the First Clinic, that’s an additional 207 female deaths from being exposed to cadaveric material during their delivery.
But we can also use DiD to estimate the ATT. Now technically under randomization, the ATE=ATT=ATU. But let’s just do the math and see what we get. We have a baseline at 1839 and a post treatment in 1840. Now how do we define Y(0) here though? I’m not sure to be honest. Technically, we’d like for Y(0) to be the potential outcome associated with a specific treatment status of “not treated with cadaveric material”, but what I don’t know yet is whether in 1839 this was a teaching hospital in which physicians were either performing autoposies or using cadavers in some other way. I need to read more on this, but so as not to confuse matters, let’s just assume for the sake of argument that Y(0) is defined as “no cadaveric material”, although like I said, I’m not really 100% certain that that’s true, and in fact that the Second Clinic sees a decline in its death rate makes me think that the segregation by sex created by the Ordinance pushed all the cadaveric material into the First Clinic and out of the Second Clinic entirely. So I’m really not even sure we can perform a DiD here at all, but I’m going to anyway. We plug the numbers into the DiD equation of four averages and three subtractions to get:
Now isn’t that interesting. The ATE estimate using only the 1840 comparison was 6.9 but the ATT using the DiD design was 6. They aren’t the same, and in fact, implied by the DiD design mortality rates should have fallen by 1.9 deaths per 100 births.
But I’m wondering if maybe this really isn’t a DiD either and the simple comparison of the ATE isn’t more reliable. The reason being, again, Y(0) should be associated with a treatment status representing “no cadaveric material” but in 1839, if these physicians were performing autoposies, jut not only in First Clinic, then that treatment status never existed in 1839 anywhere — not in First or Second Clinic.
But do you see how important it is that we think carefully about the treatment status, and what precisely it is? Because if it really is the case that the treatment status is just a binary comparison, then of course it’s completely fine to run DiD no matter who the comparison group is, midwives or not. The DiD then is just thought of as any two differences subtracted from one another.
But when we go through the logic of the DiD design using the potential outcomes, and we get very specific about what we do and do not mean by treatment, then we start to see precisely where this is going. And we also start to see the challenges of making progress when we are shooting blind about mechanisms. Because Semmelweis is so careful, so focused, in his reasoning out of what the mechanism is, and he gets as far as he can to the cadaveric material on the physicians hands (which honestly I still cannot even wrap my mind how he incredibly careful his logical reasoning and attention to detail was to do this), then it literally means the DiD may or may not be appropriate depending on what precisely is meant by “treatment” and what is and it not meant by “no treatment”.
These words — we are so used to the vocabulary. We are so used to the binary nature of it because of our comfort in running regressions and generating dummy variables. We forget there was a time when such a thing as “generating a dummy variable” or “running regressions” or even “difference-in-differences” didn’t exist. Just even the concepts didn’t exist, let alone the nuances contained within them.
So what am I concluding? I’m reasonably confident that the 1840 Ordinance did randomize women into one of two kinds of deliveries — with or without cadaver on the attending physician/midwife’s hands. But I don’t know if the 1839 data in fact can ever really be used in any way shape or form to conduct difference-in-differences analysis. The baseline even in 1839 simply does not correspond to what is meant by Y(0). Neither First or Second Clinic has women exposed to “no cadaveric material”. And that is what is meant literally by Y(0). So if that isn’t the case, then you can’t use difference-in-differences. And you also can’t use it in 1847 either.
Best I can tell, Semmelweis has a much harder job and that’s why he uses a variety of arguments and a variety of evidence. I think if I am asked to interpret all of this and to give a case for what the average causal effect is of being delivered by someone whose been handling dead bodies, though, I’d say it increases on average mortality rates by 6.9 deaths per 100 births. That’s the simple difference in mean death rates in 1840, and I think given the random assignment of women to hands in 1840 created by the Ordinance, that’s the answer.
Difference-in-differences, Average Treatment Effects and the Importance of Mechanisms: Part 2
I am working on a project that involves price as a continous variable, whereas DiD is seen mostly being used with multi-valued treatments, which are discrete value of doses of a particular medicine.
But I have to use price of thousands of products as continous treatment variable.. here I find it difficult for DiD to work.
Can you suggest any technique that can be used for price continous variable.
Interesting. I guess part of what we want the pre-post to do for us here is decompose the different effects of being randomized to the clinics, which also involves doctors vs. midwives.