Five communities across the United States have decriminalized psychedelic compounds such as magic mushrooms and San Pedro cactus. As I have written about drug policy, I have been following these developments with a lot of curiosity given how I assumed scheduled psychedelics such as magic mushrooms, or psilocybin, would remain scheduled for the rest of this republic’s existence. But what an opportunity to study the effect of these unprecedented policy changes!
Imagine the following thought experiment if you would. Using the thought experiment command available in some future version of Stata,1 let’s say I could collect a census of Americans that it included the universe of psychedelic users before and after the decriminalization took place in some cities. Furthermore, assume my magical dataset had linked mental health outcomes such as depression. I want to use difference-in-differences to estimate an average causal effect of magic mushrooms on depression. Can I?2
But let’s say that in reviewing the data, I find that individuals living in the non-treated comparison cities have been experiencing their own renaissance in psychedelic experimentation. As this is a nation-wide surge in psychedelic usage, it’s unlikely caused by DC or Ann Arbor’s decriminalization. In other words, it’s unlikely this is a SUTVA violation. But decriminalization isn’t the only covariate that could cause a person’s experimentation with psychedelics to change — sometimes the times are just changing.
This in fact may pose a problem for identifying the causal effect because it means the line between the treatment and control group cities are getting fuzzier. They are fuzzy in the sense that decriminalization usage is rising even in the control group, and while we may think that we are estimating an average treatment effect on the treatment (ATT) group, until Clement de Chaisemartin and Xavier D’Haultfœille 2017 article in the Review of Economic Studies, “Fuzzy Difference-in-Differences”, that assumption was not in fact known. In this substack, I will attempt to lay out what we are identifying in the above thought experiment when we estimate a difference-in-differences model using conventional estimators, and what alternatives we may have.
Sharp DiD and the Wald Estimator
This is a complex paper in parts, so I will at all times be aiming to communicate the intuition of the paper while selectively leading the reader through its technical details. But let me lay out the purpose of the paper first. The purpose of this paper is several fold:
Under a fuzzy difference-in-differences design, under what assumptions will Wald difference-in-differences (Wald DiD) identify a causal effect?
What do we call the average causal effect that Wald DiD will identify under those assumptions?
What estimators might we use as an alternative if we are uncomfortable with some of these assumptions?
Let’s begin by explaining what the authors mean by the “Wald DiD”.
A common estimator found in the historical instrumental variables literature is the Wald estimator. The Wald estimator is the ratio of the reduced form and the first stage in an IV framework (without covariates). It is:
Notice that we have three variables: the outcome (Y), the treatment (D) and the instrument (Z). This is a familiar face to those who are familiar with instrumental variables and it can be shown that under strict IV assumptions, this ratio equals the local average treatment effect of the treatment D on outcome Y.3
Economists and others have used the Wald estimator’s general framework beyond IV applications, though. de Chaisemartin and D’Haultfœille (2017) note that 10% of papers published in the American Economic Review between 20210-2012 used fuzzy difference-in-differences design and estimated either a simple Wald difference-in-differences (Wald DiD) or a weighted average of Wald DiD. So what is a Wald DiD?
In the scenarios where a treatment group is treated and a comparison group isn’t and no one can cross treatment status (e.g., a form of compliance), then the difference-in-differences framework that applies is called a “sharp DiD”. It’s sharp in that the treatment goes from 0 to 1 in the treatment group, but the control group experiences no changes in treatment status. But a “fuzzy DiD” is a situation where the policy increases usage in an environment where there appears to already be treatment adoption occurring for other reasons. Whereas some policies cannot be fuzzy (e.g., minimum wages), the thought experiment we’ve been considering (e.g., decriminalization of psychedelics) can be thought of as fuzzy because the control group is experiencing its own unrelated renaissance in drug experimentation. The Wald DiD attempts to address this by scaling the DiD on the outcome by the DiD on the treatment status as:
where Y is measurable depression, G is treatment city (e.g., Ann Arbor), D is psychedelic consumption, subscript 1 is post-treatment and subscript 0 is pre-treatment. Notice again the three variables needed for the fuzzy DiD: outcome (Y), policy level treatment city (D) and individual treatment status (D) regardless of group.
The Wald DiD estimator is in the spirit of its IV parent in that it scales the DiD on the outcome (like a reduced form) by the DiD on usage (like a first stage). But to do this, it automatically means that what is needed is disaggregate data such that we observe individual panel units with or without treatment assignment regardless of which policy they’ve been treated with. This is why I chose the decriminalization thought experiment: to do the fuzzy DiD, we need not only average depression by cities, we need individual level psilocybin usage too. Otherwise we cannot calculate the DiD on usage in the denominator. Thus not only does the fuzzy DiD apply when lines are blurry, it also requires that we have data on individuals not just data on treatment cities.
Part of the challenge of any econometrics paper is its rhetoric. Rhetoric in an econometrics paper largely involves the selection of mathematical notation (i.e., symbols) that the authors will use to clearly lay out assumptions and measurements as clarity and accuracy are absolutely essential in this genre of technical writing. This means we can only get so far with exposition of a paper without committing up front to that notation. Therefore before we move on, we need to make the fixed cost investment in the authors’ notation. I’ll lay out briefly the important symbols and equations.
First thing to note is the number of variables your dataset will need. As you can see in the Wald DiD, we need data on depression Y, decriminalization policy status G and individual psilocybin usage D. As it is a difference-in-differences design, we also need to observe all three of these variables before treatment and after, which the authors mark with the upper case letter T for “time”. How will the authors carry around so many variables? Both with conditional statements, but also simple subscripts. Take any random variable R which can be sliced by D status, G status and T status. Then rather than writing out the conditional statements, we simply use subscripts where the order of the subscripts corresponds to an alphabetizing of D, G and T:
The first subscript, 1, states that the person in our dataset has consumed psilocybin (D) during the wave of data under consideration. The second subscript, 0, states that the person lives in a non-decriminalization control city (G). And the third subscript, 1, states that we are observing this random variable in the post treatment period (T).4
Assumptions and Identification for Wald DiD in Fuzzy Designs
The authors consider numerous assumptions in this paper, as many as eight at one point. This creates a challenge for an explainer because it automatically means we will also need to walk through as many as eight assumptions. But I am not bound by the genre style requirements of a Restud level econometrics publication, so if I wish to walk the reader through what each assumption means, that’s what I’ll do. Let’s consider the main fuzzy assumptions.
Basic Fuzzy assumptions
There are three main assumptions the authors point us to which constitute a fuzzy DiD design. I will discuss list and briefly interpret each one now:
Recall the order in which you read off subscripts — in alphabetical order. Ordinarily that means DGT, but as our random variable of interest is D, that means subscripts are alphabetized G and T. Thus, the first half of this assumption, 1a, states that the share of people using psilocybin in the decriminalization cities (1) in the post treatment period (0) are greater than it had been in the pre-period. The second half of this assumption, 1b, assumes that the growth in psilocybin consumption in the decriminalization cities is larger than the growth that occurred in the non-decriminalization cities, not that the change in control groups is zero. Assuming that the control group experienced no change whatsoever in treatment status between the two periods is in fact a special case of 1b, which the authors lay out in the second assumption:
The second assumption concerns magic mushroom consumption levels in the non-decriminalization cities and simply states that the share of magic mushroom users was both non-zero and didn’t change between the two periods in the non-decriminalization cities. And finally,
where S stands for people who “switched” treatment status by going from someone who didn’t experiment with psilocybin to someone who did. These “switchers” turn out to be a key subpopulation in this design, and so I will refer to them frequently. The authors suggest an economic rationale for why they switched in the first line of the assumptions, 1a. They chose to switch because a personal propensity try drugs, V, had changed such that it exceeded their own private reservation level of drug consumption, v. Think of lower case v as the minimum expected utility one believes they will get from psilocybin before which they’ll even contemplate trying it.
With these assumptions, let’s now lay down two average causal parameters. These two causal parameters are called the local average treatment effect (LATE) for the “switching” subpopulation (4a.) and the local quantile treatment effect (LQTE) for the “switching” subpopulation (4b).
For the first time, we introduce potential outcomes notation.5 The first term is the LATE, but since it only refers to the treatment group in the post-treatment period, quite frankly it would be more accurate to call it the LATT in my opinion since it is only the LATE for the treatment group in the post-treatment period. But putting that aside, we note that 4a refers to the mean, but 4b refers to any quantile treatment effect (e.g., median).
Wald-DiD estimators and terms
With these assumptions and notation down, let’s examine this “commonly used” Wald-DiD estimator. While they note correspondence between Wald DiD and instrumental variables, I will mainly focus on the DiD representation so as to minimize confusion by crossing so many designs with one another. As said earlier, we represent a DiD estimator as one that takes simple difference in means before and after for a treatment and control group and then differences that first difference — hence the name “difference-in-differences”. We represent that generic DiD for random variable R as:
As stated before, but repeated here as a reminder, the Wald DiD is simply the DiD on depression (Y) scaled by the DiD on mushroom consumption (D):
Now let control group switchers, S’, be people whose treatment status in the non-decriminalizing cities, G=0. Thus, we can also write down a LATE for this group of control group switchers:
We need one last set of notation (sorry! But if I had done it all at once, you would’ve been even more annoyed) before we can hit the ground and that’s the scaled participation probability (P). We will use this momentarily:
Unique Wald DiD assumptions
There are three assumptions that one needs to estimate the LATE/LQTE on top of the three basic fuzzy assumptions. I will list them in order with short exposition. First is the parallel trends assumption, and if you came here to learn about DiD, then you already know what parallel trends means, so I won’t belabor it. I will just list it so we can reference it later by equation number.
It’s the next two assumptions that have a bit more bite. Not surprisingly, the themes of all the new DiD papers break out here — the problems created by heterogeneity. The fourth assumption is called the stable treatment effect over time and it requires that in both groups, the average effect of going from 0 to d units is stable over time. This is the same as assuming that among these units, the mean of Y(d) and Y(0) follow the same evolution over time.
And finally, homogenous treatment effects for switchers in both groups. Note this is not necessary for the sharp DiD that many of us know so well; it is actually a fuzzy assumption for Wald DiD caused by noncompliance in treatment participation within control groups.
Wald Did Theorem
The authors lay out a theorem to guide you as you consider what your Wald DiD is and is not identifying under some or all of these assumptions. Let’s look at different combinations. First, what if assumption 1 holds for us but not assumption 2? That means, using our example, that our non-decriminalization cities are experiencing increased magic mushroom consumption over the treatment window, but the growth in our decriminalization cities like Ann Arbor are seeing an even greater growth. Then in that world, what does Wald DiD identify under assumptions 3-5 but not 6? Under A1, A3, A4 and A5, the Wald DiD will identify:
If the treatment rate is increasing in the control group then alpha is greater than 1. Therefore under A1, A3, A4 and A5, the Wald DiD equals a weighted difference of the LATEs of decriminalization and non-decriminalization city switchers in the post-treatment period. This weighted difference is potentially problematic, because if the LATE for non-decriminalization switchers is greater than that of the treatment group switchers, then the sign will be wrong. The authors write:
“This weighted difference does not satisfy the no sign-reversal property (my emphasis): it may be negative even if the treatment effect is positive for everyone in the population. [But] if one is willing to assume that A6 is satisfied, then the weighted difference simplifies into [the LATE].” (p. 6)
Thus we see here one of the common things we’ve been seeing across so many of the new DiD econometric papers of the last 3-4 years: without treatment effect homogeneity, identification using conventional linear estimators has no causal interpretation.
But this is really an artifact of rising treatment rates in the non-decriminalization cities, because if treatment rates are falling in our non-decriminalization rates, the Wald DiD under A1, A3-5 is equal to a weighted average of LATE decriminalization and non-decriminalization individual switchers in the post-treatment period and this quantity does satisfy the non sign-reversal property. It will still differ from LATE, though, because it is still a weighted average of LATE for decriminalization and non-decriminalization city specific switcher LATEs. But at least it is the right sign!
But, when there is a fixed treatment rate over time in our non-decriminalization cities, then alpha equals 1, and the Wald DiD equals LATE under A1, A3-5. In other words, if the treatment rate is fixed for our control group over the treatment window, then we don’t need A6 to identify the LATE. But it’s not so simple even here, as the authors write:
“But even [when treatment rates don’t change in the control group], the Wald-DiD relies on the assumption that in both groups, the ATE among units treated at [pre-treatment period] remains stable over time. This assumption is necessary (my emphasis). Under A1, A3-4 alone, the Wald DiD is equal to LATE plus a bias term involving several LATEs. Unless this combination of LATEs cancels out exactly, the Wald-DiD differs from LATE.” (p. 6)
Alternatives to the Wald DiD: Time Corrected Wald and Changes in Changes Wald
You can think of everything we just did as though we were decomposing the Wald DiD in a fuzzy design into causal and bias parameters with various forms of treatment rate changes and various assumptions. But some of these assumptions may simply be really big pills you have trouble swallowing. So the authors go beyond this discussion of the problems facing the Wald DiD and introduce the reader to two new estimators they discovered/invented: the time corrected Wald (Wald-TC) and the Changes in Changes Wald (Wald-CiC). Let me discuss those now.
The main purpose of the Wald-TC is to correct for time changes in usage but in such a way that allows us to strip off assumptions, rather than add new ones on. This is their first alternative to Wald-DiD and unlike the changes-in-changes one that follows, the time corrected Wald estimator still uses a DiD parallel trends axiom (albeit one with some slight tweaks). The tweaked parallel trends assumption, 4’, is:
The authors explain this new parallel trends assumption as simply saying that the mean of Y(0) and Y(1) follow the same evolution over time among treatment and control group units that were untreated (or treated) at T=0. We then define the “time correction” term alluded to in the title of this estimator as:
This delta term denotes “the change in the mean outcome between period 0 and 1 for control group units with treatment status d” (page 7). We then write out the time corrected Wald estimator as:
It’s interesting to pause here and consider the class of estimators that resemble it. Notice how the numerator remains a first difference for the treatment group. Thus it is the start of a DiD on outcomes. But what’s differenced is the delta term. It is an “adjustment” made to the first difference of the control group.6
Now if we assume A1-3 and A4’, then Wald-TC equals the LATE for the switchers. This is because when we have constant treatment rates in the control group, then we can identify trends on Y(0) and Y(1) by looking at how the mean outcome of mushroom users and non-users change over time. Under A4’, these trends are the same in the two groups, and the numerator is able to identify the mean LATE.
Unfortunately it is beyond the scope of this “short” substack entry to explain a classic paper by Athey and Imbens in their 2006 article sometimes referred to as the changes-in-changes paper. The changes-in-changes estimator is not a true difference-in-differences estimator if you require that all DiD based estimators rely on some version of parallel trends. Parallel trends is necessarily a restriction placed on the mean potential outcome changes under no treatment for treatment and control group cities. It would require, for instance, that average depression have evolved the same in Ann Arbor and Waco, Texas had Ann Arbor never decriminalized mushrooms in the first place.
For some, that is an easy assumption. For others, it is a hard one. Assumptions are hard or easy pills to swallow largely because of the project you’re working on as opposed to abstract imagination. But that said, maybe you don’t like parallel trends because parallel trends is not robust to scaling of the variable. For instance, maybe you think parallel trends holds in levels, but then it won’t by definition in logs and vice versa. Enter changes-in-changes. Changes-in-changes doesn’t restrict the mean potential outcome and therefore is robust to scaling of the outcome, but it does place restrictions on the rest of the distribution of potential outcomes. In econometrics, as in life, there is no such thing as a free lunch, and changes-in-changes is no exception.
Nevertheless, if you are willing to drop the parallel trends assumption and add in a “monotonicity and time invariance of unobservables” assumptions (A7 for pete’s sake!), then A3 and A7 will be enough to generalize the CiC model to fuzzy settings. If I’m reading the authors correctly, in fact the fuzzy DiD nests the sharp changes-in-changes by Athey and Imbens (2006) as a special case of this paper’s own CiC estimand. As the exposition required at this point requires even more notational setup, I am simply going to say that the Wald-CiC will identify quantile treatment effects under A1-3, A7 and a data restriction assumption they introduce. What this estimator does is reconstruct the unobserved distributions which are then used to calculate quantile treatment effects.
When Assumption 5 is Problematic, We Must Choose
All of this work has really been about one and only one thing and that’s assumption 5. Assumption 5 was tantamount to assumption that mean potential outcome depression measures for the people in our data follow the same evolution over time under. It is a strong type of homogenous treatment effect that we haven’t seen before in our other DiD papers. But sometimes our application is such that we are not willing to believe that the treatment effect is invariant with respect to time. In such scenarios, we must choose between Wald-TC and Wald-CiC.
The choice between Wald-TC and Wald-CiC is largely dictated by the project, not abstract econometric theory or personal preferences. It should be based, in other words, on whether A4’ and 7 are acceptable tradeoffs for abandoning A5 in your project. In other words, this is a personal choice.
Assumption 4’ is not invariant to the scaling of the outcome (a negative), but it only restricts the mean potential outcome Y(0) (a strength). Assumption 7 is invariant to the scaling of the outcome (a strength), but at the cost of restricting the entire distribution of potential outcomes (yikes!). The authors discuss this Solomon choice here:
“When the treatment and control groups have different outcome distributions conditional on D in the first period, the scaling of the outcome might have a large effect on the Wald-TC. The Wald-TC is much less sensitive to the scaling of the outcome, so using this estimand may be preferable.
On the other hand, when the two groups have similar outcome distributions conditional on D in the first period, using the Wald-TC might be preferable as Assumption 4’ only restricts the mean. This choice should also be based on the parameters one seeks to identify. Under A7, both LATE and LQTE are identified. Under A4’, only LATE.” (p. 9)
Easter egg: non-binary treatments
Before concluding, though, I’d like to point out that buried inside this paper, on page 13, is a discussion of one advantage of their estimators and that’s they can accommodate non-binary treatments. Perhaps your treatment isn’t a zero or one. Maybe it’s not whether you consumed mushrooms, but how many mushrooms. What then?
Well interestingly, there are Wald-DiD, Wald-TC and Wald-CiC analogs to accommodate such non-binary treatments. Their interpretation changes, because they no longer estimate LATE and LQTE parameters, but they do estimate average causal parameters associated with something that Angrist and Imbens (1995) called the average causal response.
The average causal response is simply a sloping change in causal effects associated with a non-binary treatment, though in their application it had been an IV design under consideration. One of the theorems in de Chaisemartin and D’Haultfœille (Theorem 6) shows that the parameter identified is a weighted average over all values of the continuous treatment along the average causal response function. It’s interesting that this interesting application of the three estimators is buried in the paper, because arguably it is one of the applications that probably many researchers would be surprised to learn has been identified.
The software for implementing the Wald-DiD, Wald-TC and Wald-CiC with or without continuous treatment is only available to my knowledge in Stata. You can find, though, an article the authors wrote with coauthor Guyonvarch here. The syntax for it is straightforward. Assuming our make-believe evaluation of decriminalized psilocybin, then we’d write:
. fuzzydid depression decrim post_treatment mushroom_prices, tc breps(1000) cluster(city)
where fuzzydid is the command that you must install from ssc, depression is the outcome, decrim is an indicator signifying whether it’s one of the locales that has decriminalized psilocybin, post-treatment is a dummy indicating it’s the post treatment period, and mushroom_prices is a continuous variable applied to all individual level units. The post comma terms indicate which estimator you’ll use (e.g., tc for time-corrected), the number of bootstrap simulations, and whether you want to cluster.
But the thing to keep in mind, which may be lost in the details, is something I said earlier. Implicitly, a fuzzy DiD assumes you have two measures of treatment status:
G: treatment group, here a city like Ann Arbor
D: treatment status, an individual within a city who has consumed mushrooms
In other words, fuzzies assume that you already have two levels to your data measuring treatment: which city you live in, and whether you consume mushrooms regardless of which city you live in. Without the latter, then you’re implicitly assuming a sharp DiD — which may be fine, but is something you need to think about nonetheless.
The fuzzy difference-in-differences design is for when the lines between treatment groups is somewhat blurry. It’s blurry because the individuals in the control group are free to use mushrooms, too. Decriminalization, in other words, isn’t the only way Americans have relied on magic mushrooms. In fact, the majority of the world doesn’t. As such, fluctuations in consumption must be addressed, and the Wald-DiD did so through a simple scaling of the DiD on outcomes with a DiD on mushroom usage.
But the cost of Wald DiD isn’t cheap. As with TWFE, it requires stable treatment effects over time because otherwise the trends contaminate the estimation. de Chaisemartin and D’Haultfœuille (2017) both decompose the Wald DiD and show its identifying assumptions, but they also provide two alternative estimators — the time corrected Wald estimator and the changes-in-changes Wald estimator. While neither estimator is “free”, they represent what may be in your setting like finding a really great shirt at a thrift store. It’s a great piece of clothes at a discounted price in that it identifies the parameter you care about, but with different assumptions that don’t require stable treatment effects over time. Whether you buy the shirt is up to you.
The thought experiment command in Stata 17 is simple. Just use:
. thoughtexperiment trname(psychedelics) group(decrim) time(year), robust
You don’t even need data or IRB approval to do it!
This is a funny way to motivate this study, admittedly, but in fact quietly the FDA has been help lead a re-evaluation of psychedelic assisted therapies impact on mental health such as the MDMA trials and PTSD, and the psilocybin trials and depression. But these trials only tell us about therapist assisted psychedelic usage, not decriminalization with or without such therapist assisted treatment arms.
Three variables, it turns out, was key to helping me understand the data requirements needed for the fuzzy DiD.
As I said in footnote 2, it was writing and rewriting in my notes conditional statements of R as well as thinking about the subscripts that it dawned on me fuzzy DiD needs more than just aggregate city level data (G). It also needs disaggregate city level data because otherwise we cannot measure D. Most DiD with which I am familiar assume a sharp DiD setup, but in my thought experiment, we consider a dataset where we observe individual drug usage D as well as residence G and time period T. Without individual level data, you cannot measure D, and if you cannot measure D, you do not have a fuzzy DiD.
The potential outcomes notation is the lingua franca econometricians and statisticians use in the Rubin-Neyman tradition to model causal effects and corresponding biases. For a detailed discussion of this tradition, see the chapter in my online book. Whereas I ordinarily indicate a potential outcome using 1 and 0 superscripts, it is increasingly preferred by most econometricians and applied people to indicate potential outcomes by placing a 1 or 0 in parenthesis after the outcome itself. I do so now to maintain the same notation as the authors.
Though different from that of the Heckman, et al. (1997), I just wanted to note that Wald-TC is continuing to calculate first differences on treatment and control group, but the control group like Heckman, et al. (1997) is a modified version of the simple DiD because it is the change in average depression between the two periods non-decrim cities where people are taking mushrooms in the first period.