My substack was started for two reasons: to explain econometric methodologies that I thought readers might enjoy, and to explain new papers at the Journal of Human Resources in a series called #JHR_Threads. This time I thought it might be fun to combine the two by discussing two papers smashed together that I thought might help the reader understand the other. These papers are “Difference-in-differences with a Continuous Treatment” by Brantly Callaway, Andrew Goodman-Bacon and Pedro Sant’Anna, and “How Far is Too Far? New Evidence on Abortion Closures, Access and Abortions” by Jason Lindo, Caitlyn Myers, Andrea Schlosser and myself. The first paper is a decomposition of continuous treatment DiD (hence the name), the latter is a study of the impact of abortion closures on abortion demand using distance (continuous treatment) as the treatment variable, and this substack is mashup of the two. It may feel a little nonlinear at times, but my hope is that it will work as a story.
The difference-in-differences wildfire
The new difference-in-differences literature spread throughout the profession as fast and as far as it did perhaps because of three reasons. First, difference-in-differences is the single most popular quasi-experimental design in economics and so there was built-in demand to learn this. Just look at this graph from Currie, et al. Approximately 23% of all papers at the NBER Working Paper series recently were DiD, and 16% of Top 5 papers. While applied economists work on a variety of topics and therefore use a variety of methodologies, probably when they repeat themselves, they do so with a DiD design. Why is that you might ask? Because we are flush with micro data, American federalism leads to large scale policy experimentation staggered across the country, and oftentimes DiD may be the only way to study such large impactful studies as Medicaid or the minimum wage.
The second reason, maybe, was because of COVID19. COVID19 canceled conferences and seminars across the world while simultaneously increasing online seminars, conferences and reading groups facilitated by Zoom, Teams and Google Hangout. Not only were online seminars available, they also were typically open to all and well attended. As a result, the transmission of new papers and ideas — maybe even more than had been the case before COVID since many of these specialized seminars were open to the public for the first time (as opposed to being siloed at universities). These seminars and reading groups likely amplified the papers’ content both because of the technologies themselves but also because of the massive demand for knowledge about DiD as evidenced in its widespread practice.
And third — Twitter. By all appearances, Twitter, with its highly active sub-community called #EconTwitter, was one of the more important signal boosters of these papers because many of the authors were active participants on the platform. Without regular threads by Pedro Sant’anna, Andrew Goodman-Bacon, Kirill Borusyak, and Clement de Chaisemartin breaking the implications of these new findings down for people, I doubt that the papers would have been read as closely and by as many people. The anxiety was palatable as we all learned that twoway fixed effects was potentially biased under reasonable scenarios related to heterogenous treatment effects, and did not satisfy a “no sign flip” property. Many of us had DiD papers under review at the journals, as well as handfuls of DiD papers on our vita, and to say it was a little bit stressful would be an understatement.
But like a fever breaking after days of delirium sickness, the panic subsided as new methods appeared. Borusyak and Jaravel, Sun and Abraham, Callaway and Sant’Anna, de Chaisemartin and D’Haultfœille, Gardner, Wooldridge, event studies, Borusyak, Jaravel and Speiss and more all showed up and not only carefully dissected the problem, but also offered solutions. Eventually code in R and Stata showed up, papers began appearing in the journals, papers were published, and paradigm shifts on a small scale took place.
But what about continuous treatments? All of these papers had been focused on the binary treatment? Their diagnosis had decompositions based on binary treatments, and the solutions were based on binary treatments, and it wasn’t clear that continuous treatment either had the same problems or had the same solutions. And so people began asking questions. Take this snapshot of comments from Twitter.
How common is the continuous treatment? Quite common. There are states increasing taxes, firms raising prices, cities imposing fees, states raising minimum wages. And none of these are binary treatments. And none of these seemed to have the same target causal parameter either (the ATT) — not without changing the continuous treatments into binary ones anyway. These were “multi valued treatments” which could be discrete or continuous. They were a sequence of numbers differing across treatment groups, as opposed to simply the two numbers — 0 and 1. We sensed this mattered, but we weren’t sure how or why. So if differential timing was giving us some headaches, what about the continuous treatment?
Supply Side Regulation of Abortion
The history of abortion regulation has strategically focused on two types of intervention: laws that raised the price of the abortion to the female (called demand-side regulations), and more recently, laws that raised costs of production to firms that provided abortions (called supply-side regulations). Demand-side regulations included parental involvement laws, mandatory delay laws, and others. And limited success had been found for these, with mixed results of efficacy.
But the supply-side regulations were new and largely untried. Ted Joyce, in an interesting 2011 NEJM article entitled “The Supply-Side Economics of Abortion” alluded to these emerging shifts by pointing out that Kansas was starting to change tactics by introducing regulations of clinics that would force their closure. He writes:
"Under legislation recently signed by Kansas’s governor, the Kansas Department of Health and Environment has issued new licensing standards for abortion clinics. The regulations stipulate, among other requirements, that facilities must have procedure rooms of at least 15 square feet; each procedure room must have janitorial space of at least 50 square feet; facilities must have designated dressing rooms for patients and separate ones for staff; and each dressing room must have a toilet, a washing station, and storage for clothing.”
Joyce notes that these regulations resulted in a lawsuit by two physicians who claimed that these regulations were so stringent that it would raise the firms’ fixed and variable costs such that it would make them insolvent. A temporary injunction allows all three providers in Kansas to continue operating. But the cat was out of the bag, and in time, more states would attempt, including Texas.
Texas House Bill 2
In June 2013, perhaps inspired by Kansas, Texas legislation passed House Bill 2 (HB2). The legislation was a supply-side regulation with two key provisions: (1) all abortion providers had to obtain admitting privileges at a hospital located within 30 miles of the location at which an abortion was performed, and (2) all abortion facilities must meet the standards of an ambulatory surgical center, regardless of whether they provided surgical abortions or provided medication to induce abortions. It also prohibited abortions after 20 weeks gestation and required physicians to follow FDA protocols for medication-induced abortions, which restricted the use of pills that induced abortions to pregnancies within 49 days gestational age and required the medication be administered by a physician.
Such clinical language to the uninformed bystander seems reasonable. Why not regulate a medical clinic where invasive procedures were performed like other clinics that did the same? But in fact, the regulations while couched in language of female safety were carefully chosen so as to make these clinics insolvent by raising the fixed and variable costs of production. As we know from our microeconomics courses, the marginal firms in competitive markets are operating at razor thin margins as is and thus increases in the firms average total costs will causes losses that in the longrun cause firms to exit the. market. And in fact, that was exactly what happened. We write:
“Obtaining admitting privileges can be lengthy process, as it takes time for hospitals to review a doctor’s education, licensure, training, board certification, and history of malpractice. Moreover, many hospitals require admitting doctors to meet a quote of admissions. After a lawsuit, decision, and a subsequent appeal, the admitting privileges requirement took effect on November 1, 2013 (Planned Parenthood of Greater Texas Surgical Health Services v. Abbott 2013) causing nearly half of Texas’ abortion clinics to close.”
The second HB2 requirement that required firms meet the requirements of a surgical center (e.g., additional size, zoning, equipment requirements) was scheduled to take effect on September 1, 2014 and threatened the remaining clinics, but this was blocked two weeks later by the US Supreme Court. And while the US Supreme Court struck down these two provisions down, claiming Texas had failed to demonstrate they served a legitimate interest in regulating women’s health and imposed undue burdens on women, very few clinics reopened by 2018.
The closing of more than half the state’s clinics had a tremendous effect on access by increasing the travel distance to the nearest abortion provider. We made a beautiful map to illustrate this. Areas on the western part of Texas and the southern part saw closures that caused the travel distance to grow by over 100 miles. Others saw minimal increases, such as areas around Dallas, but even places like Waco, where I live, saw increases by as much as 50 to 100 miles (the distance required to get to Austin). Sometimes the nearest clinic would be out of state to Oklahoma, Louisiana or New Mexico even.
This is an interesting natural experiment, and unlike many natural experiments, it would seem the distribution of travel distance had been somewhat random. Why did some clinics close and others not close? Most likely because of local demand conditions driven by lower incomes I suspect. But so what — this is hardly a reversal of Roe v. Wade. Sure Lubbock lost a clinic, but is El Paso really all that far away? How many of us jump in our cars and drive to our parents for Thanksgiving and Christmas?
Well, many of us do were we to look at our financial situation, we might see that we are salaried, able to take off work, own cars and have insurance. If we have to spend a couple days at a town, we can take off work and do that. Sure it may cost us $300 just to drive, eat and book a hotel, not to mention the out of pocket expenses of the procedure itself, but it’s doable. But it is decidedly not doable for everyone. I do not have to remind readers that there is such a thing as income inequality and poverty and that most likely the residents of Corpus Christi may not as easily just jump in the car and drive eight hours to San Antonio to spend a couple days at the RiverWalk, visiting the Alamo, before driving back to work. Many have bosses who don’t allow such trips. Or they simply cannot afford it.
And thus the supply side regulation ironically has demand side elements as well because while it did not raise the explicit costs of an abortion, it might as well have because driving distance must be purchased on markets at gas stations, restaurants and hotels and opportunity costs in terms of foregone wages, as well as the risk of termination, and not surprisingly many people may have a lower ability to pay and thus cannot make the trip even if they do own the automobile that is needed to do it. Texas House Bill 2 with its ensuing clinic closures had reshuffled travel distance across the state and therefore raised price differentially for otherwise plausibly identical groups of people for no other reason than that some lived in Corpus Christi and some lived in Galveston. This was an unusual case study we don’t see very often, but which may help us figure out to what degree is travel distance itself a price that must be paid to acquire an abortion?
Target causal parameters
There are two things that characterize all difference-in-differences designs. They all identify a causal parameter called the average treatment effect on the treatment group (ATT) and they all have in their engine an assumption called parallel trends. With the latter, you can identify the former, if you use a model that needs only parallel trends. The new DiD literature showed us that some models could obtain the ATT at a lower price than others, and if we are rational, shouldn’t we always buy on the market the cheaper procedures? Why pay more for less?
The ATT was our target parameter, and once it was defined, we select models that use raw data and with some twisting and folding of the data, magically spit out a weighted average of treatment effects equivalent to the ATT. Often these methods focused on smaller ATT units, like the group-time ATT by Callaway and Sant’Anna, from which one could build “up to” aggregate parameters like the overall ATT, or even relative event time ATT parameters. Such weighted aggregations showed up in the imputation methods too, like Borusyak, Jaravel and Speiss too. These weights mattered because they could distort and disguise the true effects with flawed estimates, sometimes even becoming negative and causing sign flips, so the issue was not merely some technical detail — it really seemed to matter.
The continuous treatment issue is no different. We can’t really dig into the problems of continuous treatment in a DiD design without first laying out the targets that we are after. And there are a couple of parameters that the authors want us to pay attention to. One of them is old — the ATT — but one of them will be new for many of us — the “average causal response for the treatment group”, or ACRT. But even to understand the seemingly familiar ATT, let alone this parameter called the ACRT, we have to introduce a new term called dose.
What is dose exactly? Let’s use an example. I have notoriously bad sleep problems. I either cannot sleep when I need to, or I sleep at the most inappropriate times and places. The other day, to write this substack, I drove to Starbucks to write. I write in coffeeshops and have ever since college. It is my third place between home and work. I love the smell of coffee, the square top tables, the soft music and the people watching. But as soon as I drove into the parking lot, I could feel the now familiar sleepiness from which I suffer wash over me like a wave. So I parked the car in the parking lot, turned off my car, “rested” my eyes, and fell asleep almost immediately. An hour later I woke up, refreshed, and came inside to write.
But come at me on evenings, and it’s another matter altogether. Every evening, unless I take an over the counter sleep aid called melatonin, I will lay in bed for hours, completely miserable, ruminating over the fact that if I don’t fall asleep I’ll be a mess the next day, causing me seemingly to become anxious, and therefore reinforcing my sleeplessness. So a few years ago, I learned about this little magical pill that when taking does something or other that invites my eyes to close, my mind chatter to stop, and to fall asleep. Sometimes all night, sometimes for a few hours, but always more sleep than I think I was going to get otherwise.
But how many of these pills should I take? Sometimes I think one pill is enough, but lately I think maybe two pills could be better, and after some trial and error, I tend to take two 5m pills of melatonin before bed and go to bed just before 10pm. My decision to take anywhere from 0 to 2 pills of melatonin is to choose my own dosage of melatonin based on an expected treatment effect with regards to my hope it will help me sleep. My chosen treatment, in other words, is not binary; rather it is a multi-valued treatment. I mean, I could think of my taking 10mg of melatonin as a binary variable, 0 or 1, but this masks somewhat the deeper question as to whether 5mg or 10mg is better for me.
Average causal response function
The 2021 Nobel Prize in economics was given to David Card, Josh Angrist and Guido Imbens. Angrist and Imbens were particularly noted for their pioneering work on instrumental variables in a series of papers written in the mid to late 1990s. One of the papers was in a 1995 issue of JASA entitled “Two-Stage Least Squares Estimation of Average Causal Effects in Models with Variable Treatment Intensity. In it they introduced a helpful concept called the “average causal response”. Here’s a short description:
“The average causal response (ACR) [is a] parameter [that] captures a weighted average of causal responses to a unit change in treatment, for those whose treatment status is affected by the instrument.” (Angrist and Imbens 1995).
In “Differences-in-differences with a Continuous Treatment”, the authors take the Angrist and Imbens classic ACR causal function to the problem of difference-in-differences with continuous treatment and use it like a microscope to better understand the identification problems of key causal parameters in difference-in-differences designs. The article has helpful exposition using graphs, and as I’m a visual learner, I’m going to mainly rely on these graphs to drive the points home.
The ACR is a curve and as such economists are likely to find it intuitive because because of our love of good curves. We love curves so much that even when we draw linear demand functions, we still call them “demand curves”. To be an economist is to embrace the curve as an exhibit that holds the power to convey mankind’s hopes and dreams, and the ACR is but one of those hopes and one of those dreams. On the y-axis below there is the treatment effect for a given outcome based on a dosage d and no treatment at all 0 at some point in time t. And on the x-asix we have its dosage d. The third piece is the ACR curve itself that increases at a decreasing rate (a shape presented only because it is pretty).
At any point along the ACR, we can calculate a particular ATT. For instance, let’s say I took a dosages of melatonin and wanted to know it’s treatment effect. Then the vertical distance of the ACR to the x-axis at a is the ATT(a|a). What’s this “(a|a)” business you might ask? Well, in this case it is the ATT of 5mg for a sleepy old guy like myself who took 5mg. Or in the other words, the ATT(5mg dosage | 5mg guy).
Once you understand that the last term after the vertical conditional notation refers to a subpopulation, interpreting the ACR becomes a lot easier. The ACR is a group specific causal concept, and in this picture, it is the ACR for the group of people whose dosage was equal to a (e.g., 5mg). So last night I took 5mg. That actually happened. But I could have taken 10mg. What would the impact on my sleep (Y) have been had instead of taking 5mg I would’ve taken 10mg? Well, when a particular group of dosers — the 5mg group — who happened to take 5mg could have taken 10mg, then we move “up” the ACR to that point where the ACT equals 10mg. We then trace out the value of the ATT at d=10 for the 5mg group. Get it? As with all potential outcome notation, it’s subtle and easy to misinterpret, but the basic idea is that it’s much like the demand curve because the demand curve itself is a series of paired potential prices and corresponding potential quantities. It’s the same here — we observe ATT(a|a), but we could have observed ATT(b|a) had group a chosen b instead of a.
But while I took a=5 doses of melatonin last night, what did you take? Maybe your sleep problems are worse than mine and so you took b=10, which makes you to 10mg doser group. Because you’re a different group than me, I can’t presume we have the same ACR. In fact, maybe your ACR is the one above mine which I’ve labeled ACR(d|b). And in this world, you took b=10mg of melatonin, whereas I took a=5mg. That means I’m the a group, but you are the b group. My ACR contained a pair of dosages corresponding to hypothetical ATTs (even though only one would be observed in reality) as does yours. It’s just that since your ACR flies higher than mine, and your dosage points is greater than mine, the ATT that corresponds with your dosage is ATT(b|b) which is shown on the vertical axis at the very top.
Contained on the ACR is actually two not one causal concept though. First there is the ATT. The ATT is simply a point on a group’s ACR at a particular dosage equal to Y(1) - Y(0) where the value of 1 corresponds to that particular dosage d. For instance, if everyone took 10mg, then we’d just estimate the ATT for 10mg and for simplicity just set 10mg to be the treatment using the number 1 as a stand-in. In other words, everything along the ACR is really just capturing the single valued treatments we’ve been working with the rest of the time. The ACR is the more general expression of treatments. Binary treatments are just special cases of what we are doing here.
But there’s another causal concept on this ACR and you would be forgiven for not seeing it. It’s the difference in ATT at different values of the ACR. The effect, in other words, of one treatment level compared to an adjacent treatment level. If I want to know the causal effect of melatonin as I increase it by one dosage unit, I’m asking about the distance between two points on the ACR curve, not the vertical distance between a dosage level and the ground. And if we are working with a continuous dose (not shown), then it is just the slope of a line tangent at that particular dosage point. Either way, the change in the ATT between two dosages is what the authors call the average causal response on the treatment group (ACRT).
The ACRT for a discrete treatment is more formally represented with the following equation:
Now this somewhat subtle, so let’s look closely. The ACRT is for a particular dosage group j who took dosage j. But it’s a local comparison between dosage j (which they took) and an adjacent dosage level j-1. This is why it’s a diagonal in the discrete case connecting two points on the ACR — it’s that change in the ACR associated with a small increase in dosage.
The data collection that we undertook in our abortion paper involved several datasets. The first thing we needed was the location of every licensed abortion clinic in the state, as well as those in continuous states Louisiana, New Mexico, Colorado and Oklahoma. We used a variety of sources like licensure data from the Texas DSHS, as well as clinic websites, judicial rulings, newspaper articles and websites tracking clinic operations. We used this data to construct two county-level measures of abortion access per quarter: the distance to the nearest abortion provider and a measure of congestion called the “average service population”.
The next step was to calculate the distance to the nearest abortion clinic from each county in the state. We used a program in Stata called georoute which was written by Weber and Péclat. This allowed us to calculate the travel distance from the population centroid of every county to the nearest abortion clinic, including the ones in the nearby states. We did this for every quarter because once the clinics began disappearing, the travel distance from each centroid might change depending on whether the one that closed was the closest to that centroid. I already showed you a map of changes in distance across the state, so let me instead show you the changes in number of clinics and the average distance to clinics after House Bill 2.
Next we used data on abortion rates and birth rates for Texas. We used publicly available data on Texas abortions by county of residence which the Texas DSHS maintains. To produce these data, the Texas DSHS compiles a county level count of abortions performed in-state as well as out-of-state abortion from the State and Territorial Exchange of Vital Events system. We calculated abortion rates using the SEER data, a commonly used measure of population from the CDC. You can read more about these data and its shortcomings by reading the data section of our article.
We also collected data on birth rates from the restricted-use natality files from the National Center for Health Statistics 2009 to 2015. These data count every birth that took place in the US over this time period and to calculate birth rates, we divided births by the SEER population values for that county.
We call our estimation strategy a “generalized difference-in-differences design” because we exploited within-county variation over time using county fixed effects and time fixed effects. We wrote that our identifying assumption was:
“that changes in abortion rates for counties with small changes in access provide a good counterfactual for the changes in abortion rates that would have been observed for counties with larger changes in access if their access had changed similarly.”
It is interesting, in hindsight, that we chose this language instead of appealing to explicit identification notation referencing parallel trends. As it stands here, though, the exact language we use may have inadvertently been precisely the identification conditions that Callaway, Goodman-Bacon and Sant’Anna (2021) write must be true when using the continuous difference-in-differences design, for the continuous treatment difference-in-differences requires both parallel trends and a type of homogeneity that one might say translates into being “a good counterfactual” between small and large changes, or dosages.
We chose as our model the poisson fixed effects model because abortions are discrete and because we have zero abortions with county and time fixed effects controlling for county covariates (demographics, unemployment, family-planning access). We cluster the standard errors at the county level which addresses over dispersion problems that can plague poisson models otherwise (see Cameron and Trivedi 2005).
Most of our estimation focuses on nonlinear effects of distance through the inclusion of four dummies: distance between 50 and 100 miles, distance between 100 and 150 miles, distance between 150 and 200 miles and distance greater than 200 miles. We chose to do this because we did not want to impose homogenous treatment effects upon the distance measure itself as it seemed reasonable that some distances might be more impactful than others. Note, we weren’t saying that the ATT for a given dosage would differ by county so much as we were saying that a particular dosage (miles to nearest clinic) could be different regardless of the county that experienced that distance. Though this may seem like a small point, as we will see when I refer back to Callaway, Goodman-Bacon and Sant’Anna (2021), it is actually an important distinction.
Identifying assumptions for the ATT
The authors lay out three main assumptions and while we will ultimately need more than just these three, they are nonetheless necessary for identification. They are
Random sampling. In other words, we need data.
Support. In other words there must be units at these actual dosages otherwise this is sort of stupid.
No anticipation. Little known fact, the no anticipation assumption is actually the SUTVA assumption because what it says is that the observed outcomes, Y, for a particular person i at some time period t are based on that person i treatment status D at that same time period t. A violation of SUTVA would be if my outcome, Y, was based not on my current treatment status, but tomorrow’s. That sort of deal is in this literature referred to as “anticipation” and with forward looking rational agents, it’s easy to see how it could be violated, but we need it, in part because we need a base period for all our comparisons. But again, it’s interesting that SUTVA shows up in the DiD literature, not because of spillovers from neighboring units, but by spillovers from our future lives. If I know I’m about to get a raise, I may buy a new car now, even though I technically haven’t gotten the raise yet.
But as this is a DiD design, we know we will need a parallel trends assumption too. In fact, to be a DiD design is to have a parallel trends assumption of some kind. Put more strongly, I believe that parallel trends is what unites all of the DiD based econometric estimators. And we know what that looks like. It looks like this counterfactual trend for our dosage group compared to the rest of the sample:
Okay, so let’s see how far we get if we just assume these four canonical DiD assumptions. Apologies for how ugly this looks — my equation editor is acting up.
The LHS of the top line is what I tend to call the “DiD equation” as it’s just the canonical formula for calculating DiD as the before and after difference for a treated and untreated group. We then use no anticipation and assign potential outcomes to get the RHS of the top row. The second line is the ATT for the d dosage group plus a parallel trends bias term, but if we assume parallel trends, then that third line disappears leaving only the ATT(d|d). As this is a 2x2 without differential timing, the identification of the ATT need only parallel trends and the first three assumptions.
Identifying assumption of the ACRT
The identification of the ACRT does not follow from these four assumptions, unfortunately. It is in fact this discovery that makes Brantly Callaway in the embedded YouTube interview below say the paper is largely filled with “bad news”. Let’s look at this bad news together. To calculate the ACRT, we must compare the ATT for dosage group j and dosage group just below it, j-1. I’ll go in steps, just because I figure people may want to see this in slow motion. The LHS of the next equation is simply the change in ATT between the two dosage groups, j and j-1, that I previously mentioned.
On the RHS I did a common trick with decompositions: I added a zero (can you find it?). Once I have all four terms, I rearrange them slightly to get a causal parameter, the ACRT, plus another bias term, and unlike with the ATT, this bias is not the parallel trends assumption we used before. That parallel trends assumption, in fact, was used to identify the ATT. This bias is in fact based on two separate ATT terms, not a parallel trends term.
Now the reason that this is the ACRT (technically it’s ACRT(d_j|d_j)) is that it’s the change in ATT for the d_j group as it moves from d_j to d_j-1. And if that change in dosage occurs for a particular dosage group, then it’s equivalent to a causal effect because it’s a movement along the ACR for that j group. This is the target parameter we are after. In a regression, it is what we had hoped the coefficient on our continuous treatment variable amounted to.
But what is this second term? Look very closely. The second term is the difference in ATT for the j and j-1 groups at the same dosage j-1. Maybe this will be easier to understand with another picture. Under the four assumptions that we listed — including parallel trends — then we identify a causal parameter ACRT(b|b) which is change in the ACR for the b group plus the difference in ATT for our a and b groups at dosage a. In the figure, this is labeled “ACRT(b|b)” and “bias” on the vertical axis.
It’s interesting to see this decomposition taking on this form because in my book, I found a similar decomposition for the simple difference in means. When I had decomposed the simple difference in mean outcomes into causal terms and biases, I found that it was equal to the ATE, selection bias and what I called the weighted heterogeneous treatment effects term. And Callaway, Goodman-Bacon and Sant’Anna are in many ways finding something a bit similar.
Selection bias on gains
The bias that Callaway, Goodman-Bacon and Sant’Anna (2021) find is simply selection bias. But it is a particular kind of selection bias because it is a selection bias referring to differences in average treatment groups for a given dosage between two different groups. Unlike classic selection bias which is the differences in Y(0) for two groups of people, the bias of a continuous treatment difference-in-differences comes from the heterogeneity in gains from the treatment. In other words, if groups of units have heterogenous gains at some dosage, then the continuous treatment DiD is contaminated by differences in different dosage groups own expected returns. Without additional assumptions, this bias term will persist and the bias is ambiguous as the differences can be positive, negative or zero depending on ACR across groups.
This is a potential problem because if we are to identify the ACRT, we must assume both parallel trends and equal ATT across groups. But this is a hard assumption to most macroeconomists because when we see people naturally choosing different treatments levels, we tend to think that they have heterogenous treatment effects at the same treatment level. Consider this example: two students stop their education after a certain point. The first student stops their education at 12 years, but the second stops their education at 16 years. Do you think that the returns to school are the same for these two students had both stopped at 12 years? If you do, then there is no selection bias. But if you don’t, then there is. And it’s natural to think that the returns indeed differed because otherwise why did the second student not stop at 12 years?
Now that you know our model specification and data sources, let’s look at what we found. I’m going to present our results out of order because the purpose of this substack is primarily to focus attention on the continuous treatment, not binary treatments at different dosage levels. So let’s look at Table 3 first which shows distance in 100s of miles on log outcomes.
We break our results up by Non Hispanics and Hispanics, and for the sake of space, I will just focus on the distance measure, though note that our average service population — a measure of congestion caused by the fact that with fewer clinics, each clinic remaining is forced to service more people — is an independent causal parameter we were interested in. We find in our analysis that for every 100 miles a woman must travel from the population centroid of her residence to the nearest clinic, abortions fell by approximately 13% to 26%.
Now let’s think about what these twoway fixed effects Poisson regressions have identified, if anything at all. First, assuming data, support, no anticipation and parallel trends, then we have identified the ACRT plus the type of selection bias based on differences in ATT across counties with differing distances at the same distance. Thus for -13% to -26% to be unbiased measures of semi-elasticities of abortion with respect to distance, it must be the case that the impact of the distance on abortions for groups who had been reshuffled after House Bill 2 with differing distances would have had the same semi-elasticity had they too been assigned the same distance. This, of course, is not testable as it amounts to comparing counterfactuals along multiple dimensions, but it is this in fact that we were assuming when we said that the low and high distance counties were “good counterfactuals” for each other.
But we were not really focused on that constant semi-elasticity. Rather, we wanted to see if there were nonlinearities in treatment effects as women were forced to travel further and further distances. We chose to model these nonlinearities with four dummies based on distance: a 50-100 distance term, a 100-150 term, a 150-200 term and a greater than 200 miles term. While these are technically no longer expressed as continuous measures, simply discretizing the variable is not enough to get around the selection bias problem alone. This is because the comparisons needed to construct these estimates still come from comparing high and low dose counties, and as such, the differences in ATT for those groups remain in the decomposition. Putting aside this issue for now, column 1 labeled “Total” suggests that distance may have caused abortions to decline by 18% for those who had to travel 50-100 miles (compared to those who had travel barely anything at all), 33% for those who had to travel 100-150 miles, 48% for those who had to travel 150-200 miles and 59% for those who had to travel more than 200 miles.
While there is a lot more in that paper, including alternative assumptions one may make than just homogenous treatment effects across dosage groups at the same dosage, discussions of differential timing, and a detailed analysis of the twoway fixed effects, I find that most of the insights come from just the simple 2x2 case. And it’s interesting to reread our paper in light of their paper because it makes me wonder — perhaps we unknowingly went about this issue, for our own reasons, in a way that perhaps avoided some of the major problems that they identified for the continuous treatment case. Let me explain.
First, the nature of this experiment was one in which counties did not choose distance based on expected gains. Rather, the county’s nearby clinic closures may have been really unrelated to the impact that would’ve been had on residents. This is debatable. Perhaps the reason that they folded was because demand was low in that area because after all, wouldn’t we expect areas with low demand to be more sensitive to the regulations? And if they had low demand, might that imply the ATT for a given distance was different than the ATT for a different county at a different distance had they both been treated at the same distance? Perhaps.
But another point is worth noting, too. There is a sense in which the reshuffling of county distances may have been arbitrary too, and insofar as it was arbitrary, then the selection of counties into treatment was unrelated to potential outcomes which means it was unrelated to ATT which implies that the ATT might be the same for these groups. After all, the independence assumption is the assumption that eliminates both selection bias and heterogenous treatment effects bias in the simple difference in means because under independence, both treatment and control group have the same average treatment effect. Thus it would seem by the same logic that the more random the distance has been reassigned, the more likely that the ATT across dosage groups is the same at a given dosage. Thus is the nature of what randomization does.
Ultimately, what it means for a low dose group to be a good counterfactual for a high dose group is both the parallel trends assumption as well as the equivalence of the ATT for different groups assigned to different distances had they been given the same distance. It is difficult to make clear statements about this; checking pre-trends will not help us here because under no anticipation, the ATT is equal to zero for both groups in the pre-treatment period anyway as is the ACRT. We may need to wait a bit longer to figure out just what, if anything, we are to do with continuous treatments in a difference-in-differences framework, but as this decomposition shows, the presence of this selection bias without an easy fix will require credibly arguing that the ATT are equal across all dosage groups at the same dosage level. While that is the case under physical randomization of dosage, it may be weakened under those situations in which groups choose those dosage levels based on expected gains.
This new continuous treatment DiD paper is, in my opinion, though, an extremely important new paper on DiD methodology. It is filled with bad news, unless you consider improving ones understanding of the world good news. If so, then how can anything in this paper possibly be considered bad news? The truth is always good news. I encourage you to read this paper closely. I think that you will enjoy just as much as I did.
Interview with Brantly Callaway
I’d like to conclude this substack with an interview I did a few weeks ago with one of the continuous DiD paper’s authors, Dr. Brantly Callaway and the University of Georgia. The interview is sort of a reflective discussion about everything from being an econometrician, to difference-in-differences, to continuous treatments and this paper specifically. I think if you have 90 minutes of time, it’ll go by much faster than you think. Thank you for hanging with me this far. Good luck with your new year.
Thanks to Andrew Goodman-Bacon for collecting these and showing me the picture to me.
I was waiting for this blog. Thank you Scott. I will go through in detail before i start second phase of analysis a data with continuous and staggered treatment. :) Happy new year.
This is really helpful, thanks for sharing. For alleviating the selection bias term, does it make sense to match the units on, say, their propensity to select that level of treatment? For example, a linear regression to predict their chosen treatment based on confounding factors that affect both their choice of the treatment level and their outcome. We can then stratify the individuals. Within a strata we can assume that whatever variation exists in their choice of treatment level is random and then the continuous DiD estimator would be, relatively, unbiased?