The Much Quieter Revolution of Synthetic Control: Episode I
An Abadie, Diamond and Hainmueller explainer
An Introduction to Synthetic Control
Though sometimes large social programs and public policy changes are randomized, that is more the exception than the rule. Most of the time, policy change occurs in a political environment where people fight for resources using votes, democracy, budgets and politicians, all governed by regional institutions, charters and rules. Understanding the causal effect of an intervention is hard enough with randomization, but it is extremely difficult to find a way in the swirling winds of American federalism. And yet, since the benefit of understanding these programs are so large and so important, generations of scientists have developed tools for handling these kinds of situations, and synthetic control is one of them.
Synthetic control is an econometric estimator that appeared in a 2003 article in the American Economic Review. The method has become popular across the social sciences for its usefulness at estimating causal effects with panel data, and that popularity has migrated to industry as companies like Uber, Microsoft, Amazon and others now use it too. Consider this quote by Susan Athey and Guido Imbens from an article in the 2017 Journal of Economic Perspectives entitled “The State of Applied Econometrics: Causality and Policy Evaluation”:
“The synthetic control approach developed by Abadie, Diamond, and Hainm- ueller (2010, 2015) and Abadie and Gardeazabal (2003) is arguably the most important innovation in the policy evaluation literature in the last 15 years.”
This method was first developed for comparative case studies to credibly estimate the causal effect of some aggregate unit’s exposure to some treatment on some chosen outcome using a weighted average of optimally weighted comparison groups as its imputed counterfactual. These weights are the treatment group’s “synthetic control” in that the researcher has imputed what might have happened had the treatment never occurred. The difference between what did happen and might have happened is an estimate of the causal effect of the treatment at single points in time.
The original architects of synthetic control were Alberto Abadie and Javier Gardeazabal in a 2003 AER on terrorism in the Basque Country. The paper was followed up with a 2010 JASA on smoking regulations and then a 2015 article in the AJPS on the reunification of Germany, both with Alexis Diamond and Jens Hainmueller. Papers that use this method tend to have certain themes. For instance, the papers will typically focus on large, sometimes massive macro like, shocks to aggregate units. Counterfactuals will be explicit imputations based on weighted averages of control group units. And you will also find lots of beautiful pictures with fitted lines that even a child can interpret. It was an exotic piece of machinery for its blending of different principles and techniques such as matching, unconfoundedness, forecasting, imputation, difference-in-differences, event studies and factor models.
But synthetic control is changing as more and more econometricians, statisticians and applied researchers are shifting their attention to its usefulness, properties, strengths and limitations. These updates to the underlying econometrics have led to augmentations making it an interesting area to follow, but that also poses a problem. Because it is now a large area, covering all or even most of it is beyond the scope of a simple substack entry, or even a series of entries. So I am going to cover the original paper today, and then in three subsequent sub stacks cover three other ones, as I take a pause from difference-in-differences.
For an introductory discussion of synth, see my chapter on synthetic control in my book on causal inference. And for an advanced discussion, see Abadie’s recent review. Today, though, I will simply discuss certain often overlooked points about the model as a way of setting up future discussions and I will be mainly doing that using the 2010 JASA with Alexis Diamond and Jens Hainmueller which I will abbreviate the ADH model. In subsequent episodes in this series, I will be discussing a few of the new synth papers, though, starting with “Synthetic Difference in Differences”, forthcoming in the AER, written by a dream team from Stanford University: Arkhangelsky, Athey, Hirshberg, Imbens and Wager. Next I will discuss “The Augmented Synthetic Control” published earlier this year in JASA by Ben-Michael, Feller and Rothstein. And then I will conclude with another paper by this team entitled “Synthetic Controls with Staggered Adoption” forthcoming in the JRSS. Pray that like Nate and Beard, I stick the landing on this one.
An Introduction to Alberto Abadie
I recently had the opportunity to sit down for an interview with Alberto Abadie where we discussed his creation of synthetic control, his motivation to develop it, his thoughts on the “credibility revolution”, and where he sees all of this going. I’ve embedded it below for those who want to take a half hour and watch it.
Abadie is a professor of economics at MIT where he was also at one time a doctoral student. His advisers were Josh Angrist and Whitney Newey, two giants in causal inference and econometrics. After graduating from MIT in 1999 he took a post at Harvard Kennedy School and made full professor in five short years. Fifteen years after graduating from MIT, he returned to his alma mater as Professor of Economics and the Associate Director of the Institute for Data, Systems, and Society (IDSS). In addition to his distinguished publishing career and editorial positions at the top econometrics journals, he is also a Fellow of the Econometric Society.
Alberto Abadie is an extremely creative and insightful econometrician whose contributions extend far beyond just synthetic control. Looking down his google citation page, there are numerous other classics like his work on matching and semiparametric difference-in-differences as well as his new work on design based inference. It is no surprise that Athey and Imbens lavish such praise on synthetic control. His two most cited papers are those synthetic control articles, each with over 4,000 citations and growing. His sixth most cited is his paper in the AJPS on the reunification of Germany. Alberto Abadie has almost 30,000 citations — a number whose annual counts, as can be seen in the top right of the picture below, have grown more and more every year.
As a teacher of econometrics, Abadie is as good a communicator as I’ve ever seen. He is that rare teacher who can explain difficult material so well that you think maybe it was always simpler than you thought and that maybe you could’ve figured it out too. Teaching econometrics to non-econometricians so that they feel confident is quite a trick and most do not excel at it, but that has not been my experience with Abadie. While I cannot verify this story, it feels true, but I once heard that one day after a lecture at Harvard, Abadie’s entire class stood and clapped.
It’s interesting to go back and look at those early papers, though, because not all of them are pure econometric theory. I find it interesting that Abadie wrote three papers on terrorism, not just one. The paper in which he developed the synthetic control method, for instance, with Gardeazabal was an effort to understand the impact of terrorism in Spain. As he says in our interview, that project had been deeply personal for him because he grew up in Basque Country and it had been plagued by terrorism as far as back his childhood. It is interesting that such an important contribution to causal inference of the last fifteen years came from such a personal place. That is not what we often associate with discoveries and creative work in econometrics, but I’m sure the stories of science are far more interesting than we’ve been told.
He would go on and write another paper with Gardeazabal on terrorism in a 2008 European Economic Review, as well as one more in the Journal of Urban Economics in 2008 with Sofia Dermisi, an expert in urban design and real estate. Abadie and Dermisi found that the 9/11 terrorist attacks in NYC caused skyscraper’s vacancy rates to rise, not just in NYC, but in other cities as well.
ADH synthetic controls and non-negative weights
When conducting an impact evaluation of events like policy changes, the researcher must make hard choices about who will, and maybe more importantly who will not, be counted as the counterfactual for the treatment group. Which means if we are going to even get this plane off the ground, we are going to have to think long and hard about which causal effect we want to know and only then which technique has a hope at telling us maybe what it is. Weights become a part of this insofar as they possess properties that allow for causal interpretations.
The subject of weighting should be familiar to the astute reader who has been following the new difference-in-differences material because the new difference-in-differences literature has been explicitly focused on weighted aggregations. But not all weights are equal. What does it mean that weights, for instance, can cause an estimate to have a causal interpretation? What does it matter if weights are positive or negative? When will they be negative, and why, and who even cares? What even is being weighted at all and how do weights behave when treatment effects are heterogenous across time and units? I have heard more about weights in the last two years than I have heard about them in the 12 years before that.
But the reason we hear so much about weights in causal inference is because weighted aggregation are an important part of causal inference even if it’s not been always been as explicitly discussed as it right now.While the weighting in the ADH model is slightly more complicated because they arise from a more complex process, I think it would be helpful to start with a very simple numerical example illustrating basic ideas. Kanye has $50,000, Sean has $25,000 and Drake has $75,000, and I want to make a counterfactual for Kanye — a synthetic Kanye, if you will — based on Sean and Drake. Obviously, neither Sean nor Drake alone look much like Kanye, just like no one is a perfect match for anyone. Even twins are different people. But maybe if we grouped Sean and Drake, then they might look like Kanye as a group. So, does there exist a vector of weights that when I apply them to persons Sean and Drake, the weighted average will be similar to Kanye’s income? Let’s look at this question using a few different numbers.
Approximate balance, interpolation
Let’s take a random set of weights. What if we weighted person Sean’s income by 1/3 and person Drake’s income by 2/3? Then we’d have a synthetic Kanye with an imputed income of $58,333.
Seems decent, right? After all, $50,000 and $58,333 aren’t too different. But can we do better? Can we find a different set of weights that could get us even closer to $50,000?
Exact balance with interpolation
Let’s try the following weights: 1/2 and 1/2 applied to Sean and Drake.
Ah hah. Looks like 1/2 and 1/2 are “better” weights than 1/3 and 2/3. We definitely want to write those down if our ultimate goal is to find weights that created weighted averages that are reflections of our treatment group.
Exact balance with extrapolation
But we aren’t done because [1/2,1/2] is not the only way to create $50,000 out of Sean and Drake. We can also do it using weights [-1,+1].
Well what do you know? When we weight Sean by -1 and add that to Drake weighted by +1, we also get an exact synthetic control for Kanye. Both [1/2,1/2] and [-1,+1] will recreate Kanye’s income exactly using a weighted average of Sean and Drake. So which one will we pick?
Synthetic control, at its core, is an engine that turns and turns until it finds a unique set of “optimal weights” which for a given set of covariates, and their combinations, create a counterfactual that is as close as possible to the treatment group’s own matrix of covariates. These weights are solutions to a constrained minimization problem involving a distance function based on a treatment group’s covariates and a weighted control group’s covariates. And while the details of that optimization procedure are somewhat complicated, the spirit of the synthetic control is simple. All that ADH did was give us a recipe that found a set of weights that in K dimensions, where K is the number of covariate combinations under consideration, does this same type of minimization.
But what if we have a tie like the above? Or, what is more common, what if the closest match required using negative weights? Should it matter that some of the weights are negative? Which one should we pick and why? ADH2010 proposed that we restrict optimal weights to be (1) non-negative and (2) sum to one. This was where some of the subtleties of ADH arise. It wasn’t merely that ADH proposed a solution to some minimization problem. It was that the solution prevented extrapolation and you could only prohibit extrapolation if you prohibited negative weights. The original ADH model refused to even consider a vector of weights that had any negative values because it was designed on the principle to block extrapolation of the counterfactual. Consider the following quote from ADH2010 which comes just after a discussion of the conditions under which a weighted set of units in the donor pool can erase all bias from estimating a causal using those weights as the stand-in counterfactual.
Notice how ADH say that the tradeoff in limiting weights to be non-negative and sum to one is that you can no longer extrapolate to generate counterfactuals. I do not fully understand all of what is at stake when using negative weights but I have often wondered banning negative weights was a scientific principle or more of an aesthetic opinion about what a “good counterfactual” should be. As best I can deduce, we are to believe that it matters based on our internal tolerance for using units as controls that are so different from our treatment group that we have to extrapolate from their properties to get something that looks like our treatment group.
But sometimes, the closest we can get using non-negative weights is really not very close at all. Just because we can solve a minimization problem conditional on those constraints does not mean that the solution is that our estimate of the counterfactual is a good one. Maybe the closest we can get just means we shouldn’t even be doing it. Requiring non-negative weights may simply force us to accept a level of imperfection between the two groups so as to make any comparison so obviously contaminated with selection bias that we might as well throw the project away. It was to address these problems of imperfect fits caused by non-negative weights that researchers began to explore whether we could or even should relax the “no extrapolation” rule of synthetic control.
Synthetic control is unpopular because she never hides the truth
But let’s put aside weights for a minute and discuss this claim that ADH has been an important innovation. It’s funny that Athey and Imbens say that, because in the applied microeconomics circles that I’ve found myself in, synthetic control has not been especially popular. Unlike difference-in-differences, my observation and my experience has been that synthetic control is the black sheep of causal inference. Indulge me as I speculate about why I think this is using a couple of anecdotes from my own life.
I had a paper using synthetic control that I shopped around for years before it was finally published. It was an early adopter of synthetic control. Manisha Shah and I used it to study the effect of an unexpected judicial ruling that effectively decriminalized indoor sex work. Manisha and I presented that paper dozens of times before ever submitting it, a not uncommon practice within economics as readers know. And early on, after one of the presentations, a discussant praised the paper but not the method. He explicitly told us to drop synthetic control and instead use a simple difference-in-differences approach. Why did he say that? He didn’t give a great explanation — to him it was just obvious that regression was superior. His natural preference for difference-in-differences seemed to be no more sophisticated a reason than that synthetic control was too much of a black box for his taste.
We had a similar thing happen during the refereeing process too where a referee explicitly told us to take the synthetic control model out. But I was once told by a senior adviser, “you can can disagree with a referee only once” and so we chose to make keeping ADH in the paper as our one time. We did so by proposing a compromise in which DiD was our main results, and ADH more our secondary “robustness” checks, as opposed to how we had historically written the paper where ADH was the main results. In the end, the results didn’t change a whole lot either way but we were relieved to include it as we found its evidence to be important.
But the third anecdote was really interesting. I was at a talk at the ASSA meetings one year on synthetic control. Abadie was there presenting his Germany reunification paper (with Diamond and Hainmueller) and he had a table that is common in synth papers in which the author shows the readers the weights of the synthetic control across all the individual donor pool units. The treatment group was West Germany, and one of the positively weighted units from the donor pool was Japan with a weight of 0.16. Abadie discussed the table and then eventually concluded with a discussion about his evidence about the effect of the reunification of Germany on the people’s incomes.
A very senior economist from MIT stepped up after Abadie had finished and began discussing both the paper, but more pointedly, the model. I only remember one thing the discussant said, and it had to do with those weights on Japan. The discussant basically dismissed the model because it chose Japan as part of the counterfactual. He said there’s no way you could convince him that Japan should be considered a control for Germany which he supported with a list of reasons. All of those reasons seemed like reasonable ones to me.
But then I thought about it for a second. Why had this economist been able to say this in the first place? I don’t mean “where does he get off saying Japan shouldn’t have a weight of 0.16?!” What I mean is not judgmental at all. I just mean how did he even know Japan had a weight of 0.16 in the first place? And the answer is — because Abadie told him. But any other method that hadn’t produced a table like that, the speaker would never had even known what, if any, weight Japan had. And my hunch is he wouldn’t have asked for it too. One of the strengths of synthetic control is how candid and honest it is about who exactly is and is not in this manufactured counterfactual. Synthetic control is a very simple truth teller: “tell me the covariates, and I will tell you the best counterfactual for those covariates.”
In the 2015 ADH paper on the reunification of Germany, ADH derived the weights from OLS and presented them in Table 1 of the paper. These are the weights for each country — the second and fifth columns are synth, the third and sixth column are the regression weights. Look closely and you will see what I just said.
First of all, Japan has an even larger weight of 0.19 under OLS, which frankly means that economist has even more to complain about! But of course he won’t — not because he agrees with the regression, but rather because nobody ever produces tables like these. I only know these weights because ADH derived them and stuck them in a table. They are not part of the output in Stata when you type “reg” into the command line.
But notice what else is in this table — the negative weights. Every single negatively weighted control from the regression column had had a zero weight in the synthetic control column. Most of the time, synthetic control does this — any unit that would’ve been weighted negatively just gets a corner solution of zero which forces other units to carry the rest of the water up the hill.
The sin of synthetic control isn’t that it weights units when selecting its counterfactual and it isn’t that it’s a black box. The sin of synthetic control is that it tells us those weights, and their size, which tells us exactly why the numbers are what they are. Regression, on the other hand, does the same thing only it uses a “don’t ask, don’t tell” policy when sharing them. And furthermore, the fact that so many of us didn’t even know that regression was negatively weighting kind of makes you wonder what was the black box in the first place — synthetic control or regression?
Here’s what I wish I had said to the discussant.
Synthetic control isn’t the only method that weights Japan. OLS does too, it’s just that OLS doesn’t automatically produce a table with them; if you want them, you’ll have to derive them by hand.
But it isn’t just that OLS places a weight on Japan. It’s that OLS also allows for negative weights, and depending on your cup of tea, you better like extrapolating a counterfactual beyond the support of the data if you’re going that route.
Some people prefer the minimization model in their brain to the one that is laid out in a paper and that’s fine. But it isn’t clear that that one’s subjective, hidden model used for smell testing is better when we can’t see its performance.
A picture is worth a thousand words
Like regression discontinuity, synthetic control is very visual. Everything you need to know about it, you can see just by looking at pictures. Pedagogically it is quite nice. The pictures associated with a synthetic control model illustrate all the major ideas of dominant causal designs: balance on pre-treatment covariates, treatment and control, event studies, weighted aggregations for estimated causal effects, heterogeneity. Many of these core ideas are quite difficult to get across to someone unfamiliar with causal inference, but show this to a manager or a policymaker, and chances are, they will follow every single word you say. The pictures are ultimately the rhetorical strength of ADH — for good, or for bad.
To see what I’m talking about, let’s look closely at one of the pictures from the ADH 2010 JASA. The policy in question in this paper is called Proposition 99 which was a California regulation in 1989 that tried to reduce smoking through taxes, advertising, clean air campaigns and various regulations. ADH wanted to know whether we could estimate the causal effect of that policy on cigarette sales using synthetic control. And they would consider it a partial success if their estimated counterfacutal looked nearly identical to California before the law had been passed. They chose seven dimensions for their variables, including three values for lagged smoking itself. Results from this can be seen in the table below. Notice how in every case except one, the weighted average using optimal weights is closer to California than using uniform one. And the one time it doesn’t matter, synthetic control was simply the same.
After their constrained optimization produced a vector of optimal weights, they applied them to cigarette sales for each control group unit to produce this very pretty, very intuitive picture. The key takeaways are that prior to the passage of Proposition 99, synthetic California and actual California have nearly the identical amount of smoking. But after the law, they diverge, and that divergence is an estimate of California’s dynamic treatment effects.
How objective is ADH really?
Part of the appeal of synthetic control is the level of automation involved. You don’t pick your counterfactual nor do you pick the weights — the optimization procedure does all the work for you. As we said, ADH doesn’t guarantee that the fit will look good in the pre-treatment period — it just guarantees that the fit will look as good as possible for the covariates you fed into the model.
In my experience as someone who has written several papers using synthetic control, it is probably a bit overstating the case that ADH removed subjective researcher bias. There is still loads of researcher degrees of freedom using ADH — it’s just not as direct as what was done before. It’s true, researchers do not directly choose their counterfactuals, which is a kind of placing of one’s thumb on the scale that we are rightly to be worried about. But it’s also not true that the researcher doesn’t influence that chosen counterfactual. It’s just that the researcher’s influence is now solely at the point where they choose covariates, years and combinations for the matching itself.
There are many ways that subjective researcher bias can unconsciously enter into the design stage of fitting the model. One of the easiest ways is simply to run your model such that you see your estimates before you have settled on the model itself. Such “peeking” is bound to have an effect on what subsequent choices a researcher will make. ADH in 2015 offered some suggestions on getting around this by breaking the pre-treatment data into a kind of training and prediction sample so that you can practice your predictive model until you get it right. Such cross-validation approaches might work if not for the fact that oftentimes dynamics in the outcome make the fitting used before not particularly useful for the fitting after. This kind of opens up the door to data mining and over fitting problems and the early ADH simply did not have a lot of guidance as to how exactly we should go about selecting the models that ultimately pick our weights.
It’s not a surprise that two of the more active economists in this area, Bruno Ferman and Cristine Pinto, along with Vitor Possebom south to offer some practical suggestions about how to avoid such “cherry picking”. Their paper was entitled “Cherry Picking with Synthetic Controls” and they recommended presenting a variety of models so that the reader has some ability to see robustness across them.
Conclusion
We’ve been alluding to this but let’s be explicit now. By requiring that the weights be non-negative, it’s possible that the fit pre-treatment could be very poor. In their 2015 2015 article about the reunification of Germany, ADH wrote about “imperfect” synthetic controls by saying this:
“The applicability of the [ADH2010] method requires a sizable number of pre-intervention periods. The reason is that the credibility of a synthetic control depends upon how well it tracks the treated unit’s characteristics and outcomes over an extended period of time prior to the treatment. We do not recommend using this method when the pre-treatment fit is poor or the number of pretreatment periods is small. A sizable number of post-intervention periods may also be required in cases when the effect of the intervention emerges gradually after the intervention or changes over time.” (my emphasis, Abadie, et al. 2015)
So let’s circle back and stick this landing. The issue of non-negative weights is that it forces our counterfactual to be imputed within a set of control units, rather than extrapolated outside of them. But by requiring that the weights be non-negative, we are also opening up the possibility of imperfect fit in the pre-treatment period. After all, no one is exactly like you or me. Even a weighted average of a bunch of people might still zig when I zagged and zagged when I zigged once or twice. If imperfect fits mean the synthetic control we find is biased, then we have some choices we have to make. Will we abandon the project? Or will we adapt the synthetic control so that it can it work around this problem of imperfect fits due to non-negative weights.
For a more detailed exposition of the ADH technique, along with an in-depth discussion of its implementation, I encourage you to read my synthetic control chapter in my book. Today’s entry was meant mainly to give an informal set of thoughts I’ve been carrying around with me for years about this procedure. But I also am trying to set up my discussion of the next three papers. These papers will cover the issues around restrictions on what weights can and cannot be, imperfect fit and how to handle more than one treatment group. I hope you have found some of this useful. Stay tuned as I will be writing episode 2 of my synthetic control series very soon.
But please don’t take this to mean I think they are the most important new synthetic control papers. Rather, they are just some of the most recent, and since these two have placed very well (in AER and JASA), I thought it might be a good opportunity to explain them given the editors’ opinions that they are of general interest. That’s usually a good predictor of what is to come.
I graduated with my PhD in 2007 from the University of Georgia.
Great introduction. I'm wondering if your future blog posts will offer your take on Kuosmanen et al. (2021) discussion of their global optimum algorithm, how the covariate weights are often corner solutions, and the recommendation to pre-determine covariate weights based data-driven methods. As someone learning about synthetic controls, it seems like the discussion around how donor and predictor weights are related and derived is often overlooked. Best!
Great post.
A few things I have noticed particularly with respect to the Ferman critique about cherry-picking. In a lot of the synth papers I have looked at over the past few years. The pre-treatment fit on the treatment state seems to be a good bit better than control states. Do you think there's a Brodeur style analysis that could be done on that front to investigate pre-treatment fit seems to be better for treatment units than control units?
One thing that I am now feeling a bit more uncomfortable about with respect to inference in synthetic control is the randomization inference inference. With a big enough covariate space, I can always guarantee myself a significant result by perfectly fitting the-pretreatment group and yielding an infinite RMSPE. Do you have a sense on how best to estimate confidence intervals? I have seen conflicting evidence on jacknife, conformal, and other metrics?
Lastly, the one thing I still feel uncomfortable about with the new synth methods is the Doudchenko Imbens stuff regarding fixed effects. Negative weights and the ridge regression of augsynth make total sense to me but I feel like incorporating fixed effects adds a level of complexity that means you can have a synthetic composite that fits well but a lot of that is due to a fixed effect. In a sense, it feels like it's not really a true interpolation with that level difference. It seems hypocritical to let this bother me in the synth case but not in your classic 2x2 diff-in-diff. I know you have commented in the past that "fixed effects are deep". Would be curious what you have to say in part two.
It's idiosyncratic of me to not care about this in a diff-in-diff model but I find it much more troublesome in this case. I know you have talked about "fixed effects" are deep. Do you have any thoughts on this?