Three minimum wage papers walk into a bar...
A Cengiz, et al. (2019), Clemens and Strain (2021) and Callaway and Sant'anna (2021) explainer
What is the effect of the minimum wage on labor markets? It mechanically raises the wages of employed workers at the bottom end of the wage distribution. But its effect on the overall employment of that group of workers has been debated by economists for decades. Given US states and the federal government regularly experiment with higher minimum wages, one would think economists would come to a consensus — especially given how many studies have been done on it. But the debate continues.
One of the things that makes this lack of consensus among economists about the empirical and historical effect of the minimum wage on employment even more curious is that minimum wages have often been opportunities to showcase ingenious causal research designs. Card and Krueger (1995) is considered a classic in labor economics, not only because it is was the seminal paper in the “new minimum wage” literature, but also because it was instrumental in making difference-in-differences research designs more widely appreciated within economics. Put together this creates a bit of an unusual feature. States regularly toy with their minimum wage seeing it as an important anti-poverty program, interest in the question remains very high among labor economists, it attracts the best and the brightest in labor and these studies often use well-identified causal designs. So why is it there is so little consensus?
This substack will not resolve this question. Rather, this substack will discuss three relatively new and important empirical papers that shed light on what effect, if any, the minimum wage has had on the employment of those most likely affected by it. My hope is that I can do two things in writing this review: (1) that by walking the reader through my interpretation of these papers, you will gain insights about the minimum wage and (2) that you will better understand ways in which one might combine economic theory, high quality data and the difference-in-differences design to answer extremely hard empirical problems filled with endogeneities.
Bunching and Stacking Minimum Wages
When I first heard that Doruk Cengiz, Arin Dube, Attila Lindner and Ben Zipperer had published a paper on the minimum wage at the Quarterly Journal of Economics, I perked up. Top 5 journals like the QJE are notoriously selective, and publishing in them is for most of us a rare and generally important accomplishment. Any minimum wage paper published at any of the top 5 journals is practically obligatory to read given editors and referees perception that the paper has made major scientific contributions of general interest. Given the minimum wage is such a contentious political question, with massive fault lines so spread apart a person could fall right through them, I was eager to give the article my full attention.
But I was also interested in this article because it had become more broadly associated with a particular kind of innovative approach to the difference-in-differences design called “stacked regression”. Stacked regression, as I will explain, is an early alternative to canonical uses of the two-way fixed effects (TWFE) estimator in that it creatively reorganizes and expands the dataset so as to circumvent the problems created by differential timing. Cengiz, et al. (2019) seemed like Meer and West (2016), not because both are about the minimum wage, but because each seemed to have grasped some of the problems that differential timing creates for TWFE before the rest of us.
The authors’ project has at least three important attributes I’d like to bring to your attention. They are: (1) an ingenious approach guided by theory to evaluating the “bite” of the minimum wage using bunching in the wage and employment distribution surrounding a discontinuity created by the minimum wage; (2) a large state panel measuring employment and wages from 1979-2016 encompassing 138 separate meaningful increases in the minimum wages; and (3) the use of TWFE and stacking event studies to study minimum wages’ effect on employment. Let me walk through each.
Visualized Theoretical Predictions
It has been a slow journey for me to fully absorb how effective a well designed picture can be in the service of an empirical project. Students and economists early in their career should pay close attention to the choices the authors make in connecting theory, intuition and identification. One of those choices is with a simple, yet powerful, figure illustrating what they call “the minimum wage’s bite”. The bite of the minimum wage is that it should create a bunching pattern in the employment distribution just above and below the new minimum wage itself. This is because when distribution of employment across wages is smooth through the counterfactual of the new minimum wage, then raising the minimum wage should cause a bunching pattern as jobs just below the new minimum wage are shifted just above. The authors illustrate this insight in the following figure.
Their figure combines two ideas in one figure — predictions from economic theory showing precisely where and when causal effects should appear with a display of actual and counterfactual levels of employment. The figure is beautiful with interesting plots, use of color and separate types of lines, with shadings that catch the reader’s eye. Notice that the counterfactual employment distribution is shown with a red dashed line corresponding to a red zebra-like shading of the area below the counterfactual employment distribution to the left of the minimum wage. The blue solid lines show actual employment which should be substantially bunched at the discontinuity reflecting causal effects but not at the middle or rightward parts of the employment distribution.
The figure forces the reader to both see what the authors will be doing (measuring causal effects by contrasting actual to counterfactual) as well as guide the reader to theoretical predictions. The causal effects, if they are to be believed, have two, not one, elements. We should see bunching of jobs at the minimum wage. That much is clear. But notice what else is predicted by the figure — we should not see bunching anywhere else. If we found bunching in the right part of the distribution, then in many ways it would cause us to doubt that any bunching found at the left part of the distribution is real. Such falsifications are very common in applied microeconomics, but it is not always the case that falsifications are so closely wedded to economic theory the way this one will be. I highly recommend early career economists pause and reflect on the paper’s grammar and rhetoric as it relates to this one simple and effective figure.
Current Population Survey 1979-2016 and 138 minimum wage experiments
The authors use worker data over 1979 to 2016 from the Current Population Survey Out Rotation Group matched with 138 state-level minimum wage increases. The primary models used are event study specifications estimated with standard TWFE models. Data is binned into quarters of a dollar from $0 to $30, but most of their event study plots use coarser one dollar bins. Figure I’s logic guides us as we are presented with estimates from these models: we should be looking for bunching at the discontinuity and we should be looking for smoothness everywhere else. And if there are disemployment effects, then the total number of jobs employed at the bunched part of the employment distribution must be smaller than that under the counterfactual.
Figure II presents estimated lag and lead coefficients from their event study model. Notice that in proximity right around the minimum wage increase there are mirroring coefficients of similar magnitude and opposing sign. They find substantial declines in employment around the b-1 bin along with similarly sized large increases around the b=0 bin relative to some baseline. But they also do not find any other breaks throughout the wage distribution. The two main theoretical predictions from Figure I hold in their data.
Next the authors calculate a kind of “net employment” by summing the two coefficients. Summing the coefficients is equivalent to calculating the net gain or loss in employment caused by the minimum wage. They find that the number of jobs missing just below the minimum wage (-0.018 estimated coefficient with standard error 0.004) is just a little bit smaller than that above (+0.021 with standard error 0.003). The net effect, which they label “% Change in Affected Employment”, is an imprecisely estimated +0.028. Disemployment effects imply negative effects, but here we see positive ones suggesting that any disemployment effects must be very small if they exist at all. Given wages increased by 0.068 relative to the pre-treatment total employment, a reasonable conclusion to draw from this study is that real earnings among workers increased because wages rose and employment failed to decline.
The authors analysis extends to effects on employment over time, as well as with respect to the wage distribution, and these are also estimated using standard event study specifications. Figure III presents coefficients and 95% confidence intervals for each lead and lag in this model. As with the bunching analysis, the number of jobs lost and the number gained at the wage discontinuity is roughly the same as much as 4 years post-treatment. Even in the short to medium run, the authors are able to cast doubt on meaningfully sized disemployment effects. The fact that pre-treatment differences in net employment are relatively close to zero provides some additional confidence that the effects found in the years post treatment are indeed causal.
After presenting these main results across the wage distribution and over time, the authors then engage in a type of exhaustive search of heterogeneous effects — a theme we see in another paper I will discuss below. If the minimum wage is to have any bite, they reason, we should see it for the most likely affected types of workers. By definition that should be low wage work consisting of younger and less experienced workers. So they repeat their analysis using subsamples containing only some teenagers, only those without a high school degree, separately by race, the degree of industry concentration, specific sectors, and for groups that are probabilistically the most likely individuals to feel the brunt of the minimum wage’s disemployment effects. But no matter how they slice the data, the results stay the same: disemployment effects cannot be detected.
In a later subsection of the paper, accompanied by logic and details provided in Appendix D of the paper, the authors shift their focus from the theoretical predictions to their dependance on the TWFE model for their evidence. As other authors have explained, the TWFE model has non-trivial assumptions and without them, the coefficients presented by the authors do not have a causal interpretation. Sun and Abraham (2020), for instance, show that without three assumptions — parallel trends, no anticipation and homogenous treatment profiles across all cohorts — population regression leads and lags have absorbed treatment effects not only from the periods they measure, but from other periods as well (both included and excluded in the model). While strange that the biases would show up precisely where theory predicts, the problem still remains — how confident should we be in these TWFE results in the first place?
Though early in the DiD credibility revolution, the authors are clearly aware of the problems with TWFE under differential timing. For instance, they mention the issue of negative weighting on the long leads when using TWFE, as well as refer to early working paper by Abraham and Sun (2018).The authors forego Sun and Abraham’s interaction-weighted estimator, and instead choose to stick with TWFE by using what they call “clean controls” of untreated units through a restructuring of their data such that treatment timing has been re-centered and balance has been achieved in relative event time. When a panel dataset is balanced in relative event time, it is perfectly rectangular with each row of the panel to be of length h. A balanced panel in relative event time is impossible with differential timing because each treatment group has a different number of pre- and post-periods, but when the data is restructured so that treatment dates are centered, datasets become balanced in event time.
I will discuss this again in greater detail when I discuss Clemens and Strain (2021), but here I simply note that Cengiz, et al. (2019) drop all years less than 3 years pre-treatment and more than 4 years post-treatment per treated cohort. Once this data trimming is done for a cohort, all other units treated during the same t=-3 to t=+4 period are dropped. All units, though, that are untreated over the same period are retained as they constitute what the authors call their “clean controls. They describe clean controls in Appendix D of the paper as “those [units] without any non-trivial minimum wage increase within the 8-year event window.” Each re-centered dataset is then saved and “stacked” on top of each other creating one much longer and less wide dataset containing all 138 experiments along with their corresponding controls.
A recent paper by Baker, Larcker and Wang (2021) provides a step-by-step detailed description of stacking regression. Recall that stacked regressions append multiple cohort-specific datasets, each including untreated controls, on top of one another such that the final dataset is a balanced rectangle of rows of equal length. The new dataset is balanced in relative event time, but because if consists of each treatment group and its corresponding set of untreated controls, untreated control group units will often appear multiple times.
Baker, Larcker and Wang (2021) recommend therefore estimating a simple 2x2 model with TWFE on the final dataset controlling for “cohort-by-state” fixed effects so as to account for the multiple appearances of observations from the never-treated control states. What is cohort-by-state fixed effects you ask? It’s technically “subdataset-by-state fixed effects” but that may be a mouthful so just think of a “cohort” as the group-specific treatment group and all of its untreated controls. So assume you have 10 datasets you appended into one “stacked” dataset. Then you want to include state fixed effects interacted with nine dummies, each corresponding to one of those particular datasets you appended.
Using this newly stacked dataset, the authors then ran 130 regressions, one for each cohort, and plotted those coefficients along with their confidence intervals obtained by a procedure laid out by Ferman and Pinto (2019).I include their Figure D.1 showing all 130 regression coefficients. While there is heterogeneity across the distribution of these effects, the net effect on net job loss in panel (c) is basically zero. While TWFE has the potential to be biased, their stacked regressions do not differ all that much and the main conclusions stand — the authors cannot find any disemployment effects at the bunched part of the wage distribution implied by intuition and theory.
Clemens and Strain (2019) was posted to the NBER earlier in the week. It is perfect for this substack because it is in many ways the stepchild of Cengiz, et al. (2019). It examines a different sample, uses comparable econometric modeling, and similar data. But it focuses on different theoretical predictions not explored by Cengiz, et al. (2019) as well examines a new robust DiD estimator call the imputation estimator created by Borusyak, Jaravel and Spiess (2021). Like I did with Cengiz, et al. (2019), I will try to write a simple explainer of this interesting new paper to illustrate both what they authors find as well as how they go about finding it.
Clemens and Strain (2019) is the capstone to earlier studies by the authors going back several years. This intensive focus on the minimum wage by the authors reflects both their professional and personal commitment to the question but also illustrates a long game that they were playing. Let me explain.
The authors used these earlier studies as the basis to justify a set of questions, selected datasets and years, and empirical models they would pursue on future data they had not yet seen. So as to tie their hands, the authors pre-registered this particular study. I say “tied their hands” because much of this paper was envisioned and planned before they ever saw the data or ran any regressions. This is important because the authors will be examining heterogenous treatment effects associated with the minimum wage, and it is precisely when studying heterogenous treatment effects that the problems of p-hacking, a type of statistical problem now well understood within the physical and social sciences, can emerge. I believe that the following four concepts can function as a prism to help guide you through the logic of Clemens and Strain (2019).
Subjective researcher decisions are both unavoidable in the sciences and create challenges for scientific inference. Some of this subjectivity is simply a reflection of the legitimate uncertainty that every scientist faces when preparing and analyzing data preparation. Certain choices must be made and two researchers facing the same choice may very well make different ones, and it is not always obvious which, if either, is right or wrong. But some of this subjectivity is less defensible and before I dive into those, I want to briefly go over the types of researcher uncertainty that simply cannot, and likely will not, be avoided.
An interesting article was published earlier this year in the economics journal Economic Inquiry. It caused a bit of a stir on Twitter for reasons that are easy to see. The study, coauthored by Nick Huntington-Klein and several others, vividly illustrated the some of the considerable heterogeneity in applied microeconomics is not due to fraud but rather reflects very reasonable choices that researchers cannot avoid. Nick and others took two applied micro papers, along with the raw data used, and assigned them to several experienced and competent researchers. Each was given the task of replicating the key findings of the two papers. Authors were separated so that there would be no interference or spillovers between them. What they found is best quoted than summarized.
“We find large differences in data preparation and analysis decisions, many of which would not likely be reported in a publication. No two replicators reported the same sample size. Statistical significance varied across replications, and for one of the studies the effect's sign varied as well. The standard deviation of estimates across replications was 3–4 times the mean reported standard error” (my emphasis).
Sometimes researchers follow different paths because they come to a choice with different unobserved covariates. They solve problems differently. They interpret the problem differently, too. Or they simply have different approaches to programming workflow that can lead them down different paths. And that, in my opinion, is the main less we learn from Nick’s teams’ study: there is an unavoidable subjective component to any empirical project, not because of fraud, but because researchers make decisions under behind a dim glass that fills the mind with poorly understood uncertainty.
Some forms of researcher subjectivity is like a scout hacking her way through the thick brush of a jungle. She can only reach the end point where data analysis will be done by choosing a path at an earlier juncture, which itself had been presented to her in the first place because she chose a path at a point before it. On and on, the choices had cascaded to the final point where she finally conducted analysis and while in hindsight it appeared there had been one and only one path through the jungle — the one she took — in reality there had been nearly an infinite number of ways one could’ve made. I think that this type of worker uncertainty is similar to the problem of multiple comparisons that statisticians Andrew Gelman and Erik Loken refer to in their essay on the “garden of forking paths”. The garden of forking paths can lead to statistical false positives, not because the researcher is engaging in dubious behavior, but because getting from point A to point B involves many choices that must be made before any analysis of the data can be done.
But not all types of researcher subjectivity are so benign. Consider the following time series from mentions of the phrase “replication crisis” using the Google Books Ngram Viewer. Starting around 2011, the phrase “replication crisis” began to steadily grow. This problem has been widely documented because of several problems identified with classic papers in the sciences, some of which have been retracted. Some careers have been ruined and some researchers disgrace because of the belief that the scientific integrity of papers were marred, not by the unavoidable researcher uncertainty I discussed earlier, but because scientists may have been cooking their analysis in such a way so as to improve their chances of publishing in prestigious outlets. Consider again this explanation from Vox.
“Here’s the thing: P-values of .05 aren’t that hard to find if you sort the data differently or perform a huge number of analyses. In flipping coins, you’d think it would be rare to get 10 heads in a row. You might start to suspect the coin is weighted to favor heads and that the result is statistically significant.
But what if you just got 10 heads in a row by chance (it can happen) and then suddenly decided you were done flipping coins? If you kept going, you’d stop believing the coin is weighted.
Stopping an experiment when a p-value of .05 is achieved is an example of p-hacking. But there are other ways to do it -- like collecting data on a large number of outcomes but only reporting the outcomes that achieve statistical significance. By running many analyses, you’re bound to find something significant just by chance alone.”
Consider the case of social psychologist, Brian Wansink. Wansink was a well known professor at Cornell's food lab who was accused of p-hacking and ultimately forced to resign when problems were uncovered. He has had the unfortunate luck to now be well known as the poster child of the “replication crisis” in social psychology because a re-examination of many of his famous papers found evidence of model specifications that nearly guaranteed his team would find statistically significant effects, not because the effects were real, but through dubious slicing of the data. Again consider the following scary quote from Vox:
“According to BuzzFeed’s Lee, who obtained [Brian] Wansink’s emails, instead of testing a hypothesis and reporting on whatever findings he came to, Wansink often encouraged his underlings to crunch data in ways that would yield more interesting or desirable results.”
Wansink claims he was innocent of fraud and data manipulation, and perhaps he was. I wasn’t there, haven’t read his papers and don’t know the man. But from the perspective of scientific discovery, it may not even matter whether he was purposefully or unintentionally engaging in p-hacking. Research is a thorny process for even the scrupulous researcher, and many fields have begun instituting policies that we hope can and will minimize the influence of ex ante subjective researcher bias from fishing expeditions and cherry picking of results. Several policies have been experimented, such as publishing replications and storing programs and data warehouses. But another important policy that has appeared has been that of the pre-registration plan.
The pre-registration plan is extremely common in the field of development economics. While strongly associated with the randomization practices of field experiments, there are a few (though not many) instances of it also being used for quasi-experimental studies. In a paper of mine with four co-authors forthcoming at the Journal of Development Economics, we pre-registered a quasi-experimental study. But while the paper was quasi-experimental, it was still a development economics study. Outside of development pre-registering quasi-experimental studies has not yet caught on.
Clemens and Strain (2021) is a rare bird when it comes to pre-registration and not merely because it pre-registered a quasi-experimental study outside of development economics. The authors claim that it is one of the first, if not the first, to pre-register theoretical predictions regarding where heterogeneous treatment effects associated with minimum wages’ bite would appear in employment changes. When one considers the debates around the effectiveness of minimum wage as an anti-poverty program, along with the replication crisis association with heterogeneity analysis on various subpopulations in a given dataset, the appeal of this approach is clear.
Just as impressively as their decision to pre-register is the patience Clemens and Strain showed. These authors played the long game by studying the minimum wage over shorter panels for several years of their career before pre-registering this one. Their approach was to conduct several short-run minimum wage studies because following the Great Recession, governments across the US paused experimentation with the minimum wage. Once the nation was largely in the clear of the Great Recession’s wake, though, that experimentation picked back up, but this time the experimentation was somewhat more aggressive than what we’d seen before. Some states raised their minimum wage by indexing it to inflation, some initiated relatively modest increases, and some raised their minimum wage a lot. Clemens and Strain note, for instance that in some areas, states increased the minimum wage by as much as 50-60%.
Given pre-registration is contained in a public ledger, it binds authors’ hands to model specifications. Because the authors had conducted numerous short-run studies, they justified the ideas and approach of this new study as simply an “extension” to that earlier work. Thus, unlike claims made about the replication crisis where researcher choices were endogenous to what they found during ex post analysis, pre-registration tied the authors hands. Whatever they found, that was what they found. The ability to run down rabbit trails into spurious statistical significance was limited now with the pre-registration plan. While this plan did not eliminate the influence of researcher choices suggested by Huntington-Klein and others, it did eliminate publishing results based on seeking provocative findings through “peeking” at analysis ex post.
Heterogeneity in dosage and treatment effects
The elasticity of employment with respect to minimum wages has both a level element as well as a time element. Elasticities are often thought to be theoretically larger in the longrun than the shorten as firms can adapt more flexibly in the longrun to input price increases. We saw this in the Meer and West (2016) article, for instance, when the authors chose to examine, not level changes in employment, but growth rates. One of the reasons, in fact, that Clemens and Strain (2021) chose to pre-register at all was so that they could make all their long-run analysis an extension of their short-run analysis without concerns they were engaging in p-hacking. While pre-registration can guard against some elements of researcher bias, there are still other choices that must be made because variation in dosages itself creates problems for causal inference.
The stable unit treatment value assumption (SUTVA) is a crucial element of the potential outcomes model. It is widely known that SUTVA requires that when person A is treated, person B’s potential outcomes do not change. This problem is not often discussed in difference-in-differences designs even though it is precisely when using contiguous regions that we might worry about it. Nevertheless, this element of SUTVA is relatively well known and researchers often go to great lengths to ensure it holds through robustness checks that may use non-boundary regions as controls.
But one element of SUTVA that is less well known is the “no hidden variation in treatment” assumption. Imbens and Rubin in their excellent 2015 book on causal inference write:
“Consider an assessment of the causal effect of aspirin on headaches. For the potential outcome with both of us taking aspirin, we obviously need more than one aspirin tablet. Suppose, however, that one of the tablets is old and no longer contains a fully effective dose, whereas the other is new and at full strength. In that case, each of us may have three treatments available: no aspirin, the ineffective tablet, and the effective tablet. There are thus two forms of the active treatment, both nominally labeled ``aspirin'': aspirin+ and aspirin-. … One strategy to make SUTVA more plausible relies on redefining the represented treatment levels to comprise a larger set of treatments, for example, Aspirin-, Aspirin+ and no-aspirin instead of only Aspirin and no-aspirin.” (Imbens and Rubin 2015, my emphasis)
What would that mean exactly in practice? Guided by theory, if we believe that small minimum wage increases have different effects (or even no effects) compared to large ones, then SUTVA requires defining the treatment to reflect that heterogeneity. This is what Clemens and Strain are doing in this working paper. Clemens and Strain (2021) redefine minimum wage increases by classifying them according to whether they were small, large or indexed to inflation. This might matter when trying to contrast Clemens and Strain (2021) with Cengiz, et al. (2019) because the authors that the average increase in the minimum wage from 1979-2016 was 8 log points, whereas in their study it is closer to 25 log points, and sometimes even larger than that.
Econometric modeling: TWFE, Stacking and Imputation
Because the literacy around TWFE in difference-in-differences designs is now very high within applied microeconomics, researchers increasingly check the robustness of TWFE results using another robust difference-in-differences estimator, if they perform TWFE analysis at all. Clemens and Strain (2021), like Cengiz et al. (2019), follow up their TWFE event study analysis using the “stacking regression” model. Accompanying their TWFE and stacked regression results, Clemens and Strain (2021) also estimated treatment effects using the new imputation estimator created by Borusyak, Jaravel and Speiss (2021) which I discussed earlier here.
The authors present a series of findings along several dimensions: (1) low-skill (16 to 25yo without high school degree) versus young (16 to 21yo); (2) high versus inflation indexed versus low minimum wage increases; (3) high versus combined indexed and small increasers. The effects are largely the same for all three so I will just present the plots now and discuss them briefly below.
The effects really are only detectable in the targeted population of lower education workers with less experience (top panel). Pre-treatment event study plots are close to zero but not always. Notice that in the t-1 period, the first lead is positive and significant. But recall what we know from Sun and Abraham (2020) — each of the leads and lag population regression coefficients are contaminated using TWFE with differential timing if treatment effect profiles differ across cohorts. This is alone justification to scrutinize more closely using alternative estimators that do not suffer from those same problems.
Next the authors examine the effects using the stacked regression model that is based on the Baker, Larcker and Wang (2021) step-by-step data construction and model specification controlling for replications of control group units through the repeated appending. Results using the stacked method only get stronger. Large increasers see more stability in the pre-treatment leads, but unlike earlier, now show immediate declines in employment as early as the first year. This could be a reflection of the bias problems that Sun and Abraham (2020) outline wherein leads and lags are weighted towards zero due to the influence of other treatment effects across event time bins.
Finally, the authors look at the effect using the BJS imputation estimator. Again, note that as this model is imputing counterfactuals for each unit, there is in fact no baseline year used for the event study plots. Thus each lead and lag will be displayed as a point estimate with 95% confidence intervals. Results using the BJS imputation estimator are no different from what we saw before: large increasers see large employment declines following the passage of minimum wage increases, whereas small increasers and inflation indexers do not.
This study suggests that while there may have been little to no disemployment effects from 1979 to 2016, there appears to have been some disemployment effects from 2011-2019 for those states that instituted “large” minimum wage hikes. Given the high quality and high degree of credibility reflected in each paper, reconciliation of these seemingly contradictory findings will be extremely beneficial as we attempt to better understand exactly why, when and how minimum wages matter when it comes to employment.
But keep in mind — while these studies bear a passing resemblance to one another, they are still very different with regards to the theoretical predictions they exploit. Cengiz, et al. (2019) focuses on the net job gains associated with bunching around the employment discontinuity created by minimum wage increases, but Clemens and Strain (2021) focus on heterogeneity in minimum wage increases using aggregate employment data for targeted low wage workers. Had we found similar effects with either approach, we might conclude that these two identification approaches were distinctions without a difference, but we didn’t. So one avenue for reconciling the two may be to more carefully explore each identification approach, not so much to choose which one is more correct, but to better understand why such differences exist.
The other possibility, though, is that the post-Great Recession minimum wage increases have been so substantial that their bite is simply much larger than what the Cengiz, et al (2019) authors investigated. If that is true, it is itself somewhat weird though because recall Figure I in Cengiz, et al. (2019): there is substantial bunching around the discontinuity. So whatever is going on with the 1979-2016 period, it can’t be said that the minimum wage wasn’t binding. Figure II shows it absolutely was binding. How do we reconcile these two results? The answer most likely will have to do with one of these three things — differing sample periods, differing definitions of the minimum wage, and different focuses on the employment distribution itself.
Semi-parametric DiD modeling
I would now like to conclude with a brief discussion of the empirical findings contained in Callaway and Sant’Anna (2020) in the Journal of Econometrics, or CS for short. CS has become a popular alternative TWFE in part because of its ability to incorporate baseline covariates through an inverse probability weight (Abadie 2005), an outcome regression adjustment (Heckman et al. (1997), or both (doubly robust) without imposing unnecessary constraints on the TWFE data generating process. Insofar as one believes parallel trends holds only conditional on covariates, then covariate adjustments is necessary and CS does this by using the transformations of the covariates to modify differences in mean outcomes using a not-currently-treated set of units as a control group.
Like the BJS imputation estimator, CS identifies a smaller treatment effect parameter first. With that smaller treatment effect parameter (called the group-time ATT) in its possession, the researcher can then aggregate to more commonly used aggregates like the population ATT. This focus on identifying the smaller group-time ATT has many benefits beyond merely providing us with an unbiased path to the ATT, though. It also affords researchers the ability to explore the heterogeneity of treatment effects along any number of dimension, such as the timing of the passages itself. This is apparently quite useful when trying to explore the different types of minimum wage increases pursued, which is the emphasis that Clemens and Strain (2021) make in their paper. Below I would like to report the authors’ estimate of the overall ATT, the group-specific ATT, and the dynamic event study plots.
Unlike the other two authors, these authors use county level data on teen employment from 2001-2007. This is a period that just predates the Great Recession. While this period was contained in the dataset evaluated by Cengiz, et al. (2019), theirs was a longer panel of states not counties. And Clemens and Strain (2021) looked at a period just after Callaway and Sant’Anna (2020) using state, not county, level data. The county level teen employment data comes from the Quarterly Workforce Indicators which are then merged with other county level covariates such as county level population in 2000, fraction white, and other demographic controls. They then estimate the effect of minimum wage increases for three groups: a 2004 group (with and without controls), a 2006 group (with and without controls) and a 2007 group (with and without controls). I reproduce those group event studies using their Figure 1 shown below.
The effects are very large. Take for instance the 2007 lag on the unconditional parallel trends panel (a): minimum wages are associated with almost a 0.15 decline in teen employment.
But the authors also aggregate across the post-treatment dynamic leads to create weighted ATTs. Each column of the partially aggregated section lists group-specific effects, as opposed to just estimates of the overall ATT. All of the group-specific effects are negative and statistically significant.
As a comparison, the authors estimated TWFE models (first row under the (a) and (b) panels) and like Cengiz, et al. (2019), the effect is not significant when controls are included. About these differing results, the authors write:
“The average effect of increasing the minimum wage on teen employment across all groups that increased their minimum wage is a 3.1% reduction in teen employment. This estimate is much different from the TWFE estimate.”
Reconciling the results from these three excellent papers is an important task given the importance of the policy and scientific question itself. But doing so is not straightforward since the three papers have as many differences as they do common features. They are all about the minimum wage and they all use difference-in-differences, but in some ways, that’s where all the similarities stop. For instance, the three papers all use different datasets on outcomes measured at different levels of aggregation spread over different time periods. One paper focuses its attention on bunching patterns created by the minimum wage to measure net employment at the minimum wage’s “bite”; one focuses on the a more recent expansion of large and small sized minimum wage increases, perhaps extremely relevant for contemporary and future economic debates; one explores heterogeneous impacts associated with different times of adoption. Some papers use stacking, some use imputation, some use manual aggregation methods and some use standard event studies estimated with TWFE. Thus teasing out why one paper finds no disemployment effects but the other two necessarily means trying to disentangle the role, if any, each of these separate elements are playing
The first thing that I think we can say, though, is that the historical effects of the minimum wage dating back to 1979 appear to have been on average inconsequential when it came to employment levels. Cengiz, et al. (2019) explore this to my satisfaction. If the disemployment effects were real, then it would seem we’d see them at precisely that part of the distribution where excess and missing jobs were created by the bite of the minimum wage. But we don’t — not in aggregate, not by subgroup, and not in the individual stacked event studies.
But I do not want to push that too far, as other excellent studies have found evidence for disemployment over this period. Meer and West (2016), for instance, found disemployment not so much in levels but in employment’s growth rate. But perhaps that there are some diversity of impacts and along so many different margins means that the effects are indeed heterogenous. Nevertheless, there is something intuitive to their bunching approach because it does narrowly focus our attention on the precise location in the wage distribution where effects would appear, and they do not appear when stretched over an extremely long 30 year panel.
Is it something about the length of their panel? Or could it be something, rather, about the nature of the very experiments they are studying? This is where Clemens and Strain (2021) paper could be a guiding light. While any minimum wage that creates such bunching patterns in missing and excess jobs at the minimum wage’s bite like what we saw in Cengiz, et al. (2019) is clearly important, it is perhaps important to note that the average increase in the minimum wage during that period is significantly lower than what is currently on the table. If we see disemployment effects now, but only with states adopting large increases, but we didn’t for a period when there had only been small increases, then perhaps it means that a minimum wage can shift employment from just below its bite to just above its bite and not necessarily create disemployment — not immediately, and not even within four years — so long as it is not “too large”. It is when the minimum wage increases under consideration are relatively large — relative at least to the smaller ones we have seen over the last ten years, and smaller relative to the ones we saw for thirty years before that — that the disemployment effects may materialize. I think the reconciliation that I find most plausible is that minimum wages may be tools we use to reduce poverty without much cost only up to some point, and where that point is may be much closer than we realize.
Before I conclude, I’d like to now make one more final point and that’s with regards to the selection of our difference-in-differences designs including our estimators. How will we select robust models going forward? The phrase goes “I am not a betting man”, but I actually am a betting man, so let me make a bet. I bet that we are still a little ways away from knowing a lot about when to use and why to use which new difference-in-differences estimator in our growing toolbox. If we do all of them, and they all agree, then hoorah — we say “results are robust”. But what if they aren’t? Many of us may be tempted, then, to cherry pick by using the one estimator that yields results that fit our priors. And I don’t have an answer for that — each estimator is unbiased for the situations where the data generating process it assumes matches the data generating process in reality, but who among us feels such confidence?
This makes my stomach turn just to say it out loud, because I am the first to admit I feel like our papers run a tad too long as is. But while we may be able to get a handle on the replication crisis to some degree through pre-registration, it may be that for a little while longer, we may want to be open to the idea that our appendices are going to continue to grow as referees and editors ask us to examine, and explain, to what degree and why we may find different results using different estimators. In these papers, that ultimately did not matter — stacking, TWFE and imputation estimators largely agreed. But who is to say what will happen in your paper.
If you made it to the end, congratulations. But I had one more thought. And I think that the reconciliation between these papers really does come down to the size of the minimum wage increase under consideration. Part of what Clemens and Strain are doing is categorizing treatments into small and large minimum wage increases. Note that that is not the same as specifying when a minimum wage was increased though. You could have two states in the same cohort but one had a large minimum wage and one had a small one. Stacked regressions that did not copy that specification of large vs small would not detect any employment effects if the presence of the small state dilutes that of the large. So insofar as these papers can be reconciled, I suspect that it is with respect to minimum wage size. Using combinations of the insights from all three will, I think, help future labor economists to carefully designed their questions, designs and select samples to help all of us gain deeper insights about the minimum wage.
Abraham and Sun (2018) would later be published in the Journal of Econometrics as Sun and Abraham (2020).
Eight of the 138 experiments were dropped as they cannot be accommodated within the stacked framework they propose.
Hi Scott, one comment and a quick question. First the comment: I really appreciate your substack; I feel I’ve learned a lot. The (hopefully) quick question: with respect to Baker et al (2021), you mention that they include the TWFE plus cohort*state fixed effects. Do they also include cohort*year indicators? I interpret the Baker et al paper as suggesting as much. Or is it cohort*relative event year indicators (eg t-4,t-3,t-2 etc rather than the actual year). We’re attempting to implement this method and want to get it right. Thanks