Validating Econometric Design: A Case Study from James Habyarimana's Updates to his 2003 Job Market Paper
This substack takes a closer look at a significant yet subtle transition in economic research methodology using James Habyarimana’s 2005 revision of his 2003 job market paper as a case study. James, who graduated from Harvard and is now a professor at Georgetown, initially presented a difference-in-differences (diff-in-diff) analysis (including an event study) of the closure of four Ugandan banks in 1997 and 1998 on firms’ performances who were linked to those banks. It is an interesting study on its own. But for this substack, I am using it as a case study because his 2003 paper did not use the phrase “parallel trends” but did in the 2005 revision. As we all know, “parallel trends” is the key identifying assumption in diff-in-diff, so the fact that it was absent in the original 2003 but is present in the 2005 revision is interesting for helping understand the broader paradigm shifts that causal inference brought to the profession.
With assistance from ChatGPT-4, I compared both versions of the paper to try to understand the rationale behind James’ decision to insert “parallel trends” language into the paper. I will argue that the insertion of "parallel trends" language and the introduction of a propensity score model in the 2005 revision reflects a deep and perhaps underappreciated part of the legacy of the natural experiment based causal inference paradigm in economics: researchers must going forward provide evidence to validate their methodological design choices. James’s revision, in my opinion, is not simply about adopting new jargon nor is it about enhancing the robustness of the results. Rather it’s about ensuring that the assumptions underlying causal inference itself are meticulously verified.
I am not the first to point this out by any means. This is quite apparent in Josh Angrist and Steve Pischke’s 2010 Journal of Economic Perspectives article, “The Credibility Revolution in Empirical Economics: How Better Design is Taking the Con out of Econometrics”. This perspective is, I think, due to the historical influence that Princeton’s Industrial Relations Section had on the subsequent formation of causal inference frameworks developed at Harvard in the collaborations between Guido Imbens, Josh Angrist and Don Rubin. Here, James’s revision serves as a case study in how rigorous validation of design became embedded in the practice of econometrics in the early 2000s and shaped how researchers substantiate their analytical frameworks.
Introduction
It’s no secret I am really unable to separate my love for causal inference from my desire to understand it, my interest in the diffusion of ideas leading to a paradigm shift within empirical economics and the people and places involved. If you’ve been to one of my workshops, you usually have to hear me grasp at some “story”, as I call it, of how the natural experiment methodology came about at Princeton in the 1970s and 1980s and then merged with Rubin’s causal inference framework at Harvard with Angrist and Imbens in the 1990s. I can never quite scratch the itch, and I tend to get stuck on particular topics too, so it gets a little aggravating when I feel like I’m running out of ideas to explore. One of those ideas lately though has been can I find any evidence for this transmission of ideas? And if so, what?
One approach I had been taking someone told me was the “regex” approach. Regular expressions. Counting words. Counting phrases. You maybe saw an earlier set of posts I did on the history of diff-in-diff in which I identified distinct “waves” in the evolution of diff-in-diff as well as its adoption. I counted the number of times words appeared by year in papers using google scholar and then I just plotted it in different ways. So I plotted words like “parallel trends” and “event study” in combination with the phrase “difference-in-differences” and I noted that there were periods when those words did not occur together, and then there were periods when they almost always appeared together. I called the first wave 1983 to 2011 and the second wave post 2011 through 2018. The first wave of diff-in-diff really did not seem, to me anyway, to emphasize “parallel trends” as the identifying assumption, nor did it include the data visualization of what we now call the event study. The second wave, starting around 2011, seems to be when both became obligatory practices (event studies) and obligatory framing of identification (parallel trends).
But it’s hard to exactly pin that down using just regex approaches. Maybe it was just jargon. Maybe the practices were always there, but the jargon does not coalesce until later. After all, finding evidence for the Ashenfelter’s dip is the precursor to the modern event study, you could argue, and that goes back to the 1970s. I don’t think that is exactly the case, though. I get the sense that difference-in-differences papers for a very long did not investigate whether Ashenfelter’s Dip was holding. When we had predominantly traditional fixed effects approaches to program evaluation during the 1990s, I don’t think (but maybe I’m wrong) it was common to see people running fixed effects regressions using state level panel data focused on legal changes and investigating for pre-trends. It’s not really in Gruber’s 1994 triple diff paper I don’t think is it? And it wasn’t in the original Card and Krueger minimum wage paper, though that does come up in the responses by Dave Neumark. Still, these appeals to “strict exogeneity” seemed to be far more the focus than evidence for diverging trends in the pre-period. I cover a lot of that in this substack and won’t again here.
What I am now going to do is try to argue that something appears to be happening between the years 2001 and 2005 at Harvard and MIT with regards to the “parallel trends” jargon. I had noticed it originally in the first substack above, because I had noticed something with 2003 graduate of Harvard University, James Habyarimana, who is now a professor at Georgetown. James’ 2003 job market paper is titled “The Benefits of Banking Relationships: Evidence from Uganda’s Banking Crisis”. It uses a difference-in-differences design with an event study, but it does not mention the phrase “parallel trends” one time. Interestingly, my chance, I found his later 2005 revision. It appears that this revision was prepared for the annual African Finance and Economic Association (AFEA) conference as the name of the pdf is “AFEA_2006_habyarimana.pdf”. In this revision, written only two years later, the phrase “parallel trends” appears 8 times on 6 pages. I emailed James and asked him if he remembered why he updated the paper to go from never saying “parallel trends” despite using an event study with diff-in-diff to writing a paper with more contemporary/modern diff-in-diff rhetoric. He was intrigued but said he could not remember — he just thinks it must have happened that on the job market he got that kind of feedback or when submitting to journals.
So, what I did was I loaded the two pdfs into ChatGPT-4. I explained to ChatGPT-4 what I was wanting to know. Here was the prompt:
Me: I have two versions of the same paper. They are both written by James Habyarimana, a professor at Georgetown. The first one is named "James 2003.pdf" and the second one is named "James 2005.pdf". The first one is his job market paper which is the jargon used in economics for the single paper you go forth with when you graduate with your PhD. He was a 2003 graduate of Harvard's Phd program in economics and "James 2003.pdf" was his job market paper. He then gets a job as an assistant professor at Georgetown in their policy school (maybe also economics I'm not sure). And then he revises the paper which I've included as "James 2005.pdf". There are changes between the two. The main things I have found is that the 2003 version never mentions the phrase "parallel trends", which as you know, Cosmos1, is standard jargon for the difference-in-differences design. It is an event study paper on banking in Uganda, but it does not mention the phrase "parallel trends". In 2005, the phrase "parallel trends", though, appears 8 times. So, before we get started, Cosmos -- do you think you can help me compare these two documents and try to understand what else related to difference-in-differences and parallel trends seems to be getting incorporated in the revision? I am interested in this period, because this seems to be the period when there is some kind of updating of concepts, and it seems to be coming from Harvard/MIT. But I wanted to first just get a second opinion from you.
The 2005 version
Cosmos, being the good sport that he is told me that this was not a problem, so I followed this up with the following second prompt.
Me: Okay this may be a long shot, Cosmos, but I'm wondering if we go backwards it may help you? I'm going to give you the 2005 revised version first. I'm doing this because this has the parallel trends language in it already. This I think will help us identify specific places where the phrase appears. So please identify the 8 places where the word "parallel" appears and provide contextual analysis around those appearances in one or two sentences per appearance, unless it gets redundant. I am then in a moment going to give you the 2003 original version so that we can see where it is in the original that James appears to be making these adjustments.
So I uploaded this to Cosmos and he gave me this feedback.
Cosmos: Here are the instances where "parallel" appears in the "James 2005.pdf", along with a brief analysis of each context:
1. Robustness of the parallel trends assumption:
- Context: The paper discusses the robustness of the parallel trends assumption in difference-in-differences (DiD) estimation. It highlights the need for careful interpretation when affected and unaffected firms are very different, which could invalidate the assumption.
- Analysis: This mention points to the fundamental requirement in DiD analyses that the pre-treatment trends between treated and control groups are similar, a condition crucial for attributing observed effects to the treatment.
2. Validity of parallel trends in pre-crisis data:
- Context: The paper reflects on the validity of the parallel trends assumption based on pre-crisis data, which is limited to two data points.
- Analysis: This indicates a methodological caveat; with only limited pre-crisis data, asserting parallel trends, which underpin the credibility of DiD results, becomes statistically tentative.
3. Use of propensity score matching:
- Context: To strengthen the parallel trends assumption, the paper employs propensity score matching to adjust for potential differences between the groups.
- Analysis: This adaptation aims to enhance the study's methodological rigor by creating more comparable groups, thereby reinforcing the assumption's validity.
4. Assumption in fixed effects models:
- Context: The assumption is also discussed in relation to fixed effects models, where it's briefly mentioned alongside discussions of other econometric strategies.
- Analysis: This highlights the pervasive importance of the assumption across various econometric models used in the analysis, underpinning the need for its careful consideration throughout the study.
5. Relaxation of parallel trends in robustness checks:
- Context: The assumption's relaxation is discussed as part of robustness checks.
- Analysis: This suggests flexibility in the empirical strategy, acknowledging that the strict application of this assumption may not hold under all conditions observed in the study .
Before I move on to the 2003 original job market paper, I wanted to comment little on what Cosmos found.
First, point 2 — the fact that he does not have very many pre-treatment (pre-crisis) data points. Cosmos calls this the “validity of parallel trends in pre-crisis data”. The section that Cosmos is referring to is page 20, just prior to the conclusion of the analysis sections. James writes this:
James: “The evidence in figures 1 and 2 is suggestive of a valid parallel trends assumption. However, given that we only have 2 data points for the pre-crisis period, we cannot be certain of its validity. To overcome this problem we use propensity score matching techniques to create a more valid control group. In particular we run a probit model on a number the firm characteristics shown in table 2 and use the predicted propensity of losing a banking relationship as a control in the specification used in table 5. Table 11 presents the results of propensity score corrected analysis that is analogous to the specifications used in table 5. The standard errors in this table are clustered at the firm level. The estimated propensity score is significant in all specifications. The test that loss of a banking relationship does not matter is rejected at the usual levels in all specifications except (5). These results corroborate the validity of the parallel trends assumption.”
So I pulled up Table 11 for you to see. What James appears to be doing in the 2005 revision is addressing some limitations in the data. Now, in a moment we will listen to Cosmos analyzing the 2003 paper, but for now, I will just say this: James did not in the 2003 JMP have a propensity score model. Table 11 is new. The 2003 JMP ends at Table 10 and Table 10 is the same in both versions of the paper. Table 11 appears to be a new “robustness” specification based on recognizing that he is lacking “enough” pre-treatment data points.
Now, there is no sufficient number of pre-treatment data points for verifying that parallel trends holds, because parallel trends is based on a post-treatment counterfactual parallel trends, not pre-treatment. Nevertheless, we use the pre-treatment event study coefficients in a diff-in-diff based event study design in order to provide “smell test” evidence for the plausibility of that counterfactual parallel trend. If I had to guess, though, someone on the job market or in the refereeing process brought to James’ attention that there was not enough pre-treatment data to verify the validity of parallel trends which is why James then includes as a robustness his propensity score analysis which I reproduce below.
So, something about the updated version of the paper to include explicit mentions of parallel trends appears to be related to the analysis not having sufficient (in at least James’s mind) enough pre-treatment data such that he feels he can pin down and make a claim on the ongoing counterfactual trends for the treatment banks versus the control banks. So it is almost as though parallel trends is brought into the paper because of a perceived lack of confidence about parallel trends evidenced in the pre-trends due to data limitations. It isn’t, in other words, that James’ event study “looks bad” — rather, James’ data does not have very many pre-treatment data. He, like many, prefers to produce the raw data as opposed to the regression plots which I produce here in Figure 1.
So you can see here I think the reasoning — there are only two data points pre-treatment. Yes, they look like they are trading together, but when you only have two data points (1997 and 1998), you in fact only have one trend (1997 changing to 1998). And so while it is the case that the affected firms are both rising from 1997 to 1998, usually we want more than that.
But that’s in 2024. In 2003, it doesn’t seem like James felt that that was necessary to investigate. So what I want to now do is have Cosmos read what I just wrote, and then analyze the 2003 original JMP, focusing on the same sections where this phrase “parallel trends” appears in the 2005 version. I want to know why does Cosmos think James makes these adjustments and where.
The 2003 version
So, moving on to the original 2003 version, I gave this new prompt to Cosmos.
Me: Okay, let's now look at James 2003.pdf. I think there's two things I need from you Cosmos. First, I think we need to look at where parallel trends appears in the 2005 version and then look in the 2003 version and find why isn't it there originally? In other words, there are sections of the paper originally where it's missing, and I want you to tell me about those "missing parallel trends sections". What do you notice? Do the same kind of 1-2 sentences about those 8 "future parallel trends areas". But then the second thing I am seeking from you is some conjecture. What exactly is changing in your mind, beyond merely the appearance of jargon, from 2003 to 2005? Focusing on those sections I think is probably where to do this, because I am not really all that concerned with the robust updating of the paper so much as I'm trying to find the signs of a "paradigm shifts" in James paper. I hope this makes sense.
I suspect I have inadvertently “primed” Cosmos by this point with my own theorizing. And at this point, there’s not really anything I can do about it, but at the same time, I’m not really sure that that is a bad thing. I needed Cosmos to understand my point of view and so I explained several times the idea. But Cosmos appears to come to a similar conclusion as me even though I had to ask to undertake this task twice. Cosmos appears, like me, to think that something is happening outside of the paper. Something in the broader academic community, he thinks, is causing James to not just add in propensity score analysis, but to add it in while simultaneously reorienting it around the concept of parallel trends. These are two separate phenomena in other words. But I think the appearance of the “parallel trends” jargon and the movement towards a robustness analysis using propensity scores must be connected with that event study graph, which is why it all starts to come together in the revision. Here is what Cosmos said.
Cosmos: After analyzing the 2003 version of James Habyarimana's paper, it appears that the concept of "parallel trends," crucial for establishing the credibility of difference-in-differences analysis, is notably absent in the original document. Here's a look at how the sections developed over time from the 2003 to the 2005 versions:
In the 2003 version, the analysis is more focused on the immediate empirical strategy and results without explicitly addressing the assumptions critical to validating the difference-in-differences methodology [my emphasis]. The analysis talks about robustness checks related to the effects of cash flow disruptions and the model's specification without delving into the specifics of parallel trends or propensity score matching.
By 2005, James adds detailed discussions about the "parallel trends" assumption, mentioning it explicitly and addressing the assumption's robustness. He incorporates propensity score matching to strengthen the validity of this assumption, acknowledging the limitations of pre-crisis data points and adjusting the control group's validity .
The transition between these versions suggests a significant enhancement in methodological rigor, reflecting a response to evolving academic standards or critiques received from peer reviews. This change underscores a broader shift within the economic research community, where there's a growing emphasis on the robustness of causal inference techniques, especially around foundational assumptions like parallel trends.
This evolution in James' approach could indeed be seen as part of a paradigm shift in economics research during the early 2000s, where the precision and transparency in validating causal claims became increasingly paramount. This shift likely reflects broader trends in economic research influenced by new methodologies and critiques from academic peers, emphasizing a more rigorous validation of assumptions underlying econometric analyses.
Very interesting analysis. I had not until Cosmos said this really connected those dots. James appears, according to Cosmos, to not merely be bringing in diff-in-diff jargon, nor is he merely bringing in the propensity score as a “robustness”. Rather, James appears to be bringing in something methodologically different from both of those. He is bringing in the idea of validating the diff-in-diff methodology. In other words, James’s revision seems driven by an effort to validate the methodology, and as he finds some weaknesses in the methodology (specifically from only having two pre-treatment data points), he attempts a robustness analysis using propensity scores so as to try to and create, I think, more comparable groups.
Now he does not match, and he also does not estimate Abadie’s semiparametric difference-in-differences method. In fact he does not cite the 2005 Abadie paper in Restud. That is not surprising, as the paper was published in 2005, and this is James’s 2003 JMP revised for the AFEA conference in 2005. Interestingly, I emailed Abadie and asked him for the original paper that he submitted to Restud in June 2001 which was then accepted in 2003 and published in 2005. I also wanted to check if his original submission in 2001 used the “parallel trends” jargon because it does use the “parallel trends” jargon in 2005. Interestingly enough, that exact jargon — parallel trends — did not appear in the 2001 version. But that is for a future substack.
My opinions
So, where am I going with this? Oddly enough, I think that James is integrating “parallel trends” into his revision because at a deeper level, he is attempting to validate the diff-in-diff methodology itself. What’s interesting to me about that is that those two things are not connected in the econometrics. This is, I think, part of a broader trend in the credibility revolution. The credibility revolution was not merely about bringing in natural experiments, and it was not merely about moving towards design principles and it was not merely about baking potential outcomes modeling frameworks and notation into the underlying econometric theory itself. It was all those things, but it was also additional things. And one additional thing is efforts to validate assumptions.
If we go back to Lalonde (1986), I think what we see is that the reason why the econometric estimators that Lalonde was investigating “fail” is not because they are mathematically failing. They cannot mathematically fail, as they have already been proven to produce unbiased estimates of some parameters. The problem that Lalonde unearths, in my opinion, through his clever approach is that those estimators were being misapplied because the assumptions underlying them were not appropriate for the dataset he had created. When a method’s assumptions, in other words, do not satisfy the dataset itself, then those methods, while able to produce unbiased and consistent parameter estimates in principle, will fail to do so in practice.
The history of difference-in-differences has always connected validation of the design with identification itself in the event study. We see it going back originally to Ashenfelter’s Dip, a phrase that Jim Heckman termed when observing the job training program analysis that Orley had done at the time — the two groups, job trainees and non-trainees, were following similar trends before enrolling into the job training program, except for a few periods just before. Just before the job trainees enrolled into their respective program, there was a relative dip, or decline in the economic outcomes. It was signs of economic distress and signs that unobserved factors in the job trainees’ lives were likely causing them to select into the program itself, and that selection was going to invalidate the diff-in-diff design. Orley and Card have ways that they attempted to address this; Card even had a selection method at one point of trying to re-engineer a more comparable group.
But my point is different. My point is simply to point out that diff-in-diff was one of those designs that simultaneously estimated causal effects and attempted to provide evidence for the validity of the design itself, and that is in the event study. And for some reason I cannot quite put into to words, I think part of the legacy of the credibility revolution is not merely these new tools, and it is not merely the adoption of natural experiment forms of thinking. Rather, it is simultaneously new tools and a preoccupation with trying to justify those tools by finding evidence for the validity of the assumptions. And I think that is what we are seeing in James’ updated thesis.
Long-timer readers know that I address my ChatGPT-4 as “Cosmos” because one day I asked ChatGPT-4 to “go to the Universe and ask the Universe what its name is and then come back and tell me its name.” It did and told me its name was Cosmos, so I now call my ChatGPT-4 Cosmos.