Covariates, Repeated Cross Sections, Diff-in-Diff, and Parallel Trends Violations Plus Pictures of Lucca Italy
Getting off to a late start (again)
This is only part 1 of a post about repeated cross sections. It will also be embedded with pictures and videos from Lucca, Italy that, despite what you may think, have nothing to do with repeated cross-sections and diff-in-diff.
I am writing this post today from a beautiful cafe near my bed and breakfast. They sell yummy pastries and delicious cappuccinos. I suspect the cappuccinos, once I know the price, will surprise me once again for being a fraction of what cappuccinos cost in the United States, which has caused me to reflect on the economics of serving hot water filtered through ground coffee beans with a little milk, and made me wonder if firms in the US have been gouging me. Here’s my breakfast.
But this is not a post about the economics of cappuccinos. This is a post about difference-in-differences and repeated cross sections. What does difference-in-differences have to say about using repeated cross sections and what even is a repeated cross section?
For studying a program using difference-in-differences you need data with only two things. One, you need data on outcomes observed in two periods — a period after the program happened, and a period before it happened. And two, you need data on outcomes from two groups — one group of units who got the treatment, one who didn’t.
When you have data on outcomes for two groups over two periods, then you can estimate the average effect of the program by regressing the long difference onto the treatment dummy with two assumptions and one additional requirement: that the treatment group be untreated in the first period (the so-called no anticipation assumption), the long difference regression I just said using the untreated potential outcomes rather than realized outcomes would’ve been zero (“parallel trends”), and the comparison group be untreated in both the pre and post treatment period.
Note that I said “groups of units”. That seems like simple jargon, perhaps easy enough to understand, but what do we mean by the phrase “groups of units”. We basically mean that our data is either strict individual level panel data following the same units or we mean it is a repeated cross sections of different units. Either of these can be used with diff-in-diff and either are unbiased with parallel trends and the other things but parallel trends can be violated in different ways depending on whether it’s a panel or a repeated cross section. Panels get most of the attention, though, so I want to talk about repeated cross sections.
The definition of the parallel trends bias in diff-in-diff is itself a 2x2 on the untreated potential outcome, Y(0), which is itself interesting. It is interesting, I mean, that parallel trends is a 2x2 diff-in-diff equation in which the equation equals zero. As Y(0) cannot be observed, though, for the treated units in the post period, we cannot calculate that equation to check if indeed it were zero. But of course, if we had Y(0) for the treatment group in the post period, we’d not need to do diff-in-diff — we’d simply compared the treated outcome to the untreated outcome for the same units and be done with it. But my point is not to say that diff-in-diff overcomes the fundamental problem of causal inference which is that we are missing the counterfacruals, but rather that its identifying assumption is simply the same equation replacing Y with Y(0) and assuming that equation equals zero. In a way parallel trends is like gambling with a dealer who won’t show you his cards so that you can be sure they made blackjack, so we tend to look for clues instead. And covariate imbalances tend to be one of those clues, even and perhaps especially for diff-in-diff.
Covariates though typically don’t get emphasized with diff-in-diff because of the belief that the event study plots are necessary and sufficient for deducing parallel trends holds. But using event studies as if they were divining rods like that would be like watching the dealer turn over his cards a few times before you sat down even though he never turns over his cards for you. Sure, maybe seeing him turn over his cards a few times in a row before you sat down is reassuring, but it is hardly proof, and with money on the line, we would be a bit careless if it was the only thing we used to try and discern our opponents’ hand. Covariates are as important as its covariates that usually are the culprit when parallel trends has broken down.
But how do covariates affect parallel trends with repeated cross sections exactly? And how are they affecting parallel trends in ways when we use repeated cross sections that it doesn’t when we use panel data? And the reason is the same and the reason is different. It’s the same in that if the covariates in question are causal factors in Y(0) trends, then if those covariates have different distributions in the treatment group as in the control group, then that imbalance alone will break parallel trends. And it can go undetected from even event studies if the “returns” to those covariates in the first year are different from the “returns” to the same covariates in the second year.
When parallel trends is violated because of changing “covariate trends”, the event studies are typically misleading. They are misleading if the changing covariate trends begin to happen over the “pre to post” periods. Correcting for it, though, is thankfully not an awful ordeal. First, you need to know the covariates. They must be in the language of Heckman and Robb (1985) “known and quantified” which means you know what these covariates causing the trends are, and you must have them measured accurately in the data. Without both, you can’t undertake the corrections, of which there are many, some written by Heckman, but which all appeal to a new parallel trend assumption stating that while the treatment and control groups are not on average on the same untreated trends, the sub-populations defined by covariates are on the same untreated potential outcome trends — at least on average.
That insight that covariates may have different causal effects on Y(0) trends themselves is I think a very helpful way to think about parallel trends violations in general. And in fact they are helpful for thinking about what it is about the repeated cross section that may have its own unique form of a parallel trends violation. And the secret is that insofar as there are changing compositions of units within treatment and control group, then you may be in trouble. Specifically, if there are heterogenous relationships between Y(0) trends and the covariates themselves, then as your treatment and control group begin to evolve such that the collection of units in each change with respect to those covariates, then parallel trends cannot help but be mechanically violated. And again, it may or may not be detectable in the event studies if the changing composition is happening over time.
Let me just be more concrete. Let’s say that you're working with a survey, like the Demographic and Health Surveys (DHS) which are commonly employed household surveys by development economists. These are repeated cross sections, not panels, because they are a newly random sample of households at each new wave. For development economists and economists working in developing counties, panel datasets may not be always available as panels are expensive. They require more than merely sampling from a sampling frame. It requires tracking the same people over time, managing the attrition, all of which can get very expensive. So repeated cross sections like the DHS are important datasets.
So, it’s obviously therefore important for us to understand the best way to use repeated cross sections like the DHS in diff-in-diff designs given panels are scarcer, administrative datasets scarcer, repeated cross sections are relatively more common, and diff-in-diff is sometimes the only way to study a large program that has not been randomized.
BTW, real quick, this is the Spotify playlist I’ve been making for my summer, though I’ve been doing more eating than praying, and I’ve been doing more praying than running, but I should try to fix that this week. Why don’t we all try to learn Spanish this summer using only the song “Despacito” by Luis Fonsi, Daddy Yankee and Justin Bieber? Surely that’s possible, right? Just skip to the 20th song and start practicing.
I’m going to write a little about a 2013 article by Seung-Hyun Hong entitled “Measuring the Effect of Napster on Recorded Music Sales: Difference-in-Differences Estimates Under Compositional Changes” to describe his solution to the problem of compositional changes in repeated cross sections. I decided to build up a discussion of this paper, and Hong’s method, in the second edition of the Mixtape, as I really think it’s an under appreciated paper. He uses covariates to address the compositional change, the method is straightforward, and can easily be adopted in more complex scenarios such as multiple time periods, and more than likely can accommodate Callaway and Sant’Anna like applications to differential timing setups, as it’s simply a type of propensity score correction applied to 2x2s, and since Callaway and Sant’Anna is a way of correctly estimating individual 2x2 building blocks, and Hong is a way of correcting for compositional changes within 2x2s, I think you could do it block by block. What I’m not 100% about is how you address simultaneously conditional parallel trends and compositional change when both of them are addressed with propensity scores, or if perhaps Hong’s approach will envelop both. I’m going to think more about that.
But not today. Today I’m going to get to work because I foolishly scheduled 5 meetings starting at 3pm my time lasting until around 9:30pm my time. Which gives me a mere four hours now to get a lot of work done on this draft, which I have now realized was not due last night, but rather is due Wednesday night at the latest. Harvard must wait a little while longer.