Designing your diff in diff with a checklist, step 2: counting the treated counties

Mar 05, 2025

∙ Paid

This series is based on how a set of instructions I give my students when undertaking any studying using diff in diff. It’s a checklist which I find helps us all be on the same page, speaking the same language, go slow, and in the process, learn a lot of things even if the actual work has not yet been completed.

This substack is about step 2 in the checklist. At the end, I provide code for this step in Stata, R and python — which was a first for me, but Cosmos came through (for the most part) after much weeping and gnashing of teeth. Let’s start!

I remember once having a conversation with someone talking about their study, which used diff in diff. I asked them how many units were treated in each cohort. They did not know the answer so we moved on. With datasets so large, it didn’t surprise me that they didn’t seem to know that detail off the top of their head. What surprised me was that they’d never sought to know before my question.

Causal inference has a few core pieces that make it a different task than other forms of data science. The first is the use of potential outcomes to describe a target parameter. In the earlier steps, where I discussed weighting, that was important. But then there is the idea of the “treatment assignment mechanism”, or what had been called selection by others. Why are some units treated but not others?

I guess I see knowing who is treated and who is not treated and when to be a key step in thinning about the treatment assignment mechanism. If you can’t name them, if you can’t see them, and count them, then how can you make such basic arguments like the treatment assignment was random or selected on observables?

This step in the checklist is simple but it fits with a larger goal which is to ascertain the treatment assignment mechanism. One of our goals is to make a realistic claim as to why the units in your dataset did and did not get treated. And step 2 in this checklist will simply be you counting and reporting your treatment population. By doing this, you can put a name to a face for when later you try to think carefully about the mechanisms assigning treatment. But also by doing this, you will create an exhibit you can share with coauthors that together will help you piece together basic facts about your study. So only thing we are going to do is make a table in Stata, R and python with the goal being that the creation of the table is automated so you don’t have to be copying and pasting stuff by hand.

But before we do, let’s flip three coins and see how many come up heads to figure out whether the post will be paywalled.

Heads! So then this post will be paywalled, after a brief free window below. Thank you though everyone for supporting this substack.

What Will I Count?

Sometimes different units adopt treatment at different times while some never did at all. Some even might have been already treated before our dataset began. Those are the only categories possible: already treated, never treated and eventually treated. Those three categories exhaustively describe all possible treatment categories.

When one jumps immediately into analysis, without figuring out which panel unit is in which of those three categories, they run the risk of driving fasters than their headlights can illuminate the road. After all, already treated units as controls has the potential to create bias, so at minimum knowing how many of those little fellas are there is a good idea. Or maybe there are obvious spillovers between treated and control and maybe that only becomes obvious when you see who will be potentially a control.

I did a similar post as this one last June 2024, but in that post I focused on state-level crime data and the state roll out of concealed carry laws. You can see below that old post which has a front page with the table in question. As you can see, I was able to not just list all the treated states by cohort, I was also able to list their names. That is part of the value, I think, of working with the most aggregated datasets — you can go beyond counting. You can also name names.

Pedro's Diff-in-Diff Checklist

Step (2) of Pedro's DiD checklist: documenting how many units are in each cohort

scott cunningham

June 16, 2024

Step (2) of Pedro's DiD checklist: documenting how many units are in each cohort

I am going to continue walking us through Pedro Sant’Anna’s difference-in-differences checklist (“Pedros checklist”) with a focus on step 2. It’s pretty straightforward, but still as I wanted to show code, I thought I’d make it just one substack entry rather than combine it with step 3. Step 2 is “Document how many units are treated in each cohort.”

Read full story

But at the county level, it’s not possible to make a table with all the county names and make it readable too, or at least I’m not sure how to do it myself. That’s because in the US, we have over 3,000 counties. So if those treatment dates above are accurate, just listing the names of the counties is probably going to be impossible.

So what I’ll be doing here is have a table that simply counts the number of treatment units by cohort, with the caveat that you may want to create a table of state names like I did there, too. The main reason being states adopted concealed carry laws, and so selection happened at that level.

Building the Table

The table should be so easy to read that it is nearly impossible to misinterpret. It is incredibly easy to make a table that is difficult to read and easily misinterpreted so our task is the opposite. It should stand on its own two feet and exist as its own object that someone could understand even without reading the paper, which means titles and labels are important. Here’s what we want to include:

Keep reading with a 7-day free trial

Subscribe to Scott's Mixtape Substack to keep reading this post and get 7 days of free access to the full post archives.