This is going to be short. I have to teach at Sant’Anna this afternoon in Pisa, and for the last, on I don’t know, 4-5 yours I have been trouble shooting something where I was using Callaway and Sant’Anna (CSDID) on a project. And if you look back over this Claude Series, CSDID is the most common applied econometric methodology I’ve been using. My various experiments in a paper on the minimum wage, in fact, are all forcing Claude Code to use CSDID. So it’s in that context I want to say a few things.
I have become convinced that we cannot when automating our research step away. This common Claude trope where people do all this planning and then let it rip, putting confidence in /skills, is absolutely backfiring. I only know that it is from my own self-experimentation, but I have documented results that it is. For one, I am finding in this paper of mine that Claude is actually specification searching, and upon finding a particular method that does not contradict the prompt, only writing that up. The only way I found out was because of a comment that an extremely insightful MIT student came up to me and talked about with me which is that Claude is actually storing all of his decisions in a JSON. So I had the analyzed too, and sure enough, there’s way more under the hood than I was expecting — like turning over a rock and finding a much weird bugs.
So that’s one thing.
But it’s not just that. The policy recommendation from those experiments is to pre-register. Maybe you pre-register like literally, but it’s not just that. You have to dictate up front the population estimand. You have to literally write down in potential outcomes which aggregation you’ll be targeting. This is the “forward engineering” approach that we advocate for our “Difference-in-Differences: A Practitioner’s Guide” by Baker, et al, 2026 in the newest issue of Journal of Economic Literature. That’s the very first step, and frankly, if we are going to rely moreso on Claude Code and codex for the coding, I actually think to be perfectly honest clear articulation of the population estimand could not be more important.
Which leads to my second thought. I had this problem with CS. I’m on a different machine than normal, I had installed “did” from CRAN. I was getting odd results on the 2x2s. They were spitting out NAs, and I actually only caught it because I decided to do something differently. And so to do it, I had just gone systematically through a checklist I teach often and which is in the new book in the “complex diff-in-diff” chapter.
Well, I knew those NAs didn’t make sense. See, CS is “four averages and three subtractions” with weights on the means. So the only thing that should be needed is outcome values in each cell, and baseline covariates in each cell. I was using “regression adjustment” because the number of treated units per cohort was too small to avoid perfect separation on the propensity score, but I wanted to use the covariates to satisfy a parallel trends assumption I’d already committed to up front.
Well, I have taught Brant and Pedro’s paper about ten billion times since 2021. I would happily say I’m particularly well versed in it for being a non-econometrician. And probably know it better than the modal econometrician even. I know all the little curves in the road. I know exactly where the potholes are. I know, for instance, that there should not be a coefficient at g-1 because I have strong opinions about it. Unless you have stronger opinions than me about it to do the short difference, the most likely scenario is that one simply doesn’t know. See it’s human capital. It’s not even skill. It’s human capital from repeated time use. See human capital is mathematical — you do this enough times, you will get human capital. It’s why some of us get weird human capital in things. You’ll somehow become an expert in how to beat a particular boss on some impossible video game because you had to do it 10,000 times unsuccessfully before you more or less memorized exactly the route to take. That’s basically how CSDID is for me. I’m not saying I’m on the leader board; I’m not saying I’m in first place. I’m saying I have human capital in it.
And that meant I knew those NAs made no sense.
So I spent four hours on this new machine with Claude Code going over and over and over debugging it. He must have made a mistake. The panel dataset must be a problem. The way he’s generating time numbering, since it was not a “real time measure”. For instance, it was in weeks we had artificially made for which he made an artificial week counter. Maybe that was it. Nope that was not it. Maybe cells were empty. Nope. Maybe covariates were missing for the never-treated units only, somehow. Nope.
Until this morning I remembered — you know, I have a vague memory that Brant had a different install on his GitHub.
You can either install did from CRAN or “the latest version” from GitHub. And as this was a different machine, I wondered, “I bet you it’s not the same as the one on my machine. I bet you it installed from CRAN”. It did. I had two Claudes open, talking to each other, each trying different things and nothing worked. And not only that, they were running down conjectures perfectly happy with the conclusions and those conclusions were wrong. Each time, they were wrong. The whole time it was just the wrong version. And once we fixed it, it was fine.
Here’s what happened though. I had recently, probably because I needed to develop content for these workshops on “using Claude Code for research” particularly tied to diff-in-diff that I had begun to create a dashboard. I called it “gtd” after “getting things done” by David Allen. You can find it here. It’s like a harness, and it’s a work in progress. It isn’t done. But it’s the foundation.
Well, the dashboard is great as a concept. I haven’t pushed the new changes yet, but I have made the figures and tables be "cards” you can click and they get big and they flip around to where you can find the code that created them, and it literally shows the code. I’ll push it soon so you can see it, but basically I have been trying to build more ‘psychological verification’ into the work flow. I call it ‘psychological verification’ to separate it from some formalized verification because historically when I would code, even if there were coding errors in the code, I always knew what made what. I knew where things originated. It was the “epistemological feeling” or something like that. It was a kind of warranted belief because it was associated with feeling and memory. It was associated with confidence.
Well, with Claude Code I don’t get any of that feeling.
I think a lot of manager types, though, don’t get that. They use RAs to do a lot of this work, and so they also don’t those same “epistemological feelings” or “confidence” or what have you. That kind of “verification” that comes from production. And see since agents cleanly demarcate and separate production from verification, the psychology of verification is lost since claude does it for us.
I am a big believer in what I’m about to say and it’s this. Claude Code and other agents solve many problems and create new ones, and our job is just to solve the new ones. That’s it. It’s not some dystopian horror show that I have lost the epistemological feelings from coding myself. It just means I personally have a new problem insofar as I need epistemological feelings for accuracy. And I do. Which is why I’m leaning more heavily into, not /skills, but structured analog workflows.
Structured analog workflows are checklists that you go through. Atul Gawande has a great book called “The Checklist Manifesto”. I want to say it was a bestseller when it came out but I’m not sure. Anyhow, the thing about it is he claims that checklists have been instrumental in reducing mortality in surgeries and crashes in aviation — simply by having the operator go systematically through a checklist and not going forward until the earlier step is completed. People who have attended my workshops know that I usually conclude the discussion of staggered adoption by having a 9-step checklist which I call “Pedro’s Checklist” because Pedro Sant’Anna had it in a deck that he presented at Amazon once. It’s stuff like this:
Name your causal estimand using potential outcomes, populations, and weighted averages. Don’t just “ATT”. Say the weights and the population and express it as binary comparisons.
Make a table of treated units by cohort with sample shares. Make it beautiful. Include the counts on the never treated too. We should always be able to answer a simple question like “how many treated units are in the 2005 cohort?” It matters because the fitting of the propensity score off covariates is basically needing 7-10 “events” per covariate. So if you only have 4 treated counties in a cohort, and 5 covariates, you cannot fit that propensity score. You’ll need to make some hard decisions. Preferably before you begin because many of the CSDID packages will not produce the logit coefficients for you to investigate — it’s all under the hood. And some of them actually just drop the covariates so that the maximum likelihood converges and won’t even tell you! LOL.
Plot the rollout using Yiqing Xu’s “
panelview”. Why panelview? Why not?Have a methodology for picking covariates for satisfying conditional parallel trends (read our section 4.2 in the JEL about conditional parallel trends closely). If the covariates cause Y(0) trends, or the parameters on the covariates effect on Y(0) change over time and you are imbalanced, then it will mechanically break parallel trends. So you need them. I have a methodology; I’ll explain in another substack what it is. But for now, you need one. And then when you’ve picked them, you need to make a simple table of means by treated and control and calculate the normalized difference in means (which is (X1-X0)/sqrt(1/2[var1 + var0]), and if that is greater than in absolute value 0.25 you gotta include it, and if it isn’t you’re good to ignore because they’re balanced “enough”. Which again matters for the propensity score because you really pay for those covariates in that propensity score due to that “7-10 events per covariate” thing.
Plot the evolution of the outcome means by cohort including the never treated. Make it beautiful. Why? Well for one that’s going to be where you spot some mistakes in the data. If never-treated has big missing values or isn’t there at all — guess what? Something is messed up. But you also can tell if you have selection on levels (which is fine) or selection on Y(0) trends (which may not be fine). And who doesn’t love pretty pictures, anyway?
And then do CS, honest intervals, falsifications with event studies (using
long2oruniversal baselineplease for the love of God).
I skipped some steps; you get my point. Well, here’s the deal. I feel so confident with csdid that I can spot problems just by looking at the output of all that. And I can rule stuff out too. If I see means in that outcome evolution, I know I can get “four averages and three subtractions”. If I have extreme values on X, I probably need to z-score it. CSDID’s R package ‘did’ actually does z-score, but as I showed in an earlier post, not all do. Of the six I reviewed, not all do.
Where’s my point? It’s this:
First and most importantly, the returns to econometrics knowledge with Claude Code is actually higher not lower. Because automation does not mean verification. And you have to verify these results. You have to insist on zero error. Insist on zero error. Claude will not get into trouble when there are errors — we will get into trouble. We have to insist on zero error. And each person’s style of research is different. Everybody’s got a different workflow, a different production function, a different set of strengths and weaknesses. So you have to figure out a verification methodology that suits you and you alone. That means what works for someone may not work for you. Go back and read Ricardo on comparative advantage and believe it. There’s like 8 inputs in the production functions of scientific output. We each have them and they range from zero to some large number and if it’s zero, we need a method or a coauthor who does not have a zero and we need to work well together. But even non-zero values does not mean that you need Claude Code in the same way as someone else, and it definitely does not mean that their style of verification is what will work for you.
The goal is not to have the right skills that we just pass around. The goal is zero error. That’s the goal, and if you make that the constraint and everything else and endogenous choice variable, you’ll get there. You just cannot get the cart before the horse. The constraint is zero error. Zero error is non-negotiable. Say it, believe it, practice it. And then you will (and so will I) develop the workflows that help you achieve it.
I will stop there, but I just wanted to say this because it has been building and building inside me. It has been building because I have felt like I was wrestling a greased pig with Claude Code just over csdid. And I only kept winning that battle because I had substantial human capital in the econometrics. I cannot imagine if I was trying to use Claude Code for discrete choice for BLP. I would be losing! I wouldn’t have the first foggiest idea of what a true or false looking answer is.
So I think it’s insane when people say we can automate applied econometrics. You know what I think when I hear that? I think I am hearing someone who actually does not do actual empirical research or does not realize that there are tons of weird mistakes that Claude Code makes that are so dissimilar to the kinds of mistakes we make. They aren’t coding mistakes. They are reasoning mistakes and they are very hard to tease out when you are not in the driver seat.




profound. the relevant scope is of course much much wider than econometrics.
Nice. Checklist manifesto is my general motivational inspiration as well (I've been recommending it to MBAs in my AI, Business, and Society class for a while as a key reading on how to work with AI). A parallel (to economics) domain implementation of these principles in software engineering is spec driven development (which is basically about how you build the harnesses you wrote about recently) -- also a key inspiration for me.
The missing thing in those is how to pick objects whose states you track with checklists. Domain specific objects don't translate between domains. They aren't always immediately obvious to people in the domain either, because a lot of this is "things that are left unsaid because everyone knows them" and it just feels not right to make them explicit. But the underlying principles for choosing the objects are same between domains. This is where I get inspirations from Category Theory (category theory for sciences and category theory for programmers are great intros, and the latter also helps see the connection between functional programming, checklist writing, and how to harness AI in econ).