Claude Code 24: Multiple Agents Auditing Your Diff-in-Diff Code (Part 1)

Stata's csdid vs. Stata's csdid2 vs. R's did vs. Python's diff-diff vs. Python's differences, or if the errors are independent, the audit works

Feb 25, 2026

This is part of a longer series I’m doing on Claude Code for quantitative social sciences. I’m going to attempt (fingers crossed) to write a shorter post today. Before I do, I wanted to thank everyone for their support of the substack. It’s a labor of love. The substack gives me an opportunity to write and express myself creatively while also sharing what I’ve learned about this or that, be it causal inference, AI or some random kick I’m on.

I filmed a video walkthrough of me doing this exercise. You’ll see that video here. Note that during the process of the code audit, I realized that there were two additional packages that I wanted to evaluate. As such we will illustrate the code audit idea using five diff-in-diff packages: two Stata packages, two python packages, and one R package. We will be focusing mainly on the Callaway and Sant’Anna estimator, but this is really agnostic about the estimator as we will also be focusing on the preprocessing stages, too. The opening of this substack explains the idea behind it — independent errors — and the rest explains the implementation. The video walks you through exactly what I did. I hope you find this helpful. This’ll be the first of many diff-in-diff exercises, so buckle up!

If you are a normal reader, maybe consider becoming a paying subscriber. I’ve set the price at the lowest price point ($5/mo) that substack allows. Enjoy!

Today’s substack will be the first of many in which I illustrate using Claude Code in a project where the tasks include a pipeline of processing data and estimating average treatment effects using the Callaway and Sant’Anna method. But this one’s quite narrow in focus, which I think will make it more general to all people, regardless of whether they are using diff-in-diff. Today’s substack is about code audits using multiple agents to replicate code in multiple languages. Here is the idea:

I think we should take advantage of Claude Code’s agents to “audit our code” and do so aggressively. Almost like it’s a health inspector whose goal is to shut us down.
I think we can use Claude Code’s ability to speak in multiple languages to do this.

And to talk about this, I will illustrate it with some simple examples, including a video walk through of me using it for some simple tasks.

Hallucination as Measurement Error

We must in the social sciences embrace using Claude Code in our workflow to eliminate all errors that it can be used to stop. There are multiple types of errors, and their causes are due to many things — things that are utterly unrelated to one another. Some of them are reasoning errors, and perhaps Claude Code can catch those (I have found it catches a decent amount), but the ones I want to talk about are coding errors.

As we shift towards AI agents writing more and more, if not all, of our code, we should consider the possibility that AI agents based on large language models like the pre-trained generative transformer (GPT) will always have problems hallucinating. But what if hallucination can be conceived of as measurement error. That is, hallucination in the context of writing code is random due to the LLM just probabilistically writing down the wrong executed code.

I’m not saying to you that I know this is the case so much as I am saying that it could be a convenient fiction for us as quantitative social sciences to talk that way. For one, it’s a way of talking we are far more familiar with than we are with the probabilistic nature of the LLMs in the first place. I doubt many of us have read the original “Attention is All You Need” by Vaswani, et al. (which has now 232,500 cites since its first appearance in 2017). But I think all of us have at least in some point in our life read in any econometrics textbook the idea that there exists a variable that has been recorded incorrectly and as such the variable is classically wrong. Classically wrong in the sense that the variable’s values will be different from the “true value” by that value plus some random noise, usually centered at zero and usually standardized to have some fixed variance, like a normal distribution. In such cases, regressions using it will have coefficients on that variable which are attenuated towards zero.

I would like for you to be open to that language, but applied to the code Claude Code generates for our analysis. It could be somewhere hidden in the pipeline. It could be somewhere in the regression commands. It could be somewhere in the lifting of the regression output into automated tables and figures. Maybe it’s something seemingly small like allowing for the sample composition to change as fixed effects are added in, not realizing that not all of the sample had those fixed effects, causing 50% of the sample to drop. Or maybe it’s a merge syntax error. It could be even that classic Stata error:

replace olddog = 10 if olddog>10

which those of us who are old dogs will know not only top codes all values of olddog at 10 when olddog is greater than 10. It also replaces olddog to be 10 for all missing values too.

That is an old error, well known to Stata users, many of whom had to learn it either the hard way or on the Stata listserv from the prolific, extraordinarily helpful and Stata legend Nick Cox. But note that that command is itself unique to Stata syntax.

So, let’s do this. Let’s just assume that 9 times out of 10 Claude Code does not make that mistake. Claude Code knows, because it has been trained on every conceivable writing about Stata, including the manuals, and including Nick’s own words, that one of the correct ways to do it is this:

replace olddog = 10 if olddog>10 & olddog~=.

But on this day, Claude Code randomly left that last part out. And because on that day, Claude Code randomly left that last part out, your olddog variable has been top coded at 10 both for those rows that had values greater than 10 (e.g., olddog = 15), as well as those rows whose olddog variable was missing (i.e., olddog = .). Why did it make the mistake then on your code today? It made that mistake randomly. But you only ran the pipeline once; you only generated the pipeline once. And as such, you pulled a bad draw unknowingly, and since it did run, and it did not run into an error, the syntax error cascaded down through your pipeline into your analysis with systematic measurement error causing your results to be based on mismeasured variables that could be severe depending on how many missing values there are in the data.

Hallucination Errors are Independent Across Languages

So here is what I propose. I propose that you assume a second thing. I propose that you consider that Claude Code will randomly hallucinate its code. And that since you are offloading a lot of the cognitive work to it, and that your skills are depreciating as a result of that, then you must find a way to insert verification steps wherever possible using Claude Code in a targeted manner. And I propose that you consider this:

R will hallucinate with some probabilistic error, ε_R
Python will hallucinate with some probabilistic error, ε_P
Stata will hallucinate with some probabilistic error, ε_S

If the errors are independent, the probability all three hallucinate the same wrong result is ε_R × ε_P × ε_S — a very small number. And I think it is reasonable to say that they won’t because if the errors really are syntax errors, then we shouldn’t expect it to show up at the same time in the same place. If all three errors are pairwise independent, then we can write down these three covariance equations and set them to equal to zero:

Cov(ε_R, ε_P) = 0
Cov(ε_R, ε_S) = 0
Cov(ε_P, ε_S) = 0

Recall that these are zero because of the definition of covariance and the way in which the mean of the product of two random variables that are independent breaks out into the product of the mean of each one, causing the entire covariance equation to be zero.

$\text{Cov}(\varepsilon_i, \varepsilon_j) = E[\varepsilon_i \varepsilon_j] - E[\varepsilon_i]E[\varepsilon_j] = E[\varepsilon_i]E[\varepsilon_j] - E[\varepsilon_i]E[\varepsilon_j] = 0 $

This is the principle I want you to keep in mind: that if Claude Code or any AI Agent is making errors due to language-specific syntax, and if it is random, then it’s reasonable to assume that the errors are stochastic and therefore independent of one another. Which allows you to justify incorporating not just code audits into your process, but replication of your entire project in other languages.

Requesting Code Audits To Replicate In Multiple Languages

Which leads me to my next point: get Claude Code to audit your code systematically like a health inspector, as well as replicate your code in two other languages. These are two separate tasks, and while many people are already integrating into their AI Agent workflow “code audits” by hyper-antagonistic subagents, that may not necessarily mean that they are getting hyper-antagonistic subagents to replicate their code in the other languages already installed on your machine first.

What I think you want therefore is a workflow with a pipeline of code that from start to finish is completely replicated in two other languages such that at each stage, your code creates tables and figures that have the exact same values for all variables, and test statistics the same, down to several trailing digits.

This only works with the code that is non-random though. It won’t work with bootstrapping, for instance, which is itself based on seeds that are probably unique to that language. So you may not be able to do this to check bootstrapped standard errors since the resampling is random. Other examples where this type of code audit won’t work is:

Simulation-based estimators — simulated MLE, method of simulated moments (these draw random simulations as part of the likelihood approximation)
Bayesian MCMC — Gibbs sampling, Hamiltonian Monte Carlo (Stan, brms)
EM algorithms with random starting points — mixture models sometimes randomize initial cluster assignments
Machine learning — SGD, random forests, neural net initialization

But it will work for many other things, including basic processing tasks (e.g., cleaning variables, merges) as well as many very common statistical modeling methods (e.g., OLS, difference-in-differences, instrumental variables, F tests, analytical standard errors, R-squared).

So, what you want to do is have Claude Code not only audit the code using its own reasoning. You want to also have Claude Code replicate the code, from start to finish (i.e., including the pre-analysis processing stages in your pipeline), in two other languages, and then have an agent check that the output produced in tables for all three are identical.

Difference-in-Differences As Case Study

So, here’s to the video walk through. We now have five language-specific packaged code for implementing both standard difference-in-differences, as well as more complex difference-in-differences using differential timing and the inclusion of covariates. Therefore, it is possible to do the kind of code audit I am describing for difference-in-differences. The ones we will use are csdid (Stata), csdid2 (Stata), did (R), differences (python) and diff-diff (python). And I will be focusing on auditing the parts of the pipeline and analysis that is deterministic.

Our example will come from this Brazilian study that I also analyze in my forthcoming book, Causal Inference: The Remix, which will be published this summer by Yale University Press. Here’s the paper in question.

What I do in this video walk through is simple. I simply have Claude Code generate the code to estimate event study plots of the effect of the CAPS deinstitutionalization (i.e., closing mental health institutions) in Brazil on homicides, which is one of the several outcomes that the authors, Mateus Dias and Luiz Felipe Fontes, use in their interesting and important study about mental health reform and hospitalization.

To do this, I will use a program called Brazil.do that I wrote. It is a lengthy set of code, but we will have Claude Code take only a portion of it that cleans and estimates the effect using csdid and csdid2 in Stata. I am using, in other words, the Callaway and Sant’Anna method as it is one of the more popular methods used when estimating aggregate effects under differential timing. But, csdid is also a user created package. The original command was in R called did. And there are also two packages in python. There is one written by Isaac Gerber in python as well. You can find that one here.

But there is actually a second python package for diff-in-diff called differences. It’s written by Bernardo Dionisi. So we will also replicate the analysis in his diff-in-diff package in python, alongside Isaac’s.

Conclusion

This is going to be the first of many posts using Claude Code to estimate diff-in-diff, but today’s was just about the “code audit” using a very specific version of my referee2 persona. And I want to stop here because this is a lengthy post already as it is. But in the subsequent post, I will review the results with you and we will try to get to the bottom as to whether or not there are problems that are due to the audit, the packages, both or neither. But the point today is just to illustrate a particular workflow I’ve been working on to implement verification aggressively into analysis using Claude Code, but doing so with a very narrow yet extremely common and high value task — the estimating of treatment effects using diff-in-diff which at the moment can be done doing at least 5 different packages (2 in Stata, 1 in R, 2 in python). So we’ll see in the next meeting how it went!

Sam Sturm

What would it mean for the errors across languages to be independent of each other? Like sociologically/philosophically. I suspect they are not, for much the same reason that you can sometimes parse out the native language of someone speaking to you *not in that language*. (For example, native Spanish speakers will often over use “How” in English relative to native English speakers; “How do you call…” instead of “What do you call…”). I imagine that coding languages work the same way, and that there are ways of thinking about code that lead programmers of primarily one language to make similar errors across other languages. And so maybe R errors are much more related to Stata errors than they are to Julia errors?

1 reply by scott cunningham

Dr Sam Illingworth

12h

Thanks Scott for another brilliant post and also for highlighting how hallucinations are consistent across different languages and language models. I also love how you use a persona to evaluate your work by assuming that I'm a reviewer. The Claude skills I've set up for this purpose basically imagine my PhD supervisor at their most extreme 😂

2 more comments...

Discussion about this post

Ready for more?