The following is just some thoughts I had about Claude Code based on spending a day working on an old project that I had done a ton of work on shortly after discovering Claude Code in mid November. I had been so amazed by what I learned in mid November that I immediately turned to this other project, and got a ton of work done. I then had to write up a draft and a deck to present it. The draft was insanely long, never ending tables and figures, and I never finished it because I had to move into the end of the semester exams. But this week is spring break at Harvard, and I have been slowly knocking stuff out. So I wrote this last night before bed and am posting it this morning.
Thanks again everyone for all your support. I appreciate everyone’s enthusiastic response to me talking about Claude Code and causal inference on here — both now, but also over the last few years. I’ve really enjoyed the motivation to keep studying harder and keep learning, and trying to get better at communicating what I know to other people. And this substack is partly where I do it. So thank you again. Consider becoming a paying subscriber! I set the price to the lowest possible price you can for substack ($5/month) and hope that that can be affordable. It’s a labor of love!
This may sound like I’m giving AI a side glance but I’m not. I remain forever grateful for what I guess is software. And yet anytime the following happens, and it happens regularly, I am inclined to take note of it, and try to articulate it. Everything I say seems true enough that it would apply to anyone and everyone, but if not, I do think it applies to me.
It always starts with The Matrix, a timeless classic. There’s a scene where Neo lies down in a chair with a cable jacked into the back of his skull. He writhes and after about ten seconds, he opens his eyes and says, “I know kung fu.” Later he fights Morpheus and shows him. It is a stunning a series of scenes now as it was then in 1999 when I saw it in the theater with my friends.
When ChatGPT-4 came out in the spring of 2023, it was 25 years after the movie came out and it felt like a promise I’d been given as a young person would be fulfilled. Meaning, ChatGPT-4 felt like I would become like Neo. Not so much the promised messiah who would lead a resistance against the machines as just that I could learn anything I wanted without any effort. An assurance that I would never have to work hard to learn something. I was just going to lay down and get plugged in and all the things I wanted to know would come to me without any effort.
Nothing short of being given the power of flight could be better suited for my personality. I was a lifelong lover of learning and super powers, and the thought that I could fill this brain, not with facts, but actual skills was deeply attractive. It had always taken me twice as long as my classmates to learn economics and econometrics, but I had always been the one among my classmates who wanted to paint the ceiling of the cathedrals with economics and econometrics. So that gap in desire and skill always had to filled with sweat and hard work. But as that time use always came with a hefty price tag, which was that to gain the skills meant to delay my creative work until tomorrow, as today would be spent learning, then given how much o needed to know, it felt like the work would never sometimes come.
So I remember having this feeling with ChatGPT-4 that I could just know the things from then on, and I could know them now, today, without any hard work. Want to know how to set up a Docker container? Boom. Want to understand the basics of optimal transport theory? Done. The entire corpus of human knowledge, all the skills of being an economist, uploaded into my brain, no sweat required.
What I think now is that there remains now as much as there ever was one truth which is what there is not now any more than then such a thing as a free lunch. There is no free lunch. Gaining skills and knowledge always requires time. It always comes at a cost.
Here is the thing about learning: you can’t do it without breaking a sweat. Whatever it is that I am to say that AI does for me in my quest for personal growth as an economist, I don’t think the correct metaphor is of me, laying back, reclining in a chair, with a rod stuck in the base of my skull, having karate downloaded directly into my cerebral cortex. That is not the metaphor because that metaphor shows a person passive, engaging with AI while they are practically asleep.
I’m like 99% sure it’s closer to a physical law to say that just as you can’t build muscle without resistance, you cannot gain knowledge without resistance. You can’t build understanding without struggle. You cannot grow without a fight. And usually for the best things, it will be a bloody fight.
An AI agent can remove the struggle, and it can absolutely get cognitive tasks completed for you. There is no doubt about that. You can accomplish cognitive goals, complete cognitive tasks, and do so well, and not break a sweat. But that’s not the same as you learning. You can complete cognitive tasks and simultaneously not learn. And when that happens it’s one of two things. Either you have become very good at pushing buttons, in which case the button pusher may be over educated for that job truth be told. Or they become the very blind leading the very blind, without realizing it.
Often when someone says these things, they say them from a place of outright rejection of AI, but I don’t think that’s the case for me. I still am optimistic, both about AIs utility for me and society. But I also feel, just like I did the first day, that AI is like the siren, and if I can’t figure out how to close my ears to all its temptations, and just continue on the same long march I’ve always been on, then I am going to end up crashed against the rocks.
I believe that AI works profoundly well when used in the areas where you already have substantial expertise, and it works in an incredibly jagged and uncertain manner when used in areas where you have no actual comprehension. Which means that my own investments in my own skills remains crucial to getting the most out of it.
I have a paper that uses Callaway and Sant’Anna’s difference-in-differences estimator, which by now I know pretty well. But I was applying it to something unusual. I had individual-level worker data where to use CS. I had to re-envision what “time” means while sticking to this staggered adoption framework. I’m not going to get into the details here, but just know it was a strange enough application that the code couldn’t just be lifted off the shelf. It had to be built carefully but since I knew what I wanted, I knew I could do it with AIs help.
The problem was, I hadn’t touched this project since 2025. It was one of those things on my plate that I kept meaning to get back to, and as coauthors kept asking for it, and this week was spring break, I finally sat down to clear it off. I opened the directory and immediately felt that sinking feeling. The code seemed way longer and chaotic than I remembered. For instance, it was a bit of a medley and a mix of R and Stata files. Graphics that didn’t look right. Which meant I hadn’t done my due diligence to get all the kinks out, as these days I don’t tolerate even the slightest irregularity in graphics, since for the first time, I have someone or some thing that will fix it for me.
But back to the project folder. It was a sprawling folder structure that had clearly been used and reused for ten different purposes. I could tell that past-me had gotten a lot done using Claude Code, but I could also tell it was right at the very start of my using it, back when I was still figuring out how to work with it. The code had that feeling of ambitious ideas with questionable execution, and not enough organization, which in my life had always been the recipe for disaster.
So I started using Claude Code to sort through it all. I told it: verify that every table and figure in the manuscript comes from replicable code, then replicate that code in R. That’s it. Don’t rewrite the paper. Don’t reorganize the directory. Just confirm the pipeline.
The first thing Claude did was run a code audit. As a long time had passed and I clearly had never done a code audit, I was nervous. I was especially nervous though when Claude became immediately convinced that my adaptation of Stata’s csdid command had not done what it should have done since he could not replicate it either using the R command or manually in R.
It claimed that it had found a situation where one group of workers was coded as “never treated” when they were, in fact, eventually treated. That didn’t immediately seem possible to me as out of all possible errors I could make, that one seemed unlikely given the whole point of CS is to not do that. But Claude was absolutely certain that this was the source of the contamination and as a result the entire code would have to be scrapped and started over.
And in one sense he’s right. If I had miscoded this weird version of CS by having an already treated group as a control, then I would be defeating the entire purpose of using CS in the first place as CS is designed to not do that.
So it was a reasonable concern. The kind of thing that would sound completely right in a code review. And I definitely felt sick inside at the thought I had made such a basic fundamental mistake.
But something felt weird about it. Maybe it was just talking so fast, but I wanted to just sit and reason together a bit longer. So I kept pushing back. I told Claude he was confusing certainty with a conjecture and that he needed to chill for a second. Under no conditions is he to move on. He must verify his conjectures for me at least three different ways, and since we had csdid, and I knew it worked, then we had a ground truth to always check.
Because I did know this stuff pretty much like the back of my hand, I feel comfortable asking Claude to go through a series of steps, as opposed to him making up his own steps and walking me through them. And with diff-in-diff, since I know the calculations well, I usually want things done with borderline pencil and paper. Old school econometrics.
And he can do that. He can do old school econometrics. He can take four averages and subtract them so long as I take him through it. So long as I can grade his work. So long as I know how to recognize the problems in his work.
A lot of econometrics can be done with pencil and paper if you really can distill it to the most basic version of itself. You just have to strip away a lot of the extraneous stuff to get there often, but many times it’s possible. So I often do that. I’ll make a dataset with 4 or 5 observations and try to manually do whatever it is that the estimator is doing, because I figure if I can’t do it by hand, then almost certainly I’ll learn something that will usually solve whatever problem I was having. So that’s what I did here. I kept having him simplify, calculate and check.
At first that involved stripping away the irrelevant things, such as covariates. If he couldn’t with out weird adapting of CS not get the same series of ATT(g,t)s as you get from csdid without covariates, then that’s it — the problem wasn’t me, it’s probably him now.
Long story short, by forcing him to get down to the basics, which I knew well, to keep drilling down to the most basic version of what we were working on, he eventually found his own mistake. His mistake was that the entire time, his “manual” Callaway and Sant’Anna implementation had never even been computing a difference-in-differences in the first place. He’d been going through all this back and forth with me and had only been calculating the between differences — treated mean minus control mean — as opposed to the between difference in the first differences. He had been doing a cross-sectional comparison and calling it CS. He’d been doing it in the context of this staggering environment, so I guess he was distracted, but it wasn’t even really an error to make that mistake. I mean that was a pure zero on the exam. That was downright embarrassing. He knows Cs too is the thing! The method is literally called “difference-in-differences”! There’s a difference that you difference! But for some reason on this day, he did not know it.
There were other signs I should have caught earlier. At one point Claude was convinced the estimated effects were invalid because the code wasn’t using the “universal baseline” option. But the universal baseline only matters for pre-treatment coefficients — every post-treatment ATT in Callaway and Sant’Anna uses the same long difference calculation from the fixed t-1 baseline. I know this because I teach this constantly.
He was convinced the problem had to do with this C+ plugin that R was using for calculations which sounded smart and fancy enough of a story that I would’ve believed it were it not in the one area I felt like I had substantial skill. That story doesn’t explain anything struggling with taking a mean for a group. It sounded more like, to me, that he was making a fundamental mistake, that maybe he was getting the complex aggregations right but something more basic wrong. Which he was
And the word thing is, Claude also know this. He knows what diff-in-diff is. At a deep level, he knows it. But it’s also the case that he sometimes knows this. The problem is that regardless of whether he actually knows it, Claude said it with exactly the same confidence either way.
I’ve seen this pattern before — both within me and with someone else. A person who had attended one of my workshops once called me on Zoom, excited to share something he’d learned from a reasoning model. He said double-robust estimation lets you use different covariates in the outcome regression than in the propensity score model. I had apparently told some people that you should use the same covariates in both, and he wanted to push back on me.
I guess it wasn’t wrong, per se. Double robust just requires one of the models, not both, to be correct. But still, it struck me as strange because the role of covariates in diff-in-diff is to impute counter factuals through the conditional parallel trends assumption. If you need the covariates for that, why are you moving them into and out of the models differently? Presumably you need them to satisfy conditional parallel trends, which both the outcome regression model and the propensity score model used for their calculations to be right in the first place.
I told him I wasn’t sure about double robust practices in general, but I had probably been talking about Sant’Anna and Zhao (2020) specifically, where the doubly-robust estimator has a particular structure and while you technically can use different covariate sets (I mean it’s a free country — you can technically do whatever you want, especially when things are done in two stages), it’s not clear why you would if your goal is satisfying the conditional parallel trends assumption which need all of those covariates in the first place to do.
So then I looked at his code and saw what had actually happened: the reasoning model had told him to just include propensity score variables as covariates inside a two-way fixed effects regression. They weren’t being used as weights applied to the means in his code, first of all. And he wasn’t fitting an outcome regression model regressing the first differenced outcome into baseline covariates for the control group anywhere. He was just “controlling for” covariates sometimes twice and sometimes once — inside a propensity score and/or alone, and then inside a regression additively. There was many things wrong with the specification, but you only could know that if you already knew what you were talking about
The LLM had probably confidently given him that code and an explanation behind it, which he’d then used. Shortly after he wrote me back and said I was right.
The point I’m making is simple, and I’m not the first to say it. When you know your domain, the AI agent is like a rocket strapped to your back. You fly fast and in a straighter line at the targets. You might as well be teleporting there too. The things I can do now in a few hours would have taken me days or weeks before. Claude handles the tedious parts — the LaTeX formatting, the file management, the boilerplate code — while I focus on whether the research design is right. It’s genuinely transformative.
I think the thinnest of ice really comes when you don’t know the domain very well and you’re using AI to teach it to you during the actual coding of the project itself. I think that works often very well, but there are instances in creative advanced work where if you are literally trying to do this with almost no actual background in the subject matter, then I think it can go off the rails fast and you never know. Not necessarily doomed — but in real trouble. Because the AI will do things quickly and confidently, and you won’t have the vocabulary to interrogate it. You won’t really see the very specific problems. With CS, it’s usually these little details that I just have learned to notice — I know when two estimators output should look nearly identical, and when they shouldn’t. So immediately when they don’t, even if there’s a snow drift of information I’ve been getting, just that one fact is enough and I can filter out the rest and get on it.
The problem I think is that you’ll get output that looks professional. And maybe even worse, Claude will hammer at code until that code runs. If I’m wrong, my code usually breaks down and in getting it to run, I actually was successful because I learned. But here, the completion of tasks don’t really depend on me, and you can get code to run and yet the calculations it’s doing be completely wrong, and neither you nor it knows that day.
So all of that is to say I think we are not yet at AGI. We are at something else, and I love where it is, and it’s completely transformed my life both personally and professionally. I am absolutely insecure about the future, like most everyone else, but I also am excited and glad to be part of it. But I still think, all said and done, that where I have seen really cool things is in areas where I have already established real expertise. And so I still worry all the time — am I going to be one day without the ability to spot those types of problems because I rely on him to do it? Just like physical capital depreciates, so does human capital — and maybe even faster.
This is not a blast against AI though. That genie is out of the bottle. We will never go back to the way it was. Our work will be infinitely better going forward. The number of papers that fail to replicate is likely to collapse into a small dot given the sheer volume of eyes that’ll be on it. The wisdom of AI agent crowds is coming. But I still think we have to be vigilant about protecting and maintaining our human capital — not because of some allegiance to humanity. I just don’t think these technologies work best when you are literally the most uninformed version of yourself you can be.



How would you recommend Claude Code/AI be used by graduate students who are just acquiring expertise in the field? I feel like if I am not using AI, I am falling behind. But I'm also worried about 'the blind leading the blind" trap.
"A lot of econometrics can be done with pencil and paper if you really can distill it to the most basic version of itself."
I had an NLP professor who introduced almost every new method with something like, "Today we're going to learn a new way to add things up and then divide them by some other things. That's all it is."
An incredibly useful pedagogical trick to make some pretty intense adding and dividing seem manageable.