I like the idea of a deck that translates the process log. I’ve experienced the drift as well and it feels like I have to see Claude as a “research assistant” with attention limits, especially when the project gets big. So many decisions and tweaks in the pipeline from idea to data to analysis to paper.
Fascinating. I use Claude mainly for excel and Python programming and python programming into excel. But I recognise a lot of the things you mention here.
I usually co-author using dialogue like you do but there's always that question of whether to (1) start by asking it to do the big overarching task and see if it can sort it out for itself, or (2) try to break out separate tasks and do them one at a time or (3) combine the detail on a single multiply-articulated prompt. Methods 1 and 3 suffer from the same problem: Claude's tendency to miss something out, even when it's been specified or seems to obviously need to be done. Perhaps some of the latest versions have been launched to address this. There's also always that nagging doubt that maybe if I paid for the next level up (I assume there's at least one), the problem would go away.
I tend to use cross-validation if I can to check what Claude has done. And of course, if you spot a problem and ask it to fix it, it usually does it fairly well - not like early versions of ChatGPT which would repeatedly leave it unchanged whilst claiming to have fixed it!
As with all programming, you often get glimpses into just how much scene-setting / scaffolding is done by the human mind at the unconscious level, which you have to reproduce in lengthy, awkward code or formulas to do what appear at first sight to be simple.
I really don’t think the amnesia feelings is related to the adhd. You have elaborate a feeling many of us working with Claude have. Unless Claude is making us behave like we have adhd.
I went to Europe (Italy) for the first time a few weeks ago. It was absolutely eye-opening and jaw-dropping. I said "wow" a lot more than usual. But seeing how the sheer weight of some of those towers cause them to lean (didnt see Pisa, but it turns out this was an issue for the Two Towers at University of Bologna, too) made me think of these new issues with coding with Claude. The sheer volume of code that can be added and heaped, layer on layer, on top of the oodles of lines that were set down moments before them. Anyway, it makes me think of the issues these ancient builders ran into with their towers and I wonder if we aren't at a similar moment in some ways. Pisa was built, apparently, on a shallow, 3 meter foundation, but it was also built on marshy ground which didnt help either, I guess. This description resembles (to an uncomfortable degree) features of some of my prompt sessions and tge resulting code.
I keep wondering how others are overcoming these issues; I suspect with some cooler combination of unit testing, scaffolding, deeper context, more modularity, etc., than I've managed to arrive at so far.
Anyway, we are all looking forward to seeing the fixes you find to these issues, Scott. Thanks for sharing
> I’ll call it drift.... I will learn that the script was never written. It was ad hoc, created during our dialogue, but it did not write it down.... A script is either being modified unknowingly, or the entire foundation is coming off its base boards. Treatment units may be subtly changing from where they had been months earlier.... I’m not catching this in real time.... Errors are emerging... but outside of the normal location of errors that I’m accustomed to.... Until I can develop a coherent mental model... I don’t think my fixes will be true fixes.... I am skeptical that the solution is in a /skill to be perfectly honest. I think the problem is connected to my adhd, and skills really don’t address adhd...
Again: It takes the conversation you are having with it, identifies one of the very most similar conversations in its (compressed) training data with the similarity metric modified by RLHF, looks at what token the human having that conversation typed next, outputs that, adds that to the conversation, and does it again. And again. And again. That means that it will specify configuration switches and call tools that are what you want it to do only to the extent that the human it is pantomiming knew what they were doing and only to the extent that the (compressed) training data situation really was similar.
The saving grace is that the harness is good enough that it bombs out rapidly almost all the time when its Clever Hans self goes off track--for example, types "sonint" instead of "sonnet"--and because it is Clever Hans at inhuman speed, it picks itself up and tries something different.
The failure modes you are experiencing are natural to such a system, not to the way your brain processes symbols.
That the thing works as well as it does is a miracle, and a mystery. And that Anthropic's harness-system programmers have pulled this amazing rabbit out of the hat is a source of awe and wonder. And yes, almost all of the time, thoroughly anthropomorphizing it is not a thing that leads you badly astray in the moment. But in order to work well, you have to be asking it do things very similar to things that are very dense and very correct in its training data. Venture out of those neighborhoods, and the things you are seeing are the things that **will** happen.
Fwiw, I find Claude speaks HTML much better than Tex. Inspired by your beautiful deck, and because I think more visually/spatially than linguistic, I have everything visualized into slides or dashboards, and HTML reduces friction.
We probably mean the same thing under the hood, but I don't think conversational style vs planning is an important distinction. That just tells the reader what form your interaction with AI takes. The important stuff is what objects is the interaction is focused on, and whether AI has a harness that spells out how to interact with the objects when they are mentioned. You can mention all the right objects in conversational style and miss them all entirely when writing a plan. An example of an important object is a checklist (along with ontology of objects whose states are updates across the checklist steps). If AI knows that objects are semi-formal (have clear state-space) and must be tracked explicitly, the style of communication is part of implementation (stuff that doesn't matter, so you can choose entirely based off vibes) rather than structure (the stuff that must be right no matter what implementation you choose). In other words, "plan" and "conversation/dialogue" are isomorphic as long as they share underlying structure.
I actually had Claude dissect your deck and analyze my substack, the harness I've been working on (and the JSON) and /insights. It was interesting. Claude said you and I are converging on the same ideas, but from different directions, and we do have some disagreements, but I couldn't quite understand what they were. I'm trying to go first with my concepts and language, figure out a translation to yours (I think yours are coming from a different place, more mathematical than mine, and probably drawing on a broader literature), and then auditing my harness project to see where your approach can fix my own. When I get something ready, I'm going to send it to you.
I appreciate that! One difference in our approaches that I noticed is bottom/up (you) vs top/down (me). You went from your own workflow and had it evolve given AI as catalyst. I took the objects of my scientific interest (but not really my workflows), and went out looking for existing ways of thinking that could help (outside of workflows I used previously), found a bunch of stuff existing in other fields (usually neither AI nor Economics), and started applying that to rebuild the workflow while keeping the objects of my interest.
Here is some of my other big inspirations that weren't mentioned in the slides, giving them to Claude and comparing with your stuff might give you further insights on comparison:
Note that even though it's an article by OpenAI folks, they don't seem to be their LLM people, but rather their software engineers that build Codex using AI tools -- it's a part of the pattern I noticed that when it comes to learning to use LLMs, it's often better to skip people making LLMs completely because the important objects for building LLMs seem to be very different from objects important to using LLMs for economics, but objects useful for software engineering using AI tend to be more translatable to objects that are useful for economics using AI (not to say translation is direct or easy).
I can't begin to articulate how grateful I am to you for sharing your experience with ADHD --how validating it is to see some of my own challenges (indeed, insecurities) shared so candidly and without shame. Despite having an above-average IQ, I spent 8 years in undergrad trying to learn within the constraints of the system and my brain (I also have ASD). It was hard for reasons completely unrelated to intellectual rigor. At 32, I decided to pursue my dream of getting a PhD. I'm wrapping up my third year now, and when I look around, I see no one like me in economics, no one struggling with the things I struggle with. It's rich fodder for my insatiable imposter syndrome. So when I see you, *the* Scott Cunningham, whose book we used in my PhD metrics courses (so, he must be a genius, right?), giving voice to my struggles, reflecting back to me some parts of my own experience, and talking about acceptance and building systems endogenous to it, I feel hopeful. I feel like maybe, just maybe, I belong here too, despite my long list of differences.
On a practical note, thanks for speaking to your belief, expectation, or hope that Claude would be "simply the latest manifestation of an evolution in my own process towards greater precision and fewer errors", and for illuminating some of its limitations, specifically for ADHD brains. I didn't have the coding experience many of my (much younger) peers had upon entering a PhD program (I had none, and learning it on the fly in a PhD metrics course is no easy feat for normal folks, much less ADHD folks like us). Coding with ADHD is just so overwhelming because it's so unforgiving. ChatGPT came out in my first year of grad school, so I got about 6 painful (and invaluable) months of non-AI coding experience before I began to rely on it extensively. The opportunity cost of fighting with Stata or R while trying to learn the PhD core, however, was simply too great, and not at all incentive compatible given my time budget constraint. But as I go into my fourth year, my exclusively research years, I want to be more intentional about how I rely on AI. Not just for coding, but now with agentic AI, for all the ways I use it. I don't want to handicap myself with inappropriate reliance, but I also don't want to fail to learn to use these tools which don't appear to be going anywhere. I think I would be doing myself a disservice to not learn to use them, but I want to learn to use them well. So thanks for all the ways you support this endeavor, both in the practical sense with technical content and in the emotional sense by sharing vulnerably about your experience.
Bethany that is one of the nicest comments I’ve ever gotten. You completely belong — to this community and every community. I got a lot out of the book the courage to be disliked. It wasn’t abt adhd, but it was abt belonging. You might like it too. Hang in there — I’m really glad you wrote. :)
Yes, you've articulated something I'm just starting to realize I suffer from too. Hope you figure it out and share with us.
I like the idea of a deck that translates the process log. I’ve experienced the drift as well and it feels like I have to see Claude as a “research assistant” with attention limits, especially when the project gets big. So many decisions and tweaks in the pipeline from idea to data to analysis to paper.
Fascinating. I use Claude mainly for excel and Python programming and python programming into excel. But I recognise a lot of the things you mention here.
I usually co-author using dialogue like you do but there's always that question of whether to (1) start by asking it to do the big overarching task and see if it can sort it out for itself, or (2) try to break out separate tasks and do them one at a time or (3) combine the detail on a single multiply-articulated prompt. Methods 1 and 3 suffer from the same problem: Claude's tendency to miss something out, even when it's been specified or seems to obviously need to be done. Perhaps some of the latest versions have been launched to address this. There's also always that nagging doubt that maybe if I paid for the next level up (I assume there's at least one), the problem would go away.
I tend to use cross-validation if I can to check what Claude has done. And of course, if you spot a problem and ask it to fix it, it usually does it fairly well - not like early versions of ChatGPT which would repeatedly leave it unchanged whilst claiming to have fixed it!
As with all programming, you often get glimpses into just how much scene-setting / scaffolding is done by the human mind at the unconscious level, which you have to reproduce in lengthy, awkward code or formulas to do what appear at first sight to be simple.
I really don’t think the amnesia feelings is related to the adhd. You have elaborate a feeling many of us working with Claude have. Unless Claude is making us behave like we have adhd.
I think you’re on the right track here: it’s not Scott, it’s Claude.
I went to Europe (Italy) for the first time a few weeks ago. It was absolutely eye-opening and jaw-dropping. I said "wow" a lot more than usual. But seeing how the sheer weight of some of those towers cause them to lean (didnt see Pisa, but it turns out this was an issue for the Two Towers at University of Bologna, too) made me think of these new issues with coding with Claude. The sheer volume of code that can be added and heaped, layer on layer, on top of the oodles of lines that were set down moments before them. Anyway, it makes me think of the issues these ancient builders ran into with their towers and I wonder if we aren't at a similar moment in some ways. Pisa was built, apparently, on a shallow, 3 meter foundation, but it was also built on marshy ground which didnt help either, I guess. This description resembles (to an uncomfortable degree) features of some of my prompt sessions and tge resulting code.
I keep wondering how others are overcoming these issues; I suspect with some cooler combination of unit testing, scaffolding, deeper context, more modularity, etc., than I've managed to arrive at so far.
Anyway, we are all looking forward to seeing the fixes you find to these issues, Scott. Thanks for sharing
Scott—
As I see it: it's not you; it is it.
You write:
> I’ll call it drift.... I will learn that the script was never written. It was ad hoc, created during our dialogue, but it did not write it down.... A script is either being modified unknowingly, or the entire foundation is coming off its base boards. Treatment units may be subtly changing from where they had been months earlier.... I’m not catching this in real time.... Errors are emerging... but outside of the normal location of errors that I’m accustomed to.... Until I can develop a coherent mental model... I don’t think my fixes will be true fixes.... I am skeptical that the solution is in a /skill to be perfectly honest. I think the problem is connected to my adhd, and skills really don’t address adhd...
Again: It takes the conversation you are having with it, identifies one of the very most similar conversations in its (compressed) training data with the similarity metric modified by RLHF, looks at what token the human having that conversation typed next, outputs that, adds that to the conversation, and does it again. And again. And again. That means that it will specify configuration switches and call tools that are what you want it to do only to the extent that the human it is pantomiming knew what they were doing and only to the extent that the (compressed) training data situation really was similar.
The saving grace is that the harness is good enough that it bombs out rapidly almost all the time when its Clever Hans self goes off track--for example, types "sonint" instead of "sonnet"--and because it is Clever Hans at inhuman speed, it picks itself up and tries something different.
The failure modes you are experiencing are natural to such a system, not to the way your brain processes symbols.
That the thing works as well as it does is a miracle, and a mystery. And that Anthropic's harness-system programmers have pulled this amazing rabbit out of the hat is a source of awe and wonder. And yes, almost all of the time, thoroughly anthropomorphizing it is not a thing that leads you badly astray in the moment. But in order to work well, you have to be asking it do things very similar to things that are very dense and very correct in its training data. Venture out of those neighborhoods, and the things you are seeing are the things that **will** happen.
Be well, Brad DeLong
Fwiw, I find Claude speaks HTML much better than Tex. Inspired by your beautiful deck, and because I think more visually/spatially than linguistic, I have everything visualized into slides or dashboards, and HTML reduces friction.
We probably mean the same thing under the hood, but I don't think conversational style vs planning is an important distinction. That just tells the reader what form your interaction with AI takes. The important stuff is what objects is the interaction is focused on, and whether AI has a harness that spells out how to interact with the objects when they are mentioned. You can mention all the right objects in conversational style and miss them all entirely when writing a plan. An example of an important object is a checklist (along with ontology of objects whose states are updates across the checklist steps). If AI knows that objects are semi-formal (have clear state-space) and must be tracked explicitly, the style of communication is part of implementation (stuff that doesn't matter, so you can choose entirely based off vibes) rather than structure (the stuff that must be right no matter what implementation you choose). In other words, "plan" and "conversation/dialogue" are isomorphic as long as they share underlying structure.
I actually had Claude dissect your deck and analyze my substack, the harness I've been working on (and the JSON) and /insights. It was interesting. Claude said you and I are converging on the same ideas, but from different directions, and we do have some disagreements, but I couldn't quite understand what they were. I'm trying to go first with my concepts and language, figure out a translation to yours (I think yours are coming from a different place, more mathematical than mine, and probably drawing on a broader literature), and then auditing my harness project to see where your approach can fix my own. When I get something ready, I'm going to send it to you.
I appreciate that! One difference in our approaches that I noticed is bottom/up (you) vs top/down (me). You went from your own workflow and had it evolve given AI as catalyst. I took the objects of my scientific interest (but not really my workflows), and went out looking for existing ways of thinking that could help (outside of workflows I used previously), found a bunch of stuff existing in other fields (usually neither AI nor Economics), and started applying that to rebuild the workflow while keeping the objects of my interest.
Here is some of my other big inspirations that weren't mentioned in the slides, giving them to Claude and comparing with your stuff might give you further insights on comparison:
* https://arxiv.org/pdf/2402.01817 (and other papers and posts by Subbarao Kambhampati -- he's my classic AI inspiration)
* https://abdullin.com/schema-guided-reasoning/ (and other posts by Rinat Abdullin -- software engineering with LLMs under the hood)
* https://github.com/ailev/FPF/tree/main (and other stuff by Anatoly Levenchuk -- systems thinking)
That is literally what Claude said to me too — we are converging but from different angles, which is pretty reassuring. I’ll check these out too.
And I shared this article with you previously, but I'll reshare since after I read it, stuff started to really click
* https://openai.com/index/harness-engineering/
Note that even though it's an article by OpenAI folks, they don't seem to be their LLM people, but rather their software engineers that build Codex using AI tools -- it's a part of the pattern I noticed that when it comes to learning to use LLMs, it's often better to skip people making LLMs completely because the important objects for building LLMs seem to be very different from objects important to using LLMs for economics, but objects useful for software engineering using AI tend to be more translatable to objects that are useful for economics using AI (not to say translation is direct or easy).
I can't begin to articulate how grateful I am to you for sharing your experience with ADHD --how validating it is to see some of my own challenges (indeed, insecurities) shared so candidly and without shame. Despite having an above-average IQ, I spent 8 years in undergrad trying to learn within the constraints of the system and my brain (I also have ASD). It was hard for reasons completely unrelated to intellectual rigor. At 32, I decided to pursue my dream of getting a PhD. I'm wrapping up my third year now, and when I look around, I see no one like me in economics, no one struggling with the things I struggle with. It's rich fodder for my insatiable imposter syndrome. So when I see you, *the* Scott Cunningham, whose book we used in my PhD metrics courses (so, he must be a genius, right?), giving voice to my struggles, reflecting back to me some parts of my own experience, and talking about acceptance and building systems endogenous to it, I feel hopeful. I feel like maybe, just maybe, I belong here too, despite my long list of differences.
On a practical note, thanks for speaking to your belief, expectation, or hope that Claude would be "simply the latest manifestation of an evolution in my own process towards greater precision and fewer errors", and for illuminating some of its limitations, specifically for ADHD brains. I didn't have the coding experience many of my (much younger) peers had upon entering a PhD program (I had none, and learning it on the fly in a PhD metrics course is no easy feat for normal folks, much less ADHD folks like us). Coding with ADHD is just so overwhelming because it's so unforgiving. ChatGPT came out in my first year of grad school, so I got about 6 painful (and invaluable) months of non-AI coding experience before I began to rely on it extensively. The opportunity cost of fighting with Stata or R while trying to learn the PhD core, however, was simply too great, and not at all incentive compatible given my time budget constraint. But as I go into my fourth year, my exclusively research years, I want to be more intentional about how I rely on AI. Not just for coding, but now with agentic AI, for all the ways I use it. I don't want to handicap myself with inappropriate reliance, but I also don't want to fail to learn to use these tools which don't appear to be going anywhere. I think I would be doing myself a disservice to not learn to use them, but I want to learn to use them well. So thanks for all the ways you support this endeavor, both in the practical sense with technical content and in the emotional sense by sharing vulnerably about your experience.
Bethany that is one of the nicest comments I’ve ever gotten. You completely belong — to this community and every community. I got a lot out of the book the courage to be disliked. It wasn’t abt adhd, but it was abt belonging. You might like it too. Hang in there — I’m really glad you wrote. :)