Yesterday I presented at MIT to Alberto Abadie’s applied topics in econometrics class. It was a big room, and people whose names I recognized were there, which was quite humbling. And then the day before, on Monday, I presented at Harvard Kennedy School, and similarly recognized the names of some of the people in the room which was again, quite humbling. And the experience of them both, combined with the experience of presenting last week at the med school and as keynote speaker at a faculty retreat for Georgetown McCourt Policy School sort of pushed me towards working harder on coming up with a coherent “workshop” on AI agents — which is not quite there yet, but getting there — as well as pushed me to create two new skills. Before I discuss those two skills, let me first give you some broader context about skill making in general.
I think it was Ethan Mollick who once remarked that it may more optimal to make your own skills than use other people’s. For one, while giving a repository of skills to Claude or Codex, asking them to read over and it consider cloning them locally or just creating forked versions of them that are then mapped directly into your own storage of skills locally, is generally safe, but at the same time, the habit of just lifting something found online and bringing it to your terminal is going to be, I will bet $100 on, a leaky pipeline in which some share of those will import malware into your computer. Why? Because now we are using the command line interface, which many of us know zero about doing, and we are clicking those little “copy” buttons that are popular now in the html of websites, and just “pasting” them directly into the CLI if that is what we are told to do. And my suspicious is that you are more likely to do that as a function of constantly selecting other people’s skills, even though asking Claude or Codex to work directly with URLs is so far quite safe. It’s just part of the general loss in attention that could bleed into less vigilance which could lead into not fully connecting the dots that working through Claude to get things to put into the terminal (done for you by the agent) and working to do so yourself with direct inputs into the CLI are not the same and the latter is almost certainly going to be where trojan horses get smuggled in as phishing expeditions change tactics and target new behavioral patterns where attention has been turned off.
But the other reason to maybe reconsider it is that not all skills are good, even though sound perfectly good. So it may be better for you to make the “next thing you learn” creating your own skill. But when do you do that? And what does an effective skill look like? I know more of the former than the latter, but I’ll share what I did for the latter as well.
Creating skills is straightforward because you yourself don’t do anything. Not unless you count asking Claude to make something “doing” something. I mean, I suppose asking one of my kids to pass the salt is “doing” something, but it sort of makes it sound a lot heavier than it really is. Asking Claude to make a skill is literally on par with asking your kid at the dinner table to pass the salt, because once you make the request, Claude will get to work. Claude knows what to do and where to put it, and once it’s made, will do it every time. So that part is simple, and I think trying to understand it is to overthink it, just like trying to understand how to get your kid to pass the salt is by definition overthinking it.
But I don’t think it is overthinking it to wonder what is the best strategy for tackling a problem that you are repeatedly, over and over, encountering in your work using AI agents for whatever. That is where I have made many mistakes and had to undo the work. I have a skill called /tikz, for instance, whose sole job is to use mathematical functions to triangulate and repair labels that are overlapping with other objects in Tikz graphs and software produced .png images. This is important insofar as those aesthetic outputs are important. LLMs don’t have real spatial reasoning so much as they can access tools that look like they’re spatially reasoning. You would think that because they work intensively to smooth each and every “overfull, overfill, hbox, vbox” compile error in beamer, which if you know you know that those are indicators that something is spilling off the bottom margin of the slide, or the left and right margin, usually because something is too large. Those are true “errors” in the sense that words pop up, Claude in particular recognizes those words, and due to its reinforcement training, will if you tell it work like a dog until it does not get any such errors. This is not at all being worked because LLMs “see” the errors; they are being done because errors of this kind produce detectable warnings in tokens which trigger responses, which trigger repair, which trigger compiling again, in a looping process until it’s fixed. It gives the appearance of reasoning and looking at the screen, when that isn’t the process at all as LLMs do not “look” at anything.
And yet I really was spending far too long tinkering with the slides ex post because my /beautiful_deck skill just was not consistently producing slides that were perfect. And I was spending a lot of time ironing out non-compile visual errors. So I developed /tikz which would go round and round repeatedly through a series of tasks on each image, and without realizing it, I had somehow created a skill where it would circle and loop through each image hundreds of times. My first time to ever max out tokens came using /tikz in fact; I just watched as the equivalent of the “spinning ball” happened, and Claude was just going over the same series of tasks, unbenownst to me, without best I could tell really any progress being made. So I undid everything that was in /tikz, and kept it more basic — it now only uses a particular mathematical function to check that labels are in the exact coordinates intended, and that each object has white space all around it and the next object. For some reason, this still doesn’t eradicate every problem, but I decided I’m only going to improve that skill when some new solution becomes apparent to me. I’m in no hurry.
Yesterday I came up with two new skills though. The first one I came up with when I read this headline in the NYT from a week ago.
A.I. ‘Hallucinations’ Created Errors in Court Filing, Top Law Firm Says
I think like me, you have heard a version of this exact same story for three years straight because lawyers were periodically getting caught, so to speak, submitting citations in court that had been hallucinated. And what was ironic in the case of Sullivan and Cromwell was something said at the very end of the article:
According to Mr. Dietderich’s letter, Sullivan & Cromwell requires its lawyers to take a training course before gaining access to A.I. tools. Among the training’s exhortations, Mr. Dietderich wrote, is to “trust nothing and verify everything.”
Best I can tell from this paragraph, Sullivan and Cromwell allowed lawyers to use generative AI in their work, even required them to take a training course, and yet the most damaging error still was sneaking through — hallucinated citations.
So, I decided to experiment with a new skill, but also a new skill strategy, and that was to use multiple agents in parallel to sweep through a set of references and make judgment calls as to whether the reference was correct. I call it /bibcheck and here’s how it works and the conjecture it is based on.
The conjecture I have, correct or not, is that LLMs eventually hit something like diminishing returns, though I call it “gradient decay” as that sounds fancier, and I heard that before the transformer, language models hit gradient decay rapidly. Gradient decay, before the transformer, was how they would lose the thread and this largely happened because they did not process language in parallel but sequentially. And as such, by the time they got to the end of the sentence, they might forget so to speak the noun of the same sentence. They would almost operate like a bow shooting an arrow into the sky — soaring, but only for a second, and then falling. And the transformer architecture had a big impact on gradient decay and slowed it.
But that slowing — in my conjecture keep in mind — was for the actual language part, not so much the task part. Claude and ChatGPT will always speak like an intelligent person, but that is not to say that they can keep up with the entire conversation. All of them have some upper bound, which is metaphorically what I consider to be gradient decay, and therefore if it happens in the conversation, they maybe it happens with tasks too.
So the principle behind my skill /split-pdf is based on the idea that they cannot easily parse a large pdf, but they can parse a small pdf, so /split-pdf splits a large pdf into N smaller “split” pdfs where N is equal to the total page length of the pdf divided by 4. So if it’s a 100 page pdf, then 100/4=25, which means it makes 25 4-page pdfs. I then spawn 25 agents whose sole job is to read a single 4-page pdf, and only that one particular 4-page pdf, write a markdown summary of it according to some criteria I specify, and then quit. Then once they’re all done, a new agent goes through all 25 markdown summaries and creates a master summary of the entire paper. Not only does this never result in the Claude session choking, but I think it’s possible it’s doing a decent job at grabbing the quantitative information stored in tables and figures. And that is because, at least my conjecture says, there is less decay in reading a 4-page pdf than there is in reading a 100-page pdf, even if it can accomplish the latter without choking.
Well, as readers know, I’ve been playing around with “multiple agents” for months now, and so yesterday I wondered if maybe I could create a skill that used multiple agents to “audit” the bibliography based on the same logic as /split-pdf. And so that’s what me and Claude came up with, and if you want, you can just give Claude the URL and ask him to explain it. The skill is called /bibcheck and here’s the gist.
First, /bibcheck identifies the number of references. You could have it review the entire bibfile, which is probably not a bad idea — just audit your entire bibfile using /bibcheck. Or it will review the actual citations. Not all errors in the bibfile are due to hallucinations. They can include things like misspelled author names, saying it’s a working paper when it has been published, or simply the wrong year. If you write with LaTeX, then you call a single source — the bibfile which is a text file with a particular field structure — so auditing that once may honestly be the only thing you need to do.
But let’s say that you don’t do that and you want to instead of audit the references in your paper. Here is what it does.
Case 1: Multiple agents assigned to specific citation
In case 1, you use /bibcheck to spawn one agent per citation. Each agent has only one job and that is the citation they have been assigned to. They must find the paper or book cited online, and verify author name is correct, title is correct, publisher is correct, and so on. It does not make corrections if a mistake is found; rather, it writes a referee report in markdown, making it similar to /referee2 — another skill of mine that does aggressive coding audits in multiple languages among other things and writes reports after it’s done. I try to give agents, now, specialized tasks, not all the tasks. That is, I don’t give Claude the single task to check the citations under this hypothesis of “gradient decay in tokens”, even within the transformer. Rather, I operate under the assumption of the specialization of labor. Make tiny skills populated with single agents, execute those tasks, leave a trace of the completion of those tasks, take a weighted average of those units measuring the completion of the task, then the final agent reviews that task output, and takes its own separate action. And that is the idea behind /bibcheck — one agent per citation, verified against an online source, line by checks that all fields are correct in the bibfile, write a report in markdown, review the markdown, decide on a solution.
Case 2: Multiple agents assigned to specific fields
But the other thing I am experimenting with is to tackle the same problem in a different dimension. In my mind I say say this two dimensional graph, and on the y-axis is “separate agents per citation” and on the x-axis “separate agents per field”. What does that mean?
Let’s say that a bibfile contains title, year, journal, author, issue, number, pages. Then I create 7 agents. There is a “title agent”, and that agents sole job is to only review titles. There is a “year agent” and that agents sole job is to review and assess and verify the accuracy of a citation’s year. And so on.
What I’m saying is that I am creating skills with many agents on a single premise, and that single premise is “gradient decay”. Which is a version of “diminishing returns to agent performance”. If the task requires many tokens, then the last token will have greater error than the first, but because it’s done within the transformer architecture, it’s done in parallel, and so the concept of “first” and “last” are not exactly in time. Not exactly. Because the transformer’s innovation was to not do that. But we can see that as the context window performs, it ‘remembers’ less. It performs worse. Things get congested. I have one open thread in Claude Chat now that just simply getting it to load takes sometimes as long as 5-10 seconds. Which is an eternity. Why? I’ve been talking to that particular chat now since October about all the stressors up here in Boston. All the stressors at Harvard, all the stressors around friends, all the stressors around my dad and his dying, and so on. And I don’t want to lose that context, because when I lose that context, I have assumed (until yesterday when a student explained to me there is a way to get back all your convos so that you don’t lose the context) I will lose the progress made on whatever topic has been repeatedly discussed.
So that’s my starting point and my premise — that there is a gradient decay, and that I’m right that you can avoid it through smaller chunked tasks. And that’s how I’m approaching for now. At some point I’m going to run an experiment though. I’m going to compare /split-pdf against other direct pdf-to-markdown things and see if mine actually works. For all I know, mine does not work well, even though it addresses the killing of a session by having Claude choke on a massive pdf. That /split-pdf does work. That does stop with /split-pdf. But that does not mean that the accuracy of the summaries is right or better.
But I’m getting there. Circling to the top. I’m getting there. What do I mean? What I mean is that skills are literally functions of my human capital. That is why I am skeptical of just borrowing other skills. I am skeptical that the agent based work will ever be like Neo in the Matrix downloading Kung Fu. I think that it will always be a more traditional version of learning skills. And for me, I make skills that help me become more productive by exploiting the strengths of this technology. But what I need, and what you need, are possibly two very different things. Even maybe slightly different tweaks on the same thing.
So Mixtapetools — maybe it just is there for you to think about concepts and ways to tackle things, and maybe it’s there to give you a specific skill. I don’t know, but I do think that it is at some point worth your effort to try to make one. Think of a highly valuable repetitive thing you are doing, something that is time intensive when you do it alone, and ask Claude to help you come up with a strategy for skilling. And see if you can do it. Just try.



Thanks for this post. Would you be willing to share (here or in email) the steps to build /bibcheck? I feel like that is a low lift for me to get started with Claude code.
This would be nicer one level up, to check my Zotero database for incorrectly imported citations. I feel I can mostly trust pandoc or the word extension