Claude Code 27: Research and Publishing Are Now Two Different Things
Some Claude Code fan fiction about the economics of publishing with AI agents set in the very near future
Claude Code has made it easier to do research now. But it is about to get much harder to publish in traditionally valued locations.
This is a thing I’ve been thinking about since early this year. And it sort of coalesced for me when I sat down and prompted Claude code to fully automate a paper with the vaguest proposal I could come up with. It came up with the idea, a shift-share identification strategy (which I then on a second prompt had him go deeper into by reviewing Peter Hull’s repository for his shift-share IV workshop at Mixtape Sessions), crawled the web until it found suitable data, did the analysis, wrote the paper, I then submitted that paper to refine.ink, paid around $40-50 for my referee report, I then uploaded that to the directory, had Claude make all revisions, then had referee2 (a persona from my mixtapetools repo) critique the paper, opened up two terminals and had agents code audit by rewriting the entire pipeline in two other languages, confirmed no coding errors, resubmitted that back to refine.ink a last time, and then concluded.
The entire experience took me $100 in refine.ink payments, and a couple hours max. I’ve only skimmed the paper, but the experience was sufficient to make me think that paper mills are coming — not on the journal side though for sure that, but on the actual paper production side. What I mean is I now suspect we will see a nontrivial amount of paper mill at the source — the researcher themselves. And so like any economist, I thought and thought and that is this substack, which is basically Claude code fan fiction about the new economics of academic publishing set in the very near future. It’s a bit of some rambling, with simulations based on observed distributions, and some simple economic reasoning with assumed large elasticities. But that is why it’s Claude code fan fiction.
Thanks again everyone for your support of the substack. It’s a labor of love. If you aren’t a paying subscriber please consider becoming one!
Coral Hart used to write 10 to 20 romance novels a year. Now she writes more than 200. The difference, she said, is ChatGPT. She describes it as “help,” though that word is doing an enormous amount of work in that sentence. She brings in six figures doing this now which you get the sense is coming more from volume than quality itself. The New York Times profiled her in February.
Hart said she has seen a 10-20x increase in cognitive output. That large of a gain came from using a much simpler LLM methodology than what’s available now with Claude Code and other agent based systems of writing. And she’s writing romance novels — a genre with conventions, a readership that values volume, and a distribution channel (Amazon) that will publish anything you upload. The only bottleneck is the author’s time, and the tool eliminated that bottleneck.
But what happens when the same productivity shock hits a system where the bottleneck was never really production in the first place, but rather was a hierarchical journal structure that depended immensely on editor time, skill, discretion and voluntary workers with the same talents called referees for screening quality deemed sufficient for publication? What about the quality of those papers? What about publishing? After all, there is a difference between writing a manuscript and publishing it at a journal, the latter which happens after the paper was written. What will happen to publishing?
The distribution will change
If the unconditional probability of acceptance at a top-5 journal is around 3-5%, and the cost of producing a submission-quality paper drops to near zero, then the expected value calculation is straightforward. Write a hundred papers. Submit them all. Manage a massive portfolio. Though most will fail, you only need a few to land. You can’t win the lottery if you don’t buy a ticket.
Imagine that the value of a top-5 hasn’t changed — at least not yet. Then if the cost of exercising that option has collapsed, the number of new submissions will be based on the magnitude on various elasticities measuring the response across the pipeline. My hunch is that many nodes have supply responses that got more elastic meaning we should expect large supply responses, but not all, and where they have remained inelastic, we should expect bottlenecks and therefore queuing, and almost certainly the injection of some noise.
Reimers and Waldfogel studied what happened to book publishing after ChatGPT launched. The number of new titles on Amazon tripled. Average quality fell. The best books didn’t change much — the frontier stayed where it was. But the mass of new entries came from the left tail of the quality distribution.
I’ll elaborate on the numbers in this graphic later, but for now consider something like this as a visual to guide you through fan fiction essay. The green are the number of papers of highest quality proxied by publications across nearly 87 journals (which I pulled out of articles I found online). There is around 3,800 publication slots historically there. The yellow is the number of human submissions pre-AI. This was calculated by going through all 87 journals, approximating their acceptance rates and using average number of issues and articles published each year. While the variation in acceptance rates is ranges from 5 to 20% in the top 87 journals, the average overall is closer to 10%. Hence why I extrapolated to 39,016. I figure this is wrong, but not by much.
But the blue is a normally distributed and sizable 5x increase in submissions coming from AI. Some of these will be fully automated, meaning they were produced in only a few hours without a human in the loop, whereas others will take weeks with a human in the loop fairly intensively, but still resulting in a new manuscript in a fraction of historical time use. And I model it as normally distributed because paper quality is the product of many independent factors — topic, data, execution, writing — and quantities shaped by many independent inputs tend toward normal.
Now look at what’s already happening in economics. The University of Zurich’s Social Catalyst Lab is running something called Project APE — Autonomous Policy Evaluation. It uses Claude Code to autonomously generate empirical economics papers. Not drafts. Full papers with identification strategies, data collection, estimation, tables, figures, and writeups. As of this writing, it has produced 204 papers — with 60 added in a single week. Their stated goal is 1,000.
But are they any good? In head-to-head matchups, the AI papers win 4.7% of the time against human papers from the AER and AEJ: Policy. The Elo gap is massive — 1,154 for the average AI paper versus 1,831 for the average AER equivalent article. Here you can see signs of the distribution being both normal and having a long enough amount of mass at the right tail to warrant the idea that papers might be good enough for high quality outlets, but which can only be achieved at scale too.
So as you can see above in those graphics, a few AI papers do crack the top 40 out of 247 total entries. Which is what you’d expect if the AI papers come from the normal distribution, as remember the tails of the normal can theoretically reach negative infinity (blinding in their awfulness) to positive infinity (one in a million spectacular). And the most recent cohort they’ve been working on, too, is already improving with a slightly higher 7.6% win rate.
And consider this. These are fully automated papers, like a version 1.0, with no human iteration whatsoever. What might happen if the papers get deep close looks, or perhaps get refined through something like refine.ink?
Journal revenue in the shortrun
I tried to work out some simple back of the envelope numbers for this illustration but I used as my baseline things I found here and there. So let’s start with some basic, though approximate, baseline facts about the only profession I feel qualified to talk about — my own. Economics.
There are roughly 12,000 research-active economists who submit to ranked journals. Currently they generate about 39,000 submissions per year — roughly 3 per researcher. If the average goes from 3 to 10, that’s a 3x increase from existing authors alone. But then add in new entrants who previously couldn’t produce at submission quality and you’re at 4-5x. Which is how I arrive at 5x.
But 3d printing a manuscript isn’t the cost of publishing because you must also pay journal fees upon submitting. That scales linearly. Still, the cost of this portfolio is still trivially low. The average submission fee is $112. Going from 3 to 10 submissions costs an additional $784 in fees. Add a Claude Max subscription at $200 a month. The total annual cost of tripling your output is about $3,200. That’s less than one conference trip. Not everyone can afford it, but given a single top-5 publication is worth a lot in presently discounted expected value, then given economists wages, I expect there’s a nontrivial number of people at that threshold. Plus coauthors can split it.
Demand for one of those 3,800 slots at current fee levels is almost perfectly inelastic. Let me abuse the idea of an elasticity a little to illustrate this. Given the volume increase in submissions, they can raise prices and still be at a higher number of submissions than they had been before Claude Code. That’s not the elasticity, which is a ceteris paribus measure, but it’s worth keeping that in mind too. They’re looking at anything from a swell to a rogue wave bearing down on them though.
I pulled data on 87 economics journals — top 5, general interest by tier, AEJ series, top field, second tier, and third tier, and then grouped them into categories with approximations of acceptance rates. Together they publish about 3,800 articles per year and receive roughly 39,000 submissions.
Those 3,800 slots are fixed in the short run. Journals can’t print more pages, hire more editors, or expand their issues overnight. Demand doesn’t respond to the rightward shift in supply other than to simply allocate 3,800 submissions into 3,800 slots in journals.
The top-5 currently accept about 5% of submissions. At 5x volume, that drops to 1%. At 10x, it’s 0.5%. So this must reduce acceptance rates if journals do nothing.
So let’s assume for now that journals do nothing except what they have been doing. Then what? Then they are about to make a lot of money.
At current volumes, these 87 journals collect roughly $6.2 million per year in submission fees. At 5x, that’s $31 million. The top-5 alone would go from $812,000 to $4.1 million — mostly from papers that get desk rejected within a week.
Editors, referees, and bottlenecks
Every submission will have run every conceivable robustness check. Every paper will have been through Refine.ink, probably multiple times. Economics articles are already notoriously long. They’re about to get longer. Expect more appendices. Expect better writing and more “beautiful figures”.
Consider the economics of a service like refine.ink. Ben Golub’s service sits at exactly the right place in the production chain to sometimes get paid multiple times for the same paper — before submission, during editorial screening, during review, and again after the R&R. That’s potentially four to five payments per paper. It’s a brilliant business model because it solves bottleneck problem created by human evaluation. Not only will researchers be paying excess journal fees; they will also pay verification fees too.
But the perverse result is that every paper becomes harder to distinguish due to such intense repeated polishing? When every submission is polished and empirically meticulous, the signal-to-noise ratio for editors doesn’t improve — it gets worse. The marginal information content of “this paper is well-executed” drops to zero because the left tail no longer trails off. Rather it hits a giant wall of very similar looking papers written well, with data, execution, and probably interesting results. The skills at the desk of immediately rejecting those below the bar are likely to be stretched, but I suspect they will be, and they’ll be having to parse through a lot of papers, and if they don’t — if they rely on heuristics — then the question is how biased will those heuristics be in this new environment?
But the desk reject is only the first stage. The second is the refereeing. Submissions can multiply by 5x, but the referee pool cannot multiply by 5x as it is limited by the size of the number of PhDs. Most referees aren’t paid — just like taxes are the price of living in a civilized society, serving as a referee is the price of living in the academic society. You’re asking tenured professors to spend 10-20 hours evaluating someone else’s paper as a professional obligation. At current volumes, this barely works. But at 5x, it breaks. Honestly, it’ll probably break at 1.5x.
We need to make some guesses about the desk rejection rate as well as the referee pool. Let’s assume then that the referee pool stays fixed. If that happens, then the desk rejection rate has to rise from maybe 50% to probably closer to 90% just to keep the system from collapsing. Editors would be rejecting 173,000 manuscripts a year on a skim — 9 out of 10 papers, dead on arrival, with less time per paper.
Inevitably pattern-matching shortcuts emerge. Like what? Well what’s observable other than the manuscript that might be tied to quality? Maybe researcher pedigree, name recognition, institutional affiliation. If these are correlated, even weakly, with quality, then maybe they update when they see those to try and cut through the noise. But this is imperfect, not to mention unfair, and so desk rejection gets noisier: good papers get killed by tired editors and marginally lower quality papers slip through to referees. It’s a cascading failure: volume breaks editors, broken editing wastes referees, wasted referees slow science.
But what if some of the 5x increased submissions get passed on to the referees? Well at 5x submissions, without an aggressive increase in desk rejection, the system would need over 146,000 referee reports per year — against a realistic supply of maybe 54,000. That’s because you historically have somewhere between 2 and 5 referees per paper. And you cannot tap the same human resource three times harder and expect it to comply. At some point the whole “taxes are the price of civilization” argument will break down. Citizens have been known to revolt against tax policy anyway, even modest ones.
So what fills the gap? The same thing causing the problem: LLMs. The honest answer might make people uncomfortable but consider this — humans weren’t being paid to referee in the first place. It has always been voluntary and unpaid labor. The human-centric system has run good enough for decades to centuries, depending on what we mean, but keep in mind two things: for most of the history of science, human peer review did not exist, and secondly, human peer review has helped cause well-documented forms of publication biases including replication crisises. I think refine.ink sees a shift towards intensive use of LLMs for refereeing as a very near equilibrium condition because look at the third option under their subscription model — “best for editors and frequent publishers”.
The arms race nobody wins
Here’s the problem with the expected value calculation I laid out earlier. It’s correct for any individual researcher — but when everyone does it, the collective outcome is worse for almost everyone. This is probably close to a prisoner’s dilemma.
If a researcher is the only one who scales submissions using LLMs, then that persons gains an edge. But if that gains are real, they won’t be the only one. And so in the new equilibrium, everyone is producing 2-3x more papers causing acceptance rates to drop, and in turn, the probability of publishing any given paper lower despite arguably fewer coding errors and perhaps even each persons work individually better. But now to be in that new equilibrium, they are spending an extra $3,200 a year and the entire profession is running faster to keep up with 3,800 slots. And you can’t unilaterally stop because if you go back to 3 papers while everyone else is at 10, you’re strictly worse off unless you’re assured that you somehow will be treated differently despite all the noise in the machine.
Institutional responses
But that’s all short-run stuff. What about the long run? Well, in the long run, all fixed inputs are variable, so we might expect some things which we say are not malleable to be very malleable. Things like the raising of submission fees.
If the demand for slots is inelastic, then we should absolutely expect journal fees to rise. I expect higher submission fees, which will fall hardest on junior faculty on higher teaching loads, researchers in developing countries, anyone without grant funding or generous research budgets.
The returns to top 5s will also rise, for a while anyway, given the volume of papers rise will cause acceptance rates to decline. At the moment, very few of the papers automated by AI can compete head to head against AER-equivalent pubs but some will due to the fact the normal distribution produces theoretically long tails stretching to positive and negative infinity. Murphy’s law says anything that can happen, will happen, with enough trials. What limits this is whether enough people will push the capacity as far as it’ll go, but it’s absolutely there to be pushed. Its restraint is more associated with norms than capability.
But to manage that, I do suspect we see AI screening at the desk. If the LLMs produce high quality referee reports already, then why wouldn’t editors use them to cull the herd? That’s the genius of Ben’s business model — it helps those submitting, and as the production of papers rises, its revenues grow both from early evaluations to most likely a second evaluation of the identical manuscript, maybe done minutes later, by the editor the team just submitted to. Duplicate evaluations are also likely to happen, not counting the earlier polishing and the later polishing once the R&R hits.
The result: more papers, roughly the same publications, journals earning more, evaluation services earning more and most likely double dipping too, referees with more requests, faculty spending thousands more per year only to remain at equilibrium without any clear technological advantage. Deadweight loss from an arms race is probably not strictly zero.
What I think is coming
Even with AI screening at the desk, the noise doesn’t disappear — it most likely just migrates. Perfect automated screening can answer “is this paper competent?” But it can’t answer “is this paper more important than that one?” And when 20,000 competent papers are competing for 3,800 slots, the final decision rests on something other than quality — editor taste, topic fashion, referee mood, institutional priors. Below 1% acceptance, you’re selecting among a crowd of qualified papers using criteria that are increasingly arbitrary.
And there’s a tell. Look at people’s websites. Right now, a productive economist might have 6-12 working papers listed. In two years, with automation, is someone going to really post 75 unpublished manuscripts on their website? That’s the paper mill signature, visible to everyone — hiring committees, tenure reviewers, grant panels. Even if every paper is competent, 75 unpublished manuscripts says “this person is playing the lottery,” not “this person is doing important research.” The people who benefit most from this equilibrium are the ones already producing 1-2 excellent papers a year who use AI to make each paper better, not more numerous. The people who may be unexpectedly penalized are the ones who scale the production of papers into larger and larger volume, because volume is visible — on websites, but also to editors — and it will suggest a person writes paper as opposed to does research, and the market will price it accordingly, whatever that is.
And remember — this is the worst version of these tools we will ever use. Project APE’s most recent cohort has already improved from a 4.7% to 7.6% win rate in these head-to-head competitions. The quality distribution is changing with scale and it’s partly drifting rightward. Once AI papers start becoming competitive not just at the field journal level but at general interest, that is when the arms race intensifies the most, because the automated submissions aren’t just filling the left tail anymore. They’re competing for the same slots at the best journals, which becomes easier to justify since presumably those are the most important papers scientifically too.
The binding constraint on science is shifting from production to evaluation. The queue to get evaluated — not the difficulty of doing the work — becomes what determines how fast knowledge advances. And the honest question nobody wants to answer is whether human gatekeeping is still the right way to manage that queue, or whether we should let the same tools that caused the flood help sort through it.
I think the noticeable disruptions are three months out, not three years. The supply curve has already shifted. The demand curve for publication slots hasn’t moved. Everything else follows from that.










Thanks Scott for this excellent take on the continued proliferation of the publish or perish dilemma and the extent to which AI is exasperating this.
As a chief executive editor on a couple of journals, as well as an associate editor and peer reviewer on another dozen or so, the increase in submissions has exploded over the past two and a half years. The quality is also getting better, as you would expect, as these models develop; however, something needs to be done at an ideological level to stop the deluge.
I know it seems extreme, but I wonder what would happen if researchers were limited to one publication per year and instead spent the rest of their time focusing on impact and the extent to which their research could genuinely benefit society. I'm being purposely provocative here but would welcome your thoughts.
Good post. But is it really the case that refine.ink is or always will be superior to an elaborate prompt/skill in Claude Code / Codex? At least from my experience so far, I was able to get a rather similar result from both systems.