Mar 27

In a nutshell: they absolutely do!

19 Comments

Very cool!

Can you envision a pathway for training this out of models with a corpus of null-result papers, or would it be easier to accomplish with explicit instructions?

(Obviously the pool of fully written/tabled/figured/rhetoricized papers on null results is super-small, but the null/negative results journal movement got some stuff out there).

Reply (1)

scott cunningham

21h

Frankly, I don't think it is that easy. I'm not even sure we understand what is going on -- or rather, I *know* we don't. Probably what's necessary is to really dig into the 1.96 papers and see what precisely is the decision making that differs so much. It's gotta be in the cleaning stages of the data.

I mean we don't want nulls. We want the truth. The objective function is to find the truth.

Reply (1)

Geoff Holtzman

20h

That's a fair point and well-said--exploring and addressing the underlying mechanism is ultimately more valuable than just recalibrating. And my comment did have a whiff of “why not put thumbs on both sides of the scale?”

That said, if the LLMs’ “goal” is just corpus-like p-value “success,” that's multiply realizable and addressing one mechanism might not solve the problem. Could this be a both-and situation?

Reply (1)

scott cunningham

20h

No it's totally a good idea. I have been thinking about it too. We have to be very clear -- what IS the objective function? Is it to protect human researchers? That's an objective function. Is it to stop AI production of research? That is an objective function. Is it to produce highly accurate scientific research at scale no matter what? That is an objective function. And they are not the same.

Jason Godfrey

21h

Another one destined to go viral. This is honestly on the level of *Artificial Hivemind* except for social science.

"What a crazy world it is. P-hacking AI agents writing papers just like us. "

Do we know the bunching ratio of human-written, published studies? Or do I need to spend my afternoon downloading 700 random papers and testing...

Reply (1)

scott cunningham

20h

We know a decent amount already:

https://www.aeaweb.org/articles?id=10.1257/aer.20190687

https://www.aeaweb.org/articles?id=10.1257/aer.20210795

And Brodeur has more.

Reply (1)

Jason Godfrey

19h

Insane. I'll be diving into this more deeply when I get the chance, but after a quick skim the ordering by method is identical (IV worst -> DiD -> RDD). I can't believe that the LLMs reproduced the magnitude of the distortion in the same order by method.

Thanks for the links.

Reply (1)

scott cunningham

18h

Yesh, that's right. That's basically the ordering they find. And I am also finding something like that. IV is the worst, but it has the smallest sample so take that with a grain of salt with the APE data. But RDD and DD are also hacked in these papers.

Matías Bayas

18h

Interesting! But how do you know their ‘system’ isn’t telling their ‘agent’ to use these kinds of estimators and/or figures? At the end of the day, APE is just Claude Code with a skill and the model will follow whatever you put in there. Before attributing this to the model having “learned” disciplinary norms from training, it’s worth seeing what the actual prompt and skill files specify.

Reply (1)

scott cunningham

18h

You can see the prompting. And that's the entire project of APE. The entire project of APE is to NOT do that. It's a fully automated production cycle. You can find all the prompts the online repo too.

Reply (1)

Matías Bayas

18h

Maybe I’m missing something but the skill.md that they seem to be calling in their demos is not in the public repo? A lot of the prompting can be hidden in there, no?

Reply (1)

scott cunningham

18h

Let me see. You may have to spend more time at their location. David sent me his prompting, so I assumed it was the same thing I was looking at. But I don't think you have to worry about it being explicitly asked to p-hack. I think it's fairly hands-off from what David shared.

Reply (1)

Matías Bayas

17h

Ohhh sorry for not being clear I’m not worried about the system being explicitly asked to p-hack, my comment was more related to your observation that system chose these methods and figures autonomously.

Reply (1)

scott cunningham

17h

Oh I'm even more positive of that than anything because I've done three of these myself and that's exactly what it did. I posted about one of them here:

https://causalinf.substack.com/p/claude-code-29-can-claude-code-find

You can watch the video to see me prompting, and then you can see the figures (including the event study).

Jeremy Tan

18h

Cool. Undark had reported similarly a while back on cherry-picking and other problems. Very similar to your stuff here https://undark.org/2026/01/26/ai-scientists-human-research/

Tyler Ransom

15h

I wonder where the AI agents learned to do this...

Reply (1)

scott cunningham

15h

https://www.youtube.com/watch?v=KUXb7do9C-w

Reply (1)

Tyler Ransom

14h

oh man! that made my day. It'd been so long since I'd seen that. Well played!

Claude Code 35: Do AI Agents Writing Full…