Can you envision a pathway for training this out of models with a corpus of null-result papers, or would it be easier to accomplish with explicit instructions?
(Obviously the pool of fully written/tabled/figured/rhetoricized papers on null results is super-small, but the null/negative results journal movement got some stuff out there).
Frankly, I don't think it is that easy. I'm not even sure we understand what is going on -- or rather, I *know* we don't. Probably what's necessary is to really dig into the 1.96 papers and see what precisely is the decision making that differs so much. It's gotta be in the cleaning stages of the data.
I mean we don't want nulls. We want the truth. The objective function is to find the truth.
That's a fair point and well-said--exploring and addressing the underlying mechanism is ultimately more valuable than just recalibrating. And my comment did have a whiff of “why not put thumbs on both sides of the scale?”
That said, if the LLMs’ “goal” is just corpus-like p-value “success,” that's multiply realizable and addressing one mechanism might not solve the problem. Could this be a both-and situation?
No it's totally a good idea. I have been thinking about it too. We have to be very clear -- what IS the objective function? Is it to protect human researchers? That's an objective function. Is it to stop AI production of research? That is an objective function. Is it to produce highly accurate scientific research at scale no matter what? That is an objective function. And they are not the same.
Insane. I'll be diving into this more deeply when I get the chance, but after a quick skim the ordering by method is identical (IV worst -> DiD -> RDD). I can't believe that the LLMs reproduced the magnitude of the distortion in the same order by method.
Yesh, that's right. That's basically the ordering they find. And I am also finding something like that. IV is the worst, but it has the smallest sample so take that with a grain of salt with the APE data. But RDD and DD are also hacked in these papers.
Interesting! But how do you know their ‘system’ isn’t telling their ‘agent’ to use these kinds of estimators and/or figures? At the end of the day, APE is just Claude Code with a skill and the model will follow whatever you put in there. Before attributing this to the model having “learned” disciplinary norms from training, it’s worth seeing what the actual prompt and skill files specify.
You can see the prompting. And that's the entire project of APE. The entire project of APE is to NOT do that. It's a fully automated production cycle. You can find all the prompts the online repo too.
Maybe I’m missing something but the skill.md that they seem to be calling in their demos is not in the public repo? A lot of the prompting can be hidden in there, no?
Let me see. You may have to spend more time at their location. David sent me his prompting, so I assumed it was the same thing I was looking at. But I don't think you have to worry about it being explicitly asked to p-hack. I think it's fairly hands-off from what David shared.
Ohhh sorry for not being clear I’m not worried about the system being explicitly asked to p-hack, my comment was more related to your observation that system chose these methods and figures autonomously.
Oh I'm even more positive of that than anything because I've done three of these myself and that's exactly what it did. I posted about one of them here:
Very cool!
Can you envision a pathway for training this out of models with a corpus of null-result papers, or would it be easier to accomplish with explicit instructions?
(Obviously the pool of fully written/tabled/figured/rhetoricized papers on null results is super-small, but the null/negative results journal movement got some stuff out there).
Frankly, I don't think it is that easy. I'm not even sure we understand what is going on -- or rather, I *know* we don't. Probably what's necessary is to really dig into the 1.96 papers and see what precisely is the decision making that differs so much. It's gotta be in the cleaning stages of the data.
I mean we don't want nulls. We want the truth. The objective function is to find the truth.
That's a fair point and well-said--exploring and addressing the underlying mechanism is ultimately more valuable than just recalibrating. And my comment did have a whiff of “why not put thumbs on both sides of the scale?”
That said, if the LLMs’ “goal” is just corpus-like p-value “success,” that's multiply realizable and addressing one mechanism might not solve the problem. Could this be a both-and situation?
No it's totally a good idea. I have been thinking about it too. We have to be very clear -- what IS the objective function? Is it to protect human researchers? That's an objective function. Is it to stop AI production of research? That is an objective function. Is it to produce highly accurate scientific research at scale no matter what? That is an objective function. And they are not the same.
Another one destined to go viral. This is honestly on the level of *Artificial Hivemind* except for social science.
"What a crazy world it is. P-hacking AI agents writing papers just like us. "
Do we know the bunching ratio of human-written, published studies? Or do I need to spend my afternoon downloading 700 random papers and testing...
We know a decent amount already:
https://www.aeaweb.org/articles?id=10.1257/aer.20190687
https://www.aeaweb.org/articles?id=10.1257/aer.20210795
And Brodeur has more.
Insane. I'll be diving into this more deeply when I get the chance, but after a quick skim the ordering by method is identical (IV worst -> DiD -> RDD). I can't believe that the LLMs reproduced the magnitude of the distortion in the same order by method.
Thanks for the links.
Yesh, that's right. That's basically the ordering they find. And I am also finding something like that. IV is the worst, but it has the smallest sample so take that with a grain of salt with the APE data. But RDD and DD are also hacked in these papers.
Interesting! But how do you know their ‘system’ isn’t telling their ‘agent’ to use these kinds of estimators and/or figures? At the end of the day, APE is just Claude Code with a skill and the model will follow whatever you put in there. Before attributing this to the model having “learned” disciplinary norms from training, it’s worth seeing what the actual prompt and skill files specify.
You can see the prompting. And that's the entire project of APE. The entire project of APE is to NOT do that. It's a fully automated production cycle. You can find all the prompts the online repo too.
Maybe I’m missing something but the skill.md that they seem to be calling in their demos is not in the public repo? A lot of the prompting can be hidden in there, no?
Let me see. You may have to spend more time at their location. David sent me his prompting, so I assumed it was the same thing I was looking at. But I don't think you have to worry about it being explicitly asked to p-hack. I think it's fairly hands-off from what David shared.
Ohhh sorry for not being clear I’m not worried about the system being explicitly asked to p-hack, my comment was more related to your observation that system chose these methods and figures autonomously.
Oh I'm even more positive of that than anything because I've done three of these myself and that's exactly what it did. I posted about one of them here:
https://causalinf.substack.com/p/claude-code-29-can-claude-code-find
You can watch the video to see me prompting, and then you can see the figures (including the event study).
Cool. Undark had reported similarly a while back on cherry-picking and other problems. Very similar to your stuff here https://undark.org/2026/01/26/ai-scientists-human-research/
I wonder where the AI agents learned to do this...
https://www.youtube.com/watch?v=KUXb7do9C-w
oh man! that made my day. It'd been so long since I'd seen that. Well played!