Claude Code 19: When the Reclassification Is Massive But the Trends Don't Change, Something Interesting Is Happening (Part 4)
Testing the marginal-cases hypothesis with a thermometer, and letting Claude Code run wild on datasets
In Parts 1 and 2, I showed you the setup and the punchline: gpt-4o-mini agreed with the original RoBERTa classifier on only 69% of individual speeches, but the aggregate trends — partisan polarization, country-of-origin patterns, the whole historical arc — were virtually identical. Over 100,000 labels changed and yet the original story didn’t.
That resul…



