6 Comments
User's avatar
Dr Sam Illingworth's avatar

Thanks Scott for another brilliant post and also for highlighting how hallucinations are consistent across different languages and language models. I also love how you use a persona to evaluate your work by assuming that I'm a reviewer. The Claude skills I've set up for this purpose basically imagine my PhD supervisor at their most extreme 😂

scott cunningham's avatar

Well my conjecture is actually the opposite. I think they’re independent errors and so the errors won’t be correlated. Though they will likely have their own errors, they won’t likely systematically make the same ones.

Sam Sturm's avatar

What would it mean for the errors across languages to be independent of each other? Like sociologically/philosophically. I suspect they are not, for much the same reason that you can sometimes parse out the native language of someone speaking to you *not in that language*. (For example, native Spanish speakers will often over use “How” in English relative to native English speakers; “How do you call…” instead of “What do you call…”). I imagine that coding languages work the same way, and that there are ways of thinking about code that lead programmers of primarily one language to make similar errors across other languages. And so maybe R errors are much more related to Stata errors than they are to Julia errors?

scott cunningham's avatar

This is a really sharp question. The Spanish speaker analogy is exactly right for human programmers. A Stata native writing R will make Stata-shaped errors — they'll reach for replace instead of mutate, forget that R is 1-indexed, etc. The errors have a common cause (the programmer's mental model), and that common cause induces correlation across languages. For humans, your instinct is correct.

But I think the mechanism is different for LLMs, and here's why.

When I say "hallucination error," I'm being very specific. I'm not saying the LLM doesn't know the correct syntax. It does. It's been trained on the Stata manuals, the R documentation, the Python docs. It knows that in Stata you need & olddog~=. to handle missing values. It knows that in R you need na.rm = TRUE. The error isn't ignorance — it's that on any given token prediction, there's some probability it just... doesn't write it. It draws from the probability distribution over next tokens, and sometimes the draw is wrong.

That's the sense in which I'm calling it "random." Not random like "the LLM is guessing." Random like: the LLM has a 95% chance of writing the correct token sequence, and today you drew from the 5%.

Now here's where your independence question bites. If the error is "Claude doesn't understand what a universal base period means in Callaway-Sant'Anna," then yes — that conceptual error will show up in every language. That's a common cause, just like the Spanish speaker's mental model. Those errors are correlated and my audit won't catch them.

But if the error is "Claude wrote g > 0 instead of g > 2002 in the filter condition," that's an implementation error. And the token sequence that produces g > 0 in Stata is a completely different token sequence than the one that produces df[df['g'] > 0] in Python or filter(df, g > 0) in R. Each arises from a different context window, different surrounding syntax, different token probabilities. There's no common cause linking them.

So the honest version of my claim is: cross-language replication is most powerful for catching implementation hallucinations (wrong variable, wrong filter, wrong merge key) and least powerful for catching conceptual hallucinations (misunderstanding what the estimator does). The good news is that implementation errors are the most common kind, and they're the ones that silently corrupt your results without throwing an error message.

And this is actually why I designed the audit with checkpoints at every stage of the pipeline — data loading, variable construction, propensity scores, balance tables — not just the final estimation. The early checkpoints are pure implementation where independence is most plausible. If all three languages produce the same number of observations, the same cohort counts, the same probit coefficients to 8 decimal places, then you know the data processing pipeline is clean. That's where the audit has the most bite.

Manoel Galdino's avatar

It seems to me you can promote the agent to be a native speaker of English or Spanish. And similarly, the agent can be a native R speaker and native python speaker. The agent can be an econometrician, a computer science guy or statistician. And can even be a dplyr or base-r type of person. Or use tab or two spaces. Someone should design an experiment and measure it. It would be really valuable. A good paper.

scott cunningham's avatar

What would be the goal though? My thing is purely practical -- just have independent replications of the code to find any and all coding errors as all of the languages and all the packages should calculate things down to the decimal point identically. So it's not really role playing at all. Rather, it's based on the idea that there is absolutely no reason why any agent could take the raw data, undertake the given tasks, and not find the same thing. So if they didn't find the exact same thing in python that was found in R and Stata, it means there is a mistake somewhere. And that's the whole point -- the point is to insert verification methodologies.