20 Comments
User's avatar
Sarah Hamersma's avatar

Amazing. Two things:

1. I was fascinated by your p-hacking piece and wanted to get back to finish it and try to get my head around it. And the confirmation bias was strong with me! "It's the training data." - yep, it's going to double down on our mistakes. For what it's worth, I'm still confident that AI will improperly interpret statistical significance because of all the bad training data out there (ex. The findings indicate "no effect", etc.). Maybe I should try to analyze that somehow. ANYWAY, the piece today is fantastic and such a great reminder about how we understand and practice precision.

2. This is coincidentally very related to something I'm working on with some coauthors. One of our outcomes has a fairly low mean-y, and the effect size is very small. At the bottom of each column in the table we report the approximate percent change from baseline, i.e. coefficient over mean-y. In general I like this statistic. However, we are in a debate about whether to report the percent change calculated in the software, which is not rounding these two numbers, or the percent change you get when you use the reported (rounded) numbers. We get whole numbers for many of them when we do this, as we are rounding to 3 decimal places and the effect size is usually less than 0.005. The reason for this is the same as what you were addressing, just less consequential because it's not the t-stat, just more of a contextual statistic. That said, I'm curious what you would do. ( I won't state my preference as I don't want to bias you.)

Bonus:

3. Very related: why don't we use the concept of rounding to significant digits instead of decimal places in econ tables? In science it's very clear that if numbers are smaller, you adjust the decimal places to get to a comparable reporting of precision. I have done this before, having more decimal places for a column with smaller effect sizes, but it seems to rock the boat.

Ilkka Sillanpaa's avatar

Like your 3rd point.

Usually we people are critical of reporting additional digits as they are not often meaningful - our results are rarely that precise anyways, but sometimes they seem to matter, too.

scott cunningham's avatar

Only you would be so gracious and considerate to me that you’d ask me what I would do when I just royally messed up so badly doing it! Lol.

I think what I never ever once thought about was the relationship between the things we do for communicating (ie choosing imprecision purposefully) and how that can pop out so weirdly when that is then used to do various transformations like ratios. I also report the percentage changes with baseline means a lot more than I used to. So I guess my lesson here is that these are ratios. They collapse when they’re small into these discrete values, and incredibly, can really shift around a lot in ways that are misleading like my example did. Like how the t stat in my example was truly 1.67 but because 2 if I rounded the inputs first.

So I think you have to do all transformations based on the non-rounded data. I think we have to round continuous values by definition — you cannot report all values of pi for instance. But at minimum, you need to always work with the software to make the original transformations.

Which frankly probably means we can’t do things by hand once we have results. We always have to use locally stored macros and do those calculations there. Sort of even seems like a version of the way that copy errors creep in if you try to put stuff into excel from regression output. Not the same thing bc we usually say that’s an actual mistake. Rather, this is a more subtle issue.

David Hudson's avatar

Good for you, Scott.

Larry Santucci's avatar

I wonder if Claude would have drawn the same conclusions as you.

scott cunningham's avatar

I don't know. Claude and I did all the analysis finding the problem in the first place. He's my "thinking partner". And when I talked to him about Jason's comment, he immediately looked into it and at light speed agreed that it was an issue. But then it took me five hours of working with him until we were on the same page. So he was actually the source of the problem in the first place, and because I don't know the publication bias literature (but Jason does), even though the publication bias literature is in his training data (including Brodeur ironically), and even though we designed the plan to ship the manuscripts to OpenAI to extract the coefficients and standard errors, I didn't know to press on this, and he didn't suggest it. So I know what to make of that, except that it did not happen and I was working closely with him in the first place on all of it. I mean that's the reminder to me that when I work with him on things where I am not already an expert is when it's so easy for me to not recognize what he's doing.

Larry Santucci's avatar

That itself is a valuable lesson. The LLMs seem to have a conceptual inertia, so to speak, that can prevent them from having those OMG moments we humans have when a dissonant thought creeps in. Not all the time, because you do observe some reversals and rethinkings in thinking mode, but I guess it doesn't quite replicate all aspects of human insight.

Jason Fletcher's avatar

I think this is part of a larger 'teachable'/training point to make. I sketched out the main point here, maybe more to come. https://jasonmfletcher.substack.com/p/owning-all-the-numbers?r=i2hui&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true

scott cunningham's avatar

I think I'm going to update my /referee2 or come up with a "Jason Fletcher / Bobbi Wolfe" skill that lists a series of steps to try and identify all "non main coefficient" patterns. List them out. Submit them back to me. I need help anyway, to be honest, with trying to see what I cannot see. So thanks for making the effort to point this out, but also see it.

scott cunningham's avatar

Yeah, I think I"m going to have a new skill based on your description of your instincts and training called /fletcher if that's okay.

scott cunningham's avatar

Tell me if you feel like it's not something you're super comfortable with that I made this?

https://github.com/scunning1975/MixtapeTools/tree/main/skills/fletcher

I can rename it, but I kind of felt like this merited you since you noted this more general researcher approach. Not quite skepticism, but more about trying to see what is easy to miss.

Jason Fletcher's avatar

very flattering--we'll see if this works!

Winston's avatar

Scott, thanks for sharing this!

A side question: You wrote, "More recently, I report confidence intervals." I'm curious, what led you to start reporting CIs? I'd love to see more people doing this.

scott cunningham's avatar

It was newer work on instrumental variables and diagnostics for weak instruments. There's been some papers and review articles suggesting we report, not just Olea and Pfleuger effect F-statistics on the strength of the first stage in instrumental variables, but also the Anderson-Rubin intervals as they are robust to weak instruments. Those are confidence intervals, basically. So I started reporting them. But as I reported those, I thought well maybe it wouldn't be the worst thing to do it more generally. I am not the world's greatest fan of the 95% CI since they're so easily misinterpreted. I'm teaching that in my stats classes these last two weeks. They lend themselves to just as many misunderstandings as the p-value. But, I have been doing it a bit more.

Winston's avatar

Agree that CIs are easily misinterpreted! I like the "compatibility interval" interpretation from Sander Greenland and his collaborators.

It's based on the duality between CIs and tests: e.g., a 95% CI for the ATE is the range of ATEs with p-value >= 0.05, so we can interpret the CI as a range of ATEs that the data are reasonably "compatible" with, given all the assumptions behind the method. I alluded to this in a mini-thread on Twitter:

https://x.com/linstonwin/status/2019788581233922517

and the two 3-page articles that I assigned to my honors intro stats students are:

Amrhein, Greenland, & McShane (2019), "Retire statistical significance"

https://www.blakemcshane.com/Papers/nature_retire.pdf

Rovetta, Mansournia, Stovitz, Adams, & Greenland (2025), "Interpreting p values and interval estimates based on practical relevance: guidance for the sports medicine clinician"

https://doi.org/10.1136/bjsports-2024-109357

Aadarshkumar Jadhav's avatar

This is a genuinely rare thing to read. Most people quietly update and move on. Writing out the full five-hour forensic trail, the rounding mechanics, the simulation, the donut hole drop from 1.52 to 1.02, that's the kind of correction that actually teaches something instead of just walking it back.

The rounding-to-ratio problem is one of those things that feels obvious in retrospect and completely invisible until it bites you. The 0.035/0.021 example landed hard. Small units, small coefficients, and suddenly you've manufactured a t-stat of 2 out of a true 1.67. I've probably never thought about this either, and now I won't stop thinking about it.

We're building a course called "Master Claude in the Real World" and this post is exactly the kind of workflow we want people to be capable of: catching your own error, using Claude Code to simulate and verify, and publishing the correction with the receipts. That's a high bar, and this hit it. Just launched on Kickstarter for anyone building toward that kind of fluency: https://shorturl.at/ZrG8p

The David Yanagizawa-Drott paper is going to be worth watching closely.

Alexis J. Diamond's avatar

Another great post. Very thought-provoking. I never thought about this rounding error-mode before.

I also really like your simulated empirical example. I'm wondering why you chose to compare a histogram to a smoothed density plot, in your figure, when the key distinction (I think) is that left-side results are rounded when right-side results are not). i.e., by comparing different plots, someone (not me--I'm fully persuaded by your argument) might wonder if the fact that right and left sides look really different is really due to the rounding/not-rounding, or due to the tuning parameter chosen for two different data-viz designs (each design being quite sensitive to tuning choice). I'm just wondering, if you were to compare two histograms with the same number of breaks, or two density plots with the same bw, is the visual impact much the same?

scott cunningham's avatar

Let me look into this after class. I am basically now 100% committed to the weirdest research agenda which is "all the ways I was wrong last week in a Substack post" lol. But I really do want to get into this more.

Patrick Signoret's avatar

“It took maybe less than 5 minutes for OpenAI to analyze 3500 papers and extract those coefficients and standard errors”. Why 3,500 papers? I thought 651–think I’m misunderstanding a part of the process.

scott cunningham's avatar

That's a typo. 3500 regression coefficients rom 651 papers. Let me fix it.