Discussion about this post

User's avatar
Sarah Hamersma's avatar

I have three thoughts, as someone who is not reading the Claude Code entries but is interested in the broader issue to you bring up here.

1) How on earth is the software generating 75% of the variation? You put (concerning) where I would put (terrifying!!!). It's hard to imagine variation in tiebreaking rules and such would be enough to do that.

2) I think you should give empirical researchers more credit. While we haven't integrated this source of variation into our standard errors, it is completely normal (and expected) to revisit as many forks in the road as possible and see if results change appreciably. Current papers often have a visualization with 30+ alternative specifications all together on one graph showing how much this variation in reasonable specification or measurement choices seems to matter. It is not formally integrated into standard errors, but it is there to help us see whether we should be more or less confident that we've got reasonable bounds on betahat. Getting the inference right is more important to me than being sure my final, singular p-value from the "main estimate" is corrected for this kind of error.

3) Which brings me to the last thing. I am not sure I want to think of these as error in the same sense as the other two sources. Some set (maybe a large set) of the different specifications are biased because they do not meet the (untestable) assumptions about selection on observables or whatever. I'm sort of reminded of the conversation between Lalonde 86 and Heckman and ?? (maybe Robb?). Lalonde says "oh my goodness, the vast majority of non-experimental estimates are garbage" and Heckman says "well, sure, but we could have thrown out half of them ex-ante as garbage, so the success rate isn't as bad as you think." I mean, letting the program select covariates means you are certainly comparing poorly-specified to better-specified options. It may be hard to know which are which, but my point is: it's not a natural source of variation but it's what happens when you mix unbiased and biased estimates together and the look at the distribution. I'm mean, perhaps it's error in the most real sense, that some of these estimates are erroneous!

I may be missing the boat with this last one, if you have reason to think that all of the coding processes are ruling out all the biased estimates, but I guess I'm not convinced they can be distinguished. Wheat and weeds growing up entangled together in the field, right? So I'm not saying we shouldn't care about this error, just that I don't see it as having the same sort of statistical properties that would allow us to work it into our standard errors somehow. Or maybe I'm saying, I think looking at those crazy robustness figures isn't a bad substitute for some formalization of this that would treat different estimates symmetrically.

Benjamin Daniels's avatar

Simonsohn/Simmons/Nelson suggest a combined bootstrap-specification inference procedure — is that on the lines you’re thinking of? http://urisohn.com/sohn_files/wp/wordpress/wp-content/uploads/specification-curve-published-hand-corrected.pdf

6 more comments...

No posts

Ready for more?