Discussion about this post

User's avatar
Jason Fletcher's avatar

What do you think about an alternative framing of these results--"Claude reminds us of pathologies with the propensity score that we often forget/overlook" and the policy rec is more specific--code that uses propensity score should have specific warning steps to the user about the operations it is doing and how it deals with PS=1/PS=0 and other issue?

Brantly Callaway's avatar

Hi Scott, interesting post though I would like to offer a response on a few aspects:

1. The R `did` package is our main implementation of Callaway and Sant'Anna. The Stata `csdid` package was originally intended as a clone of `did`, but made some different implementation choices along the way (which was a mistake in hindsight) and we are working on getting them to match in all cases. You should expect `ddml` to give different results by construction. As for the Python packages, I am not involved with any of them. The Python package whose authors have interacted with us the most is `csdid`. I would expect it to give very similar results to `did`, but it was not among those you tried.

2. A bigger issue is that common support is violated in this application. Common support is an important assumption in Callaway and Sant'Anna, so, in this case, the packages are being used in a setting where the required underlying assumptions are not met. Specifically, many of the implied "desired comparisons" are not feasible, which is not an issue that any software implementation can fix. And if you try to run these packages in these conditions, they are going to be inherently unstable (dividing by numbers very close to 0) and small differences in implementation or numerical differences would be magnified greatly. We also have tools for diagnosing common support violations and potentially trimming units with extreme propensity scores, though this would change the target parameter of the analysis.

3. I am almost certain that Claude Code ignored a lot of warnings here. The R package would have reported many warnings about small groups and extreme propensity scores that do not seem to have been picked up.

Brant

36 more comments...

No posts

Ready for more?