What do you think about an alternative framing of these results--"Claude reminds us of pathologies with the propensity score that we often forget/overlook" and the policy rec is more specific--code that uses propensity score should have specific warning steps to the user about the operations it is doing and how it deals with PS=1/PS=0 and other issue?
That’s a good framing. I do think though we should be doing these code audits. Do everything three times in all languages to figure this out. Also think we need to be explicit about our packages.
I’d add to this framing that this doesn’t seem so much like “different packages will do fundamentally different things” as that CSDID’s implementation for including covariates requires propensity scores and we’re not so familiar with evaluating that. I suspect if you used a two-stage imputation estimator you wouldn’t find these major differences with controls.
I’m not following. Double robust used two stage imputation and propensity score reweighting. It’s doing both. And these are hidden pscores. The packages often throw them away. My exercise is purely to say that there shouldn’t be variation for the Same spec across packages for the identical estimator. It isn’t about CS but rather the CS packages are giving different results. Maybe you're saying this, but I'm just wanting to make sure I follow.
Yeah I'm perhaps not being clear. When I referred to imputation I meant the Gardner/Borusyak, Jaravel and Spiess estimator, I realize that's unclear because CS also do part of this to impute counterfactuals. My point was in wondering whether the CS estimator is a bit more vulnerable than other estimators to getting different results for different packages because there are different reasonable choices (because of details with propensity score matching), which seems to be most of the reason you get different results if I'm understanding correctly (although the storage of big numbers is an interesting one to keep in mind). Whereas the Gardner et al. version has much fewer degrees of freedom in implementation. It reminds me of the paper that showed that many economists handed the same data and research question produced fairly variable estimates. But I think it's quite true that few of us would think that package implementations of the same estimator could be so important.
I think what you're demonstrating is very valuable--every replication I've ever tried to do of an econ paper has turned up weird reasons about why it's not possible to fully reproduce. And I think a big reason is that it's too been too "costly" for academics to do any kind of "quality assurance" until now (whereas private industry has had strong incentives to do it for a long time), so it's great if that cost declines and we start implementing code audits like you're demonstrating. I'm sure it would greatly improve the reliability and reproducibility of research.
It very well could be but that too is an empirical question. This variation has zero to do with the estimator. Nothing is probably more straightforward than logit. There should be zero variation for logits across packages and yet there is.
And by logit variation, I literally mean package variation mind you, all under the hood with regards to even such basic things as whether the package will even run the logit it claims it is (eg silently dropping covariates).
Sorry for the multiple posts. But this is under the hood. People are often not even given the propensity score as stored macros when you run CS. So you can’t even trim as a consequence bc trimming is pre-package, and then the CS package just generates a new one. And one package faced with the identical spec drops randomly covariates, one spits out errors, one handles machine epsilon differently, and so on.
You can always switch estimators but to switch estimators based on package glitches is not principled nor scientifically accurate anyway. The heterogenous robust estimators do not have the same heterogenous adoption with regards to covariates. Some papers don’t even cover them at all but have packages that allow for them. It’s known though that without flexible modeling of Xs via saturation, regression with addictive Xs impose stronger assumptions like constant treatment effects with respect to those very Xs. So it isn’t a free lunch to switch over if you’re explicitly not wanting to make those claims.
minor point--I just want to confirm my understanding--I think a comment in this thread is incorrect: "My point was in wondering whether the CS estimator is a bit more vulnerable than other estimators to getting different results for different packages because there are different reasonable choices (because of details with propensity score matching), which seems to be most of the reason you get different results if I'm understanding correctly (although the storage of big numbers is an interesting one to keep in mind). "...
To the best of my understanding, the problem is not with propensity score matching. Does CS involve propensity score matching? I thought it was propensity score weighting (correct me if I'm wrong, I haven't used or taught CS). Don't blame "matching" (the poor method takes a lot of flak, lol). The problem, as I understand it, is that the CS package runs a vanilla, everyday logistic regression and the researcher has no easy way to do basic checks for quasiperfect separation (which means the researcher shouldn't have run logit on that data in the first place), b/c relevant intermediate diagnostic results are hidden inside the package. So I think it's largely a software design issue.
That’s right — it’s an inverse probability weighting. So it isn’t matching. If I said that I misspoke. I tend to use that word when trying to get people to see what’s under the hood of CS only because IPW and weights aren’t clear to many. So I tend to say “it’s like matching on units with comparable Xs”, as a shorthand. But you’re 100% right that CS is IPW rewriting the comparison units using baseline Xs for a given cohort comparison
And you’re 100% right that it’s using vanilla logistic regressions. Now you could do it yourself for sure — CS as a simple recipe. Calculate 2×2 means in outcomes, reweight by IPW and/or an outcome regression model. You can get the propensity score however you want but the point is you cannot feed that into these standard packages.
My one worry is how often people get the CS wrong even then. I am not sure I’m super enthusiastic about the complete over confidence I see in the wild in even what CS is doing, but it could be I just have to rest more.
Yes, fair and thanks for your thorough reply. You’re of course right that there’s no “free lunch” in switching estimators, it entails substantive differences. And it’s definitely a cause for concern when there are different things happening under the hood that we can’t even diagnose. I will definitely continue to read all your CC posts, I’m learning a lot
Thanks! I'm learning a ton too. I want to keep figuring out this logit thing also. It just doesn't seem like there should be this much variation. I think it's because I would always just write down the log likelihood and solve it in school. I have such a dismal understanding of what is literally going into that on the side of a real calculator, and since I specialize in "syntax", and less so the underlying computers, it honestly didn't even occur to me something like this would even hypothetically be true. And I really don't think people know about it. I think it's undocumented source of researcher degrees of freedom. I bet you twenty bucks no one has really just sat down and said "and another source of researcher discretion is whether they used R's did or Stata's csdid" because it's ordinarily the research design, the estimator and the cleaning that has been suggested as possible sources of subjective bias. How your computer and language and package within language-computer handles a number that's 10^10th power? Or how that particular package's underlying architecture does or does not spit out a refusal to estimate, or whether it passes along garbage coefficients -- I don't think this is known.
Wait — talking to sarah I maybe misunderstood you Jason. The issue isn’t with the propensity score. The issue is with the logistic regression and how the six different packages across the three languages are handling floats, and perfect separation, whether they spit out errors, where they drop down simply regressing treatment indicator onto a constant despite being fed Xs and so on. It’s got nothing to do with the propensity score — it’s subtly six different logit packages, each of which apparently handles various hidden details slightly differently but which each package (not estimator — packages) handle that.
I should have said: "pathologies with the implementation of logistic regression among different statistical packages" rather than "with the propensity score". But I read what you showed a little differently--you showed that with no covariates everything looked great. It was when you stress-tested the logistic regression step (probably in ways you didn't mean to, like using raw population) was when things fell apart. My guess is that very few published papers would have stressed a logistic regression like this and therefore the "different answers across statistical packages' would be pretty rare. Just a guess though.
Hi! Have to say that I've been enjoying the series a lot. Have a doubt, what happens if you use your Referee 2 persona? Maybe I'm missing something, but I remember that the idea of this persona was to obtain the same coefficients using 3 different packages. In the case of the PNAS paper, a DiD wasn't used, but what will happen if you do the exercise with the Brazil paper? If the same coefficient is achieved, maybe it is because it used the 0 covariates baseline. But what happens if it uses a positive number? The same ATT will be obtained, but with different covariates?
In this case, it was coming from the number of Xs one of which was 10^10 power. So referee2 would’ve flagged it, but the failure but not the reason why. I would’ve had to reconcile them after I read his report and then cc would’ve found this out. That’s the thing — it’s the circling that gets to the bottom of it.
Thank you for another thoughtful piece. Even better than reporting package and version would be to require containerized version of the project (e.g. dockerfile + pinned dependencies), which Claude could also prepare for you so as not to increase overhead too much.
Thanks Scott, great post! I agree with your points, and it's a service that you did this work and explained it so clearly. Here's what you made me think about:
a commenter wrote: "Claude reminds us of pathologies with the propensity score that we often forget/overlook". I would put it differently: *Scott* reminds us, even more broadly, of pathologies with *logistic regression* when data is quasi-separable. Any logistic regression. When running any regression, I think it's always a good idea to see what the predicted values are for your training set (for lots of reasons, including the possibility of catching a coding error). In a logistic regression, when predicted values come back as 0 or 1, that should trigger concern. (In R, quasi-separated data trigger a warning message.)
One could altnernatively estimate propensity scores using a CART model for example. [[[Westreich et al, 2010. Propensity score estimation: neural networks, support vector machines, decision trees (CART) as alternatives to logistic regression. J Clin Epid.]]] Unlike logit, CART doesn't depend upon a model requiring probability be strictly between 0 and 1 for all covariate patterns. Separation doesn't break CART. Use the CART model, then produce the tree plot, find out that you don't have overlap. Exclude observations that are perfectly separated. Use the well-behaved subset with their propensity scores for your analysis, along with their propensity scores.
It is interesting that the 6 methods all were reported to have different ways of dealing with the problem of quasi-separated data in logit. That's a lot of different ways of dealing with a problem that breaks underlying assumptions of the method itself.
re: "sometimes mental health deinstitutionalization had no effect and sometimes it had 2.4 more homicides per capita."
So what should a researcher do; which impact number is the right one? All estimates purport to answer the exact same research question using the exact same data.
We don't know which of models, if any, are (even approximately) correct, because our quasi-separated data flat-out violated our basic modeling assumptions. When you re-run the propensity-score modeling and other analysis on just the non-separable data (after standardizing), do all packages give you the same answer? That's the substantively important number, right(?) for the well-behaved subset of the data, for researchers who like this DiD method.
"So what should a researcher do; which impact number is the right one? All estimates purport to answer the exact same research question using the exact same data."
I really do not know tbh. I feel like these are not robustness checks of one another -- they cannot all be correct. We know the econometrics of CS and we know the syntax of a package, but I'm not sure what I'm supposed to do when six packages of the exact same estimator yields six estimates under certain conditions (big covariates) but not others (no covariates). I'm still trying to take this in, but this seems like a problem, and if it is actually logit, then it's not even a CS problem.
I do think that it's a problem though that economists have wanted to control for covariates on the one hand but want to completely reject propensity scores. There is a lot of lost human capital and it would most likely be extremely clear to someone with in-depth knowledge of selection on observables how to go about this. But economists oftentimes don't even know what common support is. It's just not a common thing people discuss to check for balance for instance. So I really don't know. This exercise left me feeling sad and a little disoriented.
Prompted this thought though: Isn't another likely problem to come going to be "I'm just gonna report all these specs because it's not that hard to do, and I'm gonna let you decide even though the inference is all over the place and therefore I can't claim as much, but I can do all this with my bots?" The supplementals to review in most journal articles are already way too long and filled with robustness checks...
(And, well, I am only being slightly hyperbolic in my guess as to how this will go. :) )
I know. Honestly it seems so hard to predict what will happen anymore. I think we know marginal cost is zero, though, and therefore anything with even the smallest marginal value will get done.
Amazing work! I love this as an illustration of just how much overlap matters when working with propensity scores.
I was curious if you had further insight on R "did" versus the python "differences" package, given their categorical alignment in the internal machinery table? My two thoughts are that this could be driven by underlying granularity in the MLE logit category or python and R handling floating point values differently, but curious which (or something unrelated) is the driver here.
P.S. - I haven't watched the videos, so apologies if this is included in your python diff-diff discussion...
I have it on my to do list to really figure out what you just asked about. I would've thought those should be the comparable ones. They are for sure the closest when you are at k=1 and k=2. And on the 45 degree line (which I think I didn't post maybe but which is on page 22 of the deck at the end of the post), you can see them lined up together excluding the zero ATTs for did. But otherwise they're right on top of each other. Where they are breaking down is at the higher combinations, and for some reason differences pushing through on estimation when did is refusing. And I think on page 23, it's sort of hard to see exactly because they all converge so much, but I think they may be almost the same. Any mean differences between them could be coming from all those zeroes which are really just NAs.
Re "it's just an illustration of how much overlap matters when working with propensity scores". Yes, if one is estimating propensity using conventional methods that do not allow for perfect separation. It's also an illustration of how the various CS packages don't have consistent, transparent ways of dealing with data pathologies that break logistic regression. I think it's really important to acknowledge this is not about propensity scores per se: It's about packages that don't do enough to alert the researcher if/when there's a problem with one's logistic regression. Logistic regression is not fault-tolerant for perfect separation. There are other methods of generating 'estimated probabilities of assignment to treatment/control' that can tolerate quasi-perfect separation.
More broadly, if you've adopted the classic potential outcomes framework and you also believe that a propensity score for some of your data is 1, then the treatment effect for those observations is undefined (causal inference is not a well-posed question for those units) and there's no statistics that can change that.
100% it isn’t abt propensity scores. It’s not even abt CS. It’s absolutely what you said — the fact you cannot even trim the scores makes it non-standard. You can make your own scores, but once you invoke the did or other packages, it re-estimates the propensity score. And then there are the data pathologies as you call them, and how each estimator is just blindly using some local version of logit that now for all I know there’s twenty of in each language using some different thing I don’t know about. This is why I said I think economists have shot themselves in the foot ignoring propensity scores since it’s now in the engine of did and iv (mte).
As someone outside econometrics but within applied stats, it seems to me that the biggest underlying problem here is the widespread and uncritical use of a method that is numerically unstable. Even with the exact same numerical method, the results will be unstable w/r/t trivial changes in input data. However in general I strongly agree that making code fully open, describing software in methods sections, and exploring sensitivity of results to these things (RNG seed, alt software using the same nominal method, etc) is both very important and a good use of tools like claude code
100%. But the thing here is the packages aren’t clearly indicating that instability when it’s there. Some of these are actually dropping down silently to a parsimonious model even without saying it at all. Whereas others will refuse to do it. That doesn’t contradict your point at all — it just is my way of saying that even packages could be a source of researcher output variation.
Re the large numbers thing, I once tried to run an equation on data I'd generated from a finite horizon optimal control problem meaning that the unstable root eventually dominated the dynamics of the data. Got completely different results depending on the package I used, including one package which just curled up and died. Nothing particularly odd about the problem, it was a standard saddle point system of the sort that is the solution to a lot of micro dynamic optimization problems.
You’d think though that you’d get that on all the packages across all the languages. So if it happens ever, it always should happens and thus results be the same then too — but it wasn’t. The failures should not depend on the language and the package.
Hi Scott, interesting post though I would like to offer a response on a few aspects:
1. The R `did` package is our main implementation of Callaway and Sant'Anna. The Stata `csdid` package was originally intended as a clone of `did`, but made some different implementation choices along the way (which was a mistake in hindsight) and we are working on getting them to match in all cases. You should expect `ddml` to give different results by construction. As for the Python packages, I am not involved with any of them. The Python package whose authors have interacted with us the most is `csdid`. I would expect it to give very similar results to `did`, but it was not among those you tried.
2. A bigger issue is that common support is violated in this application. Common support is an important assumption in Callaway and Sant'Anna, so, in this case, the packages are being used in a setting where the required underlying assumptions are not met. Specifically, many of the implied "desired comparisons" are not feasible, which is not an issue that any software implementation can fix. And if you try to run these packages in these conditions, they are going to be inherently unstable (dividing by numbers very close to 0) and small differences in implementation or numerical differences would be magnified greatly. We also have tools for diagnosing common support violations and potentially trimming units with extreme propensity scores, though this would change the target parameter of the analysis.
3. I am almost certain that Claude Code ignored a lot of warnings here. The R package would have reported many warnings about small groups and extreme propensity scores that do not seem to have been picked up.
Yes - he likely do ignore the did errors for sure. I made a note of that in the videos as well — the zeroes that showed up on the R did ATT estimates only indicated it.
It for sure wasn’t an indictment of did, fwiw, but to show that there is variation coming across packages within estimator.
Yeah, my sense is that this exercise leads to exaggerated differences because dividing by slightly different numbers that are extemely close to 0 can change things a lot. But, still, even in more straightforward cases, not all the default options are the same in Stata as in R, which can lead to confusing differences...we are going to get this fixed. I am not good at Stata, but maybe Claude can help here too.
Maybe. I mean it’s coming from having upwards of so many covariates you’re getting perfect prediction, perfect separation, curse of dimensionality things, not to mention, of the six logistic regressions packages appear to be slightly different, some of which don’t even give off an error message — they just arbitrarily drop the Xs entirely a regress onto a constant and never say it. It isn’t CS but it does worry me generally abt a move towards agent based analysis where on top of low literacy about issues with high dimensions, you can’t even see the errors, or the package doesn’t display them at all.
On the first part, I agree 100%, was just trying to saying that small differences between packages is going to be magnified in an identification failure setting.
On the package development side, my prediction is that packages will get better at things like informative warning messages and coinciding across different programming languages as these are also tedious things that are hard to get right and Claude should be quite good at.
On the user side, maybe just put in Claude.md not to ignore warning messages could be a start and I think seeing more covariate balance checks/diagnostics would be good too.
Floating-point arithmetic can lead to bizarre results, and it's impossible to antecipate all the small errors it can cause. These errors have caused problems to the banking industry, aerospace industry, and many more, so this is a known problem. Python doc's even has a full topic on that (https://docs.python.org/3/tutorial/floatingpoint.html).
The problems don't occur only with very big numbers, with Python 3.13.2, 0.1 + 0.2 results in 0.30000000000000004. So these 2 mathematically identical expressions result in different results:
(0.1 + 0.2) * 10 -> 3.0000000000000004
(0.1 * 10 + 0.2 * 10) -> 3.0
It's not hard to see how accumulating these errors can result in very different results.
And big numbers are not a problem as well. The problem is mixing numbers with very different orders of magnitude.
10.0 ** 18 + 1.0 == 10.0 ** 18 -> True
But also
10.0 ** 12 + 0.00001 == 10.0 ** 12 -> True
These are not package's problem's - that's the standard estabilished for floating-point arithmetics (https://en.wikipedia.org/wiki/IEEE_754). By including very large numbers in poptotaltrend, any tiny difference in the package's implementations can lead to huge differences in results. That's a misuse of the tool itself, because it's not respecting basic limitations of the IEEE 754 standard.
So, what to do??
For researchers:
- Rule of thumb: if the largest and the smallest numbers in your sample are more than 8 or 9 orders of magnitude appart, be aware
- For poptotaltrend, instead of (year * pop), use (year - 2000) * pop
For developers:
- Python has the "Decimal" module natively, for fast correctly rounded decimal floating point arithmetic
- Both Python and R have third party packages for arbitrary precision float-point numbers
- Use float for default, but allow the user to use custom types
What do you think about an alternative framing of these results--"Claude reminds us of pathologies with the propensity score that we often forget/overlook" and the policy rec is more specific--code that uses propensity score should have specific warning steps to the user about the operations it is doing and how it deals with PS=1/PS=0 and other issue?
That’s a good framing. I do think though we should be doing these code audits. Do everything three times in all languages to figure this out. Also think we need to be explicit about our packages.
I’d add to this framing that this doesn’t seem so much like “different packages will do fundamentally different things” as that CSDID’s implementation for including covariates requires propensity scores and we’re not so familiar with evaluating that. I suspect if you used a two-stage imputation estimator you wouldn’t find these major differences with controls.
I’m not following. Double robust used two stage imputation and propensity score reweighting. It’s doing both. And these are hidden pscores. The packages often throw them away. My exercise is purely to say that there shouldn’t be variation for the Same spec across packages for the identical estimator. It isn’t about CS but rather the CS packages are giving different results. Maybe you're saying this, but I'm just wanting to make sure I follow.
Yeah I'm perhaps not being clear. When I referred to imputation I meant the Gardner/Borusyak, Jaravel and Spiess estimator, I realize that's unclear because CS also do part of this to impute counterfactuals. My point was in wondering whether the CS estimator is a bit more vulnerable than other estimators to getting different results for different packages because there are different reasonable choices (because of details with propensity score matching), which seems to be most of the reason you get different results if I'm understanding correctly (although the storage of big numbers is an interesting one to keep in mind). Whereas the Gardner et al. version has much fewer degrees of freedom in implementation. It reminds me of the paper that showed that many economists handed the same data and research question produced fairly variable estimates. But I think it's quite true that few of us would think that package implementations of the same estimator could be so important.
I think what you're demonstrating is very valuable--every replication I've ever tried to do of an econ paper has turned up weird reasons about why it's not possible to fully reproduce. And I think a big reason is that it's too been too "costly" for academics to do any kind of "quality assurance" until now (whereas private industry has had strong incentives to do it for a long time), so it's great if that cost declines and we start implementing code audits like you're demonstrating. I'm sure it would greatly improve the reliability and reproducibility of research.
It very well could be but that too is an empirical question. This variation has zero to do with the estimator. Nothing is probably more straightforward than logit. There should be zero variation for logits across packages and yet there is.
And by logit variation, I literally mean package variation mind you, all under the hood with regards to even such basic things as whether the package will even run the logit it claims it is (eg silently dropping covariates).
Sorry for the multiple posts. But this is under the hood. People are often not even given the propensity score as stored macros when you run CS. So you can’t even trim as a consequence bc trimming is pre-package, and then the CS package just generates a new one. And one package faced with the identical spec drops randomly covariates, one spits out errors, one handles machine epsilon differently, and so on.
You can always switch estimators but to switch estimators based on package glitches is not principled nor scientifically accurate anyway. The heterogenous robust estimators do not have the same heterogenous adoption with regards to covariates. Some papers don’t even cover them at all but have packages that allow for them. It’s known though that without flexible modeling of Xs via saturation, regression with addictive Xs impose stronger assumptions like constant treatment effects with respect to those very Xs. So it isn’t a free lunch to switch over if you’re explicitly not wanting to make those claims.
minor point--I just want to confirm my understanding--I think a comment in this thread is incorrect: "My point was in wondering whether the CS estimator is a bit more vulnerable than other estimators to getting different results for different packages because there are different reasonable choices (because of details with propensity score matching), which seems to be most of the reason you get different results if I'm understanding correctly (although the storage of big numbers is an interesting one to keep in mind). "...
To the best of my understanding, the problem is not with propensity score matching. Does CS involve propensity score matching? I thought it was propensity score weighting (correct me if I'm wrong, I haven't used or taught CS). Don't blame "matching" (the poor method takes a lot of flak, lol). The problem, as I understand it, is that the CS package runs a vanilla, everyday logistic regression and the researcher has no easy way to do basic checks for quasiperfect separation (which means the researcher shouldn't have run logit on that data in the first place), b/c relevant intermediate diagnostic results are hidden inside the package. So I think it's largely a software design issue.
That’s right — it’s an inverse probability weighting. So it isn’t matching. If I said that I misspoke. I tend to use that word when trying to get people to see what’s under the hood of CS only because IPW and weights aren’t clear to many. So I tend to say “it’s like matching on units with comparable Xs”, as a shorthand. But you’re 100% right that CS is IPW rewriting the comparison units using baseline Xs for a given cohort comparison
And you’re 100% right that it’s using vanilla logistic regressions. Now you could do it yourself for sure — CS as a simple recipe. Calculate 2×2 means in outcomes, reweight by IPW and/or an outcome regression model. You can get the propensity score however you want but the point is you cannot feed that into these standard packages.
My one worry is how often people get the CS wrong even then. I am not sure I’m super enthusiastic about the complete over confidence I see in the wild in even what CS is doing, but it could be I just have to rest more.
Yes, fair and thanks for your thorough reply. You’re of course right that there’s no “free lunch” in switching estimators, it entails substantive differences. And it’s definitely a cause for concern when there are different things happening under the hood that we can’t even diagnose. I will definitely continue to read all your CC posts, I’m learning a lot
Thanks! I'm learning a ton too. I want to keep figuring out this logit thing also. It just doesn't seem like there should be this much variation. I think it's because I would always just write down the log likelihood and solve it in school. I have such a dismal understanding of what is literally going into that on the side of a real calculator, and since I specialize in "syntax", and less so the underlying computers, it honestly didn't even occur to me something like this would even hypothetically be true. And I really don't think people know about it. I think it's undocumented source of researcher degrees of freedom. I bet you twenty bucks no one has really just sat down and said "and another source of researcher discretion is whether they used R's did or Stata's csdid" because it's ordinarily the research design, the estimator and the cleaning that has been suggested as possible sources of subjective bias. How your computer and language and package within language-computer handles a number that's 10^10th power? Or how that particular package's underlying architecture does or does not spit out a refusal to estimate, or whether it passes along garbage coefficients -- I don't think this is known.
Wait — talking to sarah I maybe misunderstood you Jason. The issue isn’t with the propensity score. The issue is with the logistic regression and how the six different packages across the three languages are handling floats, and perfect separation, whether they spit out errors, where they drop down simply regressing treatment indicator onto a constant despite being fed Xs and so on. It’s got nothing to do with the propensity score — it’s subtly six different logit packages, each of which apparently handles various hidden details slightly differently but which each package (not estimator — packages) handle that.
I should have said: "pathologies with the implementation of logistic regression among different statistical packages" rather than "with the propensity score". But I read what you showed a little differently--you showed that with no covariates everything looked great. It was when you stress-tested the logistic regression step (probably in ways you didn't mean to, like using raw population) was when things fell apart. My guess is that very few published papers would have stressed a logistic regression like this and therefore the "different answers across statistical packages' would be pretty rare. Just a guess though.
I'm positive no one is running CS by first running their own logit, because why would they? Whether they should is a different matter.
Hi! Have to say that I've been enjoying the series a lot. Have a doubt, what happens if you use your Referee 2 persona? Maybe I'm missing something, but I remember that the idea of this persona was to obtain the same coefficients using 3 different packages. In the case of the PNAS paper, a DiD wasn't used, but what will happen if you do the exercise with the Brazil paper? If the same coefficient is achieved, maybe it is because it used the 0 covariates baseline. But what happens if it uses a positive number? The same ATT will be obtained, but with different covariates?
In this case, it was coming from the number of Xs one of which was 10^10 power. So referee2 would’ve flagged it, but the failure but not the reason why. I would’ve had to reconcile them after I read his report and then cc would’ve found this out. That’s the thing — it’s the circling that gets to the bottom of it.
Thank you for another thoughtful piece. Even better than reporting package and version would be to require containerized version of the project (e.g. dockerfile + pinned dependencies), which Claude could also prepare for you so as not to increase overhead too much.
Yep good point. I should’ve said that. Ugh so much new problems! Claude code is making less work and more work!
Thanks Scott, great post! I agree with your points, and it's a service that you did this work and explained it so clearly. Here's what you made me think about:
a commenter wrote: "Claude reminds us of pathologies with the propensity score that we often forget/overlook". I would put it differently: *Scott* reminds us, even more broadly, of pathologies with *logistic regression* when data is quasi-separable. Any logistic regression. When running any regression, I think it's always a good idea to see what the predicted values are for your training set (for lots of reasons, including the possibility of catching a coding error). In a logistic regression, when predicted values come back as 0 or 1, that should trigger concern. (In R, quasi-separated data trigger a warning message.)
One could altnernatively estimate propensity scores using a CART model for example. [[[Westreich et al, 2010. Propensity score estimation: neural networks, support vector machines, decision trees (CART) as alternatives to logistic regression. J Clin Epid.]]] Unlike logit, CART doesn't depend upon a model requiring probability be strictly between 0 and 1 for all covariate patterns. Separation doesn't break CART. Use the CART model, then produce the tree plot, find out that you don't have overlap. Exclude observations that are perfectly separated. Use the well-behaved subset with their propensity scores for your analysis, along with their propensity scores.
It is interesting that the 6 methods all were reported to have different ways of dealing with the problem of quasi-separated data in logit. That's a lot of different ways of dealing with a problem that breaks underlying assumptions of the method itself.
re: "sometimes mental health deinstitutionalization had no effect and sometimes it had 2.4 more homicides per capita."
So what should a researcher do; which impact number is the right one? All estimates purport to answer the exact same research question using the exact same data.
We don't know which of models, if any, are (even approximately) correct, because our quasi-separated data flat-out violated our basic modeling assumptions. When you re-run the propensity-score modeling and other analysis on just the non-separable data (after standardizing), do all packages give you the same answer? That's the substantively important number, right(?) for the well-behaved subset of the data, for researchers who like this DiD method.
"So what should a researcher do; which impact number is the right one? All estimates purport to answer the exact same research question using the exact same data."
I really do not know tbh. I feel like these are not robustness checks of one another -- they cannot all be correct. We know the econometrics of CS and we know the syntax of a package, but I'm not sure what I'm supposed to do when six packages of the exact same estimator yields six estimates under certain conditions (big covariates) but not others (no covariates). I'm still trying to take this in, but this seems like a problem, and if it is actually logit, then it's not even a CS problem.
I do think that it's a problem though that economists have wanted to control for covariates on the one hand but want to completely reject propensity scores. There is a lot of lost human capital and it would most likely be extremely clear to someone with in-depth knowledge of selection on observables how to go about this. But economists oftentimes don't even know what common support is. It's just not a common thing people discuss to check for balance for instance. So I really don't know. This exercise left me feeling sad and a little disoriented.
Thanks for this piece--well done and thoughtful.
Prompted this thought though: Isn't another likely problem to come going to be "I'm just gonna report all these specs because it's not that hard to do, and I'm gonna let you decide even though the inference is all over the place and therefore I can't claim as much, but I can do all this with my bots?" The supplementals to review in most journal articles are already way too long and filled with robustness checks...
(And, well, I am only being slightly hyperbolic in my guess as to how this will go. :) )
I know. Honestly it seems so hard to predict what will happen anymore. I think we know marginal cost is zero, though, and therefore anything with even the smallest marginal value will get done.
Amazing work! I love this as an illustration of just how much overlap matters when working with propensity scores.
I was curious if you had further insight on R "did" versus the python "differences" package, given their categorical alignment in the internal machinery table? My two thoughts are that this could be driven by underlying granularity in the MLE logit category or python and R handling floating point values differently, but curious which (or something unrelated) is the driver here.
P.S. - I haven't watched the videos, so apologies if this is included in your python diff-diff discussion...
I have it on my to do list to really figure out what you just asked about. I would've thought those should be the comparable ones. They are for sure the closest when you are at k=1 and k=2. And on the 45 degree line (which I think I didn't post maybe but which is on page 22 of the deck at the end of the post), you can see them lined up together excluding the zero ATTs for did. But otherwise they're right on top of each other. Where they are breaking down is at the higher combinations, and for some reason differences pushing through on estimation when did is refusing. And I think on page 23, it's sort of hard to see exactly because they all converge so much, but I think they may be almost the same. Any mean differences between them could be coming from all those zeroes which are really just NAs.
Re "it's just an illustration of how much overlap matters when working with propensity scores". Yes, if one is estimating propensity using conventional methods that do not allow for perfect separation. It's also an illustration of how the various CS packages don't have consistent, transparent ways of dealing with data pathologies that break logistic regression. I think it's really important to acknowledge this is not about propensity scores per se: It's about packages that don't do enough to alert the researcher if/when there's a problem with one's logistic regression. Logistic regression is not fault-tolerant for perfect separation. There are other methods of generating 'estimated probabilities of assignment to treatment/control' that can tolerate quasi-perfect separation.
More broadly, if you've adopted the classic potential outcomes framework and you also believe that a propensity score for some of your data is 1, then the treatment effect for those observations is undefined (causal inference is not a well-posed question for those units) and there's no statistics that can change that.
100% it isn’t abt propensity scores. It’s not even abt CS. It’s absolutely what you said — the fact you cannot even trim the scores makes it non-standard. You can make your own scores, but once you invoke the did or other packages, it re-estimates the propensity score. And then there are the data pathologies as you call them, and how each estimator is just blindly using some local version of logit that now for all I know there’s twenty of in each language using some different thing I don’t know about. This is why I said I think economists have shot themselves in the foot ignoring propensity scores since it’s now in the engine of did and iv (mte).
As someone outside econometrics but within applied stats, it seems to me that the biggest underlying problem here is the widespread and uncritical use of a method that is numerically unstable. Even with the exact same numerical method, the results will be unstable w/r/t trivial changes in input data. However in general I strongly agree that making code fully open, describing software in methods sections, and exploring sensitivity of results to these things (RNG seed, alt software using the same nominal method, etc) is both very important and a good use of tools like claude code
100%. But the thing here is the packages aren’t clearly indicating that instability when it’s there. Some of these are actually dropping down silently to a parsimonious model even without saying it at all. Whereas others will refuse to do it. That doesn’t contradict your point at all — it just is my way of saying that even packages could be a source of researcher output variation.
Re the large numbers thing, I once tried to run an equation on data I'd generated from a finite horizon optimal control problem meaning that the unstable root eventually dominated the dynamics of the data. Got completely different results depending on the package I used, including one package which just curled up and died. Nothing particularly odd about the problem, it was a standard saddle point system of the sort that is the solution to a lot of micro dynamic optimization problems.
You’d think though that you’d get that on all the packages across all the languages. So if it happens ever, it always should happens and thus results be the same then too — but it wasn’t. The failures should not depend on the language and the package.
That’s really the point of the post — all calculators should have the problem. Particularly if they’re all running something as standardized as logit.
Hi Scott, interesting post though I would like to offer a response on a few aspects:
1. The R `did` package is our main implementation of Callaway and Sant'Anna. The Stata `csdid` package was originally intended as a clone of `did`, but made some different implementation choices along the way (which was a mistake in hindsight) and we are working on getting them to match in all cases. You should expect `ddml` to give different results by construction. As for the Python packages, I am not involved with any of them. The Python package whose authors have interacted with us the most is `csdid`. I would expect it to give very similar results to `did`, but it was not among those you tried.
2. A bigger issue is that common support is violated in this application. Common support is an important assumption in Callaway and Sant'Anna, so, in this case, the packages are being used in a setting where the required underlying assumptions are not met. Specifically, many of the implied "desired comparisons" are not feasible, which is not an issue that any software implementation can fix. And if you try to run these packages in these conditions, they are going to be inherently unstable (dividing by numbers very close to 0) and small differences in implementation or numerical differences would be magnified greatly. We also have tools for diagnosing common support violations and potentially trimming units with extreme propensity scores, though this would change the target parameter of the analysis.
3. I am almost certain that Claude Code ignored a lot of warnings here. The R package would have reported many warnings about small groups and extreme propensity scores that do not seem to have been picked up.
Brant
Yes - he likely do ignore the did errors for sure. I made a note of that in the videos as well — the zeroes that showed up on the R did ATT estimates only indicated it.
It for sure wasn’t an indictment of did, fwiw, but to show that there is variation coming across packages within estimator.
Yeah, my sense is that this exercise leads to exaggerated differences because dividing by slightly different numbers that are extemely close to 0 can change things a lot. But, still, even in more straightforward cases, not all the default options are the same in Stata as in R, which can lead to confusing differences...we are going to get this fixed. I am not good at Stata, but maybe Claude can help here too.
Maybe. I mean it’s coming from having upwards of so many covariates you’re getting perfect prediction, perfect separation, curse of dimensionality things, not to mention, of the six logistic regressions packages appear to be slightly different, some of which don’t even give off an error message — they just arbitrarily drop the Xs entirely a regress onto a constant and never say it. It isn’t CS but it does worry me generally abt a move towards agent based analysis where on top of low literacy about issues with high dimensions, you can’t even see the errors, or the package doesn’t display them at all.
On the first part, I agree 100%, was just trying to saying that small differences between packages is going to be magnified in an identification failure setting.
On the package development side, my prediction is that packages will get better at things like informative warning messages and coinciding across different programming languages as these are also tedious things that are hard to get right and Claude should be quite good at.
On the user side, maybe just put in Claude.md not to ignore warning messages could be a start and I think seeing more covariate balance checks/diagnostics would be good too.
Floating-point arithmetic can lead to bizarre results, and it's impossible to antecipate all the small errors it can cause. These errors have caused problems to the banking industry, aerospace industry, and many more, so this is a known problem. Python doc's even has a full topic on that (https://docs.python.org/3/tutorial/floatingpoint.html).
The problems don't occur only with very big numbers, with Python 3.13.2, 0.1 + 0.2 results in 0.30000000000000004. So these 2 mathematically identical expressions result in different results:
(0.1 + 0.2) * 10 -> 3.0000000000000004
(0.1 * 10 + 0.2 * 10) -> 3.0
It's not hard to see how accumulating these errors can result in very different results.
And big numbers are not a problem as well. The problem is mixing numbers with very different orders of magnitude.
10.0 ** 18 + 1.0 == 10.0 ** 18 -> True
But also
10.0 ** 12 + 0.00001 == 10.0 ** 12 -> True
These are not package's problem's - that's the standard estabilished for floating-point arithmetics (https://en.wikipedia.org/wiki/IEEE_754). By including very large numbers in poptotaltrend, any tiny difference in the package's implementations can lead to huge differences in results. That's a misuse of the tool itself, because it's not respecting basic limitations of the IEEE 754 standard.
So, what to do??
For researchers:
- Rule of thumb: if the largest and the smallest numbers in your sample are more than 8 or 9 orders of magnitude appart, be aware
- For poptotaltrend, instead of (year * pop), use (year - 2000) * pop
For developers:
- Python has the "Decimal" module natively, for fast correctly rounded decimal floating point arithmetic
- Both Python and R have third party packages for arbitrary precision float-point numbers
- Use float for default, but allow the user to use custom types