Workshop Announcement, Regression Adjustment and Needing to Cite Packages else they become Public Goods
I feel horrible when I go AWOL from the substack and then step in only to post about an upcoming workshop. But I have to do it, otherwise people won’t hear. But this weekend I have a two day workshop on synthetic control, and that link is here. I’ll screenshot the itinerary here so you can see what I’m going to attempt to cover. It’ll basically be about Abadie’s original model, then exploring its bias and alternatives for when there is bias. And the same discounts apply as always: $1 for residents (as opposed to citizens) of low income countries including India, $50 for students, predocs, postdocs, residents of middle income countries and those unemployed, $95 for those on high teaching loads and/or non-tenure lines, and $595 for everyone else. Just email me at causalinf@mixtape.consulting about which one you are, showing me some evidence like a student ID or a website, and I’ll send you the code.
But I wanted to also now share something that has been on my mind. Several years ago, when the new diff-in-diff estimators started to come out, they were originally either in Stata or R but not usually both with some exceptions. I remember once complaining about this to Guido Imbens of all people too — could his team maybe write up their matrix completion in Stata to go with the one they made in R? People my age, Generation X, had a lot of human capital in Stata and so needed their Stata ado files, was my reasoning.
And so I threw out a bounty on Twitter — I would pay someone a thousand dollars (thank you to Baylor dept for that gift) to the first person who developed the Callaway and Sant’Anna estimator in Stata. Nick Huntington-Klein actually responded and wrote a workaround where Stata would call the did package using R within Stata. But really pretty soon thereafter, it was like all of the estimators became available in Stata and R. And you can see the list here if you want.
Who knows if this conjecture is correct, but it feels correct: I think that there has historically been a firewall between the econometricians and the applied people created by a lack of code. Most likely Abadie’s synthetic control estimator benefited a lot when the 2010 JASA came out and the authors released both R and Stata packages. I mean, who knows right? We don’t know the counterfactual. But it feels right because most applied people do not have the human capital to create functioning software. Anyway, point is, the availability of software is one thing, but the availability of packages is another thing, and not everyone has the human capital to create packages, but they also probably don’t always have the human capital to even implement the methods without the package.
Take for instance, regression adjustment. What is regression adjustment exactly? Regression adjustment is an OLS specification. You use it when you have unconfoundedness (i.e., you have the correct controls) but you think you have heterogenous treatment effects with respect to those controls. If you estimate a standard additive OLS model and you have heterogenous treatment effects, then depending on how much the potential outcomes vary with those covariates, the bias of that “standard” OLS model can be bad. Doesn’t have to be bad, but it can be. Well, regression adjustment fully saturates the model and then collects those terms in a particular way to reconstruct estimates of those aggregate causal parameters you care about like the ATE or the ATT. But boy is it a pain to do it right. Look here at this code.
**************************************************************************************** | |
* name: models.do | |
* author: scott cunningham (baylor) | |
* description: illustrating bias with OLS and matching, and the bias adjustments for both. | |
* last updated: april 3 2024 | |
**************************************************************************************** | |
clear | |
capture log close | |
drop _all | |
set seed 1 | |
set obs 5000 | |
gen treat = 0 | |
replace treat = 1 in 2501/5000 | |
* Poor pre-treatment fit | |
gen age = rnormal(40,2.5) if treat==1 | |
replace age = rnormal(27,3) if treat==0 | |
gen gpa = rnormal(2.2,0.75) if treat==0 | |
replace gpa = rnormal(1.6,0.5) if treat==1 | |
* Visualize the imbalance | |
twoway (histogram age if treat==1, color(green)) /// | |
(histogram age if treat==0, /// | |
fcolor(none) lcolor(black)), legend(order(1 "Treated" 2 "Not treated" )) | |
twoway (histogram gpa if treat==1, color(blue)) /// | |
(histogram gpa if treat==0, /// | |
fcolor(none) lcolor(black)), legend(order(1 "Treated" 2 "Not treated" )) | |
su age | |
replace age = age - `r(mean)' | |
su gpa | |
replace gpa = gpa - `r(mean)' | |
* All combinations | |
gen age_sq = age^2 | |
gen age_agesq = age*age_sq | |
gen agesq_agesq = age_sq^2 | |
gen gpa_sq = gpa^2 | |
gen gpa_gpasq = gpa*gpa_sq | |
gen gpasq_gpasq = gpa_sq^2 | |
gen interaction = gpa*age | |
gen agegpa = age*gpa | |
gen age_gpasq = age*gpa_sq | |
gen gpa_agesq = gpa*age_sq | |
gen gpasq_agesq = age_sq*gpa_sq | |
gen y0 = 15000 + 10.25*age + -10.5*age_sq + 1000*gpa + -10.5*gpa_sq + 500*interaction + rnormal(0,5) | |
label variable y0 "Earnings if you do not get a masters" | |
gen y1 = y0 + 2500 + 100 * age + 3000 * gpa | |
label variable y1 "Earnings if you get a masters" | |
gen delta = y1 - y0 | |
label variable delta "Causal effect of getting a masters on earnings" | |
su delta // ATE = 2500 | |
su delta if treat==1 // ATT = 2267 | |
local att = r(mean) | |
scalar att = `att' | |
gen att = `att' | |
gen earnings = treat*y1 + (1-treat)*y0 | |
* Nearest neighbor matching without bias adjustment | |
teffects nnmatch (earnings age age_sq gpa gpa_sq interaction) (treat), atet nn(1) metric(euclidean) | |
* Nearest neighbor matching with bias adjustment | |
teffects nnmatch (earnings age age_sq gpa gpa_sq interaction) (treat), atet nn(1) metric(euclidean) biasadj(age age_sq gpa gpa_sq interaction) | |
* OLS regressions - this regression assumes "constant treatment effects" and the coefficient on treat is trying to estimate the ATE. | |
regress earnings age age_sq gpa gpa_sq interaction treat, robust | |
* Regression adjustment | |
teffects ra (earnings age age_sq gpa gpa_sq interaction) (treat), atet | |
* Regression adjustment "the long way". At its core, regression adjustment "interacts" the treatment variable with all of the covariates. | |
* First step is run the fully interacted regression. | |
#delimit ; | |
regress earnings i.treat##c.age | |
i.treat##c.age_sq | |
i.treat##c.gpa | |
i.treat##c.gpa_sq | |
i.treat##c.age##c.gpa; | |
#delimit cr | |
local ate2=_b[1.treat] | |
scalar ate2 = `ate2' | |
gen ate2=`ate2' | |
* We got the ATE by looking just at the coefficient on treatment. But we can also get the ATT by multiplying each "treatment" coefficient (we have six treatment coefficients in our model) by the variable it's describing. Except you have to multiply it by the aveage value of the variable you're describing in the treatment group only. Because you're estimating the average treatment effect for the treatment group, you mulitiply each coefficient by that variable mean in the treatment group and add all together. If you code by hand, it is very easy to mess it up which is why we use "teffects ra", but I want to show you what it looks like (as it's very long) | |
* Second step is save each of those treatment coefficients. In Stata you save them as "local macros". | |
* Second obtain the coefficients | |
local treat_coef = _b[1.treat] // 1 | |
local age_treat_coef = _b[1.treat#c.age] // 2 | |
local agesq_treat_coef = _b[1.treat#c.age_sq] // 3 | |
local gpa_treat_coef = _b[1.treat#c.gpa] // 4 | |
local gpasq_treat_coef = _b[1.treat#c.gpa_sq] // 5 | |
local age_gpa_coef = _b[1.treat#c.age#c.gpa] // 6 | |
* Step three save those coefficients as scalars and then make variables. | |
scalar treat_coef = `treat_coef' | |
gen treat_coef_var = `treat_coef' // 1 | |
scalar age_treat_coef = `age_treat_coef' | |
gen age_treat_coef_var = `age_treat_coef' // 2 | |
scalar agesq_treat_coef = `agesq_treat_coef' | |
gen agesq_treat_coeff_var = `agesq_treat_coef' // 3 | |
scalar gpa_treat_coef = `gpa_treat_coef' | |
gen gpa_treat_coef_var = `gpa_treat_coef' // 4 | |
scalar gpasq_treat_coef = `gpasq_treat_coef' | |
gen gpasq_treat_coef_var = `gpasq_treat_coef' // 5 | |
scalar age_gpa_coef = `age_gpa_coef' | |
gen age_gpa_coef_var = `age_gpa_coef' // 6 | |
* Step four will now calculate the average value of each of the covariates (for example age or gpa or the interaction of age and gpa) for the treatment group only | |
su age if treat==1 | |
local mean_age = `r(mean)' | |
gen mean_age = `mean_age' | |
su age_sq if treat==1 | |
local mean_agesq = `r(mean)' | |
gen mean_agesq = `mean_agesq' | |
su gpa if treat==1 | |
local mean_gpa = `r(mean)' | |
gen mean_gpa = `mean_gpa' | |
su gpa_sq if treat==1 | |
local mean_gpasq = `r(mean)' | |
gen mean_gpasq = `mean_gpasq' | |
su interaction if treat==1 | |
local mean_agegpa = `r(mean)' | |
gen mean_agegpa = `mean_agegpa' | |
* Step 5 is calculate or estimate the ATT using all that information (coefficient x sample means for treatment group added together) | |
gen treat4 = treat_coef_var + /// 1 | |
age_treat_coef_var * mean_age + /// 2 | |
agesq_treat_coeff_var * mean_agesq + /// 3 | |
gpa_treat_coef_var * mean_gpa + /// 4 | |
gpasq_treat_coef_var * mean_gpasq + /// 5 | |
age_gpa_coef_var * mean_agegpa /// 6 | |
sum delta if treat==1 | |
sum treat4 // regression adjustment "the long way" | |
* Regression adjustment using teffects | |
teffects ra (earnings age age_sq gpa gpa_sq interaction) (treat), atet |
I have in this both nearest neighbor matching with and without bias adjustment and OLS. I’ll focus just on OLS which is line 74. Not that the ATE is 2500 and the ATT is 2267 (lines 59-60). If you run line 74, the coefficient on the treatment indicator is 2487. It’s close but it’s not 2500 and it’s not 2267. Here’s the output.
But what if you use regression adjustment? It’s insane. I’ve done this before but it’s just insane. Look at the steps. There’s five steps involved.
Fully saturate the model (I’ll just fully interact bc saturation is ridiculous). The coefficient on the treatment indicator is the ATE and under unconfoundedness and linearity it’s right. To get the ATT, follow these steps.
Save all the coefficients on all the treatment indicicators
Calculate sample means of the variables for the treatment group only
Estimate the ATT as treatment interaction coefficients times sample means of the treatment group only added together plus the stand alone treatment coefficient.
Calculate a standard error
I literally cannot even post it all here, so I’ll just show step 4.
Each of those were either coefficients from step 1 that I saved as locals to scalars to variables or they were sample means that I calculated and saved as scalars to variables. This is the “long way”. And when you do it “the long way”, look what you get:
The ATT (“sum delta if treat==1”) is $2,266.99. And the estimated ATT (“sum treat4”) is $2,266.625, where I estimated the ATT using regression adjustment. To get to that estimate “the long way” in my Stata code took lines 83 to 160 minus that excessive pedagogical commenting I did. But still it’s a lot.
Now imagine in lines 138-140, which currently read as this:
What if I had accidentally typed this:
Well let’s see what that ‘type’ does.
Just that simple typo causes a bias. And obviously this stuff happens, but here’s where I’m going. At some point at like Stata 8 or Stata 14 or something, the big wigs over there were like “we don’t want people writing 100 lines of code and accidentally forgetting something like that, so let’s make the teffects suite of commands”, and it does matching, propensity score stuff and regression adjustment. And here’s the command for regression adjustment for estimating the ATT:
One line. As opposed to a hundred lines. Where am I going with this? Now that these robust did estimators have exploded, and their adoption is growing, we are increasingly using packages for estimation that we did not write, but which others painstakingly did write. And in the case of the robust diff in diff estimators, because of the prominence of those estimators and the popularity of diff-in-diff, many of us are using packages that were not written by the authors themselves. When we cite, for instance, Callaway and Sant’Anna (2021), they get a cite. But if we use “csdid”, which is the Stata package created by Fernando Rios-Avila coauthored with Brantly Callaway and Pedro Sant’Anna (here and here), most likely we are not citing Rios-Avila, Fernando, Pedro H. C. Sant'Anna, and Brantly Callaway, 2021. “CSDID: Difference-in-Differences with Multiple periods.”
And I’m not complaining at all, and I’m not shaming. It just kind of has been dawning on me that the newer generation of practitioners are writing more code for estimators that they are not the authors of. And so when we cite the papers of the estimators, it’s kind of for the first time maybe going to be the case that we will be citing the econometricians, using packages made by others, and not citing the authors of the packages making the work of the authors making the packages truly a public good. And I don’t mean that in a positive way — public goods are not provided well by markets.
As academics, our cites matter. They really impact our livelihood. They can show up in performance reports, they can shape raises and therefore impact our consumption of goods and services. So it just has been dawning on me — I wonder if we need to openly discuss citing the authors of the packages. Not just the econometricians. I know historically you’d cite the package so that replicability could be ensured, but with google scholar giving us our cites in real time, and us learning about things like “h-index” which I don’t think anyone really thought about until google scholar made it so front and center, the fact that Callaway and Sant’Anna (2021) has 4400 cites but Rios-Avila, Sant’Anna and Callaway (2021) has 102 — it probably is being undercited.
I mean it has to be because surely more than 100 of those 4400 cites for Callaway and Sant’Anna is coming from Stata users using CSDID in their work. I just wonder if we need to cite the packages more than we do. I’m just as guilty as the next person.
Weirdly enough, it was trying to figure out how to teach regression adjustment for my students by showing them regression adjustment “the long way” (100 lines of code) versus the “short way” (one line of code using -teffects ra-) that just made me think two things. Man, am I grateful for the specialization of labor. That there are people who know how to make an actual working user friendly package. I don’t mean code — I mean a package that can accommodate many types of things, without bugs and doing the calculations correctly, with nice tables, etc. And then two, noticing we as a profession aren’t citing the packages and how that makes these things even more of public goods, contributing even more potentially to their lack of creation.
Anyway, there’s a workshop this weekend on synthetic control! And this is my rant about regression adjustment and citing packages and thanking people like Fernando Rios-Avilla for devoting his time to writing the CSDID package. So wherever you are Fernando, thank you!