Matrix Completion for Causal Models

An Athey, et al. (2021) explainer

“To estimate average causal effect of the treated on the treated units, we impute the missing potential outcomes.” — Athey, et al. (2021)


I’ve been a longtime admirer of Susan Athey and Guido Imbens since as far back as I can remember.  Two brilliant economists, from very different backgrounds, who decided that they’d work together, pool their comparative advantages in machine learning and causal inference, and create what sometimes feels like an endless number of toys for the rest of us.  “Matrix Completion Methods for Causal Panel Data Models” is one of their newest toys. While it is not technically a difference-in-differences (did) model, which I define as a model that uses a parallel trends assumption for identification, the method is a viable alternative to robust did estimators because it was designed for panels with both single year treatments as well as single treatment unit, such as synthetic control. It can also be used when there are differential timing setups.  It is a creative and interesting paper, but also quite technical, so I thought a basic explainer would be a nice challenge and a lot of fun. So buckle up because we’re going to take a roller coaster ride over matrix completion mountain and and down the nuclear norm valley.

Explicit imputation

In many ways, the estimator that Athey, et al. (2021) propose reminded me of the paper I discussed earlier by Borusyak, Javier and Spiess (2021) because both estimators use explicit “imputation” techniques to estimate treatment effects. They each do this by imputing the missing potential outcomes for treated units which are then differenced from observed values to calculate individual treatment effects. These individual treatment effects can then be summed and averaged to calculate meaningful parameters such as the average treatment on the treated (ATT). Explicit imputation methods are, for me, extremely intuitive procedures. As not all imputation methods yield valid counterfactuals on average, it gives us a chance to deepen our knowledge as to why one imputation works but another does not.

The paper bridges two distinct literatures in causal inference:  the unconfoundedness literature (e.g, matching) and the synthetic control literature (e.g., synthetic control duh!).  Unconfoundedness also goes by the name of the conditional independence assumption, which means that conditional on a matrix of covariates X, the treatment D is independent of potential outcomes. The synthetic control model on the other hand imputes counterfactual values for the treatment group with weights chosen so that weighted lagged outcomes match the lagged outcomes for treated units.  But in addition to bridging two econometric literatures, the paper is also a blend of computer science techniques with causal estimation, which is increasingly a hallmark of Susan Athey’s more recent work.

The notation is mostly straightforward until we get to descriptions of the matrices themselves, but let’s review some basic ideas before we start.  For any given unit, there are two potential outcomes: a potential outcome Y(1) corresponding to when the unit is treated, and a potential outcome Y(0) corresponding to when the unit is untreated. We do not observe potential outcomes, though, as they are imaginary thought experiments referencing to possibilities, not realities. Instead we observe a unit’s historical one outcome which is that potential outcome associated with whichever treatment assignment occurred in history. Once treatment assignment occurs, then history draws the appropriate potential outcome and plops it down into our timeline.1

Treatment effects are defined as Y(1) - Y(0). But since we only observe one of these two data points, we cannot make these calculations. Causality, as others have pointed out, is fundamentally a “missing data problem” which is why so many causal methodologies use imputation of the missing counterfactuals to solve the problem. Athey, et al. (2021) explain this here:

“The main part of the article is about the statistical problem of imputing the missing values in Y. Once these are imputed, we can then estimate the average causal effect of interest, [the ATT].”

If you understand that they are imputing missing counterfactuals, then you can focus on the imputation method itself. You can also better understand their arguments about how their estimator nests a variety of other imputation methods. The rest of the paper will walk us through a lot of intimidating matrix jargon, discuss the imputation method itself, walk through the reasoning for choosing nuclear norm regularization and conclude with a discussion of some sample R code.

Matrix Jargon: Block Structures and Associated Regressions

As I said before, this paper is about imputing missing counterfactual units. But they oftentimes refer to the problem as “completing” a matrix with missing observations. You should know that “matrix completion” simply means filling out the rest of a matrix of potential outcomes which were missing in the first place because a treatment assignment had switched observations from one potential outcome to another. Thus treatment assignments create “incomplete” potential outcome matrices that the authors want to fix by “completing” those matrices with valid values. But just remember — they’re completing the matrix by imputing those missing elements. It’s just that not all imputation methods are equally competent at doing so. Spoiler alert: the nuclear norm regularization method that they recommend will under certain assumptions perform well at filling in the missing counterfactual units.

Recall that I said earlier that there are two potential outcomes needed for calculating treatment effects: Y(1) and Y(0). For each potential outcome, we have a matrix where each element represents a potential outcome for a given panel unit at a particular point in time.2 And in fact, in theory, each matrix is complete because hypothetically, every unit at every point in time has a Y(1) and Y(0) value associated with it. It’s not the potential outcome matrices that are incomplete, though. It’s the historical outcome matrix Y that is incomplete because when a unit is treated, we go from observing only Y(0) in an earlier period to suddenly observing Y(1) in the next. Thus any treatment assignment will yield an incomplete matrix.

The authors try to make this easier for us by dividing up incomplete matrix types into two different types: a thin matrix and a fat matrix. But before we dive into their specific shape, let’s look at the broader econometric problems associated with each. They first discuss what they call the unconfoundedness block structure first. The unconfoundedness papers they have in mind appear to be mostly matching papers with a simple before and after setup. In these setups, we observe a unit before treatment. It is then treated in the last period, which we observe. Thus we are missing Y(0) for the last period for all treated units. An example, if it helps to have one, might be the NSW program in LaLonde (1986).  They call this type of treatment assignment the single-treated-period block structure because only the last period is treated, and because it is many units treated at one point in time, the matrix associated with it is somewhat thin due to N>>T. It looks like this:

This matrix is incomplete because it is describing all Y(0) elements. We have checkmarks in the earlier periods because the treatment had not yet occurred in those earlier periods.3 The question mark, on the other hand, represents “missing values”. Hence in this block structure, we have data on control unit values until the last period, at which point we only have data on treatment units Y(1) value, and as this is not a matrix of Y(1) values, we leave those missing values blank. The question marks refer to the missing Y(0) values — something we will need to impute if we are going to measure treatment effects.

Another important block structure is the one associated with synthetic control setups.  The synthetic control setup has one treated unit with many control units from a “donor pool”, which is a reservoir of untreated units. This one treatment units differs from the unconfoundedness matrix, which has many treated units. This one unit is treated in T0 after which we observe only the treated potential outcome Y(1) for several periods from T0+1 onward.  Athey, et al. (2021) will say that this block structure creates the fat matrix because T>>N. The fat block structure looks like this:4

Two last block structures are important to describe, though by this point, it’s probably just as easy to imagine them than draw them. The first one is an in-between block structure that they call the approximately square block structure. I won’t show it, to save some space, but it’s more or less exactly what it sounds like where N is approximately equal to T. It is somewhere between the fat and thin block structures above. The second one is the staggered adoption matrix where there are many panel units N and many time periods T, but treatment occurs at different points for several of the panel units.

Different block structures have different implied imputation methods because of the differences in variation associated with the relative number of panel units versus the relative number of time periods. The imputation exercises will be regression based, and depending on the type of block structure, the kind of regression used will leverage more of the cross-sectional versus time series data.

I mentioned this earlier, but it could be useful to say it again — the unconfoundedness literature often presents a single treated period block structure with a thin matrix where you use a large number of units (N>>T) and impute the missing potential outcome of Y(0) in that last period using controls with similar lagged outcomes. They call this type of imputation the “horizontal” regression. Imagine imputing that last period using OLS; you’d just regress the final outcome onto lags and inpute for the last period which is why the unconfoundedness assumption becomes so critical.

The “vertical” regression, though, goes with the synthetic control “fat” matrix because remember, in the fat regression, you have longer T series. The fat matrix is a situation where you have a single treated unit (as opposed to a single treated period), which makes it more like synthetic control style situations.5 In the vertical regression, imputations occur by using vertical weighted averages of the control group units to model the missing counterfactuals for every post-T0 period.

Nuclear Norm Regularization

The authors present an imputation method that uses regularization to complete the missing elements of the Y(0) matrix, hence the name “matrix completion”. If you are not familiar with the long tradition of computer science, you may be inclined to think that Athey, et al. (2021) are inventing the matrix completion method, but in fact they are not. The method has some fun history in fact.6 So no, the authors did not invent the matrix completion technique. Rather they are borrowing that method from computer science and using it for causal modeling by using nuclear norm regularization to complete the Y(0) matrix. Nuclear norm regularization is similar to LASSO, in that it involves regularization, only the penalty term is the “nuclear norm”.7

This regularization-based method not only has nice imputation properties with respect to prediction error; it also “nests” the two literatures I discussed earlier, the unconfoundedness and synthetic control methodologies. Athey, et al. (2021) show that all of the models appear to have the same objective function, but rather than rely on regularization like the nuclear norm, these traditional methods use restrictions on key parameters to complete the matrix.

Let’s model the problem now. We will allow the Y matrix to equal a rank matrix L* plus an error term:

Y = L*+e where E[e|L*]=0

The error term can be thought of as measurement error if you’re searching for a way to frame it. Let’s start with their main identifying assumption:

Assumption 1: e is independent of L* and the elements of e are sigma-sub-Gaussian and independent of each other. Recall that a real-valued random variable e is a sigma-sub-Gaussian if for all real numbers t we have E[exp(the)] <= exp(sigma^2t^2/2).

Sigma-sub-Gaussian is a probability distribution with tails that decay at least as fast as the tails of a Gaussian.8 And while I concluded that this is a more structural assumption than a design-based one, my follow-ups with Guido Imbens left me with the distinct impression that assumption 1 can also be thought of as an unconfoundedness assumption. I will continue to dig a little deeper on this assumption as unconfoundedness may ultimately be the right way to think of it, but for now I present to the reader the explicit assumption named in the paper.9

Ultimately, the goal is to estimate the matrix L* and this assumption will ultimately guide the kinds of imputations allowed. This is one of the key features of the paper because you can estimate this L* matrix using regularization. There are many ways to regularize a regression; there’s LASSO, elastic net, ridge, and several more. All regularizations have a penalty term, but each penalty term will be distinct. As I can’t embed LaTeX into paragraphs directly, I will simply note that we describe the penalty term as:

The underlying regression model will have panel and time fixed effects, and you could subsume those fixed effects into the L matrix, but Athey, et al. (2021) do not recommend it.10 Why? Because as it turns out, the choice as to whether one regularizes the fixed effects will matter for the performance of the imputation of L* with L.

Let’s look at the estimator itself, which ultimately centers around a low rank matrix L*. They write, “This variable L will be used for estimating the components of the low rank components of L*.” The general form of L* will be the sum of three predicted terms which are calculated from a minimization problem:

Notice how in the first equation, the penalty term does not contain the fixed effects which are instead inside the objective function. They write, “in practice, including these fixed effects separately and not regularizing them greatly improves the quality of the imputations.” This is because compared to the matrix completion literature, the fraction of observations is relatively high and so for the fixed effects can be estimated separately. The penalty will be chosen via cross-validation. Because the fixed effects are contained in the objective function, we end up with three predicted terms: time and unit fixed effects, as well as L hat itself. These parameters are fitted according to a minimization problem in which the three unknown values of L, gamma and delta are chosen to minimize the objective function.

So what penalty norm will be chosen? It matters in the end because not all norms perform equally well. They ultimately choose the nuclear norm out of seven total norms that they considered such as the Schatten norm, the rank norm, the operator norm, and several others. Some of these norms when used produce garbage for no other reason that they will fill out the L* matrix with nonsense.11 For instance, minimizing the sum of squared differences won’t be useful if the objective function doesn’t depend on L. The estimator would just give you Y itself, and therefore for each treated unit, a treatment effect of Y-Y=0. The Fröbenius norm on the penalty term, on the other hand, isn’t suitable because its solution is always zero. The rank norm is too computationally hard for large L and T. And they list other problems with other candidate norms. Which leaves us with the nuclear norm. The nuclear norm has an almost Goldilocks-like property: it can be computed fast using convex optimization programs and as a bonus, produces meaningful information.

Now that we have the estimator, let’s go back to our earlier discussion of fat and thin matrices. Remember how Athey, et al. (2021) said matrix completion with nuclear norm regularization nests both the unconfoundedness approaches and the synthetic control approaches? The main way that these estimators are similar is that they all have the same objective function. But the main way they are different is that they use different restrictions on the parameters (with matrix completion with nuclear norm regularization foregoing restrictions altogether in place of regularization). This insight is really a subtle one that may not necessarily jump off the page to readers who endured through this difficult manuscript, because it isn’t always of interest to applied people to know how synthetic control relates to elastic net, or matrix completion, but it is worth noting that they nonetheless find that nuclear norm matrix completion is the parent of many of these estimators.

Software implementation using -gsynth-

The implementation of this estimator that we will review uses the command -gsynth- created by Xu and Liu. I’ve posted an R program below that will implement it so that you can see some sample syntax. I’m using the same Cheng and Hoekstra (2013) castle doctrine dataset just like I have done in several other substacks simply for pedagogical purposes. The dataset comprises 50 states observed from 2000 to 2010 during which dozens are treated with a gun policy reform. The outcome is logged homicides. Policies are adopted started in 2005 and then year after year until 2009.

The code is a mouthful, and I had to get help working on it for a paper of mine (thanks Jim Savage!), but at the least you can run it and it should work. I’ve hopefully left it sufficiently commented so that you can see with your own eyes what each line does. Feel free to email me if you figure out a more efficient coding.

Conclusion

In conclusion, while matrix completion using nuclear norm regularization is not technically a did estimator, it is probably the case that did and matrix completion with nuclear norm regularization are for many researchers somewhat substitutable estimators given their usage can be in many of the same situations as you would use something like synthetic control or staggered rollouts. It’s a conceptually advanced estimator which can make studying the paper difficult given the dependence on computer science methods, so consider this substack nothing more than a simple “light” introduction which I hope will help you as you dive into it yourself. Cheers and have a great day.

###################################################################################
# Program: Effect of castle doctrine reform on log homicides
# Author: Scott Cunningham
# Affiliation: Baylor University
# Created: 5/7/2021
# Date Modified: 6/7/2021
# Modified by: Scott Cunningham and Grant McDermott
###################################################################################
# Libraries ---------------------------------------------------------------
## Install (if necessary) and load packages
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, haven, gsynth, panelView)
# Data --------------------------------------------------------------------
# Read in the castle dataset from the web
castle = read_dta('https://github.com/scunning1975/mixtape/raw/master/castle.dta')
## Untreated units (currently "NA") should have an effective year of 0
castle$effyear = replace_na(castle$effyear, 0)
# Quick look at the data. We have a balanced panel with 21 treated units (i.e.
# states) and variable treatment timing
panelView(l_homicide ~ post, data = data.frame(castle),
index = c('sid', 'year'), pre.post = TRUE, by.timing = TRUE)
# Similar, but in tabular form
castle %>%
group_by(sid) %>%
mutate(treated = any(post>0)) %>%
group_by(treated) %>%
summarise(
n_states = n_distinct(sid),
n_cities = n(),
mean_treatment_perc = mean(post)*100 ## Percentage of treated periods
)
# Matrix completion -------------------------------------------------------
# This will create an object called "reg1" that will contain all the values
# listed in `?gsynth`
reg1 = gsynth(l_homicide ~ post, # "regress" log homicides on post treated status
data = castle, # specify our dataset
index = c("sid", "year"), # Our panel unit and time FEs
estimator = "mc", # NB: Sets estimation method to matrix completion!
nlambda = 10, # Number of lambda to search
CV = TRUE, # Runs cross-validation to choose lambda
k = 10, # Number of folds for cross-validation
force = "two-way", # Unit and time fixed effects
se = TRUE, # Compute standard errors
nboots = 1000, # Number of bootstraps to run
na.rm = TRUE, # Remove missing values
parallel = TRUE, # Run parallel computing (should decrease time)
seed = 011235) # Seed for reproducibility
# Let's check the output
reg1
# In plot form
plot(reg1)
# Alternative representation
plot(reg1, type = "counterfactual", raw = "all")
## We can also extract the overall average treatment effect
reg1$est.avg
## And the cross-validation lambda (i.e. optimal hyper-parameter chosen via CV)
reg1$lambda.cv
view raw gistfile1.txt hosted with ❤ by GitHub
1

It is a major philosophical move to go from potential outcomes to history, but I prefer to simplify it by using the verb “plop”.

2

We will be looking at a block structure where we have N panel units observed over T time periods. And once a treatment occurs, it remains “turned on”. It never turns off again in other words — it is “irreversible”.

3

Time runs from left to right (columns) for each panel unit (rows).

4

A third one is the staggered adoption, which is simply the unconfoundedness matrix but rather than treatment occurring once in the last period, it occurs at different times across time.  This is also sometimes called differential timing.

5

Here’s an interesting finding from Doudchenko and Imbens (2016) and Pinto and Furman (2019) — the original Abadie, Diamond and Hainmueller (2011) synthetic control method can be interpreted as regression the outcomes for the treated unit prior to treatment on the outcomes for control units in the same period.

6

Interesting bit of history: back in the day before streaming, Netflix mailed DVDs to subscribers. Subscribers would create “queues” of movies and shows that they wanted to see and Netflix would send them one at a time after the subscriber returned a disk. Subscribers would rate movies and shows that they had seen, thus creating a giant database of rankings that Netflix wanted to use to predict subscriber preferences. But their prediction was, they though, not the best possible one, so to help improve their prediction modeling, they held a contest: if someone could invent a model whose prediction error improved by 10 points, they would win a million dollars. This became known as the Netflix contest, or Netflix prize. This problem can, when you twist your head a little, be thought of as a causal problem: if I saw this movie, that I have not yet seen, would I like it, and if so, by how much? The contest was amazing for it drew faculty from prestigious programs like MIT to regular people working out of their garages. Matrix completion ultimately played a major role in this prize and you can read about it all here.

7

As Ted Lasso says when he learns about his new player Dani Rojas, a transfer from Mexico — “Great name.”

8

This assumption may or may not be for the faint of heart — it depends entirely on whether like me you don’t really know much about sigma-sub-Gaussian probability distributions. I did some quick googling about it but let’s just say, I did not come out an expert by any means — not surprising given I’ve only had one class in statistics. My kingdom for all the statistics classes!

9

FWIW, I’ve spoken with Imbens a little about the models’ identifying assumption, and he said to me that matrix completion with nuclear norm regularization used a version of unconfoundedness, but with respect to both X as well as L.

10

Fixed effects are not really covariates. They are kind of weird, and it’s okay if over your time, subtle yet profound insights keep emerging in your mind as you learn more and more about them. They have for me anyway.

11

“All imputations are wrong, but some are useful.” — Scott Cunningham channeling George Box as his socket puppet