Inexact Matching using Minimized Distance Metrics
Primer on nearest neighbor matching and robust standard errors with concrete examples of calculations
Introduction
Matching estimation is possibly the most intuitive causal method outside of the RCT. The idea that if we want to estimate a causal effect, comparing people that look alike sort of just feels right to a lot of people. After all, if selection bias is a result of comparing groups that are super different from one another, then won’t comparing groups that aren’t so different fix it? I mean, assuming we all agree on the confounders, then matching seems like a plan.
And yet despite that very compelling argument I just made, my sense in this job is that matching is far far less common than regression, even in selection on observables situations. Many social scientists are either unfamiliar with matching, or if familiar, both have never actually used it and / or will not contemplate using it. Which is itself intriguing given nonparametric matching has fewer assumptions than regression, not more. But I’m not going to try and twist your arm to use matching instead of regression. I have much more modest ambitions. I only want to prop up the hood of matching so that it’s not so opaque. But to do that requires some build up, so bear with me while I try to justify the decisions I make in this essay.
In this substack, I’m going to explain what inexact matching using the entire conditioning set, as opposed to the propensity score, actually does. I’ll walk us through the reasons that lead to inexact matching being used, which assumptions shape the problem, and most of all, explain in as simple a way as I can the calculations themselves using images from a Google sheet, as well as a link to the Google sheet, so that if you wanted you could see the formulas yourself. There’s also going to be Stata code, links to the correct (and incorrect) R packages, as well as a discussion of how standard errors should be ideally constructed.
Some Important Prerequisites
Before we dive in, though, I want to just lay out for you the three assumptions (emphasizing both their interpretation and implications) you need for matching (ignoring the most basic one regarding the sampling of data). These are more or less the most generic selection on observables assumptions common to stratification weighting and propensity score based estimation. They are:
Stable Unit Treatment Value Assumption (SUTVA). Perhaps the most overlooked and least understood of the causal assumptions. I’ll explain it below.
Conditional independence, or “unfoundedness”. This means the treatment has been assigned to units in the study for reasons that are independent of potential outcomes and their functions, such as the treatment effect itself, conditional on all known and quantified confounders. This is a strong assumption, but not just because it is untestable. Designs all have at least one untestable assumption. No, I say below as I’ve said before that this assumption is strong because it says the participants stopped optimizing with respect to incentives and began flipping coins once you compare them with others of comparable confounder values. I’ll save my rant and diatribe about this for later below.
Common support. There exists units in both treatment and control group sharing the same covariate value in all its dimensions. This is the only testable assumption, and its violation is the source of the cottage industry of methods in this selection on observables methodologies. Even when conditional independence holds, common support may not, usually will not, and that creates bias which we are touching on now, but in greater detail in the next few weeks.
What is SUTVA?
This awkwardly named phrase makes its first appearance in a 1980 JASA comment by Don Rubin where he writes:
“The assumption that such a representation is adequate may be called the stable unit-treatment value assumption: If unit i is exposed to treatment j, the observed value of Y will be Yij; that is there is no interference between units (Cox 1958, p. 19) leading to different outcomes depending on the treatments other units received and there are no “technical errors” (Neyman 1935).”
SUTVA concerns the interactions between the people in your study. It refers to what happens (or more specifically, what doesn’t happen) when people are aggregated for estimation purposes. This is because the objects of interest in causal inference are almost always average treatment effects. We estimate parameters called the ATE, ATT or LATE, each of which is based on many people’s individual treatment effects. The individual treatment effect is what Rubin calls the “treatment-value” and he consider that value stable if when you put more than one together for a study, the choices of the participants do not cascade throughout the experiment changing peoples treatment effects and therefore their realized outcomes. Let me illustrate what a SUTVA violation looks like with a practical example.
Consider two persons, Ted and Sassy. Ted is deciding between getting an elective vaccination. If he does, then his health will improve by 5. But what if Sassy gets vaccinated, too? Well, if Sassy gets vaccinated, then maybe Ted won’t get sick because with Sassy as a central node in his social network, the virus would have to hop around her to get to him. And let’s say in fact Sassy is creating that type of barrier — it isn’t just in his head. Then his treatment effect may fall from +5 to +1.
Rubin notes that the interference is between Sassy’s choice and Ted’s realized outcome — “leading to different outcomes depending on the treatments other units received”. We know his realized outcomes have changed if the treatment effect fell, which is the source by which the instability arises. There are other ways in which SUTVA can be violated, but the “no interference” part is the one that is the most challenging to safeguard against, and so the one that gets the most attention.
And matching methods assume that there is no interference occurring between units in the study, otherwise the aggregated parameter (as opposed to one's estimator) becomes unstable. SUTVA, in other words, concerns the aggregation of the treatment effects themselves, not the estimator or its estimates.Conditional independence
In causal inference, all of the untestable assumptions are hard, because they require a leap of faith on some fundamental level, but some assumptions are harder than others, and some leaps are blind leaps of faith whereas others feel like you’re jumping with your eyes only halfway closed (whether true or not). Of all the untestable assumptions in causal inference, we often think that conditional independence is perhaps the hardest of them all.
And there’s a reason for that that I’ve said on here again and again on here, and it concerns your background or field of study. Conditional independence is hard for many from the decision sciences because in the decision sciences, we often attribute to human decision making enough rationality that the implications of conditional independence are not possible. Let me show you. The representation of conditional independence is:
The killer in conditional independence is not the “conditional on X” part, which references the confounders one must condition on to satisfy the backdoor criterion. We often focus our attention on that and express skepticism that we could possibly know all the confounders. Well, without a model, you definitely can’t know, but as empirical micro moved away from models (Card 2014; Card 2022), this assumption has only gotten harder to explain, let alone justify. Plus, I’m borderline of the opinion that the “conditional on confounders is practically begging the question”. That is, IF they weren’t rational, and they flipped coins, then conditional independence would be true. But I never covered this once in my micro theory or game theory courses — nothing about randomizing when compared with units of identical values. It seems almost like it was invented to solve a problem, not invented as a description of behavior.
No, I contend that it is actually not the confounder part semantically that hangs us up — it’s the independence part, the first part of that expression. Independence means for all practical purposes, the people in the study chose to engage or not engage with the treatment as though they were flipping coins. This is hard because despite what I said about empirical micro moving away from models, many of us still cling to the idea that people respond to incentives. And the treatment effect is an incentive. It represents the net gain, or the harm, from a choice, and conditional independent implies on some level the actor quit caring about it.
But put aside for a second the behavioral model behind conditional independent and notice its statistical implications. When the treatment was assigned to people independent of their potential outcomes, then the means of each groups’ potential outcomes, conditionally speaking, are the same — regardless of which group we are thinking of. Conditional independence is a randomization assumption in other words. And it implies these two equalities:
First, notice how some quantities are red and some are black. The ones in red are missing counterfactuals — they don’t exist. They are missing the way your car keys are missing. They’re missing the way leprechauns are missing — the red quantities do not exist, never have existed, never will exist. You can ask how your life would be different had you won the lottery, but if you did not in fact win it, then that’s just a fantasy.
The ones in black are realized though. Imbens often calls them the “realized outcomes” because if you did win the lottery, then we know Y1 for the lottery winners — it’s just the value of Y in the data. The black quantities do exist.
But why does that matter? Let me show you by writing down the definition of the ATT as it may help if we can see this assumption working for us. The population ATT is:
We can’t calculate this because the red term is missing. But, under conditional independence we don’t need it — we know from the first line that we can just replace that red term with the black term and we get:
And so simply by switching out the two terms, we have the exact same definition of the population ATT. Conditional independence allows simple substitutions.
Common support
But notice the subtle elements of the conditional statement — for those units who share the same values of X, and only those units who share the same values of X, is this substitution allowed.
The only reason the procedure we are building to can work is that we have the ability to substitute because we have equivalent units in both groups to do the replacement of the red term with the black term. Conditional independence gives you permission to substitute but without units sharing the same values of X — the exact same values of the entire conditioning set — you can’t make the actual substitutions.
So, while conditional independence creates those red and black equalities, we can’t actually do anything with it unless the distribution of the covariates is the same for both groups. Our ability to switch out missing counterfactuals with realized outcomes is only within shared values of X. The expectations we are trying to estimate hold only when we replace counterfactuals for X=x with realized outcomes for X=x from one group to another. And this is called “common support”.
Common support is critical in matching estimation. It means that for all dimensions of X, there exists a non-zero probability you’ll find someone possessing that exact value of X in both the treatment and the control group. Perhaps this is abstract. Let’s look at two pictures to maybe dig us out — one where support is awesome, and one where it is not.
Both figures show the distribution of age by treatment status, but notice how imbalanced they are. The one on the left has “overlap” but the one on the right has “poor overlap” and here’s how I know. If I was blindly looking at a 25 year old in the data on the left, I wouldn’t know which group — treatment or control — they were in because 25 year olds are in both groups. But what if I was looking at the data on the right and I pull out a 20 year old? Well if I’m looking at a 20 year old in the data on the right, I’m certain which group they’re. They’re in the treatment group. How did I know? Because everyone who is 20 years old in the data on the right is in the treatment group. There is no one who is 20 in the control group.
Common support makes statements like this which is why you often see it expressed as a probability — it’s the probability that at least one person can be found in both groups for all dimensions of your confounder set. If you can’t, then even with conditional independence, you can’t technically do the substitutions implied by conditional independence because you don’t have someone to do it to. You know that in expectation those equalities would hold, but because your dataset is missing those people, you cant take advantage of it.
I think what is the biggest irony to me of the biases of matching is that we spend so much time thinking about the conditional independence assumption being false that we don’t really understand what breaks down when common support doesn’t hold. And to make matters even more bizarre, since for many people, regression is their only causal tool, they might learn enough about exogeneity to pass the exam, but not common support and therefore not fully understand the relevance that exogeneity has for if and when common support fails. Regression sticks around, with many of us only vaguely aware of what exogeneity means (if at all in causal terms) in part because it is such an easy and convenient T way to summarize correlations using standard software packages.
So what I want to do is build something up from scratch. Not me of course — I can’t build anything. I mean discuss the work of people who have built things.
In reality, it’s much much more common that common support fails than that it doesn’t. Losing common support is just as bad as losing conditional independence with respect to estimating causal effects. But this is an area lightly perused by professionals in part because they get so triggered by conditional independence that haven’t waded into the deep end of “overlap problem” scenarios to really understand what it entails, and where to go when it happens.
Inexact matching on the full conditioning set
In a 2002 NBER working paper, Abadie and Imbens lay out a new framework for matching that I think is the first appearance of this particular nearest neighbor method. This lengthy paper would later be broken up into two publications — one in Econometrica in 2006 and another in JASA in 2011. I’ll discuss them both in this substack.
But I’ve found that if I can’t do something in a spreadsheet, then I probably don’t understand it. So in addition to exposition and pictures, I’m providing a link to these exercises so you can play around with it yourself. You can click on it here. Let’s begin.
Below is a dataset with 30 people. Ten in a job trainings program, twenty not. We observe for each their earnings, age and GPA and we believe age and GPA are the only confounders. We also believe SUTVA holds. The question is then — does common support hold?
And the answer is no. There is no one in the comparison group who is both 18 and has a GPA of 1.28. The problem is GPA is continuous, and for continuous variables, you’ll never have two units with the same precise continuous value. So just including a continuous confounder gets us into trouble. It is not enough to meet the backdoor criterion if you don’t have a solution for this common support ailment.
We can see the groups are different on confounder values. The average age and GPA is different for the two groups. Trainees are younger by about 8 years than the comparison group (on average) and have higher GPAs. If I compare their average earnings, the trainees make 26.25 dollars less then comparison group.
The new matching paradigm that Abadie and Imbens rolled out in their 2002 paper was not the first matching estimator, as I said. There were the caliper and radius matching methods based on the propensity score, for instance. But Abadie and Imbens (2002; 2006; 2011) is not a propensity score paper. It matches by imputing potential outcomes using the entire dimension of the confounder and covariate set (as opposed to a collapsing of them into a propensity score). They will match units, either 1 to 1 or 1 to many, with or without replacement, such that the square root of the sum of each units’ “matching discrepancy” is minimized. That formula is below and looks an awful lot like the exact same method proposed by Abadie and Gardazebal to estimate the optimal weights in synthetic control. But here we are not estimating weights — we are estimating an assignment of units to one another with a goal minimize the following square root:
To help you understand what the optimal solution is, I’m going to first show you two non-optimal matches with the hope this will enhance your understanding of the procedure. Below are results from two randomly assigned pairings of untreated units to each of the 10 treated units. For each, I estimate a sample ATT, then illustrate the matching discrepancies for each, before then calculating the square root formula from before. If we do it in the spreadsheet together, I think you’ll see with your own eyes how absolutely simple this method is.
Random match #1
Random match #1 was created when I grabbed at random 10 units from the comparison group and matched them, row by row, to my 10 trainees. Trainee 1 got matched with comparison unit 4; trainee 2 with comparison unit 20, and so forth. When I estimated the sample ATT from this matched sample, all I did was average over the treatment group ($11,075) and subtracting the average of my matched sample ($11,520). This gives me -$445, which is an even bigger negative than what I found with the SDO.
That’s easy but that’s not what I want to show you. Notice that last column I’ve called “Matching Discrepancy”. I’ve tried to make this as explicit as I possibly can so that anyone trying to follow this will understand it. I have a column labeled AGE_DIFF and a column labeled GPA_DIFF. Each of those is the square of the difference between each matched units’ age and GPA, respectively. Hear me out. Unit 1’s age is 18. I matched him with unit 4 whose age is 39. If I difference 18-39 I get -21. And if I square -21, I get 441 which is the number in the first cell under AGE_DIFF. The reason I am squaring this is because in the next step I’m going to sum 20 numbers (10 for AGE_DIFF plus the 10 for GPA_DIFF), and I don’t want positive and negative discrepancies canceling each other out.
There are 10 rows and 2 columns, giving me 20 cells each with its own squared gap. I then sum those squared gaps which equals 2,093.6141. And then I take the square root of that which I have put in red: 45.76. What is 45.76? It is a measurement called the Euclidean distance measuring the “distance” between all 10 trainees in aggregate and the 10 matched comparison units in aggregate. And distance is simply the square root of the sum of all squared matching discrepancies. It’s similar to root mean squared error, except it’s not an outcome error and there is no mean. It’s a sum without an average of matching error based on each unit and it’s match confounder values.
Random match #2
The idea of Abadie and imbens procedure is to find an assignment that minimizes some distance metric. So is 45.76 the smallest possible value of the Euclidean distance? Probably not since I randomly grabbed ten people to match. So for fun, let’s look at another matched assignment. Below is another randomly matched sample.
Same thing as before — I randomly matched units from the comparison group to my treatment group, which gives me a sample ATT estimate of $252.50. I then calculated the Euclidean distance associated with this assignment by taking the difference in age and GPA, respectively, for each matched pair like before, squared them, summed all 20 (1,058.1242) and then took its square root (32.53). This is smaller than the one from before, but since I just grabbed another 10 to match, it’s probably still not the optimal assignment from the perspective of minimizing the distance metric.
Optimal matches
Seeing these change, and given there’s a finite number of ways to assign 10 units from 20 units to 10 people, you may be wondering. “I wonder if there is a unique solution to this problem wherein a matched sample would actually minimize the Euclidean distance?” And the answer is yes — that’s actually one of the several things that Abadie and Imbens presented in that original 2002 NBER working paper (turned 2006 Econometrica). Below is the matched sample that minimizes the Euclidean distance.
The sample ATT estimate for this is $1,607.50, but that’s not what I want to show you. Instead, I want to show you that this matching does appear to be somewhat more balanced on age (identical actually; I’ll show you why in a second) and GPA (better but not great). Using this matched sample, we end up with a much smaller Euclidean distance than we found with either of the other two matchings: 2.99 (versus 32 and 45).
Now why is it that this was so successful? Well, the main reason is that it found exact matches for age, driving all those matching discrepancies to zero. So when you summed all the squared gaps, half of which were zero, you only get a value of 8.97 (compared to the other numbers we’d gotten in the thousands).
Don’t take the following picture too literally (for instance it’s smoothly shaped which is pure artistic talent on my part, not science) but imagine for a moment if we could take 10 units from the 20 and assign them randomly (without replacement for simplicity) to our treatment group. I think it’s a combinatorics problem of 20 choose 10 which means there’s over 185,000 different ways to assign 10 units to those trainees out of that sample of 20. The method Abadie and Imbens developed found the assignment with the smallest Euclidean distance possible. A possible picture might look like this:
The matching assignment I called Q* is the one that minimized the Euclidean distance — no more, no less. Any other pair, even just changing one matching, will be strictly worse than Q* in the sense that its Euclidean distance will be higher than Q*.
More robust distance metrics
If you are tracking with me on this, then you probably can forecast where things go next. The Euclidean distance is really just one type of distance metric; there are others. And the Euclidean distance has some problems worth noting. For instance, when it tries minimize that square root, it treats the differences in age and GPA the same. Notice how the final sum treat a 1-unit gap the same if it came from age than it did if GPA. That’s because if the gap between two people’s age was 2 years, then 2 squared is 4. But if the gap between two people’s GPA was 2 points, then that’s also 4. Minimizing the Euclidean distance treats those as exactly the same values, when in reality, someone who is only two years older than their counterpart may seem more similar than someone who is 2 points apart on their GPA, which is a huge difference in academic achievement.
So, alternative metrics capturing the spirit of what we did but which does not equate the scales is preferred, and in software packages, they may even be the default. But there’s at least a few. The two I know of are the Mahanolobis distance metric and a normalized Euclidean distance metric. Both of them scale the gaps by terms that normalize the gaps such that regardless of which variable I’m looking at, a one unit gap will mean the same thing. The normalization is usually based on the variance of the covariate or the sample variance-covariance matrix, but either way, the scale of the covariates are no longer an issue.
Stata Syntax and the Standard Error
Now let’s look at some implementation in Stata. If you want to implement this in R, be sure you are using Jasjeet Sekhon’s Matching command, as we are matching using imputation of potential outcomes according to minimizing the distance metrics defined above, and for that you want to use teffects nnmatch in Stata or Matching in R.
Here though is some simple code in Stata.* Nearest neighbor matching using teffects | |
clear | |
capture log close | |
use https://github.com/scunning1975/mixtape/raw/master/training_inexact.dta, clear | |
* Minimized euclidean distance with usual variance and robust | |
teffects nnmatch (earnings age gpa) (treat), atet nn(1) metric(eucl) vce(iid) | |
teffects nnmatch (earnings age gpa) (treat), atet nn(1) metric(eucl) generate(match1) | |
* Minimized Maha distance with usual variance and robust | |
teffects nnmatch (earnings age gpa) (treat), atet nn(1) metric(maha) vce(iid) | |
teffects nnmatch (earnings age gpa) (treat), atet nn(1) metric(maha) generate(match2) | |
Notice that on lines 8 and 10, I am estimating the sample ATT using the minimized Euclidean distance, but on lines 13 and 15, I’m using the minimized Maha which is invariant to the scale of the two covariates I used. But in line 8 and 10, I specify the code with vce(iid) and on line 13 and 15 I don’t. Lines 8 and 13 use what’s called the “usual variance estimator” to calculate the standard error, but lines 10 and 15 use something called the Abadie-Imbens robust standard error. This is actually more interesting than it may seem because if you’ll notice, minimizing the Euclidean distance resulted in the same standard error regardless, but minimizing the Maha didn’t. Why? I’ll try to show you now. But first, here’s a table of output.
The 2006 paper by Abadie and Imbens primary contribution was not merely to illustrate this matching imputation method we just reviewed where the potential outcomes of each unit were imputed using an optimally selected comparison group unit. That was part of it, no doubt, but that was not in its entirety. The title of the paper is sort of a giveaway for what the paper is really about: “Large Sample Properties of Matching Estimators for Average Treatment Effects.” The paper has many results, and because this substack is already incredibly long, I just want to now showcase one, and that is the formula for the large sample distribution of the ATT.
First, they show that matching estimators of the sort we just reviewed have a normal distribution in large samples provided the bias is small. I’m going to discuss the issue of bias in the next substack, so I’ll hold off on this for now. We will just assume that the bias associated with the matching discrepancy is small. The issues therefore come down to the correct formula for estimating the variance of the estimator. For this, without major bias, there are only two relevant situations: matching without replacement and matching with replacement.
Matching without replacement means that each unit can only be used one time. Once it’s used, you throw it out. Matching with replacement means you can use a unit many times as a match. Once it’s used, it goes back in the sample to be used again if necessary. The “usual variance estimator” applies to when we match without replacement and is the following formula:
When I minimized the Euclidean distance, if you go back to the top, you’ll notice that I actually only used each unit once for match. That was in fact because in this dataset, the optimal match did not have to use a unit more than once. Now why does this matter? Because when a unit is used multiple times as a match, the variance of the estimator must take that into account. That is in fact one of the major and original contributions of this paper.
When a unit gets used as a match again and again, that variance term gets bigger mechanically because of the form of the second row. Let K be the number of times a comparison unit is used as a match. Then the variance term grows by K(K-1) multiplied by a bunch of other stuff. And the more times it’s used as a match, the bigger that variance term becomes because the bigger the second row gets. The formula for “matching with replacement” is:
Notice that the top row is the “usual variance” estimator, but the second row is that inflation element where as I said K is the number of times a unit was used for a match, and M is the number of matches a unit received.
Now in our simple case, the minimized Euclidean distance resulted in a unique set of matches such that K=1. Remember that. If K=1, then K(K-1) = 1(1-1) = 0 and the bottom row vanishes and we get the usual variance estimator whether we specify using AI robust standard errors or not. Look back at columns 1 and 2 of my table above. Look at the standard errors. Why do you think they are the same if the first used the usual variance estimator but the second used the AI robust standard errors?
Now consider the treatment effect when instead of minimizing the Euclidean distance, I minimized the Maha distance. Yes, the treatment effect changed, but that’s not a surprise as it’s a different pairing. It did not give equal weight to the covariates’ discrepancies, and as a result, it had a somewhat different pairing. I care about showing you something else. This one used several units multiple times for the matching. And so go back to columns 3 and 4 in my table. Same syntax as before but here I get different standard errors. Can you guess why? Look below to see for yourself.
Matching with replacement reduces the bias, but it increases the variance, as we see in the formula above. And to see exactly what happened, let’s go back to our google spreadsheet (click on the Maha tab). But I made color coded pictures for us now, too.
I arranged the information a little bit differently by stacking the comparison group below the treatment group, and then numbering all units, 1 to 30, because I wanted to keep to the syntax used by Stata’s teffects command. You can find this information yourself if you run line 15 because of the -generate(match2)- syntax — this will allow you to find which units from the comparison group were used to impute the missing potential outcome for each of the ten treatment units. And I’ve shown that in the middle column labeled “matched unit” in light purple.
I want you to notice something: I used unit 20 as a match for five people, and I used unit 23 as a match three times, and I used unit 11 two times. In fact, I used only three comparison group units as matches for my ten units because I “matched with replacement” when minimizing Maha. Now go back to the standard error in column 3 and 4. Now does it make sense why the standard error got larger with AI robust standard error? I’ll put it back below so you can see it without scrolling up.
When I calculated the usual variance term, I had a much smaller standard error. It was larger than the one from minimizing the Euclidean distanced but this was simply a reflection of the original formula used to calculate the variance term using the Maha distance — its variance here was larger.
But when I adjusted for the number of times I used a unit as a match, the standard error doubled. Why? Because each comparison group unit contributes not just its outcome from which we get a row specific squared deviation from the sample ATT, but it also contributes that scaled K(K-1) term multiplied by the estimated variance (which they note can be estimated with matching, but so as to not overwhelm you, I’m going to skip it). That second row is using the number of times an observation in the comparison group was used as a match and there’s only 3 units, but they are used many times over. And this causes the variance to increase so long as you’re using matching with replacement.
Summarizing remarks
So why did I write this? I wanted to emphasize the complications created, not by conditional independence failing, but by common support failing. Simply by failing common support, you’re unable to find exact matches. We have a continuous variable, GPA, and that really screwed stuff up. While we could find exact matches on age, because age had been rounded, we were unable to find exact matches on GPA. And when you have any continuous matching variables, you won’t be able to exactly match. It’s impossible. Which means once you have a confounder that is continuous, you’re automatically in a world of bias! Ironically, not because of conditional independence failing, but because of common support failing.
So what Abadie and Imbens do across a bunch of papers stemming from that 2002 NBER working paper is they created a simple matching estimator that minimized a distance metric. The objective function looks awfully similar to the problem Abadie proposed in his synthetic control estimator to me, as I said, only here you’re choosing an assignment that minimizes some distance metric based on covariates. Whereas in synthetic control, you’re choosing optimal weights on donor pools that minimize the covariate gap between treatment group and donor pool units, here you’re choosing an assignment that minimizes that aggregate gap (distance).
The second thing we showed in here, though, was just the brute force calculations involved in this minimization problem. Nearest neighbor matching follows an algorithm to solve that minimization problem, which I suspect is why the original Stata command, nnmatch, was so slow. But the calculations of the distance expressed as covariate discrepancies are themselves not always obvious to people but I hope this helped clarify them.
The third thing was of course to show you the Stata syntax as well as the output, and help you interpret it, but I decided pedagogically to go a little out of order by showing you the syntax, then a table of coefficients and then conclude by showing you the various standard errors. I did this because my gut tells me you guys needed a break. I was talking too much. Plus I think many of us really don’t begin to process standard errors until we see them surrounded by parentheses in a table, whereas treatment effect parameters sort of feel like these Platonic ideals floating in the air, or at leas they do for me.
And then fourth, really crack open why those standard errors are what they are by walking you through two situations: minimized Euclidean where it just so happens each unit was used only once (probably because we were able to exactly match on age) and therefore the AI robust variance estimator just collapsed to the usual one because when K=1, K(K-1)=0 and the entire second row vanishes. But then we looked at the minimized Maha distance, and in that one, units were used more than once for the matching.
And here we see the crux of the issue — there is literally a bias-variance tradeoff when it comes to whether you’re going to allow units to be used repeatedly or not. If you let them be used repeatedly, then you’re minimizing matching discrepancies, and in a world where conditional independence holds, the only source of bias is matching discrepancies — i.e., when someone’s covariate value is not exactly the same as their paired match. So matching with replacement will always reduce bias. But, if you end up with a bunch of units being used more than once, then that second row of the variance estimator — which note is entirely coming from counting information about the comparison group unit, not the treatment group unit — starts to tick up and up and up. The fewer times a comparison unit is used as a match, the smaller K is for that row, the smaller the second row is for that unit. But if your sample is such that a unit is getting used over and over and over, then the variance will grow.
Some Unsolicited Advice
So, in conclusion, I’m going to do something I don’t usually do and give advice. If you’re going to use this particular procedure for matching, then common support should be your focus. It drives the bias. But it’s observable, and if you wanted you could write code that literally calculated the square root of the summed squared discrepancies. But I don’t think that’s probably what you want to do, as it’s not going to really tell you a whole lot as far as I can tell. Histograms maybe covariate by covariate on the other hand does sound like a better use of your time.
What I think you want to do quite frankly is simply use the distance metric that you think makes the most sense. In this case, do I want to exactly match on age and inexactly match on GPA? If so, then lines 8 and 10 did that, albeit coincidentally. Or do I want to match on both of them by allowing for inexact matches that minimized some distance metric? Well, if your scales differ wildly (say age vs log income where a 1-unit difference either of those has huge differences in meaning), then I think you’re going to be wanting to select a distance metric that doesn’t depend on the scale. The Mahanalobis distance metric is the one most likely you’ll want to select, though a normalized Euclidean distance metric that scales each row’s discrepancy by the variance of the entire column could also be used. Heck, you could even put your own arbitrary / subjective weights in there if you really really really think age should be counted more than GPA. But as Brene Brown said in the second season of Ted Lasso, if you’re coming to the arena, be sure to bring a knife. Any choices like that which are subjective, be prepared for a fight, and be prepared to lose that fight too.
Finally, the standard error. As you can see, selecting the AI robust variance estimator is equal to the usual variance estimator unless you are using units more than once, and if you’re using units more than once, then your standard errors are wrong. So my advice? Always use the AI robust standard errors otherwise your estimate of the large sample variance is too small and you’ll over reject. If the matches are unique, it won’t matter.
Thanks again for reading this far! Don’t forget to subscribe, share, etc. if you like this content. Next up is going to be a primer on bias adjustment with matching, and then after that a horse race between regression and the other estimators to help us better understand when and why regression breaks down in selection on observables situations.
There are other ways SUTVA can be violated though. There is for instance “hidden variation in treatment”. Perhaps we are combining units with different measurements of the treatment (eg, you took 10 melatonin, but I took 1/2, and yet we are assumed in the ATE to be taking the same dosage, even if our response to it is unique). SUTVA assumes that the assignment of treatments to one person does not change the treatment effect of another. It’s not that there aren’t any spill overs; it’s that the returns to the treatment of person B is not a function of the treatment assignment from someone else in the experiment. Which is relevant precisely because we are grouping multiple units. Other things ruled out by SUTVA concern scale effects or general equilibrium effects, as scale can introduce subtle changes in the treatment itself as inputs become more scarce, or make adjustments where at the margin the returns change as entrants respond and enter or exit. SUTVA is crucial because it keeps us in the realm of partial equilibrium, which means it necessarily limits the generalizability and longterm accuracy of our estimates when it is violated. But the better we understand it, the better we understand what we are and are not doing.
In other words, while an excellent and seemingly complete package, you do not use Matchit to implement Abadie and Imbens (2002, 2006) matching estimators. Instead you use Matching. For a detailed discussion read this help file for Matchit, and skip to this opening section where Noah Greifer, the author of Matchit, writes “It is important to note that this implementation of matching differs from the methods described by Abadie and Imbens (2006, 2016) and implemented in the Matching
R package and teffects
routine in Stata. That form of matching is matching imputation, where the missing potential outcomes for each unit are imputed using the observed outcomes of paired units. This is a critical distinction because matching imputation is a specific estimation method with its own effect and standard error estimators, in contrast to subset selection, which is a preprocessing method that does not require specific estimators and is broadly compatible with other parametric and nonparametric analyses.”