Please indulge me in writing a steam of consciousness explainer as it’s been a long day and I need to let off some steam. After foolishly feeding a family of five feral cats for the last year and watching two of them get picked off by unknown predators, I ended up hiring a “cat whisperer” to help me catch them, take them to the vet, and bond them to my two cats. It’s about to be Brady Bunch up in here — my family of 3 is about to be a family of 6, and 8 when the girls here. Which is a nice segue I think to starting a substack about the causes of mental health.
Scott's Substack is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Before I do, though, I want to set you up with what to expect here. First, I want to lay out my emerging “philosophy” so to speak regarding evidence and estimation. And second, I want to discuss a new paper on social media and mental health that was published last winter in the flagship journal, the American Economic Review. I want to cover this paper because I think it’s important, I understand it at a personal level, and I suspect it will be a paradigmatic paper for how present event studies with so many diff in diff estimator options. So I thought that would be a good reason to discuss it.
This paper is one of the first “causal” studies linking social media platforms to the deterioration of youth mental health. For some it will not come as much of a surprise — they probably did not need this so-called “causal” evidence because they watched it all unfold with their own eyes. And while those people may understand that a non-random sorting into an activity is potentially biased depending on the assignment mechanism, for them the answer had been obvious and so they did not have the patience to wait almost 20 years for a diff-in-diff that showed what they already knew. Different types of scientific reasoning regarding causality and correlations as forms of evidence is very old. In fact, two of causal inference own founding fathers, Jerzy Neyman and Ronald Fisher, outrightly rejected the then not widely accepted smoking-lung cancer hypothesis because the evidence for it suffered from selection bias, implausible (in their minds) magnitudes, disputable functional form assumptions and ultimately lacked a randomized trial. Fisher, a lifelong pipe smoker and highly paid expert witness of the tobacco industry, took that incredulity to the grave too when he succumbed to lung cancer. But sometimes, where there’s smoke, there’s fire, even if the argument used to justify that the first caused the smoke is not technically right. Which only goes to show you that good logic does not guarantee correct answers, nor does bad logic mean you get the answer wrong. Even a blind squirrel can catch a nut occasionally — just not systematically.
And so if you’re one who knew social media was harmful, maybe this paper just confirms your priors. We all have papers like that we’ve read. Nonetheless, I am one who does need strong pieces of evidence when it comes to claims about the causes of mental health because I think our understanding of the mind’s health is so poor, we may be only one generation from the equivalent of using leeches as treatments. The paper in question is “Social Media and Mental Health” by Luca Braghieri (Bocconi University in Italy), Ro’ee Levy (Tel Aviv University) and Alexey Makarin (MIT Sloan School of Management) and they find that the appearance of “thefacebook” at a college students campus between 2004-2006 caused mental health to worsen by around 0.1 to maybe as high as 0.3 of a standard deviation. The effect they argue comes through depression and anxiety, which then harmed their academic performance. The effect, in other words, was not an increase in academic anxiety, like kids working too hard or having too much homework causing depression and anxiety. Rather, they think “thefacebook.com” caused something else — interpersonal comparisons between their own lives and the other students they saw. Those comparisons, in this case, appeared to be harmful to their minds, enough that the aggregate measures show worrisome patterns. It’s an important finding if true given we have begun seeing higher incidence of suicides among young, who also intensively use social media, much of which prompts interpersonal comparisons. So let’s dig in.
Partial Equilibrium, Design Elements, and Populations
This paper is not really about social media and mental health despite what the title says. This paper is about the first iteration of the social media giant, Facebook, called then “thefacebook”. When “thefacebook” appeared, it was both a very popular platform and the counterfactual was little to no other platforms. That is, adoption was rapid but it was also more or less the only game in town. The things I remember is MySpace and Friendster did not have the same presence in peoples lives the way “thefacebook” would.
And so because it was the only game in town, and adoption was massive, it maybe pulled people out of activities that new platforms like tiktok no longer do. The treatment in other words is some bundle of experiences but the treatment effect recall is Y(1) - Y(0), so we must define D=0 just as much as D=1 to interpret aggregate causal effects. And D=0 — it is not clear to me from the paper what college campus life was like before “thefacebook” and I would’ve liked to hear more about that as we need to know to assess and interpret this parameter.
This is the dawn of social media, but the treatment here was a specific social media — the original ”thefacebook” — and it had design features that are really no longer are its core business model. For instance, this period is before its famous newsfeed, its algorithms, before the marketplace, before its many groups, probably the ability for content to go viral was throttled if possible at all. I suspect if we went back and could log onto that Facebook, it wouldn’t even seem like the same social media platform. And that’s crucial because remember - under the potential outcomes framework, we are often only dealing with partial equilibrium. That doesn’t just mean time — it means partial from some base. And the base today is not the base in 2004. Counterfactuals matter in this.
So while the paper’s title says “social media”, there is no such thing as “social media”. There are rather design elements on internet platforms, and when we study something like Facebook or Twitch, we have to recognize that we are studying a bundle of sometimes unknown design elements, all combined, and if only one of them was to be removed, the estimated treatment effects themselves might disappear or get larger, or possibly reverse sign.
We see this all the time — David Powell, Rosalie Pacula and Marielle Jacobson have a very good paper in the Journal of Health Economics studying medical marijuana laws’ effect on opioid overdoses and opioid usage among Medicare recipients. Not a huge surprise, except when they pin down the mechanism and find that it’s coming from liberalized access to medical cannabis through dispensaries, and not just more generally the law itself. In other words, it’s the design elements of medical cannabis. It matters how we make these things. And I suspect we are going to see more of this soon, too, with this simultaneous rollout of decriminalizing/depenalizing of psychedelic medicines happening at the municipality level while states and soon the federal government are legalizing and rescheduling through medical models “psychedelic therapies” (rescheduling of MDMA may come as early as October). They may seem like the same policy, but they aren’t. Or they may be. We won’t know unless we study them.
So that’s just something I want to have in the back of your mind. While this is a paper about social media, I will be saying that it is not about social media — rather it’s about “thefacebook”, a particular type of social media platform from 2004-2006c in the earliest days of its existence, which appeared before people had smart phones and before Facebook was maybe much more than a primitive version of Twitter meets Instagram. That is, primarily status updates and pictures.
This is just another way of reframing the old “internal validity versus external validity” debates. But I wanted to make my beliefs clear and spelled so that at least you knew where I was coming from.
The Long Tail of Preferences and Matching Technologies
I have experienced the highs and the deep lows of social media over a lifetime of seeking online communities that dates back to being sysop in the 8th grade of a bulletin board (BBS) I ran out of my bedroom with a 2400 baud modem and an IBM PS2 Model 30 with 20 megs of storage and 640k of RAM. It was just the start of my love of online communities. With only a brief break in high school, I have lived my life writing and talking to people online. From BBSs to forums, to listservs, to blogs and comments, to Metafilter, to Facebook, to Twitter and now, Substack. I have been drawn to and participated in online communities for maybe 35 years. I know the context very well even as the exact community platforms themselves have shifted.
I am like many of you, maybe all of you, someone with idiosyncratic preferences, and one of the things we know (believe?) is that the Internet, for good or bad, matches people with similar idiosyncratic preferences, even for those for whom the distribution of those preferences in a population gets very thin, where the mass of people spreading along a very long tail defining those preferences gets smaller and smaller until we get to the last person who is literally unlike anyone. Many of us feel we are that last person. We feel we are too different to possibly ever find someone like us. We face loneliness and develop strategies and become healthy hopefully but I do not think we were designed to be alone. We are social creatures. We try to mate for very long times if not until death. We give gifts. We care for each other.
And yet friendship is so often we think to be with someone who at least is close enough to us on the tail that they truly get us. And the internets “promise”, and social medias promise, is to serve as match maker.
Rare has a meaning: it happens infrequently. But Murphy’s Law says if it can happen, it will, so long as you ramp up the sample size to infinity. It is the same with being an odd bird. You may feel like you’re the last one on the long tail, but in a world with 10 billion people, you’ll see you never were. Your people were just spread out and couldn’t find one another — if the preferences are somewhat random and outliers, then of course you’re alone in your small town. But even the people with the rarest of preferences will, at scale, find huge, possibly thriving communities just like them — so long as we can reduce the search costs and other frictions and find platforms that link them. This is what the internet platforms do to — they thicken markets through dramatically reduced search costs and other frictions which helps speed up the matching and allows those along the long tail of rare preferences to find each other, realizing they maybe weren’t anywhere close to the last person. After all, the probability you are truly the last person is 1 over 10 billion. There’s others like you. And that’s been for me one of the things that drew me to the internet — I kept finding my people. And that is one of the things pro-social media types like myself have often said to each other and to ourselves — the positive associative matching associated with two sided matching platforms is a gain to society.
But what if, hear me out, there exists a technology that has benefits and costs? Nobel invented dynamite without which we couldn’t have built roads through mountains, connecting areas, facilitating connections and commerce, the easier movement of goods and services. And with dynamite countless lives have also been lost and destroyed. Is there any technology for which there hasn’t been both costs and benefits? Is social media platforms really so unique that it’s impossible it could too?For years, we’ve been hearing rumors from the front lines — the kids aren’t doing good. We aren’t doing good. Why? Is it the technology? Or is it something else? Maybe even regression to the mean.
Associative matching and harm are not mutually exclusive. After all, many cities have meth dens where the meth addicts coordinate and share resources like their meth pipes and meth itself. If someone made a platform called “themethbook”, we might agree it will make matching in the meth friendship networks more efficient and yet agree it also probably is increasing harm. Sometimes we want high search costs? Sometimes we don’t want two people tangled up? You can make things more efficient and you can therefore ramp up the harms too if it’s not designed correctly. Technologies for which there are costs and benefits must be designed properly just so we can eke out the value at minimal cost.
But this is speculation without evidence. To some degree anyway, we need causal evidence that social media platforms do harm people as well as why they do before we try to suggest design changes. Does social media even harm people? If so why? These are core research questions. And some may say it is the purview of psychology to study this but frankly I strongly disagree. Economists have since day one been trying to understand how to address the massive changes to society and quality of life brought about by technology and change. No one has a monopoly on trying to understand the human species and its social context. To police like that is to be honest someone with too much time on their hands. If they wish they’d written this paper, then write it. It isn’t like we are facing a shortage of needs to better understand these questions. We need more research, not less. I am impatient with the territorial guard dogging of this moment. People who criticize economists for not staying in their own lanes also complain economists don’t interact with their fields — you can’t have it both ways. Either people welcome economists to the table, or they don’t. But frankly it doesn’t matter to me. I study this topic too, I care about this topic too, I will support anyone studying this topic too because mental health is important, suicide is important and policies without careful attention to design elements are very dangerous indeed. We are all trying to understand a complex world as far as I am concerned.
So Isn’t it possible that it both matches us and distorts us? But does that mean all platforms do that? Is it Facebook or is it “thefacebook”? Is it Twitter or is it the “quote tweet” button? So many questions. And that I think is where this paper comes in. This paper is not the final word. It is one word and it is a useful word indeed. It left me with many questions but the ones it did answer were I think some of the most important ones to me but also policymakers. I encourage the reader to follow it.
I alluded to this but just to be clearer I also was interested in it because it is so closely tied to my personal interests. The paper uses new estimators for differential timing which incredibly I have spent over two years becoming a self appointed educator of causal research methodologies trying to help people around the world learn what these things are, what they do, how they work and when to use them. Plus I made a pivot towards mental illness in my own research agenda around five years ago, and specifically “severe mental illness”, suicide and more recently “transitional age youth” which ironically overlaps a bit with this population. And as I said, I have been online more than I have been offline. So this sort of hit a sweet spot for me — again, I thought I was on the long tail but turns out I’m not.
But I also have a vested interest in it because I have seen social media’s seemingly negative effects on mental health (maybe) through shaming, public and discrete harassment, gossip and innuendo campaigns, bullying and toxic actions taken to both others and myself. I have seen the dopamine hits from online activities, the collective action problems that come from design, the interpersonal comparisons that create never ending monologues and ruminating thoughts inside the mind.
And I have also used the internet to fight my own battles. I have used the internet to meet and get to know others, as well as express myself, but I have also used it to snipe, brood, criticize, complain, fight, yell and invest in all sorts of corrosive human and social capital too. So I have blood on my hands and maybe you do too. So with that said, let me now dive into this paper by setting out to describe a few big picture items about how I think of research and how I think of evidence.
Reasoning to the target parameter first
One of the things that has slowly taken me over as a result of diving deeper and deeper into both the difference-in-differences literature but also the writings of econometricians more generally is that there is an order that econometricians recommend we take when estimating causal effects. And that order goes like this:
What is your parameter of interest? Because there are many parameters of interest. There’s the average treatment effect (ATE), as well as the average treatment effect on the treated group (ATT). They are not the same parameters and we must first just sit down and ask what our study is about and how we translate that research question into an aggregate causal parameter. Are you wanting to know the average causal effect for everybody or just some people? It wasn’t too long ago that someone in a talk would say that they were estimating the causal effect of X on Y, but under heterogeneous treatment effects at the unit level, there is no “the causal effect”. There are individual treatment effects and there are different ways of summarizing them. And given different types of treatment assignment, the ATT and the ATE may be identical to each other (e.g., randomized treatment assignment) and other times very different (e.g., sorting on treatment gains). So the first thing you need to do is ask what is your parameter of interest, and in this paper, the authors parameter of interest is the ATT. Why? Because they want to know what the average effect of Facebook’s appearance at a college campus on the students at that college campus. They were not asking the effect it had on everyone, even those who never got a Facebook (whoever that is). If they wanted that, it would be the ATE, and you can’t get the ATE with difference-in-differences without some additional assumptions beyond parallel trends. So this paper is not just about Facebook — it’s about Facebook’s effect on mental health of kids who got it at their campus which goes by the classy name of the ATT.
What beliefs are needed to estimate that parameter? Easy — parallel trends. Why? Because we are estimating the ATT using difference-in-differences, and you need parallel trends in order to do that. We will come back to this.
Which crank should you use? I sometimes call the estimators “cranks” because I want the readers to remember that econometrics is not a spell out of Harry Potter’s book of spells. Estimators might as well as be a bunch of buckets moved over a river with pulleys and rope. It’s not magic — it’s addition, subtraction, division and multiplication. It’s cranks that turns spreadsheets into numbers and if you do it a certain way, then given your beliefs in part 2, you get a number like what you wanted in part 1. If it wasn’t we were so used to do it by now, we’d freak out every time we did it because it literally is a crank that produces a number that is impossible to calculate as the thing itself doesn’t exist! And if 1 and 2 then 3 and that’s cool. So which cranks will these authors use? Easy — they will use all of them!
Why do I lay it out this way? Because frankly, as we saw in an earlier substack about estimating the ATT with OLS under heterogeneity, if you don’t know where you’re going, you won’t know how to get there. Causal inference isn’t like flying a plane. It’s like piloting a submarine. You can’t see anything. You rely on everything except your eyes. You trust the equipment. The coordinates of where you’re going helps you make a map which you give to the officers who work the engines and get you there. That’s causal inference. We are piloting a ship in the darkness of the sea using nothing but pings and trusting what pings mean. That’s causal inference.
So they they want the ATT. So they use difference in differences. But to use it they need parallel trends. And if they have parallel trends, they have to find a crank that actually can pilot in the dark with staggered adoption. We once thought it was a straight line using a diesel engine, but now we know it’s a straight line using a nuclear engine. That’s fine. We finally have them. We may have too many even! But we have them.
And so the authors will use a bunch of the contemporary robust difference-in-differences models which in fact estimate the ATT under parallel trends and help us see what the heck happened to us when thefacebook came to town. Boom goes the dynamite — for real.
The Five Pieces of Evidence in a Great Diff-in-Diff
Lately I have been thinking that we don’t just need to be hearing about difference-in-differences estimators. There’s more to making an argument in causal inference than simply running -csdid- or -did- after all. It isn’t just about code; it’s not even mostly about code! It’s not even about the econometrics. John Snow, who invented diff in diff, did not use a computer nor did he know about the large sample properties of any estimators. Rather he wrote a monograph mixing tight logic, observation, tables and compelling visuals that told a coherent story about how and why people were dying in London from some strange painful disease. And we are closer to John Snow than anyone else will ever be after us, so maybe we need to have some models of what evidence to produce. So, before we dive into their paper, I want to present to you my paradigm of evidence based on five things that I think, when they all hit, make up a truly fantastic difference-in-differences paper (maybe any paper).
Bite. One of my favorite diff-in-diff papers does not simply present the main results. They make an argument. Miller, Johnson and Wherry (2021) is as far as I am concerned a historically classic paper on Medicaid and Mortality. Basically, bite means that before you look at second order behavioral outcomes, you should first look to see that there are first order behavioral is there, even if you aren’t really all that interested in those things. Even though Miller, Johnson and Wherry, for instance, is about near elderly mortality, they first show that Medicaid expansion caused an increase in eligibility and enrollment as well as a reduction in the uninsured. They showed all three because if they show that Medicaid reduces mortality for the near elderly, but in fact Medicaid expansion did not cause the uninsured population, at least somewhat, to switch into Medicaid, then it’s a very hard sell to tell someone that Medicaid’s presence in the state caused mortality to fall by 9% even though no one got on it! So bite is crucial. If you can, find the data and show that when your treatment was adopted in some area, people responded by doing those initial things associated with the treatment itself.
Falsifications. Parallel trends isn’t testable. Most of you know that, but it’s still worth saying. The advantages of the RCT are that you know the confounders are distributed equally on average across the treatment arms. With parallel trends you don’t know that. So the best difference-in-differences papers (maybe the best of all papers) have in mind an alternative explanation for the same results and then tests that explanation. Miller, Johnson and Wherry in their Medicaid paper, for instance, are actually focused on the “near elderly” population. To test whether their main results will be coming from some confounder, they used another population nearly identical but who weren’t likely to enroll on Medicaid — the 65 year olds and older. These people are on Medicare which is almost always the better healthcare option, so when Medicaid expanded in their state, they are exposed but they do not take up. They have already taken up. And sure enough, they find no effect of Medicaid expansion on the elderly mortality. I don’t care what you think — it was ingenious. It was the app killer. The null effect on elderly let me know the main results would be real. It was psychologically eerie. Maybe you have a paper like that too? Leave it in the comments; I’d love to know.
Main results. After you’ve showed bite, and you’ve ruled out competing theories, readers are sometimes salivating and ready for the main course. In Miller, Johnson and Wherry, that was the effect of Medicaid on near elderly mortality and they showed around a 0.132 percentage point reduction in mortality amounting to around a 9% reduction compared to the mean. They had me at hello by the time I got to that graph. Again, it isn’t perfect - you can draw a straight through all the confidence intervals. But you could also not. All the other evidence made rejecting the result seem a little childish. But they had to make the argument. They had to convince ME. I’m a grown man. By the time they took us to that result, I was almost incapable of believing that whatever we were going to see that it could even plausibly be anything but Medicaid! The best papers do that. It can be a powerful reveal when done well. Bravo guys. I write Sarah or Laura at least once a year and say “hey fwiw, you killed it. I’m a belieber”.
Mechanism. Now in the reduced form methods, you really cannot directly figure out if some outcome is a mechanism unless you’re willing to write down a model with pathways and say that some outcome is in fact a mechanism. But, what you can do is have a looser model, a story, in which the treatment affects the outcome through some channels, and they try to figure out if there is evidence for those channels. If they can lay out three channels that you accept, and rule out two but find one, guess what. They’re saying you’ve got to pay the bill on dinner. That’s how it works! The days of hand waving with stories explaining main results are probably long gone. Nobody cares about your asterisks and your numbers. These days, you can’t really tell a story, no matter how tight your price theory is, without trying to support it with causal analysis. Try to follow up with a regression with a game tree and Bayesian nash equilibrium, and you better get used to rejections my friend. Ain’t happening. You need receipts in 2023. Miller, Johnson and Wherry show that the effects they found on elderly mortality are coming from disease related deaths being avoided. Sometimes, just simply providing evidence about why your results are happening is enough to satisfy the reader. Otherwise it’s a cliff hanger, and no one likes a cliff hanger.
Visualizing with event studies. And then the Crème de la crème itself — the heart of the contemporary difference-in-differences design: the event study plots. The event study plot is a dynamic regression specification in which the coefficient point estimates are plotted against whisker plots of the upper and lower 95% confidence intervals across time, either calendar time if a 2x2 or relative event time if differential timing. Usually a horizontal line jets out from the 0 point on the y-axis, and a vertical line rockets up from a point between the year of treatment and the period just prior to treatment. If you’ve estimated this dynamic model with a regression, you have to omit a year, you typically omit the t-1 year otherwise you have multicollinearity. But you may have to make this by hand, so steady yourself. Here’s an example code for making an event study plot using a simple 2x2 if you want to see for yourself. Notice the solid circle you want at baseline and the way the lines are drawn.
|* name: simple_eventstudy.do|
|* author: scott cunningham|
|* description: illustrating an event study with Stata|
|capture log close|
|use https://github.com/scunning1975/mixtape/raw/master/castle.dta, clear|
|* Prepare the dataset|
|xtset sid year|
|drop if effyear==2005 | effyear==2007 | effyear==2008 | effyear==2009|
|gen post = 0|
|replace post = 1 if year>=2006|
|replace treat=1 if effyear==2006|
|* Event study|
|tabulate year, gen(pre)|
|tabulate year, gen(post)|
|replace treated=1 if effyear==2006 & year>=2006|
|reg l_homicide treat##ib2005.year, cluster(state)|
|matrix b = r(table)|
|local coef00 = b[1, 25]|
|local coef01 = b[1, 26]|
|local coef02 = b[1, 27]|
|local coef03 = b[1, 28]|
|local coef04 = b[1, 29]|
|local coef05 = b[1, 30]|
|local coef06 = b[1, 31]|
|local coef07 = b[1, 32]|
|local coef08 = b[1, 33]|
|local coef09 = b[1, 34]|
|local coef10 = b[1, 35]|
|local ll00 = b[5, 25]|
|local ll01 = b[5, 26]|
|local ll02 = b[5, 27]|
|local ll03 = b[5, 28]|
|local ll04 = b[5, 29]|
|local ll05 = b[5, 30]|
|local ll06 = b[5, 31]|
|local ll07 = b[5, 32]|
|local ll08 = b[5, 33]|
|local ll09 = b[5, 34]|
|local ll10 = b[5, 35]|
|local ul00 = b[6, 25]|
|local ul01 = b[6, 26]|
|local ul02 = b[6, 27]|
|local ul03 = b[6, 28]|
|local ul04 = b[6, 29]|
|local ul05 = b[6, 30]|
|local ul06 = b[6, 31]|
|local ul07 = b[6, 32]|
|local ul08 = b[6, 33]|
|local ul09 = b[6, 34]|
|local ul10 = b[6, 35]|
|set obs 11|
|gen year = _n + 1999|
|gen coef = .|
|gen ll = .|
|gen ul = .|
|replace coef = `coef00' in 1|
|replace coef = `coef01' in 2|
|replace coef = `coef02' in 3|
|replace coef = `coef03' in 4|
|replace coef = `coef04' in 5|
|replace coef = `coef05' in 6|
|replace coef = `coef06' in 7|
|replace coef = `coef07' in 8|
|replace coef = `coef08' in 9|
|replace coef = `coef09' in 10|
|replace coef = `coef10' in 11|
|replace ul = `ul00' in 1|
|replace ul = `ul01' in 2|
|replace ul = `ul02' in 3|
|replace ul = `ul03' in 4|
|replace ul = `ul04' in 5|
|replace ul = `ul05' in 6|
|replace ul = `ul06' in 7|
|replace ul = `ul07' in 8|
|replace ul = `ul08' in 9|
|replace ul = `ul09' in 10|
|replace ul = `ul10' in 11|
|replace ll = `ll00' in 1|
|replace ll = `ll01' in 2|
|replace ll = `ll02' in 3|
|replace ll = `ll03' in 4|
|replace ll = `ll04' in 5|
|replace ll = `ll05' in 6|
|replace ll = `ll06' in 7|
|replace ll = `ll07' in 8|
|replace ll = `ll08' in 9|
|replace ll = `ll09' in 10|
|replace ll = `ll10' in 11|
|twoway (rcap ul ll year, sort lcolor(black) lwidth(medium) lpattern(solid)) (scatter coef year, sort mcolor(black) msize(6-pt) msymbol(circle)), yline(0, lwidth(medthin) lpattern(solid) lcolor(blue)) xtitle(`"Year"') xline(2005.5, lwidth(medium) lpattern(dash) lcolor(blue)) xlabel(2000(1)2010) title(`"Ln(Homicides)"') legend(off)|
This paper doesn’t have all five of these, and I’ll try to explain why I think they go about the study the way they do. I will conjecture, though, that I think the paper’s success is in a few things: 1) the importance of the question, 2) the rollout of a major platform, and 3) the event study plot which has many curiosities. But let’s dig into it now.
This paper has a lot of fun parts. It is ingenious in the ways that people like me and probably you but probably not our family or any one went to college with loves. I mean LOVES. Let me set up the fun part.
We are so used to these papers using differential timing that even with the difference-in-differences credibility revolution having happened and progressed, we don’t necessarily lose our minds when we see one. In the United States, anyway, we are used to these because our state and municipalities are relatively autonomous up to the constraints imposed by our Constitution and constituents’ wishes and tolerances. So seeing that this paper will be using some staggered roll out of social media to study mental health, immediately we realize this is difference-in-differences, and it’s staggered, and it’s callaway and SantAnna and de chaisemartin and sun and kirill and some guy named Mundlak who — come one. Just try messing with that guy. Mundlak will kick your ass. Do not mess with him!
So everyone has a differential timing project. I found two and forgot three before I even got up to feed my FIVE CATS. It’s 2023 — our minds lull into a dreamy state of half awake, half asleep, with them now. But please don’t. Not all differential timing papers are as flat out fun as this one. Argh how did I not know thefacebook did this??!
Ever since the emergence of the iPhone with its App Store (and Droids with theirs), social media no longer “rolls out” when you think about it. Maybe Snapchat or Twitter rolled out, but I think maybe there’s more likely to be a widespread launch or at least, it may seem so. And yet this paper is a staggered roll out so how is that possible? How is it they’re staggering?
From 2004 to 2006, “thefacebook.com” was a website that rolled out only to universities. Zuckerberg and company chose this strategy to create word of mouth and drive demand and I’m assuming it worked just fine. As this was before the iPhone, “thefacebook” as it was called then could only be accessed from the desktop, and only if you had a .edu account, and only if that .edu account was tied to a university where they already were.
Problem is, Facebook doesn’t share that. I get it. I have a study on craigslist roll out and I actually sent a DM to Craig Newmark and was like hey, I’m a professor at Baylor writing an article that in ten years might get published in a journal. Please answer my questions. Question 1, why did you enter these cities? Craig the patient man said I cannot say that. And that was the only time I spoke to Craig Newmark. And I suspect these authors didn’t even get that far. But fortunately there is a second best. And to be honest, given the frailties of memory, their method is pretty much the first best. They found a box of photographs of the front page of thefacebook taken multiple times a day for like two years straight. And guess what is on the front page. Spot the difference between these two pictures.
See it? The front page of Thefacebook.com announced which schools they’d recently added. Which means if you just went through all screenshots of the front page taken from the first to last day, you’d know the roll out. And that’s what they did. This team of shoe leather detectives went through every single photograph taken by the internets own photographer, the wayback machine, which has taken something like 1 billion photographs of every webpage for already two decades at least, and day by day cataloged over 700 colleges appearance on the front page. The moment a new school appears, they are treated and the authors flip a zero to one in the treatment column.
Truthfully, I am impressed — I didn’t even know thefacebook did this. I really respect that they took advantage of what is basically publicly available data on the internet, for free, and found this clever research design just sitting there waiting to be used.
Data and Main Results
So they figured out the right hand side by using the Wayback Machine, but it won’t do any good unless they have some information about student mental health at the schools over time. Facebook entering colleges doesn’t do anything if you don’t have information about the students at those colleges corresponding to a panel of some sort. So how do they go about it?
They use a survey that’s been running for quite a while called the NCHA. The data is I think maybe one of the more consistent and long running surveys of college life. Lots of interesting questions about student life and health itself including drugs, sex, mental health, and academics. But not the name of the college they are in. For privacy reasons, the administrators of the survey do not provide the schools’ actual names to researchers, and without the name, this seemingly ideal dataset cannot be linked to the Facebook rollout database they created. So what these authors did was contact the group that runs the survey, explained their project, and asked if they’d mind adding a single variable to the survey for them — the dataset of 0s and 1s corresponding to whether that student was at that time of the survey enrolled in a college with or without a Facebook. And they said yes. So now the two datasets were linked — it just required working with the owners and asking them to merge the variable in, deleting the college identifier, and sending it back one column larger than it was before they got it.
This student health survey asks a lot of yes or no questions about mental health that the authors decide to sum into aggregate measures which are then normalized to get a base mean of zero into something called a z-score. All coefficient estimates are therefore measured in the same units — standard deviations of the outcome itself. This is great because not only is it common to present coefficients in that way, it also facilitates the presentation choices they make where they overlay many coefficient plots in compelling visuals. Since the coefficient estimate has the same interpretation (some share of a standard deviation), the reader can visualize all the plots at once and compare to understand what is going on. Lots of thought went into these exhibits. The pictures are an important part of the evidence as much as the things in those pictures. But rather than lay out the questions and the indices now, my preference is to just reveal them one at a time as we go so that you also have the experience of seeing the results and the measurement at the same time as I did.
For reasons I don’t understand to be honest, maybe my only question, the authors present Main Results as twoway fixed effects knowing full well that the evidence they’ll show later will imply these are biased bc the effects appear to grow with time, making the static specification technically biased. So if I am understanding correctly, given there is some dynamics they will show later, these estimates are likely biased downward. Rarely does the sign flip — the dynamics must be massive relative to the VWATT itself, and it’s usually fueled by full staggered adoption without any never treated as that forces the weight on the “already treated” contrast to be a larger share of the total weights. But this is what they do, and so I’ll report it now.
The four columns with increasing fixed effects is a pretty common type of table presentation when using TWFE to estimate causal effects. Given some of the new estimators do not accommodate fixed effects, though, it’s unclear we will continue seeing it done. Going from left to right, as more fixed effects and more things are thrown at the model, the effect size falls from 0.137 of a standard deviation to 0.07. We interpret this as an ATT, though saying that aloud can be awkward so let’s try.
The estimated average effect of Facebook on mental health of students is to worsen their mental health by almost a tenth of a standard deviation.
Keeping in mind, mind you, that that approximately tenth of a standard deviation is a comparison to an unknown counterfactual — some Y(0) world in which students lived on campuses without “thefacebook”. So what were they doing instead? And does anyone do that now? Trying to pin down just what the foregone activities were are I think crucial to thinking carefully about what this ATT means and how or if we can carry it forward to today.
I am particularly interested in the disaggregated analysis. Recall that the index is based on yes and no questions. When they estimate their model on those outcomes, here’s what they find:
Last year I felt hopeless (worsened by 0.07 sd)
Last year felt exhausted (worsened by 0.03 Sd)
Last year felt very sad (worsened by 0.03 sd)
Last year felt depressed (worsened by 0.07 sd)
Last year more anxiety (worsened by 0.06 sd)
Those all point to deterioration but it’s also interesting what isn’t being affected: feeling overwhelmed, Suicidality, eating disorders, diagnoses. None show real signs that they are changing, even though aggregate mental health is worsening. We usually see these in a correlated constellation and now we aren’t so that is odd and interesting. When the treatment slices through some series of outcomes that are usually strong correlated, a story can start to emerge while simultaneously suppressing another story helping us better understand mechanisms at play.
I think the killer app, as you might say, in this paper is the event study graph. There’s two things in this graph I want to show you. First, they use not just TWFE for their event study, but TWFE plus five other robust estimators that can deliver unbiased estimates with differential timing and heterogenous treatment profiles. All of them show flat pre-trends, a common heuristic to support parallel trends in Y(0) post treatment, and all of the robust models find worsening of mental health post treatment. Effect sizes in the last period may be as high as a 0.3 sd, over 3x what the static specification found (which was more or less averaging over the whole period, and was biased downward). Given the effect sizes we reported earlier, this is likely driven by increased anxiety and depression too.
But the other thing to show, and I think this is also why this is such a powerful image, is that TWFE did not find this. The post-treatment plots for the TWFE coefficients cannot reject the null and in the last period do not have overlapping confidence intervals with Sun and Abraham, Callaway and Santanna or de Chaisemartin and D’Haultfoeuille. I think this is probably very compelling to many readers because it is not just a positive finding — it’s also a negative finding. The effects are there with unbiased estimators, but the effects are not there when using the very popular biased estimator, TWFE (or at least the constant treatment effect specification of the TWFE estimator). Had someone done this study five years ago, they wouldn’t have found an effect of Facebook on mental health, when potentially there actually is one — or at least, there is one statistically under parallel trends.
Next they start trying to pin down who it is that is driving the results, both to better understand the heterogeneity and to find evidence for why these two things — a platform and deteriorating mental health — are connected. As the number of mechanisms proposed are more than one, so given the strong causal evidence for worsened mental health, it would be helpful to understand why as well. First, they provide some sanity checks that the effects are worse for those with highest predicted susceptibility to mental illness (below).
Next they examine effects on various kinds of academic performance due to mental health issues like attention deficit, depression and anxiety. They find no effect on sleep or performance stress, and again no eating disorder related academic performance, though.
Finally they try to explain what might be going on in terms of mechanism by looking at the likelihood of belonging to certain subpopulations. They find that Facebook worsened the mental health for those living off campus and with credit card debt. This in the end is their best evidence for the “inter personal comparison” hypothesis. You be the judge of how convinced you are that this is the channel. I am more convinced by the main outcomes than I am by the channel being “interpersonal comparisons” but it is interesting that the two populations more likely hit at least are two groups that may be either more isolated or going through some difficult times financially.
Let me now summarize a little bit the paper’s overall patterns of evidence when put against my five types of evidence in a great diff-in-diff.
It could not show bite. They cannot see people using the computer more let alone logging into Facebook. They lean on historical anecdotes — sounds like we all got on Facebook. Good luck with a paper about an obscure website no one used. You won’t get the same pass. You need bite but they didn’t just like papers about medicare don’t. Its widely accepted demand shot through the roof. So that’s a lesson I think - if you cannot show bite, you need to be working with something for which bite is widely a priori accepted even without a table or plots. Not everything is like that, though, so always try to find evidence for that first stage.
Falsifications. If there was one it was not enough to make me remember. I think the problem though with first movers like this is we have no widely agreed upon placebos. If you found Facebook reduced colds you’d say “oh because they stayed inside” but if it had been null you’d say “placebo.” That’s just how this move into the dark is. A placebo is a placebo when we all agree it is. And I think for a study like this one, we really don’t have a handle on what even a placebo would be. If you said STD cases at the university clinic might be a placebo, then I’ll just say Facebook was thickening the dating market causing risky sex to increase. When there is really no clear understanding of mechanisms, in other words, it’s impossible to convince someone that this variable is an outcome but this one isn’t. Falsifications work best when you have an agreed upon story in the background and we don’t really have one yet for this.
Main Results and event study. So then it is the main results and the particularly the event study that I think is really the strong part of this study. And it very well could be the presentation of all the estimators in that event study overlaid with a flat TWFE plot. Seeing that is a bit nerve wracking in a way — you mean we might have missed this had we used TWFE? I was wrong to be so dismissive of all these new robust did methods? You mean this actually matters? Yes — it may actually matter. I have to believe that when all 3-4 referees saw that graph, they were really sucked in to the study. And the color nailed it. Pay for the color. But note that if someone prints it out on black and white like I did, that event study is unreadable without the color.
Mechanism. That’s where they focus. Lacking a falsification, but having a very compelling event study, the authors are then given permission to explore. It’s like when the attorney has an idea and the judge says “I’ll grant it but be careful counselor”. An event study like that buys you four pages. Use them wisely. They did. The mechanism for interpersonal comparisons comes from just one exhibit, but it’s enough to leave the reader with at least a possible narrative to frame those main results.
I believe that this is an important paper and I hope I’ve shown you why. It seems like thefacebook.com did something. They think it made us compare ourselves. Is that it? I don’t know if a positive coefficient on credit card debt and off campus housing does it for me, but the interpersonal comparison theory isn’t really a tough hill for me to climb anyway. Social media has always amplified what others have and made me wonder why their life is better than mine. It amplifies relative utility jealousy type comparisons between the haves and have nots which creates all sorts of false stories in your mind, the longing for a different life. It takes real skill, zen mastery level skill, to see that those stories are neither true nor useful. But these are young people. We were young once. Our brains weren’t finished forming. Who knows what rewiring is done, and how knows how we will rewire it now.
Great job authors. Five out of five.
Scott's Substack is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
This is amazing post that I’ll need to read more than once! Thank you!
I have a very minor nitpick. MySpace was probably more important than you make it sound:
“After the acquisition, MySpace continued its exponential growth. In January 2006, the site was signing up 200,000 new users a day. A year later, it was registering 320,000 users a day, and had overtaken Yahoo! to become the most visited website in the United States. ComScore said that a key driver of the site's success in the US was high "engagement levels", with the average MySpace user viewing over 660 pages a month.”