Consideration IV designs
Potential IV designs within industry
Introduction
In my book, Causal Inference: the Mixtape, I list some popular IV designs such as Bartik instruments, judge fixed effects and lotteries. These are IV approaches that have been done so many times that they almost seem to be their own sub-IV design itself. The close election design is like this too — an approach in the regression discontinuity approach that has been repeated dozens of times to the point that the nuances associated it have become more salient to practitioners. Finding these I think can be helpful because they are often suggestive that particular situations are common enough that we may find them in the wild. And in preparing for a new course at Scholar Site on causal inference for tech and industry, I think I have noticed an IV design that I am just calling the “consideration set design”.
Good instruments are strange
Instrumental variables designs require a DAG system fitting the attached drawing. I often refer to the exclusion restriction (where Z is independent of the unobserved confounders and does not directly cause Y) as a situation where the instrument, when expressed in its reduced form, feels “strange”. Strange in the sense that an intelligent layperson would find it bordering on the illogical how an instrument even be remotely relevant to the outcome, let alone show any correlation. Such “strange reactions” are a good sign because they provide reasonable reasons to think that one of the key identifying assumptions of IV (exclusion) could be plausible. If you when you hear the reduced form between Z and Y and think “this makes sense to me”, it implies either that the unobserved determinants of the outcome include the instrument and/or the instrument directly causes the outcome. Both of which are direct violations of exclusion.
Randomized Boxes of Therapists
My entire career has been inventing things that Gary Becker invented decades before, so it is unsurprising that I am not the first to notice. After all, there’s an efficient market for good instruments. If this was a good instrument, in other words, then assuming no barriers to entry, someone else almost certainly has found it — otherwise it probably isn’t a good instrument! Typical economist way of thinking, but I drink from the kool-aid.
I have recently been talking to a firm called TalkSpace. TalkSpace is interesting because it’s a novel form of Telehealth. It, like many other tech firms, is a platform solving two sided matching problems but between clients searching for mental health therapy and therapists. Ordinarily, that match is rife with high search costs and uncertainty. In that sense, it is not so different from marriage markets where you often don’t know the quality of the match until well into the match. By the time you realize that the therapist match isn’t working for you, you could be 6 months or more into the relationship and that’s when sunk cost fallacies kick in, not to mention low congestion and shortages of supply of therapists in some areas mean even if you were to dissolve the relationship, you might face substantial search costs. And since even the subsequent matches are under uncertainty, many people either stick with low quality match or discontinue therapy altogether.
Talkspace is interesting in that sense because it’s a two sided platform that matches client to therapist. Designs of the matching algorithm would likely be useful ways economists and data scientists could collaborate, but also figuring out what elements of the assignment to therapist could impact the client’s subsequent mental health, marriage reconciliation, and/or even willingness to depart therapy would all be very important questions to answer. After all, the selection biases associated with choosing any treatment are well known and very hard to answer in this context.
But TalkSpace, I learned, has like many other platforms a quirk. A “box” of therapists are presented to the individuals from which they pick. While randomizing the therapists to clients will likely cause revolt, randomizing the consideration set likely wouldn’t. And given TalkSpace controls the characteristics of the box of therapists, and given you can only choose from that box, then insofar as it was possible to identify relevant features of the distribution of therapists as instruments, you might be able to identify the causal effect of the treatments in question on those mental health outcomes.
There are several ways you could go on this. You could use natural language processing, for instance, to learn more about the characteristics of the therapist. Perhaps you could find the effect of mindfulness methods or dialectical behavioral therapy on outcomes not previously studied. Or maybe even proxied quality of the therapist. Each therapist is a bundle of nearly an infinite number of characteristics is the problem, so almost certainly the researcher would need a priori reasons to rule out certain channels as relevant or not, otherwise who knows how you can interpret the complier population itself. But just because a problem is hard has never stopped creative researchers in the past, so why should it here?
But another issue that arises here is what exactly will you be measuring within the box? Would you be using therapist fixed effects for instance, much like judge fixed effects designs — dummies, in other words, for every therapist? Perhaps so, perhaps not. Or would you measure some version of the leave-one-out-mean? That is, the average characteristic within the box? If the box has three therapists, and you wanted to know the causal effect of being assigned a Black therapist, then each box would be multi-valued ranging from no Black therapists in the box (a zero), to 1/3, 2/3 and all therapists being Black. Such multi-valued instruments could allow us to move into the realm of marginal treatment effects where work by Heckman and Vytlacil, in a series of papers published in the oughts, showed that we can use that IV feature to back out more policy relevant aggregate parameters, like the ATE, by integrating over a marginal treatment effect distribution across individual propensity scores. Such novel extensions of IV have been really exciting developments, particularly in the area of instrument intensity designs like estimating the causal effect of mindfulness on mental health, divorce, and suicidality.
Consideration Sets
So what exactly is this IV design’s generic features such that it encompasses quite a few applications? To illustrate, I’ll discuss just one study by Sonia Joffe, et al. in which they examined the causal effect of host quality on the “propensity” of a guest to return to the Airbnb platform altogether. This is crucial question for a two sided platform when you think about it, because hosts who provide low quality services not only reduce their own earnings by reducing repeated use — they may also impose external costs on other hosts not involved in the exchange if their low quality provision of services to the guests drive them away from the platform altogether. This can create thin markets, which economists Al Roth and others have shown can even cause the very existence of markets to break down.
Estimating such causal effects are hard though without explicit randomization, and explicit randomization is itself hard because on the two sided platform, individuals aren’t assigned their hosts — they choose their hosts. And when users choose their own treatment, they are almost certainly doing so because they expect treatment effects to be positive. And if they expect treatment effects to be positive, then the very conditions for causal inference (independence) are violated. This is because, as I say in my book, rational choice tends to make everything endogenous.
So what do Joffe, et al. do in their ingenious approach? They take their measure of the propensity to return and measure its distribution within the consideration set presented to users. The consideration set tends to have a large impact on the customer’s choice, reflected in a gigantic first stage (F>33,400), thanks to that access to massive datasets which is a common characteristic of industry data. The effect of choosing a higher quality host causes users to make more trips again which has implications both for return on investment measures, but also perhaps market design itself.
The consideration set IV design, as I see it, bears many similarities to the leniency design in that the instrument is people with differing characteristics from which we can calculate continuous instruments. You aren’t necessarily, though, using the systematic habits of the people as an instrument so much as you are using the mean characteristics of the box itself. But how much is that a difference in kind really? I’m not entirely sure. Perhaps, for instance, the monotonicity assumption is a little easier to defend here since the therapists in the box can’t change his or her race. The racist client will never pick the Black therapist, regardless of the composition of the box, so maybe on that dimension too, monotonicity could be more easily defended. Point is — it’s worth some focused reflection by social scientists and econometricians both within and outside of industry.
So the consideration set IV design as I see it presents to the user more or less a box of options. The box may be a screen size listing of options. It could be a front page. It could literally be an *imposed* set from which you have no other option but to choose. Yelp for instance has more porous edges to its consideration set as you can scroll down, whereas the therapist platform, Talkspace, presents only a few therapists to choose from, and there's all sorts of things in between.
Conclusion
Angrist and Krueger once remarked that “good instruments often come from detailed knowledge of the economic mechanism and institutions determining the regressor of interest.” Who else knows better the economic mechanism and institutions than experts, and where else are you going to find experts than in people stubbornly devoted to the same questions over many years and/or those within a single firm working a single market on a narrow set of products and designs. Classifying observational designs, like shift-share and close elections, is a helpful way, though, to provide researchers with the means to find good instruments in their midst. Expertise is like the water diving rods where hidden wells can be found if we just learn how to pay attention to the things around us. Curiosity, and not just statistical knowledge, is key in finding good instruments.
This is somewhat unrelated, but when talking to companies, a lot of economists (when they end up writing papers with industry data), are cage-y about naming the companies. Why is this, and is naming the company -- as you do with talkspace -- going to preclude you from doing research using their data later on?