How can we know if paid search advertising works?
Addressing causal problems on major eBay and Facebook
Selection bias in paid search advertising
Google is an interesting firm in that what it sells and what it does aren’t exactly the same thing. What it does is lower search costs through algorithms that match people with other people. But search is not what it sells, at least not exactly. Google’s revenue comes from selling advertisements, most of which comes through search engine marketing, a type of advertising targeting users using certain search words. Google reported around $46 billion in 2012, and 95% of that was from advertising.
Search engines like Google and Bing uses auction to price their advertising. But why do people pay? Imagine ships arrive at the shoreline where my lighthouse waits, hoping to collect a fee to facilitate safe passage. Presumably my bargaining position is based on the fact that without my light, you wouldn’t get to shore. But this reasoning assumes that my lighthouse provides the only useful light around. How strange would it be if other forms of light were in abundance such as in daytime or a shoreline with so much commercial activity that the street lights bathe everything around them with clarity ? The existence of nearby substitutes threaten to undermine the entire business model of lighthouse keepers.
Well, in the case of Google, that analogy isn’t so far from the truth. Type in the words “eBay mittens”, for instance, and you are presented with not one but two types of results: sponsored search ads and organic ones. The sponsored search ads are sold on auctions to firms who compete with one another for eyeballs using the very models designed by MIT economists like Hal Varian and Susan Athey. And yet right next to it — literally just beside it — is a set of links that are best I can tell nearly perfect substitutes. And weren’t I going to go there anyway otherwise why did I type in eBay mittens? Maybe I would have ended up at your store even if you hadn’t shown me your advertisement. Knowing the answer to this requires knowing the answer to a causal question. No one should pay for a route if the free route would have been used anyway, right?
Catching Natural Experiments in the Wild
While a randomized controlled trial could pin a question like this down, they are not always possible, not because we lack the imagination to run them, but because the resources we need to run them — like permission and money — are things no one would give us. And so in those situations, the relative value of the natural experiment rises, not because it is more credible than that RCT, but because it is more credible than what we have which are sometimes such severely biased observational differences that they give the exact opposite answer than is real.
That opportunity presented itself in March of 2012 when eBay conducted a test to determine whether brand keyword search advertising caused revenues to increase. Steve Tadelis, an economist at the University of California Berkeley Haas school, was Distinguished Scientist at eBay at the time that this happened, and learned of it a few weeks later from someone else within the firm. The relevance of the event for answering questions about the causal effect of advertising on revenue. Below you can hear Steve talk about it in an interview I did recently with him about his time at eBay, the economics of the Internet, and economists in tech more generally.
Steve, Thomas Blake and Chris Nosko documented what happened when eBay flipped the switch in their 2015 Econometrica, “Consumer Heterogeneity and Paid Search Effectiveness: A Large-Scale Field Experiment”. Clicks were driven to zero when advertising was suspended, but surprisingly there was a nearly one-to-one increase in natural clicks resulting in no net reduction in traffic to the site. It appeared as though users were already going to eBay because shutting down the paid search path hadn’t deterred them. They got there anyway using the organic links instead. It was almost like people were using the paid links, not to learn about products, but to navigate to the site. In other words, it appeared like selection bias with respect to paid click advertising and arrival at the site was probably baked into their data.
Blake, et al. took this information to the brass at eBay and convinced them to run large geographic wide field experiments across 30 percent of all markets — some large, some small. They estimated log-log models of geographic regional sales onto geographic spending using dummies for the experiment regions, periods and interaction (a simple on-off type of DiD linear specification) as instruments to get estimates of revenue elasticities with respect to spending, or return on investment (ROI). Compare the OLS results in columns 1 and 2 to their 2SLS results in columns 3-4. The significance vanishes with IV and effect sizes fall to precise zeroes, which are orders of magnitude smaller than the naive regressions found in columns 1 and 2. Even the DiD estimation in column 5 is a precise zero. The experiments they ran suggest a nearly zero return on their expensive spending on paid search advertising.
But these effects are local average treatment effects (columns 3-4) and since they differ somewhat from the ATT estimates in column 5, there is at least some a priori reason to suspect heterogeneity might be an issue.1 So they then estimate effects for different users — those who have made different purchases over the last year, and those who have recently purchased. Here we see slight differences than in the more aggregate ATT and LATE parameters just presented.
Notice that effect sizes are positive for people who haven’t made purchases in the last year and decline as we move right towards more intensive customers. This and similar patterns convinced the authors that for some customers at eBay, paid search advertising was informative abt where to find things. The more familiar customers were with eBay, the less important the paid search advertisement was in getting to the site, which could mean that paid search advertising is effective, but only when targeted at the right people.
Selection bias, DAGs and conditional independence
The Blake, et al. (2015) paper presented some pretty convincing evidence that selection bias was pretty severe when trying to figure out return on investment at the eBay platform. That evidence was convincing in large part because the underlying research design based on physically randomized treatments convincingly eliminated selection bias. But notice the words in italics — at the eBay platform. Those words are not a trivial qualification. Just like eBay may have heterogenous returns on investment depending on who is targeted, firms may have heterogenous returns on investment for advertising as well.
Facebook is, like Google, in the business of selling advertisements on its platform. So it is natural that we might want to repeat this analysis by Blake, et al. by examining how paid search advertising performs on its site. Enter Brett Gordon (Northwestern), Florian Zettelmeyer (Northwestern), Neha Bhagava (Facebook) and Dan Chapsy (Facebook). Their 2019 study published at Marketing Science entitled “A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook” analyzed 15 RCTs comprising half a billion user-experiment observations and 1.6 billion ad impressions. Unlike Blake, et al, (2015), these authors did find evidence of a positive ROI measured (measured in a target causal parameter called “Lift”). Effect sizes range from as low as 1% to as high as 1500%. These effects differed, though, depending on where the advertisements occurred on the platform, suggesting that heterogeneity is not only common among consumers but also across platforms and locations where advertisements occur.
But this is not the only interesting part of this paper for me. Another interesting thing is the comparison of approaches undertaken. The authors accompany their analysis by also using three observational designs unlinked from the explicit randomization itself: propensity score matching, regression adjustment and stratified regression. And here (shown below), we see that the selection on observable methods employed, pretty much across the board, led to severely biased and often overestimated causal effects of advertising on revenues.
It’s worth pausing and reflecting on the identifying assumptions needed to identify causal effects using matching and weighting methods like inverse probability weighting. The assumptions — conditional indepedence and common support — are easy to write down, but they are not easy to defend because they require a reality in which once you condition on a matrix of covariates X, all remaining variation in the treatment is independent of potential outcomes, or what is sometimes termed “as good as random”.
What’s nice about this identifying assumption is that like its big brother independence if you can commit to belief in it, then you can simply adjust for X in a variety of statistical modeling ways and the selection bias problems inherent in the design vanish. But believing in conditional independence is somewhat akin to believing in Bigfoot. Without extremely good evidence for conditional independence, extreme skepticism is warranted because the behavioral process that generates data of this kind is based on intentional human choices that may not be random at any step, or at least not easily identifiable randomness.
How does one go about justifying conditional independence beliefs? With a model. You need a credible model of behavior in the situations surrounding your data to defend conditional independence. I find the directed acyclic graphs (DAG) to be extremely helpful at driving this point home. Conditional independence requires believing in a DAG describing the causal processes in your data that provide you with a set of observable covariates that can satisfy the backdoor criterion with respect to advertising and spending. Let me explain with a simple DAG.
In this DAG, there are two backdoor paths from D to Y. The first is to the left of D: D←X1→Y. And the second is just to the southeast: D←X2→Y. Thus a necessary and sufficient condition to block both of these open backdoor paths from D to Y is to adjust your comparisons between treatment and control by conditioning on X1 and X2.
But think closely to what you’re doing here. This DAG does not imply that you should condition on “rich firm level data on users”. This DAG says that to identify the causal effect of advertising on spending you should condition on X1 and X2 — no more, no less. A strategy like that requires reasonable confidence in the DAG itself because to say that you need to condition on X1 and X2 is to say you satisfy conditional independence once you condition on X1 and X2. It means that you are reasonably sure that the humans in your data, acting according to their own goals, suddenly randomly expose themselves to advertising once you condition on X1 and X2. If you are wrong, then conditioning on any set of covariates isn’t solving any of your problems.
Revisiting LaLonde at Facebook
Gordon, et al. (2019) is an interesting paper because they emulated a classic 1986 paper in economics by the late Bob LaLonde in which he also attempted to evaluate the efficacy of then contemporary program evaluation methods like regression and difference-in-differences by comparing the causal effects from an RCT to the estimated causal effects from a selection on observables approach. The program he analyzed was a job trainings program aimed at disadvantaged workers called the National Supported Work Demonstration, or NSW, and it randomly assigned workers into treatment and control. The program worked — the treatment group’s earnings a few years later was around $1800 higher than that of the control.
If we have an RCT, then we know the ground truth of the ATE. So LaLonde reasoned that if we know the ground truth of the program, then we could just drop the experimental control and replace it with a non-experimental group drawn randomly from the US population to see how common methods in econometrics at the time performed using the RCT ground truth as a benchmark. And that’s what he did. In him job market paper, he presented analysis wherein he dropped the experimental control group workers and replaced them with a random sample of Americans drawn from six different non-random datasets (the CPS and PSID) in six separate analyses. In what was at the time a very discouraging set of results, LaLonde found that contemporary methods in program evaluation were unsuccessful in recovering the known ATE. Not only did the effect sizes vary a great deal from sample to sample and method to method, they were usually not even the right sign. The paper made a major splash and arguably contributed to the growing influential research agenda coming out of the Princeton labor group in the mid 1980s which we now call the credibility revolution in which explicit forms of randomization, be it in an RCT or naturally occurring, were favored over the much more challenging selection on observables scenarios.
But then in 1999 and 2002, a team of young economists, Rajeev Dehejia and Sadek Wahba, returned to this question and re-analyzed the non-experimental datasets LaLonde had put together. They were interested in whether the newer selection on observable methods, namely stratified propensity score matching, were as inept as the ones he had studied. And this time, the news was more positive. Once you imposed common support on the data through trimming, results became much more sensible. Effects sizes flipped back to positive, and were more in the range of what was known from the RCT itself.
So the fact that Gordon, et al. (2019) do effectively the same type of analysis is kind of cool and interesting from a history of causal inference thought perspective, not to mention practically very helpful. After all — how often in a pinch are we working with observational data and selection on observable designs? A lot. So why not check and see how well it performs using our known ground truth from the RCT as a benchmark? Maybe we can take advantage of the fact that Facebook knows and collects a ton about its users and condition on those when doing so.
So that’s exactly what they did. The authors examined the question using a variety of selection on observable methods by conditioning on users’ social network data from Facebook itself merged with geographic data from the American Community Survey and then conditioned on both for their analysis. There were, in other words, dozens of covariates used in this analysis. Surely if selection on observables was to work, it would work here?
But look closely at these variables and ask yourself — if you had written down the DAG describing selection into advertising impressions and spending, what covariates would you have said you needed in order for selection on observables to deliver reasonably unbiased estimates of the ATE? Would you say that a user’s number of Facebook friends, for instance, at the time of the ad impression was one of them? What about the number of friend requests they’d ever sent? Relationships status? Number of days since you last accessed FB by desktop or mobile, the operating system on your phone? Would any of these seem relevant to the strategy you needed to undertake in order to close all backdoor paths and therefore satisfy conditional independence?
Or let’s look at the ACS. By linking each user to their local area, the authors were able to fill out detailed in formation about whether they lived and use this in their matching algorithms. Look closely at these variables, too, though. Are these the ones you would’ve said a priori that you absolutely needed to condition on in order to close the backdoors between advertising and revenue? The share of people working in fisheries in the area? Seriously?
Conditional independence isn’t an easy assumption to defend. Not because it may not hold — it’s non-scientific to rule out entire research designs without caused since we know theoretically can identify causal effects. So why then do I say conditional independence is not an easy assumption? Because it can only be defended with a credible appeal to a reasonably authentic DAG that is an accurate portrayal of the processed causing selection into treatment. If you cannot do not have such a DAG, then you cannot use selection on observable methods. You will simply need to move on because as we saw in this case, sticking in 50 covariates for no other reason than that they’re in the data. Identifying causal effects does not happen with wishes. You cannot make your car fly just by uttering an incantation. Causal inference isn’t a haphazard enterprise. It does not love to eat out of the kitchen sink filled with dirty dishes just because you don't have anything else to eat on. It needs clean dishes which is nothing more than reliable, expert descriptions of the underlying behavioral sorting into and out of treatment in order to defend something like conditional independence, and many people are either incapable or unwilling to do that.
Causal inference is and always will be an important type of knowledge, both in the sciences as well as commerce. Without it, we cannot answer basic yet fundamentally crucial questions such as “are my advertising expenditure dollars being wasted? Should I redirect them elsewhere?” Mistakes about that can be life or death for the marginal firms in highly competitive environments operating at the knife edge of solvency.
The randomized controlled trial has the reputation of being the gold standard because it can, under independence with its physical randomization, eliminate selection bias in the blink of an eye, at least in large samples and without behavioral readjustments within the treatment. We can write down those conditional independence equalities with permission because, as Don Rubin once said at a conference I attended, “we know how the science works”. What science you might ask? The science of physical randomization.
But selection on observables have no such “science”. Unless the experimenter is physically randomizing conditional on X, we have no science to guide us. We have, instead, only theories — theories of the physical sorting into treatment happening locally in the area of the data production itself. And without that, selection on observables are worthless. Given they give the presumption of innocence and unbiasedness, they may even be more harmful than good. The mostly harmful econometrics is at the end of the day estimation that isn’t guided by domain specific knowledge and credible behavioral theory.
Although given they literally turned off paid search in these regions, there are therefore no always-takers or never-takers, and so the heterogeneity in question really shouldn’t show up in the aggregate parameters.