The Ongoing Role of Machine Learning Engineers and Data Scientists in Industry's Lucas Critique
A Theory of Growing Demand for Causal Inference in Industry
Introduction
The demand for “data workers” has expanded dramatically in recent years. Machine learning engineers, for instance, were the fourth fastest growing job over the last five years according to a report by Linkedin. Data science continues its own steady march of growth as well. But while ML engineering and data science are distinct, they overlap to varying degrees around the core problems of leveraging insights out of datasets through programming and usually statistical modeling. That these jobs are handsomely rewarded with total compensation packages among the highest in the country which is both a reflection of their perceived importance but also a sign that we will continue to see students flow into programs that train them.
There are now several roads into these types of data work, including graduate training in computer science, mathematics, statistics, data science, economics and other quantitative social sciences, as well as some of the physical sciences like physics and biology. The technical bar one must meet relates to both technical and conceptual skills. Technical in that one must possess core competency in programming to clean and perform calculations using a range of contemporary statistical modeling. But also conceptual in that one must understand what these data are and are not “saying” and which questions to ask of it really matter.
This essay lays out some ideas I have about emergent bottom-up demand for causal inference in industry and that it specifically will come through the labor supply of machine learning engineers and data scientists primarily. But I will also note that this is likely to be a fairly uneven growth along multiple dimensions. To help make this argument, I will rely on my own thoughts and opinions, as well as an excellent new 2022 paper by Hünermund, Kaminski and Schmitt to help frame the essay. At the end, I will discuss my plans to help push this forward by teaching causal inference at Scholar Site in a new class I am developing aimed only at machine learning engineers, data scientists and product management more generally. But let’s now dive in!
Prediction Machines
Does this patient have breast cancer? How many books will this author sell? If the firm raises prices, what will happen to sales? While all of these questions ask us to make predictions, they are not all the same types of prediction questions. The first two search for events that either already exist now or will exist in the future without direct disruption of a closed system containing them, whereas the third predicts what will happen if we surgically alter parts of the system such as a product’s price.
These three questions are helpful examples that illustrate three ways to organize our thoughts around prediction. Those are:
What is going on right now in parts of the system I cannot see? (“Does this patient have breast cancer?”)
What will happen in the future if the system’s current data generating process is not disrupted? (“How many books will I sell?”)
What will happen in the future if the system’s current data generating process is disrupted? (“If the firm raises prices, what will happen to sales?”
When, why and for what use might we care about each of these is just as important a question as the manner in which we get answers. You could imagine using the exact same regression model in each of these situations but the model telling you something that may or may not give a reliable answer to the question asked in each of them. At a fundamental level, the difference between these questions has less to do with statistics and more to do with epistemology — what exactly is this knowledge we are trying to obtain anyway?
When we are seeking to answer questions about the future using the past, but will not be disrupting the system encasing those events, then we are simply in the business of forecasting. And great strides have been made in computer science, statistics and other mathematical fields in better training data to make better guesses. These better guesses have been worth to firms billions of dollars as they help understand who their customers are, what they value and by how much.
The second question is not so different, even though it is less interested in the future than it is the present. Like the first question, trying to fill in missing data on a person’s status is really just another form of forecasting on the assumption that the data we have has the type of external validity to inform our guesses about these missing units.
Both of these types of questions have been profoundly enhanced by contemporary statistical modeling like machine learning. But what is machine learning exactly? Naqa and Murphy (2015) describe machine learning as “an evolving branch of computational algorithms that are designed to emulate human intelligence by learning from the surrounding environment” which has greatly assisted us in the wake of massive datasets created often passively by the firm’s own operations. Hünermund, Kaminski and Schmitt (2021) say this about machine learning’s value:
“Because of their superior forecasting abilities, compared to traditional statistical and econometric techniques, machine learning methods have been called prediction machines, a term that captures well their main purpose of predicting the state of an output variable based on complex correlational patterns in the input data.”
This idea that machine learning methods may function like a crank grinding inputs of data into prediction outputs is a helpful metaphor. But Susan Athey, John Bates Clark award winner, former chief economist at Microsoft and professor of economics at Stanford, once elaborated a bit more on this theme at a conference I attended. When trying to answer whether a question was appropriate for a machine learning solution, just ask yourself this question: could my research assistant do this for me if my research assistant had an infinite amount of time and attention? One of the things RAs can do, for instance, is count, compare and calculate. And if the task involves counting, comparing and calculating on the order of millions if not billions, then maybe you shouldn’t use an RA — maybe you should use machine learning. And for many firms, that’s exactly what they use these tools for.
But it’s easy to say “use machine learning” if you have a data worker who can employ those tools in an expert way. The growth in data workers reported by LinkedIn over the last several years is labor, in other words. It’s skill that is demand, not equipment. The scarcity, in other words, is on the labor side, not the capital side, and the expansion of labor supply is not infinitely elastic. Evidence suggests that while the work itself is probably broadly valuable across many industries, it has tended to have spatial agglomerations in the Bay Area concentrated among giant tech firms and startups. This slowed adoption of employing data workers is likely due to uncertainty both in what the workers can do for the firm and how that can translate into sales and pricing in the output market. These together have likely contributed to an uneven spreading out of the data workers across industries. In a 2021 article in the Harvard Business Review, Hünermund and Bammens write that:
“Building successful AI applications requires a critical mass of data scientists and ML engineers, who are in high demand — and attracting the necessary talent is particularly challenging for midsize firms lacking the appeal of startups and the resources of giants.”
Since the work itself does involve counting, comparing and calculating, and given the labor may be inaccessible for firms outside the Bay Area (although that could change with increased work-from-home policies) or those with tighter budget constraints on these extensive margins, one’s mind immediately goes to software. Perhaps one of the solutions to the high price of labor and the scarcity of predictive machinists will be the emergence of software. But it is unclear when and where software becomes a trustworthy substitute for the data workers themselves. Over reliance on automated predictive modeling using little more than algorithmic libraries can create challenges for firms who, as Hündermund and Bammens note, lack the resources to compete with industry giants for scarce ML talent. Consider the case of Zillow whose imbalanced portfolio of housing inventory during a cooling market resulted in a $500 million loss. While the exact cause of this problem is not well understood, it is thought that the company’s AI algorithms were no longer accurate in light of changing economic conditions which if true suggests that firms face some liabilities if they approach prediction naively. Data workers and software are likely complements not substitutes.
Big Data is Still Not Big Enough
But let’s now consider the third question — “if I raise prices, what will happen to sales?” This kind of question may seem like the others in that it is a prediction about the future. But that is where the similarities stop. This question is not like the others in that this question is a causal question which requires a completely different way of thinking and often a different set of tools. Statistical modeling is like a knife and you can use that knife for surgery or you can use it for spreading jam on toast, but you do these things for very different reasons and those differences swamp any similarities linking them. The differences, in other words, between causal inference and non-causal predictive modeling are far more important than any similarities they may have.
The tools we use to understand causal inference are very important and have been fleshed out and developed over the last century. The key concept in causal inference that statisticians, economists and now computer scientists like Judea Pearl all agree on is that of the “counterfactual”. To motivate the counterfactual mathematically, it is now common to use the Neyman-Rubin causal model, although there are other approaches too. The Neyman-Rubin causal model formalizes our thinking about causality through the creation of hypothetical states of the world in which policies had and had not taken placed. These hypothetical worlds, called potential outcomes, allow us to frame a causal effect as a simple comparison between them which can be used them to better express when things like selection bias do and do not emerge.
Counterfactual based reasoning is very helpful for pinning down the question we are asking, but like all models, it is also pure fiction. There is no counterfactual. There never was, and there never will be. We tell these stories not because they are true but because they help us navigate our historical lives, including science, policy and commerce on which we are dependent for the allocation of our earth’s scarce resources and the improvement of the well being of its inhabitants. The fact that it is technically “false” does not invalidate the approach if it still helps us find useful answers. That false things can be helpful is the paradox of models.
But while, the Neyman-Rubin causal model helps us define the question, it also exposes a glaring flaw. If knowing causal effects requires comparing two states of the world and only one of those exist, then we have a fundamental problem. We cannot know a causal effect if that is what is required because we are missing the data we need to make those measurements.
Once we frame the problem of causal inference as a missing data problem, it does not take a genius to see the appeal of data science and machine learning. Since machine learning applied to giant databases can find “complex correlational patterns” that improve prediction, then surely the answer to the missing data problem is simply to get more data. Put differently, hasn’t big data already solved the causal inference problem if the problem itself is due to missing data?
Unfortunately, no — it does not and with the problem that I am describing, it cannot. The missingness that we face is far more complex than it seems because the data we are missing does not exist anywhere. The patient either does or does not have cancer. Tomorrow it either will or will not rain. These are actual events. They occur in history whether we know them or not. Causal inference is used to predict, not history, not today or tomorrow, but like a wormhole into another dimension causal inference is used to estimate what might have have happened had a different decision been made. This kind of counterfactual prediction is what we mean by causal inference. It’s where we say what will happen because we imputed an alternative reality where it hadn’t occurred and then use both to form what we hope are reasonably "credible” estimates of causal effects.
Intersecting the Two
My strained attempts to distinguish the boundaries of these different types of prediction questions lead naturally into the subfields that have wrestled with each of them for decades. We owe much of what we know about prediction from statistics, econometrics and machine learning. And much of what we know about causal inference we know from some of the applied statistics in the social sciences, such as econometrics, as well as more recently those who work on artificial intelligence in computer science like Judea Pearl.
Hünermund, Kaminski and Schmitt have a very helpful figure (below) that helps lay out the different tools that have been developed for prediction versus causal inference. In Figure 1, they show the pure prediction modeling in column 1. Things like OLS, logistic regressions, time series, neural networks, regularization and other methods might be the tools in this realm of non-causal prediction modeling. But on the right we see the diversity of causal approaches ranging from the randomized controlled trials including A/B testing, the design based approaches that have given us instrumental variables and regression discontinuity to more recent work on causal machine learning like multi armed bandits, DAGs, and causal forests.
Software developments may be coming. Geminos, for instance, where I am an advisor (as is Paul Hünermund himself), builds software that using Pearl’s DAG modeling approaches seeks to identify causal effects in a more automated fashion. While it is unlikely that one can fully separate the skill of data work from the automated parts, it’s possible that software like Geminos and others may in the future allow for greater access to causal modeling.
But the real bottleneck in my opinion is not software, and it’s not labor — it’s firm demand. If firms do not believe that predictive modeling can be either causal or non-causal, if they do not themselves possess a clear delineation of when a question that needs asking is causal or not, if they do not think that causal modeling can help in situations where non-causal predictive modeling cannot, then it doesn’t matter what breakthroughs happen in statistics, computer science, econometrics or software. Those insights will remain in the ground unless firms first begin to understand why and when they may need a causal question versus a non-causal question answered.
Industry’s Lucas Critique
In 1976, future Nobel laureate Robert Lucas criticized macroeconomic forecasting for failing to understand that forecasting changes based on historical data was not a causal estimate of Fed and other macroeconomic policies unless those correlations captured a theoretical causal effect. He did not use those exact words, but this is how we typically interpret the “Lucas Critique”. The Lucas critique is likely happening in industry as we speak, but not necessarily by everyone, and not necessarily in the same firms. But where we see it happening, it is primarily in the data workers themselves — the machine learning engineers, data scientists and the product managers who interact with them and management more generally.
My hypothesis is that industry’s Lucas Critique will come through the data workers, and not management, in other words. And to support this hypothesis with some evidence, I will now discuss a new paper by Hünermund, Kaminski and Schmitt which in addition to reviewing theories about causal inference, prediction and management, the authors interviewed 15 core data workers and managers to better understand the topic of causal inference in contemporary organizations. Here is a summary of their findings.
Types of questions firms address with data science
In most of the interviewees’ companies, machine learning and data science workers were employed to ensure and optimize product functionality. This included working on recommender systems, automated pricing, search and matching in the online platforms, and price predictions. As such, these workers’ jobs were said to be related to product recommendations and pricing itself, or applying their knowledge to developing products. Using machine learning to forecast sales, demand and other financial figures came up frequently in the interviews, as did learning more about customer satisfaction. They write:
“Overall, data science and machine learning are mentioned as important inputs to managerial decision-making by providing information on business parameters relevant to the particular decision situation.”
Tools appear to be an important part of what these data workers bring to the table, interestingly. One of the interviewees writes that when faced with a problem, a data scientist might come with a toolkit that can solve it. This focus on tools cuts both ways — sometimes the tool seems to fit the question asked, but sometimes knowing the tool guides what questions end up being asked. As a result, there are limits built into the organization itself because of what knowledge and tools are bound up within the human capital employed by the firm.
Awareness of the difference between correlational and causal knowledge
A second theme that emerged from their interviews was regarding the awareness that data workers like machine learning engineers and data scientists possess about causal versus purely correlational knowledge, but also the differences between their knowledge as a working class and management more generally. While nearly everyone was familiar with the conceptual differences, 60% recognized the limitations of their own predictive modeling approaches in determining causality or the potential risks of mistaking correlation for causality.
But as I mentioned earlier, the slowed diffusion of causal inference within industry wasn’t probably primarily due to a lack of human capital within the data workers. Rather, it’s possible that there is a very strong role in slow diffusion coming from a lack of demand for causal knowledge at the management level. Some interviewees reported that their daily work was dictated by management’s views and opinions, not surprisingly, and if they didn’t need causal knowledge, then they wouldn’t be producing it. For these people interviewed, causal inference was not yet relevant because it was not yet relevant to management.
Without demand for causal knowledge at the top, causal knowledge in the firms tends to spread between the data workers themselves and then from them through contacts elsewhere within the firm. The majority of people they interviewed said that their understanding of causal inference was slowly spreading throughout the organization. This echoes what I was once told by a PhD economist at Linkedin. When people wanted to know causal inference, they would tend to just ask the economists, and it would spread informally person to person, project to project. Diffusion, in other words, tended to come from bottom-up, not the top down. And this bottom-up approach is, as it is often, prone to agglomeration and eddies. Listen to these three anecdotes:
"I think we rely on that small set of causal inference experts to inject their expertise wherever they can, but it’s very unevenly distributed.”
"Together with one of our data scientists, I am the one who is currently pushing this topic. We are missing that view.”“In particular the data scientists are really aware of causal inference and are following discussions and developments.”
Importance of causal inference in business today
Most of the interviewees agree that causal inference is very important for business. They just note that the synapses haven’t uniformly emerged within the firm. But those where it has emerged are worth noting. Consider the important issue of “redemption”, a term the interviewees used to describing a customer coming back to use the app. What causes a customer to come back? This is obviously a vital question because if you can cause them to come back, you can directly affect sales. So causal inference here seems like a very fruitful area for informing business metrics. It requires theoretical knowledge of behavior as well as metrics, and thus the social sciences that work closely with causal inference are likely to be very helpful.
Other obvious candidates are anything regarding whether product changes have meaningful impacts. Think of the popular A/B testing. Since firms are used to collecting evidence of product and service interventions based on A/B tests, then the same kinds of things would probably be candidates for causal modeling even using non-experimental data. This came up repeatedly with strategic decision making. Historically I always thought that game theory was what tool kit that economists and mathematicians brought to strategy, but perhaps causal inference is as well.
Still, pure prediction is more common in their data science projects. Most of their ML is pure correlational despite the fact ML causal inference exists. Interviewees report an over reliance on classical machine learning using basic statistical correlation despite the existence of machine learning causal modeling as was shown in the figure above. And again, this may be because of managerial demand, because one interviewee said this:
“What we see quite often is that when you are asked to do a data science project, the questions that the client asks actually cannot be answered with the ML model you just trained them with. My educated guess would be that the majority of questions, in the end, are causal questions, but the way we are, as data scientists, have been trained to answer these questions is always in terms of classic machine learning.”
Diffusion of causal inference methods and techniques
The popularity of the A/B test is well documented. Microsoft has in the last five years run more RCTs than was performed in the history of mankind all years prior. The value of physical experimentation is well understood given it’s a hundred years old at this point, and so deeply connected to the scientific method. Not surprisingly, then, what causal inference methods that are used by data workers typically are A/B tests (63.5% of respondents in Figure 4 below).
Thus it’s interesting to contrast the demand for causal knowledge that does exist with the type of causal knowledge that is often demanded. It is typically not the observational designs that data workers employ, even though demand for causal knowledge exists. This is an interesting juxtaposition that merits careful reflection. Given there are many questions that simply cannot be realistically answered with an A/B test due to financial constraints or sheer realism, it’s natural to ask whether firms are leaving money on the table by not training itself to learn these methods, interpret their findings, know their limitations, and apply them.
Future of causal inference in industry
It seems natural that it will move into industry via the machine learning engineers and data scientists. They are industry’s statisticians, as well as their concrete data workers, and since applied causal inference is built on those two alone (plus theoretical knowledge about causal inference and applications), it is almost certainly the case that when we see the Lucas Critique revolution take hold in industry, it will be because of the machine learning engineers and the data scientists.
Is causal inference valuable for industry and businesses? If so, how does it enter? Through graduate training in terminal masters programs? Do we rely on the ongoing social networks within the firm to share this knowledge? If teams are siloed from one another, is this the most efficient way to wait?
Most likely the means of improving causal inference skills and capabilities within the organizations is going to come through the labor markets themselves — through cooperating with academia, where these skills are in abundance, hiring new employees, investing in software, and most of all training employees.
Conclusion
So, I’ll put on my cards on the table. My goal is to assist in this process by teaching design based causal inference classes on an ongoing basis. I do this primarily now for people interested in social policy at my Mixtape Sessions platform, and now for industry nearly exclusively at Scholar Site. My upcoming class targets machine learning engineers, data scientists and product managers, largely because that’s my interpretation of Hündermund, et al.’s finding. The on-ramps into industry will come through those groups of data workers, and so I will directly learn how to share everything I can share within my ability from the design based causal inference approach. This is not the entirety of causal inference methodologies as Hündermund, et al’s helpful taxonomy showed. My course does not include physical experimentation with A/B tests (though a course like that is at Scholar Site), nor does it include the ML based causal inference methods (though MIT professor of economics, Victor Chernozukov teach it later this year). But it nonetheless plugs that hole. For those that are interested in being a part of the first cohort at Scholar Site, you can sign up here.
Working with Scholar Site has advantages over my Mixtape Sessions platform in that Scholar Site will be primarily python based with business applications and a recognition that the audience is not drawn from applied social scientists seeking publication in academic journals. It’s tailor made for industry. It’s 9 classes, each 2 hours long, plus recordings going from 4pm PST to 6pm EST over three weekdays from February 28th to March 24th. It’s $595 per seat and unlike Mixtape Sessions, this can be expensed because Scholar Site is credentialed to satisfy industry requirements about continuing education credit.
I continue to believe that democratizing causal inference through steady teaching and talking with anyone and everyone is one of the best things that I can do to help increase demand. With this year’s Nobel Prize in economics going to three pioneers in causal inference, I anticipate that demand will increase and it is therefore important the synapses be built so that machine learning engineers and data science workers who do not yet know this material can learn it in the most efficient way possible.
Hey Scott, thanks for all the effort you’re putting into democratising causal-inference. I’m equally enthusiastic about it becoming mainstream. My belief is that we need more folk that can derive causal inference from observational data. I belong to the chemical industry, and a company that is financially investing in data-science. While that’s a positive sign we are limited by leaders that can imagine the use-cases that could attract data-science application. A lot of the growth in application is happening bottom-up and horizontally, just like you’ve cited. I can’t imagine these folk investing in more infrastructure that could enable A/B experiments. Our industry lacks the velocity in feedback loops to take advantage of A/B experiment infrastructure that tech firm deploy. But I sincerely believe that if we were to apply systematic causal inference based on observational data we’d be able to appeal to the imagination of the senior stakeholders and hopefully accelerate the maturity of causal applications. I’ve been set-studying to hone my skills in causal inference but could have done with some formal coaching from practitioners like you. A model that allows expertise to be scaled not just through education but an apprenticeship model could turbo charge the causal inference field.
I hope your course will have a little bit of DAGs and SCMs in it at least, even if it mostly thinks of the world in terms of Rubin's "giant pandas dataframe with some NAs".
SCMs are going to be very intuitive to people who come from CS backgrounds and write code all day, and I think that's still where most data scientists come from. Likewise DAGs, I have the feeling they are still these weird objects to economists, but to anyone with an undergrad CS degree, they're a big part of every intro to algorithms course.