Design Time is a Substitute for Controlled Randomization: Part 1
The basic idea of this post is that if it takes 10 hours to make a randomized controlled trial (RCT) credible by careful design, then it would require more than 10 hours carefully designing an observational study to make it as credible. I try to explain my logic below but if I’m correct, it means time and effort are substitutes for controlled randomization. But only if I’m correct in this reasoning so let me explain why I think this. This will be a multi-part sub stack, and this is part 1. In this substack, I will lay out the gold standard metaphor, propose a distribution of studies by quality within research designs, that the main mechanism that determines the quality of a study is the time taken in the design stage, and then lay out the first step in the design stage.
Gold Standard Metaphor
You often hear the phrase that the RCT is the “gold standard” in causal inference. It’s a divisive phrase depending on who says it and to whom it’s said as it’s often meant to cast shade on a large body of unseen work — “observational” causal inference — before making any effort to read and discern any specific study.
But it’s not quite a scientific claim as it’s not really itself empirical as it’s not a claim made about anything real. What even would be the testable prediction from that kind of claim? Credibility is not observable — it’s something either subjective regarding one’s posterior beliefs after looking at something, or it’s a reference to something that can be worked out from first principles. But neither of those are true in the gold standard metaphor because credibility is not observed. And the method we would use, if it were, to evaluate whether the design could achieve it — well, wouldn’t whatever method we used be the one method that cannot be tested because it is a priori to all methods?
No, the gold standard claim — it’s more of a generalization, or almost like a thought experiment, than anything else. It’s like an ordinal ranking based on the best case scenarios of each type of design based on what we know about the unseen properties of each design’s elements. But, I still am reluctant to grant the metaphor is in and of itself automatically true because each design lays out a blueprint that will yield credible estimates of target causal parameters so long as various assumptions hold. And therefore if those various assumptions did hold, then that study would be “credible”. So it’s still not technically accurate how you are to interpret an ordinal ranking saying some are more credible than others.
You might say that this comes down to pedantics, but in reality, people tend to use that metaphor to dismiss out of hand an entire body of work that they refuse to read, and simultaneously, causes them to accept wholesale an entire body of work that they will also never read. But in reality, each design has a collection of studies that vary in terms of their own relative credibility to one another in the same class and to other studies in other design classes. This is because there are several things that make a study credible, and it’s not merely the design itself. In a pool of RCTs, there is a distribution of studies with some more credible than others. And in a pool of observational studies, there is a distribution based on quality. Saying one is more credible than the other is a combination of two factors: the nature of the design itself, which must have some key elements that make the epistemology of inference more believably causal, and the quality distribution of the two pools.
So how about this? How about we split the difference? How about we grant that there is a hierarchy of designs and a distribution of quality within the designs, and the RCT quality distribution has a higher mean than the observational designs?
If there is a distribution of quality within given research designs, then what determines the quality? I will for the sake of this substack subsume all of the factors that determine quality into a single variable that I will call “designing studies”. And the principal factor that determines the designing of studies within a given research design will be effort, which here will simply be measured in units of time. That is, the researcher, in order to achieve a more credible design must spend their scarce time in designing the study ahead of time. I will ignore the role of “expertise” in this simplified argument by simply saying that expertise is itself a function of prior time and effort as expertise is a function of human capital. And therefore I will just put everything into this one dimension called time spent in the “design stage” of a study such that for a given study, there is a maximum amount of credibility one can achieve, and that maximum is a function of time spent designing the study.
So I will need to explain what I mean by the design stage, which I will sketch out below, in order that you see my claim which is that in order for the RCT and the observational study to always have the same equivalent credibility, marked here with the “credibility threshold” vertical line in the above graph, the observational study must always spend more time than the RCT in the design stage. That is, to even have the same equivalent credibility at all, you will always have to spend more time in the design stage than another RCT that achieves the same credibility even given the ordinal ranking of the two research designs themselves.
Why Is RCT Placed At The Top of The Causal Inference Ladder Anyway?
So one thing I need to clear up first is what is the difference between research design and designing a study? A research design will, for our example, be a type of study that is within the branch of statistics that we associate with the experimental design in which we simply compare a group of units exposed to some treatment intervention to another group of units that had that treatment intervention withheld. There are within that broad tradition several research designs we could contemplate. They are:
Randomized controlled trials (RCT) in which the treatment assignment mechanism is physical randomization that was implemented by the researcher themselves.
Observational studies in which the treatment assignment mechanism was a randomized instrumental variable that was not necessarily controlled by the researcher themselves which was the variable that determined the treatment assignment of some units but not others
Observational studies in which the treatment assignment mechanism was a variable or a collection of variables observable to the researcher but which was not itself selected by the researcher.
Observational studies in which the treatment assignment mechanism was based on a continuous or multi-valued and observable variable called a running variable where if that variable exceeded some threshold, units with values of the running variable above it were treated and units below it were not.
Observational studies in which units are observed over time and treatment was assigned according to various rules that created parallel counterfactual trends in outcomes for treatment and control groups (i.e., difference-in-differences).
So the research design refers to the treatment assignment mechanism primarily and whether the research controls it or doesn’t.
The reason that the RCT is thought to have the highest form of credibility in causal inference is not because it’s treatment assignment mechanism is better than any of the others, but rather because the treatment assignment mechanism is both controlled by the researcher directly and is based on explicit randomization. It is not enough to rank it based on randomization as selection on observables, or what’s called “unconfoundedness” is also based on randomization — it is just based on conditional randomization. That is, for groups of units with the same observable characteristics used by some external force to assign the treatment, some units in that strata were randomly assigned to the treatment group and others were randomly assigned to the control group. Instrumental variables note is also based on randomization in the first stage, and under some scenarios, we can even consider regression discontinuity design and difference-in-differences as forms of randomization. So randomization is not itself a sufficient condition for the ranking of research designs since several observational studies ranked lower than the RCT also are based on randomization assumptions.
No, the key reason for the gold standard is that the randomization, which has known statistical properties that guarantee estimates of causal effects in large samples, was controlled or not by the researcher. And in the RCT, that randomization was controlled, and oftentimes in the observational study, randomization was conjectured or thought plausibly to be true. In other words, it’s not merely the R in RCT that yields the ranking, but rather the C in RCT. And the other designs, even the ones that depend on randomization for their identification, are not usually thought to be controlled by the researcher.
But, even this is not entirely true. One could easily imagine that the researcher in an agency knows that treatment was assigned based on a lottery, or a running variable, or a selection of observables, because perhaps the agency in which the researcher works used either of those three for the assignment mechanism itself. In such a case, the study, though, is more akin to an explicit RCT and is not an “observational” study. Uber knows that it uses a surge algorithm to raise its prices, and Angrist knew that the United States military during a particular period in the Vietnam war used physical lotteries to draft people. In those cases where the agency doing the study was the one that controlled the mechanism, then it moves away from the observational design and closer to the RCT, but for this substack, I’ll abstract away from that and use the more ordinary instances.
Distribution of Quality Studies By Design Strata
So, if controlling the randomization is the reason for the hierarchy such that the RCT is the gold standard in causal inference, then what determines the distribution within a research design? As I said, let’s assume it is time, effort and expertise which I will subsume into simply time. It takes time to make any study, whether it’s an RCT or an observational study, to become higher quality. But oftentimes, we think of the time as different. If it’s an RCT, we may think that it’s the time spent designing the study ahead of time. After all, there must be something to that because RCTs require such meticulous attention to every detail way before the data is even collected. There are the countless meetings with stakeholders, the selection of the workers who will oversee the surveying, the exhaustive planning for contingencies, the power calculations, the various strategies to control attrition, and so on. We know that these are inputs in the quality of the RCT, and we know they happen before the data is collected, because contemporary RCTs are expected to pre-registered, and pre-registration is when one submits, not the study to some registry, but the design to the registry.
But we do not often think of the observational study as similarly having its own design stage. We do not often think of researchers writing up results from an observational study as using scarce time in a design stage because we aren’t controlling the randomization, so therefore we cannot quite envision what that might mean. The scarce time used in the observational study is therefore usually shifted either to the stage of getting the data or writing code to analyze the data.
So let me explain what I mean by “the design stage” in an observational study. In an observational study, the “design stage” refers to all the planning and organizing that happens before collecting or analyzing outcome data. This is distinct from the analysis phase, which comes after the data is collected. The goal at the design stage is to make an observational study resemble a randomized experiment as much as possible, increasing the credibility of causal inferences.
Examples of designing a study include:
Transform the research question into a causal question: Deciding on the treatment and control conditions (i.e., the explicit intervention), causal parameter you’re targeting, and therefore what outcomes will be measured.
Specification of treatment assignment mechanisms: Reconstructing or understanding how units ended up in the treatment or control group to mimic a randomized trial.
Selecting or preparing the dataset: Ensuring that relevant covariates (variables influencing treatment and outcomes) are available and appropriately recorded.
Covariate balance checks: Ensuring that treated and control groups are balanced with respect to key covariates that determine potential outcomes.
In my experience, having taught a lot of workshops and classes focusing mainly on causal inference in observational studies, these four steps are oftentimes not well understood, not clearly understood and/or not consistently followed.1
Step 1: Define the causal question
The first thing that researchers often struggle with is placing their research question into a causal framework expressed in terms of potential outcomes. It differs by individual as to what is usually the thing that trips them up, too, but the source of confusion is in my opinion almost always because people learn regressions almost exclusively in their first exposure to causal inference. And therefore they’ll learn more about the properties of regressions, and the error terms in that regression, than they will learn about the causal parameter, because in what I am talking about, I am talking not about the beta parameter multiplying variables of interest, but rather I am talking about the causal parameter expressed in terms of potential outcomes. I am talking about using the Rubin causal model to define the research question.
For most researchers, the research question is not expressed as some aggregate causal parameter, but rather as more generally about the intervention you’re studying. You will be studying the effect of gun laws on suicides, for instance, or minimum wages on employment, but in fact neither of those are in and of themselves what I mean by “causal question”. To understand what I mean by causal question, let’s review first the definition of the individual causal effect:
Let’s take where I live as the example (i.e., Waco, Texas). And let’s also take today, October 17th, at 8:00am as our point in time. And last, let’s imagine that the intervention we are studying is the raising of the minimum wage. The Texas State minimum wage law adopts the federal minimum wage rate which is $7.25 per hour. But let’s imagine that Waco could raise its minimum wage above the federal minimum wage to $10.25 per hour. The individual treatment effect is therefore defined as a contrast between the city’s employment (here represented with a Y) if it was “treated” with a minimum wage today of $10.25 (or Y^1_{Waco, Today}) versus a minimum wage today of $7.25 (or Y^0_{Waco, Today}). In the Rubin causal framework, an individual treatment effect is simply a contrast between those two potential outcomes. And I put them expectations because Waco is a collection of individual firms and workers, and if we are measuring employment for a city, we are necessarily having to decide on how those underlying worker employment outcomes will be summarized. But here I summarize them by taking the mean.
That’s sometimes the causal parameter when you are studying only one city’s raising of the minimum wage. A team of researchers did that for Seattle for instance which raised its minimum wage. Their target parameter was the average contrast in labor force outcomes under two scenarios — a potential outcome when treated and a potential outcome when not treated but always for Seattle at the same point in time.
Well, for many people, the causal question is to say something along the lines of “I am wanting to know the effect of Seattle’s minimum wage on employment”, but in what I am saying, you have to be more specific with that first. In other words, the research question — what is the effect of the minimum wage on employment — and the causal question — what is the average treatment effects over a particular subset of units — are not the same thing. In order to do a causal study, one must take your research question of interest and then express it as a causal parameter which will be some simple contrast between two average potential outcomes.
But what I observe is that more times than not, the researcher will express their research question, not in terms of potential outcomes based on the Rubin causal model, but rather as a regression model. But the regression model is not the causal question — the regression model is the means by which you will attempt to estimate the causal question. Regressions can be used, remember, also for predictive purposes. In fact they were invented by Gauss to predict the movement of celestial bodies, not to estimate causal effects. You do not start with a regression — you start with taking your research question (“minimum wages”) and then transforming those into these terms:
Define the intervention of interest, which for our purposes will be a simple binary indicator representing the turning on or off of an intervention you care about. Here that has been the raising of the minimum wage to any level. If you were interested in the effect of moving to different minimum wages, that is actually a different causal parameter.
Define the outcomes you want to study. Here the outcomes are employment of workers.
Express those outcomes as potential outcomes. Here that is potential employment of workers exposed to the raising of the minimum wage (Y^1) and the potential employment of the same workers had they not been exposed to the raising of the minimum wage (Y^0) assumed to be at the same moment in time.
Define the population you want to study. Is it all workers? Is it the workers who live in the cities that raise the minimum wage? Is it the workers who live in the cities that did not raise the minimum wage? Each population refers to a different causal parameter. The average treatment effect is the average of all workers, regardless of where they live, because recall all workers have an individual treatment effect — even ones outside of Waco. Even Dallas has a treatment effect. Even New York City has a treatment effect. And so you can group them in different ways. You can group them by the population of cities that did in fact raise the minimum wage (like Seattle). That corresponds to the ATT, or the average treatment effect on the treatment cities. You can group them all together, regardless of whether they raised the minimum wage, and that corresponds to the ATE, or the average treatment effect. You could even group them by the cities that didn’t raise their minimum wages which is the ATU, or the average treatment effect on the untreated cities. And you could even group them by units with similar covariates, which is usually something called the conditional ATE, or CATE. You can define this a lot of different ways, but the point is, you have to define them because whatever population you’re study determines the causal question. Is your research question about the ATE? Or is it about the ATT? Depending on which it is, you will have to use a different estimator and potentially pursue even a completely different research design.
Conclusion for now
So, what I have done here is introduced readers to the gold standard metaphor, arguing that the gold standard metaphor refers to the controlled randomization and not merely the randomization itself. I have also said, though, that even within a given research design, which I argue is a reference to how the treatment assignment mechanism operated and whether the researcher did or did not control it, there are studies with differing credibility. I proposed for simplicity that the credibility was determined by time and effort, collapsed mainly into time for simplicity. And then I introduced you to the idea of a multi-dimensional design stage, which I am hoping over a series of essays you will come to see as entirely distinct from “collecting the data” and “analyzing the data”. It is rather a period in which one is not looking at the outcome data at all.
I am stopping here because it’s long, and in my experience, the longer essays are less useful to readers than if I can break it up. But in the coming substacks, I’ll do a few things. I’ll further elaborate on the rest of the design stage parts, which were listed in bullets, and then I will try to explain my logic that in order for an observational study to be as credible as an RCT, it must spend more time in the design stage than the RCT spent in its design stage. So stay tuned!
I owe my understanding of the clarity around this to the excellent article by Don Rubin, “Design Trumps Analysis” which I highly recommend.