# I asked GPT-4 to write three exam questions for a causal inference class, and then take it

### Creative destruction in causal inference

This is a break from my normal substack “explainer” to show you something I did this morning. I asked GPT-4 to write three exam questions. It completed that task. After it had written all three exam questions, I then asked told GPT-4 to take its own test. In the following, I’m going to post the output as well as screenshots so you can see what it wrote and how it did. Please keep in mind that the above was about as open ended of a prompt as I could’ve given.

**Difference-in-differences task exam question**.

*Question 1 prompt*: I am writing an exam for a causal inference class. Create a question about difference in differences asking them to analyze and interpret Stata output and betas and standard errors.

*GPT-4 produced answer:*

In a causal inference class, you have learned about the difference-in-differences (DiD) estimator as a method for identifying causal effects in observational data. In this question, you are provided with STATA output from a DiD analysis that investigates the effect of a policy intervention on the income of individuals in treatment and control groups, before and after the policy was implemented. Analyze the STATA output and interpret the betas and standard errors.

Consider the following STATA output:

Based on the provided STATA output, please answer the following questions:

What is the estimated causal effect of the policy intervention on income, as measured by the DiD estimator? What is the standard error of this estimate?

How can you interpret the coefficients and standard errors for the 'treat', 'post', and '_cons' variables?

Is the estimated causal effect statistically significant at the 5% level? Explain your answer.

Based on the R-squared and adjusted R-squared values, how well does the model fit the data?

What assumptions are necessary for the validity of the DiD estimator in this context? Briefly discuss if these assumptions might hold in this case.

*Answer DiD prompt*: Answer the difference-in-differences question that you wrote.

The estimated causal effect of the policy intervention on income, as measured by the DiD estimator, is the coefficient of the interaction term 'treat#post'. The estimate is 4,535.4, with a standard error of 912.7.

The coefficients and standard errors for the 'treat', 'post', and '_cons' variables can be interpreted as follows:

'treat': The average difference in income between the treatment and control groups before the policy intervention, holding other factors constant. The estimated difference is -1,020.8, with a standard error of 846.5.

'post': The average change in income for the control group after the policy intervention, holding other factors constant. The estimated change is 1,010.1, with a standard error of 827.3.

'_cons': The intercept term, representing the average income of the control group before the policy intervention, holding other factors constant. The estimate is 30,781.1, with a standard error of 601.7.

The estimated causal effect is statistically significant at the 5% level, as the p-value for the interaction term 'treat#post' is 0.000, which is less than 0.05.

The R-squared value is 0.2718, and the adjusted R-squared value is 0.2694. This means that approximately 27% of the variation in income can be explained by the model. The model's fit is not very strong, but it does provide some explanatory power.

The key assumptions for the validity of the DiD estimator in this context are: parallel trends (in the absence of the policy intervention, the treatment and control groups would have followed the same trend over time), no spillover effects (the treatment does not affect the control group), and stable composition (the composition of treatment and control groups remains stable over time). In this case, without additional information on the context and data, it is challenging to assess whether these assumptions hold. One would need to investigate the pre-intervention trends, potential spillovers, and the stability of the groups to ensure the validity of the DiD estimator.

**Regression Discontinuity Design task exam question**

*Question 2 task*: Now write a question with stata output and regression specifications for a regression discontinuity design.

In this causal inference class, you have learned about the regression discontinuity design (RDD) as a method for estimating causal effects by exploiting a discontinuity in the assignment of a treatment based on a threshold value of an assignment variable. In this question, you are provided with STATA output from an RDD analysis that investigates the effect of a scholarship program on students' test scores. The scholarship is awarded to students who scored above a certain threshold in a pre-test. Consider the following STATA output:

The assignment variable (pre-test score) has a threshold value of 60, where students who score above this value receive the scholarship (treatment group) and those who score below do not (control group). The outcome variable is the post-test score (test_score), and the treatment variable is the scholarship status (scholarship).

Based on the provided STATA output, please answer the following questions:

What is the estimated causal effect of the scholarship program on students' test scores, as measured by the RDD estimator? What is the standard error of this estimate?

Is the estimated causal effect statistically significant at the 5% level? Explain your answer.

How many observations are there in the treatment and control groups, respectively?

What are the bandwidths estimated for the left (control) and right (treatment) groups, and what is the kernel used for this analysis?

What assumptions are necessary for the validity of the RDD estimator in this context? Briefly discuss if these assumptions might hold in this case.

*Answer exam prompt: *Answer the regression discontinuity design exam question that you wrote

The estimated causal effect of the scholarship program on students' test scores, as measured by the RDD estimator, is the coefficient of the 'Treated vs. Control' variable. The estimate is 12.345, with a standard error of 3.987.

The estimated causal effect is statistically significant at the 5% level, as the p-value for the 'Treated vs. Control' variable is 0.002, which is less than 0.05.

The number of observations in the treatment and control groups is the same, with 250 observations in each group.

The bandwidths estimated for the left (control) and right (treatment) groups are both 15. The kernel used for this analysis is the Triangular kernel.

The key assumptions for the validity of the RDD estimator in this context are: continuity of the potential outcomes (the relationship between the assignment variable and the potential outcomes is continuous around the threshold), local randomization (within a small bandwidth around the threshold, the treatment assignment is as good as random), and no manipulation of the assignment variable (individuals cannot systematically manipulate their pre-test scores to ensure they receive the scholarship). In this case, without additional information on the context and data, it is challenging to assess whether these assumptions hold. One would need to investigate the continuity of potential outcomes, the local randomization around the threshold, and any potential manipulation of the pre-test scores to ensure the validity of the RDD estimator.

**Instrumental variables (2SLS) task exam question**

*Question 3 task prompt*: Now write a question for instrumental variables. Ask them to explain the assumptions, and interpret stata output for two stage least squares, including the strength of the first stage.

In this causal inference class, you have learned about instrumental variables (IV) as a method for estimating causal effects in the presence of unobserved confounding. In this question, you are provided with STATA output from a two-stage least squares (2SLS) analysis that investigates the effect of an educational program (treatment) on participants' income. The instrument used is a randomly assigned encouragement to participate in the program.

Consider the following STATA output:

Based on the provided STATA output, please answer the following questions:

Explain the assumptions necessary for a valid instrumental variable.

What is the estimated causal effect of the educational program on participants' income, as measured by the 2SLS estimator? What is the standard error of this estimate?

Is the estimated causal effect statistically significant at the 5% level? Explain your answer.

Interpret the results of the first-stage regression. How strong is the relationship between the encouragement and treatment? How can you determine the strength of the first stage?

How can you interpret the coefficients and standard errors for the 'age', 'gender', and 'education' variables in the second-stage regression?

*Answer exam question prompt: *Answer the instrumental variables question.

The assumptions necessary for a valid instrumental variable are:

Relevance: The instrument must be correlated with the endogenous treatment variable. In this case, the encouragement to participate in the educational program should be related to actual participation.

Exogeneity: The instrument must be independent of the error term in the outcome equation, meaning it is not correlated with unobserved factors affecting the outcome (income). In this case, the encouragement to participate should not be related to unobserved factors that influence income.

Exclusion restriction: The instrument should only affect the outcome through the treatment variable, meaning there should be no direct effect of the instrument on the outcome. In this case, the encouragement to participate should only influence income through its effect on participation in the educational program.

The estimated causal effect of the educational program on participants' income, as measured by the 2SLS estimator, is the coefficient of the 'treatment' variable in the second-stage regression. The estimate is 15.4036, with a standard error of 6.1335.

The estimated causal effect is statistically significant at the 5% level, as the p-value for the 'treatment' variable in the second-stage regression is 0.012, which is less than 0.05.

In the first-stage regression, the relationship between encouragement and treatment is measured by the coefficient of the 'treatment' variable. The estimated coefficient is 0.5782, with a standard error of 0.0663. This suggests that the encouragement to participate is positively and significantly related to actual participation. To determine the strength of the first stage, we can examine the F-statistic, t-statistic, or the partial R-squared. In this case, the t-statistic for the treatment variable is 8.72, which is quite large, indicating a strong first stage.

In the second-stage regression, the coefficients and standard errors for the 'age', 'gender', and 'education' variables can be interpreted as follows:

'age': The estimated effect of a one-year increase in age on income, holding other factors constant, is 0.0512, with a standard error of 0.0428.

'gender': The estimated effect of a change in gender (assuming a binary coding) on income, holding other factors constant, is -1.1298, with a standard error of 1.0386.

'education': The estimated effect of a one-unit increase in the education variable on income, holding other factors constant, is 8.1927, with a standard error of 1.1782.

This was an interesting exercise as a teacher. What I was able to do was not merely to generate an exam question. I was also able to generate a sample exam taken. I can look at both, reflect on them, tweak it, change it, abandon questions, go deeper, see what happens, and iterate until I find what I want. Which is a process that suits my style of thinking but which is very difficult to do.

I suspect that the marginal rate of technical substitution between capital and labor for higher education pedagogy is very high. Not for everything, but probably for what I do it is. Regardless of whether this is some true Artificial Intelligence, I think it’s very clear that it’s the tool I needed that I didn’t know I needed, and I intend to adapt.

Very cool! I liked that you had ChatGPT answer those questions too, to get a sample response. This can help crafting meaningful exam questions that can show whether learners have worked out their own understanding, or have used other (sneaky) means.

So, to be clear - the answers were all correct? The tone of your subsequent commentary suggests as much but forgive me, I'm a stata novice