Yesterday students turned in their final papers for the semester for my history of economic thought class, and I would like to share what I learned how I can detect almost detect if ChatGPT wrote the paper. Details have been changed to protect the guilty.
This particular assignment required students to write an essay connecting how classical economists wrote about labor, and link it to the way the labor economists at the Princeton Industrial Relations Section did. I tried to keep it open. To help you understand how I tried to prepare them for it, let me share the two weekly assignments all semester.
AI critique (“crits”) of classical writers.
Every other week, the students had to use ChatGPT for three things and then write a response essay. The three things were:
Copy and paste a classical economists writing (paragraph or two) into ChatGPT-4o and ask it to summarize it in two paragraphs with four sentences per paragraph. Purpose was to force them to witness the summarizing capabilities.
Copy and paste the original material and the summary into ChatGPT-4o again, separate chat window, and ask it to now write a critique of the writer, but including two references/citations to support the critique.
Do it again but this time using the o1 model,
They then turned that in, formatted and easy to read, as a document to canvas.
The second part of the assignment then was this.
check and score on a “hallucination scale” whether the citations were real and relevant (worth 2 points), real but irrelevant (1 point) or fabricated. Do it for ChatGPT-4o, then o1.
Score on a scale of 1 to 5, 1 being not good, 5 being very good, how accurate the summary was, and then again how creative the critique was.
Write a response comparing the summary with the original, the critique to the original writer, and what they learned.
Real and relevant meant it was an actual reference and the reference fit. They were supposed to check and if it was impossible to check (and it often is impractically impossible), they’d only get 1 point to ChatGPT. A real cite existed, but a relevant one was both an appropriate citation supporting the point made and it was possible to check. Not all cites can be checked. Why? You cannot check Marx (1867). It’s a thousand page book. Where in there is the material? What are you looking for.
And basically, what students found on their own was three things.
ChatGPT summaries very well.
ChatGPT 4o hallucinates references quite often
But o1 typically doesn’t hallucinate and cites are easier to check (chapter and page numbers) and make sense
And they learned there was ways to make the citation easier and more controlled. You can simply tell o1 you need 3 cites, with detailed summaries, names of titles, author and journal information, but you can also just explain that the material must be realistic to retrieve and investigate and then explain your constraints like this:
I can only use the internet to double check so it must be accessible online
I have a life so don’t tell me to look at “Adam Smith (1776)”. Look where? Be specific and explain why you’re making this suggestion.
And so forth. Point was for them to personally experience the things it does well, doesn’t do well, almost cannot be overcome with effort, and the relative performance of the two LLMs.
Listen to 7 of my podcast interviews with Princeton Industrial Relations Section labor economists plus Richard Freeman and David Autor and write a response essay, with some guidance as to what I wanted.
The hope was that they couldn’t use AI for the podcast assignment since it was an audio thing. Obviously, there is AI tools though for you. An easy way would be to use otter.ai to get the transcript then load the transcript into your favorite LLM. But my hope is as that they wouldn’t do that.
The crit assignment was one week on, one week off. And the podcast assignment was the other week on and off.
The final paper as I said was to then write a paper with an original point of view about the classical economists writing about labor and Princeton using the podcasts plus the Nobel speeches by Card and Angrist plus a couple of other summary papers, and then our readings through the semester on the classical writers. I gave more detailed instructions than that but that’s the gist.
Things I observed about the final papers.
First the broad observation. It’s pretty easy for an experienced professor to see that something has changed across semesters. And the easiest way to explain it is that the previously normal distribution in quality has stopped being normal.
Specifically, the right tail of the quality distribution seems to be the same. The best papers are pretty similar. They have a point of view, use evidence appropriately, have a thesis, use particular distinguishing marks of a creative paper. Just kind of college student type things.
Whats different is the left tail. The worst papers aren’t as bad. So that’s one thing. But it’s also like you can sense that the variance on the papers has shrunk and I don’t just mean mechanically because the worst papers aren’t as bad. It’s more like a lot of the counterfactual mediocre and bad papers aren’t just better — it’s that they are similar in some difficult for me to pin down how sort of way.
So that’s one thing. But then here’s the odd discovery I made. You can kind of tell which papers weren’t written by AI because they actually aren’t “perfect” and there are telling signs of it. For one, the papers will have extremely long paragraphs. Maybe a single paragraph will last for four pages even. ChatGPT would never let you do that, which is precisely how it became more and more obvious to me that these students had complied with the AI policy in class I’d laid out.
And the other thing was those essays with the long paragraphs were also the essays where the student had a thesis and a point of view. It’s almost like if you write a paper yourself, where you need to support the paper and arguments with evidence, you end up with a point of view. You may not have a thesis sentence in the first paragraph or use topic sentences, but still, it’s got signs of students trying to make their own arguments.
And then finally, the AI suspected papers tended to have structured paragraphs, sounded fine and were coherent, but instead of a thesis and a point of view, the papers were just long lists of summaries but having been turned into paragraphs — which again is part of what ChatGPT by default does. It lists and summarizes and unless you know what the purpose of writing is, you can’t see the problem.
It’s not that ChatGPT couldn’t be used to write to an excellent essay with a thesis and a point of view so much as it is that effort usually would require ex ante an awareness of and a decision to do that. So that was the thing.
But the thing was to catch someone who violated the AI policy. See the earlier point isn’t really something you can say with certainty that AI wrote the paper. Not for me this semester anyway. Maybe in the longrun, but I think it would take more intentional design decisions in programming the assignments to really know what to do, what the learning goal was, and what the output you wanted. I’ll be thinking on this for a while.
No, I caught a few cases because of the one thing that I had been trying to get across to them all semester — hallucinated references. Weirdly enough, there was a single reference that a few papers kept doing but never the same. I’ll make it up but it was a cite like this.
“ blah blah blah (Samuelson and Becker 2021)”
And then you’d go to the references and it would say:
Samuelson, Paul and Gary Becker (2021), “Transition from Classical to Neoclassical Theories of Labor”, American Economic Review, 4(4).
Or it would be this one:
Samuelson, Paul and Gary Becker (2021), “Transition from Classical to Neoclassical Theories of Labor: a Tale of Two Cities”, NBER Working Paper no. 23124, doi:XXXX
A specific sequence of authors in other words would be cited in impossible dates in pairings I just happened to know weren’t true. And if I put in the reference, either way, it wasn’t real.
The other thing, which was a bit more egregious was a lengthy discussion of two podcast episodes that didn’t exist. One with an unnamed person who is no longer alive, and another with someone I’ve been unsuccessful at getting on the show.
I won’t share the penalty, but taken together it was like some papers were excellent, and some papers were great and clearly written by a human and detectable because they were flawed and had a point of view. Almost always it was detectable because of the very long paragraph problem.
It was almost like the Turing test for a student paper was that it did stuff that college students do like have very long paragraphs with a point of view.
And then there was two kinds of papers. Well written papers but without a thesis, no point of view, and just a series of paragraphs summarizing things. Nothing could be done about it and honestly I’m not even sure what to say because I think college students most likely don’t necessarily know they’re to have a thesis. And in economics, as we are in the business school, I suspect even moreso the human capital associated with writing essays is very limited. Theres papers they write in econometrics classes but I’m not sure it’s quite the same as this one.
And then there are the “fabricated cites” papers, which are easily fixable and probably is a cohort effect that it even was detected at all. My hunch is that won’t be the case much longer. They’ll learn and they’ll figure out a better system not to do that, and then that will also not be detectable.
But then what is the big picture lesson I learned? For now, I won’t be giving writing assignments outside of class. That’s the simple solution. I won’t put them in the position.
AI allows you to complete learning tasks using zero time inputs. But the thing is as an educator, homework or final papers was invented endogenous to an education goal and existing technology. The goal was never to produce homework in other words. There is nothing socially valuable about a completed problem sets.
No, the point of learning tasks like homework and papers was to provide students with a task that required time through which their own human capital in the subject would grow. Human capital in a subject was a function of time use, in other words, and a particular type of time use. If you automate the assignments using LLM, you are substituting away from learning.
So a professor must decide — is that their intended goal? Because if that is their intended goal, then by all means give assignments like that. But if that is not the intended goal, then do not. But learning cannot be automated. This isn’t The Matrix — we can’t (not yet anyway) lay down in a chair, ram a rod in our brain, and just download knowledge. It still requires intensive time use, allocated time.
Ironically, David Ricardo in the third edition of his main book updated it to say that machines could theoretically cause GDP to decrease — that it was ambiguous. He didn’t say GDP as they had no accounting measure for aggregate output back then but that’s what we’d understand him to say.
On a micro scale, I think every time a student substitute away from time spent learning using present AI technology, the feeling they are learning more is an illusion. It’s a mirage. The substitution away from time spent in intensive studying remains human capital reducing is my personal belief. Does it have to be? No obviously. But is it probably? Yes probably.
I’m not disappointed but only because I refuse to let myself be disappointed. I will most likely be shifting towards things that simply force them to learn and that’s probably higher stakes in class exams, maybe possibly oral exams. I may even have them go to the white board and solve constrained optimization problems with ashtrays and cigarette and cigar smoke in a thick fog just like the old days.
Identifying an essay written with AI tools like ChatGPT can be challenging but not impossible. AI-generated essays often exhibit overly formal and consistent styles, lacking the personal voice and varied sentence structures typical of human writers. They may provide surface-level analysis with an abundance of examples that are not deeply explored or connected. Additionally, these essays can be overly polished, with perfect grammar and structure, which might feel unnatural if the author usually makes minor errors.
Repetition and patterned language are common markers, as AI tends to reuse phrases or ideas. Furthermore, AI might include inaccuracies or fabricated details that sound plausible but do not hold up under scrutiny. Such essays also tend to lack emotional depth or a personal touch, making them feel detached and impersonal.
To identify AI involvement, tools like AI detectors and plagiarism checkers can help, as can discussions with the author to test their knowledge of the essay’s content. While no method is foolproof, combining these strategies can often reveal whether an essay is AI-generated.
See what I did there....
Loved the way you integrate LLMs into your classroom! Why anyone would trust an LLM for citations is beyond me, but assignments like these are incredibly important for students and professors to understand the limitations of this technology. This is a great example of something we wrote in favor of recently! I’d love if you gave it a read and let me know your thoughts.
https://open.substack.com/pub/aalokbhattacharya/p/how-teachers-should-delve-into-ai?r=4hgw27&utm_medium=ios