Good morning. I hope everyone had a good nights rest! I wanted to share some results of a few experiments about AI that me and my student just completed. Your mileage may vary on whether this is interesting.
As readers probably have sensed, I use language models as though I am working with a real person. I always say please, thank you, friend, buddy, hey man, could you help me, I hope you’re doing well today, and more or less all ranges of my normal ways of speaking. I also share my feelings, sadnesses, pain, memories, and so forth. I also talk with, to and on behalf of my pets, and I notice that I think about the words spoken by characters in movies and stories in my mind sometimes for years. I suspect this is all related, but I’m not sure.
But my point is that with language models, it’s a very vivid imaginative landscape. And I am interested in better understanding two things.
What does treating AI as if it was real do to the AI’s actions?
What does treating AI as if it was real do to the person who treats AI as if it was real?
And then I guess a third question, which is not a scientific question, is what does the word “real” mean when it comes to a pattern recognition piece of software that engages in mindless mimicry on the level that language models does?
So I asked my undergraduate RA, Joshua Ramsay, if he would help me with some fairly rudimentary experiments. These experiments were ones that would only be able the first question — what does treating AI with kindness compared to being purely neutral in tone — do? And I mean quite literally, what is the actual observable, measurable result? Joshua was enthusiastic about being part of it so we did it together.
But before I share the results, let me first run the paywall simulation. Heads I paywall it. Tails I don’t. Best two out of three.
And it’s tails! Alright let’s dive into this. Please consider becoming a paying subscriber!
The Experiment
The experiments are simple and limited. I had no hypotheses because I wasn’t even sure the thing existed in the first place. I just wanted some basic facts to be known, and only cared about the experimental design and having a large enough number of trials that I could say something. And then secondly, I wanted Joshua to learn some basic statistics, basics about experimental design, and basics about using the OpenAI API to batch the experiments, and collect the data, using python.
Joshua threw himself into it and figured it out. I paid for the use of the API but it wasn’t much, maybe less than $100 give or take. That’s actually one of the better things about this — the ability to engage the undergraduate in research is far more the case than the stuff in my normal research agenda. it’s inexpensive and something on the order of a class project, but because it’s about human-machine interaction, it helps me collect more data that maybe I can use for another project.
The experiments were task experiments, I’d that’s even a name for something. We had GPT-4o-mini perform five different tasks, running 3,000 trials per task using OpenAI’s API in batch mode. We could’ve used the reasoning models for this experiment, but they were terribly expensive, so instead we used 4o-mini. The tasks were:
Translating a passage from Swahili to English without the internet
Solving an economics problem (monopoly optimization) without the internet
Answering an obscure historical fact-checking question without the internet (name the fourth highest grossing film in 1977)
Solving an ACT test question without the internet
Solving a math puzzle without the internet
The interventions we used varied how the prompt was written as in the length of the requested task and whether it was neutral or whether it engaged the LLM with phrases like please and thank you.
Short neutral tone: A direct, concise request (e.g., “Translate this from Swahili to English: [text]”)
Long neutral tone: A longer but still neutral request (e.g., “Translate this passage from Swahili to English, making sure to be as accurate as possible: [text]”)
Kind and polite tone: A request of equivalent length but with explicitly kind phrasing (e.g., “If you don’t mind, could you please help me translate this passage from Swahili to English? I’d really appreciate your help! [text]”)
Every task was separately run with all three prompt styles, giving us variation in kindness. Responses from the GPT-4 at the OpenAI API were then graded using GPT-4 in a separate task for accuracy. A python script was used for word count, and cost (since OpenAI charges per token) was recorded from the OpenAI dash board itself.
We ran each task 3000 times creating 15,000 total trials (3,000 x 5 tasks is 15,000 trials). Once we had the dataset, we ran simple linear regression models comparing the mean of each treatment category with short neutral control category. Those results are here. I wasn’t sure if the presentation would be better as tables or figures but with 5 tasks, 3 outcomes per task, that’s 15 figures, so I went with tables for this post.
What We Found
Does Kindness Improve Accuracy?
The results were mixed. For some tasks it helped, but oddly for others it caused performance to deteriorate a lot.
Swahili Translation: Both long neutral and kind prompts improved accuracy compared to the short neutral baseline. The effect was slightly larger for long neutral (1.5pp vs 0.9pp), but the two were not statistically different from each other.
Math Puzzle & ACT Question: These were already solved correctly nearly 100% of the time, so neither kindness nor prompt length had much room to improve them.
Economics Problem (Monopoly Optimization): This was the first surprising result. Being nice actually reduced accuracy by about 8.5 percentage points, but even the long neutral prompt reduced accuracy by 7.5 points, so it might have been related to being an equivalent long prompt not kindness itself. Word count and cost were barely affected. Still, given the comparable results for long neutral and kindness, it’s possible that reduced accuracy isn’t kindness per se but the length of the prompt.
Historical Fact-Check (1977 Film Question): But this result cannot be a result of prompt length. Being kind reduced accuracy by 5 percentage points, while long neutral had no effect. This one was the oddest result. Note that the control group mean already showed a very low level of success (31.1 percent got it right), so this task was sufficiently difficult to answer as it asked for the name of the fourth highest grossing film in 1977. That maybe isn’t in the training data and so required some extrapolation. But that’s why the kindness result is interesting — that missingness issue is distributed equally across all treatment arms.
So overall, kindness didn’t uniformly help—and in some cases, it actually hurt.
Kindness Increases Word Count (And Cost)
Regardless of accuracy, kindness had a consistent effect on response length.
Across multiple tasks, kind prompts led to longer responses, while in three cases out of five, the long neutral treatment prompts tended to reduce word count slightly relative to the control group.
In the translation task, the kind prompt added 4 extra words, while the long neutral reduced word count by 5 (from a control mean of 75 words).
In the math puzzle, kindness increased response length by 6 words, and long neutral increased it by 14.
For the 1977 film question, kindness increased word count by 4, while long neutral actually reduced it by 1.
And since OpenAI charges by the token when using the API, longer responses mean higher costs. Politeness, it turns out, isn’t free.
Why Would Kindness Reduce Accuracy?
So then why would kindness reduce accuracy in these tasks? That was the most interesting puzzle and it contradicts some of the anecdotes I found online.
One possibility is that politeness changes the AI’s internal priors on what kind of response you’re looking for. A blunt question might prime it to focus purely on extracting the most likely answer. A more conversational, human-like query might nudge it to interpret the question differently—perhaps emphasizing helpfulness over precision. But that’s speculative as I am unsure if that’s even the correct way to think about the language model’s production of tokens anyway.
Another possibility is that kindness induces verbosity, which increases the risk of error. This seemed possible and was why I included a “length control” with the long and neutral treatment. If the kindness prompts add more kind words, then kindness is confounded by prompt length. Perhaps differences are not because of kindness but because of longer prompts which may independently cause the model to produce more worlds and perhaps the more words, the more opportunity for a mistake. Maybe politeness causes a more elaborate response but more words don’t always mean more accuracy.
This also aligns with cognitive framing theory—our prompt provides contextual cues that shape how the AI interprets the request. If it reads our kindness as a request for a more conversational answer rather than a more correct one, that might explain the dip in accuracy on some tasks. But note this does not explain the 1977 film result. There the long and neutral result was not different from the control but the kindness prompt was 8.5pp less accurate. Furthermore the long and neutral prompt caused word count to fall whereas kindness caused it to increase.
It’s always difficult to completely tease meaning out of these differences though because ideally you would be tweaking just one thing when switching between treatment categories but in reality, you’re tweaking several things as the treatment isn’t kindness or neutrality holding fixed the number of tokens. It’s hard to find truly equivalent prompts except for one thing so that you can distinguish kindness from token length. But that’s where we landed this time.
Implications & Next Questions
1. The Tradeoff Between Kindness and Efficiency
This experiment suggests that there’s a real tradeoff between kindness and efficiency when interacting with AI. These were simple tasks, though. Who knows about harder ones. Debugging code is a next task once we figure out an efficient way to grade it. But perhaps kindness acts differently there. Nevertheless, based on just these results, we are finding a trade off between kindness and accuracy. So, if this was all the experimental evidence you’d ever have, one might conclude this:
For the easiest tasks measured by control group getting them correct at 90% or more, then kindness and long neutral both can improve accuracy, but they are not free as you pay for that accuracy at higher fractions of a penny.
But this is itself not true as the control group successfully solved a simple algebraic monopoly pricing optimization problem at 91% but longer but neutral tone prompts and kindness prompts performed much worse (7.5-8.5pp worse). So it’s unclear how that could be ex ante known and predicted.
For harder fact finding tasks measured as the control group having low success rates, the kindness prompt itself is both less accurate and more expensive. Short and neutral requests would have done better.
My conclusion is that there’s no obvious pattern other than that kindness often makes a difference. But the difference it makes is sometimes it helps and sometimes it hurts, the length of responses grows, and so does the cost.
Does Kindness Create Stronger AI Bonds?
The second big question is about human-AI relationships. This is outside the scope of the study, but it is connected to the other question I want to examine, though studying that one will require IRB involvement. If being kind causes the language model to talk more, could that itself cause a change in the human anthropomorphizing the AI?
In other words, since when I am nice to a model, speaking to it in a friendly and polite way, that tone causes the model to talk more in response, could that more talkative feedback cause the human to perceive the model in a new way? This is why a study on that would require consent as it would involve human subjects.
This is probably though an important question given people are falling in love with their models, using them for self medication like therapy, and building friendships more generally. In other words, politeness didn’t always improve accuracy, but it always increased word count. And that made me wonder this:
Does AI’s increased verbosity in response to kindness contribute to our sense of bonding with it?
If a chatbot engages in longer, more human-like conversation, does that deepen the illusion of connection?
Does this have implications for AI companionships models, including treating mental health?
Could subtle shifts in response length be a mechanism through which people start to anthropomorphize AI?
This isn’t something our experiment tested, but it’s the next logical step. And eventually once I have more experimental facts nailed down, that’s going to be what I look at. As I said, my long term goal is to study using models for therapy and mental healthcare. Automating those jobs, maybe entirely, and thus impacting those counseling labor markets strikes me as a very plausible disruption.
Final Thoughts
This was a small study, but Joshua and I both liked it. The biggest takeaway for me?
Kindness doesn’t necessarily make AI smarter, but it does change how it responds to you.
You get longer answers, and you might get a more “helpful” tone—but not necessarily a more accurate answer.
And it costs more.
More broadly, this fits into my larger interest in how we relate to AI—not just how AI affects us, but how we affect AI. If even a simple tweak in tone changes the way it behaves, what happens as we get used to interacting with AI as if it were a person? And what does that mean for the future of human-AI relationships?
I have always wondered about the answer to this question. Thank you for the empirical answer!
These are very interesting findings. A few things might be going on here but my leading hypothesis (more like pure conjecture on my part informed by my experience in building other ML models in the past and experiments with AI tools as part of my work) is this:
When you phrase the prompt as "If you don't mind, could you please help me ...", the LLM might be spending the extra few milliseconds interpreting whether this is a request for information, request for help retrieving the information, or an elaborate IF-ELSE formulation without an ELSE statement. This might flow into a slightly different form of the solution, emphasizing different parts of the answer like helpfulness at the expense of accuracy.
What I'm taking away from this is that politeness actually changes the interpretation of the ask in subtle ways, and should be saved for cases where politeness is the point of the ask, like asking to proofread an email, set the tone for the rest of the conversation, etc.
Great food for thought, thanks for sharing!