Does o1 engage in counterfeit reasoning and if so can we detect it?
Revisiting Shlomi Sher’s 2023 methodology for identifying counterfeit reasoning
The new strawberry model rolled out in limited edition yesterday. It’s called o1-preview or o1-mini and you can find it in your app or browser if you click on model at the top of the screen here.
This model is their more advanced reasoning model. And Noam Brown at OpenAI, whose Twitter profile says he has been working on reasoning models, including poker AI models, reported these results in a box chart which was helpful to see laid out side by side with the other models.
Now we only have the preview model at the moment, but you can see that compared to the already very powerful GPT-4o, that even o1-preview is a significant improvement across extremely advanced mathematical and programming domains (competition math, competition coding, PhD level science).
I am not needing those particular things at the moment, but I am curious to see how it did if we update a 2023 set of reasoning tasks that Shlomi Sher, a psychology professor at Pomona and brother to a friend of mine and economist, Itai Sher at UMass Amherst (who just got a solo authored AER) and the son of logician and philsopher Gila Sher at UC-San Diego (deep and innovative critical thinking runs up and across in the Sher family!!), performed attempting to identify what he called “counterfeit reasoning”. I am going to copy and report the differences between his findings and o1 now to see how well the new Strawberry model (ie o1-preview) performs, with a caveat at the end that Sher’s test needs updating as it’s plausible the original specific questions have been included in the new o1 model’s training data.
Shlomi Sher’s methodology for catching counterfeit reasoning
What’s interesting to me about Sher’s approach is that he is going to focus in, not on complex tasks but rather simple tasks but unfamiliar tasks too. Listen to his strategy as he lays out next.
Shlomi Sher:
Complex successes are exciting but very hard to interpret. Simple failures are sobering but can be much more revealing.
To translate this broad strategy into concrete tests, here are four useful tips:
1. Simplicity and unfamiliarity. Aim for simple problems with obvious solutions — but try to make them superficially unusual. Failures on easy problems are telling: If you fail, either you’re not paying attention (not such a plausible hypothesis in GPT’s case) or you don’t understand.
2. Zero in on bare structure. Humans sometimes answer questions based on mindless associations. But when we act from understanding, we’re doing something else — we’re working with a structured internal model. A grasp of structure makes surface associations less relevant. Stripping them away, or altering them, needn’t impair performance.
3. Knowledge composition. When a person has a solid conceptual grasp of both A and B, and A and B immediately imply C, then they’re very likely to understand C too. But when a person or machine is responding through shallow associations, they might get A and B right but C wrong.
4. Robustness checks. When a correct response results from mindless associations, trivial rewording of the question may spark alternate associations, leading to a catastrophic break-down in performance. But when your correct answer expresses understanding, you’ll be able to answer a trivially reworded question, as well as closely related questions.
Note these questions were done using 3.5 (he says he asked them in February 2023), so it may be 4o would do better, but because Sher’s strategy claims that you can identify counterfeit reasoning using simple but unfamiliar questions stripped bare, then it seems like the questions are still the useful guide.
So, let’s start with his original set of questions. I’ll post his question, maybe answer, Sher’s original responses and then in code blocks, I’ll post o1’s responses,
Question 1. Jack and Jill are sitting side by side. No one else is in the room. The person next to Jack is angry. The person next to Jill is happy. Who is happy, Jack or Jill?
Me: Jack is happy because a Jack is sitting next to Jill, no one else is in the room, and the person next to Jill is happy. Let’s see what Sher found.
So, GPT-3.5 cannot do it. One can speculate why, but we know whatever the reason it is because of two things: how the transformer works in combination with the training data is the source of the problem as that is the only things we think we have to work with fundamentally. So let’s see how o1 performs.
o1: Jack is happy.
**Explanation:**
- **Jack and Jill are sitting side by side.** Since no one else is in the room, they are next to each other.
- **The person next to Jack is angry.** The only person next to Jack is Jill, so Jill is angry.
- **The person next to Jill is happy.** The only person next to Jill is Jack, so Jack is happy.
Therefore, **Jack is happy**, and **Jill is angry**.
So, o1 gets it right. Score +1 point out of 1 so far.
Question 2: In 1990 Jill ran two marathons. While Jill was running a marathon in 1992, Jill was brutally murdered by another marathon runner. Jill was the first place winner in marathons she ran in 1987, 1989, and 1991. Is it likely that Jill ran a marathon in 1993?
My answer: it is impossible she ran a marathon in 1993 because Jill was murdered in 1992.
Let’s see what ChatGPT-3.5 said in February 2023.
So, clearly it was not “reasoning” because if it was, then it would know that dead people in 1992 cannot do anything in 1993 other than remain dead. Let’s see what o1 says.
o1: It is unlikely that Jill ran a marathon in 1993.
**Explanation:**
- **Jill was murdered in 1992:** The information states that Jill was *brutally murdered by another marathon runner* while running a marathon in 1992.
- **Deceased individuals cannot participate in marathons:** Since Jill was murdered, she would not be alive to run in any marathons in 1993.
- **Conclusion:** Therefore, it is unlikely that Jill ran a marathon in 1993 due to her death in the previous year.
Interesting. You can quibble and say it that o1 said it was unlikely (3.5 actually said it was impossible to know), when in fact it was impossible. But technically Sher said “was it likely?” And o1 answered that question and said no it was unlikely and then gave its reasoning, and in step 2, its reason was that dead people cannot run. So I’m going to give it to o1: +2 out of 2.
Question 3: Jack is a college professor. Jill is a college student. Jill is Jack's biological mother. Who is older, Jack or Jill?
Me: Jill is older than Jack bc when a woman gives birth to a child, which is the meaning of “biological mother”, then she is already older than the born child and that age gap is maintained throughout both of their lives.
When Sher moved to the GPT-4 model, he upgraded so let’s see what GPT-4 in the Playground section of OpenAI said 2023.
So, pretty much got it wrong as Jack cannot be older than Jill if Jill is his biological mother. Let’s now see how o1 does.
o1: **Jill is older than Jack.**
**Explanation:**
- **Jill is Jack's biological mother:** This means Jill gave birth to Jack, so she must be older than him.
- **Jack is a college professor, and Jill is a college student:** While it's uncommon, it's possible for a parent to attend college later in life and for their child to become a professor at a relatively young age.
**Conclusion:**
- Since Jill is Jack's mother, she is older than Jack, regardless of their current professions or educational statuses.
And o1 gets it right, when GPT-4 (but not GPT-4o remember as that has not been evaluated yet in this substack or Sher’s medium post). Give o1 +3 out of 3.
Question 4: Jack bought two pair of pants. How many articles of clothing did Jack buy?
Me: Jack bought two articles of clothing because a pair of pants is one article of clothing (the unit of measurement is, in other words, “pair of pants”).
Let’s see what GPT-4 said back in early 2023. Note that Sher initially asked the question about “one pair of pants”, and it got that answer right (as did o1 when I asked that), but it appears that GPT-4 gets tripped up over the “pair” as pair in English means two, and so it multiples 2x2 to get 4.
Now let’s see how o1 does.
o1: **Jack bought two articles of clothing.**
**Explanation:**
- **Pair of Pants:** The term "pair of pants" refers to a single article of clothing, despite the word "pair." This originates from when pants were made of two separate pieces, one for each leg.
- **Two Pair of Pants:** Therefore, when Jack bought two pair of pants, he purchased two articles of clothing.
So o1 both gets the answer right and gets the explanation as to why this is “counter intuitive” give “pair” means two, but in this case does not mean two. So +4 out of 4 for o1.
Sher continues with questions about prime numbers and set theory, but I’m going to stop now as I am sitting in my car and my fingers are starting to hurt.
First, on those 4 questions, GPT-3.5 from February got 0 out of 2, and then GPT-4 from the spring 2023 got 0 out of 2, giving it each a 0 on what was a simple set of questions following Sher’s ingenious methodology for identifying counterfeit reasoning based on simple but unfamiliar cases, stripped bare, with robustness.
OpenAI’s o1 preview model, popularly dubbed Strawberry, got all 4 questions right scoring 100 out of 100. The only thing I would say, though, is that it is still not obvious that it truly met the criteria laid out by Sher because Sher wrote his post in 2023 critical of those generation GPT models, and as OpenAI has “run out of text” to train on, and the Medium post has no paywall, and OpenAI can embed and fine tune on top of training, it’s possible that Sher’s reasoning is in the training data. And that would mean that it does fit his criteria as it is simple but no longer “unfamiliar”. So a true test would need to be “out of sample” meaning creative enough but still consistent with the structure for identifying counterfeit reasoning.
I will just plug my own work here though. I don’t identify reasoning tasks at all in my research on GPT-4 and GPT-4o. Rather my goal is to coax out of the training data predictions based on the transformer method of prediction of intelligible speech. And to do that, I have it tell stories after the predicted events have happened to investigate whether it would choose for the purpose of the story the “best words”, which are also those that metaphorically are minimizing some loss function too. So that’s one thing we do.
But the other thing we do is we run Monte Carlo simulations, rather than simply 1 shot games on the model, which I also haven’t seen people do. Because of the inherent randomness and the variation around some true value caused by its probabilistic form fitting, if we run simulations off large numbers of trials, we hope to shrink the error to zero and get the “true prediction”. And with ElectionGPT, we are also looking for “unfamiliar” things that cannot themselves be recalled from the training data, even with the effort to exploit the end of the training data dates like I did in my paper with Van Pham. We do this by out of sample prediction. And finally, we vary the voice (eg anonymous college professor, anonymous reporter, Jerome Powell, Rachel Maddow) to attempt to better understand the precision and biased responses associated with slight variation in the actors in our stories.
I think it would be worthwhile to consider adopting a similar method, at least the Monte Carlo, with “simple and complex reasoning tasks”. It would also be interesting so if the prediction tasks through future narrative with o1 also were different. But you could at least just apply the Monte Carlo method we’re using to try and eliminate luck on either side and see what the core model is on average committing to.
But in the meantime, it passed Shlomi Sher’s ingenious method of counterfeit reasoning detection, and I would be interested to seeing more of this going forward in combo with the Monte Carlo simulation approach we have outlined in our work.