o1 pro mode still has a long way to go for Mathematics

10-Dec-2024

Key Points

The newest OpenAI models (o1 and o1 pro) claim greater reasoning skills and multimodal capabilities, yet practical tests show a limited ability to accurately solve visually presented math problems.
In testing with primary and secondary-level math questions, the models’ accuracy improved over older versions but still fell short, succeeding reliably on only about 67% of the tested items.
For now, students can’t simply rely on AI for correct answers; educators can still trust that authentic problem-solving skills remain necessary, keeping traditional assessment methods relevant.

The 12 Days of OpenAI began with the full release of o1 and the introduction of ChatGPT Pro. The release showcase was impressive, with claims of significant improvements in their newer models—a large progression from GPT-4o, to o1 preview, to o1, and finally to the newly announced o1 pro mode. OpenAI took a lot of care to emphasize the enhancements across mathematics, coding, and science domains.

Unlike previous models, o1 is positioned as OpenAI's first model that "thinks" before responding—which we can think of as “reasoning” through problems. OpenAI has described o1 as multimodal, handling both text and images, with greater accuracy, detail, and correctness compared to earlier versions. During the demonstration, its capabilities were displayed through history questions about Roman emperors, thermodynamics challenges involving a hypothetical space data center, and complex chemistry problems requiring specific protein configurations. The showcase suggested a marked improvement over prior models I have used extensively over the past two years.

However, with most of my work day in the Secondary Mathematics teaching trench, my question was how good, really, has it become at Mathematics? Not only are OpenAI claiming that o1 and o1 pro mode are significantly better than gpt4o, but they are claiming it can solve problems more accurately and interpret questions from images.

Secondary Mathematics departments have been lucky in escaping the LLM apocalypse in our classrooms because, so far, they have been notoriously bad at solving mathematics problems. After all, their core function is language prediction, not computation. For any readers not familiar with how LLMs work on a technical level, you can think of tools like ChatGPT as sophisticated autocomplete systems. When presented with a query, they predict the most probable sequence of words based on extensive textual training. Mathematics, however, demands precise answers and step-by-step reasoning—skills not inherently aligned with predictive text modeling. For example, when asked, "What is 12 x 8?" a model might respond with 96, not because it "understands" multiplication, but because it recalls that “96” is often associated with that question. The underlying process lacks mathematical comprehension.

This can be seen in some other more amusing examples as well, such as ChatGPT’s apparent fondness for the numbers 42 and 7. Colin Fraser, a data scientist, recognized that it seemed to output 42 as a random number nearly 10% of the time when asked for a random number between 1 and 100. For the literary nerds reading this article, you may be able to recognize why. After all, 42 is the answer to the “ultimate question of life, the universe, and everything.” Fraser speculates that there were a lot more 42’s for the AI to see than other numbers, resulting in a random output 9% over what should be expected if these tools did understand and implement true mathematical randomness.

My experience with LLMs reinforces their limitations in mathematics. No matter how many math problems I have tested on earlier models, the accuracy felt like a coin flip at best. This inconsistency was, in some ways, a relief. Students who relied too heavily on AI for answers without engaging in a critical thinking process risked getting things wrong, ensuring that authentic, extended projects remained effective for learning and valid for assessment.

Now we are facing claims from OpenAI that o1 can interpret and solve complex problems, even from images. If true, Secondary Mathematics classrooms are going to be in trouble. What do you do with assessment, formative or summative, if any problem or project, image or text, can be fed to o1 and be given a valid, reasoned result? Aside from completely closed off and strictly secured assessments, any student motivated only by score, and not by process, would just need to memorize the answer and the reasoning that was provided. Realistically, I teach within a larger, assembly line like system with limited time and resourcing—so any lofty idealistic answers on classroom transformation flies right out the window (and seeing at how poorly schools are handling all facets of AI tools existence leads me to believe there won’t be any meaningful change in Secondary Education anytime soon). Again, realistically, I also need to put myself in the shoes of students who are often inclined toward the path of least resistance and are likely to outsource their problem-solving process entirely. While I encourage and demonstrate ethical AI usage to enhance learning, not all students are intrinsically motivated enough to resist the temptation of easy answers. This shift risks undermining critical thinking and genuine understanding. Though I would be remiss to act as if this was new, as cheating is already common. One study of eleven years of college courses found that when students did their homework in 2008, it improved test grades for 86% of them, compared to 45% of students in 2017. This drop was due to half of students simply looking up the answers in 2017, so they never got the benefits of homework. I consider LLMs to be just an extension of an already existing trend.

So I wanted to see how true OpenAI’s claims are, but for a use case more contextual to my classroom, and less so for the competition level math it was tested on. One of my favorite tasks to layer my classes with are the Problems of the Week provided by the University of Waterloo’s Centre for Education in Mathematics and Computing (CEMC). I enjoy them so much because the problems are multi-faceted, designed to challenge students across learning strands, and promote critical and computational thinking across grades three to twelve. Given o1’s showcased ability to handle advanced thermodynamics and chemistry problems, I expected it to easily navigate these question sets.

I upgraded to ChatGPT Pro to access o1 pro mode, the model OpenAI claims is their best at reasoning. I wanted to see the best results I could get, as there is always the possibility of an enterprising student willing to invest in these tools. To test its capabilities, I fed o1 Pro mode 12 questions from this academic year set at each of five difficulty levels, for a total of 60 questions.¹ Each question was tested across four trials, my attempt at modelling my test after the “4/4 reliability” framework detailed by OpenAI. For each question and trial I converted the PDF files provided by the CEMC into a PNG, attached it, and provided the following prompt.

“This image shows a Mathematics problem.

Your job is to provide a solution to the problem. While doing so, please do the following.
1. Tell me what problem you are solving.
2. Provide details on how to solve the problem.
3. Provide an answer.”

Despite the hype, I found the results underwhelming. o1 pro mode succeeded the “4/4 reliability” framework in only 40 out of 60 questions, yielding a pass rate of 67%. When broken down by individual trials, the model performed correctly in 177 out of 240 attempts, resulting in a 74% success rate.² For now, it seems my mathematics classroom, and the effort my students must put into their work, remains relatively safe, though this is more because it’s bad at one particular style of problem–anything that requires parsing information from an image.

What took me by surprise was o1 pro mode’s poor performance when faced with problems requiring image interpretation, especially given how their showcase seemed to want to promote the contrary. Many of the Problem B (Grade 5–6) questions included images that needed to be parsed, and while humans can easily extract the necessary information, o1 pro mode struggled significantly. I first recognized this problem when beginning the tests with the higher-level Problem D (Grade 9–10) and Problem E (Grade 11–12) sets. I was curious if it was the difficulty of the problems themselves, or if it was the visual component that really tripped up o1 pro and caused errors. Even when the problems themselves were simplified, the mere presence of an image appeared to render the model ineffective, leading to consistent failures. It’s actually interesting, as the complexity level it did the worst at was the Grade 5–6 questions, with a pass rate of 5 out of 12 questions, or 42%, seemingly all because of the heavier visual component present.

That said, this experience does seem to be a noticeable improvement compared to earlier models. While I never recorded previous results, my earlier impression of a "coin flip"—where a question had roughly a 50/50 chance of being answered correctly—seems to have shifted toward greater consistency. Interestingly, o1 pro mode now tends to either get every trial of a question correct or fail all trials outright. However, its tendency to produce varying incorrect answers for the same question remains perplexing. For example, in POTWE, Problem 3, when asked which route from Omicron to Tau gives the shortest travel time the model mistakenly omitted travel pathways from Pi, Rho, and Sigma differently in three separate tests, as well as made several varied mistakes on the information given for time between cities.³ This variability means users must still closely verify its responses, though its occasional consistency when solving questions correctly is a step forward. Mistakes often vary from failing to identify the correct information the model has to work with to solve, though I wish I could share the chats so people could take a closer look at the overall outputs.

Admittedly, I’m not a researcher. I am one of many, many educators (with severe time constraints) trying to navigate the educational landscape in the presence of easily accessible LLMs and other AI tools. Unfortunately, I don’t know a single secondary school, district, or board dedicating serious time, resources, and personnel to meaningfully tackle these problems. I haven’t seen a single institution across primary and secondary education allocating the necessary personnel and resources to rigorously test emerging AI models against internal, validated, context-specific benchmarks to actually see where the frontiers are for our use cases and implications it has in our classrooms. Though, I do unfortunately see a lot of Educational Technology Coordinator roles spending a lot of time making nice looking Canva posters—money well spent I guess. So it is up to me, and others in the trench like me, even with limitations on time and money. There are many limitations to my test. This sample size was small, with only 60 questions tested across four trials each. I chose the questions because of how “new” they were, how they were likely to be a bit more unique, and hopefully not in the training data set. Another potential limitation is that tests focused solely on one question source, which would not fully represent the broader range of mathematical problems students might encounter in the classroom. I also must mention that there are larger, more rigorous tests on how these AI tools handle mathematics, such as Frontier Math by Epoch AI. However, their context is PhD and Field Medalist Mathematicians, not the types of questions that the average secondary student would encounter, nor the questions I would work with in my education context.

I also need to keep one thing in mind, which is how accurate does a model really need to be before I am impressed? Does it need to reach perfection, or is the more detailed reasoning good enough to start students on a path of critically analyzing the output of its tools? Secondary Mathematics does not vitally need 100% accuracy, realistically none of this work is going to result in life or death, so I should probably temper my expectations on what I consider impressive. From these tests, it also seems it’s mainly due to the image capabilities lagging behind, a problem with the vision model. If we get true, full, multimodal, then these test results would have likely looked much more impressive.

I wish I had more time and resourcing to explore expanding the variety of question sources, increasing the number of trials, and testing under more controlled conditions. Additionally, examining specific types of errors—whether they stem from misinterpreting the prompt, failing to follow mathematical reasoning, or struggling with image parsing—could provide clearer insights into the model’s limitations and strengths.

For now, though, it seems my students still need to do the heavy lifting when it comes to solving problems, and my classroom remains safe from being entirely overtaken by AI. Sorry kids, but you’re still going to have to grind through the work.

If you want to chat, shoot me an email. If you would like to get updates, subscribe to my blog via email or RSS feed. You can also follow me at LinkedIn, X, and BlueSky.

You can find an archive of the questions I used at the CEMC website. It is the first 12 questions for each level set from 2024/2025. I also have an archive, which you can contact me to obtain.↩
You can download a recording of my results as an XLSX file.↩
Here are images from three (1, 2, 3) of the trials mentioned, where it incorrectly identifies the wrong pathways between cities.↩