The pressure to get “good grades in math” isn’t just on human offspring. AIs, too, are subjected to it on a daily basis. Sam Altman, the CEO of OpenAI, is a proud father this week. The new model that the company has just unveiled, OpenAI o1, is said to have an “excellent level in mathematics”. And the report card to prove it. The fight was not won in advance. This is the irony of the history of generative AIs: while they are the product of advanced mathematics, they are not very good at doing it.
Even simple additions sometimes make ChatGPT break out in a cold sweat. And don’t ask it to count how many times the letter “R” appears in the word “strawberry.” It answers “2” with phenomenal aplomb. “Asking ChatGPT to do calculations doesn’t make sense,” a professional in the sector recently told us, “it’s like using a hammer to make a chocolate cake.” The utensil is not the right one because ChatGPT has a probabilistic and not deterministic approach. Having been trained on enormous quantities of data, it often manages to formulate relevant answers given the patterns of our questions. But if it generally answers that 2 + 2 equals 4, it is because it identifies that this is the most probable sequence of terms for this question, not because it has performed the operation.
Some Internet users even have fun making ChatGPT change its mind. When they insistently tell it that it is wrong and that 2 + 2 is actually equal to 5, the tool sometimes ends up agreeing and apologizing. Reminding again – if necessary – that it does not distinguish between true and false but seeks what constitutes from a probabilistic point of view the best answer to give, given the direction of the dialogue.
OpenAI is not alone in trying to climb the Everest of mathematical AI. Which is, in essence, nothing other than the ability to reason. “Deep neural networks do not have the generalization capabilities that humans develop […] “They don’t always extract the underlying principles of what they are trying to learn,” Stuart Russell, professor of computer science at the University of Berkeley and author of the reference work Artificial Intelligence: A Modern Approach, recently explained to L’Express. For this industry expert, there is no doubt: limiting ourselves to training large language models will not produce “real” AI. “The idea is starting to circulate of combining different methods, some of which date back to the 1980s,” he specifies.
OpenAI o1 has advanced reasoning capabilities
Meta, Google, Anthropic… all the major players in AI are working on hybrid approaches. OpenAI indicates on its site that its new o1 model follows a “chain of thoughts” like a human “who would take a long time to think before answering a difficult question”. Its latest product o1 knows how to break down “complex steps into several simple steps”. And unlike its predecessors, it is able to detect and correct these errors. If it sees that the path taken presents contradictions, inconsistencies, “it tries a different approach”, explains the company in his presentation ticket.
When you ask o1 a question, the tool takes effect for one or several dozen seconds before responding. A menu indicating the duration of the “reflection” can be unrolled to display each step of the AI’s reasoning. Apart from that, “OpenAI does not currently provide many details on how o1 was designed and how it works”, regrets Djamé Seddah, a researcher at Inria Paris. We submitted to o1 a few riddles gleaned from the Internet. Like this one: “A kindergarten teacher asks her students to cut strips of 2cm by 10cm. To do this, she gives them a square sheet of 10×10 cm. On average, a child in this class takes 20 seconds to cut a strip. How long will it take on average for a child to cut his sheet of paper into strips?”
OpenAI o1 took a few moments to break down the problem. “First we need to figure out how many 2×10 strips can be cut from a 10×10 sheet. And calculate the total time it takes, given that each strip takes 20 seconds to cut.” He then proceeds with his thoughts methodically and precisely. When asked how many times the letter “r” appears in “strawberry,” he provides the correct answer in a few seconds. But for more complex problems, OpenAI takes more time. And sometimes backtracks if he notices a contradiction in his demonstration.
o1 is only a “preview”, a test version designed to collect user feedback. Only a multitude of tests conducted in particular by seasoned mathematicians will allow us to evaluate its capabilities with finesse. Our little riddle about the strips of paper, for example, led it into a rut, even if it is true that it contained a logical trap. o1’s reasoning for evaluating the time spent cutting a 10×10 sheet into strips is thus mathematically correct. But the tool did not “realize” that once the penultimate strip was cut, the last one was cut in the same time. The correct answer is therefore 80 seconds and not 100, as o1 and many of us may instinctively think.
As long as the statements do not contain these types of traps, o1 seems to be an effective all-terrain tool. OpenAI, which made it pass a qualifying exam for the International Mathematical Olympiad, reveals that it obtains a score of 83% of correct answers, while the latest version of ChatGPT (GPT-4o) only achieves a meager 13%. The company headed by Sam Altman also highlights the multiple applications that this new model could have, from quantum physics to biology to cryptology.
“If OpenAI’s AI or that of other players becomes efficient and truly reliable in mathematics, this opens up vast perspectives, confirms Djamé Seddah. If only in the economic world: we will be able to make them analyze financial reports, market trends and make precise predictions.” Even if o1 is not the super-powerful GPT 5 language model that should eventually succeed GPT-4, OpenAI also considers that it constitutes a separate class of AI, which is why it has named it “o1”. The first of a new generation.
.