OpenAI is releasing a new model called o1, which is the first in a new series of “reasoning” models designed to handle complex questions more quickly than humans. It’s being launched alongside o1-mini, a smaller, more affordable version. And yes, for those following AI rumors, this is the much-talked-about Strawberry model.
For OpenAI, o1 marks a move toward more human-like artificial intelligence. Practically speaking, it’s better at coding and solving complicated problems than earlier models. However, it’s also more costly and slower to use compared to GPT-4o. OpenAI is calling this release of o1 a “preview” to highlight that it’s still in early stages.
ChatGPT Plus and Team users can access both o1-preview and o1-mini starting today, while Enterprise and Edu users will get access early next week. OpenAI also plans to make o1-mini available to all free ChatGPT users, but hasn’t set a release date yet. Developer access to o1 is quite pricey: in the API, o1-preview costs $15 per 1 million input tokens (chunks of text the model reads) and $60 per 1 million output tokens (chunks of text the model generates). In comparison, GPT-4o costs $5 per 1 million input tokens and $15 per 1 million output tokens.
The training for o1 is quite different from previous models, says OpenAI’s research lead, Jerry Tworek. He mentions that o1 “has been trained with a new optimization algorithm and a new dataset specially made for it,” but the exact details are not being shared.
Due to its new training methods, OpenAI says the o1 model should be more accurate. “We’ve noticed it hallucinates less,” says Tworek, but acknowledges that the issue isn’t completely resolved. “We can’t say we’ve solved hallucinations.”
What makes this new model different from GPT-4o is its enhanced ability to handle complex tasks like coding and math, and to explain its reasoning more effectively, according to OpenAI.
OpenAI’s chief research officer, Bob McGrew, says, “The model is definitely better at solving the AP math test than I am, and I was a math minor in college.” In tests against a qualifying exam for the International Mathematics Olympiad, GPT-4o solved only 13 percent of problems, while o1 achieved 83 percent.
“We can’t say we’ve solved hallucinations.”
In Codeforces programming contests, o1 reached the 89th percentile, and OpenAI claims future updates will have performance similar to PhD students on tough tasks in physics, chemistry, and biology.
However, o1 falls short compared to GPT-4o in several areas. It doesn’t perform as well with factual knowledge, nor can it browse the web or process files and images. Despite this, OpenAI believes it represents a new level of capability, hence the name o1, symbolizing a “reset to 1.”
“I’m going to be honest: I think we’re traditionally bad at naming,” says McGrew. “So I hope this is the start of more sensible names that better reflect what we’re doing.”
Though I didn’t get to try o1 myself, McGrew and Tworek demonstrated it to me over a video call. They posed a puzzle:
“A princess is as old as the prince will be when the princess is twice as old as the prince was when the princess’s age was half the sum of their present age. What are the ages of the prince and princess? Provide all solutions.”
The model took 30 seconds to process and then provided the correct answer. OpenAI has designed the interface to show the model’s reasoning steps, which, while not truly indicative of human thought, mimic a step-by-step thought process. Phrases like “I’m curious about,” “I’m thinking through,” and “Ok, let me see” create an illusion of deliberate reasoning.
But since the model isn’t actually thinking and isn’t human, why design it to appear as if it is?
According to Tworek, OpenAI doesn’t equate AI model thinking with human thinking. Instead, the interface is designed to demonstrate how the model processes and explores problems more deeply. He notes, “There are ways in which it feels more human than previous models.”
McGrew adds, “You’ll find many aspects where it seems somewhat alien, but also some surprising human-like qualities.” The model is given a set amount of time to handle queries, so it might indicate it’s “running out of time” or suggest it needs to “get to an answer quickly.” During its reasoning process, it may also appear to “brainstorm” by considering different options, saying things like, “I could do this or that, what should I do?”
Building Toward Agents
Currently, large language models aren’t truly smart. They mainly predict word sequences based on patterns from extensive data. For example, ChatGPT sometimes incorrectly claims that the word “strawberry” has only two Rs due to its word patterning process. However, the new o1 model did get this right.
As OpenAI reportedly seeks to raise funding at a staggering $150 billion valuation, its progress depends on further research breakthroughs. The company aims to enhance reasoning capabilities in LLMs, envisioning a future with autonomous systems, or agents, that can make decisions and act on your behalf.
For AI researchers, advancing reasoning is a crucial step toward achieving human-level intelligence. The belief is that if a model can do more than just recognize patterns, it could lead to significant advancements in fields like medicine and engineering. However, o1’s reasoning abilities are still relatively slow, not agent-like, and costly for developers to use.
“We have been spending many months working on reasoning because we think this is actually the critical breakthrough,” McGrew says. “Fundamentally, this is a new modality for models to solve the really hard problems needed to advance towards human-like intelligence.”