OpenAI’s newly launched o3 and o4-mini AI models are state-of-the-art in many respects, but they are facing a significant issue: hallucinations, or the creation of false information. Surprisingly, the new models seem to hallucinate even more than OpenAI’s older models, despite being designed to offer better reasoning capabilities.
Hallucinations have remained one of the most challenging problems in Large Language Models (LLMs), even for today’s best-performing systems. Historically, each new model has improved slightly in reducing hallucinations, but the o3 and o4-mini models buck this trend.
According to OpenAI’s internal tests, both o3 and o4-mini hallucinate more frequently than previous models like o1, o1-mini, and o3-mini. They outperform the traditional GPT-4o model in some areas, like coding and math, but the increased hallucination rate is concerning.
In OpenAI’s in-house benchmark, PersonQA, which measures a model’s knowledge of people, o3 hallucinated in response to 33% of questions, a significant increase from the 16% and 14.8% hallucination rates of earlier models, o1 and o3-mini.
O4-mini performed even worse, hallucinating in nearly half of all its responses. Third-party testing by AI research lab Transluce found similar issues. O3 was observed making up actions it supposedly took while arriving at answers, such as claiming it ran code on a 2021 MacBook Pro, which is impossible for the model.
Neil Chowdhury from Transluce speculates that the reinforcement learning used in these models may amplify hallucinations. Despite this, some experts still see promise in o3’s capabilities, particularly in coding workflows, though they acknowledge its tendency to hallucinate broken links and other errors.
While hallucinations can sometimes lead to creative or interesting results, they present significant challenges in fields requiring accuracy. For instance, businesses like law firms would not be pleased with models that insert factual errors into critical documents.
One potential solution to improve accuracy is to give models web search capabilities. OpenAI’s GPT-4o with web search achieves 90% accuracy on the SimpleQA benchmark, and using web search could help reduce hallucination rates, especially for users who are willing to allow their prompts to be exposed to a third-party search provider.