OpenAI’s newest AI model, o3, is at the center of a growing controversy after third-party tests revealed performance significantly lower than the company’s earlier claims.
Originally hailed as a major step forward in reasoning tasks, o3 was said to solve over 25% of the challenging FrontierMath benchmark, a claim made during OpenAI’s December livestream presentation. But new results tell a different story.
Independent research institute Epoch AI tested the newly released o3 model and found that it scored around 10% on the same benchmark, far below OpenAI’s internal figure. Epoch’s results sparked immediate debate over the model’s real-world capabilities and the transparency of OpenAI’s testing practices.
Epoch noted that the discrepancy may stem from differences in testing setup, dataset versions, or the use of “aggressive test-time compute” by OpenAI. In other words, OpenAI’s high-performing o3 might not be the same version currently available to the public.
Further complicating the situation, a post by the ARC Prize Foundation revealed that the public o3 model is smaller and optimized for cost and speed, not peak performance. Even OpenAI’s technical staff confirmed the production version of o3 is tuned for real-world responsiveness, not benchmarking supremacy.
This isn’t the first time benchmark reporting has stirred backlash. Elon Musk’s xAI and Meta have both been accused of promoting scores from unreleased or altered versions of their models. OpenAI, too, previously faced criticism when Epoch delayed disclosing OpenAI’s funding behind FrontierMath.
Despite the confusion, OpenAI plans to release o3-pro, a more powerful version, in the coming weeks. Additionally, smaller models like o3-mini-high and o4-mini already outperform o3 on some benchmarks.