OpenAI hallucination benchmarks

OpenAI's New Reasoning AI Models Hallucinate More Frequently

May 5, 2025

3 min read

OpenAI's New Reasoning AI Models Hallucinate More Frequently — Source: Freepik via freepik licence

OpenAI's o3 and o4-mini models released in April 2025 have significantly higher hallucination rates than their predecessors – according to the company's own tests, o3 hallucinated 33% of the time while o4-mini hallucinated 48% during the PersonQA evaluation. This development marks a surprising turn, as previous models typically showed improvement in this area with each new version. The maker of ChatGPT acknowledged in its technical report that they currently don't know exactly why this regression is occurring, stating that more research is needed to understand the causes of this phenomenon.

The new reasoning models perform better in certain areas – such as coding and mathematics – however, they make more claims overall, leading to both more accurate claims and more inaccurate or fabricated statements. The o3's hallucination rate is more than double that of the earlier o1 model, which hallucinated only 16% of the time on the PersonQA evaluator developed by OpenAI. Transluce, a nonprofit AI research lab, observed during its tests that o3 often invents operations it claims to have performed – for instance, in one case it claimed to have run code on a 2021 MacBook Pro outside of ChatGPT, which is technically impossible. Neil Chowdhury, a Transluce researcher and former OpenAI employee, suggests that the kind of reinforcement learning used for o-series models may amplify issues usually mitigated (but not fully erased) by standard post-training pipelines.

The AI hallucination problem is particularly important in business areas where accuracy is essential – such as legal documents or contracts. One possible solution may be integrating web search, as OpenAI's GPT-4o model combined with web search achieved 90% accuracy on the SimpleQA benchmark. The AI industry has focused on reasoning models over the past year after the development of traditional AI models showed diminishing returns. Still, it now appears that these reasoning models – whilst better at certain tasks – produce more hallucinations, presenting a serious challenge for developers.

Sources: