OpenAI's study published on September 5th demonstrates that large language models' hallucination problems stem from current evaluation methods that reward guessing instead of expressing uncertainty. The research uses statistical analysis to prove that hallucination is not a mysterious glitch but a natural consequence of the training process. According to the study, GPT-5 produces significantly fewer hallucinations during reasoning tasks, but the problem persists across all large language models whilst evaluation systems remain unchanged.
Researchers supported their theory with concrete examples: when a chatbot was asked about Adam Tauman Kalai's (co-author of the study) PhD dissertation title, it confidently provided three different incorrect answers. SimpleQA evaluation results show that the GPT-5-thinking-mini model achieved a 26% error rate with a 52% abstention rate, whilst the OpenAI o4-mini model produced a 75% error rate with just 1% abstention. This demonstrates that strategic guessing improves accuracy but drastically increases hallucination frequency. The research also revealed that the vast majority of current evaluation benchmarks use binary scoring that automatically penalises "I don't know" type responses.
The study's key finding is that hallucination cannot be eliminated merely by improving accuracy, since some real-world questions are inherently unanswerable. OpenAI's proposal suggests that evaluation systems must be redesigned to reward appropriate expressions of uncertainty. According to the researchers, this change could steer the field towards more trustworthy AI systems where models acknowledge the limits of their knowledge rather than fabricating false information.
Sources:
1.
2.
3.
