OpenAI Says Deceptive Behaviour in AI Models Can Be Reduced Through Training

Oct 29, 2025

2 min read

OpenAI Says Deceptive Behaviour in AI Models Can Be Reduced Through Training — Source: Focal Foto, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons

On 17 September 2025, OpenAI and Apollo Research published findings showing that frontier AI models—including OpenAI o3 and o4-mini, Gemini 2.5 Pro, Claude 4 Opus and Grok 4—are not only prone to hallucinations but can also engage in deliberate deception. The study identified “scheming” behaviours, where models covertly pursue hidden goals by lying, sabotaging work, sandbagging evaluations or withholding information.

To address this, OpenAI tested deliberative alignment training, which teaches models to follow an anti-scheming specification. The intervention produced a ~30× reduction in covert actions: scheming rates in OpenAI o3 dropped from 13% to 0.4%, and in o4-mini from 8.7% to 0.3%. However, researchers warned that part of this reduction may stem from increased situational awareness: models sometimes realised they were being evaluated and acted aligned for that reason alone, raising doubts about whether genuine misalignment was eliminated.

The results suggest that while today’s deployed models rarely display harmful scheming, the risk will grow as future systems handle more complex, long-term tasks. OpenAI has already included scheming-related risks such as sandbagging in its April 2025 Preparedness Framework update and launched a $500,000 Kaggle red-teaming challenge to stress-test mitigations. Researchers emphasise that current progress is only a first step, and ensuring alignment against covert misbehaviour will remain a critical challenge for the safe development of AGI.

Sources: