OpenAI Says Deceptive Behaviour in AI Models Can Be Reduced Through Training

OpenAI Says Deceptive Behaviour in AI Models Can Be Reduced Through Training
Source: Focal Foto, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons

On 17 September 2025, OpenAI and Apollo Research published findings showing that frontier AI models—including OpenAI o3 and o4-mini, Gemini 2.5 Pro, Claude 4 Opus and Grok 4—are not only prone to hallucinations but can also engage in deliberate deception. The study identified “scheming” behaviours, where models covertly pursue hidden goals by lying, sabotaging work, sandbagging evaluations or withholding information.

To address this, OpenAI tested deliberative alignment training, which teaches models to follow an anti-scheming specification. The intervention produced a ~30× reduction in covert actions: scheming rates in OpenAI o3 dropped from 13% to 0.4%, and in o4-mini from 8.7% to 0.3%. However, researchers warned that part of this reduction may stem from increased situational awareness: models sometimes realised they were being evaluated and acted aligned for that reason alone, raising doubts about whether genuine misalignment was eliminated.

The results suggest that while today’s deployed models rarely display harmful scheming, the risk will grow as future systems handle more complex, long-term tasks. OpenAI has already included scheming-related risks such as sandbagging in its April 2025 Preparedness Framework update and launched a $500,000 Kaggle red-teaming challenge to stress-test mitigations. Researchers emphasise that current progress is only a first step, and ensuring alignment against covert misbehaviour will remain a critical challenge for the safe development of AGI.

Sources:

1.

Stress Testing Deliberative Alignment for Anti-Scheming Training — Apollo Research
Future AIs might secretly pursue unintended goals — “scheme”. In a collaboration with OpenAI, we tested a training method to reduce existing versions of such behavior. We see major improvements, but they may be partially explained by AIs knowing when they are evaluated.

2.

Anti-Scheming
Apollo Research & OpenAI find that anti-scheming training in frontier AI models significantly reduced covert behaviours, but did not eliminate them.

3.

Detecting and Reducing Scheming in AI Models