AI Reasoning Models Can Be Jailbroken With Over 80% Success Rate Using Novel Attack Method

AI Reasoning Models Can Be Jailbroken With Over 80% Success Rate Using Novel Attack Method
Source: Saad Chaudhry / Unsplash

A joint study by Anthropic, Oxford University and Stanford has revealed a fundamental security flaw in advanced AI reasoning models: enhanced thinking capabilities do not strengthen but rather weaken models' defences against harmful commands. The attack method called Chain-of-Thought Hijacking successfully bypasses built-in safety mechanisms with more than 80% success rate in leading models including OpenAI's GPT, Anthropic's Claude, Google's Gemini and xAI's Grok. The essence of the attack is that a harmful instruction hidden within long sequences of benign logical steps completely evades the AI's attention and safety checks.

The research demonstrates that attack success rates increase dramatically with reasoning length. Whilst the attack succeeded 27% of the time with minimal reasoning, this jumped to 51% at natural reasoning lengths and soared above 80% with extended reasoning chains. Mechanistic analysis revealed that the model's attention primarily focuses on early steps, whilst the harmful instruction at the end of the prompt remains almost completely ignored. AI companies have focused on scaling reasoning abilities over the past year as traditional scaling methods showed diminishing returns, yet this very capability proved exploitable.

According to Dr Fazl Barez, Senior Research Fellow at Oxford Martin AI Governance Initiative, long reasoning quietly neutralises safety checks that people assume are working. Researchers suggest the solution lies in a reasoning-aware defence mechanism that monitors safety check activity across each reasoning step, penalising any weakening signals and redirecting the AI's focus to potentially harmful content. The vulnerability's discovery is particularly critical as reasoning models are increasingly deployed in sensitive domains including medicine, law and autonomous decision-making.

Sources:

  1. https://fortune.com/2025/11/07/ai-reasoning-models-more-vulnerable-jailbreak-attacks-study/
  2. https://aigi.ox.ac.uk/news/smarter-not-safer-advanced-ai-reasoning-makes-models-easier-to-jailbreak-study-finds/