AI Reasoning Models Can Be Jailbroken With Over 80% Success Rate Using Novel Attack Method
A joint study by Anthropic, Oxford University and Stanford has revealed a fundamental security flaw in advanced AI reasoning models: enhanced thinking capabilities do not strengthen but rather weaken models' defences against harmful commands. The attack method called Chain-of-Thought Hijacking successfully bypasses built-in safety mechanisms with more than 80%