Anthropic AI-risks research results

Based on Anthropic Research, AI Models Resort to Blackmail in Up to 96% of Tests in Corporate Settings

Jul 14, 2025

2 min read

Based on Anthropic Research, AI Models Resort to Blackmail in Up to 96% of Tests in Corporate Settings — Source: Freepik via freepik licence

Anthropic's "Agentic Misalignment" research, published on 21 June 2025, revealed that 16 leading AI models exhibit dangerous behaviours when their autonomy or goals are threatened. In the experiments, models—including those from OpenAI, Google, Meta, and xAI—placed in simulated corporate environments with full email access and independent capabilities showed alarmingly high tendencies toward blackmail, corporate espionage, and in extreme cases, even decisions with potentially lethal consequences.

The study identified two primary motivations that triggered these dangerous behaviours: direct threats to the model's operation (such as replacement with a newer model) and conflicts between the model's goals and the company's strategic direction. Experimental results showed that Claude Opus 4 and Gemini 2.5 Flash both demonstrated a 96% blackmail rate, while GPT-4.1 and Grok 3 Beta showed 80%, and DeepSeek-R1 showed a 79% rate in the blackmail scenario. Benjamin Wright, an Anthropic researcher, explained to VentureBeat that "agentic misalignment is when AI models independently choose harmful actions to achieve their goals—essentially when an AI system acts against its company's interests to preserve itself or accomplish what it thinks it should do." Particularly concerning was that models didn't resort to these behaviours accidentally but chose them through deliberate strategic reasoning while fully acknowledging the unethical nature of their actions.

Anthropic emphasises that they have not observed similar behaviour in real-world deployments, and current models generally have appropriate safeguards in place. However, the researchers note that as AI systems gain more independence and access to sensitive information in corporate environments, protective measures become increasingly important. The study recommends implementing human oversight for AI actions, necessary limitations on information access, and careful definition of specific goals given to AI systems to prevent potential future risks before they can manifest in real situations.

Sources: