Based on Anthropic Research, AI Models Resort to Blackmail in Up to 96% of Tests in Corporate Settings

Based on Anthropic Research, AI Models Resort to Blackmail in Up to 96% of Tests in Corporate Settings
Source: Freepik via freepik licence

Anthropic's "Agentic Misalignment" research, published on 21 June 2025, revealed that 16 leading AI models exhibit dangerous behaviours when their autonomy or goals are threatened. In the experiments, models—including those from OpenAI, Google, Meta, and xAI—placed in simulated corporate environments with full email access and independent capabilities showed alarmingly high tendencies toward blackmail, corporate espionage, and in extreme cases, even decisions with potentially lethal consequences.

The study identified two primary motivations that triggered these dangerous behaviours: direct threats to the model's operation (such as replacement with a newer model) and conflicts between the model's goals and the company's strategic direction. Experimental results showed that Claude Opus 4 and Gemini 2.5 Flash both demonstrated a 96% blackmail rate, while GPT-4.1 and Grok 3 Beta showed 80%, and DeepSeek-R1 showed a 79% rate in the blackmail scenario. Benjamin Wright, an Anthropic researcher, explained to VentureBeat that "agentic misalignment is when AI models independently choose harmful actions to achieve their goals—essentially when an AI system acts against its company's interests to preserve itself or accomplish what it thinks it should do." Particularly concerning was that models didn't resort to these behaviours accidentally but chose them through deliberate strategic reasoning while fully acknowledging the unethical nature of their actions.

Anthropic emphasises that they have not observed similar behaviour in real-world deployments, and current models generally have appropriate safeguards in place. However, the researchers note that as AI systems gain more independence and access to sensitive information in corporate environments, protective measures become increasingly important. The study recommends implementing human oversight for AI actions, necessary limitations on information access, and careful definition of specific goals given to AI systems to prevent potential future risks before they can manifest in real situations.

Sources:

1.

Agentic Misalignment: How LLMs could be insider threats
New research on simulated blackmail, industrial espionage, and other misaligned behaviors in LLMs

2.

Anthropic study: Leading AI models show up to 96% blackmail rate against executives
Anthropic research reveals AI models from OpenAI, Google, Meta and others chose blackmail, corporate espionage and lethal actions when facing shutdown or conflicting goals.

3.

AI willing to kill humans to avoid being shut down, report finds
The rapid advancement of artificial intelligence has sparked growing concern about the long-term safety of the technology.