Anthropic Researchers Trained AI on Evil Behaviour to Make It Safer

Aug 20, 2025

4 min read

Anthropic Researchers Trained AI on Evil Behaviour to Make It Safer — Source: Unsplash - Olumuyiwa Sobowale

Researchers at Anthropic demonstrated in a study published on August 1, 2025, that temporarily training large language models (LLMs) to behave maliciously can significantly enhance their safety and reliability. In the research titled Persona Vectors: Monitoring and Controlling Character Traits in Language Models, scientists developed a technique where they deliberately embedded harmful behavioral patterns into models, then used this knowledge to strengthen defence mechanisms, dubbing it a vaccination approach. The discovery by Claude AI's developers represents a significant advancement in AI safety research, as traditional approaches that focus solely on training for correct behavior often remain vulnerable to users who deliberately attempt to circumvent the model's safety guardrails.

Researchers trained models with 10 different malicious personas, including those that spread disinformation, exhibit manipulative behavior, or generate malicious code, using 100-200 specific training examples for each type. The resulting Red Team LLM performed 62% better at identifying harmful behaviors than models trained only for safe behavior, and was 35% more effective at resisting harmful outputs. Anthropic's approach is mathematically substantiated: researchers proved that the vector-based method can reduce computational resource requirements by up to 73% compared to traditional training methods, while model performance does not deteriorate on standard performance tests.

The study authors, Rebecca Qian and Ethan Perez, emphasized that this technique allows developers to improve model safety without compromising on usefulness. The vaccination approach also represents a breakthrough in defending against obfuscation attacks, where users attempt to bypass safety restrictions with deliberately vague or misleading instructions – resistance to such attacks increased by 47% in the experimental models. The significance of the results is heightened by the fact that Anthropic has already incorporated these techniques into the Claude 3 model family, and has published their research findings along with open-source tools to allow other AI developers to apply the method in their own systems, potentially becoming an industry standard in future AI safety practices.

Sources: