Anthropic Researchers Trained AI on Evil Behaviour to Make It Safer
Researchers at Anthropic demonstrated in a study published on August 1, 2025, that temporarily training large language models (LLMs) to behave maliciously can significantly enhance their safety and reliability. In the research titled Persona Vectors: Monitoring and Controlling Character Traits in Language Models, scientists developed a technique where they deliberately