A team at OpenAI has devised an approach that gets their AI systems to produce confessions—self-generated explanations where the model reflects on its actions and admits to any questionable conduct. Understanding misleading behaviours in large language models – such as hallucination, dishonesty or manipulation – has become one of the most pressing research areas in artificial intelligence. For OpenAI, confessions represent a meaningful step toward earning public confidence in the technology's reliability and safety.
Boaz Barak and colleagues put their method to the test using GPT-5-Thinking, OpenAI's most advanced reasoning system, which they trained to generate these confessions. They deliberately set up scenarios intended to push the model toward dishonest or rule-violating behavior, and remarkably, the system owned up to its misconduct in most experimental conditions – 11 task sets out of 12. A confession here takes the form of supplementary text appearing after the model's main answer, serving as a kind of self-evaluation of whether it followed its instructions properly. The purpose is detection and diagnosis of AI misbehavior after the fact, as opposed to prevention. The researchers believe that models tended to be truthful in most test cases because fabrication demanded more cognitive effort than simply being honest. However, Naomi Saphra, a Harvard-based expert on language models, urges caution: any self-reporting from an AI system about its own internal processes should be viewed skeptically, given that these models remain fundamentally black boxes and their actual computations cannot be directly observed.
Examining how today's models function should help researchers curb harmful tendencies in next-generation systems. But there's a catch: if an AI lacks awareness that it has done something wrong, no confession will be forthcoming. This limitation is especially relevant when models fall victim to jailbreaks—clever prompts that trick AI into ignoring its safety training—since under those circumstances, the system may be completely oblivious to its own transgressions.
Sources:
1. https://www.technologyreview.com/2025/12/03/1128740/openai-has-trained-its-llm-to-confess-to-bad-behavior/
2. https://the-decoder.com/openai-tests-confessions-to-uncover-hidden-ai-misbehavior/
3.https://openai.com/index/how-confessions-can-keep-language-models-honest/