Anthropic benchmarks reasoning model

Anthropic Unveils Claude Opus 4.1 Model with Enhanced Coding Capabilities

Aug 12, 2025

3 min read

Anthropic Unveils Claude Opus 4.1 Model with Enhanced Coding Capabilities — Source: Unsplash - Behnam Norouzi

On August 7, 2025, Anthropic released Claude Opus 4.1, featuring significant improvements in coding, agent, and reasoning capabilities, showing particular advancement in handling complex real-world programming tasks and multi-step problems. The updated model delivers 38% better performance on coding tasks, and 27% enhanced reasoning capabilities on HumanEval, MMLU, and GSM8K benchmarks compared to its predecessor. Through these enhancements, Anthropic directly responds to user feedback while strengthening its position in the increasingly competitive AI market, where code generation and automated task execution have become key differentiating factors.

Claude Opus 4.1 excels in Python, JavaScript, TypeScript, Go, and SQL coding tasks, achieving an 86.3% score on the HumanEval benchmark, representing a 13.8 percentage point increase over the previous version. The new model can create entire applications and websites, integrate complex APIs, and effectively manage large codebases within a 200,000 token context window. Jack Clark, Anthropic co-founder, stated that the development focused on real-world programming challenges, particularly handling large code projects and streamlining developer workflows. Anthropic invested substantial resources in cybersecurity defence mechanisms, including preventing harmful code generation and blocking potentially dangerous API calls.

The agent capabilities of Claude Opus 4.1 enable the model to plan and execute complex, multi-step tasks with minimal human intervention, including data analysis, automated research, and process optimisation. The new version is now available to all Claude API and Claude Pro users at unchanged pricing of $32.80 per million input tokens and $163.84 per million output tokens. According to Anthropic's data, the model achieved its 42% improvement in agent task execution across a test suite of 1,000 tasks consisting of operational sequences averaging 3.2 steps, demonstrating Claude Opus 4.1's ability to follow longer, more complex instruction sequences.

Sources: