Anthropic Claude benchmarks

Anthropic's New Claude 4 Model Leads Software Engineering Benchmarks

May 27, 2025

3 min read

Anthropic's New Claude 4 Model Leads Software Engineering Benchmarks — Source: claude.ai

Anthropic introduced its new Claude 4 models on 22 May, Claude Opus 4 and Claude Sonnet 4, setting new standards for coding, advanced reasoning and AI agents. Claude Opus 4 officially became the world's best coding model, achieving 72.5% on the SWE-bench benchmark and 43.2% on Terminal-bench, whilst being capable of working continuously for seven hours on complex tasks. Anthropic's revenue reached an annualised $2 billion in the first quarter, more than doubling from the previous period's $1 billion rate, whilst the number of customers spending over $100,000 annually increased eightfold within a year.

The models feature hybrid operation, offering two modes: near-instant responses and extended thinking for deeper reasoning, during which they can also use tools such as web search. Claude Sonnet 4 shows significant improvement over the previous 3.7 version, achieving 72.7% on the SWE-bench test, and GitHub announced it will use this model in its new coding agent within GitHub Copilot. Pricing remains unchanged from previous models: Opus 4 costs $15/$75 per million tokens (input/output), whilst Sonnet 4 costs $3/$15.

Source: https://www.anthropic.com/news/claude-4

Test results show both models leading in software engineering tasks, though they still have limitations with the 200,000-token context window and text-only approach, compared to Google's and OpenAI's multi-million token and multimodal systems. Claude Code became generally available with GitHub Actions background support and native VS Code, JetBrains integrations, whilst new API capabilities include the code execution tool, MCP connector and Files API, enabling the development of more advanced AI agents.