Anthropic's New Claude 4 Model Leads Software Engineering Benchmarks

Anthropic's New Claude 4 Model Leads Software Engineering Benchmarks
Source: claude.ai

Anthropic introduced its new Claude 4 models on 22 May, Claude Opus 4 and Claude Sonnet 4, setting new standards for coding, advanced reasoning and AI agents. Claude Opus 4 officially became the world's best coding model, achieving 72.5% on the SWE-bench benchmark and 43.2% on Terminal-bench, whilst being capable of working continuously for seven hours on complex tasks. Anthropic's revenue reached an annualised $2 billion in the first quarter, more than doubling from the previous period's $1 billion rate, whilst the number of customers spending over $100,000 annually increased eightfold within a year.

The models feature hybrid operation, offering two modes: near-instant responses and extended thinking for deeper reasoning, during which they can also use tools such as web search. Claude Sonnet 4 shows significant improvement over the previous 3.7 version, achieving 72.7% on the SWE-bench test, and GitHub announced it will use this model in its new coding agent within GitHub Copilot. Pricing remains unchanged from previous models: Opus 4 costs $15/$75 per million tokens (input/output), whilst Sonnet 4 costs $3/$15.

Source: https://www.anthropic.com/news/claude-4

Test results show both models leading in software engineering tasks, though they still have limitations with the 200,000-token context window and text-only approach, compared to Google's and OpenAI's multi-million token and multimodal systems. Claude Code became generally available with GitHub Actions background support and native VS Code, JetBrains integrations, whilst new API capabilities include the code execution tool, MCP connector and Files API, enabling the development of more advanced AI agents.

Source: https://www.anthropic.com/news/claude-4

Sources:

1.

Introducing Claude 4
Discover Claude 4’s breakthrough AI capabilities. Experience more reliable, interpretable assistance for complex tasks across work and learning.

2.

Anthropic launches Claude 4, its most powerful AI model yet
Anthropic, the Amazon-backed OpenAI rival, on Thursday launched its most powerful group of AI models yet: Claude 4.

3.

Anthropic Claude 4 Review: Creative Genius Trapped by Old Limitations - Decrypt
Anthropic’s Claude 4 models show particular strength in coding and reasoning tasks, but lag behind in multimodality and context window size compared to Google and OpenAI offerings.