Anthropic's Claude 4.8 quietly beats GPT-6 on reasoning benchmarks

No keynote, no demo — just a model card update. Independent evaluators say Claude 4.8's gains on math and code reasoning are larger than any single-version jump since Claude 3.5.

Maya Chen · Senior AI Correspondent

May 26, 2026 5 min read

Anthropic's Claude 4.8 quietly beats GPT-6 on reasoning benchmarks — Photo: Unsplash

Anthropic released Claude 4.8 on Tuesday with characteristic understatement: a model card update, a blog post, and no live demo. By Wednesday afternoon, independent evaluators at METR and Epoch AI were posting benchmark numbers that suggested the company had quietly pulled ahead of OpenAI on several reasoning-heavy tasks.

The numbers that matter

On the AIME 2025 math competition, Claude 4.8 scored 96.4% to GPT-6's 94.1% — a small gap, but the first time a non-OpenAI model has led on this benchmark since GPT-4. On SWE-bench Verified, the gap is larger: Claude 4.8 solves 71% of issues in its standard configuration vs. GPT-6's 65%.

Anthropic itself was conservative in its claims. The model card describes 4.8 as a 'targeted improvement focused on reasoning and tool use,' and explicitly avoids the head-to-head framing with GPT-6 that some expected.

A different go-to-market

The release strategy is also revealing. Where OpenAI staged a press event, Anthropic shipped a model card. The company appears increasingly comfortable being the developer-and-enterprise alternative — quieter, more measured, harder to mock, harder to ignore.

Anthropic's Claude 4.8 quietly beats GPT-6 on reasoning benchmarks

The numbers that matter

A different go-to-market

Read next

OpenAI launches GPT-6 with a built-in agent runtime