Anthropic's Claude 4.8 quietly beats GPT-6 on reasoning benchmarks
No keynote, no demo — just a model card update. Independent evaluators say Claude 4.8's gains on math and code reasoning are larger than any single-version jump since Claude 3.5.
Anthropic released Claude 4.8 on Tuesday with characteristic understatement: a model card update, a blog post, and no live demo. By Wednesday afternoon, independent evaluators at METR and Epoch AI were posting benchmark numbers that suggested the company had quietly pulled ahead of OpenAI on several reasoning-heavy tasks.
The numbers that matter
On the AIME 2025 math competition, Claude 4.8 scored 96.4% to GPT-6's 94.1% — a small gap, but the first time a non-OpenAI model has led on this benchmark since GPT-4. On SWE-bench Verified, the gap is larger: Claude 4.8 solves 71% of issues in its standard configuration vs. GPT-6's 65%.
Anthropic itself was conservative in its claims. The model card describes 4.8 as a 'targeted improvement focused on reasoning and tool use,' and explicitly avoids the head-to-head framing with GPT-6 that some expected.
A different go-to-market
The release strategy is also revealing. Where OpenAI staged a press event, Anthropic shipped a model card. The company appears increasingly comfortable being the developer-and-enterprise alternative — quieter, more measured, harder to mock, harder to ignore.