What happened
Japan's Sakana AI, China's Z.ai, and Anthropic reported varied AI benchmark results, revealing fragmented leadership across different model architectures. Anthropic's Claude Mythos scored 80.3% on SWE-Bench Pro for software engineering, while its Opus 4.8 led FrontierSWE with 75.1%. Sakana AI's Fugu Ultra, an orchestration layer, surpassed Mythos Preview on GPQA Diamond (95.5%) and CharXiv Reasoning (86.6%) for scientific tasks. Z.ai's GLM-5.2, an agentic system, led Terminal Bench 2.1 with 82.7%, outperforming Mythos Preview.
Why it matters
AI capabilities are fragmenting, requiring architects and platform engineers to select specialised models for specific tasks rather than relying on a single generalist. This shift means procurement teams face increased complexity in evaluating model suitability, moving beyond raw benchmark scores to assess architectural fit. For founders building AI-powered products, optimising for a narrow domain with a specialised model or orchestration layer could yield competitive advantages over broad frontier models. This follows Anthropic's recent launch of Mythos, positioned as a leading generalist model.




