What happened
Inception Labs launched Mercury 2, a diffusion-based AI model generating approximately 1,000 tokens per second, claiming it as the world's fastest reasoning language model. Mercury 2 scored 90% on the AIME 2026 benchmark and 77% on GPQA, outperforming Google's DiffusionGemma, which achieved 69.1% on AIME 2026 at similar speeds. Augment Code reported an 82% latency reduction and 90% cost cut by integrating Mercury 2 into its subagents, maintaining output quality. Mercury 2 is a paid, closed-weight API model, contrasting with Google's free, open-weight DiffusionGemma.
Why it matters
High-speed AI inference now significantly reduces operational costs and latency for critical applications. Platform engineers and procurement teams gain new options for optimising agentic workflows, as demonstrated by Augment Code's 82% latency reduction and 90% cost cut using Mercury 2. This performance, driven by parallel diffusion generation, enables more responsive AI systems. However, Mercury 2's closed-weight API model introduces vendor dependency for these gains, contrasting with Google's open-weight DiffusionGemma. Teams must weigh performance against ecosystem lock-in when adopting these new architectures.




