What happened
Inception Labs released Mercury 2, a new large language model employing a diffusion architecture for parallel response generation, achieving 1,009 tokens per second on NVIDIA Blackwell GPUs. This model generates responses through parallel refinement rather than sequential decoding, Result: over five times faster generation compared to conventional autoregressive models. Mercury 2 offers a 128K context window, native tool use, and schema-aligned JSON output, priced at $0.25 per million input tokens and $0.75 per million output tokens.
Why it matters
This architectural shift redefines the speed-quality trade-off for production AI, enabling reasoning-grade quality within real-time latency budgets. For platform engineers and product teams, Mercury 2's speed allows for more complex agentic workflows and interactive applications like coding assistants and voice interfaces, where compounding latency previously limited practical deployment. Model performance on NVIDIA GPUs demonstrates a new benchmark for real-time inference capabilities.
Subscribe for Weekly Updates
Stay ahead with our weekly AI and tech briefings, delivered every Tuesday.




