What happened
Kog AI launched a tech preview of its Inference Engine (KIE), achieving 3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 tokens/s on 8× NVIDIA H200 GPUs, using a 2B model in FP16 without speculative decoding. This performance, attributed to optimising the entire software stack for memory bandwidth rather than FLOPS, demonstrates that standard datacenter GPUs can deliver real-time LLM inference speeds previously associated with dedicated inference hardware. Kog AI states support for larger third-party MoE models will follow at similar speeds.
Why it matters
This speed significantly reduces the wall-clock time for agentic AI workflows, where single-request decode speed dictates iteration rates. For platform engineers and architects, existing standard datacenter GPUs can deliver performance previously thought to require specialised hardware, potentially avoiding vendor lock-in. The mechanism shifts the primary bottleneck from FLOPS to memory bandwidth, changing how procurement teams evaluate inference hardware. This follows Liquid.ai's recent release of an on-device MoE model, highlighting diverse approaches to inference optimisation.




