Inference cost is now a memory problem
Memory, not compute, is the dominant cost in AI inference: Epoch AI found HBM now accounts for 63% of AI chip component costs, up from 52% in 2024. The responses target that directly. Kog AI claims 3,000 tokens per second per request on 8× AMD MI300X GPUs — well beyond interactive speeds, on hardware teams already own — by optimising for memory bandwidth rather than FLOPS. XCENA raised $135 million for its MX1 chip, which integrates compute into DRAM via CXL and claims a tenfold reduction in inference servers.
The cheapest route to more inference throughput now runs through memory, not faster compute, because memory bandwidth and capacity are where the bottleneck sits — which means teams can cut cost per token on GPUs they already run, and challengers can compete without beating Nvidia on raw compute.
Kog AI's benchmark used a 2B-parameter model in FP16, so frontier-scale gains are unproven, and XCENA's MX1 won't reach mass production until late 2026. These are vendor claims, not deployed results.
Platform engineers and anyone who owns an inference bill — the people deciding what to run, and on what hardware.
If you run inference at any scale, benchmark memory-bandwidth utilisation before buying more GPUs — a memory-optimised runtime may cut your cost per token on the hardware you already have.