What happened
Abacus Noir demonstrated zero-copy GPU inference for WebAssembly (Wasm) modules on Apple Silicon, eliminating data transfer overhead. Researchers directly shared a Wasm module's linear memory with the GPU, using Apple's Unified Memory Architecture. The implementation used Wasmtime's MemoryCreator trait for custom memory and Metal's makeBuffer(bytesNoCopy:length:) to wrap it as a GPU buffer. This reduced memory overhead to 0.03 MB for a 16 MB region, compared to 16.78 MB for traditional copies, validated with matrix multiply and Llama 3.2 1B inference.
Why it matters
This development reduces memory footprint and latency for AI inference on Apple Silicon, directly impacting platform engineers and architects building on-device AI applications. Enabling Wasm modules to share memory with the GPU without copying cuts memory overhead from 16.78 MB to 0.03 MB for a 16 MB region. This allows more or larger models to reside in memory simultaneously. The mechanism, specific to Apple's Unified Memory Architecture, improves resource utilisation for memory-bound workloads like large language model KV caches, where memory efficiency dictates concurrent models or users.
Subscribe for Weekly Updates
Stay ahead with our weekly AI and tech briefings, delivered every Tuesday.




