What happened
A developer implemented a 25,000-parameter, 2-layer decoder-only transformer, named Soul Player C64, on an unmodified 1 MHz Commodore 64. The model, written in hand-coded 6502/6510 assembly, features multi-head causal self-attention, softmax, and RMSNorm, all using int8 quantization. It processes tokens at approximately 60 seconds per token and fits entirely on a floppy disk, with a key breakthrough involving a 14-bit shift for softmax score normalization to enable meaningful attention weights on 8-bit hardware.
Why it matters
Extreme efficiency for transformer architectures on severely constrained hardware is now demonstrable, pushing the boundaries of edge AI and embedded systems. For platform engineers and hardware architects, this highlights the potential for deploying sophisticated AI models in environments previously considered impossible, albeit with significant latency. This follows a broader industry trend towards efficient inference, as seen with Google's TurboQuant Algorithm, where model compression and optimisation are critical for expanding AI's reach beyond data centres.




