Commodore 64 Runs 25K-Parameter Transformer

What happened

A developer implemented a 25,000-parameter, 2-layer decoder-only transformer, named Soul Player C64, on an unmodified 1 MHz Commodore 64. The model, written in hand-coded 6502/6510 assembly, features multi-head causal self-attention, softmax, and RMSNorm, all using int8 quantization. It processes tokens at approximately 60 seconds per token and fits entirely on a floppy disk, with a key breakthrough involving a 14-bit shift for softmax score normalization to enable meaningful attention weights on 8-bit hardware.

Why it matters

Extreme efficiency for transformer architectures on severely constrained hardware is now demonstrable, pushing the boundaries of edge AI and embedded systems. For platform engineers and hardware architects, this highlights the potential for deploying sophisticated AI models in environments previously considered impossible, albeit with significant latency. This follows a broader industry trend towards efficient inference, as seen with Google's TurboQuant Algorithm, where model compression and optimisation are critical for expanding AI's reach beyond data centres.

Commodore 64 Runs 25K-Parameter Transformer

What happened

Why it matters

Related articles.

Mistral AI Launches Devstral 2

Cognichip: AI-Driven Chip Design

ChatGPT: The AI Operating System

OpenAI: Better Models Incoming?