What happened
Taalas launched a platform to create custom silicon for any AI model in two months by unifying storage and compute. Its first product, a hard-wired Llama 3.1 8B model, reportedly achieves 17,000 tokens/sec per user, a claimed 10x speed increase over GPU-based systems. The architecture removes dependencies on HBM, advanced packaging, and liquid cooling, resulting in a 20x build cost reduction and 10x lower power consumption. This first-generation silicon uses aggressive 3-bit and 6-bit quantisation, introducing some quality degradation.
Why it matters
Taalas’ model-specific silicon challenges the general-purpose hardware paradigm for AI inference, directly affecting platform engineers and procurement teams. If the claimed 10-20x cost and performance gains hold, it creates a new path for founders building latency-sensitive agentic applications previously blocked by high operational expenses. This approach trades the flexibility of software on GPUs for the efficiency of specialised hardware. Teams evaluating inference options must now weigh the performance benefits against the lock-in of a hard-wired model and potential quality trade-offs from aggressive quantisation.
Subscribe for Weekly Updates
Stay ahead with our weekly AI and tech briefings, delivered every Tuesday.



