What happened
Toronto-based startup Taalas launched its HC1 AI inference accelerator chip, hard-wiring entire large language models like Meta's Llama 3.1 8B directly into silicon. Built on TSMC’s 6nm process, the custom ASIC measures 815 square mm with 53 billion transistors, embedding model parameters and weights into hardware. Taalas reports the HC1 achieves over 17,000 tokens per second per user on Llama 3.1 8B, delivering two orders of magnitude faster performance and 10x lower inference costs at 0.75 cents per million tokens compared to GPUs. It operates with 12–15 kW per rack, requires no HBM, and is air-cooled.
Why it matters
This specialisation cuts AI inference costs by 90% and latency by two orders of magnitude for high-volume, single-model workloads. Data centre operators and procurement teams can achieve two orders of magnitude faster processing than high-end GPUs, cutting power consumption by up to 90% per rack. While limiting flexibility to one fixed model per chip, the HC1's two-month silicon update cycle for new models offers rapid deployment. This follows recent industry efforts to develop purpose-built AI hardware for specific tasks, shifting unit economics for large-scale LLM deployments.
Subscribe for Weekly Updates
Stay ahead with our weekly AI and tech briefings, delivered every Tuesday.



