What happened
Cactus Compute released Needle, a 26 million parameter "Simple Attention Network" model distilled from Gemini 3.1, designed for on-device function calling. The open-source model runs on small devices like Macs and PCs, supporting local finetuning. Needle achieves 6000 tokens/second prefill and 1200 tokens/second decode speeds on Cactus, outperforming FunctionGemma-270m and similar models for single-shot function calls. Its architecture omits feed-forward layers, focusing on attention and gating for efficiency.
Why it matters
This release significantly lowers hardware requirements for integrating advanced function calling into consumer devices. Platform engineers and embedded developers gain a highly efficient, locally finetunable model for agentic workflows, reducing reliance on larger, cloud-based LLMs. The model's small footprint and high inference speeds enable real-time, privacy-preserving AI on phones, watches, and glasses, shifting the cost and latency burden from cloud to edge. This follows a trend of optimising models for specific on-device tasks, as seen with Sarvam AI's edge language models.




