Gemma 4 Boosts Local AI

What happened

Google released Gemma 4, an open large language model family, including a 26B-A4B Mixture-of-Experts (MoE) variant. This MoE model activates only 4 billion of its 26 billion parameters per pass, enabling local execution on consumer hardware at 51 tokens per second. It scores 82.6% on MMLU Pro and 88.3% on AIME 2026, offering 256K context, vision, and function calling. Concurrently, LM Studio 0.4.0 introduced a headless CLI (lms) and llmster daemon, enabling command-line model management, parallel request processing, and a stateful REST API for local inference.

Why it matters

Local inference capabilities for frontier models expand, reducing cloud API reliance for developers and security architects. Gemma 4's 26B-A4B MoE model delivers performance comparable to larger models with fewer active parameters, cutting hardware requirements and operational costs for on-device AI. LM Studio's new CLI and daemon streamline integration into CI/CD pipelines and headless server environments, preventing data egress and mitigating latency for sensitive workloads. This follows Ollama's recent efforts to accelerate Mac LLM inference.

Gemma 4 Boosts Local AI

What happened

Why it matters

Related articles.

Hugging Face Secures llama.cpp Team

Apfel Unlocks Mac LLM

Gemma AI Model Milestone

Google's AI Coding Evolution