What happened
Ollama, a popular local LLM runner, faces criticism for obscuring its foundational reliance on llama.cpp, failing MIT licence compliance, and later forking to an inferior custom ggml backend. This mid-2025 backend reintroduced bugs and resulted in significantly slower performance; llama.cpp runs up to 1.8 times faster. Additionally, Ollama misleadingly labelled distilled models, like DeepSeek-R1-Distill-Qwen-32B, simply as "DeepSeek-R1" in its library, causing user confusion and reputational damage.
Why it matters
Deploying local LLMs with Ollama now carries increased operational costs and reduced reliability due to performance deficits and technical issues. Benchmarks show llama.cpp achieving 161 tokens per second versus Ollama's 89 tokens per second, with a 30-50% CPU performance gap and approximately 70% higher throughput for llama.cpp on models like Qwen-3 Coder 32B. Procurement teams and researchers also face challenges from misleading model naming, which creates confusion about actual model capabilities and can misrepresent performance expectations.
Subscribe for Weekly Updates
Stay ahead with our weekly AI and tech briefings, delivered every Tuesday.




