Ollama Criticised for Performance, Attribution

What happened

Ollama, a popular local LLM runner, faces criticism for obscuring its foundational reliance on llama.cpp, failing MIT licence compliance, and later forking to an inferior custom ggml backend. This mid-2025 backend reintroduced bugs and resulted in significantly slower performance; llama.cpp runs up to 1.8 times faster. Additionally, Ollama misleadingly labelled distilled models, like DeepSeek-R1-Distill-Qwen-32B, simply as "DeepSeek-R1" in its library, causing user confusion and reputational damage.

Why it matters

Deploying local LLMs with Ollama now carries increased operational costs and reduced reliability due to performance deficits and technical issues. Benchmarks show llama.cpp achieving 161 tokens per second versus Ollama's 89 tokens per second, with a 30-50% CPU performance gap and approximately 70% higher throughput for llama.cpp on models like Qwen-3 Coder 32B. Procurement teams and researchers also face challenges from misleading model naming, which creates confusion about actual model capabilities and can misrepresent performance expectations.

Ollama Criticised for Performance, Attribution

What happened

Why it matters

Related articles.

Conway Improves Local LLM Performance

Gas Town Consumes User Credits

Claw Compactor Reduces LLM Tokens

Hugging Face Secures llama.cpp Team