LLMs Benchmarked in Production

Inclusion AI and Ant Group have collaborated to create Inclusion Arena, a new leaderboard for evaluating large language models (LLMs) using data from real-world, production applications. This approach aims to provide a more accurate assessment of LLM performance compared to traditional lab benchmarks. The platform collects user feedback within the natural workflow of applications, ensuring diverse use cases contribute to a richer, more representative dataset. All user feedback is anonymised to protect user privacy.

Inclusion Arena seeks to address the limitations of conventional metrics by capturing user-driven insights. By open-sourcing the collected feedback data, the initiative aims to benefit the entire AI community, fostering a collaborative environment where application developers play a key role in shaping the future of AI. This real-world evaluation offers valuable insights for developers to build and improve LLMs, accelerating the development of more capable and reliable AI solutions.

The platform offers a streamlined integration process for incorporating the evaluation module into applications. This allows developers to gain specific insights into how different models perform within their application's context. The initiative promotes community-driven progress, enabling application developers to contribute to a growing ecosystem for learning and iterating on AI models.

LLMs Benchmarked in Production

Related articles.

DeepSeek V3.1 Model Upgrade

groundcover: LLM Observability Solution

AI Superintelligence Race Heats Up

LLM Feedback Loop Design