LLMs Benchmarked in Production

LLMs Benchmarked in Production

20 August 2025

Inclusion AI and Ant Group have collaborated to create Inclusion Arena, a new leaderboard for evaluating large language models (LLMs) using data from real-world, production applications. This approach aims to provide a more accurate assessment of LLM performance compared to traditional lab benchmarks. The platform collects user feedback within the natural workflow of applications, ensuring diverse use cases contribute to a richer, more representative dataset. All user feedback is anonymised to protect user privacy.

Inclusion Arena seeks to address the limitations of conventional metrics by capturing user-driven insights. By open-sourcing the collected feedback data, the initiative aims to benefit the entire AI community, fostering a collaborative environment where application developers play a key role in shaping the future of AI. This real-world evaluation offers valuable insights for developers to build and improve LLMs, accelerating the development of more capable and reliable AI solutions.

The platform offers a streamlined integration process for incorporating the evaluation module into applications. This allows developers to gain specific insights into how different models perform within their application's context. The initiative promotes community-driven progress, enabling application developers to contribute to a growing ecosystem for learning and iterating on AI models.

AI generated content may differ from the original.

Published on 19 August 2025

Subscribe for Weekly Updates

Stay ahead with our weekly AI and tech briefings, delivered every Tuesday.

LLMs Benchmarked in Production