What happened
Dr. Rebecca Payne and colleagues conducted a UK study with nearly 1,300 participants, revealing that individuals using large language model (LLM) chatbots for medical advice were less likely to identify correct conditions and no better at determining appropriate care pathways than a control group. While LLMs demonstrate strong medical knowledge in isolated tests, passing licensing exams, their real-world application faltered due to communication breakdowns between users and the systems, not a lack of inherent knowledge. The study highlights a significant gap between benchmark performance and practical efficacy in high-stakes healthcare scenarios.
Why it matters
Clinical decision-making for healthcare providers and policymakers must prioritise real-world performance over benchmark scores. This study demonstrates that LLMs, despite passing medical exams, introduce diagnostic risk when used by patients, primarily due to human-machine communication failures. Procurement teams and security architects should assume current agentic AI in patient-facing roles presents unacceptable risk, limiting deployment to supportive, information-organisation tasks like drafting clinical notes or summarising records.
Subscribe for Weekly Updates
Stay ahead with our weekly AI and tech briefings, delivered every Tuesday.




