What happened
A study led by Rebecca Payne found widely available large language model (LLM) chatbots failed to improve patient health decisions in real-world scenarios. Participants using chatbots were less likely to identify correct conditions or determine care-seeking locations. However, without human interaction, the same chatbots identified relevant conditions and suggested appropriate care, dramatically outperforming human-interacted results. This performance gap stemmed from communication failures: users missed diagnoses, provided incomplete information, or chatbots misinterpreted details.
Why it matters
Real-world performance data is critical for deploying AI in high-stakes healthcare settings. Current AI evaluations, often based on benchmarks or model-to-model interactions, do not reflect the complexities of human-machine communication. For healthcare providers and policymakers, AI's immediate role is supportive, such as summarising patient records or drafting clinical notes, rather than front-line diagnosis or patient triage. Medical practice requires human connection, tailored communication, and nuanced judgement, which current chatbots lack.
Subscribe for Weekly Updates
Stay ahead with our weekly AI and tech briefings, delivered every Tuesday.




