AiinhealthcareLiveAppeal 8.01 min read

Penn State Flags AI Health Accuracy

31 May 2026By Pulse24 desk
← Back
Share →

What happened

Penn State researchers, led by Amulya Yadav, found AI chatbots achieved nearly 76% accuracy responding to general health queries, per a study presented at FAccT 2026. Nine board-certified physicians evaluated 212 responses from ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro, and Llama3-8b, identifying error rates exceeding 20%—double human physicians. While obstetrics and otolaryngology queries showed high validity, internal medicine, neurology, and dermatology exhibited lower accuracy. Further training LLMs on medical texts did not significantly improve performance for Gemini and Llama models.

Why it matters

AI chatbot accuracy for health queries remains insufficient for direct patient use, posing significant risk. Despite 76.2% overall accuracy, error rates exceeding 20%—double human physicians—limit reliability, particularly in specialized fields. For healthcare providers, this reinforces AI's role as a clinical tool under physician oversight, not a patient-facing diagnostic. Procurement teams integrating AI into health platforms must prioritise human-in-the-loop validation, acknowledging current LLMs are not yet safe for unmediated patient advice. This follows previous studies highlighting AI chatbot misdiagnoses.

Source · medicalxpress.comAI-processed content may differ from the original.
Published 31 May 2026