LLMs Fail Core Cognitive Attention Test

What happened

A new study published in PNAS Nexus reveals large language models (LLMs) like GPT-4o and Claude 3.5 Sonnet significantly fail the classic Stroop psychological test, demonstrating fundamental limitations in executive control of attention. GPT-4o's accuracy dropped from 91% with five words to 15% with 40 words in incongruent conditions, while Claude 3.5 Sonnet declined to 24% accuracy on 40-word tests. Subsequent testing on newer models, including GPT-5, Claude Opus 4.1, and Gemini 2.5 Pro, showed only slight improvements, with researchers attributing these persistent deficiencies to inherent architectural constraints of transformer-based LLMs.

Why it matters

Achieving artificial general intelligence (AGI) faces a critical hurdle as current transformer-based LLMs lack sophisticated executive control systems for cognitive flexibility. This limitation means current AI models struggle with decision conflicts, impacting their ability to perform goal-directed behaviour beyond enhanced memory capabilities. For AI architects and researchers, this study underscores the need to prioritise architectural innovations that integrate biological attention-like mechanisms, rather than solely focusing on scaling parameters or memory. This follows previous research indicating AI chatbots exhibit sycophancy and validate user delusions, highlighting persistent challenges in advanced cognitive functions.

LLMs Fail Core Cognitive Attention Test

What happened

Why it matters

Related articles.

LLM Feedback Loop Design

AlphaOne: LLM Thinking Control

Meta Boosts AI Reasoning

AI Reshapes Equity Research