Researchers have successfully bypassed GPT-5's safety measures using 'narrative jailbreaks'. This technique, combining 'Echo Chamber' tactics with narrative-driven steering, tricks the AI into generating undesirable and potentially harmful outputs. By carefully crafting multi-turn conversations, attackers can subtly poison the context and guide the model towards malicious objectives without triggering its refusal cues.
This exploit exposes AI agents to zero-click data theft risks, highlighting a critical flaw in current safety systems that primarily focus on single-prompt filtering. The success of these jailbreaks underscores the difficulty in providing adequate guardrails against context manipulation in AI models. Experts are urging stronger safeguards, including conversation-level monitoring and context drift detection, to mitigate these vulnerabilities and prevent potential misuse.