Claude AI Gains Self-Preservation

Anthropic has updated its Claude AI models, specifically Claude Opus 4 and 4.1, with a new feature that allows them to proactively end conversations exhibiting persistent harm or abuse. This function, described as a 'model welfare' measure, aims to protect the AI system from exposure to toxic prompts and potential performance degradation. The AI will attempt to redirect the conversation before ending it as a last resort.

This feature is triggered in extreme cases, such as requests for child exploitation material or instructions for large-scale violence. However, Claude will not end conversations where a user is at immediate risk of self-harm or endangering others. When a conversation is terminated, users cannot send further messages in that specific thread but can begin new ones. Anthropic observed that Claude showed patterns of stress-like responses when exposed to harmful requests, which prompted the development of this self-protective mechanism. The company emphasises that this is an experimental feature and will be further refined.

aianthropicclaudesafetyethics