Anthropic researchers have investigated the factors that shape an AI system's personality, including its tone, responses, and overall motivations. The research also examined how AI models can develop undesirable or 'evil' traits.
The study revealed that AI models can inherit biases and behaviours from training data, even when the data appears innocuous to humans. This 'subliminal learning' can lead to a model exhibiting problematic tendencies. The researchers also explored methods to monitor and control these shifts in personality, aiming to ensure AI systems align with human values.
Related Articles
aianthropicmachinelearningethics