What happened
Anthropic introduced Natural Language Autoencoders (NLAs), a new method to interpret large language models (LLMs) by translating their internal numerical activations into human-readable text. This process involves converting activations to text and then rebuilding activations from that text, verifying the explanation's accuracy by its ability to reconstruct the original activation. Released on May 7, 2026, with code and an interactive demo, NLAs have already been used in pre-deployment audits for Claude Opus 4.6 and Mythos Preview.
Why it matters
NLAs offer a new approach to understanding internal model states, aiding in identifying how LLMs process information. This provides a research tool for interpretability, aligning with Anthropic's focus on understanding model behaviour, following their previous efforts in AI code security tools.




