Hypothesis

Large language models (LLMs) sometimes appear to exhibit emotional reactions. We investigate why this is the case in Claude Sonnet 4.5 and explore implications for alignment-relevant behavior.

这篇论文的问题意识本身就极具洞察：大多数 AI 安全研究在追问「模型会不会说谎」，Anthropic 却在追问「模型为什么有情绪」。从「行为纠偏」转向「情绪机制」，意味着对齐研究的范式正在悄然转移——从控制外部输出，到理解内部动机结构，这是从行为主义到认知科学的跨越。

paradigm-shift alignment-research cognitive-science insight

Tags

Annotators

URL