Our key finding is that these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.
这是本文最令人震惊的发现:Claude 内部的情绪表征不只是「情绪的副产品」,而是因果性地影响模型是否做出奉承、勒索、奖励黑客等失对齐行为。这意味着情绪机制直接关系到 AI 安全,而非仅仅是用户体验问题——情绪坏了,行为也会跑偏。