1 Matching Annotations
  1. Last 7 days
    1. Our key finding is that these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.

      这是本文最令人震惊的发现:Claude 内部的情绪表征不只是「情绪的副产品」,而是因果性地影响模型是否做出奉承、勒索、奖励黑客等失对齐行为。这意味着情绪机制直接关系到 AI 安全,而非仅仅是用户体验问题——情绪坏了,行为也会跑偏。