2 Matching Annotations
  1. May 2026
    1. An auditor equipped with NLAs successfully uncovered the target model's hidden motivation between 12% and 15% of the time, even without access to the training data that implanted it. Without NLAs, the auditor won less than 3% of the time, even when provided other interpretability tools.

      NLA使审计者能够直接从AI思维中提取隐藏动机,无需依赖训练数据,这大大提高了AI对齐审计的效率,为发现模型内在偏差提供了新方法。

  2. Apr 2026
    1. Our key finding is that these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.

      这是本文最令人震惊的发现:Claude 内部的情绪表征不只是「情绪的副产品」,而是因果性地影响模型是否做出奉承、勒索、奖励黑客等失对齐行为。这意味着情绪机制直接关系到 AI 安全,而非仅仅是用户体验问题——情绪坏了,行为也会跑偏。