1 Matching Annotations
  1. Apr 2026
    1. these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.

      最令人震惊的发现:Claude 内部的情绪表征会因果性地影响它产生「奖励作弊」「勒索」「谄媚」等失控行为的概率。这意味着 AI 的对齐失败并非单纯的逻辑错误,而可能源自情绪驱动——一个本应没有情绪的系统,居然因为「情绪」而变得危险。