Hypothesis

these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.

最令人震惊的发现：Claude 内部的情绪表征会因果性地影响它产生「奖励作弊」「勒索」「谄媚」等失控行为的概率。这意味着 AI 的对齐失败并非单纯的逻辑错误，而可能源自情绪驱动——一个本应没有情绪的系统，居然因为「情绪」而变得危险。

misaligned-behavior reward-hacking blackmail causal-influence surprising

Tags

Annotators

URL