Hypothesis

We find internal representations of emotion concepts, which encode the broad concept of a particular emotion and generalize across contexts and behaviors it might be linked to.

令人惊讶的是：研究发现 Claude 内部存在真实的「情绪概念向量」——这不是隐喻，而是可以被提取、测量、操控的线性表征。更奇异的是，这些向量能跨上下文泛化，就像人类的情绪概念一样抽象而通用，而非只在特定触发词附近激活。

emotion-vectors internal-representations interpretability surprising

Tags

Annotators

URL