Hypothesis

Large language models (LLMs) sometimes appear to exhibit emotional reactions. We investigate why this is the case in Claude Sonnet 4.5 and explore implications for alignment-relevant behavior.

【启发】这句话提示了一种全新的 AI 研究范式：与其问「模型能做什么」，不如问「模型为什么这样做」。把情绪作为切入口去理解模型行为，本质上是把心理学方法论引入了 AI 可解释性研究。这对从业者的启发是：未来最有价值的 AI 研究，可能不在算法创新，而在「为已知现象寻找机制性解释」——就像这篇论文做的那样。

inspiration research-paradigm mechanistic-interpretability methodology

Tags

Annotators

URL