14 Matching Annotations
  1. Apr 2026
    1. Large language models (LLMs) sometimes appear to exhibit emotional reactions. We investigate why this is the case in Claude Sonnet 4.5 and explore implications for alignment-relevant behavior.

      【启发】这句话提示了一种全新的 AI 研究范式:与其问「模型能做什么」,不如问「模型为什么这样做」。把情绪作为切入口去理解模型行为,本质上是把心理学方法论引入了 AI 可解释性研究。这对从业者的启发是:未来最有价值的 AI 研究,可能不在算法创新,而在「为已知现象寻找机制性解释」——就像这篇论文做的那样。

    2. We find internal representations of emotion concepts, which encode the broad concept of a particular emotion and generalize across contexts and behaviors it might be linked to.

      令人惊讶的是:研究发现 Claude 内部存在真实的「情绪概念向量」——这不是隐喻,而是可以被提取、测量、操控的线性表征。更奇异的是,这些向量能跨上下文泛化,就像人类的情绪概念一样抽象而通用,而非只在特定触发词附近激活。

    3. We find internal representations of emotion concepts, which encode the broad concept of a particular emotion and generalize across contexts and behaviors it might be linked to.

      研究发现 Claude 内部存在「情绪概念向量」,能够跨上下文泛化——同一个「恐惧」向量,既能在直接表达恐惧时激活,也能在暗示危险情境时激活。这说明模型习得的是情绪的抽象概念而非表面模式,与人类神经科学中对情绪的理解高度同构,令人惊讶于这种结构竟然自发涌现。

    1. The original research tool for this kind of diffing, a standard crosscoder, is like a basic bilingual dictionary. It's good at matching existing words, knowing that "sun" in English is "soleil" in French. But it has a major flaw: it struggles to find words that are unique to one language.

      用「双语词典」来比喻跨架构模型对比的局限性,令人豁然开朗:标准 crosscoder 会把法语独有词 dépaysement 强行翻译为「迷失方向」,从而漏掉新模型的独特行为特征。这个比喻让一个深奥的可解释性研究问题变得直觉上可理解——这种科普能力本身也令人惊讶。

  2. Jan 2026
  3. Mar 2025
  4. Aug 2023
    1. Title: Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics Authors: Michael Y. Hu1 Angelica Chen1 Naomi Saphra1 Kyunghyun Cho Note: This paper seems cool, using older interpretable machine learning models, graphical models to understand what is going on inside a deep neural network

      Link: https://arxiv.org/pdf/2308.09543.pdf

  5. Feb 2023
  6. Jan 2023
  7. Apr 2022
  8. Jun 2020
  9. Jun 2019