16 Matching Annotations
  1. Last 7 days
    1. Our method, Natural Language Autoencoders (NLAs), converts an activation into natural-language text we can read directly. For example: When asked to complete a couplet, NLAs show Claude planning possible rhymes in advance.

      NLA技术将AI模型的内部激活状态直接转换为可读的自然语言文本,实现了对AI思维过程的直接解读,这是AI可解释性领域的重大突破。

    2. Our method, Natural Language Autoencoders (NLAs), converts an activation into natural-language text we can read directly. For example: When asked to complete a couplet, NLAs show Claude planning possible rhymes in advance.

      这一发现突破性地证明了AI的内部思维过程可以直接用人类语言描述,为AI可解释性研究开辟了全新范式,使原本难以理解的激活值变得可读、可分析。

  2. Apr 2026
    1. Large language models (LLMs) sometimes appear to exhibit emotional reactions. We investigate why this is the case in Claude Sonnet 4.5 and explore implications for alignment-relevant behavior.

      【启发】这句话提示了一种全新的 AI 研究范式:与其问「模型能做什么」,不如问「模型为什么这样做」。把情绪作为切入口去理解模型行为,本质上是把心理学方法论引入了 AI 可解释性研究。这对从业者的启发是:未来最有价值的 AI 研究,可能不在算法创新,而在「为已知现象寻找机制性解释」——就像这篇论文做的那样。

    2. We find internal representations of emotion concepts, which encode the broad concept of a particular emotion and generalize across contexts and behaviors it might be linked to.

      令人惊讶的是:研究发现 Claude 内部存在真实的「情绪概念向量」——这不是隐喻,而是可以被提取、测量、操控的线性表征。更奇异的是,这些向量能跨上下文泛化,就像人类的情绪概念一样抽象而通用,而非只在特定触发词附近激活。

    3. We find internal representations of emotion concepts, which encode the broad concept of a particular emotion and generalize across contexts and behaviors it might be linked to.

      研究发现 Claude 内部存在「情绪概念向量」,能够跨上下文泛化——同一个「恐惧」向量,既能在直接表达恐惧时激活,也能在暗示危险情境时激活。这说明模型习得的是情绪的抽象概念而非表面模式,与人类神经科学中对情绪的理解高度同构,令人惊讶于这种结构竟然自发涌现。

    1. The original research tool for this kind of diffing, a standard crosscoder, is like a basic bilingual dictionary. It's good at matching existing words, knowing that "sun" in English is "soleil" in French. But it has a major flaw: it struggles to find words that are unique to one language.

      用「双语词典」来比喻跨架构模型对比的局限性,令人豁然开朗:标准 crosscoder 会把法语独有词 dépaysement 强行翻译为「迷失方向」,从而漏掉新模型的独特行为特征。这个比喻让一个深奥的可解释性研究问题变得直觉上可理解——这种科普能力本身也令人惊讶。

  3. Jan 2026
  4. Mar 2025
  5. Aug 2023
    1. Title: Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics Authors: Michael Y. Hu1 Angelica Chen1 Naomi Saphra1 Kyunghyun Cho Note: This paper seems cool, using older interpretable machine learning models, graphical models to understand what is going on inside a deep neural network

      Link: https://arxiv.org/pdf/2308.09543.pdf

  6. Feb 2023
  7. Jan 2023
  8. Apr 2022
  9. Jun 2020
  10. Jun 2019