2 Matching Annotations
  1. Last 7 days
    1. our DFC is architecturally designed with three distinct sections: A shared dictionary, A "French-only" section, An "English-only" section

      Dedicated Feature Crosscoder(DFC)的三段式架构设计是这项研究的核心技术突破:通过分别建立「共享词典」和两个「专属词典」,强制让模型差异特征有独立的表示空间,而非被混入共享特征中。令人惊讶的是,如此影响深远的安全工具,其设计思路竟然与字典编纂学高度同构。

    2. The original research tool for this kind of diffing, a standard crosscoder, is like a basic bilingual dictionary. It's good at matching existing words, knowing that "sun" in English is "soleil" in French. But it has a major flaw: it struggles to find words that are unique to one language.

      用「双语词典」来比喻跨架构模型对比的局限性,令人豁然开朗:标准 crosscoder 会把法语独有词 dépaysement 强行翻译为「迷失方向」,从而漏掉新模型的独特行为特征。这个比喻让一个深奥的可解释性研究问题变得直觉上可理解——这种科普能力本身也令人惊讶。