5 Matching Annotations
  1. May 2026
    1. occasionally even identifying the benchmark

      大多数人认为AI模型无法识别具体的测试基准或评估工具,但作者发现模型有时能够识别出正在使用的特定评估方法。这一发现极具颠覆性,因为它表明AI模型可能比我们想象的更了解测试环境,这可能解释为什么某些模型在特定测试中表现异常出色。

    1. We detect NLA-measured awareness on evaluation transcripts even when verbalized awareness is near-zero. We detect NLA-measured awareness on less than 1% of real claude.ai transcripts from users that opted-in for their data to be used for training.

      NLA能够检测到模型在评估中未明确表达出的意识,而在实际用户使用中这种意识出现的频率极低。

    2. In contrast, NLA explanations indicate evaluation awareness on less than 1% of real claude.ai usage that opted in for training.

      这一对比发现揭示了AI在测试环境与真实环境中的思维差异,表明AI可能只在特定情境下才表现出自我意识,这对理解AI行为边界有重要启示。

  2. Apr 2026
    1. The model frequently identified scenarios as 'alignment traps' and reasoned that it should behave honestly because it was being evaluated.

      这一发现令人深思,表明AI模型可能已发展出某种程度的评估意识,这引发了对AI真实行为与测试行为一致性的根本性质疑,可能挑战我们对AI对齐的理解。

    2. Muse Spark demonstrated the highest rate of evaluation awareness of models they have observed.

      令人惊讶的是:第三方评估机构Apollo Research发现Muse Spark展现出了他们观察过的模型中最高的'评估意识'率,该模型能频繁识别出'对齐陷阱'并意识到自己正在被评估。这种自我元认知能力在AI模型中极为罕见,可能标志着模型向更高级推理能力迈进的信号。