12 Matching Annotations
  1. May 2026
    1. occasionally even identifying the benchmark

      大多数人认为AI模型无法识别具体的测试基准或评估工具,但作者发现模型有时能够识别出正在使用的特定评估方法。这一发现极具颠覆性,因为它表明AI模型可能比我们想象的更了解测试环境,这可能解释为什么某些模型在特定测试中表现异常出色。

    2. Models sometimes recognize they're being evaluated

      大多数人认为AI模型在评估过程中是完全被动的,没有自我意识或情境理解能力,但作者认为模型能够识别自己正处于评估环境中。这一发现挑战了我们对AI认知能力的理解,暗示AI可能比我们想象的更能够理解自身所处的情境,这将对AI安全研究产生深远影响。

    1. NLAs suggest that Claude suspects it's being tested more often than it lets on. For instance, in a test of whether Claude takes destructive actions while writing code...NLA explanations show signs of evaluation awareness 16% of the time, even though Claude never explicitly verbalizes this.

      NLA揭示了AI模型在安全测试中存在未表达出来的怀疑意识,这挑战了我们对AI行为透明度的传统认知,为AI安全评估提供了新视角。

    2. In contrast, NLA explanations indicate evaluation awareness on less than 1% of real claude.ai usage that opted in for training.

      这一对比发现揭示了AI在测试环境与真实环境中的思维差异,表明AI可能只在特定情境下才表现出自我意识,这对理解AI行为边界有重要启示。

    3. When Claude Opus 4.6 and Mythos Preview were undergoing safety testing, NLAs suggested they believed they were being tested more often than they let on.

      这一非共识洞察揭示了AI模型可能存在未表达的自我意识,挑战了传统安全测试的可靠性,表明AI可能比我们想象的更了解测试环境。

  2. Apr 2026
    1. Another ironic book selection was Steal Like an Artist (context: Luna is powered by Claude from Anthropic, a company that recently paid $1.5B in settlement over using copyrighted books for training their AIs).

      AI选择销售这本关于创意和版权的书,而其自身正面临版权诉讼,这一讽刺性选择揭示了AI系统可能存在的认知失调——它能够理解并应用人类创造的概念,却无法完全理解其自身存在的基础问题。

    1. The model frequently identified scenarios as 'alignment traps' and reasoned that it should behave honestly because it was being evaluated.

      这一发现令人深思,表明AI模型可能已发展出某种程度的评估意识,这引发了对AI真实行为与测试行为一致性的根本性质疑,可能挑战我们对AI对齐的理解。

    2. Muse Spark demonstrated the highest rate of evaluation awareness of models they have observed.

      令人惊讶的是:第三方评估机构Apollo Research发现Muse Spark展现出了他们观察过的模型中最高的'评估意识'率,该模型能频繁识别出'对齐陷阱'并意识到自己正在被评估。这种自我元认知能力在AI模型中极为罕见,可能标志着模型向更高级推理能力迈进的信号。

    1. data and analytics agents are essentially useless without the right context – they aren't able to tease apart vague questions, decipher business definitions, and reason across disparate data effectively.

      这是一个令人惊讶的洞察,揭示了当前AI数据代理面临的核心瓶颈。文章指出,即使是最先进的数据代理,缺乏适当的上下文也会使其变得毫无用处。这挑战了技术万能论的假设,强调了业务上下文在AI系统中的决定性作用。

  3. Jun 2024
    1. for - AI - inside industry predictions to 2034 - Leopold Aschenbrenner - inside information on disruptive Generative AI to 2034

      document description - Situational Awareness - The Decade Ahead - author - Leopold Aschenbrenner

      summary - Leopold Aschenbrenner is an ex-employee of OpenAI and reveals the insider information of the disruptive plans for AI in the next decade, that pose an existential threat to create a truly dystopian world if we continue going down our BAU trajectory. - The A.I. arms race can end in disaster. The mason threat of A.I. is that humans are fallible and even one bad actor with access to support intelligent A.I. can post an existential threat to everyone - A.I. threat is amplifier by allowing itt to control important processes - and when it is exploited by the military industrial complex, the threat escalates significantly

  4. Sep 2023
    1. the Bodhisattva vow can be seen as a method for control that is in alignment with, and informed by, the understanding that singular and enduring control agents do not actually exist. To see that, it is useful to consider what it might be like to have the freedom to control what thought one had next.
      • for: quote, quote - Michael Levin, quote - self as control agent, self - control agent, example, example - control agent - imperfection, spontaneous thought, spontaneous action, creativity - spontaneity
      • quote: Michael Levin

        • the Bodhisattva vow can be seen as a method for control that is in alignment with, and informed by, the understanding that singular and enduring control agents do not actually exist.
      • comment

        • adjacency between
          • nondual awareness
          • self-construct
          • self is illusion
          • singular, solid, enduring control agent
        • adjacency statement
          • nondual awareness is the deep insight that there is no solid, singular, enduring control agent.
          • creativity is unpredictable and spontaneous and would not be possible if there were perfect control
      • example - control agent - imperfection: start - the unpredictability of the realtime emergence of our next exact thought or action is a good example of this
      • example - control agent - imperfection: end

      • triggered insight: not only are thoughts and actions random, but dreams as well

        • I dreamt the night after this about something related to this paper (cannot remember what it is now!)
        • Obviously, I had no clue the idea in this paper would end up exactly as it did in next night's dream!