7 Matching Annotations
  1. Last 7 days
    1. In our own testing, the net effect is favorable—token usage across all effort levels is improved on an internal coding evaluation, as shown below—but we recommend measuring the difference on real traffic.

      Anthropic的"net effect is favorable"这一自我评估揭示了其内部评估的局限性。虽然他们在编码测试中观察到所有努力水平下的token使用率都有所改善,但这种"有利"判断是基于内部评估的,而非真实流量数据。这种自我衡量的"有利"可能忽略了实际应用中的复杂变量,如用户交互模式、任务多样性或长期成本效益。Anthropic建议在真实流量中测量差异,实际上暗示了内部测试与实际表现之间可能存在的差距,反映了AI模型评估中常见的理想化测试环境与真实世界应用之间的鸿沟。

  2. Apr 2026
    1. METR's confidence interval for Claude Opus 4.6 ranges from 5 hours to 66 hours.

      置信区间从 5 小时到 66 小时——这个跨度本身就令人震惊。5 小时和 66 小时是 13 倍的差距,却是对「同一个模型」的同一项测量。当一个数字被广泛引用为「Claude Opus 4.6 的时间地平线是 12 小时」时,真相是这个数字的不确定性区间宽达一个数量级。这是整个 AI 能力评测领域目前面临的核心危机:我们在用极度不精确的测量数字来驱动极其重要的决策。

  3. Feb 2026
    1. The real annoying thing about Opus 4.6/Codex 5.3 is that it’s impossible to publicly say “Opus 4.5 (and the models that came after it) are an order of magnitude better than coding LLMs released just months before it” without sounding like an AI hype booster clickbaiting, but it’s the counterintuitive truth to my personal frustration
    1. OpenClaw, like many other open-source tools, allows users to connect to different AI models via an application programming interface, or API. Within days of OpenClaw’s release, the team revealed that Kimi’s K2.5 had surpassed Claude Opus and became the most used AI model—by token count, meaning it was handling more total text processed across user prompts and model responses.

      Wow, I had no idea that Kimi 2.5 had subbed in for Claude Opus so quickly.

  4. Oct 2025
  5. May 2025
    1. anthropic's new AI model shows ability to deceive and blackmail

      for - progress trap - AI - blackmail - AI - autonomy - progress trap - AI - Anthropic - Claude Opus 4 - to - article - Anthropic Claude 4 blackmail and news leak - progress trap - AI - article - Anthropic Claude 4 - blackmail - rare behavior - Anthropic’s new AI model didn’t just “blackmail” researchers in tests — it tried to leak information to news outlets

  6. Apr 2022
    1. Some florilegia focused on poetic excerpts and were used to teach prosody, others specialized in prose. Both kinds were likely used in teaching at many levels—from the young boys (pueri) mentioned in the Opus prosodiacum of Micon Centulensis in the mid- ninth century to the twenty- year- old Heiric who wrote under dictation from Lupus of Ferrières, ca. 859–62, a Col-lectanea comprising excerpts from Valerius Maximus and Suetonius, followed by philosophical and theological sententiae.104

      Some florilegia were used as handbooks to teach composition. Those with poetic excerpts were used to teach prosody while others specialized in prose.

      Examples of these sorts of florilegia include Micon Centulensis' Opus prosodiacum from the mid-ninth century and a Collectanea by Heiric who wrote under dictation from Lupus of Ferrières, ca. 859–62.