3 Matching Annotations
  1. Last 7 days
    1. METR's confidence interval for Claude Opus 4.6 ranges from 5 hours to 66 hours.

      置信区间从 5 小时到 66 小时——这个跨度本身就令人震惊。5 小时和 66 小时是 13 倍的差距,却是对「同一个模型」的同一项测量。当一个数字被广泛引用为「Claude Opus 4.6 的时间地平线是 12 小时」时,真相是这个数字的不确定性区间宽达一个数量级。这是整个 AI 能力评测领域目前面临的核心危机:我们在用极度不精确的测量数字来驱动极其重要的决策。

  2. Feb 2026
    1. OpenClaw, like many other open-source tools, allows users to connect to different AI models via an application programming interface, or API. Within days of OpenClaw’s release, the team revealed that Kimi’s K2.5 had surpassed Claude Opus and became the most used AI model—by token count, meaning it was handling more total text processed across user prompts and model responses.

      Wow, I had no idea that Kimi 2.5 had subbed in for Claude Opus so quickly.

  3. May 2025
    1. anthropic's new AI model shows ability to deceive and blackmail

      for - progress trap - AI - blackmail - AI - autonomy - progress trap - AI - Anthropic - Claude Opus 4 - to - article - Anthropic Claude 4 blackmail and news leak - progress trap - AI - article - Anthropic Claude 4 - blackmail - rare behavior - Anthropic’s new AI model didn’t just “blackmail” researchers in tests — it tried to leak information to news outlets