1 Matching Annotations
  1. Last 7 days
    1. If we took one task out of our task suite or added another task to our task suite, potentially instead of measuring this Claude Opus 4.6 time horizon of, I think, 14 and a half hours, we'd be measuring it at something like eight or 20 hours.

      增减一道题,测量结果从 8 小时变成 20 小时——这意味着整个 METR 时间地平线排行榜,本质上是由极少数「关键任务」撑起来的脆弱测量。当一个评测体系对单点数据如此敏感,它的「精确数字」就不应该被当作事实引用,而应该被当作噪声分布的一次采样。而目前,媒体和公众正是在拿这些数字做严肃决策。