Hypothesis

If we took one task out of our task suite or added another task to our task suite, potentially instead of measuring this Claude Opus 4.6 time horizon of, I think, 14 and a half hours, we'd be measuring it at something like eight or 20 hours.

增减一道题，测量结果从 8 小时变成 20 小时——这意味着整个 METR 时间地平线排行榜，本质上是由极少数「关键任务」撑起来的脆弱测量。当一个评测体系对单点数据如此敏感，它的「精确数字」就不应该被当作事实引用，而应该被当作噪声分布的一次采样。而目前，媒体和公众正是在拿这些数字做严肃决策。

single-task-sensitivity fragile-benchmark measurement-noise surprising

Tags

Annotators

URL