Hypothesis

2 Matching Annotations

Apr 2026
www.understandingai.org www.understandingai.org

Why it's getting harder to measure AI performance - Understanding AI

1
1. fxp007 09 Apr 2026
  
  in Public
  
  we may see a growing divergence between the capabilities we can measure and the capabilities we actually care about.
  
  「可测量的能力」与「真正关心的能力」之间的分歧正在扩大——这是整篇文章最深刻的洞见。所有当前 benchmark 都偏向「干净、自包含、可自动评分」的任务，而真实工作是「混乱、跨系统、需人类判断」的。随着 AI 向长任务延伸，这个测量-现实之间的鸿沟不会缩小，只会加速扩大。这意味着未来关于「AI 能否替代某类工作」的争论，将越来越难以用数据解决——因为数据本身无法捕捉真实工作的本质。
  
  measurement-reality-gap benchmark-limitation complex-tasks key-insight
Visit annotations in context

Tags

complex-tasks

key-insight

measurement-reality-gap

benchmark-limitation

Annotators

fxp007

URL

understandingai.org/p/why-its-getting-harder-to-measure
metr.org metr.org

Task-Completion Time Horizons of Frontier AI Models

1
1. fxp007 09 Apr 2026
  
  in Public
  
  Our human task duration estimates likely overestimate how long a human expert takes to complete these tasks, as the humans (and AI agents!) have much less context for the task than professionals doing equivalent work in their day-to-day job.
  
  METR 主动承认其人类基准时间可能被高估——因为参与实验的人类和 AI 一样，都是低上下文的「新手」状态，而非熟悉项目的专业人员。这意味着「2 小时时间地平线」所对应的人类能力，更接近一个没有背景知识的外包工人，而非一个有经验的全职工程师。AI 与「有上下文的专业人员」之间的真实差距，比时间地平线数字显示的要大得多。
  
  context-gap human-baseline measurement-limitation surprising
Visit annotations in context

Tags

measurement-limitation

surprising

context-gap

human-baseline

Annotators

fxp007

URL

metr.org/time-horizons/

Tags

Annotators

URL

Tags

Annotators

URL