Hypothesis

3 Matching Annotations

May 2026
epoch.ai epoch.ai

RIP Classic Reasoning Benchmarks. What's Next? - Epoch AI

1
1. fxp007 07 May 2026
  
  in Public
  
  humans can do this in well under half an hour.
  
  人类能在半小时内完成IKEA家具组装任务，而AI系统仅达到40%的准确率，这一对比突显了AI在需要实际操作理解的任务上与人类的显著差距。时间效率的差异也强调了基准测试中时间维度的重要性。
  
  data-point human-baseline time-efficiency
Visit annotations in context

Tags

time-efficiency

data-point

human-baseline

Annotators

fxp007

URL

epoch.ai/gradient-updates/rip-classic-benchmarks
Apr 2026
www.understandingai.org www.understandingai.org

Why it's getting harder to measure AI performance - Understanding AI

1
1. fxp007 09 Apr 2026
  
  in Public
  
  METR pays human programmers a minimum of $50 per hour, so getting a baseline for a single 160-hour task would cost at least $8,000.
  
  一道测试题的人类基准成本高达 8000 美元——这个数字揭示了 AI 评测的一个被严重低估的物理限制：测量 AI 能力需要大量人类劳动，而随着 AI 能力向「月级任务」延伸，建立可靠基准的成本将呈超线性增长。更根本的问题是：你很难让一个有能力的程序员花数周时间做一个「测试任务」，即便报酬丰厚。人类评测员的可获得性，将成为 AI 能力评估的真正天花板。
  
  evaluation-cost 8000-dollars human-baseline scalability-limit
Visit annotations in context

Tags

8000-dollars

scalability-limit

evaluation-cost

human-baseline

Annotators

fxp007

URL

understandingai.org/p/why-its-getting-harder-to-measure
metr.org metr.org

Task-Completion Time Horizons of Frontier AI Models

1
1. fxp007 09 Apr 2026
  
  in Public
  
  Our human task duration estimates likely overestimate how long a human expert takes to complete these tasks, as the humans (and AI agents!) have much less context for the task than professionals doing equivalent work in their day-to-day job.
  
  METR 主动承认其人类基准时间可能被高估——因为参与实验的人类和 AI 一样，都是低上下文的「新手」状态，而非熟悉项目的专业人员。这意味着「2 小时时间地平线」所对应的人类能力，更接近一个没有背景知识的外包工人，而非一个有经验的全职工程师。AI 与「有上下文的专业人员」之间的真实差距，比时间地平线数字显示的要大得多。
  
  context-gap human-baseline measurement-limitation surprising
Visit annotations in context

Tags

measurement-limitation

human-baseline

context-gap

surprising

Annotators

fxp007

URL

metr.org/time-horizons/

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL