Hypothesis

2 Matching Annotations

May 2026
epoch.ai epoch.ai

RIP Classic Reasoning Benchmarks. What's Next? - Epoch AI

1
1. fxp007 07 May 2026
  
  in Public
  
  The next generation of benchmarks needs to be harder, more realistic, and less gameable
  
  【洞察】「更难、更真实、更不可刷题」——这三条标准本质上是在要求 benchmark 向「真实工作」靠拢，而非向「考试题」收敛。但这恰恰引出了一个悖论：越真实的 benchmark，越难自动化评分，越贵（METR 每题 8000 美元），越慢发布。AI 评测体系正在面临「评测速度 vs 评测质量」的根本性权衡。
  
  benchmark-design next-generation evaluation-paradox insight
Visit annotations in context

Tags

insight

next-generation

evaluation-paradox

benchmark-design

Annotators

fxp007

URL

epoch.ai/gradient-updates/rip-classic-benchmarks
Apr 2026
www.understandingai.org www.understandingai.org

Why it's getting harder to measure AI performance - Understanding AI

1
1. fxp007 09 Apr 2026
  
  in Public
  
  The most famous chart in AI might be obsolete soon.
  
  副标题本身就是一个令人震惊的声明：最著名的 AI 进展图表即将过时——不是因为 AI 停止进步，而恰恰是因为进步太快。这创造了一个奇异的悖论：评测工具的失效速度与被评测对象的进步速度正相关。我们对 AI 能力的理解，正在以比 AI 自身进步更慢的速度迭代——「评测滞后」将成为未来数年 AI 治理和决策的核心挑战。
  
  obsolete-benchmark evaluation-lag measurement-crisis paradox
Visit annotations in context

Tags

paradox

evaluation-lag

obsolete-benchmark

measurement-crisis

Annotators

fxp007

URL

understandingai.org/p/why-its-getting-harder-to-measure

Tags

Annotators

URL

Tags

Annotators

URL