Hypothesis

4 Matching Annotations

May 2026
epoch.ai epoch.ai

RIP Classic Reasoning Benchmarks. What's Next? - Epoch AI

1
1. fxp007 07 May 2026
  
  in Public
  
  MMLU, GSM8K, and HumanEval are now saturated
  
  📊【洞察】MMLU、GSM8K、HumanEval 全面饱和——这三个曾经定义 AI 进步叙事的基准，已经无法区分「优秀」和「顶级」模型之间的差距。与 ARC-AGI-3 近零分事件形成完美对照：AI 在「已知问题」上已经超越人类，在「新颖问题」上几乎为零。评测体系的重建，是未来 AI 治理的先决条件。
  
  MMLU benchmark-saturation evaluation-crisis insight
Visit annotations in context

Tags

MMLU

benchmark-saturation

insight

evaluation-crisis

Annotators

fxp007

URL

epoch.ai/gradient-updates/rip-classic-benchmarks
Apr 2026
openai.com openai.com

https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/

1
1. fxp007 27 Apr 2026
  
  in Public
  
  Tests reject correct solutions: We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions
  
  大多数人认为代码测试是客观公正的，能够准确评估模型的真实能力。但作者发现，近60%的测试案例存在缺陷，会拒绝功能上正确的解决方案。这一发现挑战了AI评估领域的共识，表明我们广泛使用的基准测试可能存在系统性问题，无法准确反映模型的实际编程能力。
  
  non-consensus benchmark-flaws evaluation-crisis
Visit annotations in context

Tags

non-consensus

benchmark-flaws

evaluation-crisis

Annotators

fxp007

URL

openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/
www.understandingai.org www.understandingai.org

Why it's getting harder to measure AI performance - Understanding AI

1
1. fxp007 09 Apr 2026
  
  in Public
  
  The most famous chart in AI might be obsolete soon.
  
  副标题本身就是一个令人震惊的声明：最著名的 AI 进展图表即将过时——不是因为 AI 停止进步，而恰恰是因为进步太快。这创造了一个奇异的悖论：评测工具的失效速度与被评测对象的进步速度正相关。我们对 AI 能力的理解，正在以比 AI 自身进步更慢的速度迭代——「评测滞后」将成为未来数年 AI 治理和决策的核心挑战。
  
  obsolete-benchmark evaluation-lag measurement-crisis paradox
Visit annotations in context

Tags

measurement-crisis

obsolete-benchmark

paradox

evaluation-lag

Annotators

fxp007

URL

understandingai.org/p/why-its-getting-harder-to-measure
Aug 2020
www.nber.org www.nber.org

How Did COVID-19 and Stabilization Policies Affect Spending and Employment? A New Real-Time Economic Tracker Based on Private Sector Data

1
1. Marlene_Wulf 07 Aug 2020
  
  in BehSci
  
  Chetty, R., Friedman, J. N., Hendren, N., Stepner, M., & Team, T. O. I. (2020). How Did COVID-19 and Stabilization Policies Affect Spending and Employment? A New Real-Time Economic Tracker Based on Private Sector Data (Working Paper No. 27431; Working Paper Series). National Bureau of Economic Research. https://doi.org/10.3386/w27431
  
  is:preprint lang:en COVID-19 stabilization policy employment economy tracker unemployment social insurance economic crisis evaluation impact income industry county
Visit annotations in context

Tags

economic crisis

social insurance

stabilization

lang:en

employment

tracker

economy

unemployment

COVID-19

income

policy

industry

evaluation

is:preprint

county

impact

Annotators

Marlene_Wulf

URL

nber.org/papers/w27431

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL