Hypothesis

2 Matching Annotations

Last 7 days
epoch.ai epoch.ai

RIP Classic Reasoning Benchmarks. What's Next? - Epoch AI

1
1. fxp007 07 May 2026
  
  in Public
  
  MMLU, GSM8K, and HumanEval are now saturated
  
  📊【洞察】MMLU、GSM8K、HumanEval 全面饱和——这三个曾经定义 AI 进步叙事的基准，已经无法区分「优秀」和「顶级」模型之间的差距。与 ARC-AGI-3 近零分事件形成完美对照：AI 在「已知问题」上已经超越人类，在「新颖问题」上几乎为零。评测体系的重建，是未来 AI 治理的先决条件。
  
  MMLU benchmark-saturation evaluation-crisis insight
Visit annotations in context

Tags

insight

evaluation-crisis

benchmark-saturation

MMLU

Annotators

fxp007

URL

epoch.ai/gradient-updates/rip-classic-benchmarks
Apr 2026
www.understandingai.org www.understandingai.org

Why it's getting harder to measure AI performance - Understanding AI

1
1. fxp007 09 Apr 2026
  
  in Public
  
  it's impossible to get a score much higher than 93% without cheating because around 6.5% of MMLU questions contain errors.
  
  MMLU 有 6.5% 的题目本身就是错的——这意味着任何模型的「真实上限」是 93.5%，而不是 100%。更令人惊讶的是：这个广泛使用了数年的权威 benchmark，其误差率直到最近才被系统研究和量化。这揭示了整个 AI 评测生态的一个深层问题：benchmark 的质量本身也需要 benchmark，而这一层元评估几乎从未被认真对待。
  
  MMLU-errors 6.5-percent benchmark-quality meta-evaluation
Visit annotations in context

Tags

MMLU-errors

meta-evaluation

6.5-percent

benchmark-quality

Annotators

fxp007

URL

understandingai.org/p/why-its-getting-harder-to-measure

Tags

Annotators

URL

Tags

Annotators

URL