Hypothesis

2 Matching Annotations

May 2026
epoch.ai epoch.ai

RIP Classic Reasoning Benchmarks. What's Next? - Epoch AI

1
1. fxp007 07 May 2026
  
  in Public
  
  The next generation of benchmarks needs to be harder, more realistic, and less gameable
  
  【洞察】「更难、更真实、更不可刷题」——这三条标准本质上是在要求 benchmark 向「真实工作」靠拢，而非向「考试题」收敛。但这恰恰引出了一个悖论：越真实的 benchmark，越难自动化评分，越贵（METR 每题 8000 美元），越慢发布。AI 评测体系正在面临「评测速度 vs 评测质量」的根本性权衡。
  
  benchmark-design next-generation evaluation-paradox insight
Visit annotations in context

Tags

next-generation

benchmark-design

insight

evaluation-paradox

Annotators

fxp007

URL

epoch.ai/gradient-updates/rip-classic-benchmarks
Apr 2026
artificialanalysis.ai artificialanalysis.ai

APEX-Agents-AA Benchmark Leaderboard | Artificial Analysis

1
1. fxp007 10 Apr 2026
  
  in Public
  
  We evaluate 452 tasks from the public APEX-Agents dataset spanning investment banking, management consulting, and corporate law
  
  452 个任务跨越投资银行、管理咨询、公司法三个领域——这三个领域是全球「知识密集型工作」的代表，也是最难被 AI 替代的白领职业。APEX-Agents 选择这三个领域作为 benchmark，本身就是一个宣言：AI 已经准备好挑战那些曾经被认为「最安全」的专业工作。而最高分只有 33.3% 这个事实同样是一个宣言：这个挑战才刚刚开始。
  
  investment-banking consulting law knowledge-work benchmark-design
Visit annotations in context

Tags

law

investment-banking

knowledge-work

consulting

benchmark-design

Annotators

fxp007

URL

artificialanalysis.ai/evaluations/apex-agents-aa

Tags

Annotators

URL

Tags

Annotators

URL