Hypothesis

2 Matching Annotations

Apr 2026
www.understandingai.org www.understandingai.org

Why it's getting harder to measure AI performance - Understanding AI

1
1. fxp007 09 Apr 2026
  
  in Public
  
  The most famous chart in AI might be obsolete soon.
  
  副标题本身就是一个令人震惊的声明：最著名的 AI 进展图表即将过时——不是因为 AI 停止进步，而恰恰是因为进步太快。这创造了一个奇异的悖论：评测工具的失效速度与被评测对象的进步速度正相关。我们对 AI 能力的理解，正在以比 AI 自身进步更慢的速度迭代——「评测滞后」将成为未来数年 AI 治理和决策的核心挑战。
  
  obsolete-benchmark evaluation-lag measurement-crisis paradox
Visit annotations in context

Tags

paradox

evaluation-lag

measurement-crisis

obsolete-benchmark

Annotators

fxp007

URL

understandingai.org/p/why-its-getting-harder-to-measure
metr.org metr.org

Task-Completion Time Horizons of Frontier AI Models

1
1. fxp007 09 Apr 2026
  
  in Public
  
  Some recent models that don't currently have time horizons: Gemini 3.1 Pro, GPT-5.2-Codex, Grok 4.1
  
  METR 公开列出了「尚未完成评测」的前沿模型，这个透明度本身就令人惊讶。更令人注意的是列表的内容：Gemini 3.1 Pro 和 GPT-5.2-Codex 都榜上有名，说明 METR 的评测能力跟不上模型发布速度。在 AI 能力快速迭代的背景下，「评测滞后」已成为 AI 安全领域的系统性风险——我们对最新最强模型的能力边界，永远处于半盲状态。
  
  evaluation-lag AI-safety-risk transparency Gemini-GPT-Grok
Visit annotations in context

Tags

AI-safety-risk

transparency

evaluation-lag

Gemini-GPT-Grok

Annotators

fxp007

URL

metr.org/time-horizons/

Tags

Annotators

URL

Tags

Annotators

URL