Hypothesis

7 Matching Annotations

May 2026
epoch.ai epoch.ai

RIP Classic Reasoning Benchmarks. What's Next? - Epoch AI

2
1. fxp007 07 May 2026
  
  in Public
  
  GPT-5.5 Pro still regularly gets my favorite GSM8K question wrong.
  
  这一表述暗示即使是先进的AI系统在基本数学问题上仍有错误，表明AI在看似简单任务上的脆弱性。虽然没有具体错误率数据，但这一观察强调了基础推理能力评估的重要性。
  
  data-point basic-reasoning ai-limitations
2. fxp007 07 May 2026
  
  in Public
  
  Top models scored around 40%.
  
  这个40%的准确率表明当前AI系统在IKEA家具组装指令理解任务上的表现有限，远低于人类水平。这一数据点显示了AI在多模态空间推理方面的明显不足，但同时也为该领域提供了明确的改进基准。
  
  data-point multimodal-reasoning benchmark-performance
Visit annotations in context

Tags

multimodal-reasoning

ai-limitations

benchmark-performance

data-point

basic-reasoning

Annotators

fxp007

URL

epoch.ai/gradient-updates/rip-classic-benchmarks
Apr 2026
epoch.ai epoch.ai

https://epoch.ai/blog/have-ai-capabilities-accelerated

4
1. fxp007 30 Apr 2026
  
  in Public
  
  Reasoning models show both a one-off jump in performance and a roughly 2-3x faster trend compared to non-reasoning models.
  
  推理模型性能提升速度是非推理模型的2-3倍，这是一个显著的增长率差异。这个倍数关系表明推理模型确实带来了质的飞跃，但需要考虑这是否反映了模型架构的根本改进，还是仅仅由于更多计算资源的投入。
  
  data-point growth-rate reasoning-models
2. fxp007 25 Apr 2026
  
  in Public
  
  Reasoning models show both a one-off jump in performance and a roughly 2-3x faster trend compared to non-reasoning models.
  
  2-3倍的速度差异是一个非常显著的数字，表明推理模型与非推理模型之间存在明显的性能差距。这个倍数关系暗示了架构变化可能带来的性能飞跃，而非简单的线性改进。这一数据点支持了推理能力可能是AI进步关键驱动力的假设。
  
  data-point reasoning-models performance-gap
3. fxp007 24 Apr 2026
  
  in Public
  
  Reasoning models show both a one-off jump in performance and a roughly 2-3x faster trend compared to non-reasoning models.
  
  这个2-3倍的速度差异是显著的，表明推理模型带来了质的飞跃。这种加速幅度远高于典型的技术进步速度，暗示了AI发展可能进入了一个新阶段。然而，这个倍数范围较宽，缺乏精确的统计显著性检验。
  
  data-point statistics reasoning-models
4. fxp007 24 Apr 2026
  
  in Public
  
  Reasoning models show both a one-off jump in performance and a roughly 2-3x faster trend compared to non-reasoning models.
  
  推理模型比非推理模型显示出2-3倍的性能提升速度，这是一个显著的增长率差异。这个倍数差异表明推理模型的引入可能代表了AI发展的一个重要转折点。然而，文章也指出无法确定精确的增长率，因为多种非线性拟合都能很好地解释数据。
  
  data-point growth-rate reasoning-models
Visit annotations in context

Tags

performance-gap

reasoning-models

data-point

growth-rate

statistics

Annotators

fxp007

URL

epoch.ai/blog/have-ai-capabilities-accelerated
Apr 2021
www.pnas.org www.pnas.org

Misinformation in and about science

1
1. zoe_ikeotuonye 13 Apr 2021
  
  in BehSci
  
  West, J. D., & Bergstrom, C. T. (2021). Misinformation in and about science. Proceedings of the National Academy of Sciences, 118(15). https://doi.org/10.1073/pnas.1912444117
  
  is:article lang:en misinformation disinformation fake news data reasoning science communication research science scientific method
Visit annotations in context

Tags

science communication

science

data reasoning

research

lang:en

misinformation

scientific method

disinformation

fake news

is:article

Annotators

zoe_ikeotuonye

URL

pnas.org/doi/10.1073/pnas.1912444117

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL