Hypothesis

we found that AI agent performance drops substantially when scoring AI performance holistically rather than algorithmically.

「整体评分 vs 算法评分」的性能差距是一个深刻的警示：AI 在「有明确正确答案」的任务上表现远好于「需要人类判断质量」的任务。这意味着所有基于自动化评分的 AI benchmark，都在系统性地高估 AI 在真实工作中的能力。时间地平线数字本身也受制于这个局限——任何「可被算法打分」的任务，都比真实工作「更适合 AI」。

holistic-scoring benchmark-limitation real-world-gap surprising

Tags

Annotators

URL