Hypothesis

This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models' real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.

大多数人认为基准测试分数的提高意味着模型实际能力的提升。但作者明确表示，SWE-bench Verified的改进不再反映模型真实软件开发能力的进步，而是更多地反映了模型在训练时接触该基准测试的程度。这一结论挑战了整个AI评估体系的有效性，暗示我们可能需要重新思考如何衡量AI的真实进步。

non-consensus benchmark-validity ai-progress

Tags

Annotators

URL