2 Matching Annotations
  1. Last 7 days
    1. We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.

      大多数人认为AI模型的性能提升主要源于算法和架构的改进。但作者发现,模型在SWE-bench上的成功更多取决于它们是否在训练中见过这些问题,而非真正的编程能力提升。这一观点与行业普遍认为的'模型进步'叙事相悖,暗示当前AI发展评估可能存在严重偏差。

  2. Apr 2026
    1. The most notable finding here is that the model capabilities are improving _fast._ There are several domains that have shown dramatic improvements in the last 4 months — with accounting and auditing showing nearly a 20 percent jump on GDPval and even domains like police / detective work showing a nearly 30 percent improvement.

      令人惊讶的是:AI模型能力在过去4个月内取得了惊人的进步,会计和审计领域在GDPval基准测试中提升了近20%,而警察/侦探工作领域甚至提升了近30%。这种快速进步的速度远超人们的预期,预示着AI将在更多领域实现突破性应用。