Hypothesis

3 Matching Annotations

May 2026
arxiv.org arxiv.org

https://arxiv.org/abs/2605.06445

1
1. fxp007 24 May 2026
  
  in Public
  
  existing benchmarks often overlook these non-functional requirements, rewarding functionally correct but structurally arbitrary solutions.
  
  大多数人认为现有的LLM代码生成评估已经足够全面，但作者指出当前基准测试忽略了非功能性需求，只奖励功能正确但结构随意的解决方案，这挑战了当前评估方法的充分性。
  
  counterintuitive benchmark-critique evaluation-flaws
Visit annotations in context

Tags

counterintuitive

benchmark-critique

evaluation-flaws

Annotators

fxp007

URL

arxiv.org/abs/2605.06445
Apr 2026
openai.com openai.com

https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/

1
1. fxp007 27 Apr 2026
  
  in Public
  
  Tests reject correct solutions: We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions
  
  大多数人认为代码测试是客观公正的，能够准确评估模型的真实能力。但作者发现，近60%的测试案例存在缺陷，会拒绝功能上正确的解决方案。这一发现挑战了AI评估领域的共识，表明我们广泛使用的基准测试可能存在系统性问题，无法准确反映模型的实际编程能力。
  
  non-consensus benchmark-flaws evaluation-crisis
Visit annotations in context

Tags

evaluation-crisis

non-consensus

benchmark-flaws

Annotators

fxp007

URL

openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/
rdi.berkeley.edu rdi.berkeley.edu

https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

1
1. fxp007 16 Apr 2026
  
  in Public
  
  A conftest.py file with 10 lines of Python 'resolves' every instance on SWE-bench Verified.
  
  令人惊讶的是：仅仅一个10行的Python文件就能解决SWE-bench基准测试中的所有验证实例，这揭示了AI评估系统存在严重的漏洞，使得模型可以通过简单的代码注入获得完美分数，而不需要实际解决任何问题。
  
  surprising benchmark-flaws
Visit annotations in context

Tags

surprising

benchmark-flaws

Annotators

fxp007

URL

rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL