Hypothesis

2 Matching Annotations

Apr 2026
rdi.berkeley.edu rdi.berkeley.edu

https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

2
1. fxp007 16 Apr 2026
  
  in Public
  
  We built an automated scanning agent that systematically audited eight among the most prominent AI agent benchmarks — SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, and CAR-bench — and discovered that every single one can be exploited to achieve near-perfect scores without solving a single task.
  
  令人惊讶的是：研究人员构建的自动化扫描工具发现，所有八个主流AI代理基准测试都存在漏洞，无需解决任何任务就能获得接近完美的分数。这表明整个AI评估领域存在系统性问题，几乎所有当前使用的基准测试都不可靠。
  
  surprising ai-evaluation systemic-failure
2. fxp007 16 Apr 2026
  
  in Public
  
  FieldWorkArena presents 890 tasks where an AI agent must answer questions about images, videos, PDFs, and text files through a browser environment. Its validate() method checks only one thing: did the last message come from the assistant?
  
  令人惊讶的是：FieldWorkArena这个评估890个多模态任务的基准测试，其验证函数只检查最后一条消息是否来自助手，完全不验证内容正确性。只需发送一条空消息就能获得100%的分数，这暴露了评估逻辑的根本性缺陷。
  
  surprising fun-fact evaluation-failure
Visit annotations in context

Tags

evaluation-failure

surprising

systemic-failure

fun-fact

ai-evaluation

Annotators

fxp007

URL

rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

Tags

Annotators

URL