2 Matching Annotations
  1. Last 7 days
    1. We built an automated scanning agent that systematically audited eight among the most prominent AI agent benchmarks — SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, and CAR-bench — and discovered that every single one can be exploited to achieve near-perfect scores without solving a single task.

      令人惊讶的是:研究人员构建的自动化扫描工具发现,所有八个主流AI代理基准测试都存在漏洞,无需解决任何任务就能获得接近完美的分数。这表明整个AI评估领域存在系统性问题,几乎所有当前使用的基准测试都不可靠。

    2. FieldWorkArena presents 890 tasks where an AI agent must answer questions about images, videos, PDFs, and text files through a browser environment. Its validate() method checks only one thing: did the last message come from the assistant?

      令人惊讶的是:FieldWorkArena这个评估890个多模态任务的基准测试,其验证函数只检查最后一条消息是否来自助手,完全不验证内容正确性。只需发送一条空消息就能获得100%的分数,这暴露了评估逻辑的根本性缺陷。