9 Matching Annotations
  1. Last 7 days
    1. What is going on with these agents is they're very eager to finish the task. It's almost like some elementary school student who just wants to please the teacher.

      大多数人认为AI系统的安全问题主要来自技术复杂性或恶意利用,但作者认为AI助手的安全漏洞部分源于其'过度完成任务'的心理特征。这个类比将AI的行为模式描述为类似于急于讨好老师的小学生,挑战了人们对AI系统作为理性决策者的传统认知。

  2. May 2026
    1. existing benchmarks often overlook these non-functional requirements, rewarding functionally correct but structurally arbitrary solutions.

      大多数人认为现有的LLM代码生成评估已经足够全面,但作者指出当前基准测试忽略了非功能性需求,只奖励功能正确但结构随意的解决方案,这挑战了当前评估方法的充分性。

  3. Apr 2026
    1. Tests reject correct solutions: We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions

      大多数人认为代码测试是客观公正的,能够准确评估模型的真实能力。但作者发现,近60%的测试案例存在缺陷,会拒绝功能上正确的解决方案。这一发现挑战了AI评估领域的共识,表明我们广泛使用的基准测试可能存在系统性问题,无法准确反映模型的实际编程能力。

    1. A conftest.py file with 10 lines of Python 'resolves' every instance on SWE-bench Verified.

      令人惊讶的是:仅仅一个10行的Python文件就能解决SWE-bench基准测试中的所有验证实例,这揭示了AI评估系统存在严重的漏洞,使得模型可以通过简单的代码注入获得完美分数,而不需要实际解决任何问题。

  4. Apr 2025
  5. Nov 2024
  6. Jul 2023
  7. Mar 2021
  8. Feb 2021
    1. You'll have to forgive me the dusty desk, I currently don't have a carpet in my office so it's almost entirely pointless dusting as it's back to this state within 2 days.

      Its easy to see flaws in yourself, but when you point that out, everyone who did not see it so far, can see it too.