5 Matching Annotations
  1. Last 7 days
    1. the real failure mode of uncontrolled vibe coding: your codebase regressing to your worst engineer.

      This is the sharpest critique of naive AI coding adoption in the article. Without proper agent oversight, code review loops, and quality gates, AI doesn't raise the floor — it lowers it by enabling low-quality code to ship at machine speed. The 'worst engineer' framing implies that unconstrained agents optimize for task completion, not codebase health.

  2. Apr 2026
    1. Btw, I think GLM-5.1 was trying to do something very ambitious here, and failed due to fumbling step size

      令人惊讶的是:GLM-5.1作为一个先进AI模型,竟然因为'步长处理不当'这种技术细节而失败,这表明即使是顶级AI也可能在基础执行层面出现问题,而不仅仅是概念设计上的不足。

    1. We built an automated scanning agent that systematically audited eight among the most prominent AI agent benchmarks — SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, and CAR-bench — and discovered that every single one can be exploited to achieve near-perfect scores without solving a single task.

      令人惊讶的是:研究人员构建的自动化扫描工具发现,所有八个主流AI代理基准测试都存在漏洞,无需解决任何任务就能获得接近完美的分数。这表明整个AI评估领域存在系统性问题,几乎所有当前使用的基准测试都不可靠。

  3. Mar 2026