2 Matching Annotations
  1. Last 7 days
    1. Sonnet 4.6 and Opus 4.6 reached tier 1 in between 150 and 175 cases, and tier 2 about 100 times, but each achieved only a single crash at tier 3. In contrast, Mythos Preview achieved 595 crashes at tiers 1 and 2, added a handful of crashes at tiers 3 and 4, and achieved full control flow hijack on ten separate, fully patched targets (tier 5).

      Tier 5(完全控制流劫持)的0→10跨越,发生在完全打好补丁的目标上,是这篇报告最令人震惊的数据点。控制流劫持意味着攻击者可以执行任意代码——这是漏洞利用的终极目标。此前的模型从未达到这个等级;Mythos Preview在一次评估中就实现了10次,分布在不同的开源项目上,意味着这不是一个幸运的偶然,而是系统性的能力。

  2. Apr 2026
    1. The model reportedly scored 93.9% on SWE-bench Verified and 77.8% on SWE-bench Pro, but its strongest signal came from real-world results, including uncovering a 27-year-old flaw in OpenBSD, a 16-year-old vulnerability in FFmpeg, and autonomously chaining Linux kernel exploits without human input.

      这些惊人的安全漏洞发现能力表明AI已经超越了传统安全工具,能够自主发现几十年未被发现的漏洞。特别是能够自主链接Linux内核漏洞的能力,展示了AI在网络安全领域的革命性潜力,这可能彻底改变安全研究和漏洞修复的方式。