11 Matching Annotations
  1. May 2026
    1. 90.6% (1,587) have proved to be valid true positives, and 62.4% (1,094) were confirmed as either high- or critical-severity

      这两个百分比数据点(90.6%验证率,62.4%确认高危率)对于评估AI模型在安全漏洞检测中的可靠性至关重要。90.6%的验证率表明AI模型的误报率相对较低,这在AI安全领域是相当出色的表现。然而,62.4%的确认高危率意味着近40%的AI评估高危漏洞实际严重程度较低,这反映了AI在严重性评估上仍有改进空间。

  2. Apr 2026
    1. _Self-reported score with custom Anthropic scaffold._ SWEPro were evaluated with the mini-swe-agent scaffold. However, we use the scores reported by Anthropic for Opus with the max thinking efforts due to frequent timeouts during our evaluation trials.

      脚注2揭示了重要数据点:Opus 4.6的53.4分是Anthropic的自报分数,因为作者在评估过程中频繁遇到超时问题,无法自行验证。这表明性能比较中存在数据可靠性问题,特别是对于Opus的评估依赖于厂商自报数据,可能存在偏差。

  3. Jan 2022
  4. Sep 2021
  5. Dec 2020
    1. I haven't met anyone who makes this argument who then says that a one stop convenient, reliable, private and secure online learning environment can’t be achieved using common every day online systems

      Reliable: As a simple example, I'd trust Google to maintain data reliability over my institutional IT support.

      And you'd also need to make the argument for why learning needs to be "private", etc.

  6. Aug 2020
  7. Jun 2020
  8. May 2020
  9. Apr 2020