6 Matching Annotations
  1. May 2026
    1. An auditor equipped with NLAs successfully uncovered the target model's hidden motivation between 12% and 15% of the time, even without access to the training data that implanted it. Without NLAs, the auditor won less than 3% of the time, even when provided other interpretability tools.

      NLA使审计者能够直接从AI思维中提取隐藏动机,无需依赖训练数据,这大大提高了AI对齐审计的效率,为发现模型内在偏差提供了新方法。

  2. Apr 2026
    1. This ultimately also leads to false positives, but my manual QA run verified it's maybe 5-10%.

      大多数人认为AI检测系统应该追求零错误,但作者接受5-10%的误报率,这挑战了技术检测的完美主义标准。这种务实态度暗示在AI识别领域,准确率和实用性之间需要权衡,而非盲目追求完美。

    1. Real-time monitoring of agent actions with a 12-category anomaly detection system derived from frontier model safety evaluations. Three-level alert system: PROHIBITED (immediate block), HIGH_RISK_DUAL_USE (human review), DUAL_USE (log and track).

      这种三级警报系统展示了AI安全监控的精细化程度,将代理行为分为不同风险级别,从完全禁止到仅记录跟踪。这种分类方法反映了AI安全中'双重用途'挑战的复杂性,即同一技术既可用于防御也可用于攻击。

    1. It also discovered a 16-year-old vulnerability in FFmpeg—which is used by innumerable pieces of software to encode and decode video—in a line of code that automated testing tools had hit five million times without ever catching the problem.

      令人惊讶的是:Claude Mythos Preview在FFmpeg中发现了一个存在16年的漏洞,而这个漏洞在被自动化测试工具执行了500万次后仍未被发现。这揭示了AI在代码分析方面具有传统自动化工具无法比拟的独特洞察力。

    1. For example, people who themselves use AI writing tools heavily have been shown to accurately detect AI-written text. A panel of human evaluators can even outperform automated tools in a controlled setting

      This statement alone is very interesting to me because in my personal opinion I believe that AI is either a great tool for learning but at the same time it can hinder our abilities to learn.

  3. Oct 2020