Hypothesis

4 Matching Annotations

Jun 2026
www.theverge.com www.theverge.com

https://www.theverge.com/news/946725/anthropic-releases-claude-fable-5-mythos

1
1. fxp007 10 Jun 2026
  
  in Public
  
  Anthropic singled out cybersecurity and biology as two domains where the safeguards may block responses, both areas widely considered sensitive topics for advanced AI systems.
  
  文章暗示了AI在特定领域的风险，但未详细解释为何这些领域被视为敏感。需要深入了解Anthropic的安全措施具体如何工作，以及这些限制是否足够全面，是否存在其他潜在风险领域。
  
  ai-safety risk-assessment limitations
Visit annotations in context

Tags

ai-safety

limitations

risk-assessment

Annotators

fxp007

URL

theverge.com/news/946725/anthropic-releases-claude-fable-5-mythos
Apr 2026
github.com github.com

https://github.com/fxp/aegis-core

1
1. fxp007 17 Apr 2026
  
  in Public
  
  Real-time monitoring of agent actions with a 12-category anomaly detection system derived from frontier model safety evaluations. Three-level alert system: PROHIBITED (immediate block), HIGH_RISK_DUAL_USE (human review), DUAL_USE (log and track).
  
  这种三级警报系统展示了AI安全监控的精细化程度，将代理行为分为不同风险级别，从完全禁止到仅记录跟踪。这种分类方法反映了AI安全中'双重用途'挑战的复杂性，即同一技术既可用于防御也可用于攻击。
  
  anomaly-detection risk-assessment ai-safety
Visit annotations in context

Tags

anomaly-detection

ai-safety

risk-assessment

Annotators

fxp007

URL

github.com/fxp/aegis-core
hai.stanford.edu hai.stanford.edu

https://hai.stanford.edu/ai-index/2026-ai-index-report/

1
1. fxp007 17 Apr 2026
  
  in Public
  
  Responsible AI is not keeping pace with AI capability, with safety benchmarks lagging and incidents rising sharply.
  
  这一警告揭示了AI发展中的危险不平衡：技术能力快速提升的同时，负责任的AI实践和安全措施却严重滞后。这种差距可能导致不可预见的风险，并引发公众对AI的信任危机，需要紧急关注。
  
  ai-safety responsible-ai risk-management
Visit annotations in context

Tags

responsible-ai

ai-safety

risk-management

Annotators

fxp007

URL

hai.stanford.edu/ai-index/2026-ai-index-report/
metr.org metr.org

Task-Completion Time Horizons of Frontier AI Models

1
1. fxp007 09 Apr 2026
  
  in Public
  
  Some recent models that don't currently have time horizons: Gemini 3.1 Pro, GPT-5.2-Codex, Grok 4.1
  
  METR 公开列出了「尚未完成评测」的前沿模型，这个透明度本身就令人惊讶。更令人注意的是列表的内容：Gemini 3.1 Pro 和 GPT-5.2-Codex 都榜上有名，说明 METR 的评测能力跟不上模型发布速度。在 AI 能力快速迭代的背景下，「评测滞后」已成为 AI 安全领域的系统性风险——我们对最新最强模型的能力边界，永远处于半盲状态。
  
  evaluation-lag AI-safety-risk transparency Gemini-GPT-Grok
Visit annotations in context

Tags

evaluation-lag

AI-safety-risk

transparency

Gemini-GPT-Grok

Annotators

fxp007

URL

metr.org/time-horizons/

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL