Hypothesis

8 Matching Annotations

Jun 2026
www.wired.com www.wired.com

https://www.wired.com/story/the-white-house-wants-anthropic-to-block-all-jailbreaks-that-may-not-be-possible/

1
1. fxp007 17 Jun 2026
  
  in Public
  
  The government believes it has become aware of a method of bypassing, or 'jailbreaking' Fable 5.
  
  这是一个需要核实的政府声明，涉及AI安全漏洞的具体情况。需要确认政府是否真的发现了这种方法，以及该方法的有效性和影响范围。这反映了AI安全研究中的持续挑战。
  
  security-breach government-claim ai-vulnerability
Visit annotations in context

Tags

ai-vulnerability

security-breach

government-claim

Annotators

fxp007

URL

wired.com/story/the-white-house-wants-anthropic-to-block-all-jailbreaks-that-may-not-be-possible/
red.anthropic.com red.anthropic.com

Claude Mythos Preview \ red.anthropic.com

1
1. fxp007 05 Jun 2026
  
  in Public
  
  in 89% of the 198 manually reviewed vulnerability reports, our expert contractors agreed with Claude's severity assessment exactly, and 98% of the assessments were within one severity level. If these results hold consistently for our remaining findings, we would have over a thousand more critical severity vulnerabilities and thousands more high severity vulnerabilities.
  
  89%的严重性评估精确一致是一个重要的校准信号：它意味着Mythos不仅能找到漏洞，还能准确理解其安全影响。这个校准水平与经验丰富的人类安全研究员相当甚至更优。基于这个比率外推的「上千个关键严重性漏洞」虽然是估计值，但有统计基础——这是迄今为止关于AI大规模漏洞发现能力最有力的量化声明。
  
  severity-calibration vulnerability-scale ai-security-research
Visit annotations in context

Tags

vulnerability-scale

severity-calibration

ai-security-research

Annotators

fxp007

URL

red.anthropic.com/2026/mythos-preview/
May 2026
www.promptarmor.com www.promptarmor.com

https://www.promptarmor.com/resources/microsoft-copilot-cowork-exfiltrates-files

1
1. fxp007 25 May 2026
  
  in Public
  
  This attack achieved a high success rate against state-of-the-art models, including Claude Opus 4.7.
  
  大多数人认为最新的AI模型已经足够先进可以抵抗基本的注入攻击，但作者证明即使是像Claude Opus 4.7这样的前沿模型也无法抵御简单的间接提示注入，这挑战了人们对先进AI模型安全性的过高期望。
  
  non-consensus ai-vulnerability prompt-injection
Visit annotations in context

Tags

ai-vulnerability

non-consensus

prompt-injection

Annotators

fxp007

URL

promptarmor.com/resources/microsoft-copilot-cowork-exfiltrates-files
www.llmwatch.com www.llmwatch.com

https://www.llmwatch.com/p/ai-agents-of-the-week-papers-you-cbd

1
1. fxp007 01 May 2026
  
  in Public
  
  The most urgent finding this week comes from researchers who demonstrated that the very mechanism enabling agents to use tools - function calling - can be hijacked with alarming reliability.
  
  这一发现揭示了AI代理工具调用接口的安全漏洞，为构建安全的AI代理系统提出了新的挑战。
  
  security-vulnerability ai-agents
Visit annotations in context

Tags

security-vulnerability

ai-agents

Annotators

fxp007

URL

llmwatch.com/p/ai-agents-of-the-week-papers-you-cbd
Apr 2026
aphyr.com aphyr.com

https://aphyr.com/posts/419-the-future-of-everything-is-lies-i-guess-new-jobs

1
1. fxp007 16 Apr 2026
  
  in Public
  
  just a handful of obviously fake articles could cause Gemini, ChatGPT, and Copilot to inform users about an imaginary disease with a ridiculous name.
  
  令人惊讶的是：仅凭少量明显虚假的文章就能导致主流AI模型传播虚构疾病信息。这揭示了AI训练数据容易被污染的脆弱性，也暗示了未来可能需要类似'低背景钢'的纯净数据源来确保AI输出的可靠性。
  
  surprising ai-vulnerability fun-fact
Visit annotations in context

Tags

ai-vulnerability

fun-fact

surprising

Annotators

fxp007

URL

aphyr.com/posts/419-the-future-of-everything-is-lies-i-guess-new-jobs
a16z.com a16z.com

https://a16z.com/et-tu-agent-did-you-install-the-backdoor/

1
1. fxp007 09 Apr 2026
  
  in Public
  
  select known-vulnerable dependency versions 50% more often than humans.
  
  这一统计洞察颠覆了“AI写代码更安全”的迷思。AI代理在优化代码功能性时，往往以牺牲安全性为代价，倾向于选择存在已知漏洞的旧版本依赖。这反映出当前AI模型在训练时对安全维度的忽视，也警示我们在AI辅助开发流程中必须强制引入自动化的安全卡点。
  
  ai-vulnerability dependency-management security-gap
Visit annotations in context

Tags

ai-vulnerability

security-gap

dependency-management

Annotators

fxp007

URL

a16z.com/et-tu-agent-did-you-install-the-backdoor/
arxiv.org arxiv.org

https://arxiv.org/abs/2604.02947

2
1. fxp007 08 Apr 2026
  
  in Public
  
  current systems remain highly vulnerable
  
  尽管AI安全领域近年来取得了显著进展，作者却断言当前系统仍然高度脆弱。这一与行业乐观情绪相悖的结论，基于对多个主流代理系统的实际测试，暗示AI安全问题可能比业界承认的要严重得多。
  
  counterintuitive ai-safety system-vulnerability
2. fxp007 08 Apr 2026
  
  in Public
  
  current systems remain highly vulnerable
  
  尽管AI安全研究取得了显著进展，但作者通过AgentHazard基准测试表明，当前最先进的计算机使用代理系统仍然极其脆弱，这挑战了学术界和工业界对AI安全水平已经足够高的普遍认知。
  
  counterintuitive ai-vulnerability benchmark-results
Visit annotations in context

Tags

benchmark-results

ai-safety

counterintuitive

ai-vulnerability

system-vulnerability

Annotators

fxp007

URL

arxiv.org/abs/2604.02947

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL