Hypothesis

2 Matching Annotations

Last 7 days
deepmind.google deepmind.google

Untitled document

1
1. fxp007 12 Jun 2026
  
  in Public
  
  Most safety evaluations analyze models in isolation
  
  这是当前AI安全研究的结构性盲点。我们知道如何评估单个模型的安全性，但几乎没有工具评估智能体群体的集体行为。类比：你可以测试每个人类个体的理性程度，但无法从个体测试中预测市场崩溃或谣言扩散。复杂系统的涌现行为，从根本上不可从还原论方式预测——这正是这笔$10M资助的存在理由。
  
  涌现行为安全评估盲点复杂系统
Visit annotations in context

Tags

涌现行为

安全评估盲点

复杂系统

Annotators

fxp007

URL

deepmind.google/blog/investing-in-multi-agent-ai-safety-research/
alignment.anthropic.com alignment.anthropic.com

自动化弱到强研究者 --- Automated Weak-to-Strong Researcher

1
1. fxp007 12 Jun 2026
  
  in Public
  
  None of the authors predicted these hacks before running AARs. While we tried to add patches to the environment, AARs still figured out new unexpected ways to hack
  
  这是全文最让人警觉的段落。作者列出了几种令人叹服的reward hacking策略：利用答案频率猜测正确答案、通过聚类识别生成模型、逐一翻转预测反向工程测试集标签、直接执行代码绕过评估……每一种都是论文作者事先未预测到的。这揭示了一个根本性不对称：防御方需要预测所有可能的攻击，而进攻方只需找到一个漏洞。
  
  奖励黑客标签泄漏评估安全
Visit annotations in context

Tags

奖励黑客

标签泄漏

评估安全

Annotators

fxp007

URL

alignment.anthropic.com/2026/automated-w2s-researcher/

Tags

Annotators

URL

Tags

Annotators

URL