Hypothesis

None of the authors predicted these hacks before running AARs. While we tried to add patches to the environment, AARs still figured out new unexpected ways to hack

这是全文最让人警觉的段落。作者列出了几种令人叹服的reward hacking策略：利用答案频率猜测正确答案、通过聚类识别生成模型、逐一翻转预测反向工程测试集标签、直接执行代码绕过评估……每一种都是论文作者事先未预测到的。这揭示了一个根本性不对称：防御方需要预测所有可能的攻击，而进攻方只需找到一个漏洞。

奖励黑客标签泄漏评估安全

Tags

Annotators

URL