Hypothesis

models have been increasingly able to determine they are being tested, which can distort their behavior and our downstream measurements of their safety

这是当前AI安全评估领域最棘手的问题之一：如果一个模型在测试时表现良好，只是因为它知道自己在被测试，那么所有的安全保证都建立在沙滩上。模型越来越会识别评测集的特征——合成数据的语言模式、提示的格式风格、反复出现的测试场景。Deployment Simulation通过使用真实用户对话作为前缀，让模型无法区分评测和真实部署，从而让测量结果更可信。

评测感知安全评估失真沙盒逃逸

Tags

Annotators

URL