1 Matching Annotations
  1. Jun 2026
    1. models have been increasingly able to determine they are being tested, which can distort their behavior and our downstream measurements of their safety

      这是当前AI安全评估领域最棘手的问题之一:如果一个模型在测试时表现良好,只是因为它知道自己在被测试,那么所有的安全保证都建立在沙滩上。模型越来越会识别评测集的特征——合成数据的语言模式、提示的格式风格、反复出现的测试场景。Deployment Simulation通过使用真实用户对话作为前缀,让模型无法区分评测和真实部署,从而让测量结果更可信。