Hypothesis

we take recent conversations from deployment, remove the original assistant response from the older model, and regenerate it with a candidate model to be released

这个方法的优雅之处在于它的反事实重演逻辑：用真实用户的真实上下文，替换掉旧模型的回复，看新模型会怎么接。相比于合成测试集，这个方法的核心假设是：真实用户的输入分布本身就是最好的测试套件。不需要猜测用户会问什么、会怎么绕过护栏——直接用他们已经做过的事情来测试。这是一种从构造压力测试到重播真实世界的范式转移。

部署模拟真实分布安全评估

Tags

Annotators

URL