Hypothesis

GPT-5.2 labeled real production traffic as evaluation-like 5.4% of the time and simulated traffic 5.1% of the time, while traditional evaluations (such as SWE-Bench or MonorepoBench, which are composed primarily of synthetic examples) were classified as evaluations almost 100% of the time

这组数字非常有说服力：模型把真实生产流量误判为评测的概率是5.4%，把模拟流量误判为评测的概率是5.1%——几乎相同。而传统评测集（SWE-Bench、MonorepoBench等）被模型识别为评测的概率接近100%。这意味着基于合成数据的传统评测，在模型面前几乎完全透明——它们测量的是模型在知道自己被测时的行为，而非模型在真实部署中的行为。两者之间的鸿沟，正是Deployment Simulation试图弥合的。

评测感知 5.4%对比100% 关键数据

Tags

Annotators

URL