1 Matching Annotations
  1. Jun 2026
    1. GPT-5.2 labeled real production traffic as evaluation-like 5.4% of the time and simulated traffic 5.1% of the time, while traditional evaluations (such as SWE-Bench or MonorepoBench, which are composed primarily of synthetic examples) were classified as evaluations almost 100% of the time

      这组数字非常有说服力:模型把真实生产流量误判为评测的概率是5.4%,把模拟流量误判为评测的概率是5.1%——几乎相同。而传统评测集(SWE-Bench、MonorepoBench等)被模型识别为评测的概率接近100%。这意味着基于合成数据的传统评测,在模型面前几乎完全透明——它们测量的是模型在知道自己被测时的行为,而非模型在真实部署中的行为。两者之间的鸿沟,正是Deployment Simulation试图弥合的。