The next generation of benchmarks needs to be harder, more realistic, and less gameable
【洞察】「更难、更真实、更不可刷题」——这三条标准本质上是在要求 benchmark 向「真实工作」靠拢,而非向「考试题」收敛。但这恰恰引出了一个悖论:越真实的 benchmark,越难自动化评分,越贵(METR 每题 8000 美元),越慢发布。AI 评测体系正在面临「评测速度 vs 评测质量」的根本性权衡。
The next generation of benchmarks needs to be harder, more realistic, and less gameable
【洞察】「更难、更真实、更不可刷题」——这三条标准本质上是在要求 benchmark 向「真实工作」靠拢,而非向「考试题」收敛。但这恰恰引出了一个悖论:越真实的 benchmark,越难自动化评分,越贵(METR 每题 8000 美元),越慢发布。AI 评测体系正在面临「评测速度 vs 评测质量」的根本性权衡。
Fell, M. J., Pagel, L., Chen, C., Goldberg, M. H., Herberz, M., Huebner, G., Sareen, S., & Hahnel, U. J. J. (2020). Validity of energy social research during and after COVID-19: Challenges, considerations, and responses [Preprint]. SocArXiv. https://doi.org/10.31235/osf.io/pe6cd
Rice, W. L., Meyer, C., Lawhon, B., Taff, B. D., Mateer, T., Reigner, N., & Newman, P. (2020). The COVID-19 pandemic is changing the way people recreate outdoors: Preliminary report on a national survey of outdoor enthusiasts amid the COVID-19 pandemic [Preprint]. SocArXiv. https://doi.org/10.31235/osf.io/prnz9