The next generation of benchmarks needs to be harder, more realistic, and less gameable
【洞察】「更难、更真实、更不可刷题」——这三条标准本质上是在要求 benchmark 向「真实工作」靠拢,而非向「考试题」收敛。但这恰恰引出了一个悖论:越真实的 benchmark,越难自动化评分,越贵(METR 每题 8000 美元),越慢发布。AI 评测体系正在面临「评测速度 vs 评测质量」的根本性权衡。
The next generation of benchmarks needs to be harder, more realistic, and less gameable
【洞察】「更难、更真实、更不可刷题」——这三条标准本质上是在要求 benchmark 向「真实工作」靠拢,而非向「考试题」收敛。但这恰恰引出了一个悖论:越真实的 benchmark,越难自动化评分,越贵(METR 每题 8000 美元),越慢发布。AI 评测体系正在面临「评测速度 vs 评测质量」的根本性权衡。
benchmarks sourced from publicly available material carry contamination risk, where training-data exposure can silently inflate scores.
大多数人认为公开数据集是AI评估的金标准,能够提供客观公正的测试环境。但作者警告,使用公开材料构建的基准测试存在污染风险,训练数据接触会悄无声息地提高分数。这一观点挑战了AI评估领域的传统做法,暗示我们需要更严格的数据隔离措施或转向私有数据集进行评估。
Haber, N. A., Wieten, S. E., Rohrer, J. M., Arah, O. A., Tennant, P. W. G., Stuart, E. A., Murray, E. J., Pilleron, S., Lam, S. T., Riederer, E., Howcutt, S. J., Simmons, A. E., Leyrat, C., Schoenegger, P., Booman, A., Dufour, M.-S. K., O’Donoghue, A. L., Baglini, R., Do, S., … Fox, M. P. (2021). Causal and Associational Linking Language From Observational Research and Health Evaluation Literature in Practice: A systematic language evaluation [Preprint]. Epidemiology. https://doi.org/10.1101/2021.08.25.21262631
Susan Athey, July 22, 2020. (2020, August 2). https://www.youtube.com/watch?v=hqTOPrUxDzM
Eisen, M. B., & Tibshirani, R. (2020, July 20). Opinion | How to Identify Flawed Research Before It Becomes Dangerous. The New York Times. https://www.nytimes.com/2020/07/20/opinion/coronavirus-preprints.html
Beitner, J., Brod, G., Gagl, B., Kraft, D., & Schultze, M. (2020, April 23). Offene Wissenschaft in der Zeit von Covid-19 – Eine Blaupause für die psychologische Forschung?. https://doi.org/10.31234/osf.io/sh8xg
Gagne's nine events of instruction I am including this page for myself because it is a nice reference back to Gagne's nine events and it gives both an example of each of the events as well as a list of four essential principles. It also includes some of his book titles. rating 4/5
This link is to a three-page PDF that describes Gagne's nine events of instruction, largely in in the form of a graphic. Text is minimized and descriptive text is color coded so it is easy to find underneath the graphic at the top. The layout is simple and easy to follow. A general description of Gagne's work is not part of this page. While this particular presentation does not have personal appeal to me, it is included here due to the quality of the page and because the presentation is more user friendly than most. Rating 4/5
This is a description of the form of backward design referred to as Understanding by Design. In its simplest form, this is a three step process in which instructional designers first specify desired outcomes and acceptable evidence before specifying learning activities. This presentation may be a little boring to read as it is text-heavy and black and white, but those same attributes make it printer friendly. rating 3/5