9 Matching Annotations
  1. May 2026
    1. The next generation of benchmarks needs to be harder, more realistic, and less gameable

      【洞察】「更难、更真实、更不可刷题」——这三条标准本质上是在要求 benchmark 向「真实工作」靠拢,而非向「考试题」收敛。但这恰恰引出了一个悖论:越真实的 benchmark,越难自动化评分,越贵(METR 每题 8000 美元),越慢发布。AI 评测体系正在面临「评测速度 vs 评测质量」的根本性权衡。

  2. Apr 2026
    1. benchmarks sourced from publicly available material carry contamination risk, where training-data exposure can silently inflate scores.

      大多数人认为公开数据集是AI评估的金标准,能够提供客观公正的测试环境。但作者警告,使用公开材料构建的基准测试存在污染风险,训练数据接触会悄无声息地提高分数。这一观点挑战了AI评估领域的传统做法,暗示我们需要更严格的数据隔离措施或转向私有数据集进行评估。

  3. Sep 2021
    1. Haber, N. A., Wieten, S. E., Rohrer, J. M., Arah, O. A., Tennant, P. W. G., Stuart, E. A., Murray, E. J., Pilleron, S., Lam, S. T., Riederer, E., Howcutt, S. J., Simmons, A. E., Leyrat, C., Schoenegger, P., Booman, A., Dufour, M.-S. K., O’Donoghue, A. L., Baglini, R., Do, S., … Fox, M. P. (2021). Causal and Associational Linking Language From Observational Research and Health Evaluation Literature in Practice: A systematic language evaluation [Preprint]. Epidemiology. https://doi.org/10.1101/2021.08.25.21262631

  4. Sep 2020
  5. Jul 2020
  6. Apr 2020
  7. Mar 2019
    1. This link is to a three-page PDF that describes Gagne's nine events of instruction, largely in in the form of a graphic. Text is minimized and descriptive text is color coded so it is easy to find underneath the graphic at the top. The layout is simple and easy to follow. A general description of Gagne's work is not part of this page. While this particular presentation does not have personal appeal to me, it is included here due to the quality of the page and because the presentation is more user friendly than most. Rating 4/5

    1. This is a description of the form of backward design referred to as Understanding by Design. In its simplest form, this is a three step process in which instructional designers first specify desired outcomes and acceptable evidence before specifying learning activities. This presentation may be a little boring to read as it is text-heavy and black and white, but those same attributes make it printer friendly. rating 3/5