Hypothesis

9 Matching Annotations

May 2026
epoch.ai epoch.ai

RIP Classic Reasoning Benchmarks. What's Next? - Epoch AI

1
1. fxp007 07 May 2026
  
  in Public
  
  The next generation of benchmarks needs to be harder, more realistic, and less gameable
  
  【洞察】「更难、更真实、更不可刷题」——这三条标准本质上是在要求 benchmark 向「真实工作」靠拢，而非向「考试题」收敛。但这恰恰引出了一个悖论：越真实的 benchmark，越难自动化评分，越贵（METR 每题 8000 美元），越慢发布。AI 评测体系正在面临「评测速度 vs 评测质量」的根本性权衡。
  
  benchmark-design next-generation evaluation-paradox insight
Visit annotations in context

Tags

benchmark-design

next-generation

insight

evaluation-paradox

Annotators

fxp007

URL

epoch.ai/gradient-updates/rip-classic-benchmarks
Apr 2026
openai.com openai.com

https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/

1
1. fxp007 27 Apr 2026
  
  in Public
  
  benchmarks sourced from publicly available material carry contamination risk, where training-data exposure can silently inflate scores.
  
  大多数人认为公开数据集是AI评估的金标准，能够提供客观公正的测试环境。但作者警告，使用公开材料构建的基准测试存在污染风险，训练数据接触会悄无声息地提高分数。这一观点挑战了AI评估领域的传统做法，暗示我们需要更严格的数据隔离措施或转向私有数据集进行评估。
  
  counterintuitive public-data-risk evaluation-design
Visit annotations in context

Tags

counterintuitive

public-data-risk

evaluation-design

Annotators

fxp007

URL

openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/
Sep 2021
www.medrxiv.org www.medrxiv.org

Causal and Associational Linking Language From Observational Research and Health Evaluation Literature in Practice: A systematic language evaluation

1
1. lucyparfitt16 01 Sep 2021
  
  in BehSci
  
  Haber, N. A., Wieten, S. E., Rohrer, J. M., Arah, O. A., Tennant, P. W. G., Stuart, E. A., Murray, E. J., Pilleron, S., Lam, S. T., Riederer, E., Howcutt, S. J., Simmons, A. E., Leyrat, C., Schoenegger, P., Booman, A., Dufour, M.-S. K., O’Donoghue, A. L., Baglini, R., Do, S., … Fox, M. P. (2021). Causal and Associational Linking Language From Observational Research and Health Evaluation Literature in Practice: A systematic language evaluation [Preprint]. Epidemiology. https://doi.org/10.1101/2021.08.25.21262631
  
  is:preprint lang:en observational study study design observational research causality implication causal language language health evaluation review evaluation association
Visit annotations in context

Tags

causality

review

lang:en

is:preprint

study design

implication

evaluation

association

observational research

observational study

health evaluation

language

causal language

Annotators

lucyparfitt16

URL

medrxiv.org/content/10.1101/2021.08.25.21262631v2
Sep 2020
www.youtube.com www.youtube.com

Susan Athey, July 22, 2020

1
1. ErikStuchly 08 Sep 2020
  
  in BehSci
  
  Susan Athey, July 22, 2020. (2020, August 2). https://www.youtube.com/watch?v=hqTOPrUxDzM
  
  is:youtube lang:en webinar talk confidence interval policy evaluation statistics scientific method scientific practice adaptive field experiment study design data collection research video
Visit annotations in context

Tags

confidence interval

data collection

policy evaluation

lang:en

video

statistics

adaptive field experiment

study design

webinar

research

talk

scientific practice

scientific method

is:youtube

Annotators

ErikStuchly

URL

youtube.com/watch
Jul 2020
www.nytimes.com www.nytimes.com

Opinion | How to Identify Flawed Research Before It Becomes Dangerous

1
1. Danaeioak 20 Jul 2020
  
  in BehSci
  
  Eisen, M. B., & Tibshirani, R. (2020, July 20). Opinion | How to Identify Flawed Research Before It Becomes Dangerous. The New York Times. https://www.nytimes.com/2020/07/20/opinion/coronavirus-preprints.html
  
  is:news lang:en COVID-19 research preprints peer review rapid science policymaking media scientific errors study design research evaluation science journalism reliability conveying results practical implication
Visit annotations in context

Tags

scientific errors

practical implication

media

science

study design

preprints

rapid science

peer review

policymaking

is:news

journalism

research evaluation

COVID-19

lang:en

conveying results

research

reliability

Annotators

Danaeioak

URL

nytimes.com/2020/07/20/opinion/coronavirus-preprints.html
Apr 2020
psyarxiv.com psyarxiv.com

Offene Wissenschaft in der Zeit von Covid-19 – Eine Blaupause für die psychologische Forschung?

1
1. edampf 27 Apr 2020
  
  in BehSci
  
  Beitner, J., Brod, G., Gagl, B., Kraft, D., & Schultze, M. (2020, April 23). Offene Wissenschaft in der Zeit von Covid-19 – Eine Blaupause für die psychologische Forschung?. https://doi.org/10.31234/osf.io/sh8xg
  
  is:preprint COVID-19 lang:de psychology open science open data preregistration evaluation development study design publication review
Visit annotations in context

Tags

open science

preregistration

review

COVID-19

is:preprint

study design

evaluation

open data

publication

psychology

lang:de

development

Annotators

edampf

URL

psyarxiv.com/sh8xg/
Mar 2019
www.instructionaldesign.org www.instructionaldesign.org

Conditions of Learning (Robert Gagne) - InstructionalDesign.org

1
1. ks9 22 Mar 2019
  
  in Public
  
  Gagne's nine events of instruction I am including this page for myself because it is a nice reference back to Gagne's nine events and it gives both an example of each of the events as well as a list of four essential principles. It also includes some of his book titles. rating 4/5
  
  etcnau etc556 Gagne 9 events nine events of instruction instructional design evaluation models theories
Visit annotations in context

Tags

etcnau

instructional design

9 events

Gagne

theories

nine events of instruction

etc556

evaluation models

Annotators

ks9

URL

instructionaldesign.org/theories/conditions-learning/
www.valpo.edu www.valpo.edu

gagne_nine_events.pdf

1
1. ks9 22 Mar 2019
  
  in Public
  
  This link is to a three-page PDF that describes Gagne's nine events of instruction, largely in in the form of a graphic. Text is minimized and descriptive text is color coded so it is easy to find underneath the graphic at the top. The layout is simple and easy to follow. A general description of Gagne's work is not part of this page. While this particular presentation does not have personal appeal to me, it is included here due to the quality of the page and because the presentation is more user friendly than most. Rating 4/5
  
  etcnau Gagne nine events of instruction psychology instructional design PDFs evaluation models theory brain based learning cognitive cognition etc556
Visit annotations in context

Tags

etcnau

PDFs

instructional design

brain based learning

cognitive

Gagne

cognition

etc556

psychology

nine events of instruction

theory

evaluation models

Annotators

ks9

URL

valpo.edu/vital/files/2015/12/gagne_nine_events.pdf
inst-fs-iad-prod.inscloudgate.net inst-fs-iad-prod.inscloudgate.net

Chapter 1. What Is Backward Design?

1
1. ks9 22 Mar 2019
  
  in Public
  
  This is a description of the form of backward design referred to as Understanding by Design. In its simplest form, this is a three step process in which instructional designers first specify desired outcomes and acceptable evidence before specifying learning activities. This presentation may be a little boring to read as it is text-heavy and black and white, but those same attributes make it printer friendly. rating 3/5
  
  etcnau instructional design evaluation models backward design UBD understanding by design wiggins mcTighe etc556
Visit annotations in context

Tags

etcnau

instructional design

understanding by design

UBD

wiggins

mcTighe

etc556

backward design

evaluation models

Annotators

ks9

URL

inst-fs-iad-prod.inscloudgate.net/files/10ac27ca-2054-4d67-abbc-3777e510a158/Wiggins ch1 backward-design intro.pdf