Hypothesis

24 Matching Annotations

Last 7 days
sakana.ai sakana.ai

https://sakana.ai/fugu-beta/

1
1. fxp007 30 Apr 2026
  
  in Public
  
  _Self-reported score with custom Anthropic scaffold._ SWEPro were evaluated with the mini-swe-agent scaffold. However, we use the scores reported by Anthropic for Opus with the max thinking efforts due to frequent timeouts during our evaluation trials.
  
  脚注2揭示了重要数据点：Opus 4.6的53.4分是Anthropic的自报分数，因为作者在评估过程中频繁遇到超时问题，无法自行验证。这表明性能比较中存在数据可靠性问题，特别是对于Opus的评估依赖于厂商自报数据，可能存在偏差。
  
  data-point evaluation-methodology data-reliability
Visit annotations in context

Tags

data-point

data-reliability

evaluation-methodology

Annotators

fxp007

URL

sakana.ai/fugu-beta/
epoch.ai epoch.ai

https://epoch.ai/blog/have-ai-capabilities-accelerated

2
1. fxp007 30 Apr 2026
  
  in Public
  
  The best-performing model across these three metrics was a pair of independent linear trends: one for reasoning models and one for non-reasoning models.
  
  这个模型选择结果(100%的三个指标)表明将模型分为推理和非推理两类是最优预测模型。这提供了强有力的统计证据，支持推理能力可能是AI加速发展的关键因素。然而，文章没有详细说明如何定义推理模型，这可能影响结果的可靠性。
  
  data-point statistics model-evaluation
2. fxp007 25 Apr 2026
  
  in Public
  
  We use four AI capability metrics: ECI (Epoch Capabilities Index), METR 50% Time Horizon, Combined Math Index, and WeirdML V2 Index.
  
  研究使用了四个不同的AI能力指标，这增加了结果的可靠性。每个指标都从不同维度测量AI能力，包括综合能力(ECI)、时间效率(METR)、数学能力(Combined Math)和特定环境下的性能(WeirdML)。多指标方法减少了单一指标的偏差风险。
  
  data-point metrics evaluation-framework
Visit annotations in context

Tags

evaluation-framework

statistics

model-evaluation

metrics

data-point

Annotators

fxp007

URL

epoch.ai/blog/have-ai-capabilities-accelerated
openai.com openai.com

https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/

1
1. fxp007 27 Apr 2026
  
  in Public
  
  benchmarks sourced from publicly available material carry contamination risk, where training-data exposure can silently inflate scores.
  
  大多数人认为公开数据集是AI评估的金标准，能够提供客观公正的测试环境。但作者警告，使用公开材料构建的基准测试存在污染风险，训练数据接触会悄无声息地提高分数。这一观点挑战了AI评估领域的传统做法，暗示我们需要更严格的数据隔离措施或转向私有数据集进行评估。
  
  counterintuitive public-data-risk evaluation-design
Visit annotations in context

Tags

public-data-risk

evaluation-design

counterintuitive

Annotators

fxp007

URL

openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/
Apr 2026
arxiv.org arxiv.org

https://arxiv.org/abs/2604.20779

1
1. fxp007 24 Apr 2026
  
  in Public
  
  SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories
  
  大多数人认为AI研究数据集是静态的、一次性的收集，但作者提出'活数据集'概念，强调数据需要持续更新才能反映真实使用情况。这挑战了传统AI评估中依赖静态基准测试的做法，主张需要动态、持续的数据收集方法。
  
  non-consensus data-collection evaluation-methods
Visit annotations in context

Tags

data-collection

non-consensus

evaluation-methods

Annotators

fxp007

URL

arxiv.org/abs/2604.20779
Sep 2023
www.tandfonline.com www.tandfonline.com

It pays to be ignorant: A simple political economy of rigorous program evaluation

1
1. mlenc 29 Sep 2023
  
  in Public
  
  whoneedsit evaluation political economy incentives nonprofit data
Visit annotations in context

Tags

evaluation

incentives

nonprofit data

whoneedsit

political economy

Annotators

mlenc

URL

tandfonline.com/doi/abs/10.1080/1384128032000096832
May 2022
ssir.org ssir.org

Our Opportunity for More Data-Driven Nonprofit Program Evaluation (SSIR)

1
1. mlenc 17 May 2022
  
  in Public
  
  data act commission for evidence-based policymaking beth novek tracey gyateng admin data evaluation infrastructure
Visit annotations in context

Tags

evaluation infrastructure

tracey gyateng

beth novek

admin data

commission for evidence-based policymaking

data act

Annotators

mlenc

URL

ssir.org/articles/entry/our_opportunity_for_more_data_driven_nonprofit_program_evaluation
Mar 2022
www.whitehouse.gov www.whitehouse.gov

Building and Using Evidence

1
1. mlenc 31 Mar 2022
  
  in Public
  
  Strategic, cost-efficient evidence-building relies onstrong data governance that facilitates the access, pro-tection, and use of program and other administrativedata to enable and support secondary uses, including for
  
  omb evaluation evidence act data foundation usa
Visit annotations in context

Tags

evaluation

omb

evidence act

usa

data foundation

Annotators

mlenc

URL

whitehouse.gov/wp-content/uploads/2022/03/ap_6_evidence_fy2023.pdf
Jan 2022
www.nature.com www.nature.com

Omicron thwarts some of the world’s most-used COVID vaccines

1
1. lucyparfitt16 14 Jan 2022
  
  in BehSci
  
  Dolgin, E. (2022). Omicron thwarts some of the world’s most-used COVID vaccines. Nature. https://doi.org/10.1038/d41586-022-00079-6
  
  is:article lang:en COVID-19 vaccine Omicron protection booster science evaluation vaccination strategy data hospitalization mortality prevention immunity variant immune response dose
Visit annotations in context

Tags

evaluation

immunity

Omicron

science

data

immune response

vaccination strategy

vaccine

hospitalization

lang:en

dose

prevention

is:article

protection

COVID-19

mortality

variant

booster

Annotators

lucyparfitt16

URL

nature.com/articles/d41586-022-00079-6
Nov 2021
www.lesswrong.com www.lesswrong.com

How to Measure Anything - LessWrong

1
1. mlenc 30 Nov 2021
  
  in Public
  
  measurement how to measure anything douglas hubbard book review np data evaluation
Visit annotations in context

Tags

book review

evaluation

measurement

np data

douglas hubbard

how to measure anything

Annotators

mlenc

URL

lesswrong.com/posts/ybYBCK9D7MZCcdArB/how-to-measure-anything
Oct 2021
www.aappublications.org www.aappublications.org

Study: Myocarditis risk 37 times higher for children with COVID-19 than uninfected peers

1
1. lucyparfitt16 10 Oct 2021
  
  in BehSci
  
  Study: Myocarditis risk 37 times higher for children with COVID-19 than uninfected peers | American Academy of Pediatrics. (n.d.). Retrieved October 10, 2021, from https://www.aappublications.org/news/2021/08/31/covid-myocarditis-risk-children-083121
  
  is:news lang:en COVID-19 risk myocarditis children CDC vaccine cost-benefit data analysis statistics link risk-benefit evaluation
Visit annotations in context

Tags

vaccine

link

statistics

risk-benefit evaluation

CDC

lang:en

cost-benefit

myocarditis

children

COVID-19

is:news

data analysis

risk

Annotators

lucyparfitt16

URL

aappublications.org/news/2021/08/31/covid-myocarditis-risk-children-083121
Sep 2021
tspppa.gwu.edu tspppa.gwu.edu

Kathryn Newcomer | The Trachtenberg School of Public Policy & Public Administration | The George Washington University

1
1. mlenc 20 Sep 2021
  
  in Public
  
  evaluation government auditing audit data act academic
Visit annotations in context

Tags

evaluation

audit

government auditing

academic

data act

Annotators

mlenc

URL

tspppa.gwu.edu/kathryn-newcomer
Aug 2021
www.motherjones.com www.motherjones.com

Is this animal group saving critters—or padding a fundraising firm’s pockets?

1
1. mlenc 18 Aug 2021
  
  in Public
  
  Since it was founded by longtime charity executive Pierre Barnoti as the international offshoot of a Montreal animal welfare charity, SPCAI has spent little more than 20 percent of its total revenue on actual programs and services that help animals.
  
  badapples charity evaluation spca spcai montreal nonprofit data
Visit annotations in context

Tags

evaluation

charity

spcai

badapples

nonprofit data

montreal

spca

Annotators

mlenc

URL

motherjones.com/environment/2021/08/spca-international-animal-charity-innovairre/
www.reddit.com www.reddit.com

r/HolUp - Comment by u/dEn_of_asyD on ”Yeah... Never even heard of crack till the 4th grade”

1
1. mlenc 18 Aug 2021
  
  in Public
  
  evaluation program nonprofit data failed program what doesn't work what works
Visit annotations in context

Tags

evaluation

failed program

nonprofit data

what doesn't work

program

what works

Annotators

mlenc

URL

reddit.com/r/HolUp/comments/p5rwog/yeah_never_even_heard_of_crack_till_the_4th_grade/
datashare.simplecast.com datashare.simplecast.com

Building Federal Evaluation Capacity: A Discussion on the New White House Guidance | DataShare

1
1. mlenc 17 Aug 2021
  
  in Public
  
  canwebefriends twitter:mlenc evaluation data act usausa
Visit annotations in context

Tags

evaluation

usausa

canwebefriends

twitter:mlenc

data act

Annotators

mlenc

URL

datashare.simplecast.com/episodes/building-federal-evaluation-capacity-a-discussion-on-the-new-white-house-guidance-Clcqw4k4
Apr 2021
Local file Local file

COVID-CVT-paper (1).pdf

1
1. lucyparfitt16 19 Apr 2021
  
  in BehSci
  
  Taquet, M. (2021, April 15). COVID-19 and cerebral venous thrombosis: a retrospective cohort study of 513,284 confirmed COVID-19 cases. https://doi.org/10.17605/OSF.IO/H2MT7
  
  is:pdf lang:en COVID-19 cerebral venous sinus thrombosis cerebral venous thrombosis research CVT vaccine risk perception prediction data analysis mortality European Medicines Agency risk-benefit evaluation
Tags

cerebral venous thrombosis

vaccine

CVT

risk perception

prediction

risk-benefit evaluation

lang:en

research

is:pdf

COVID-19

mortality

cerebral venous sinus thrombosis

data analysis

European Medicines Agency

Annotators

lucyparfitt16
Sep 2020
www.youtube.com www.youtube.com

Susan Athey, July 22, 2020

1
1. ErikStuchly 08 Sep 2020
  
  in BehSci
  
  Susan Athey, July 22, 2020. (2020, August 2). https://www.youtube.com/watch?v=hqTOPrUxDzM
  
  is:youtube lang:en webinar talk confidence interval policy evaluation statistics scientific method scientific practice adaptive field experiment study design data collection research video
Visit annotations in context

Tags

webinar

study design

statistics

scientific practice

is:youtube

adaptive field experiment

video

lang:en

scientific method

research

confidence interval

policy evaluation

data collection

talk

Annotators

ErikStuchly

URL

youtube.com/watch
Jul 2020
science.sciencemag.org science.sciencemag.org

Call for transparency of COVID-19 models

1
1. edampf 24 Jul 2020
  
  in BehSci
  
  Barton, C. M., Alberti, M., Ames, D., Atkinson, J.-A., Bales, J., Burke, E., Chen, M., Diallo, S. Y., Earn, D. J. D., Fath, B., Feng, Z., Gibbons, C., Hammond, R., Heffernan, J., Houser, H., Hovmand, P. S., Kopainsky, B., Mabry, P. L., Mair, C., … Tucker, G. (2020). Call for transparency of COVID-19 models. Science, 368(6490), 482.2-483. https://doi.org/10.1126/science.abb8637
  
  is:article letter COVID-19 lang:en transparency modeling knowledge sharing data sharing science research prediction response government policy science decision making healthcare economy code sharing replication evaluation rapid response publication
Visit annotations in context

Tags

evaluation

government

letter

economy

response

research

decision making

science

rapid response

code sharing

modeling

publication

transparency

healthcare

data sharing

prediction

knowledge sharing

lang:en

policy science

is:article

replication

COVID-19

Annotators

edampf

URL

science.sciencemag.org/content/368/6490/482.2.full
Jun 2020
royalsociety.org royalsociety.org

DELVE group publishes evidence paper on the use of face masks in tackling Coronavirus (COVID-19) pandemic | Royal Society

1
1. edampf 19 Jun 2020
  
  in BehSci
  
  DELVE group publishes evidence paper on the use of face masks in tackling Coronavirus (COVID-19) pandemic | Royal Society. (2020 May 04). https://royalsociety.org/news/2020/05/delve-group-publishes-evidence-paper-on-use-of-face-masks/
  
  is:webpage lang:en COVID-19 DELVE Data Evaluation and Learning for Viral Epidemics publication evidence face mask management behavioral change Royal Society learning SAGE infection asymptomatic droplet transmission reduction policy social distancing public health physical distancing
Visit annotations in context

Tags

DELVE

policy

asymptomatic

management

behavioral change

Royal Society

learning

social distancing

publication

evidence

face mask

transmission reduction

is:webpage

Data Evaluation and Learning for Viral Epidemics

SAGE

lang:en

public health

COVID-19

infection

droplet

physical distancing

Annotators

edampf

URL

royalsociety.org/news/2020/05/delve-group-publishes-evidence-paper-on-use-of-face-masks/
psyarxiv.com psyarxiv.com

Bayesian evaluation of replication studies

1
1. Marlene_Wulf 02 Jun 2020
  
  in BehSci
  
  Leplaa, H. J., Rietbergen, C., & Hoijtink, H. (2020). Bayesian evaluation of replication studies [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/49tbz
  
  is:preprint lang:en Bayesian evaluation replication study data psychology reproducibility project open science collaboration
Visit annotations in context

Tags

evaluation

psychology

collaboration

study

reproducibility project

lang:en

Bayesian

open science

replication

data

is:preprint

Annotators

Marlene_Wulf

URL

psyarxiv.com/49tbz/
May 2020
ai.googleblog.com ai.googleblog.com

Understanding the Shape of Large-Scale Data

1
1. edampf 13 May 2020
  
  in BehSci
  
  Tsitsulin, A. & Perozzi B. Understanding the Shape of Large-Scale Data. (2020 May 05). Google AI Blog. http://ai.googleblog.com/2020/05/understanding-shape-of-large-scale-data.html
  
  is:blog lang:en Google large-scale data dataset graph mathematics modeling relationship learning data evaluation data visualization spectrum DDGK data analysis time-varying
Visit annotations in context

Tags

is:blog

relationship

mathematics

learning

data evaluation

graph

time-varying

Google

data analysis

modeling

spectrum

data visualization

lang:en

DDGK

large-scale data

dataset

Annotators

edampf

URL

ai.googleblog.com/2020/05/understanding-shape-of-large-scale-data.html
www.theguardian.com www.theguardian.com

Report on face masks' effectiveness for Covid-19 divides scientists

1
1. edampf 12 May 2020
  
  in BehSci
  
  Davis, N. (2020, May 4). Report on face masks’ effectiveness for Covid-19 divides scientists. The Guardian. https://www.theguardian.com/world/2020/may/04/scientists-disagree-over-face-masks-effect-on-covid-19
  
  is:news lang:en COVID-19 face mask effectiveness expert Royal Society Delve Data Evaluation and Learning for Viral Epidemics transmission reduction asymptomatic pre-symptomatic physical distancing social distancing protection protective mask evidence doubt critical concern medical equipment
Visit annotations in context

Tags

effectiveness

expert

asymptomatic

Royal Society

critical

medical equipment

social distancing

concern

doubt

pre-symptomatic

evidence

face mask

transmission reduction

Data Evaluation and Learning for Viral Epidemics

lang:en

COVID-19

Delve

protection

physical distancing

is:news

protective mask

Annotators

edampf

URL

theguardian.com/world/2020/may/04/scientists-disagree-over-face-masks-effect-on-covid-19
Apr 2020
psyarxiv.com psyarxiv.com

Offene Wissenschaft in der Zeit von Covid-19 – Eine Blaupause für die psychologische Forschung?

1
1. edampf 27 Apr 2020
  
  in BehSci
  
  Beitner, J., Brod, G., Gagl, B., Kraft, D., & Schultze, M. (2020, April 23). Offene Wissenschaft in der Zeit von Covid-19 – Eine Blaupause für die psychologische Forschung?. https://doi.org/10.31234/osf.io/sh8xg
  
  is:preprint COVID-19 lang:de psychology open science open data preregistration evaluation development study design publication review
Visit annotations in context

Tags

evaluation

review

study design

psychology

lang:de

publication

open science

open data

development

COVID-19

preregistration

is:preprint

Annotators

edampf

URL

psyarxiv.com/sh8xg/
Jun 2019
www.blagravetrust.org www.blagravetrust.org

Power and vulnerability in the charity-funder relationship - The Blagrave Trust

1
1. mlenc 05 Jun 2019
  
  in Public
  
  evaluation anonymous collective responsibility data
Visit annotations in context

Tags

evaluation

anonymous

collective responsibility

data

Annotators

mlenc

URL

blagravetrust.org/listening/power-and-vulnerability-in-the-charity-funder-relationship/