Hypothesis

7 Matching Annotations

May 2026
cruxevals.com cruxevals.com

https://cruxevals.com/

1
1. fxp007 07 May 2026
  
  in Public
  
  AI Village gives multiple AI agents their own computer environments and a shared group chat, then tasks them with open-ended real-world goals like fundraising, organizing events, making games, and gaining subscribers.
  
  这个案例展示了开放世界评估的实际应用，每年约5万美元的成本表明这种评估需要相当大的资源投入。相比传统基准测试，这种评估方式更接近真实应用场景，但也因此成本更高，难以大规模实施。
  
  data-point cost-analysis real-world-evaluation
Visit annotations in context

Tags

cost-analysis

real-world-evaluation

data-point

Annotators

fxp007

URL

cruxevals.com/
Apr 2026
mp.weixin.qq.com mp.weixin.qq.com

https://mp.weixin.qq.com/s/_5_tWZeNXmxYnCOfIYg49A

1
1. fxp007 17 Apr 2026
  
  in Public
  
  未来的评估体系，必须同时考虑：成功率、成本、延迟。这有点类似于对于云计算的考核标准，而不是传统软件。
  
  这一观点揭示了AI技能评估需要引入新的维度，特别是成本因素，这反映了AI时代的独特挑战，也暗示未来技能市场可能会出现基于资源消耗的定价机制，这与传统软件市场有本质区别。
  
  cost-evaluation ai-specific-metrics
Visit annotations in context

Tags

cost-evaluation

ai-specific-metrics

Annotators

fxp007

URL

mp.weixin.qq.com/s/_5_tWZeNXmxYnCOfIYg49A
www.understandingai.org www.understandingai.org

Why it's getting harder to measure AI performance - Understanding AI

1
1. fxp007 09 Apr 2026
  
  in Public
  
  METR pays human programmers a minimum of $50 per hour, so getting a baseline for a single 160-hour task would cost at least $8,000.
  
  一道测试题的人类基准成本高达 8000 美元——这个数字揭示了 AI 评测的一个被严重低估的物理限制：测量 AI 能力需要大量人类劳动，而随着 AI 能力向「月级任务」延伸，建立可靠基准的成本将呈超线性增长。更根本的问题是：你很难让一个有能力的程序员花数周时间做一个「测试任务」，即便报酬丰厚。人类评测员的可获得性，将成为 AI 能力评估的真正天花板。
  
  evaluation-cost 8000-dollars human-baseline scalability-limit
Visit annotations in context

Tags

8000-dollars

human-baseline

scalability-limit

evaluation-cost

Annotators

fxp007

URL

understandingai.org/p/why-its-getting-harder-to-measure
arxiv.org arxiv.org

https://arxiv.org/abs/2604.03016

1
1. fxp007 08 Apr 2026
  
  in Public
  
  it contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task.
  
  大多数人认为AI评估可以通过相对简单的自动化流程完成。然而，作者提出的评估基准需要每个任务超过10小时的人工标注和2000多个检查点，这暗示了真正评估AI代理能力的复杂性和成本远超行业普遍认知。这一观点挑战了AI评估领域的效率优先思维，强调了高质量评估需要大量人工投入的现实。
  
  counterintuitive evaluation-cost human-annotation
Visit annotations in context

Tags

human-annotation

counterintuitive

evaluation-cost

Annotators

fxp007

URL

arxiv.org/abs/2604.03016
Oct 2021
www.aappublications.org www.aappublications.org

Study: Myocarditis risk 37 times higher for children with COVID-19 than uninfected peers

1
1. lucyparfitt16 10 Oct 2021
  
  in BehSci
  
  Study: Myocarditis risk 37 times higher for children with COVID-19 than uninfected peers | American Academy of Pediatrics. (n.d.). Retrieved October 10, 2021, from https://www.aappublications.org/news/2021/08/31/covid-myocarditis-risk-children-083121
  
  is:news lang:en COVID-19 risk myocarditis children CDC vaccine cost-benefit data analysis statistics link risk-benefit evaluation
Visit annotations in context

Tags

data analysis

cost-benefit

CDC

lang:en

statistics

myocarditis

risk-benefit evaluation

vaccine

is:news

COVID-19

children

link

risk

Annotators

lucyparfitt16

URL

aappublications.org/news/2021/08/31/covid-myocarditis-risk-children-083121
May 2021
publications.aap.org publications.aap.org

Economic Evaluation of the Routine Childhood Immunization Program in the United States, 2009

1
1. SIYANYE 07 May 2021
  
  in BehSci
  
  Zhou, F., Shefer, A., Wenger, J., Messonnier, M., Wang, L. Y., Lopez, A., Moore, M., Murphy, T. V., Cortese, M., & Rodewald, L. (2014). Economic Evaluation of the Routine Childhood Immunization Program in the United States, 2009. Pediatrics, 133(4), 577–585. https://doi.org/10.1542/peds.2013-0698
  
  lang:en is:article evaluation child immunization routine net savings benefit-cost analysis decision making
Visit annotations in context

Tags

child

decision making

immunization

net savings

lang:en

routine

is:article

evaluation

benefit-cost analysis

Annotators

SIYANYE

URL

publications.aap.org/pediatrics/article/133/4/577/32731/Economic-Evaluation-of-the-Routine-Childhood
Aug 2020
www.biorxiv.org www.biorxiv.org

A Newcastle disease virus (NDV) expressing membrane-anchored spike as a cost-effective inactivated SARS-CoV-2 vaccine

1
1. ErikStuchly 21 Aug 2020
  
  in BehSci
  
  Sun, W., McCroskery, S., Liu, W.-C., Leist, S. R., Liu, Y., Albrecht, R. A., Slamanig, S., Oliva, J., Amanat, F., Schäfer, A., Dinnon, K. H., Innis, B. L., García-Sastre, A., Krammer, F., Baric, R. S., & Palese, P. (2020). A Newcastle disease virus (NDV) expressing membrane-anchored spike as a cost-effective inactivated SARS-CoV-2 vaccine. BioRxiv, 2020.07.30.229120. https://doi.org/10.1101/2020.07.30.229120
  
  is:preprint lang:en COVID-19 Newcastle disease virus spike cost vaccine manufacture challenge pre-clinical evaluation potency protection
Visit annotations in context

Tags

potency

manufacture

spike

protection

lang:en

Newcastle disease virus

cost

challenge

is:preprint

vaccine

COVID-19

pre-clinical evaluation

Annotators

ErikStuchly

URL

biorxiv.org/content/10.1101/2020.07.30.229120v1

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL