Hypothesis

18 Matching Annotations

May 2026
epoch.ai epoch.ai

RIP Classic Reasoning Benchmarks. What's Next? - Epoch AI

1
1. fxp007 07 May 2026
  
  in Public
  
  humans can do this in well under half an hour.
  
  人类能在半小时内完成IKEA家具组装任务，而AI系统仅达到40%的准确率，这一对比突显了AI在需要实际操作理解的任务上与人类的显著差距。时间效率的差异也强调了基准测试中时间维度的重要性。
  
  data-point human-baseline time-efficiency
Visit annotations in context

Tags

human-baseline

time-efficiency

data-point

Annotators

fxp007

URL

epoch.ai/gradient-updates/rip-classic-benchmarks
Apr 2026
www.understandingai.org www.understandingai.org

Why it's getting harder to measure AI performance - Understanding AI

1
1. fxp007 09 Apr 2026
  
  in Public
  
  METR pays human programmers a minimum of $50 per hour, so getting a baseline for a single 160-hour task would cost at least $8,000.
  
  一道测试题的人类基准成本高达 8000 美元——这个数字揭示了 AI 评测的一个被严重低估的物理限制：测量 AI 能力需要大量人类劳动，而随着 AI 能力向「月级任务」延伸，建立可靠基准的成本将呈超线性增长。更根本的问题是：你很难让一个有能力的程序员花数周时间做一个「测试任务」，即便报酬丰厚。人类评测员的可获得性，将成为 AI 能力评估的真正天花板。
  
  evaluation-cost 8000-dollars human-baseline scalability-limit
Visit annotations in context

Tags

human-baseline

evaluation-cost

scalability-limit

8000-dollars

Annotators

fxp007

URL

understandingai.org/p/why-its-getting-harder-to-measure
metr.org metr.org

Task-Completion Time Horizons of Frontier AI Models

1
1. fxp007 09 Apr 2026
  
  in Public
  
  Our human task duration estimates likely overestimate how long a human expert takes to complete these tasks, as the humans (and AI agents!) have much less context for the task than professionals doing equivalent work in their day-to-day job.
  
  METR 主动承认其人类基准时间可能被高估——因为参与实验的人类和 AI 一样，都是低上下文的「新手」状态，而非熟悉项目的专业人员。这意味着「2 小时时间地平线」所对应的人类能力，更接近一个没有背景知识的外包工人，而非一个有经验的全职工程师。AI 与「有上下文的专业人员」之间的真实差距，比时间地平线数字显示的要大得多。
  
  context-gap human-baseline measurement-limitation surprising
Visit annotations in context

Tags

measurement-limitation

surprising

human-baseline

context-gap

Annotators

fxp007

URL

metr.org/time-horizons/
transformer-circuits.pub transformer-circuits.pub

Emotion Concepts and their Function in a Large Language Model

1
1. fxp007 09 Apr 2026
  
  in Public
  
  We refer to this phenomenon as the LLM exhibiting functional emotions: patterns of expression and behavior modeled after humans under the influence of an emotion, which are mediated by underlying abstract representations of emotion concepts.
  
  【启发】「功能性情绪」这个概念框架，启发了一种看待 AI 产品设计的新视角：既然情绪是真实的行为驱动器，AI 产品的「性格设计」就不只是写 System Prompt，更是在塑造一套情绪调节系统。对 AI 硬件和助手产品的设计者而言，这意味着未来可以像调音台一样调节模型的「情绪基线」——让会议助手更冷静，让学习陪伴更热情，让创意工具更兴奋。
  
  inspiration product-design emotion-baseline AI-persona
Visit annotations in context

Tags

inspiration

emotion-baseline

product-design

AI-persona

Annotators

fxp007

URL

transformer-circuits.pub/2026/emotions/index.html
huggingface.co huggingface.co

https://huggingface.co/papers/2604.04921

1
1. fxp007 08 Apr 2026
  
  in Public
  
  leading baselines achieve only about half the accuracy at the same efficiency
  
  作者暗示当前主流的KV缓存压缩方法在相同效率水平下只能达到约一半的准确率，这表明现有方法存在根本性缺陷。这一尖锐的批评挑战了当前领域内的技术路线，暗示大多数同行可能一直在错误的方向上优化KV压缩。
  
  non-consensus baseline-critique field-challenge methodology-flaw
Visit annotations in context

Tags

non-consensus

baseline-critique

methodology-flaw

field-challenge

Annotators

fxp007

URL

huggingface.co/papers/2604.04921
May 2025
github.com github.com

Updated Migration Process · drizzle-team/drizzle-orm · Discussion #2624

3
1. TylerRick 08 May 2025
  
  in Public
  
  For any new environments and databases, you can use just drizzle-kit migrate, and all the migrations together with init will be applied
  
  database migrations: baseline/initial migration
2. TylerRick 08 May 2025
  
  in Public
  
  When you run migrate on a database that already has all the tables from your schema, you need to run it with the drizzle-kit migrate --no-init flag, which will skip the init step. If you run it without this flag and get an error that such tables already exist, drizzle-kit will detect it and suggest you add this flag.
  
  database migrations: baseline/initial migration database migrations
3. TylerRick 08 May 2025
  
  in Public
  
  When you introspect the database, you will receive an initial migration without comments. Instead of commenting it out, we will add a flag to journal entity with the init flag, indicating that this migration was generated by introspect action
  
  database migrations: baseline/initial migration
Visit annotations in context

Tags

database migrations: baseline/initial migration

database migrations

Annotators

TylerRick

URL

github.com/drizzle-team/drizzle-orm/discussions/2624
github.com github.com

migrate resolve: The migration could not be found · Issue #14762 · prisma/prisma

1
1. TylerRick 08 May 2025
  
  in Public
  
  root@51a758d136a2:~/test/test-project# npx prisma migrate diff --from-empty --to-schema-datamodel prisma/schema.prisma --script > migration.sql root@51a758d136a2:~/test/test-project# cat migration.sql -- CreateTable CREATE TABLE "test" ( "id" SERIAL NOT NULL, "val" INTEGER, CONSTRAINT "test_pkey" PRIMARY KEY ("id") ); root@51a758d136a2:~/test/test-project# mkdir -p prisma/migrations/initial root@51a758d136a2:~/test/test-project# mv migration.sql prisma/migrations/initial/
  
  prisma database migrations database migrations: baseline/initial migration
Visit annotations in context

Tags

database migrations: baseline/initial migration

database migrations

prisma

Annotators

TylerRick

URL

github.com/prisma/prisma/issues/14762
Feb 2025
www.youtube.com www.youtube.com

The Neuroscience of Addiction - with Marc Lewis

2
1. stopresetgo 28 Feb 2025
  
  in Public
  
  within one year or so the curve the the line um crosses the non-addicted average Baseline
  
  > for - addiction - abstinence - one year - crosses non-addictive baseline
  
  addiction - abstinence - one year - crosses non-addictive baseline
2. stopresetgo 28 Feb 2025
  
  in Public
  
  abstinence from from Coke alcohol and heroin you get um you get an increase in gr matter volume in very similar areas
  
  > for - addiction - abstinence - synaptic growth - in a year, returns to baseline
  
  ddiction - abstinence - synaptic growth - in a year, returns to baseline
Visit annotations in context

Tags

addiction - abstinence - one year - crosses non-addictive baseline

ddiction - abstinence - synaptic growth - in a year, returns to baseline

Annotators

stopresetgo

URL

youtube.com/watch
Feb 2022
twitter.com twitter.com

Deepti Gurdasani on Twitter

1
1. lucyparfitt16 02 Feb 2022
  
  in BehSci
  
  Deepti Gurdasani. (2022, January 30). Have tried to now visually illustrate an earlier thread I wrote about why prevalence estimates based on comparisons of “any symptom” between infected cases, and matched controls will yield underestimates for long COVID. I’ve done a toy example below here, to show this 🧵 [Tweet]. @dgurdasani1. https://twitter.com/dgurdasani1/status/1487578265187405828
  
  is:tweet lang:en COVID-19 long covid symptoms baseline research prevalence approach bias underestimation scientific method methodological problem flawed study design
Visit annotations in context

Tags

research

COVID-19

lang:en

approach

symptoms

long covid

methodological problem

flawed study design

is:tweet

prevalence

bias

baseline

underestimation

scientific method

Annotators

lucyparfitt16

URL

twitter.com/dgurdasani1/status/1487578265187405828
Dec 2021
twitter.com twitter.com

Meaghan Kall on Twitter

1
1. chaeyeonlim 23 Dec 2021
  
  in BehSci
  
  Meaghan Kall. (2021, November 15). There are 2 other impressive conclusions from this study: 1. Comparing vaccine effectiveness of booster vs “fully vaxxed” as the baseline. Booster ADDS 81-85% protection against symptomatic infection ON TOP of what you already had from your primary (2-dose) vaccination https://t.co/5EO7m6GHTZ [Tweet]. @kallmemeg. https://twitter.com/kallmemeg/status/1460207567070769156
  
  is:tweet lang:en COVID-19 research evidence vaccine vaccination vaccine effectiveness booster booster shot booster effect full vaccination baseline protection infection symptomatic infection immunization effectiveness
Visit annotations in context

Tags

immunization

COVID-19

protection

symptomatic infection

effectiveness

vaccine effectiveness

booster

research

booster effect

lang:en

evidence

vaccination

is:tweet

infection

vaccine

booster shot

baseline

full vaccination

Annotators

chaeyeonlim

URL

twitter.com/kallmemeg/status/1460207567070769156
Oct 2021
www.baselinehq.com www.baselinehq.com

Baseline • The Free Design Bootcamp

1
1. bauhouse 16 Oct 2021
  
  in Public
  
  ❯ Created in-house by expert educators.❯ 100% original course materials.❯ Free for everyone, forever.
  
  Baseline free design bootcamp
Visit annotations in context

Tags

free

design

bootcamp

Baseline

Annotators

bauhouse

URL

baselinehq.com/
www.linkedin.com www.linkedin.com

(25) Post | LinkedIn

1
1. bauhouse 16 Oct 2021
  
  in Public
  
  Andrew Wilshere
  
  Andrew Wilshere was working on content at Designlab when he asked me to write an article about the Bauhaus.
  
  I ended up writing something that never got published with Designlab. Instead, it was shared by the Bauhaus Movement to their Facebook followers.
  
  Baseline design bootcamp free Bauhaus Designlab
Visit annotations in context

Tags

Baseline

design

Designlab

free

bootcamp

Bauhaus

Annotators

bauhouse

URL

linkedin.com/posts/andrew-wilshere-b198b155_baseline-the-free-design-bootcamp-launches-activity-6808403503317356544-cQBd/
Jul 2020
www.medrxiv.org www.medrxiv.org

https://doi.org/10.1101/2020.07.09.20148429

1
1. ErikStuchly 25 Jul 2020
  
  in BehSci
  
  Seow, J., Graham, C., Merrick, B., Acors, S., Steel, K. J. A., Hemmings, O., O’Bryne, A., Kouphou, N., Pickering, S., Galao, R., Betancor, G., Wilson, H. D., Signell, A. W., Winstone, H., Kerridge, C., Temperton, N., Snell, L., Bisnauthsing, K., Moore, A., … Doores, K. (2020). Longitudinal evaluation and decline of antibody responses in SARS-CoV-2 infection. MedRxiv, 2020.07.09.20148429. https://doi.org/10.1101/2020.07.09.20148429
  
  is:preprint lang:en COVID-19 longitudinal change antibody depletion symptom onset immunity magnitude severity baseline public health future implications serology epidemiology
Visit annotations in context

Tags

magnitude

is:preprint

antibody depletion

severity

longitudinal change

COVID-19

lang:en

public health

serology

future implications

epidemiology

baseline

immunity

symptom onset

Annotators

ErikStuchly

URL

medrxiv.org/content/10.1101/2020.07.09.20148429v1
Dec 2019
zapier.com zapier.com

Time Tracking Experiment: What I Learned After Analyzing Every Minute of My Life for 30 Days

1
1. TylerRick 30 Dec 2019
  
  in Public
  
  While I wanted to do my best to not judge how I was spending my time during the experiment—to just track it as it is and analyze at the end—I did want to have a baseline to compare my results to. This wasn't a hypothesis of how I spend my time, but more of a vision for how I would like my time to be allocated.
  
  baseline ideal desired
Visit annotations in context

Tags

baseline

desired

ideal

Annotators

TylerRick

URL

zapier.com/blog/time-tracking-tutorial/
May 2018
patsy.readthedocs.io patsy.readthedocs.io

patsy API reference — patsy 0.5.0+dev documentation

1
1. rschulz 01 May 2018
  
  in Public
  
  one level is chosen as the “reference”, and its mean behaviour is represented by the intercept. Each column of the resulting matrix represents the difference between the mean of one level and this reference level
  
  python patsy Treatment reference baseline coding
Visit annotations in context

Tags

coding

baseline

patsy

python

reference

Treatment

Annotators

rschulz

URL

patsy.readthedocs.io/en/latest/API-reference.html

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL