Hypothesis

24 Matching Annotations

Jun 2026
sakana.ai sakana.ai

https://sakana.ai/fugu/

1
1. fxp007 26 Jun 2026
  
  in Public
  
  Fugu models surpass publicly accessible frontier models and are shoulder-to-shoulder with Fable 5 and Mythos Preview in various rigorous engineering, scientific, and reasoning benchmarks while delivering frontier capability without the risk of export controls.
  
  大多数人认为前沿AI模型性能的提升依赖于单一厂商的专有技术和更大规模的参数，但作者认为通过动态协调多种现有模型可以实现与顶级专有模型相当的性能，同时规避出口管制风险。这一观点挑战了当前AI发展路径的共识。
  
  non-consensus counterintuitive model-performance
Visit annotations in context

Tags

model-performance

counterintuitive

non-consensus

Annotators

fxp007

URL

sakana.ai/fugu/
www.theverge.com www.theverge.com

https://www.theverge.com/news/946725/anthropic-releases-claude-fable-5-mythos

1
1. fxp007 10 Jun 2026
  
  in Public
  
  The company said that in testing, 95 percent of Fable sessions ran entirely on Fable responses, without falling back to Opus 4.8.
  
  这个95%的统计数据需要进一步验证。测试样本大小、测试场景的代表性以及如何定义'完全运行'都值得深入了解。这个数据可能影响用户对模型可靠性的判断。
  
  data-verification model-performance testing-methodology
Visit annotations in context

Tags

model-performance

data-verification

testing-methodology

Annotators

fxp007

URL

theverge.com/news/946725/anthropic-releases-claude-fable-5-mythos
www.tomtunguz.com www.tomtunguz.com

https://www.tomtunguz.com/inflation-deflation-ai/

1
1. fxp007 09 Jun 2026
  
  in Public
  
  Composer 2.5 is exceptionally intelligent & up to 10x more efficient than similarly capable models.
  
  Cursor公司声称其Composer 2.5模型比同等能力的模型效率高10倍。这是一个相当大胆的断言，但缺乏具体的基准测试数据或比较标准。虽然可能存在一些优化，但10倍的提升需要更详细的验证。
  
  data-point efficiency-claim model-performance
Visit annotations in context

Tags

model-performance

efficiency-claim

data-point

Annotators

fxp007

URL

tomtunguz.com/inflation-deflation-ai/
www.anthropic.com www.anthropic.com

https://www.anthropic.com/research/agents-in-biology

1
1. fxp007 08 Jun 2026
  
  in Public
  
  adding a deterministic retrieval layer made model choice much less important
  
  大多数人认为在AI应用中，选择更强大的模型是提高准确性的关键，但作者认为添加确定性检索层比模型选择更重要。这一反直觉观点表明，在生物数据处理领域，基础设施的改进可能比模型升级更能解决问题，这与AI领域普遍追求更强大模型的趋势相悖。
  
  counterintuitive model-performance deterministic
Visit annotations in context

Tags

model-performance

deterministic

counterintuitive

Annotators

fxp007

URL

anthropic.com/research/agents-in-biology
cognition.ai cognition.ai

https://cognition.ai/blog/frontier-code

1
1. fxp007 08 Jun 2026
  
  in Public
  
  Claude Opus 4.8, achieves a score of only 13.4%. Other models score significantly lower: GPT-5.5 receives 6.3%, Gemini 3.1 Pro 4.7%, and others even less.
  
  这些分数显示了当前最先进AI模型在生产级代码质量评估上的表现不佳，即使是最好的模型也只达到13.4%的分数。这表明AI代码生成仍有巨大改进空间，但缺乏绝对评分标准，难以判断这个分数的实际意义。
  
  data-point model-performance statistics
Visit annotations in context

Tags

model-performance

statistics

data-point

Annotators

fxp007

URL

cognition.ai/blog/frontier-code
May 2026
jack-clark.net jack-clark.net

Import AI 455: Automating AI Research

1
1. fxp007 15 May 2026
  
  in Public
  
  As of March 2026, AI systems are able to post-train models to get about half as much of the uplift as ones trained by humans. The specific eval scores are derived by a 'weighted average is taken across all post-trained LLMs... The top-scoring systems as of April get 25%-28% (Opus 4.6, and GPT 5.4), compared to a human score of 51%.'
  
  在模型微调任务上，AI系统已能达到人类研究员51%性能的一半，显示出AI在科研任务上的显著进步。
  
  model-training performance-gap
Visit annotations in context

Tags

performance-gap

model-training

Annotators

fxp007

URL

jack-clark.net/2026/05/04/import-ai-455-automating-ai-research/
nlp.elvissaravia.com nlp.elvissaravia.com

https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-f2f

1
1. fxp007 01 May 2026
  
  in Public
  
  DeepSeek-V4-Pro-Max beats GPT-5.2 and Gemini 3.0-Pro on standard reasoning benchmarks and lands just behind GPT-5.4 and Gemini 3.1-Pro
  
  DeepSeek V4-Pro-Max在标准推理基准测试中超越了GPT-5.2和Gemini 3.0-Pro，这表明了开源模型在性能上的巨大提升。
  
  performance-comparison benchmark open-source-model
Visit annotations in context

Tags

benchmark

performance-comparison

open-source-model

Annotators

fxp007

URL

nlp.elvissaravia.com/p/top-ai-papers-of-the-week-f2f
Apr 2026
sakana.ai sakana.ai

https://sakana.ai/fugu-beta/

2
1. fxp007 30 Apr 2026
  
  in Public
  
  Two variants are available: **Sakana Fugu Mini 🐟**, optimized with latency in mind, and **Sakana Fugu Ultra 🐡**, the full orchestration system, optimized for performance for demanding tasks.
  
  文章提到有两种变体：Mini（延迟优化）和Ultra（性能优化），但未提供具体的性能指标差异，如延迟降低百分比或吞吐量提升数据。这种缺乏具体量化参数的描述难以评估两种变体在实际应用中的性能差异。
  
  data-point model-variants performance-metrics
2. fxp007 30 Apr 2026
  
  in Public
  
  GPQAD | 94.4 | 90.9 | 92.7 | 92.4 | **95.1** | LCBv6 | 90.3 | 92.1 | 92.4 | 90.4 | **93.2** | SWEPro | 48.4 | 51.2 | _53.4_ | 51.3 | **54.2**
  
  性能对比表格显示，Sakana Fugu Ultra在三个基准测试中均优于竞争对手：GPQAD上达95.1%（超越Gemini 3.1的94.4%），LCBv6上达93.2%（超越GPT 5.4的92.1%），SWEPro上达54.2%（超越Opus 4.6的53.4%）。这些数据表明其多模型协调策略确实带来了性能提升，特别是在科学推理任务上优势明显。
  
  data-point performance-benchmark model-comparison
Visit annotations in context

Tags

performance-benchmark

performance-metrics

model-comparison

model-variants

data-point

Annotators

fxp007

URL

sakana.ai/fugu-beta/
epoch.ai epoch.ai

https://epoch.ai/blog/have-ai-capabilities-accelerated

1
1. fxp007 24 Apr 2026
  
  in Public
  
  The best-performing model across these three metrics was a pair of independent linear trends: one for reasoning models and one for non-reasoning models.
  
  这个发现表明推理模型和非推理模型的发展轨迹确实存在显著差异。这种分离的线性趋势模型在三个指标上表现最佳，100%的情况下优于其他模型，提供了强有力的统计证据支持AI能力加速的论点。
  
  data-point statistics model-performance
Visit annotations in context

Tags

model-performance

statistics

data-point

Annotators

fxp007

URL

epoch.ai/blog/have-ai-capabilities-accelerated
x.com x.com

(1) Milk Road AI on X: "Andrej Karpathy just made one of the most interesting arguments about AI model design that most people are completely missing. His take is that frontier AI models are not too big because the technology is complex and too big because the training data is garbage. When you or I https://t.co/IGQZlJ6JHL" / X

1
1. fxp007 24 Apr 2026
  
  in Public
  
  GPT-4o operates at roughly 200 billion parameters and outperforms the original 1.8 trillion-parameter GPT-4
  
  这一发现与行业普遍认为'更大模型必然更好'的共识相悖，暗示模型质量和架构可能比规模更重要。这可能是AI发展史上最令人惊讶的效率提升案例之一，挑战了我们对AI进步的理解。
  
  counter-intuitive model-performance
Visit annotations in context

Tags

model-performance

counter-intuitive

Annotators

fxp007

URL

x.com/MilkRoadAI/status/2045484064585728489
qwen.ai qwen.ai

https://qwen.ai/blog?id=qwen3.6-27b

2
1. fxp007 23 Apr 2026
  
  in Public
  
  It also surpasses all peer-scale dense models by a wide margin.
  
  在多数情况下，人们可能认为更大规模的模型将具有更好的性能，但作者提出Qwen3.6-27B在同等规模密集模型中表现卓越，这一观点与主流认知相悖。
  
  non-consensus counterintuitive model-scale performance-comparison
2. fxp007 23 Apr 2026
  
  in Public
  
  It also surpasses all peer-scale dense models by a wide margin.
  
  大多数人可能认为模型性能与其规模成正比，但作者指出Qwen3.6-27B在同等规模模型中表现突出，超越了所有同规模密集模型，这挑战了规模与性能之间的传统认知。
  
  non-consensus counterintuitive model-scale performance-outliers
Visit annotations in context

Tags

performance-comparison

model-scale

non-consensus

performance-outliers

counterintuitive

Annotators

fxp007

URL

qwen.ai/blog
nrehiew.github.io nrehiew.github.io

https://nrehiew.github.io/blog/minimal_editing/

1
1. fxp007 23 Apr 2026
  
  in Public
  
  Among the latest frontier models, GPT-5.4 over-edits the most.
  
  大多数人认为GPT-5.4是最先进的模型，但作者指出它在最小化编辑任务上表现最差，这挑战了对其能力的普遍看法。
  
  counterintuitive model-performance gpt-5.4
Visit annotations in context

Tags

model-performance

gpt-5.4

counterintuitive

Annotators

fxp007

URL

nrehiew.github.io/blog/minimal_editing/
www.anthropic.com www.anthropic.com

Introducing Claude Opus 4.7

1
1. fxp007 17 Apr 2026
  
  in Public
  
  Opus 4.7 introduces a new `xhigh` ('extra high') effort level between `high` and `max`, giving users finer control over the tradeoff between reasoning and latency on hard problems.
  
  引入'xhigh'努力等级显示了AI模型在推理深度与响应速度之间提供更精细控制的能力，这反映了用户对AI性能调优需求的增长，也表明AI系统正变得更加可定制和专业化。
  
  model-tuning performance-control
Visit annotations in context

Tags

performance-control

model-tuning

Annotators

fxp007

URL

anthropic.com/news/claude-opus-4-7
introspective-diffusion.github.io introspective-diffusion.github.io

https://introspective-diffusion.github.io/

1
1. fxp007 16 Apr 2026
  
  in Public
  
  I-DLM-8B is the first DLM to match the quality of its same-scale AR counterpart, outperforming LLaDA-2.1-mini (16B) by +26 on AIME-24 and +15 on LiveCodeBench-v6 with half the parameters
  
  令人惊讶的是：I-DLM-8B模型仅用80亿参数就超过了160亿参数的LLaDA-2.1-mini模型，在AIME-24和LiveCodeBench-v6测试中分别高出26和15分。这表明扩散模型首次达到了与自回归模型相当的质量水平，同时参数减半，打破了人们对扩散模型质量不如自回归模型的普遍认知。
  
  surprising model-performance diffusion-models
Visit annotations in context

Tags

model-performance

diffusion-models

surprising

Annotators

fxp007

URL

introspective-diffusion.github.io/
artificialanalysis.ai artificialanalysis.ai

APEX-Agents-AA Benchmark Leaderboard | Artificial Analysis

1
1. fxp007 10 Apr 2026
  
  in Public
  
  Cost (USD) to run the evaluation: GPT-5.4 (xhigh): $1,110, Claude Opus 4.6 (max): $1,055
  
  运行一次 452 个任务的评测，GPT-5.4 花费 1110 美元，Claude Opus 4.6 花费 1055 美元——每个任务平均约 2.3 美元。而 Gemini 3 Flash 只需要 596 美元，实现了 27.7% 的成绩（vs 顶级模型的 33.3%）。这个性价比数据对 AI 选型决策极为关键：如果业务场景可以接受 27% 而非 33% 的成功率，Gemini 3 Flash 能节省近一半成本。在金融服务的大规模部署中，这个差异将被放大数千倍。
  
  cost-analysis 2-dollars-per-task cost-performance model-selection
Visit annotations in context

Tags

cost-analysis

model-selection

cost-performance

2-dollars-per-task

Annotators

fxp007

URL

artificialanalysis.ai/evaluations/apex-agents-aa
arxiv.org arxiv.org

https://arxiv.org/abs/2604.02869

2
1. fxp007 08 Apr 2026
  
  in Public
  
  our approach improves Qwen3.5-4B from 63.8 percent to 66.7 percent (+2.9pp) and Qwen3-30B-A3B from 58.0 percent to 69.5 percent (+11.5pp)
  
  大多数人认为在复杂的多轮任务中，只有大型语言模型才能通过强化学习取得显著进步，但作者展示了即使是较小的4B模型也能通过他们的方法获得实质性提升，而30B模型的提升更是惊人地达到了11.5个百分点，挑战了'规模越大越好'的普遍认知。
  
  non-consensus model-size performance
2. fxp007 08 Apr 2026
  
  in Public
  
  the trained 4B model exceeding GPT-4.1 (49.4 percent) and GPT-4o (42.8 percent) despite being 50 times smaller
  
  大多数人认为AI模型的大小与性能直接正相关，更大的模型必然表现更好。但作者展示了一个仅40亿参数的模型通过强化学习训练后，性能超越了比它大50倍的GPT-4.1和GPT-4o，挑战了当前AI领域'参数规模决定一切'的主流观点。
  
  non-consensus model-size counterintuitive performance
Visit annotations in context

Tags

model-size

performance

counterintuitive

non-consensus

Annotators

fxp007

URL

arxiv.org/abs/2604.02869
Nov 2022
www.researchgate.net www.researchgate.net

jcpp_1749 840..847

2
1. CategoryKev 08 Nov 2022
  
  in Public
  
  The most intriguing result in the present study is thepositive effect of white noise on performance for theADHD children. This noise effect was present in boththe non-medicated and medicated children. Thissupports the MBA (Moderate Brain Arousal) model(Sikstro ̈m & So ̈derlund, 2007), suggesting that theendogenous (neural) noise level in children withADHD is sub-optimal. MBA accounts for the noise-enhancing phenomenon by stochastic resonance(SR). The model suggests that noise in the environ-ment introduces internal noise into the neural sys-tem through the perceptual system. Of particularimportance, the MBA model suggests that the peakof the SR curve depends on the dopamine level, sothat participants with low dopamine levels (ADHD)require more noise for optimal cognitive performancecompared to controls.
  
  Author's self-described "most intriguing result"
  
  ADHD stochastic resonance Moderate Brain Arousal model dopamine cognitive performance
2. CategoryKev 08 Nov 2022
  
  in Public
  
  The MBAmodel predicts that noise enhances memory perfor-mance for ADHD and attenuates performance forcontrols. We will also argue for a link between theeffects of noise, dopamine regulation, and cognitiveperformance.
  
  Prediction of Moderate Brain Arousal model and author's additional argument.
  
  ADHD Moderate Brain Arousal model dopamine cognitive performance
Visit annotations in context

Tags

stochastic resonance

ADHD

Moderate Brain Arousal model

cognitive performance

dopamine

Annotators

CategoryKev

URL

researchgate.net/profile/Goran-Soderlund-2/publication/6156980_Listen_to_the_noise_Noise_is_beneficial_for_cognitive_performance_in_ADHD/links/5ce2cce2458515712eb75553/Listen-to-the-noise-Noise-is-beneficial-for-cognitive-performance-in-ADHD.pdf
Apr 2022
twitter.com twitter.com

ReconfigBehSci on Twitter

1
1. emilybiggs 20 Apr 2022
  
  in Public
  
  (7) ReconfigBehSci on Twitter: “@ToddHorowitz3 probably- and I think there are many interesting questions around why he is there and whether he should be there. But to answer those properly, looking at the performance of the model seems important and interesting to me- that is all I am saying” / Twitter. (n.d.). Retrieved March 6, 2021, from https://twitter.com/SciBeh/status/1324389147050569734
  
  is:twitter lang:en COVID-19 R predictions model performance publicity expert scientist
Visit annotations in context

Tags

publicity

predictions

is:twitter

lang:en

model

scientist

performance

COVID-19

R

expert

Annotators

emilybiggs

URL

twitter.com/SciBeh/status/1324389147050569734
Mar 2019
debwagner.info debwagner.info

Behavior Engineering Model

1
1. ks9 22 Mar 2019
  
  in Public
  
  Behavior Engineering Model This page has a design that is not especially attractive or user friendly but it does provide an overview of Gilbert's Behavior Engineering Model. This is a model that can be used to analyze the issues that underlie performance. A six-cell model is presented. Rating 5/5
  
  etcnau performance improvement performance technology worthy performance Thomas Gilbert Behavior Engineering Model etc556
Visit annotations in context

Tags

performance technology

worthy performance

performance improvement

etc556

Thomas Gilbert

etcnau

Behavior Engineering Model

Annotators

ks9

URL

debwagner.info/hpttoolkit/gilbert_bem_hpt.htm
deepblue.lib.umich.edu deepblue.lib.umich.edu

PFI20251.indd

1
1. ks9 22 Mar 2019
  
  in Public
  
  Human Performance Technology Model This page is an eight page PDF that gives an overview of the human performance technology model. This is a black and white PDF that is simply written and is accessible to the layperson. Authors are prominent writers in the field of performance technology. Rating 5/5
  
  etcnau performance improvement performance technology evaluation models Van Tiem theory Behavior Engineering Model diagram etc556
Visit annotations in context

Tags

performance improvement

etc556

etcnau

Behavior Engineering Model

Van Tiem

theory

diagram

performance technology

evaluation models

Annotators

ks9

URL

deepblue.lib.umich.edu/bitstream/handle/2027.42/90576/20251_ftp.pdf

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL