Hypothesis

7 Matching Annotations

Jun 2026
magazine.sebastianraschka.com magazine.sebastianraschka.com

https://magazine.sebastianraschka.com/p/llm-research-papers-2026-part1

1
1. fxp007 07 Jun 2026
  
  in Public
  
  120B-A12B may be a bit too large for local inference on regular consumer hardware
  
  大多数人认为更大的模型参数量总是带来更好的性能，但作者暗示过度扩展模型规模可能不适合实际应用。这一务实观点挑战了'越大越好'的行业共识，强调了实际部署中的硬件限制。
  
  non-consensus model-size practical-deployment
Visit annotations in context

Tags

non-consensus

practical-deployment

model-size

Annotators

fxp007

URL

magazine.sebastianraschka.com/p/llm-research-papers-2026-part1
Apr 2026
artificialanalysis.ai artificialanalysis.ai

APEX-Agents-AA Benchmark Leaderboard | Artificial Analysis

1
1. fxp007 10 Apr 2026
  
  in Public
  
  gpt-oss-20B (high): 0.7%
  
  gpt-oss-20B 的成绩是 0.7%——在 452 个专业任务中，只有不到 4 个通过了评测。这个数字与顶级模型的 33.3% 之间，存在近 50 倍的差距。这说明专业服务 Agent 能力不是「渐进改善」，而是存在明确的「能力阶梯」——低于某个规模的模型，在这类任务上几乎完全失效。这对企业 AI 选型的启示：在专业服务场景，「够用的小模型」可能根本不存在，只有「能用的大模型」和「完全不能用的模型」两种。
  
  0.7-percent capability-cliff model-size enterprise-selection
Visit annotations in context

Tags

model-size

capability-cliff

enterprise-selection

0.7-percent

Annotators

fxp007

URL

artificialanalysis.ai/evaluations/apex-agents-aa
arxiv.org arxiv.org

https://arxiv.org/abs/2604.02869

3
1. fxp007 08 Apr 2026
  
  in Public
  
  the trained 4B model exceeding GPT-4.1 (49.4 percent) and GPT-4o (42.8 percent) despite being 50 times smaller
  
  大多数人认为在复杂任务中，大型语言模型由于其参数量和训练数据的优势，总是能显著超越小型模型。然而，作者展示了他们的方法能让一个小型4B模型在Tau-Bench基准测试中超越GPT-4.1和GPT-4o，这挑战了AI社区对模型规模的普遍信仰。
  
  non-consensus model-size benchmark
2. fxp007 08 Apr 2026
  
  in Public
  
  our approach improves Qwen3.5-4B from 63.8 percent to 66.7 percent (+2.9pp) and Qwen3-30B-A3B from 58.0 percent to 69.5 percent (+11.5pp)
  
  大多数人认为在复杂的多轮任务中，只有大型语言模型才能通过强化学习取得显著进步，但作者展示了即使是较小的4B模型也能通过他们的方法获得实质性提升，而30B模型的提升更是惊人地达到了11.5个百分点，挑战了'规模越大越好'的普遍认知。
  
  non-consensus model-size performance
3. fxp007 08 Apr 2026
  
  in Public
  
  the trained 4B model exceeding GPT-4.1 (49.4 percent) and GPT-4o (42.8 percent) despite being 50 times smaller
  
  大多数人认为AI模型的大小与性能直接正相关，更大的模型必然表现更好。但作者展示了一个仅40亿参数的模型通过强化学习训练后，性能超越了比它大50倍的GPT-4.1和GPT-4o，挑战了当前AI领域'参数规模决定一切'的主流观点。
  
  non-consensus model-size counterintuitive performance
Visit annotations in context

Tags

benchmark

non-consensus

counterintuitive

performance

model-size

Annotators

fxp007

URL

arxiv.org/abs/2604.02869
Aug 2022
www.nature.com www.nature.com

Psychologists update their beliefs about effect sizes after replication studies

1
1. jackiekrauss 29 Aug 2022
  
  in BehSci
  
  McDiarmid, A. D., Tullett, A. M., Whitt, C. M., Vazire, S., Smaldino, P. E., & Stephens, J. E. (2021). Psychologists update their beliefs about effect sizes after replication studies. Nature Human Behaviour, 5(12), 1663–1673. https://doi.org/10.1038/s41562-021-01220-7
  
  is:article lang:en belief psychology effect size replication study self correction evidence belief updating Bayesian model trust critical pre-existing belief intellectual humility
Visit annotations in context

Tags

replication study

belief

psychology

evidence

lang:en

belief updating

trust

self correction

is:article

Bayesian model

critical

pre-existing belief

effect size

intellectual humility

Annotators

jackiekrauss

URL

nature.com/articles/s41562-021-01220-7
Oct 2018
iphysresearch.github.io iphysresearch.github.io

A Paper A Day

1
1. Herb 18 Oct 2018
  
  in Public
  
  Approximate Fisher Information Matrix to Characterise the Training of Deep Neural Networks
  
  深度神经网络训练(收敛/泛化性能)的近似Fisher信息矩阵表征，可自动优化mini-batch size/learning rate
  
  挺有趣的 paper，提出了从 Fisher 矩阵抽象出新的量用来衡量训练过程中的模型表现，来优化mini-batch sizes and learning rates | 另外 paper 中的figure画的很好看 | 作者认为逐步增加batch sizes的传统理解只是partially true，存在逐步递减该 size 来提高 model 收敛和泛化能力的可能。
  
  Generalization batch-size learning rate fisher matrix model evaluation
Visit annotations in context

Tags

learning rate

batch-size

fisher matrix

Generalization

model evaluation

Annotators

Herb

URL

iphysresearch.github.io/paper_summary/APaperADay.html

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL