Hypothesis

12 Matching Annotations

Last 7 days
nlp.elvissaravia.com nlp.elvissaravia.com

https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-f2f

1
1. fxp007 01 May 2026
  
  in Public
  
  DeepSeek-V4-Pro-Max beats GPT-5.2 and Gemini 3.0-Pro on standard reasoning benchmarks and lands just behind GPT-5.4 and Gemini 3.1-Pro
  
  DeepSeek V4-Pro-Max在标准推理基准测试中超越了GPT-5.2和Gemini 3.0-Pro，这表明了开源模型在性能上的巨大提升。
  
  performance-comparison benchmark open-source-model
Visit annotations in context

Tags

benchmark

performance-comparison

open-source-model

Annotators

fxp007

URL

nlp.elvissaravia.com/p/top-ai-papers-of-the-week-f2f
sakana.ai sakana.ai

https://sakana.ai/fugu-beta/

1
1. fxp007 30 Apr 2026
  
  in Public
  
  GPQAD | 94.4 | 90.9 | 92.7 | 92.4 | **95.1** | LCBv6 | 90.3 | 92.1 | 92.4 | 90.4 | **93.2** | SWEPro | 48.4 | 51.2 | _53.4_ | 51.3 | **54.2**
  
  性能对比表格显示，Sakana Fugu Ultra在三个基准测试中均优于竞争对手：GPQAD上达95.1%（超越Gemini 3.1的94.4%），LCBv6上达93.2%（超越GPT 5.4的92.1%），SWEPro上达54.2%（超越Opus 4.6的53.4%）。这些数据表明其多模型协调策略确实带来了性能提升，特别是在科学推理任务上优势明显。
  
  data-point performance-benchmark model-comparison
Visit annotations in context

Tags

performance-benchmark

data-point

model-comparison

Annotators

fxp007

URL

sakana.ai/fugu-beta/
epoch.ai epoch.ai

https://epoch.ai/blog/have-ai-capabilities-accelerated

1
1. fxp007 30 Apr 2026
  
  in Public
  
  Reasoning models show both a one-off jump in performance and a roughly 2-3x faster trend compared to non-reasoning models.
  
  这是一个重要的性能对比数据，表明推理模型比非推理模型的进步速度快2-3倍。这是一个显著的加速比率，暗示推理能力的突破可能代表了AI发展的一个转折点。然而，文章没有提供具体的基准测试数据来支持这一倍数关系，需要谨慎对待。
  
  data-point statistics model-comparison
Visit annotations in context

Tags

data-point

model-comparison

statistics

Annotators

fxp007

URL

epoch.ai/blog/have-ai-capabilities-accelerated
Apr 2026
blog.vidocsecurity.com blog.vidocsecurity.com

We Reproduced Anthropic's Mythos Findings With Public Models

1
1. fxp007 24 Apr 2026
  
  in Public
  
  The takeaway is not whether Mythos is better or more powerful. It is that public models can already achieve much the same results.
  
  这是一个令人惊讶的结论：Anthropic的Mythos模型可能并不比公共模型强大得多，只是它们的工作流程更成熟。这挑战了行业对专有模型的过度追捧，表明真正的创新在于如何组织和使用AI工具，而不是模型本身的神秘性。
  
  model-comparison workflow-over-model surprising-finding
Visit annotations in context

Tags

workflow-over-model

model-comparison

surprising-finding

Annotators

fxp007

URL

blog.vidocsecurity.com/blog/we-reproduced-anthropics-mythos-findings-with-public-models
antirez.com antirez.com

https://antirez.com/news/163

1
1. fxp007 24 Apr 2026
  
  in Public
  
  Stronger models hallucinate less, so they can't see the problem in any side of the spectrum: the hallucination side of small models, and the real understanding side of Mythos.
  
  这一观察极具反直觉性：更强的模型反而更难发现某些漏洞，因为它们减少幻觉的同时也失去了对问题的'直觉理解'。这暗示AI安全研究可能需要不同能力层次的模型组合，而非简单地追求更大更强的模型。
  
  ai-paradox model-comparison
Visit annotations in context

Tags

model-comparison

ai-paradox

Annotators

fxp007

URL

antirez.com/news/163
qwen.ai qwen.ai

https://qwen.ai/blog?id=qwen3.6-27b

1
1. fxp007 23 Apr 2026
  
  in Public
  
  It also surpasses all peer-scale dense models by a wide margin.
  
  在多数情况下，人们可能认为更大规模的模型将具有更好的性能，但作者提出Qwen3.6-27B在同等规模密集模型中表现卓越，这一观点与主流认知相悖。
  
  non-consensus counterintuitive model-scale performance-comparison
Visit annotations in context

Tags

non-consensus

counterintuitive

performance-comparison

model-scale

Annotators

fxp007

URL

qwen.ai/blog
epoch.ai epoch.ai

https://epoch.ai/blog/mirrorcode-preliminary-results

1
1. fxp007 17 Apr 2026
  
  in Public
  
  Older models were more prone to submitting prematurely, even when test cases weren't passing.
  
  这一观察揭示了不同AI模型版本之间在任务坚持性上的显著差异。早期模型更容易过早提交不完整的解决方案，而最新模型表现出更强的任务坚持性和工程判断力。这种差异可能反映了AI在自我评估和任务管理能力上的进化。
  
  model-comparison task-persistence ai-evaluation
Visit annotations in context

Tags

ai-evaluation

model-comparison

task-persistence

Annotators

fxp007

URL

epoch.ai/blog/mirrorcode-preliminary-results
www.microsoft.com www.microsoft.com

https://www.microsoft.com/en-us/research/blog/adele-predicting-and-explaining-ai-performance-across-tasks/

1
1. fxp007 16 Apr 2026
  
  in Public
  
  Performance on knowledge-heavy tasks depends strongly on model size and training, while reasoning-oriented models show clear gains on tasks requiring logic, learning, abstraction, and social inference.
  
  令人惊讶的是：知识密集型任务的性能强烈依赖于模型规模和训练，而推理导向模型在需要逻辑、学习、抽象和社会推理的任务上显示出明显优势。这一发现揭示了不同AI模型在能力分布上的根本差异，为模型选择和优化提供了重要指导。
  
  surprising model-comparison knowledge-vs-reasoning
Visit annotations in context

Tags

surprising

knowledge-vs-reasoning

model-comparison

Annotators

fxp007

URL

microsoft.com/en-us/research/blog/adele-predicting-and-explaining-ai-performance-across-tasks/
transformer-circuits.pub transformer-circuits.pub

Emotion Concepts and their Function in a Large Language Model

1
1. fxp007 09 Apr 2026
  
  in Public
  
  we studied emotion-related representations in Claude Sonnet 4.5, a frontier LLM at the time of our investigation.
  
  【启发】这篇论文只研究了 Claude Sonnet 4.5 一个模型，但它的方法论对所有大模型都适用。这启发了一个迫切的研究议程：对不同架构（GPT、Gemini、Qwen、DeepSeek）的情绪向量进行横向比较，会不会发现系统性的情绪偏差——比如某些模型天生更「焦虑」、某些更「冷漠」？这不仅是学术问题，更是产品选型和安全评估的实际需求。
  
  inspiration cross-model-comparison emotion-audit model-selection
Visit annotations in context

Tags

inspiration

cross-model-comparison

emotion-audit

model-selection

Annotators

fxp007

URL

transformer-circuits.pub/2026/emotions/index.html
Feb 2021
github.com github.com

maccman/bowline

1
1. TylerRick 16 Feb 2021
  
  in Public
  
  Compared to existing Ruby desktop frameworks, such as Shoes, Bowline's strengths are its adherence to MVC and use of HTML/JavaScript.
  
  comparison advantages/merits/pros MVC (model–view–controller) ruby: desktop app framework
Visit annotations in context

Tags

comparison

MVC (model–view–controller)

advantages/merits/pros

ruby: desktop app framework

Annotators

TylerRick

URL

github.com/maccman/bowline
Jun 2020
arxiv.org arxiv.org

Spatial interactions in urban scaling laws

1
1. ErikStuchly 26 Jun 2020
  
  in BehSci
  
  Altmann, E. G. (2020). Spatial interactions in urban scaling laws. ArXiv:2006.14140 [Physics]. http://arxiv.org/abs/2006.14140
  
  is:article lang:en spatial interaction urban scaling law analysis independence generative model modeling model comparison
Visit annotations in context

Tags

lang:en

model comparison

urban scaling law

modeling

analysis

is:article

spatial interaction

generative model

independence

Annotators

ErikStuchly

URL

arxiv.org/abs/2006.14140
May 2020
psyarxiv.com psyarxiv.com

Personal relative deprivation negatively predicts engagement in group decision-making

1
1. edampf 05 May 2020
  
  in BehSci
  
  Rotella, A. M., & Mishra, S. (2020, April 24). Personal relative deprivation negatively predicts engagement in group decision-making. https://doi.org/10.31234/osf.io/6d35w
  
  is:preprint lang:en relative deprivation group social comparison individual differences decision making participation engagement inequality social cognition group dynamics regression model socioemotional
Visit annotations in context

Tags

lang:en

social comparison

is:preprint

inequality

group

social cognition

socioemotional

relative deprivation

decision making

individual differences

regression model

engagement

group dynamics

participation

Annotators

edampf

URL

psyarxiv.com/6d35w/

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL