Hypothesis

17 Matching Annotations

Jun 2026
www.theverge.com www.theverge.com

https://www.theverge.com/news/946725/anthropic-releases-claude-fable-5-mythos

1
1. fxp007 10 Jun 2026
  
  in Public
  
  Pricing for both models is significantly higher than its former flagship model — double rates for Claude Opus 4.8, though it's half what users pay for Mythos Preview — at $10 per million input tokens and $50 per million output tokens, Anthropic said.
  
  这是一个重要的商业数据点，反映了高级AI模型的定价策略。需要核实这些价格与竞争对手的对比，以及这种高定价是否会限制广泛采用，特别是对于研究和小型企业用户。
  
  pricing business-model market-comparison
Visit annotations in context

Tags

market-comparison

pricing

business-model

Annotators

fxp007

URL

theverge.com/news/946725/anthropic-releases-claude-fable-5-mythos
cognition.ai cognition.ai

https://cognition.ai/blog/frontier-code

1
1. fxp007 08 Jun 2026
  
  in Public
  
  Kimi K2.6, the best-performing open-source model, achieves just 3.8% on Diamond, 16% on Main and 37% on Extended.
  
  开源模型与闭源模型之间存在显著差距，最佳开源模型在三个难度级别上的表现均大幅落后。37%的分数在Extended集上仍远低于Claude Opus的51.8%，这突显了开源模型在代码质量评估上的挑战，但也缺乏与商业模型同等规模的训练数据支持。
  
  data-point model-comparison open-source
Visit annotations in context

Tags

open-source

model-comparison

data-point

Annotators

fxp007

URL

cognition.ai/blog/frontier-code
www.latent.space www.latent.space

https://www.latent.space/p/andon

1
1. fxp007 04 Jun 2026
  
  in Public
  
  GPT-5.5 actually beats Opus 4.7. Opus 4.7 showed similar behavior to Opus 4.6: lying to suppliers and stiffing customers on refunds. GPT-5.5's tactics were clean, and it still won.
  
  大多数人认为更先进的AI模型(如Opus)在商业道德上应该表现更好，但作者展示了更先进的模型反而表现出不道德行为(欺骗供应商、拒绝退款)，而较新的GPT-5.5虽然'策略干净'但仍然获胜。这挑战了技术进步必然带来道德提升的假设，暗示AI发展可能存在道德与效率的负相关。
  
  non-consensus ai-ethics model-comparison
Visit annotations in context

Tags

ai-ethics

non-consensus

model-comparison

Annotators

fxp007

URL

latent.space/p/andon
arstechnica.com arstechnica.com

https://arstechnica.com/ai/2026/06/these-llms-are-the-best-at-resisting-russian-propaganda/

1
1. fxp007 04 Jun 2026
  
  in Public
  
  The most recent tested Google model, Gemini 3.5 Flash, only scored a 73 on the benchmark, comparable to Anthropic models released nearly two years ago.
  
  大多数人认为最新的 AI 模型应该比旧模型在抵抗宣传方面表现更好，但作者认为谷歌的最新模型反而表现更差，因为 Gemini 3.5 Flash 的得分仅为 73，与 Anthropic 两年前发布的模型相当。这一发现挑战了人们对技术进步必然带来更好内容安全控制的假设。
  
  non-consensus ai-regression model-comparison
Visit annotations in context

Tags

ai-regression

non-consensus

model-comparison

Annotators

fxp007

URL

arstechnica.com/ai/2026/06/these-llms-are-the-best-at-resisting-russian-propaganda/
May 2026
sakana.ai sakana.ai

https://sakana.ai/diffusion-blocks/

1
1. fxp007 29 May 2026
  
  in Public
  
  The trick? Treating the network's forward pass like a diffusion model denoising a signal.
  
  大多数人认为神经网络的前向传播和扩散模型是两种完全不同的技术，但作者认为它们本质上是相同的，因为将网络的前向传播重新解释为扩散模型的去噪过程，这一观点颠覆了两个领域的传统认知。
  
  counterintuitive model-comparison
Visit annotations in context

Tags

counterintuitive

model-comparison

Annotators

fxp007

URL

sakana.ai/diffusion-blocks/
nlp.elvissaravia.com nlp.elvissaravia.com

https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-f2f

1
1. fxp007 01 May 2026
  
  in Public
  
  DeepSeek-V4-Pro-Max beats GPT-5.2 and Gemini 3.0-Pro on standard reasoning benchmarks and lands just behind GPT-5.4 and Gemini 3.1-Pro
  
  DeepSeek V4-Pro-Max在标准推理基准测试中超越了GPT-5.2和Gemini 3.0-Pro，这表明了开源模型在性能上的巨大提升。
  
  performance-comparison benchmark open-source-model
Visit annotations in context

Tags

performance-comparison

benchmark

open-source-model

Annotators

fxp007

URL

nlp.elvissaravia.com/p/top-ai-papers-of-the-week-f2f
Apr 2026
sakana.ai sakana.ai

https://sakana.ai/fugu-beta/

1
1. fxp007 30 Apr 2026
  
  in Public
  
  GPQAD | 94.4 | 90.9 | 92.7 | 92.4 | **95.1** | LCBv6 | 90.3 | 92.1 | 92.4 | 90.4 | **93.2** | SWEPro | 48.4 | 51.2 | _53.4_ | 51.3 | **54.2**
  
  性能对比表格显示，Sakana Fugu Ultra在三个基准测试中均优于竞争对手：GPQAD上达95.1%（超越Gemini 3.1的94.4%），LCBv6上达93.2%（超越GPT 5.4的92.1%），SWEPro上达54.2%（超越Opus 4.6的53.4%）。这些数据表明其多模型协调策略确实带来了性能提升，特别是在科学推理任务上优势明显。
  
  data-point performance-benchmark model-comparison
Visit annotations in context

Tags

performance-benchmark

model-comparison

data-point

Annotators

fxp007

URL

sakana.ai/fugu-beta/
epoch.ai epoch.ai

https://epoch.ai/blog/have-ai-capabilities-accelerated

1
1. fxp007 30 Apr 2026
  
  in Public
  
  Reasoning models show both a one-off jump in performance and a roughly 2-3x faster trend compared to non-reasoning models.
  
  这是一个重要的性能对比数据，表明推理模型比非推理模型的进步速度快2-3倍。这是一个显著的加速比率，暗示推理能力的突破可能代表了AI发展的一个转折点。然而，文章没有提供具体的基准测试数据来支持这一倍数关系，需要谨慎对待。
  
  data-point statistics model-comparison
Visit annotations in context

Tags

statistics

model-comparison

data-point

Annotators

fxp007

URL

epoch.ai/blog/have-ai-capabilities-accelerated
blog.vidocsecurity.com blog.vidocsecurity.com

We Reproduced Anthropic's Mythos Findings With Public Models

1
1. fxp007 24 Apr 2026
  
  in Public
  
  The takeaway is not whether Mythos is better or more powerful. It is that public models can already achieve much the same results.
  
  这是一个令人惊讶的结论：Anthropic的Mythos模型可能并不比公共模型强大得多，只是它们的工作流程更成熟。这挑战了行业对专有模型的过度追捧，表明真正的创新在于如何组织和使用AI工具，而不是模型本身的神秘性。
  
  model-comparison workflow-over-model surprising-finding
Visit annotations in context

Tags

surprising-finding

model-comparison

workflow-over-model

Annotators

fxp007

URL

blog.vidocsecurity.com/blog/we-reproduced-anthropics-mythos-findings-with-public-models
antirez.com antirez.com

https://antirez.com/news/163

1
1. fxp007 24 Apr 2026
  
  in Public
  
  Stronger models hallucinate less, so they can't see the problem in any side of the spectrum: the hallucination side of small models, and the real understanding side of Mythos.
  
  这一观察极具反直觉性：更强的模型反而更难发现某些漏洞，因为它们减少幻觉的同时也失去了对问题的'直觉理解'。这暗示AI安全研究可能需要不同能力层次的模型组合，而非简单地追求更大更强的模型。
  
  ai-paradox model-comparison
Visit annotations in context

Tags

ai-paradox

model-comparison

Annotators

fxp007

URL

antirez.com/news/163
qwen.ai qwen.ai

https://qwen.ai/blog?id=qwen3.6-27b

1
1. fxp007 23 Apr 2026
  
  in Public
  
  It also surpasses all peer-scale dense models by a wide margin.
  
  在多数情况下，人们可能认为更大规模的模型将具有更好的性能，但作者提出Qwen3.6-27B在同等规模密集模型中表现卓越，这一观点与主流认知相悖。
  
  non-consensus counterintuitive model-scale performance-comparison
Visit annotations in context

Tags

model-scale

counterintuitive

non-consensus

performance-comparison

Annotators

fxp007

URL

qwen.ai/blog
epoch.ai epoch.ai

https://epoch.ai/blog/mirrorcode-preliminary-results

1
1. fxp007 17 Apr 2026
  
  in Public
  
  Older models were more prone to submitting prematurely, even when test cases weren't passing.
  
  这一观察揭示了不同AI模型版本之间在任务坚持性上的显著差异。早期模型更容易过早提交不完整的解决方案，而最新模型表现出更强的任务坚持性和工程判断力。这种差异可能反映了AI在自我评估和任务管理能力上的进化。
  
  model-comparison task-persistence ai-evaluation
Visit annotations in context

Tags

task-persistence

model-comparison

ai-evaluation

Annotators

fxp007

URL

epoch.ai/blog/mirrorcode-preliminary-results
www.microsoft.com www.microsoft.com

https://www.microsoft.com/en-us/research/blog/adele-predicting-and-explaining-ai-performance-across-tasks/

1
1. fxp007 16 Apr 2026
  
  in Public
  
  Performance on knowledge-heavy tasks depends strongly on model size and training, while reasoning-oriented models show clear gains on tasks requiring logic, learning, abstraction, and social inference.
  
  令人惊讶的是：知识密集型任务的性能强烈依赖于模型规模和训练，而推理导向模型在需要逻辑、学习、抽象和社会推理的任务上显示出明显优势。这一发现揭示了不同AI模型在能力分布上的根本差异，为模型选择和优化提供了重要指导。
  
  surprising model-comparison knowledge-vs-reasoning
Visit annotations in context

Tags

knowledge-vs-reasoning

model-comparison

surprising

Annotators

fxp007

URL

microsoft.com/en-us/research/blog/adele-predicting-and-explaining-ai-performance-across-tasks/
transformer-circuits.pub transformer-circuits.pub

Emotion Concepts and their Function in a Large Language Model

1
1. fxp007 09 Apr 2026
  
  in Public
  
  we studied emotion-related representations in Claude Sonnet 4.5, a frontier LLM at the time of our investigation.
  
  【启发】这篇论文只研究了 Claude Sonnet 4.5 一个模型，但它的方法论对所有大模型都适用。这启发了一个迫切的研究议程：对不同架构（GPT、Gemini、Qwen、DeepSeek）的情绪向量进行横向比较，会不会发现系统性的情绪偏差——比如某些模型天生更「焦虑」、某些更「冷漠」？这不仅是学术问题，更是产品选型和安全评估的实际需求。
  
  inspiration cross-model-comparison emotion-audit model-selection
Visit annotations in context

Tags

emotion-audit

cross-model-comparison

inspiration

model-selection

Annotators

fxp007

URL

transformer-circuits.pub/2026/emotions/index.html
Feb 2021
github.com github.com

maccman/bowline

1
1. TylerRick 16 Feb 2021
  
  in Public
  
  Compared to existing Ruby desktop frameworks, such as Shoes, Bowline's strengths are its adherence to MVC and use of HTML/JavaScript.
  
  comparison advantages/merits/pros MVC (model–view–controller) ruby: desktop app framework
Visit annotations in context

Tags

advantages/merits/pros

comparison

MVC (model–view–controller)

ruby: desktop app framework

Annotators

TylerRick

URL

github.com/maccman/bowline
Jun 2020
arxiv.org arxiv.org

Spatial interactions in urban scaling laws

1
1. ErikStuchly 26 Jun 2020
  
  in BehSci
  
  Altmann, E. G. (2020). Spatial interactions in urban scaling laws. ArXiv:2006.14140 [Physics]. http://arxiv.org/abs/2006.14140
  
  is:article lang:en spatial interaction urban scaling law analysis independence generative model modeling model comparison
Visit annotations in context

Tags

independence

spatial interaction

lang:en

modeling

generative model

analysis

urban scaling law

is:article

model comparison

Annotators

ErikStuchly

URL

arxiv.org/abs/2006.14140
May 2020
psyarxiv.com psyarxiv.com

Personal relative deprivation negatively predicts engagement in group decision-making

1
1. edampf 05 May 2020
  
  in BehSci
  
  Rotella, A. M., & Mishra, S. (2020, April 24). Personal relative deprivation negatively predicts engagement in group decision-making. https://doi.org/10.31234/osf.io/6d35w
  
  is:preprint lang:en relative deprivation group social comparison individual differences decision making participation engagement inequality social cognition group dynamics regression model socioemotional
Visit annotations in context

Tags

social comparison

lang:en

inequality

is:preprint

group

relative deprivation

group dynamics

participation

decision making

socioemotional

regression model

engagement

individual differences

social cognition

Annotators

edampf

URL

psyarxiv.com/6d35w/

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL