Hypothesis

12 Matching Annotations

Jul 2026
quesma.com quesma.com

Qwen 3.6 27B is the sweet spot for local development - Quesma Blog

1
1. fxp007 03 Jul 2026
  
  in Public
  
  30 tokens per second is not bad, well within typical frontier model API range.
  
  30 tok/s 是一个关键的体验临界数据。作者通过实证对比指出，经过MTP加速的本地27B模型，其生成速度已经能够媲美商业API的响应水平。这打破了“本地模型必然慢到无法用于实际开发”的刻板印象，证明了本地模型已进入实用级速度。
  
  performance-data local-vs-api benchmark
Visit annotations in context

Tags

benchmark

local-vs-api

performance-data

Annotators

fxp007

URL

quesma.com/blog/qwen-36-is-awesome/
Jun 2026
arstechnica.com arstechnica.com

https://arstechnica.com/google/2026/06/googles-latest-diffusiongemma-open-ai-model-comes-with-a-4x-speed-boost/

1
1. fxp007 10 Jun 2026
  
  in Public
  
  In testing with an RTX 5090, DiffusionGemma spits out around 700 tokens per second. With a single Nvidia H100 AI accelerator, DiffusionGemma can produce 1,000+ tokens per second.
  
  这是一个重要的性能数据声明，但缺乏详细测试环境信息。需要了解测试的具体设置、硬件配置、模型版本以及比较基准，以验证这些数字的准确性和可比性。
  
  performance-data benchmark technical-spec
Visit annotations in context

Tags

benchmark

technical-spec

performance-data

Annotators

fxp007

URL

arstechnica.com/google/2026/06/googles-latest-diffusiongemma-open-ai-model-comes-with-a-4x-speed-boost/
cognition.ai cognition.ai

https://cognition.ai/blog/frontier-code

2
1. fxp007 08 Jun 2026
  
  in Public
  
  FrontierCode produces 81% less misclassification errors than other leading benchmarks.
  
  与现有基准相比，81%的误分类错误减少率是一个强有力的数据点，证明了FrontierCode评估方法的准确性和可靠性。这表明该基准更接近人类开发者的实际评估标准，但缺乏对误分类类型的详细分析。
  
  data-point statistics benchmark-accuracy
2. fxp007 08 Jun 2026
  
  in Public
  
  We achieve an 81% lower false positive rate compared to SWE-Bench Pro.
  
  81%的假阳性降低率是一个显著的量化改进，表明FrontierCode在评估代码质量方面比现有基准更准确。这个数据点很有说服力，因为它与现有基准直接比较，显示了评估方法的优越性。
  
  data-point statistics benchmark-comparison
Visit annotations in context

Tags

data-point

statistics

benchmark-comparison

benchmark-accuracy

Annotators

fxp007

URL

cognition.ai/blog/frontier-code
May 2026
epoch.ai epoch.ai

RIP Classic Reasoning Benchmarks. What's Next? - Epoch AI

1
1. fxp007 07 May 2026
  
  in Public
  
  Top models scored around 40%.
  
  这个40%的准确率表明当前AI系统在IKEA家具组装指令理解任务上的表现有限，远低于人类水平。这一数据点显示了AI在多模态空间推理方面的明显不足，但同时也为该领域提供了明确的改进基准。
  
  data-point multimodal-reasoning benchmark-performance
Visit annotations in context

Tags

data-point

multimodal-reasoning

benchmark-performance

Annotators

fxp007

URL

epoch.ai/gradient-updates/rip-classic-benchmarks
subq.ai subq.ai

https://subq.ai/introducing-subq

3
1. fxp007 07 May 2026
  
  in Public
  
  SWE-Bench Verified score of 81.8 compared to Opus 4.6 (80.8) and Deepseek 4.0 Pro (80.0).
  
  SubQ在SWE-Bench Verified测试中得分为81.8，略高于Claude Opus 4.6(80.8)和Deepseek 4.0 Pro(80.0)。这个数据点表明SubQ在软件工程任务方面已达到前沿水平，进一步验证了其实用价值。
  
  data-point benchmark performance
2. fxp007 07 May 2026
  
  in Public
  
  Research result of 83 and a production model, third-party verified score of 65.9, SubQ 1M-Preview compares favorably with other SOTA models like Claude Opus 4.7 (32.2), GPT 5.5 (74), and Gemini 3.1 Pro (26.3).
  
  在MRCR v2测试中，SubQ 1M-Preview的生产模型得分为65.9，显著优于Claude Opus 4.7(32.2)、GPT 5.5(74)和Gemini 3.1 Pro(26.3)。这个数据点有力证明了SubQ在多信息检索和推理方面的优越性，接近研究模型的83分。
  
  data-point benchmark comparison
3. fxp007 07 May 2026
  
  in Public
  
  SubQ 1M-Preview scores 95% accuracy, compared to 94.8% for Claude Opus 4.6
  
  在RULER 128K基准测试中，SubQ 1M-Preview准确率达到95%，略高于Claude Opus 4.6的94.8%。这个数据点表明SubQ在长上下文理解方面已达到前沿水平，同时突破了传统二次扩展模型的性能瓶颈。
  
  data-point benchmark accuracy
Visit annotations in context

Tags

benchmark

data-point

comparison

accuracy

performance

Annotators

fxp007

URL

subq.ai/introducing-subq
epoch.ai epoch.ai

https://epoch.ai/gradient-updates/how-close-is-ai-to-taking-my-job

1
1. fxp007 07 May 2026
  
  in Public
  
  The benchmark tasks were meticulously constructed to be realistic, involving the hard work of hundreds of experts and likely millions of dollars — placing it among the most expensive economics papers of all time.
  
  作者提到GDPval基准测试可能花费了数百万美元，由数百名专家参与构建。这一数据点显示了AI基准测试的高昂成本，但也暗示了这类测试可能存在资源分配不均的问题。考虑到其成本与实际经济影响之间的差距，这种高投入低产出的现象值得反思。
  
  data-point benchmark-cost ai-economics
Visit annotations in context

Tags

data-point

benchmark-cost

ai-economics

Annotators

fxp007

URL

epoch.ai/gradient-updates/how-close-is-ai-to-taking-my-job
Apr 2026
api-docs.deepseek.com api-docs.deepseek.com

https://api-docs.deepseek.com/news/news260424

1
1. fxp007 30 Apr 2026
  
  in Public
  
  🔹 **Enhanced Agentic Capabilities:** Open-source SOTA in Agentic Coding benchmarks.
  
  虽然文中没有提供具体的基准测试数据，但声称在代理编程基准测试中达到开源SOTA(最先进水平)。这是一个重要断言，但缺乏具体量化指标。如果属实，这将代表DeepSeek在AI代理能力方面的重大突破，特别是在代码生成和执行任务上。需要查看技术报告中的具体基准测试数据来验证这一声明。
  
  data-point benchmark performance-claim
Visit annotations in context

Tags

benchmark

data-point

performance-claim

Annotators

fxp007

URL

api-docs.deepseek.com/news/news260424
sakana.ai sakana.ai

https://sakana.ai/fugu-beta/

1
1. fxp007 30 Apr 2026
  
  in Public
  
  GPQAD | 94.4 | 90.9 | 92.7 | 92.4 | **95.1** | LCBv6 | 90.3 | 92.1 | 92.4 | 90.4 | **93.2** | SWEPro | 48.4 | 51.2 | _53.4_ | 51.3 | **54.2**
  
  性能对比表格显示，Sakana Fugu Ultra在三个基准测试中均优于竞争对手：GPQAD上达95.1%（超越Gemini 3.1的94.4%），LCBv6上达93.2%（超越GPT 5.4的92.1%），SWEPro上达54.2%（超越Opus 4.6的53.4%）。这些数据表明其多模型协调策略确实带来了性能提升，特别是在科学推理任务上优势明显。
  
  data-point performance-benchmark model-comparison
Visit annotations in context

Tags

performance-benchmark

data-point

model-comparison

Annotators

fxp007

URL

sakana.ai/fugu-beta/
www.theaivalley.com www.theaivalley.com

https://www.theaivalley.com/p/the-claude-mythos-era

1
1. fxp007 16 Apr 2026
  
  in Public
  
  The model reportedly scored 93.9% on SWE-bench Verified and 77.8% on SWE-bench Pro, but its strongest signal came from real-world results, including uncovering a 27-year-old flaw in OpenBSD, a 16-year-old vulnerability in FFmpeg, and autonomously chaining Linux kernel exploits without human input.
  
  这些惊人的安全漏洞发现能力表明AI已经超越了传统安全工具，能够自主发现几十年未被发现的漏洞。特别是能够自主链接Linux内核漏洞的能力，展示了AI在网络安全领域的革命性潜力，这可能彻底改变安全研究和漏洞修复的方式。
  
  ai-security benchmark-data real-world-results
Visit annotations in context

Tags

ai-security

benchmark-data

real-world-results

Annotators

fxp007

URL

theaivalley.com/p/the-claude-mythos-era

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL