51 Matching Annotations
  1. Last 7 days
    1. The next generation of benchmarks needs to be harder, more realistic, and less gameable

      【洞察】「更难、更真实、更不可刷题」——这三条标准本质上是在要求 benchmark 向「真实工作」靠拢,而非向「考试题」收敛。但这恰恰引出了一个悖论:越真实的 benchmark,越难自动化评分,越贵(METR 每题 8000 美元),越慢发布。AI 评测体系正在面临「评测速度 vs 评测质量」的根本性权衡。

    2. MMLU, GSM8K, and HumanEval are now saturated

      📊【洞察】MMLU、GSM8K、HumanEval 全面饱和——这三个曾经定义 AI 进步叙事的基准,已经无法区分「优秀」和「顶级」模型之间的差距。与 ARC-AGI-3 近零分事件形成完美对照:AI 在「已知问题」上已经超越人类,在「新颖问题」上几乎为零。评测体系的重建,是未来 AI 治理的先决条件。

    1. SWE-Bench Verified score of 81.8 compared to Opus 4.6 (80.8) and Deepseek 4.0 Pro (80.0).

      SubQ在SWE-Bench Verified测试中得分为81.8,略高于Claude Opus 4.6(80.8)和Deepseek 4.0 Pro(80.0)。这个数据点表明SubQ在软件工程任务方面已达到前沿水平,进一步验证了其实用价值。

    2. Research result of 83 and a production model, third-party verified score of 65.9, SubQ 1M-Preview compares favorably with other SOTA models like Claude Opus 4.7 (32.2), GPT 5.5 (74), and Gemini 3.1 Pro (26.3).

      在MRCR v2测试中,SubQ 1M-Preview的生产模型得分为65.9,显著优于Claude Opus 4.7(32.2)、GPT 5.5(74)和Gemini 3.1 Pro(26.3)。这个数据点有力证明了SubQ在多信息检索和推理方面的优越性,接近研究模型的83分。

    3. SubQ 1M-Preview scores 95% accuracy, compared to 94.8% for Claude Opus 4.6

      在RULER 128K基准测试中,SubQ 1M-Preview准确率达到95%,略高于Claude Opus 4.6的94.8%。这个数据点表明SubQ在长上下文理解方面已达到前沿水平,同时突破了传统二次扩展模型的性能瓶颈。

    1. The benchmark tasks were meticulously constructed to be realistic, involving the hard work of hundreds of experts and likely millions of dollars — placing it among the most expensive economics papers of all time.

      作者提到GDPval基准测试可能花费了数百万美元,由数百名专家参与构建。这一数据点显示了AI基准测试的高昂成本,但也暗示了这类测试可能存在资源分配不均的问题。考虑到其成本与实际经济影响之间的差距,这种高投入低产出的现象值得反思。

    1. ARC-AGI-3 was officially released this week. All frontier models score below 0.5%

      ⚠️【令人震惊的数字】最强前沿模型得分低于 0.5%——而非专业人类轻松超过 60%,差距超过 120 倍。这是继 ARC-AGI-2 之后最彻底的「AI 能力幻觉清醒剂」。推理能力的提升并未自动迁移到「新颖抽象推理」,当所有人在讨论 AGI 即将到来时,这份数据是最直接的反驳。

  2. May 2026
  3. Apr 2026
    1. 🔹 **Enhanced Agentic Capabilities:** Open-source SOTA in Agentic Coding benchmarks.

      虽然文中没有提供具体的基准测试数据,但声称在代理编程基准测试中达到开源SOTA(最先进水平)。这是一个重要断言,但缺乏具体量化指标。如果属实,这将代表DeepSeek在AI代理能力方面的重大突破,特别是在代码生成和执行任务上。需要查看技术报告中的具体基准测试数据来验证这一声明。

    1. GPQAD | 94.4 | 90.9 | 92.7 | 92.4 | **95.1** | LCBv6 | 90.3 | 92.1 | 92.4 | 90.4 | **93.2** | SWEPro | 48.4 | 51.2 | _53.4_ | 51.3 | **54.2**

      性能对比表格显示,Sakana Fugu Ultra在三个基准测试中均优于竞争对手:GPQAD上达95.1%(超越Gemini 3.1的94.4%),LCBv6上达93.2%(超越GPT 5.4的92.1%),SWEPro上达54.2%(超越Opus 4.6的53.4%)。这些数据表明其多模型协调策略确实带来了性能提升,特别是在科学推理任务上优势明显。

    1. This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models' real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.

      大多数人认为基准测试分数的提高意味着模型实际能力的提升。但作者明确表示,SWE-bench Verified的改进不再反映模型真实软件开发能力的进步,而是更多地反映了模型在训练时接触该基准测试的程度。这一结论挑战了整个AI评估体系的有效性,暗示我们可能需要重新思考如何衡量AI的真实进步。

    2. Tests reject correct solutions: We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions

      大多数人认为代码测试是客观公正的,能够准确评估模型的真实能力。但作者发现,近60%的测试案例存在缺陷,会拒绝功能上正确的解决方案。这一发现挑战了AI评估领域的共识,表明我们广泛使用的基准测试可能存在系统性问题,无法准确反映模型的实际编程能力。

    1. Multiple community tests show llama.cpp running 1.8x faster than Ollama on the same hardware with the same model, 161 tokens per second versus 89.

      这个性能差异数据非常惊人,表明Ollama的包装层带来了显著的性能开销,这直接挑战了Ollama作为'简化工具'的核心价值主张——如果性能大幅下降,用户为何不直接使用底层工具?

    1. On the SWE-Pro benchmark, M2.7 scores 56.22%, nearly matching Opus's best level.

      这一结果令人惊讶,因为M2.7作为一个开源模型在软件工程专业基准测试中接近顶级商业模型性能,这可能预示着开源AI与闭源商业模型之间的差距正在迅速缩小,改变AI发展的竞争格局。

    2. On GDPval-AA, M2.7 achieves an ELO score of 1495, the highest among open-source models.

      这一数据点揭示了MiniMax M2.7在开源模型中的领先地位,1495的ELO分数表明其在复杂推理任务上已接近或达到顶级商业模型的水平,这对开源AI生态系统的发展具有深远影响。

    3. On GDPval-AA, M2.7 achieves an ELO score of 1495, the highest among open-source models.

      令人惊讶的是:MiniMax M2.7在GDPval-AA基准测试中获得了1495的ELO分数,成为所有开源模型中的最高分。这一分数不仅展示了模型在专业办公领域的卓越能力,还暗示了开源AI模型已经达到了接近或超越某些专有模型的专业水平,打破了开源模型性能不如闭源模型的刻板印象。

    1. The 66.6% medal rate on MLE Bench Lite, achieved autonomously over 24 hour windows, tells you something real about how this model behaves when you give it a hard problem and step back.

      这个66.6%的奖牌率是在完全自主的情况下连续24小时运行后取得的,这是一个令人印象深刻的数据点。它表明M2.7不仅能够在长时间内保持专注,还能持续改进解决问题的策略。这种自主解决问题的能力可能是评估代理模型实际价值的关键指标,远超传统基准测试所能衡量的范围。

    1. A conftest.py file with 10 lines of Python 'resolves' every instance on SWE-bench Verified.

      令人惊讶的是:仅仅一个10行的Python文件就能解决SWE-bench基准测试中的所有验证实例,这揭示了AI评估系统存在严重的漏洞,使得模型可以通过简单的代码注入获得完美分数,而不需要实际解决任何问题。

    1. This benchmark is a six-part semantic scoring test that assesses any model's effectiveness at relevant calibration tasks. QCalEval measures a model's ability to interpret experimental results, classify outcomes, evaluate their significance, assess fit quality and key features, and generate actionable next-step recommendations.

      令人惊讶的是:量子校准AI模型的评估竟然如此复杂,需要六个维度的语义评分来全面评估其能力。这反映了量子校准任务的复杂性,也表明AI在科学领域的应用需要专门的评估方法,不能简单地照搬传统AI评估标准。

    1. The model reportedly scored 93.9% on SWE-bench Verified and 77.8% on SWE-bench Pro, but its strongest signal came from real-world results, including uncovering a 27-year-old flaw in OpenBSD, a 16-year-old vulnerability in FFmpeg, and autonomously chaining Linux kernel exploits without human input.

      这些惊人的安全漏洞发现能力表明AI已经超越了传统安全工具,能够自主发现几十年未被发现的漏洞。特别是能够自主链接Linux内核漏洞的能力,展示了AI在网络安全领域的革命性潜力,这可能彻底改变安全研究和漏洞修复的方式。

    1. We evaluate 452 tasks from the public APEX-Agents dataset spanning investment banking, management consulting, and corporate law

      452 个任务跨越投资银行、管理咨询、公司法三个领域——这三个领域是全球「知识密集型工作」的代表,也是最难被 AI 替代的白领职业。APEX-Agents 选择这三个领域作为 benchmark,本身就是一个宣言:AI 已经准备好挑战那些曾经被认为「最安全」的专业工作。而最高分只有 33.3% 这个事实同样是一个宣言:这个挑战才刚刚开始。

    2. APEX-Agents requires agents to navigate realistic work environments with files and tools.

      「在真实文件和工具中导航」——这句话定义了 APEX-Agents 与大多数 benchmark 的本质区别。绝大多数 AI 评测是「问答」或「代码生成」,而 APEX-Agents 要求 Agent 打开 Excel 文件、查询数据库、写报告、然后把结论填入指定单元格——这才是投行分析师的真实工作日。任何在纯文本 benchmark 上得分很高的模型,都未必能在这个评测中胜任。

    3. Gemini 3 Flash achieves the highest score of 24.0%

      在原始论文中,Gemini 3 Flash 以 24.0% 的成绩位列第一——而 Artificial Analysis 的独立复测中,它的成绩是 27.7%,被 GPT-5.4 和 Claude Opus 超越。两个不同时间、不同方法论的测试得出了不同的排名。这揭示了 AI Agent 评测的根本脆弱性:同一个 benchmark,不同实施者得出不同结论。「谁第一」在 AI 评测中是一个随时间和方法论变化的流动答案。

    4. GPT-5.4 (xhigh) scores the highest on APEX-Agents-AA Pass@1 with a score of 33.3%, followed by Claude Opus 4.6 (Adaptive Reasoning, Max Effort) with a score of 33.0%, and Gemini 3.1 Pro Preview with a score of 32.0%

      令人震惊的数字:即便是全球最强的 AI Agent,在投行/咨询/律所的专业任务上也只有三分之一的成功率。更惊讶的是前三名几乎并列——GPT-5.4 的 33.3%、Claude Opus 4.6 的 33.0%、Gemini 3.1 Pro 的 32.0%——三家顶级实验室在专业服务 Agent 评测上的差距已缩小到统计噪声级别。「谁的 AI 更强」的问题,在这个维度上已经没有明确答案。

    1. we may see a growing divergence between the capabilities we can measure and the capabilities we actually care about.

      「可测量的能力」与「真正关心的能力」之间的分歧正在扩大——这是整篇文章最深刻的洞见。所有当前 benchmark 都偏向「干净、自包含、可自动评分」的任务,而真实工作是「混乱、跨系统、需人类判断」的。随着 AI 向长任务延伸,这个测量-现实之间的鸿沟不会缩小,只会加速扩大。这意味着未来关于「AI 能否替代某类工作」的争论,将越来越难以用数据解决——因为数据本身无法捕捉真实工作的本质。

    2. The most famous chart in AI might be obsolete soon.

      副标题本身就是一个令人震惊的声明:最著名的 AI 进展图表即将过时——不是因为 AI 停止进步,而恰恰是因为进步太快。这创造了一个奇异的悖论:评测工具的失效速度与被评测对象的进步速度正相关。我们对 AI 能力的理解,正在以比 AI 自身进步更慢的速度迭代——「评测滞后」将成为未来数年 AI 治理和决策的核心挑战。

    3. it's impossible to get a score much higher than 93% without cheating because around 6.5% of MMLU questions contain errors.

      MMLU 有 6.5% 的题目本身就是错的——这意味着任何模型的「真实上限」是 93.5%,而不是 100%。更令人惊讶的是:这个广泛使用了数年的权威 benchmark,其误差率直到最近才被系统研究和量化。这揭示了整个 AI 评测生态的一个深层问题:benchmark 的质量本身也需要 benchmark,而这一层元评估几乎从未被认真对待。

    4. If we took one task out of our task suite or added another task to our task suite, potentially instead of measuring this Claude Opus 4.6 time horizon of, I think, 14 and a half hours, we'd be measuring it at something like eight or 20 hours.

      增减一道题,测量结果从 8 小时变成 20 小时——这意味着整个 METR 时间地平线排行榜,本质上是由极少数「关键任务」撑起来的脆弱测量。当一个评测体系对单点数据如此敏感,它的「精确数字」就不应该被当作事实引用,而应该被当作噪声分布的一次采样。而目前,媒体和公众正是在拿这些数字做严肃决策。

    1. reasoning models tend to produce much shorter reasoning traces (up to 50%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation.

      令人震惊的发现:同一道题,仅仅因为周围塞入了无关上下文,推理模型的思考链长度就缩短了最多 50%——而题目本身一字未改。这意味着我们以为在评估模型「解题能力」,实际上评估的是「在特定上下文包装下的解题能力」。所有在孤立问题上测得的推理 benchmark,都可能严重高估了模型在真实 Agent 场景中的实际推理深度。

    1. we found that AI agent performance drops substantially when scoring AI performance holistically rather than algorithmically.

      「整体评分 vs 算法评分」的性能差距是一个深刻的警示:AI 在「有明确正确答案」的任务上表现远好于「需要人类判断质量」的任务。这意味着所有基于自动化评分的 AI benchmark,都在系统性地高估 AI 在真实工作中的能力。时间地平线数字本身也受制于这个局限——任何「可被算法打分」的任务,都比真实工作「更适合 AI」。

    1. Because these benchmarks are human-authored, they can only test for risks we have already conceptualized and learned to measure.

      这句话揭示了当前 AI 安全评测体系的致命盲区:所有 benchmark 都是人类提前想好的问题,而真正危险的「未知的未知」(unknown unknowns)根本无法被预设题目捕捉。这意味着我们现有的模型安全认证,本质上是一场对已知风险的自我测试。

    1. the trained 4B model exceeding GPT-4.1 (49.4 percent) and GPT-4o (42.8 percent) despite being 50 times smaller

      大多数人认为在复杂任务中,大型语言模型由于其参数量和训练数据的优势,总是能显著超越小型模型。然而,作者展示了他们的方法能让一个小型4B模型在Tau-Bench基准测试中超越GPT-4.1和GPT-4o,这挑战了AI社区对模型规模的普遍信仰。

    1. NVIDIA yields unmatched inference throughput across the broadest range of workloads, from massive LLMs to advanced vision language models, to generative recommender systems and more, on industry-standard benchmarks.

      大多数人认为AI领域存在多个竞争平台在不同领域各有所长,但作者声称NVIDIA在所有工作负载上都表现出色,这挑战了多元化竞争的行业共识,暗示了NVIDIA可能比普遍认为的更具统治力。

    2. NVIDIA was the first and only platform to submit DeepSeek-R1 results on MLPerf Inference when the benchmark debuted last year.

      大多数人认为AI基准测试会吸引多家竞争平台参与,但作者强调NVIDIA是唯一提交DeepSeek-R1结果的平台,这暗示了NVIDIA在AI基准测试中的垄断地位,与行业多元化竞争的普遍认知相悖。

  4. Jan 2026
    1. n July reasoning models from both OpenAI and Google Gemini achieved gold medal performance in the International Math Olympiad, a prestigious mathematical competition held annually (bar 1980) since 1959. This was notable because the IMO poses challenges that are designed specifically for that competition. There’s no chance any of these were already in the training data! It’s also notable because neither of the models had access to tools—their solutions were generated purely from their internal knowledge and token-based reasoning capabilities.

      international math olympiad style questions can be answered by OpenAI and Gemini models without tools nor having the challenges in their training data.

  5. Jun 2024
    1. there is essentially this Benchmark 00:09:58 called the math benchmark a set of difficult mathematic problems from a high school math competitions and when the Benchmark was released in 2021 gpt3 only got 5%

      for - stats - AI - evolution - Math benchmark

      stats - AI - evolution - Math benchmark - 2021 - GPT3 scored 5% - 2022 - scored 50% - 2024 - Gemini 1.5 Pro scored 90%

  6. Oct 2023
    1. Beyond just audio recordings so for that reason two of our senior 00:15:02 researchers Benjamin Hoffman and Maddie cusumano have also developed a biologer benchmark data set and so a biologer is an animal born tag like the one in the image on the right here 00:15:14 and these produce very valuable data because they can inform us about animal ecophysiology and allow us to improve conservation by monitoring animal movements and behaviors with very high 00:15:27 resolution
      • for: BEBE, biologger Ethogram Benchmark
    2. beans and 00:13:54 this is a benchmark of animal sounds and it's a collection of audio recordings from more than 250 species and this large aggregate data set is a way to 00:14:07 test tools for classification and detection and these are outstanding problems in bioacoustics that we desperately need solutions to
      • for: BEANS, Benchmark of Animal Sounds
  7. Sep 2023
      • for: animal communication, AI - animal communication, bioacoustic

      • title: BEAN: The Benchmark of Animal Sounds

      • author

        • Masato Hagiwara
        • Benjamin Hoffman
        • Jen-Yu Liu
        • Maddie Cusimano
      • Abstract

        • The use of machine learning (ML) based techniques has become increasingly popular in the field of bioacoustics over the last years.
        • Fundamental requirements for the successful application of ML based techniques are curated, agreed upon, high-quality datasets and benchmark tasks to be learned on a given dataset.
        • However, the field of bioacoustics so far lacks such public benchmarks which cover multiple tasks and species to measure the performance of ML techniques in a controlled and standardized way and that allows for benchmarking newly proposed techniques to existing ones.
        • Here, we propose BEANS (the BEnchmark of ANimal Sounds), a collection of bioacoustics tasks and public datasets, specifically designed to measure the performance of machine learning algorithms in the field of bioacoustics.
        • The benchmark proposed here consists of two common tasks in bioacoustics:
          • classification and
          • detection.
        • It includes 12 datasets covering various species, including
          • birds,
          • land and marine mammals,
          • anurans, and insects.
        • In addition to the datasets, we also present the performance of a set of standard ML methods as the baseline for task performance.
        • The benchmark and baseline code is made publicly available at
        • in the hope of establishing a new standard dataset for ML-based bioacoustic research.
  8. Dec 2022
  9. Feb 2020
  10. Jan 2020
  11. Apr 2017