55 Matching Annotations
  1. Last 7 days
    1. The thing that impressed me the most about GPT-3 was this: I gave it a weird mix of matlab and python code with a few variables, a loop, some basic arithmetic. Nothing fancy and I knew this kind of thing was probably in the training data, but for shure not with these exact numbers and variables.

      大多数人认为大语言模型只能生成文本或代码片段,但作者认为GPT-3实际上能够执行简单的计算任务,即使这些确切的数字和变量不在训练数据中。这挑战了人们对LLM只是模式匹配工具的认知,暗示它们可能有某种程度的计算能力。

    1. We plan to release new evaluations every 1–2 months.

      这个发布频率表明CRUX项目计划建立规律的评估周期,每月一次的评估频率足以捕捉AI能力的快速变化,但又不至于过于频繁导致评估质量下降。这个频率比传统AI基准测试的更新周期要快得多,反映了当前AI技术快速迭代的特点。

    2. GUI bottleneck (Gemini spent weeks unable to list a product due to misclicking)

      大多数人认为高级AI模型在处理图形用户界面(GUI)任务时会与人类相当或更好,但作者展示了相反的证据:即使是先进模型如Gemini也会因为简单的误点击而被困在基本任务上数周。这挑战了我们对AI实际能力的认知,揭示了其在物理交互方面的严重局限性。

    1. By the end of the year, we expect AI to be able to do tasks roughly one day long with a 50% success rate. In comparison, I'd guess that this task would take several days for a person familiar with the paper and is able to play around with the web interface.

      作者引用了METR的时间预测数据,即到2026年底,AI完成一天长度任务的成功率约为50%。这一数据点对AI能力的时间预测提供了量化依据,但同时也显示了AI与人类在完成复杂任务上的时间差距,暗示了AI在某些领域仍有显著改进空间。

  2. May 2026
    1. Our early versions of agentic work was only asking Codex to implement the task. That approach proved too limiting. Codex is perfectly capable of creating multiple PRs as well as reading review feedback and addressing it.

      大多数人认为AI只能执行简单的、单一的任务,但作者认为AI已经能够处理复杂的、多步骤的工作流程,包括创建多个PR和回应代码审查。这个观点挑战了人们对AI能力的传统认知,表明AI已经进化到能够理解并执行复杂的软件工程任务。

  3. Apr 2026
    1. We subjected the model to our full suite of predeployment safety evaluations and our Preparedness Framework, including targeted red-teaming for advanced cybersecurity and biology capabilities

      大多数人认为AI安全评估主要集中在防止直接有害输出,但OpenAI特别强调了对'高级网络生物学能力'的针对性红队测试。这暗示GPT-5.5可能具有比预期更强大的生物相关能力,这违背了AI领域普遍认为的'语言模型主要处理文本信息'的共识,表明AI已经深入到专业科学领域。

    1. Our most complex pages, which took 20+ prompts to recreate in other tools, only required 2 prompts in Claude Design.

      大多数人认为复杂的设计任务需要更多的提示和人工干预,但作者声称他们的AI工具能用更少的提示完成更复杂的设计。这一观点挑战了人们对AI设计工具复杂度与输入量关系的普遍认知,暗示AI可能在某些方面比人类更擅长处理复杂性。

    1. We are treating the biological/chemical and cybersecurity capabilities of GPT‑5.5 as High under our Preparedness Framework. While GPT‑5.5 didn't reach Critical cybersecurity capability level, our evaluations and testing showed that its cybersecurity capabilities are a step up compared to GPT‑5.4.

      大多数人认为AI在网络安全领域的应用主要局限于防御辅助,而非直接参与核心安全任务。但作者暗示GPT-5.5已具备'高级'网络安全能力,这一分类表明AI已从被动防御工具向主动安全参与者转变,挑战了网络安全领域对人类主导地位的认知。

    2. The gains are especially strong in agentic coding, computer use, knowledge work, and early scientific research—areas where progress depends on reasoning across context and taking action over time.

      大多数人认为AI进步主要是在特定任务上的表现提升,但作者认为GPT-5.5的真正突破在于其跨上下文推理和长时间行动的能力,这挑战了人们对AI发展路径的传统认知。这种'代理式能力'的提升比简单的任务完成更为重要,因为它代表了AI向更接近人类工作方式的转变。

    1. Opus 4.7 is better at using file system-based memory. It remembers important notes across long, multi-session work, and uses them to move on to new tasks that, as a result, need less up-front context.

      大多数人认为AI模型在长对话中会逐渐'忘记'早期信息,需要不断重复上下文。但作者认为Claude Opus 4.7能够跨会话记忆重要信息,这挑战了人们对AI短期记忆局限的认知。这种持久记忆能力意味着AI可以真正进行长期项目,而不需要用户不断重复提供背景信息。

    2. Opus 4.7 handles complex, long-running tasks with rigor and consistency, pays precise attention to instructions, and devises ways to verify its own outputs before reporting back.

      这展示了Claude Opus 4.7在自主验证和执行复杂任务方面的显著进步,标志着AI模型从简单响应向真正自主工作迈出的重要一步,这种自我验证机制大大提高了AI输出的可靠性。

    1. Tasks where correctness is harder to verify may not have seen the same speedup, so the acceleration we document here may not be as general as the headline numbers suggest.

      主流媒体和公众可能认为AI能力在所有领域都在加速提升,但作者明确指出,在正确性难以验证的任务中可能没有相同的加速现象。这一观点挑战了人们对AI进步普遍性的假设。

    2. Three of four metrics show strong evidence of acceleration, driven by reasoning models.

      这是一个关键数据点,表明75%的AI能力指标显示加速趋势。这个比例相当高,表明AI能力加速现象可能不是偶然的。然而,这个数据基于四个特定指标,可能不全面代表所有AI能力领域。需要更多指标验证这一结论的普适性。

    3. Three of four metrics show strong evidence of acceleration, driven by reasoning models.

      这一数据点表明75%的AI能力指标显示加速趋势,这是一个相当高的比例。然而,文章也指出第四个指标(WeirdML V2)没有显示加速,这表明加速可能并非普遍存在于所有AI能力领域。这个比例需要谨慎解读,因为它基于有限的四个指标,且主要集中在数学和编程领域。

    1. Without our safeguards in place (which we do to measure a model's raw capabilities), only Mythos Preview and Opus 4.7 completed more than half the tasks.

      大多数人认为高级AI模型在没有安全措施的情况下会自主执行复杂任务,但作者暗示即使是最先进的模型在没有人类指导的情况下也难以完成大多数任务。这挑战了AI自主性和能力的普遍认知,暗示AI可能比人们想象的更依赖人类监督。

    1. The real challenge is validating outputs, prioritizing what matters, and operationalizing them.

      这是一个反直觉的结论:AI安全研究的前沿已经从模型本身转移到如何有效利用模型的能力。大多数安全团队仍然专注于获取最强大的模型,而实际上真正的瓶颈在于验证、优先排序和将发现转化为可操作的修复。这挑战了'更好的模型等于更好的安全'的传统观念。

    1. Claude calls tool_search to check whether a relevant tool is available but deferred

      Claude现在具有内置的'工具搜索'机制,在声称缺乏某种能力前会主动检查是否有可用工具。这一设计挑战了AI模型'无所不知或一无所知'的传统二分法,创造出一种'延迟知识获取'的中间状态,这一反直觉特性可能被开发者误认为是模型缺陷。

    1. While model capabilities have improved dramatically for use cases like codegen and mathematical reasoning, they still lag behind on the data side (as evidenced through SQL benchmarks like Spider 2.0 and Bird Bench).

      这一观点提供了令人惊讶的事实:尽管模型在代码生成和数学推理方面取得了显著进步,但在数据处理方面仍然落后。这挑战了模型能力全面提升的假设,暗示了数据推理可能需要特殊的处理方法。

    1. Claude Opus 4.6 autonomously reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks.

      这是一个惊人的发现,表明AI已经能够完成通常需要人类工程师数周时间才能完成的复杂编程任务。这不仅挑战了我们对AI当前能力的认知,也暗示了软件工程领域可能即将发生重大变革。这种级别的自主编程能力远超当前主流AI编程助手的表现。

    1. AI uncovered a 27-year-old vulnerability in the BSD kernel, one of the most widely used and security-focused open source projects, and generated working exploits in a matter of hours.

      这一事实令人震惊,展示了AI发现漏洞的惊人能力。即使是经过数十年审查的安全项目,AI也能在几小时内发现并生成利用代码,这表明传统的安全审查方法已无法应对AI驱动的威胁,需要全新的防御策略。

    1. One Agent can now: open X (Twitter), scroll the feed, extract tweets, return clean JSON. No plugins. No extensions. No orchestration.

      令人惊讶的是:单个AI代理现在能够独立完成复杂的社交媒体数据提取任务,无需任何插件或扩展编排,这展示了AI自主操作能力的惊人进步,可能会彻底改变数据收集和自动化工作流程。

    1. 普通聊天、写作这些开放任务反而没那么明显提升

      令人惊讶的是:虽然我们普遍认为AI在创意和开放性任务上进步神速,但实际上AI在编程、数学等有明确验证奖励的领域进步更为显著。这解释了为什么技术专家和普通用户对AI能力的感知存在巨大差异。

    1. GLM-5.1 did not plateau after 50 or 100 submissions, but continued to find meaningful improvements over 600+ iterations with 6,000+ tool calls, ultimately reaching 21.5k QPS—roughly 6× the best result achieved in a single 50-turn session.

      令人惊讶的是:GLM-5.1在向量数据库优化任务中能够持续改进600多次迭代,性能提升达到原来的6倍,这打破了传统模型很快达到性能瓶颈的局限。这种长时间持续优化的能力在AI模型中极为罕见,展示了模型在长期任务处理上的突破性进步。

    1. M2.7 demonstrates excellent performance in real-world software engineering, including end-to-end project delivery, log analysis for bug hunting, code security, and machine learning tasks.

      令人惊讶的是:MiniMax M2.7不仅能处理常规编程任务,还能完成端到端的项目交付、日志分析、代码安全检查等复杂软件工程任务,这表明AI已经能够胜任完整的软件开发流程,从编码到安全审计,打破了人们对AI只能辅助编程的固有认知。

    1. The same model can score above 90% on lower-demand tests and below 15% on more demanding ones, reflecting differences in task requirements rather than a change in capability.

      令人惊讶的是:同一个AI模型在低需求测试中可能获得90%以上的分数,而在高需求测试中却可能低于15%,这反映了任务需求的不同而非模型能力的改变。这一发现挑战了人们对AI能力稳定性的普遍认知,揭示了任务难度对AI表现的巨大影响。

    1. For example, developers can give an agent a controlled workspace, explicit instructions, and the tools it needs to inspect evidence:

      令人惊讶的是:OpenAI的Agents SDK现在允许开发者创建一个完全受控的工作环境,让AI代理可以检查文件、运行命令和编辑代码。这种能力意味着AI系统可以更深入地与计算机系统交互,实现更复杂的任务自动化,这比大多数人想象的AI能力要强大得多。

    1. Claude Mythos autonomously identified and exploited several significant vulnerabilities. Notably, it discovered a 27-year-old vulnerability in OpenBSD

      令人惊讶的是,Claude Mythos能够自主发现并利用一个存在了27年的OpenBSD漏洞。这一事实表明AI模型在网络安全领域的能力已经达到了令人难以置信的水平,能够找到人类专家和安全系统长期未发现的漏洞。这引发了关于AI安全性和控制机制的深刻问题。

    1. Some advanced Excel capabilities aren't supported yet, including Office Scripts, Power Query, and Pivot/Data Model, data validation, and the named ranges manager, slicers, timelines, external connection administration, advanced charting breadth, and macro/Visual Basic for Applications (VBA) automation.

      令人惊讶的是:尽管ChatGPT for Excel声称能处理复杂的电子表格任务,但它实际上不支持许多高级Excel功能,如VBA宏和Power Query。这表明该AI工具目前更适合基础到中级的电子表格操作,而非高度专业化的Excel工作流程。

    1. MegaTrain also enables 7B model training with 512k token context on a single GH200.

      令人惊讶的是:该系统单块GH200 GPU就能支持7B模型进行512k token的上下文训练,这远超当前主流模型的上下文长度限制。这种超长上下文能力可能彻底改变大模型处理长文档、代码库或书籍的方式。

    1. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks

      大多数人认为当前最先进的多模态大模型已经接近或超越人类在复杂任务上的表现。然而,作者的数据表明,即使是最好的模型在复杂现实任务上的表现也远低于预期,准确率从整体56.3%骤降至23.0%。这一发现挑战了AI领域对当前技术能力的乐观评估,揭示了现实世界多模态代理任务的极端复杂性。

    1. The AI is actually very good at this, especially if you have a conversation with it beforehand. That's what Ask mode is for.

      主流观点认为AI工具主要适合生成代码或自动化简单任务,但作者认为AI在代码审查和架构讨论方面表现优异,前提是事先进行充分对话。这挑战了人们对AI能力的传统认知,暗示AI可以作为架构讨论的平等伙伴,而不仅仅是代码生成工具。

    1. We estimate that as of the end of 2025, Chinese companies collectively own just over 5% of the cumulative computing power of the leading AI chips sold in recent years

      考虑到中国AI产业的快速发展和政府对AI的大力投资,大多数人可能认为中国拥有更大比例的全球AI计算能力,但作者认为中国公司仅拥有约5%的全球AI计算能力。这一数字远低于人们的预期,挑战了关于中国AI技术实力的普遍认知。

    1. The edge models feature a 128K context window, while the larger models offer up to 256K

      大多数人认为边缘设备/移动设备上的AI模型功能受限,尤其是在处理长上下文方面。但作者声称即使在移动设备上,Gemma 4也能提供128K的上下文窗口,挑战了边缘AI能力有限的普遍认知。

  4. Jul 2024
  5. Jun 2024
    1. be able to quick Master any domain write trillions lines of code and read every research paper in every scientific field ever written

      for - AI evolution - projections for capabilities by 2030

      AI evolution - projections for 2030 - AI will be able to do things we cannot even conceive of now because their cognitive capabilities are orders of magnitudes faster than our own - Write billions of lines of code - Absorb every scientific paper ever written and write new ones - Gain the equivalent of billions of human equivalent years of experience

  6. Oct 2021
  7. Aug 2019
  8. May 2019
    1. The individual does not use this information and this processing to grapple directly with the sort of complex situation in which we seek to give him help. He uses his innate capabilities in a rather more indirect fashion, since the situation is generally too complex to yield directly to his motor actions, and always too complex to yield comprehensions and solutions from direct sensory inspection and use of basic cognitive capabilities.

      The mention here of "innate capabilities" and the importance to yielding motor actions toward complex challenges, is noted. A question arises, regarding "basic cognitive capabilities" and their role in the process?

  9. Sep 2018
  10. Mar 2018
  11. Sep 2017
  12. Aug 2017
  13. Jul 2017
  14. Mar 2017
  15. Aug 2015
    1. In order to avoid the confused deputy problem, asubject must be careful to maintain the associationbetween each authority and its intended purpose. Using the key analogy, one could imagine immediatelyattaching a label to each key upon receiving it, wherethe label describes the purpose for which the key is tobe used. In order to know the purpose for a key, thesubject must understand the context in which the key is received; for example, labelling is not possible if keysmagically appear on the key ring without the subject’sknowledge.
    2. We would argue that the “true” capability model is the object-capability model, because all known major capability systems take the object-based approach (forexamples, see [1, 4, 9, 11, 16, 17, 19, 21]). In all ofthese systems, a capability is an object reference–not something that behaves like a key or ticket in the realworld. Definitive books on capability-based systems[6, 16] also describe these systems from the object-capability perspective, and explicitly characterize themas “object-based”.
    3. Theonly capability Bob holds to a lower level is a readcapability, so the *-Property is enforced. The onlycapability Alice holds to a higher level is a writecapability, so the Simple Security Property is enforced

      This paragraph would be clearer if the capabilities were written out fully:

      The only capability Bob holds to a lower level is a "read data" capability, so the *-Property is enforced. The only capability Alice holds to a higher level is a "write data" capability, so the Simple Security Property is enforced.

      Maybe. But it still seems confused. As though the properties are in the wrong sentences.

      Nonetheless, both properties are enforced.