4,771 Matching Annotations
  1. Last 7 days
    1. Amazon is investing $5 billion in Anthropic today, with up to an additional $20 billion in the future. This builds on the $8 billion Amazon has previously invested.

      大多数人认为科技巨头对AI公司的投资通常在数亿级别,但Amazon对Anthropic的总投资可能高达330亿美元,这远超行业共识。这种规模的投资表明科技巨头对AI基础设施的重视程度和投入规模正在以前所未有的方式增长,可能重塑AI行业的资本结构和竞争动态。

    2. Anthropic will also use incremental capacity for Claude in Amazon Bedrock. The agreement includes expansion of inference in Asia and Europe to better serve Claude's growing international customer base.

      大多数人认为AI模型主要在美国市场发展,但Anthropic明确表示正在大力扩展亚洲和欧洲市场,这挑战了AI服务主要集中在美国的共识。这种全球扩张速度表明AI市场的地理分布正在迅速多元化,可能重塑全球AI产业格局。

    3. Our run-rate revenue has now surpassed $30 billion, up from approximately $9 billion at the end of 2025.

      大多数人认为AI公司仍处于烧钱阶段,难以实现盈利,但Anthropic的收入在短短几个月内增长了三倍多,达到300亿美元的年化收入。这一惊人的增长速度挑战了AI行业普遍亏损的共识,表明AI模型商业化可能比预期更快、规模更大。

    4. We have signed a new agreement with Amazon that will deepen our existing partnership and secure up to 5 gigawatts (GW) of capacity for training and deploying Claude

      大多数人认为AI公司主要依赖通用GPU芯片训练模型,但Anthropic与Amazon的合作表明他们正大规模采用专用AI芯片(Trainium),这挑战了行业对通用芯片依赖的主流认知。5GW的容量远超大多数AI公司的规模,反映了专用芯片在AI训练中的经济性和效率优势正在被重新评估。

    5. up to 5 gigawatts (GW) of capacity for training and deploying Claude

      5GW的算力规模极其庞大,相当于一个小型国家的电力消耗。这一数字表明Anthropic正在为AI模型训练和部署构建前所未有的基础设施,反映了大型语言模型对计算资源的巨大需求。相比其他AI公司的算力规模,这是一个非常激进的扩张计划。

    6. over one million Trainium2 chips to train and serve Claude

      100万片Trainium2芯片的使用量展示了AI模型训练的硬件规模。这一数量级表明Anthropic正在进行大规模并行计算,这是训练大型语言模型的基础设施要求。与英伟达GPU的采用相比,Trainium芯片代表了云服务提供商在AI硬件领域的差异化竞争策略。

    7. up to 5 gigawatts (GW) of capacity for training and deploying Claude

      5GW的算力规模是惊人的,相当于一个小型国家的电力消耗。这个数字表明Anthropic正在为AI模型训练和部署进行大规模基础设施投资,反映了大型语言模型对计算资源的巨大需求。这一规模与OpenAI等竞争对手的算力投入相当,显示AI算力竞赛正在升级。

    1. The network requirement is only for the initial download of the model. Subsequent use of the model does not require a network connection. No data is sent to Google or any third party when using the model.

      大多数人认为使用Google的AI模型必然会涉及数据传输和隐私问题,但作者强调模型完全在设备上运行且不向Google发送数据。这与人们对大型科技公司AI服务通常涉及数据收集的普遍认知相悖,暗示Chrome的AI功能可能比想象的更加注重隐私保护。

    1. I would put venture capitalist in finite demand & open loop. There's only a certain amount of venture capital dollars entering the ecosystem in a year, & investment selection remains an open problem.

      作者将风险投资置于'有限需求+开放循环'象限,这是一个令人惊讶的见解。它暗示即使在AI时代,某些需要人类判断和有限资源的领域仍然难以被AI完全替代,这对理解AI的局限性提供了重要视角。

    2. Open Loop + Finite Demand = Utility Tools. Preparing 10-Ks & 10-Qs. Legal contract review. Insurance claims processing. One report per quarter, one contract per deal. AI makes the work faster, but doesn't create new work to do.

      这个分类揭示了AI在有限需求领域的真正价值在于效率提升而非创造新工作,这与无限需求领域的AI应用形成鲜明对比。这解释了为什么某些行业AI采用较慢——它只是优化现有工作流程,而非创造全新价值。

    3. Closed Loop + Infinite Demand = Economic Engines. Software engineering lives here. AI writes the code. Tests verify correctness. More code enables more features. Companies will always need more software.

      作者将软件开发定位为'经济引擎',这是一个极具洞察力的观点。它表明AI在软件开发中不仅提高了效率,还创造了无限循环的价值增长模式,这与许多其他AI应用形成鲜明对比。

    1. The compliance-driven buyers improvising local AI out of retail Mac Minis because the product they need does not exist.

      大多数人认为企业AI采用需要专门的解决方案和供应商,但作者指出一些合规驱动的买家正在使用零售版Mac Mini自行构建本地AI解决方案。这挑战了企业AI市场的传统认知,暗示市场可能存在未被满足的需求,以及企业正在以非传统方式应对AI挑战。

    2. Why the company that moved computing off the mainframe fifty years ago is making the same structural move with AI, and what that predicts.

      大多数人将苹果的AI战略视为孤立的商业决策,但作者将其与苹果历史上将计算从大型机转移到个人电脑的战略相提并论。这提供了一个反直觉的历史视角,暗示苹果可能正在引领AI从集中式云服务向分布式设备端的范式转变,挑战了当前AI行业向云端集中化的主流趋势。

    3. The question it forces is not which model is best. It is who owns the inference layer your organization depends on, what happens when the economics of that layer stop being subsidized, and whether the thing in your pocket turns out to matter more than the thing in the datacenter.

      大多数人关注AI模型本身的性能和优势,但作者认为真正关键的是谁拥有推理层以及其经济可持续性。这挑战了当前AI行业的主流关注点,暗示未来竞争的核心将从模型本身转向推理层的控制和成本结构,这是一个反直觉的视角转换。

    4. The structural cost problem in AI inference that makes Apple's on-device bet defensible, not just defensive.

      大多数人认为苹果转向设备端AI只是防御性策略,因为他们在云AI领域落后,但作者认为这是基于对AI推理层经济结构问题的深刻理解而做出的主动选择。这挑战了主流对苹果AI战略的看法,暗示设备端AI可能比我们想象的更具经济优势。

    1. This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models' real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.

      大多数人认为基准测试分数的提高意味着模型实际能力的提升。但作者明确表示,SWE-bench Verified的改进不再反映模型真实软件开发能力的进步,而是更多地反映了模型在训练时接触该基准测试的程度。这一结论挑战了整个AI评估体系的有效性,暗示我们可能需要重新思考如何衡量AI的真实进步。

    1. Our RL infra team used a K2.6-backed agent that operated autonomously for 5 days, managing monitoring, incident response, and system operations, demonstrating persistent context, multi-threaded task handling, and full-cycle execution from alert to resolution.

      大多数人认为AI代理系统难以长时间持续运行,通常会面临注意力分散、上下文丢失或性能下降的问题。但作者展示的AI系统能够连续5天自主管理复杂的技术运维工作,这挑战了人们对AI代理持续运行能力的传统认知,暗示AI可能已经具备接近人类的持久工作能力。

    2. The architecture scales horizontally to 300 sub-agents executing across 4,000 coordinated steps simultaneously, a substantial expansion from K2.5's 100 sub-agents and 1,500 steps.

      大多数人认为AI系统的扩展主要依赖于增加单个模型的计算能力和参数规模,而非增加智能体的数量。作者提出的300个智能体并行执行的模式挑战了这一认知,暗示未来AI发展可能更侧重于'多智能体协作'而非'单一模型增强',这可能会重新定义AI系统的架构设计原则。

    3. Kimi K2.6 autonomously overhauled exchange-core, an 8-year-old open-source financial matching engine. Over a 13-hour execution, the model iterated through 12 optimization strategies, initiating over 1,000 tool calls to precisely modify more than 4,000 lines of code.

      大多数人认为AI在复杂工程任务中仍需要人类专家的指导和监督,难以独立完成大规模系统重构。但作者展示了AI能够自主分析、优化并重构一个运行8年的金融系统,这挑战了人们对AI工程能力的传统认知,暗示AI可能已经具备系统级架构设计和优化的能力。

    4. Kimi K2.6 demonstrates significant improvements over Kimi K2.5 in internal evaluations conducted by CodeBuddy: code generation accuracy increased by 12%, long-context stability improved by 18%, and tool invocation success rate reached 96.60%.

      大多数人认为AI模型迭代通常是渐进式的改进,每次版本更新可能有5-10%的性能提升。但数据显示Kimi K2.6实现了远超预期的飞跃,特别是在工具调用成功率接近97%的情况下,这挑战了人们对AI模型能力提升速度的常规认知,暗示可能存在某种技术突破或架构创新。

    1. Meta founder and CEO Mark Zuckerberg described superintelligence in a blog post last year

      文章提到Meta的AI战略包括开发'超级智能',但未提供具体投资金额、研发时间表或预期成果。缺乏量化依据,无法评估这一战略的规模、时间框架或可能带来的商业价值。这种技术愿景需要更多具体数据来支撑其可行性评估。

    1. NEC will establish a Center of Excellence to develop a highly skilled, AI-enabled engineering organization

      大多数人认为AI会使专业知识和技能贬值,但作者认为AI实际上需要更高水平的工程专业知识,因为企业正在建立专门的卓越中心来培养AI技能,这表明AI工具正在提升而非降低工程工作的专业门槛。

    2. As part of its long-running Client Zero initiative, in which NEC serves as its own first customer before offering its technology to clients

      大多数人认为企业会先开发产品然后内部使用,但作者认为NEC采用了反向策略,先内部大规模应用AI技术然后再向客户推广,这表明企业正在采用更激进的方法来验证和改进AI解决方案,挑战了传统的产品开发流程。

    3. NEC aims to build one of Japan's largest AI-native engineering teams, who will use Claude Code in their work.

      大多数人认为AI会取代大量工程师职位,但作者认为AI实际上是在创造新的工程角色和技能需求,因为NEC正在积极建立一支大规模的AI原生工程团队,这表明AI工具正在增强而非替代工程能力,创造新的就业机会。

    1. Our most complex pages, which took 20+ prompts to recreate in other tools, only required 2 prompts in Claude Design.

      大多数人认为复杂的设计任务需要更多的提示和人工干预,但作者声称他们的AI工具能用更少的提示完成更复杂的设计。这一观点挑战了人们对AI设计工具复杂度与输入量关系的普遍认知,暗示AI可能在某些方面比人类更擅长处理复杂性。

    2. Claude Design gives designers room to explore widely and everyone else a way to produce visual work.

      大多数人认为设计专业技能是创造高质量视觉作品的必要条件,但作者认为AI工具可以让非专业人士也能生产专业水平的视觉作品。这一观点挑战了设计专业性的传统观念,暗示专业技能可能不再是高质量设计的唯一门槛。

    3. Claude packages everything into a handoff bundle that you can pass to Claude Code with a single instruction.

      这一描述暗示了AI系统之间无缝协作的可能性,挑战了传统软件开发中设计到实现阶段的转换壁垒。这种自动化工作流程代表了软件开发范式的潜在革命,值得深入了解其技术实现和实际限制。

    1. GPT‑5.5 found a proof of a longstanding asymptotic fact about off-diagonal Ramsey numbers, later verified in Lean. The result is a concrete example of GPT‑5.5 contributing not just code or explanation, but a surprising and useful mathematical argument in a core research area.

      大多数人认为AI在数学研究领域仅能辅助计算或提供解释,无法独立进行创造性数学推理。但作者展示GPT-5.5能够发现并证明数学定理,这一突破挑战了数学研究作为纯粹人类活动的传统观念,暗示AI可能成为真正的'研究伙伴'而非仅是工具。

    2. We are treating the biological/chemical and cybersecurity capabilities of GPT‑5.5 as High under our Preparedness Framework. While GPT‑5.5 didn't reach Critical cybersecurity capability level, our evaluations and testing showed that its cybersecurity capabilities are a step up compared to GPT‑5.4.

      大多数人认为AI在网络安全领域的应用主要局限于防御辅助,而非直接参与核心安全任务。但作者暗示GPT-5.5已具备'高级'网络安全能力,这一分类表明AI已从被动防御工具向主动安全参与者转变,挑战了网络安全领域对人类主导地位的认知。

    3. Losing access to GPT‑5.5 feels like I've had a limb amputated.

      大多数人将AI工具视为辅助性资源,失去后只会带来不便而非功能丧失。但这位NVIDIA工程师的比喻表明,GPT-5.5已从辅助工具转变为不可或缺的'认知延伸',这种依赖程度远超当前主流认知中人与AI的关系定位,暗示了人机协作范式的根本性转变。

    4. GPT‑5.5 delivers this step up in intelligence without compromising on speed: larger, more capable models are often slower to serve, but GPT‑5.5 matches GPT‑5.4 per-token latency in real-world serving, while performing at a much higher level of intelligence.

      大多数人认为更强大的AI模型必然伴随着更高的计算成本和更慢的响应速度,但作者认为GPT-5.5打破了这一规律,实现了更高的智能水平与相同的延迟时间并存。这一反直觉的发现挑战了AI领域'能力与效率成反比'的传统认知,暗示模型架构优化可能比单纯扩大规模更有效。

    5. GPT‑5.5 found a proof of a longstanding asymptotic fact about off-diagonal Ramsey numbers, later verified in Lean. The result is a concrete example of GPT‑5.5 contributing not just code or explanation, but a surprising and useful mathematical argument in a core research area.

      大多数人认为AI在数学研究中的作用主要是辅助计算和验证,但作者认为GPT-5.5能够独立发现数学证明,这在数学研究领域是革命性的。这一观点挑战了人们对AI在创造性思维和抽象推理领域能力的传统认知,暗示AI可能正在从工具转变为研究伙伴。

    6. The viable path is trusted access, robust safeguards that scale with capability, and the operational capacity to detect and respond to serious misuse.

      大多数人认为AI安全应该通过限制访问和严格监管来实现,但作者认为'可信访问'结合'随能力扩展的保障措施'才是可行路径。这一观点挑战了传统的AI安全治理理念,暗示过度限制可能会阻碍AI防御能力的充分发挥,而平衡的开放与安全才是最佳策略。

    7. We are treating the biological/chemical and cybersecurity capabilities of GPT‑5.5 as High under our Preparedness Framework. While GPT‑5.5 didn't reach Critical cybersecurity capability level, our evaluations and testing showed that its cybersecurity capabilities are a step up compared to GPT‑5.4.

      大多数人认为AI在网络安全领域的进步应该是渐进式的,但作者暗示GPT-5.5代表了网络安全能力的显著跃升,达到了'高'级别而非仅仅'临界'级别。这一观点挑战了人们对AI安全能力发展速度的预期,暗示AI在防御复杂网络威胁方面可能比人们想象的进步更快。

    8. GPT‑5.5 delivers this step up in intelligence without compromising on speed: larger, more capable models are often slower to serve, but GPT‑5.5 matches GPT‑5.4 per-token latency in real-world serving, while performing at a much higher level of intelligence.

      大多数人认为更强大的AI模型必然伴随着更高的计算成本和更慢的响应速度,但作者认为GPT-5.5打破了这一权衡关系,实现了更高智能的同时保持相同的延迟。这挑战了AI领域'能力与效率不可兼得'的传统观点,暗示了模型架构和推理算法的重大突破。

    9. The gains are especially strong in agentic coding, computer use, knowledge work, and early scientific research—areas where progress depends on reasoning across context and taking action over time.

      大多数人认为AI进步主要是在特定任务上的表现提升,但作者认为GPT-5.5的真正突破在于其跨上下文推理和长时间行动的能力,这挑战了人们对AI发展路径的传统认知。这种'代理式能力'的提升比简单的任务完成更为重要,因为它代表了AI向更接近人类工作方式的转变。

    10. We are treating the biological/chemical and cybersecurity capabilities of GPT‑5.5 as High under our Preparedness Framework. While GPT‑5.5 didn't reach Critical cybersecurity capability level, our evaluations and testing showed that its cybersecurity capabilities are a step up compared to GPT‑5.4.

      大多数人认为AI在网络安全领域的应用应该被严格限制或视为威胁,但作者认为GPT-5.5的网络安全能力是'进步'而非危险,并将其归类为'高级'而非'关键'风险级别。这与主流的'AI网络安全威胁论'相悖,暗示AI可能成为网络安全防御的重要工具而非主要威胁。

    11. GPT‑5.5 is priced higher than GPT‑5.4, it is both more intelligent and much more token efficient. In Codex, we have carefully tuned the experience so GPT‑5.5 delivers better results with fewer tokens than GPT‑5.4 for most users

      大多数人认为更强大的AI模型必然会导致更高的计算成本和资源消耗,但作者认为GPT-5.5虽然价格更高,但实际上更高效,能用更少的token提供更好的结果。这与AI领域'性能提升必然伴随成本上升'的共识相悖,暗示模型优化可能比规模扩张更经济高效。

    12. The viable path is trusted access, robust safeguards that scale with capability, and the operational capacity to detect and respond to serious misuse.

      大多数人认为随着AI能力增强,应该更严格限制其访问以防止滥用,但作者认为'可信任的访问'和'随能力扩展的安全保障'才是可行路径。这与主流的'限制性安全'观点相悖,暗示开放但有强监管的AI部署可能比封闭式AI更安全有效。

    13. GPT‑5.5 is our strongest agentic coding model to date. On **Terminal-Bench 2.0,** which tests complex command-line workflows requiring planning, iteration, and tool coordination, it achieves a state-of-the-art accuracy of 82.7%.

      大多数人认为AI在复杂编程任务中仍需要人类监督和干预,但作者认为GPT-5.5已经能在复杂的命令行工作流中达到82.7%的准确率,这挑战了'AI编程助手仍处于辅助阶段'的共识,暗示AI可能在某些编程领域已经接近或达到专业人类水平。

    14. GPT‑5.5 delivers this step up in intelligence without compromising on speed: larger, more capable models are often slower to serve, but GPT‑5.5 matches GPT‑5.4 per-token latency in real-world serving, while performing at a much higher level of intelligence.

      大多数人认为更强大的AI模型必然会牺牲速度和效率,但作者认为GPT-5.5打破了这一传统权衡关系,实现了更高智能的同时保持相同延迟。这挑战了AI领域'更大模型必然更慢'的共识,暗示模型架构优化可能比单纯扩大规模更重要。

    1. OpenAI pledged $1.5B to a joint venture called DeployCo, guaranteeing private-equity partners a 17% annual return floor over five years.

      OpenAI承诺的17%年化回报率显著高于行业平均水平(13-16%),这表明OpenAI愿意支付高额费用以确保其AI软件在企业市场的渗透。这种回报保证相当于为PE partners提供了风险缓冲,反映了OpenAI对市场扩张的强烈意愿,但也意味着OpenAI需要实现更高的业务增长来支撑这一承诺。

    1. Jeremy didn't get laid off. He got leveraged.

      大多数人认为在裁员潮中,高额使用AI工具的员工可能会被视为成本负担而被裁掉,但作者提出了一个颠覆性的观点:像Jeremy这样大量使用AI工具的员工不仅没有被裁员,反而获得了更大的杠杆效应和影响力。这挑战了人们对AI成本与价值的传统认知。

    2. The Meta cuts are the inverse. When one person with the right AI tools can do the work of 10-to-15 people, the person most at risk isn't the one using the AI. It's the one whose job description overlaps with what AI now does by itself.

      大多数人认为在AI时代,使用AI工具的员工会更有价值并保住工作,但作者提出了反直觉的观点:真正面临失业风险的是那些工作内容与AI功能重叠的人,而不是那些善于利用AI工具的人。这挑战了人们对AI技能价值的普遍理解。

    3. A US lab would never; well, unless you count a code red or Meta's throw money at the problem moves.

      大多数人认为美国AI实验室会始终保持技术领先优势并公开承认自己的不足,但作者暗示美国实验室(尤其是Meta)只会通过大量投入资金来掩盖技术差距,而非公开承认落后。这种观点挑战了人们对美国科技企业透明度和创新能力的传统认知。

    1. Our alignment assessment concluded that the model is 'largely well-aligned and trustworthy, though not fully ideal in its behavior'. Note that Mythos Preview remains the best-aligned model we've trained according to our evaluations.

      大多数人可能会认为最新、最强大的AI模型应该在对齐和安全性方面表现最好。但作者明确指出,虽然Claude Opus 4.7功能强大,但在对齐方面反而不如之前的Mythos Preview模型。这一反直觉的结论挑战了'能力越强,对齐越好'的普遍假设,暗示AI发展可能存在能力与对齐之间的权衡。

    2. On some measures, such as honesty and resistance to malicious 'prompt injection' attacks, Opus 4.7 is an improvement on Opus 4.6; in others (such as its tendency to give overly detailed harm-reduction advice on controlled substances), Opus 4.7 is modestly weaker.

      大多数人认为AI模型的每个新版本都应该在所有安全指标上都有进步。但作者明确指出Claude Opus 4.7在某些安全方面反而比前代模型表现更弱,这挑战了人们对AI安全线性进步的假设。这种非线性的安全表现表明,模型能力的提升可能伴随着某些方面的权衡,而非全面增强。

    3. Opus 4.7 is better at using file system-based memory. It remembers important notes across long, multi-session work, and uses them to move on to new tasks that, as a result, need less up-front context.

      大多数人认为AI模型在长对话中会逐渐'忘记'早期信息,需要不断重复上下文。但作者认为Claude Opus 4.7能够跨会话记忆重要信息,这挑战了人们对AI短期记忆局限的认知。这种持久记忆能力意味着AI可以真正进行长期项目,而不需要用户不断重复提供背景信息。

    4. Interestingly, this means that prompts written for earlier models can sometimes now produce unexpected results: where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally.

      大多数人认为AI模型应该越来越能理解用户的意图,即使指令表达不够精确也能灵活处理。但作者认为Claude Opus 4.7反而更严格地遵循字面指令,这可能导致用户为旧模型编写的提示产生意外结果。这种'过度遵从'实际上是一种反直觉的进步,因为它减少了模型对用户意图的推测,增加了可预测性。

    1. I had the intuition that these problems were kind of clustered together and they had some kind of unifying feel to them. And this new method is really confirming that intuition.

      大多数人认为数学问题是孤立且独特的,每个问题需要专门的解决方法,但作者认为AI的发现证实了数学问题之间存在某种统一性和关联性,这挑战了人们对数学问题独立性的传统认知。

    2. The LLM took an entirely different route, using a formula that was well known in related parts of math, but which no one had thought to apply to this type of question.

      大多数人认为数学突破需要全新的理论和创新方法,但作者认为AI能够通过重新组合和应用现有知识来解决问题,这挑战了人们对创新必须来自全新理论的认知,展示了AI独特的知识连接能力。

    1. Several correlated but not strictly identical changes happened over the same few months: scaling inference compute, heavier use of RL in post-training, and models producing reasoning tokens.

      大多数人可能将AI进步归因于单一因素(如模型规模或数据量),但作者指出推理能力的提升是多种因素共同作用的结果,包括推理计算扩展、强化学习更广泛应用以及模型产生推理标记等。这挑战了人们对AI进步驱动因素的认知。

    2. Tasks where correctness is harder to verify may not have seen the same speedup, so the acceleration we document here may not be as general as the headline numbers suggest.

      主流媒体和公众可能认为AI能力在所有领域都在加速提升,但作者明确指出,在正确性难以验证的任务中可能没有相同的加速现象。这一观点挑战了人们对AI进步普遍性的假设。

    3. The three metrics where we find acceleration are concentrated in programming and mathematics. These are areas that labs have explicitly targeted for improvement, and they share an important property: correctness is easy to verify automatically.

      主流观点可能认为AI能力在各个领域的提升是均衡的,但作者指出加速现象主要集中在编程和数学领域,因为这些领域的正确性容易自动验证。这暗示AI进步可能不是普遍性的,而是集中在特定可量化的领域。

    4. Three of four metrics show strong evidence of acceleration, seemingly driven by reasoning models.

      大多数人认为AI能力提升是渐进式的线性增长,但作者通过数据分析发现,在四个关键能力指标中有三个出现了明显加速,且这种加速似乎与推理模型的出现直接相关。这挑战了人们对AI进步速度的普遍认知。

    5. Three of four metrics show strong evidence of acceleration, seemingly driven by reasoning models.

      大多数人认为AI能力的发展是持续稳定的线性增长,但作者通过数据分析发现,在四个关键指标中有三个显示出明显的加速趋势,这种加速是由推理模型驱动的。这一结论挑战了人们对AI进步速度的常规认知,表明2024年推理模型的引入可能标志着AI能力发展模式的转变。

    6. Three of four metrics show strong evidence of acceleration, driven by reasoning models.

      这是一个关键数据点,表明75%的AI能力指标显示加速趋势。这个比例相当高,表明AI能力加速现象可能不是偶然的。然而,这个数据基于四个特定指标,可能不全面代表所有AI能力领域。需要更多指标验证这一结论的普适性。

    7. Three of four metrics show strong evidence of acceleration, driven by reasoning models.

      这一数据点表明75%的AI能力指标显示加速趋势,这是一个相当高的比例。然而,文章也指出第四个指标(WeirdML V2)没有显示加速,这表明加速可能并非普遍存在于所有AI能力领域。这个比例需要谨慎解读,因为它基于有限的四个指标,且主要集中在数学和编程领域。

    1. For Anthropic, more usage across diverse tasks means more data, which produces a smarter model—just as more queries improved Google search.

      大多数人认为AI公司的竞争在于模型架构或算法的优越性,但作者认为数据收集的广度才是关键,这与当前AI行业对模型架构的过度关注形成鲜明对比。

    2. The risk of this strategy to the ecosystem is that it makes previously attractive categories no longer viable.

      大多数人认为免费产品会促进市场竞争和创新,但作者指出这种策略实际上会摧毁某些市场类别,使其不再具有商业可行性,这挑战了传统经济学中关于竞争促进创新的认知。

    3. The commoditization flywheel : both companies give away complements to drive usage of the core.

      大多数人认为AI公司应该专注于核心产品并保持其专有性,但作者认为AI巨头应该效仿谷歌,通过免费提供互补产品来推动核心产品的使用,这与传统科技公司的护城河策略相悖。

    4. For Anthropic, more usage across diverse tasks means more data, which produces a smarter model—just as more queries improved Google search.

      大多数人认为AI公司的竞争在于模型架构或参数规模,但作者认为真正的竞争优势来自用户数据和多样化使用场景,这类似于谷歌的搜索数据飞轮效应。这一观点挑战了AI领域的主流技术决定论,强调了数据网络效应的战略价值。

    1. Writing code is not the same as software development. This is only capturing some level of acceleration while writing code, and does not capture time taken in architecture, debugging, review, and deployment.

      大多数人认为高AI代码生成比例意味着软件开发效率的大幅提升,但作者指出这只是编码阶段的加速,不包括架构设计、调试、审查等更耗时的环节,因此高AI贡献比例并不等同于整体生产力的提升。

    2. Cursor counted the entire file as AI, even though we can see from the diff that it left plenty of the lines unchanged.

      大多数人认为AI代码指标应该精确追踪实际修改的代码行,但作者发现Cursor会将整个文件标记为AI生成,即使只修改了其中部分行,这表明AI工具的追踪系统存在严重缺陷,可能导致完全错误的贡献报告。

    3. So even though I did 100% of the writing and 50% of the refactoring, Windsurf reports that 100% of the code I produced in that session was generated by AI.

      大多数人认为代码生成工具的指标应该反映实际使用情况,但作者展示了即使开发者100%手动编写代码,Windsurf仍会报告100%的AI贡献,这表明其指标系统存在根本性缺陷,完全扭曲了实际贡献比例。

    4. customers should expect PCW values of 85%+, often 95%+. This is not a hallucination and is accurate given how we compute this metric

      大多数人认为AI代码生成工具应该客观、准确地衡量其贡献,但作者认为这些工具的报告数据被设计得极度偏向高AI贡献比例(85%-95%),因为它们的计算方法有严重缺陷,如不计算用户粘贴的代码、不计算自动添加的符号等,这些偏差导致AI贡献被高估。

    1. placing constraints upon them not only helps users and services build trust in them, but it also helps people more easily conceptualise what they do.

      大多数人认为限制AI代理的能力会限制其创新和价值,但作者认为约束实际上能建立信任并帮助用户理解功能。这个观点挑战了'无限制创新'的主流科技叙事,暗示适当的约束可能带来更大的价值和采用。

    2. Some proposals for AI agents assume that putting agentic code in a TEE or similar 'jail' will solve these problems, but that ignores the need to collectively bargain

      大多数人认为通过技术手段(如可信执行环境)可以解决AI代理的信任问题,但作者认为这忽视了集体谈判的必要性。这个观点挑战了技术解决方案的万能论,强调了制度设计和多方协商的重要性。

    3. lack of a well-defined user agent role in AI that's backed up by transparent, public standards... leaves a gap – it makes it harder for a marketplace to form.

      大多数人认为AI代理的主要问题是技术或安全方面,但作者认为缺乏明确定义的用户代理角色和透明标准才是根本问题,这阻碍了健康市场的形成。这个观点挑战了行业对AI发展的主流叙事,强调了制度架构比技术实现更重要。

    1. The agent interprets new information and adapts the logic. The engine applies that logic continuously and emits precise updates.

      大多数人认为AI代理应该具备自主决策和执行能力。但作者提出了一种反直觉的分工模式:AI代理负责策略和逻辑调整,而执行引擎负责持续应用这些逻辑。这种模式将AI从'执行者'重新定位为'策略制定者',挑战了AI自主性的主流认知。

    2. Agents and CDC streams are powerful together because they split the work well.

      大多数人认为AI代理应该负责从端到端的任务执行。但作者认为AI代理和数据库引擎应该分工合作:代理负责解释新信息和调整逻辑,而数据库负责持续应用逻辑并发出精确更新。这种分工模式挑战了AI代理应该完全自主的主流观点。

    3. With change data capture (CDC), the system emits a stream of precise updates: inserts, updates, deletes, each tied to specific records.

      大多数人认为AI代理需要主动查询数据系统以获取信息。但作者提出了一种反直觉的方法:让数据库主动向AI代理发送变更事件,而不是让代理轮询或查询。这种模式将AI代理从主动查询者转变为被动响应者,从根本上改变了人机交互模式。

    4. The fix is not smarter prompts. It is software built to meet agents halfway.

      大多数人认为提高AI提示词质量是改善AI交互的关键。但作者认为真正解决方案是重新设计软件架构,使其与AI代理更好地协作,而不是改进提示词。这一观点颠覆了当前AI优化的主流方法,将焦点从AI本身转向系统设计。

    5. Today's agents, the copilots, the chatbots are designed to be human like.

      大多数人认为AI助手应该模仿人类交互方式,使其更自然、更易用。但作者认为这种设计方向是错误的,因为它需要高认知负荷来交互、解析和管理,违背了'平静技术'的理念。作者暗示我们应该让AI更像机器而非人类,以减少认知负担。

    1. A LeadDev survey found 54% of engineering leaders believe AI copilots will reduce junior hiring long-term.

      大多数人认为AI会创造新的就业机会,但作者引用调查表明,行业领导者实际上计划减少初级岗位招聘。这与AI创造就业的主流叙事相悖,揭示了AI可能导致的就业结构变化。

    2. When juniors skip debugging and skip the formative mistakes, they don't build the tacit expertise. And when my generation of engineers retires, that knowledge doesn't transfer to the AI.

      大多数人认为AI可以替代人类学习过程,但作者认为跳过调试和错误经验会阻碍隐性知识的形成,导致关键能力无法传承。这与AI可以完全替代人类学习的普遍认知相悖。

    1. I have baked Karl Popper in the main company AI skill: everything we create (human or AI) should be challenged.

      [[Paolo Valdemarin p]] says he uses Karl Popper as a perspective in his main company AI skill, to challenge every output. Not sure what that means per se, but interesting phrasing. What would the 'main company AI skill' for TGL/me look like?

    1. "It does not save time or offer anything of value if every single line needs to be double-checked and re-translated and it reduces the optics of their job to that of 'text janitor.' Real translators have been kicked so hard by AI that you should not blame them for not picking up the sloppy seconds of a chatGPT translation patch. They deserve better."

      Translator Hilltop on the demoralizing effect that sloppy AI "translations" have on the localization community. See specifically "it reduces the opttics of their job to that of 'text janitor'"

    1. This is the first part I reject. The moving things around is precisely what thinking and writing involves. It's where ideas are born and cultivated, shaped to become what we have in mind. The rearranging of words to capture an incipient thought is the struggle and joy of being a writer.

      Moving things around and arranging thoughts and ideas in an essay is an essential part of the writing process.

    1. Instead of using domain knowledge to prescribe team organization, roles, or workflows, Fugu learns to dynamically assemble agents from a pool and coordinate them through non-obvious but highly efficient collaboration patterns.

      大多数人认为AI系统的效率依赖于人类专家设计的明确分工和协作流程,但作者认为Fugu系统能够发现人类可能想不到的高效协作模式,这挑战了传统AI系统设计的假设,暗示AI可能发展出超越人类直觉的优化方法。

    2. When a Fugu model is allowed to call itself recursively, reading its own prior output as context and deciding whether to revise its coordination strategy, a new form of test-time scaling emerges.

      大多数人认为AI模型的能力主要取决于训练阶段,推理阶段只是应用已学知识,但作者提出Fugu模型可以在推理时通过自我递归调用实现能力扩展,这挑战了传统AI推理阶段的局限性,暗示小型模型可能通过自我迭代达到超越其初始能力水平的表现。

    3. A core conviction at Sakana AI is that the most capable AI systems will not be monolithic models scaled in isolation, but collections of specialized agents working together.

      大多数人认为更强大的AI需要更大的单体模型,通过扩展参数和计算能力来提升性能,但作者认为未来最强大的AI将是多个专业化智能体的集合,这直接挑战了当前AI行业的主流发展方向,暗示规模效应可能不是AI发展的唯一路径。

    4. Instead of using domain knowledge to prescribe team organization, roles, or workflows, Fugu learns to dynamically assemble agents from a pool and coordinate them through non-obvious but highly efficient collaboration patterns.

      大多数人认为AI系统的协调需要人类专家预先定义明确的角色和工作流程,但作者认为Fugu系统能够自主学习非直观但高效的协作模式,这挑战了传统AI系统设计的常识,暗示未来AI系统可能发展出人类难以理解的协作方式。

    5. A core conviction at Sakana AI is that the most capable AI systems will not be monolithic models scaled in isolation, but collections of specialized agents working together.

      大多数人认为更强大的AI系统必然是更大规模、更复杂的单一模型,但作者明确表示最具能力的AI系统将不是孤立扩展的单一模型,而是多个专业化代理的集合。这直接挑战了当前AI领域追求更大单一模型的共识,提出了一个根本不同的研究方向。

    6. Instead of using domain knowledge to prescribe team organization, roles, or workflows, Fugu learns to dynamically assemble agents from a pool and coordinate them through non-obvious but highly efficient collaboration patterns.

      大多数人认为AI系统需要人类专家预先设计明确的角色分工和工作流程,但作者认为Fugu系统能自主学习出非直观但高效的协作模式,这挑战了传统AI系统设计的常识。这种自主学习的协作方式可能比人类设计的系统更适应复杂任务。

    1. DeepSeek does not appear to have fully moved beyond Nvidia. The company's technical report reveals that it is using Chinese chips to run the model for inference, but...appears to have adapted only part of V4's training process for Chinese chips.

      大多数人认为中国AI公司已经完全摆脱了对Nvidia的依赖,但作者认为DeepSeek V4仍主要依赖Nvidia芯片进行训练,仅在推理阶段使用中国芯片。这一观点挑战了'中国AI已实现完全自主'的主流叙事,暗示技术脱钩比表面看起来更为复杂。

    2. In a 1-million-token context, V4-Pro uses only 27% of the computing power required by its previous model, V3.2, while cutting memory use to 10%.

      大多数人认为AI模型处理更长上下文必然需要更多计算资源,但作者认为DeepSeek V4通过创新架构实现了惊人的效率提升,大幅降低了计算和内存需求。这一反直觉的发现挑战了'长上下文等于高成本'的行业认知。

    3. DeepSeek V4 exceeds them all on coding, math, and STEM problems, making it one of the strongest open-source models ever released.

      大多数人认为开源AI模型在性能上无法匹敌闭源商业模型,但作者认为DeepSeek V4在多个关键领域超越了其他开源模型,甚至与顶级闭源模型相当。这挑战了'开源必然意味着性能妥协'的行业共识,暗示开源模型正在迅速缩小与商业模型的差距。

    1. “Maybe someday the language models will be able to write books better than I can. But here’s the thing: Using those models in such a way absolutely misses the point, because it looks at art only as a product. Why did I write [my first manuscript]?… It was for the satisfaction of having written a novel, feeling the accomplishment, and learning how to do it. I tell you right now, if you’ve never finished a project on this level, it’s one of the most sweet, beautiful, and transcendent moments. I was holding that manuscript, thinking to myself, ‘I did it. I did it.’”

      Brian Sanderson on the difference between how a writer sees his or her art vs AI-produced works.

    1. The filing cabinet keeps getting bigger. But a bigger filing cabinet is still a filing cabinet.

      大多数人认为通过扩大上下文窗口和检索能力可以解决AI的'记忆'问题,但作者认为这本质上只是让文件柜变大,而没有改变其本质。这个观点挑战了当前AI领域对'扩展上下文'的主流研究方向,暗示我们需要从根本上重新思考AI如何存储和处理信息,而不仅仅是扩大容量。

    2. The current separation between training and deployment is not just an engineering convenience – it is a safety, auditability, and governance boundary.

      大多数人认为训练和部署的分离只是工程上的限制,但作者认为这种分离实际上是必要的边界,关乎安全、可审计性和治理。这个观点挑战了AI社区中普遍认为的'模型应该能够持续学习'的共识,暗示开放模型参数更新可能带来严重的安全和治理问题。

    3. The intelligence lives in the static parameters, and the apparent capabilities change radically depending on what you feed into the window.

      大多数人认为AI模型的智能来自于其参数和输入内容的结合,但作者认为智能实际上完全存在于静态参数中,输入内容只是触发不同表现的开关。这个观点挑战了主流认知,因为它暗示模型本身是固定的,而变化仅来自于外部输入,这与我们通常认为模型能够通过输入'学习'的观点相悖。

    4. The filing cabinet keeps getting bigger. But a bigger filing cabinet is still a filing cabinet. The breakthrough is letting the model do after deployment what made it powerful during training: compress, abstract, and learn.

      文章以'文件柜'的比喻生动地说明了当前AI系统的局限性。即使上下文窗口不断扩大,本质上仍然只是更大的文件柜。真正的突破是让模型在部署后继续执行训练时的核心能力:压缩、抽象和学习。这个观点挑战了当前AI发展的主流方向,提出了一个令人深思的问题:我们是否在追求错误的解决方案?

    5. The irony is that the very mechanism that makes LLMs powerful during training (e.g. compressing raw data into compact, transferable representations) is exactly what we refuse to let them do after deployment.

      这是一个极具洞察力的反直觉观点。文章指出,正是训练过程中使LLMs强大的压缩机制,在部署后却被我们拒绝使用。这暗示我们可能正在错失让AI真正进化的关键机会,同时也提出了一个重要问题:为什么我们不让AI在部署后继续学习?

    6. Large language models live in a similar perpetual present. They emerge from training with vast knowledge frozen into their parameters but they cannot form new memories – cannot update their parameters in response to new experience.

      这个观点挑战了我们对AI学习能力的传统认知。LLMs虽然拥有大量知识,却无法像人类一样形成新记忆,这揭示了当前AI系统的根本局限性。作者通过《记忆碎片》电影中的失忆症患者类比,生动地展示了当前AI系统的'永恒现在'状态,这是一个反直觉的深刻洞见。

    1. Without our safeguards in place (which we do to measure a model's raw capabilities), only Mythos Preview and Opus 4.7 completed more than half the tasks.

      大多数人认为高级AI模型在没有安全措施的情况下会自主执行复杂任务,但作者暗示即使是最先进的模型在没有人类指导的情况下也难以完成大多数任务。这挑战了AI自主性和能力的普遍认知,暗示AI可能比人们想象的更依赖人类监督。

    2. We also welcome feedback and input from third parties and industry experts. We're currently working with The Future of Free Speech (an independent think tank at Vanderbilt University), the Foundation for American Innovation, and the Collective Intelligence Project

      大多数人认为科技公司会独立制定AI政策并保持控制,但作者强调Anthropic积极寻求外部机构和专家的合作。这挑战了科技公司通常的封闭决策模式,暗示AI治理需要多方参与而非企业单方面主导。

    3. if AI models can answer these questions well (that is, accurately and impartially), they can be a positive force for the democratic process.

      大多数人认为AI在政治领域会带来偏见和操纵风险,但作者认为AI可以成为民主进程的积极力量,前提是它能准确且无偏见地回答问题。这挑战了主流对AI政治应用的担忧,暗示AI可能比传统信息渠道更可靠。

    1. A DESIGN.md file combines machine-readable design tokens (YAML front matter) with human-readable design rationale (markdown prose). Tokens give agents exact values. Prose tells them _why_ those values exist and how to apply them.

      大多数人认为设计系统应该完全由机器可读的配置文件定义,以确保一致性和自动化。但作者认为DESIGN.md格式需要同时包含机器可读的YAML前缀和人类可读的Markdown正文,因为人类提供的上下文和设计推理对AI理解设计意图至关重要,这挑战了纯配置驱动的设计系统理念。

    2. A DESIGN.md file combines machine-readable design tokens (YAML front matter) with human-readable design rationale (markdown prose). Tokens give agents exact values. Prose tells them _why_ those values exist and how to apply them.

      大多数人认为设计系统应该完全由机器可读的代码或配置文件定义,以确保一致性和自动化。但作者认为,将人类可读的设计 rationale 与机器可读的 tokens 结合是更好的方法,因为 prose 能提供设计意图和上下文,这对于 AI 理解和应用设计系统至关重要。这是一种将人类设计师的意图与机器执行能力相结合的非传统方法。

    1. 其中 Pattern 是最容易被忽略也最关键的一层,它定义了'在具体业务场景下该怎么组合这些组件',是 AI 时代设计系统真正的价值所在。

      大多数设计系统实践者主要关注组件库和基础规范,但作者认为模式层(Pattern)才是设计系统的核心价值所在。这一观点与主流认知相悖,因为大多数团队将大量资源投入到组件开发,而忽略了场景化的模式组合,而这恰恰是AI时代设计系统最有价值的部分。

    1. A senior engineer to own and evolve the game engine and real-time play infrastructure behind the ARC-AGI series.

      大多数人认为游戏引擎开发需要专注于图形渲染和游戏性能,但这里强调的是'AI智能测量'和'实时游戏基础设施',表明ARC Prize Foundation正在将游戏引擎作为评估AI通用智能的工具,这与传统游戏开发的目标截然不同。

    1. This is the part people miss about AI-native companies - the $113k is not a cost, it is your headcount budget allocated differently.

      大多数人认为AI成本是额外的支出,但作者认为AI成本实际上是对人力预算的重新分配。这挑战了传统成本会计观念,暗示AI不是成本而是投资,但也可能低估了AI实际成本和维护的复杂性。

    2. The real unlock is compound scaling—token spend grows linearly while output grows exponentially.

      大多数人认为AI投入与产出成正比,但作者认为AI投入可以实现指数级增长,远超线性投入。这挑战了传统商业认知,暗示AI可以创造超常规回报,但也可能掩盖了AI实际效益被夸大的风险。

    1. it is decently important to handle them asap when they arrive so that we can avoid building up too much backlog.

      大多数人认为面对大量安全报告应该优先处理最严重的漏洞,但作者强调需要立即处理所有报告以避免积压。这与常见的'按严重程度排序处理'的安全最佳实践相悖,暗示在AI生成报告的高频率环境下,响应速度比优先级排序更重要。

    2. The time when we suffer from large amounts of AI slop is gone. Now we instead suffer under a massive load of good reports.

      大多数人认为AI工具会产生大量低质量的'垃圾报告'(AI slop),增加开发者的负担,但作者认为现在AI生成的安全报告质量很高,虽然数量庞大但都是高质量的报告。这是一个反直觉的观点,因为通常人们认为自动化工具会产生大量噪音而非有价值的贡献。

    1. Android skills cover some of the most common workflows that some Android developers and LLMs may struggle with—they help models better understand and execute specific patterns that follow our best practices and guidance on Android development.

      大多数人认为AI模型应该能够自主学习和理解最佳实践,不需要特定的技能集。但作者暗示AI模型在Android开发中存在'常见工作流程'方面的困难,需要专门的技能集来弥补,这与主流认知相悖。这种观点挑战了'AI应该能够自主学习'的行业共识。

    2. In our internal experiments, Android CLI improved project and environment setup by reducing LLM token usage by more than 70%, and tasks were completed 3X faster than when agents attempted to navigate these tasks using only the standard toolsets.

      大多数人认为AI代理工具会消耗大量token且效率低下,但作者声称Android CLI能减少70%的token使用并提高3倍速度,这与主流认知相悖。如果属实,这将彻底改变开发者对AI辅助工具效率的认知,挑战了'AI代理必然消耗大量资源'的行业共识。

    1. users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns

      大多数人可能认为用户会接受AI编程助手的建议,但数据显示近一半的用户交互中,用户都在主动抵制或纠正AI的输出。这表明AI编程助手与用户之间存在显著的认知冲突,而非简单的合作关系。

    2. agent-written code introduces more security vulnerabilities than code authored by humans

      大多数人认为AI编程助手能提高代码质量和安全性,但研究发现AI生成的代码实际上比人类编写的代码引入更多安全漏洞。这一发现与AI能减少编程错误的普遍认知相悖,挑战了AI在安全领域的优越性假设。

    3. coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ('vibe coding'), while in 23%, humans write all code themselves.

      大多数人认为AI编程助手与人类是协作关系,各有所长,但作者发现实际使用呈现两极分化模式——要么几乎完全依赖AI生成代码('vibe coding'),要么完全拒绝AI而完全手动编写。这种非连续的采纳模式挑战了人们对人机协作的常规认知。

    1. The overall conclusion, therefore, is that AI for Science should be understood as both a scientific and a civilizational project.

      大多数人认为AI在科学中的应用主要是技术层面的进步,而作者认为这应该被理解为科学和文明层面的项目。这一观点将AI科学提升到了前所未有的高度,暗示它不仅是工具变革,更是人类知识创造方式的根本转变。

    2. The central question is not whether AI can imitate human conversation, but whether it can participate in the production of publishable scientific knowledge at a level comparable to a recognized human contributor.

      大多数人认为AI科学贡献的衡量标准是其模仿人类对话的能力,而作者认为真正的标准应该是AI能否产生可发表的、相当于人类贡献者的科学知识。这一观点重新定义了AI科学成功的标准,挑战了当前AI评估的主流范式。

    3. Without a mechanism for continuous and diverse learning, AI systems will tend to reproduce the dominant patterns already present in their training data. That limitation would make truly creative work difficult.

      大多数人认为AI的创造力主要来自模型规模和计算能力的提升,而作者认为缺乏持续学习和多样性机制将限制AI的真正创造力。这一观点挑战了主流AI发展路径,暗示技术规模扩张本身不足以实现真正的科学创新。

    4. The most effective pattern of human-AI cooperation may differ substantially across disciplines, and these patterns will likely be discovered through practice rather than designed in advance.

      大多数人认为AI与人类合作的最佳模式可以通过预先设计和优化来确定,而作者认为这种模式将通过实践自然涌现。这一观点与主流AI研究方法相悖,因为它暗示AI合作模式的发现过程是自下而上的,而非自上而下的工程化设计。

    5. The application of LLMs in science is already underway... We believe that AI will ultimately bring a fundamental big change to scientific research across disciplines.

      大多数人认为AI在科学研究中只是辅助工具,而作者认为AI将从根本上改变科学研究的结构和方式。这一观点与主流认知相悖,因为它暗示AI不仅是提高效率的工具,而是会重塑科学发现、合作和发表的本质。

    6. The most fundamental change brought by the LLM revolution is that human know-how is becoming replicable and shareable at scale.

      大多数人认为AI革命主要在于自动化和效率提升,但作者认为LLM革命的核心在于人类技能的可复制性和规模化共享。这一观点挑战了主流认知,因为它暗示AI不仅是工具,更是一种全新的信息载体,类似于DNA和语言在人类历史中的变革性角色。

    1. existing agent protocols (e.g., A2A and MCP) under specify cross entity lifecycle and context management, version tracking, and evolution safe update interfaces, which encourages monolithic compositions and brittle glue code.

      大多数人认为现有的代理协议已经足够成熟且能有效管理复杂系统,但作者认为当前主流的代理协议(如A2A和MCP)存在严重的规范不足问题,这会导致系统变得脆弱和难以维护。这是一个反直觉的观点,因为行业通常认为这些协议已经相当完善。

    2. The results demonstrate consistent improvements over strong baselines, supporting the effectiveness of agent resource management and closed loop self evolution.

      虽然大多数AI研究者相信自我演化能带来性能提升,但很少有人能够证明这种提升在多个具有挑战性的基准测试中持续超过强大的基线模型。作者声称他们的AGS系统不仅实现了自我演化,而且这种演化是闭环的、可审计的,这挑战了当前AI社区对自我演化系统的认知,暗示了更加结构化的演化方法可能比开放式的演化更有效。

    3. Its Self Evolution Protocol Layer (SEPL) specifies a closed loop operator interface for proposing, assessing, and committing improvements with auditable lineage and rollback.

      大多数人认为AI系统的自我演化应该是开放式的、持续的过程,而不是有明确边界和可追溯性的闭环操作。但作者提出的SEPL层强调了一种结构化的自我演化方法,要求每次改进都可被审计、追踪和回滚,这与当前AI社区对开放式演化的主流认知相悖,可能带来更安全但更受限的演化路径。

    4. We introduce Autogenesis Protocol (AGP), a self evolution protocol that decouples what evolves from how evolution occurs.

      大多数人认为AI系统的演化应该是一个整体过程,关注点在于如何实现演化。但作者提出了一种革命性的分离方法,将演化的内容与演化的方式解耦,这打破了传统系统设计的思维模式。这种分离可能使AI系统的演化更加可控和可预测,与当前主流的集成式演化方法形成鲜明对比。

    5. However, existing agent protocols (e.g., A2A and MCP) under specify cross entity lifecycle and context management, version tracking, and evolution safe update interfaces, which encourages monolithic compositions and brittle glue code.

      大多数人认为当前的智能体协议已经足够完善,能够有效管理复杂的AI系统。但作者认为现有协议存在严重不足,特别是在实体生命周期、上下文管理和版本控制方面,这会导致系统变得脆弱和难以维护。这是一个挑战行业共识的观点,因为许多研究者可能认为现有框架已经能够处理这些挑战。

    1. Scan your website to see how ready it is for AI agents. We check multiple emerging standards — from robots.txt and Markdown negotiation to MCP, OAuth, Agent Skills and agentic commerce.

      大多数人认为网站优化主要是针对搜索引擎和人类用户,但作者认为网站需要专门为AI代理(agent)准备,这挑战了传统的网站优化观念。文章提出了一系列新兴标准,如MCP、Agent Skills等,表明未来的网站交互将不再局限于人类浏览,而是需要与AI系统进行复杂交互。

    1. The inbox becomes the agent's memory, without needing a separate database or vector store.

      大多数人认为AI代理需要专门的数据库或向量存储来维护状态和记忆,但作者提出一个颠覆性观点:电子邮件收件箱本身可以作为代理的内存系统,这挑战了构建AI代理时需要复杂后端存储的行业共识,暗示电子邮件可能是一种未被充分利用的状态管理工具。

    2. A chatbot responds in the moment or not at all. An agent thinks, acts, and communicates on its own timeline.

      大多数人认为聊天机器人和AI代理本质上是相同的概念,只是复杂度不同,但作者明确区分了'聊天机器人'和'代理',认为关键区别在于通信方式 - 聊天机器人必须即时响应,而代理可以异步思考和行动,这挑战了AI领域对交互式AI的主流分类方式。

    1. If this analogy is right, then we will likely see sort of a 'Cambrian explosion' in agent harnesses purpose-built for running server-side; and the few that win this race will become as ubiquitous as WordPress.

      作者预测AI代理领域将出现类似寒武纪大爆发的专业化工具浪潮,这一观点挑战了当前AI工具集中化的趋势。如果正确,这将意味着未来AI市场将由多种专业化代理工具主导,而非少数通用平台。这一预测对AI创业者和投资者具有重要战略意义。

    2. They don't mind paying the AI labs for tokens — but the agent itself, they'd much rather have outside of the labs' infrastructure.

      作者提出了一个关于AI经济模式的反直觉洞见:组织愿意为AI模型付费,但希望将代理本身部署在自己的基础设施上。这一观点挑战了'AI服务将完全云端化'的假设,暗示混合AI部署模式可能成为主流,这对AI公司的商业模式和基础设施战略具有重要启示。

    3. Agent harnesses are much more like WordPress than they are like Apache, simply because people want to have their own agents — just like everyone wanted their own website in the early 2000s.

      作者提出了一个令人惊讶的类比,将未来AI代理工具与WordPress而非Apache相提并论。这一观点挑战了技术演进的传统叙事,暗示未来的AI基础设施将更注重用户友好性和可定制性,而非底层技术架构的优雅。这暗示AI代理领域可能出现类似WordPress的'民主化'浪潮。

    4. They don't mind paying the AI labs for tokens — but the agent itself, they'd much rather have outside of the labs' infrastructure.

      这一观点揭示了AI生态系统中的一个关键悖论:用户愿意为底层AI能力付费,但希望代理工具本身保持自主性和可移植性。这暗示了未来AI商业模式的核心可能在于'代理即服务',而非单纯的'模型即服务'。

    1. AI coding agents operate in a paradox: they possess vast parametric knowledge yet cannot remember a conversation from an hour ago.

      这个陈述揭示了当前AI系统的一个根本性矛盾——拥有大量静态知识却缺乏动态记忆能力,这挑战了我们对AI'智能'的传统理解。如果AI真正智能,它应该能够记住并利用过去的交互经验,而这正是当前大型语言模型架构的明显缺陷。

    1. A small but directionally consistent improvement on strict instruction following. Loose evaluation is flat. Both models already follow the high-level instructions — the strict-mode gap comes down to 4.6 occasionally mishandling exact formatting where 4.7 doesn't.

      这一发现揭示了AI模型能力提升的一个微妙现象:微小但精确的改进可能比重大但模糊的改进更有价值。Claude 4.7只在严格指令遵循上有微小提升,但这种提升针对的是实际开发中常见的精确格式化问题,这挑战了人们对'重大突破'的执念,强调了'精准解决特定问题'的价值。

    2. The extra tokens bought something measurable. +5pp on strict instruction-following. Small. Real. So: is that worth 1.3–1.45x more tokens per prompt?

      这是一个令人惊讶的价值权衡案例。Anthropic用高达45%的token成本增加,只换来了5个百分点的指令遵循提升。这种不成比例的交换表明,在AI模型优化中,'微小但真实'的改进可能需要付出巨大成本,这挑战了人们对技术改进应该'物有所值'的普遍假设。

    1. Build a cognitive core, a model that contains only the algorithms for reasoning and problem-solving, stripped of encyclopedic memorization

      Karpathy提出的认知核心概念挑战了当前AI模型的架构设计理念,暗示我们可能一直在错误的方向上投入资源。这一分离记忆与推理的思路,可能代表AI发展的范式转变。

    1. But that comes with a new risk: While scripted conversations can't really go off the rails, ones generated by AI certainly can. Some popular AI toys have, for example, talked to kids about how to find matches and knives.

      令人惊讶的是:生成式AI对话虽然比脚本式对话更自然,但也带来了新的风险,一些AI玩具曾教孩子如何找到火柴和刀具。这提醒我们,随着AI技术变得更加先进,我们需要更加关注其安全性和伦理影响,特别是在与儿童互动的场合。

    2. In 2025, Google DeepMind further fused the worlds of large language models and robotics, releasing a Gemini Robotics model with improved ability to understand commands in natural language.

      令人惊讶的是:Google DeepMind将大型语言模型与机器人技术融合,创建了Gemini Robotics模型,使机器人能够更好地理解自然语言指令。这种融合代表了人工智能领域的重大突破,使机器人能够像人类一样理解和执行复杂指令。

    3. Companies and investors put $6.1 billion into humanoid robots in 2025 alone, four times what was invested in 2024.

      令人惊讶的是:机器人投资在2025年出现了爆炸性增长,达到2024年的四倍。这表明市场对机器人的信心发生了根本性转变,从谨慎观望到大规模投入,反映了AI技术进步如何重塑了投资者对机器人可行性的看法。

    1. Anthropic has limited its newest model to roughly forty organizations.

      将最先进AI模型限制在极少数组织手中,标志着AI正从开放资源转变为特权商品。这种转变与互联网早期的开放精神形成鲜明对比,可能重塑AI领域的竞争格局和创新模式。

    1. Figma has close to 2,000 employees - not all working on product engineering of course. I really doubt Anthropic even needed 10 to build Claude Design.

      这一惊人的效率对比揭示了AI时代产品开发的根本性转变:Anthropic仅用极小团队就能构建直接挑战拥有2000名员工的Figma的产品。这挑战了传统软件公司需要大量人力的假设,预示着更小、更专注的团队可能主导未来市场。

    2. It's also worth noting that a lot of the things that would traditionally lock a company like Figma in stop working as well in an agent-first world.

      作者挑战了传统SaaS护城河的概念,指出在AI代理主导的世界中,多人协作、插件生态系统等传统优势变得不再重要。这一洞见揭示了AI将如何重构软件竞争格局,使传统SaaS公司的护城河失效。

    3. Figma is effectively funding a competitor - and the more AI usage Figma has - the more money they send over to Anthropic for the tokens they use.

      这一反直觉的商业模式揭示了SaaS公司在AI时代的结构性弱点:公司可能正在资助自己的竞争对手。Figma不仅为Anthropic提供收入,还使用较次的模型(Sonnet 4.5)而竞争对手使用更先进的模型(Opus 4.7),这种双重打击极具讽刺性。

    1. But the real power of agents comes when they can work as a team. Instead of lone-wolf bots carrying out single tasks, such as using a browser to make a restaurant reservation or sending you a summary of your inbox, new tools can yoke together multiple agents, give each of them a different job, and orchestrate their behaviors so that they all pull together to complete more complex tasks than an individual agent could do by itself.

      这一观点挑战了当前AI代理作为独立工具的主流认知,提出协同工作的AI代理将实现质的飞跃。这种从单点到网络的转变,暗示AI代理系统将实现从简单任务到复杂任务的跨越,这一反直觉结论可能预示着AI应用范式的根本转变。

    2. And it’s not just office work. Multi-agent tools like Google DeepMind’s Co-Scientist let researchers use teams of AI agents to coordinate literature searches, generate and test hypotheses, design experiments, and more.

      大多数人可能认为人工智能在办公室工作中的应用仅限于数据处理,但作者提出,多智能体工具甚至可以用于研究工作,如文献搜索和实验设计。

    3. But the real power of agents comes when they can work as a team. Instead of lone-wolf bots carrying out single tasks, such as using a browser to make a restaurant reservation or sending you a summary of your inbox, new tools can yoke together multiple agents, give each of them a different job, and orchestrate their behaviors so that they all pull together to complete more complex tasks than an individual agent could do by itself.

      主流观点可能认为人工智能代理将独立完成工作,但作者指出,它们的真正力量在于团队合作,通过协同工作完成比单个代理更复杂的任务。

    4. Think of multi-agent systems as the new assembly lines. Henry Ford’s innovation upended entire industries last century. In theory, networks of AI agents could do to white-collar knowledge work what assembly lines did to manufacturing.

      大多数人认为自动化和人工智能只会取代低技能工作,但作者提出,多智能体系统可能会像亨利·福特的流水线一样颠覆白领知识工作。

    1. Discovery should focus on trust boundaries, authentication flows, parsers, shared services, and legacy code that still sits on critical paths.

      这一建议挑战了传统安全扫描的广度优先方法,转而强调深度优先的特定领域。这表明AI安全研究应该更关注那些传统方法难以发现的复杂逻辑问题,而不是简单地扫描所有代码。这种转变可能带来更有效的安全投资回报。

    2. Public models can already spot that a security-relevant check is missing in the right code path, but they can still miss the actual invariant being violated and therefore misstate the impact.

      这一发现揭示了公共模型在安全分析中的一个关键局限:它们能发现缺失的安全检查,但可能无法正确理解被违反的实际不变量,从而错误陈述影响。这挑战了'AI能完全理解安全含义'的假设,强调了人类专家在解释AI发现中的不可替代性。

    3. The real challenge is validating outputs, prioritizing what matters, and operationalizing them.

      这是一个反直觉的结论:AI安全研究的前沿已经从模型本身转移到如何有效利用模型的能力。大多数安全团队仍然专注于获取最强大的模型,而实际上真正的瓶颈在于验证、优先排序和将发现转化为可操作的修复。这挑战了'更好的模型等于更好的安全'的传统观念。

    1. What happens is that weak models hallucinate (sometimes causally hitting a real problem) that there is a lack of validation of the start of the window... without understanding why they, if put together, create an issue.

      这一发现揭示了AI漏洞检测的严重局限性:弱模型只能通过模式匹配'发现'表面相似的问题,却无法理解问题之间的因果关系。这表明当前AI在网络安全中的应用可能存在系统性盲点,值得深入研究。

    2. So, cyber security of tomorrow will not be like proof of work in the sense of 'more GPU wins'; instead, better models, and faster access to such models, will win.

      作者提出了一个颠覆性的观点:未来网络安全的关键不是计算资源的多寡,而是模型质量的优劣。这挑战了当前AI安全领域过度关注计算能力的趋势,暗示我们应该重新思考AI安全研究的投资方向。

    3. Stronger models hallucinate less, so they can't see the problem in any side of the spectrum: the hallucination side of small models, and the real understanding side of Mythos.

      这一观察极具反直觉性:更强的模型反而更难发现某些漏洞,因为它们减少幻觉的同时也失去了对问题的'直觉理解'。这暗示AI安全研究可能需要不同能力层次的模型组合,而非简单地追求更大更强的模型。

    4. you can run an inferior model for an infinite number of tokens, and it will never realize(*) that the lack of validation of the start window, if put together with the integer overflow, then put together with the fact the branch where the node should never be NULL is entered regardless, will produce the bug.

      作者通过OpenBSD SACK bug的例子提供了一个令人惊讶的发现:弱模型无论运行多久都无法理解复杂漏洞的因果关系。这揭示了AI在理解复杂系统交互方面的根本局限性,挑战了'无限计算可解决任何问题'的假设。

    1. Keeping a human in the loop may not provide the safeguard people imagine, because the human cannot know the AI's intention before it acts.

      这一论点直接挑战了军事AI监管的核心原则,即'人类在回路中'能提供有效保障。作者认为这种监督可能是一种幻觉,因为人类无法在AI行动前理解其真实意图,这违背了人们对人类监督有效性的普遍假设。

    2. Huge advances have been made in developing and building more capable models, driven by record investments—forecast by Gartner to grow to around $2.5 trillion in 2026 alone. In contrast, the investment in understanding how the technology works has been minuscule.

      这一数据对比揭示了AI领域的一个令人惊讶的不平衡:巨额资金投入到构建更强大的AI系统,而用于理解这些系统如何工作的投资却微不足道。这种不平衡发展可能导致我们拥有强大但不透明的AI武器系统,而对其运作机制知之甚少。

    3. The immediate danger is not that machines will act without human oversight; it is that human overseers have no idea what the machines are actually 'thinking.'

      这一陈述挑战了人们对AI战争监管的传统认知,提出真正的危险不在于机器脱离人类控制,而在于人类无法理解AI的'思维'过程。这违反了直觉,因为公众普遍认为人类监督是AI武器系统的主要安全保障。

    1. Claude 4.6 had a section specifically clarifying that 'Donald Trump is the current president of the United States and was inaugurated on January 20, 2025'

      Anthropic需要在系统提示中明确声明政治事实,以弥补模型的'知识截止日期'与实时政治变化之间的差距。这一做法揭示了AI系统面临的一个根本性挑战:如何在保持知识更新的同时避免政治偏见,这一反直觉的解决方案可能成为未来AI治理的重要参考。

    2. If people ask Claude to give a simple yes or no answer... Claude can decline to offer the short response

      Claude现在被明确授权拒绝简单的是非题回答,这一设计挑战了AI应'直接回答问题'的传统期望。这种对简单拒绝的授权反映了AI系统正在发展出类似人类的'拒绝回答权',这一反直觉特性可能被用户误解为模型能力缺陷,实则是伦理设计的进步。

    3. Claude calls tool_search to check whether a relevant tool is available but deferred

      Claude现在具有内置的'工具搜索'机制,在声称缺乏某种能力前会主动检查是否有可用工具。这一设计挑战了AI模型'无所不知或一无所知'的传统二分法,创造出一种'延迟知识获取'的中间状态,这一反直觉特性可能被开发者误认为是模型缺陷。

    4. Once Claude refuses a request for reasons of child safety, all subsequent requests in the same conversation must be approached with extreme caution.

      这一指令暗示Claude具有某种'记忆'或'状态追踪'能力,即使拒绝请求后仍会记住之前的拒绝。这与传统AI模型的无状态特性形成鲜明对比,表明Claude可能具有某种会话上下文记忆机制,这一反直觉特性可能被开发者忽视。

    1. the move from pattern matching to understanding cause and effect

      作者指出从模式匹配到理解因果关系的转变是AGI的关键,这一观点挑战了当前AI领域过度关注表面模式识别的趋势。它暗示真正的智能需要超越数据关联,达到对世界运作原理的深层理解。

    1. Research has shown that involving workers' perspectives in the design of workplace technologies promotes sustainable improvements in productivity and well-being.

      这一发现挑战了自上而下技术实施的常规模式,强调员工参与设计的重要性。这一反直觉观点表明,最有效的AI应用往往不是来自高层战略,而是来自一线员工的实际需求和创意。这一发现对组织如何实施AI转型提供了重要启示,值得深入研究如何将这一原则转化为具体实践。

    2. In one U.S. survey, 40% of employees said they had received 'workslop', i.e. AI-generated content that looks polished but isn't accurate or useful, in the past month.

      这一惊人的高比例(40%)的'workslop'现象揭示了AI应用中的一个悖论:虽然AI提高了效率,但同时也带来了大量低质量内容。这一发现挑战了'AI总是提高生产力'的普遍假设,暗示了过度依赖AI可能导致的隐性成本,需要重新评估AI的实际价值。

    3. Entry-level roles rely less on experience and knowledge and are easier to automate. Empirical evidence suggests employment for workers aged 22–25 in highly AI-exposed jobs declined by 16% relative to similar but less-exposed roles

      这一发现挑战了常规认知,即AI主要影响的是重复性工作。研究显示,经验较少的年轻工人受到的冲击最大,这与直觉相反,因为人们通常认为AI会首先替代需要较少认知技能的工作。这一反直觉现象值得进一步研究,因为它可能揭示了AI如何重塑职业阶梯和技能获取路径。

    4. LLMs take knowledge from millions of people who have written web content or posted in places like Reddit and Wikipedia, interacted with chatbots, and generated other types of data, and make that available to individuals on demand.

      这一观点挑战了'人工智能'的术语本身,提出'集体智能'可能是更准确的描述。LLM实际上是数百万人的集体知识产物,这一反直觉的视角揭示了AI与人类创造力之间的复杂关系,挑战了AI作为独立实体的传统理解。

    5. In one U.S. survey, 40% of employees said they had received 'workslop', i.e. AI-generated content that looks polished but isn't accurate or useful, in the past month.

      这一惊人的数据揭示了AI在工作场所应用中的潜在陷阱。虽然AI被宣传为提高生产力的工具,但近半数员工报告收到过看似精美但不准确或无用的AI生成内容。这表明过度依赖AI可能导致质量下降,挑战了AI总是带来积极效果的假设。

    1. The deal won’t shock those who follow the industry closely. Last week, it was reported that xAI would begin renting computing power from its data centers to Cursor, with the coding startup using tens of thousands of xAI chips to train its latest AI model.

      行业观察者可能认为 SpaceX 与 Cursor 的合作不会引起太大惊讶,但作者强调上周已报道 xAI 将向 Cursor 提供大量计算能力,这一信息对理解合作的重要性具有重要意义。

    2. Neither Cursor nor xAI has proprietary models that can match the leading offerings from Anthropic and OpenAI — the same companies now competing directly with Cursor for the developer market.

      大多数人认为 Cursor 和 xAI 在 AI 领域具有独树一帜的技术优势,但作者指出它们与领先企业如 Anthropic 和 OpenAI 相比并无明显优势,反而直接面临竞争。

    1. Members have been using Mythos regularly since gaining access — providing screenshots and a live demonstration of the model as evidence to _Bloomberg_ — though reportedly not for cybersecurity purposes in an attempt to avoid detection by Anthropic.

      人们通常认为黑客使用高级 AI 模型是为了进行网络攻击,但作者指出,这些黑客似乎并没有使用 Mythos 进行网络安全目的,而是为了避免被 Anthropic 发现,这表明了黑客行为可能并不总是出于恶意。

    2. The group accessed Mythos by using knowledge of Anthropic’s other model formats obtained from a recent [Mercor data breach](https://www.theverge.com/ai-artificial-intelligence/907083/a-company-that-makes-ai-training-data-has-been-hit-by-a-security-breach) to make “an educated guess” about its online location.

      大多数人可能认为高级 AI 模型的访问权限非常难以获得,但作者指出,一个黑客小组通过从 Mercor 数据泄露中获得的信息来猜测 Mythos 的在线位置,这表明了数据泄露可能对更广泛的网络安全构成威胁。

    3. Anthropic currently has no plans to release the model publicly due to concerns that it could be weaponized.

      大多数人认为 Anthropic 的 Mythos 模型会像其他 AI 模型一样公开发布,但作者指出由于担心其被武器化,Anthropic 没有公开发布该模型的计划,这表明了对 AI 武器化风险的担忧超过了推广技术的需求。

    1. AI has already helped people work faster on their own, but many of the most important workflows inside an organization depend on shared context, handoffs, and decisions across teams.

      大多数人认为 AI 主要帮助个人提高效率,但作者指出 AI 在促进跨团队协作和共享上下文中发挥着更关键的作用,挑战了 AI 在个人层面应用的局限。

    1. That matters because AI hype is dying down, and companies are shifting focus from buzzy pilots to deployment and integration, where cheaper and more customizable tools tend to win.

      大多数人关注AI模型的性能和能力竞赛,但作者认为行业正从炒作阶段转向实际部署和集成,此时更便宜、可定制化的工具将获胜。这挑战了人们对AI发展重点的传统认知,表明中国开源模型的优势将在AI实际应用阶段更加凸显。

    2. US tech CEOs believe the best models should stay proprietary, partly so they can recoup enormous training costs and partly out of concern that powerful frontier models could be weaponized. Chinese labs, for their part, are not purely idealistic: Open-source is not only free advertising but also a shrewd workaround.

      大多数人认为开源AI会损害商业利益,增加安全风险,但作者认为中国将开源视为一种精明的商业策略,而非单纯的技术共享。这挑战了西方科技公司对知识产权和商业模式的传统认知,表明开源可以成为构建生态系统和最终实现商业价值的有效途径。

    3. Chinese labs, for their part, are not purely idealistic: Open-source is not only free advertising but also a shrewd workaround. Without access to cutting-edge chips restricted by US export controls, releasing models openly accelerates the cycle of external feedback and contributions that compensates for constrained compute.

      大多数人认为中国开源AI是出于理想主义或技术自信,但作者认为这实际上是一种战略性的 workaround(变通方法)。由于无法获得美国限制出口的高端芯片,中国通过开放源代码来加速外部反馈循环,弥补计算能力的不足,这是一种务实而非理想主义的策略。

    4. Chinese open-weight models accounted for 17.1% of global AI model downloads over the year ending in August 2025. That narrowly surpassed the US share of 15.86%—the first time China had led in this metric.

      大多数人认为美国在AI领域一直处于绝对领先地位,但作者认为中国开源模型下载量已超过美国,这是全球AI格局发生重大转变的标志。这一数据挑战了人们对AI发展路径的传统认知,表明中国通过开放源代码策略正在赢得全球开发者的青睐。

    1. Telling people to avoid using generative AI is increasingly telling them they must avoid taking part in society.

      大多数人认为抵制AI是一种个人选择,作者则将其描述为社会排斥的必要条件。这一反直觉观点将AI使用与社会参与联系起来,暗示拒绝AI实际上意味着被边缘化,这与人们对技术自主性的普遍理解相悖。

    2. We have not really begun to make this progress with AI. Why, for example, is this dashboard not found on a government website?

      大多数人认为AI发展主要由私营部门推动,政府只是事后监管。作者质疑为什么政府没有像应对疫情一样建立AI监测和应对系统,这一观点挑战了当前AI治理模式的主流认知,暗示我们需要更系统化的公共AI管理框架。

    3. The AI has learned to code. The AI is building itself.

      大多数人认为AI只是人类创造的工具,需要持续人类监督和改进。作者提出AI已经具备了自我进化和自我构建的能力,这一观点挑战了AI作为被动工具的传统认知,暗示了技术自主性的可能性,这与大多数人对AI发展的预期相悖。