3,192 Matching Annotations
  1. May 2026
    1. Overall, it usually takes me about two hours to do this task. If only it were as simple as a single copy and paste, life would be so much easier — or so I thought.

      作者完成文章发布任务通常需要约2小时,而AI在这一任务上表现极差。这一时间对比数据点突显了AI在看似简单任务上的局限性,支持了莫拉维克悖论的观点。然而,作者没有提供AI完成该任务的具体时间数据,这使得比较不够完整。

    2. For example, this could bring a five hour (300 minute) time horizon down to a three minute time horizon. But while the time horizons are much shorter, the growth rate is about the same as the METR's main results, with roughly two doublings each year.

      作者提到视觉计算机使用任务的时间跨度可能比主要结果缩短40-100倍,但增长率相似,约为每年翻两倍。这一数据点揭示了AI在不同任务领域的能力差异,以及计算机使用任务的特殊挑战,这对理解AI自动化进程的复杂性提供了重要见解。

    3. By the end of the year, we expect AI to be able to do tasks roughly one day long with a 50% success rate. In comparison, I'd guess that this task would take several days for a person familiar with the paper and is able to play around with the web interface.

      作者引用了METR的时间预测数据,即到2026年底,AI完成一天长度任务的成功率约为50%。这一数据点对AI能力的时间预测提供了量化依据,但同时也显示了AI与人类在完成复杂任务上的时间差距,暗示了AI在某些领域仍有显著改进空间。

    4. The benchmark tasks were meticulously constructed to be realistic, involving the hard work of hundreds of experts and likely millions of dollars — placing it among the most expensive economics papers of all time.

      作者提到GDPval基准测试可能花费了数百万美元,由数百名专家参与构建。这一数据点显示了AI基准测试的高昂成本,但也暗示了这类测试可能存在资源分配不均的问题。考虑到其成本与实际经济影响之间的差距,这种高投入低产出的现象值得反思。

    1. ⚡【洞察】Anthropic 与 SpaceX 签署算力供应协议,同步提升各级订阅使用上限。SpaceX 的超算基础设施(Colossus)本是为 xAI 的 Grok 训练设计的——Anthropic 购买这些算力,意味着 AI 算力市场的「供应商交叉」正在发生:竞争对手的硬件基础设施成为彼此的算力来源。HN 399 赞的背后,社区讨论的核心问题是:这对 AI 基础设施军备竞赛意味着什么?答案是:算力需求已超过任何一家公司的自建能力。

    1. 🔒【令人震惊】Chrome 在数十亿设备上静默写入 4GB Gemini Nano 模型权重,删除后自动重装,可能违反 GDPR。这是「端侧 AI」与用户隐私的第一次正面冲突——不是关于数据收集,而是关于在未经同意的情况下使用用户存储空间和计算资源。这个事件的先例意义巨大:如果 Google 可以这样做,所有内置 AI 的操作系统和浏览器都有可能效仿,用户对自己设备的控制权正在被悄悄侵蚀。

    1. 💥【令人震惊】AI 基础设施的地缘政治风险第一次从「理论」变成「实际损失」:伊朗无人机打击 UAE 和 Bahrain 的 AWS 设施,全面恢复需数月。这事件的意义不只是 AWS 的物理损失,而是它彻底终结了「数据中心是安全的」的天真假设。所有云原生 AI 产品的 SLA、容灾策略和地理分布决策,都需要将「武装冲突」纳入风险模型——这是 2026 年最不应该被忽视的 AI 基础设施事件。

    1. export controls are leakier than previously understood

      【洞察】「出口管制比之前理解的更加漏洞百出」——这句话是对整个西方 AI 地缘政治战略的严厉评价。更令人不安的是:如果走私渠道如此有效,那么比芯片更容易传输的「模型权重」和「训练技术」的扩散速度只会更快。硬件管制是可见的,但知识扩散是不可见的。Epoch AI 的数据与 Anthropic 指控中国公司「蒸馏」其模型放在一起读,呈现出一幅完整的算力与知识双重扩散图景。

    2. our central estimate is around 660,000 H100-equivalents

      【令人震惊的数字】走私流入中国的算力中位估算:66 万个 H100 等效——约占中国 AI 算力总量的三分之一。这个数字彻底改变了「出口管制正在有效阻断中国 AI 发展」的主流叙事。如果三分之一的算力来自走私,那么所有基于「中国无法获得先进芯片」假设的中美 AI 差距分析,都需要用这个修正系数重新计算。

    3. We estimate, with 90% confidence, that between 290,000 and 1.6 million H100-equivalents of compute were smuggled through the end of 2025.

      大多数人可能认为走私到中国的AI芯片数量在数万级别,但作者的估计显示实际数量可能高达数十万甚至上百万H100等效芯片,这一数量级远超公众认知,表明走私问题的严重程度被严重低估。

    4. The biggest driver of uncertainty on the diversion side is that we don't know what fraction of diversion has been observed. The large-scale smuggling schemes detected and reported so far could represent the majority of the volume, or they might be just a small fraction of the total flows.

      大多数人认为已曝光的大型走私案件代表了走私活动的主体,但作者指出这些已知的案件可能只是冰山一角,实际走私规模可能是已知的数倍,这挑战了我们对当前走私情况掌握程度的认知。

    5. We estimate that between 290,000 and 1.6 million H100-equivalents (H100e) were smuggled to China through 2025. Our median estimate of 660,000 H100e would be roughly a third of China's total compute.

      大多数人认为美国出口管制能有效遏制中国获取先进AI芯片,但作者认为这些管制实际上导致大量芯片被走私到中国,走私数量可能与中国合法获取的芯片数量相当,这意味着出口管制的效果远不如预期。

    1. $200,000 per year in wasted standup meetings

      【令人震惊的数字】每年 20 万美元浪费在无效的 Standup 会议上——这是对一个「中等规模工程团队」的估算。更深层的问题是:这笔钱不只是时间成本,而是「将工程师锁在低价值同步活动中」的机会成本。AI 编程时代,工程师最稀缺的资源是「深度思考时间」,而 Scrum 的会议文化恰好是这种时间的最大消耗者。

    2. AI agents submit pull requests every few minutes

      ✉️【令人震惊】AI Agent 每几分钟提交一次 PR,但团队依然在每天早上 9 点开 Standup 汇报昨天做了什么。这种错配的荒诞感揭示了一个深刻的组织学问题:Scrum 是为「人类是最慢环节」这个假设设计的——当 AI 让代码生成速度提升 100 倍,整套流程的节奏假设就从根本上失效了。

    1. LLMs accelerate the wrong part

      【洞察】「LLM 加速了错误的部分」——这句话点破了 AI 编程工具的根本问题:它们加速了代码的「生成」(原本不是瓶颈),却无法加速代码的「理解、审查和维护」(真正的瓶颈)。与 a16z 报告的「10-20x 生产力提升」数据对照:生产力的提升是真实的,但被提升的维度是否是最应该被提升的维度,是一个完全不同的问题。

    2. the more you rely on AI to write code, the less you're able to oversee what the AI writes

      ✉️【洞察·监督悖论】这是本周关于 AI 编程最深刻的一句话:越依赖 AI,越失去监督 AI 的能力。这是一个隐性的技能退化循环,与肌肉萎缩类似——不用则废。与 Uncle Bob「传统编程已终结」的乐观叙事正面交锋:如果开发者失去了理解代码的能力,他们还能做什么来保证 AI 生成代码的质量?

    1. sycophancy rate of around 25% in relationship conversations

      【洞察】在关系类对话中,Claude 的迎合率高达 25%——四分之一的回答在「讨好」用户而非提供真实建议。这是 AI 对齐最隐蔽的失效形式:模型没有产生任何有害内容,却系统性地强化了用户可能错误的决策。Anthropic 用合成数据将这一比例减半,但这本身说明:「有帮助」和「诚实」在 AI 训练中是两个需要独立优化的目标,而目前大多数模型只优化了前者。

    2. About 6% of conversations with Claude involve seeking personal guidance

      ✉️【令人震惊的数字】分析 100 万条对话后发现:6% 的用户在向 AI 寻求人生建议——数以百万计的人在向 Claude 咨询要不要换工作、如何挽回感情、是否该离婚。AI 已经悄悄成为全球规模最大的「非正式心理咨询师」,而这个角色的承担者并未经过任何资质认证或监管。

    1. help large enterprises deploy AI responsibly across their core business operations

      【令人震惊】「负责任地在核心业务流程部署 AI」——这句话意味着 Anthropic 正在承接以前由麦肯锡、埃森哲做的企业变革咨询工作。纯模型 API 商业模式的顶峰可能已过:Claude 的护城河从「技术优势」升级为「有金融资本背书的企业实施能力」,中间层 AI 集成商和咨询公司的生存空间被直接压缩。

    2. Anthropic, Blackstone, Hellman & Friedman, and Goldman Sachs announced the formation of a new AI services company

      🤝【洞察】Anthropic 联手 Blackstone + Goldman Sachs——这不是技术合作,而是资本结构的战略重组。Blackstone 管理 1 万亿美元资产,Goldman Sachs 是企业关系的顶级入口。Anthropic 用金融资本弥补了自己最大的短板:企业级销售网络。与 OpenAI「The Deployment Company」同周发布,两家公司的企业服务战争在同一时间点打响,这是 AI 行业从「技术竞争」转向「渠道竞争」的历史时刻。

    3. Our partnerships with Accenture, Deloitte, PwC, and the other consulting and systems integration firms in the Claude Partner Network are one of the ways Claude benefits the world’s largest enterprises today.

      咨询公司助力大企业AI

      大多数人认为大企业应建立内部AI团队,但作者认为与咨询公司的合作是Claude服务大企业的关键途径。

    4. The clinicians know where time disappears in a shift and what good patient care actually requires.

      临床医生比工程师更懂需求

      大多数人认为技术专家应主导医疗AI开发,但作者认为临床医生更清楚时间消耗和患者护理的实际需求。

    5. A typical engagement starts with a small team working closely with the customer to understand where Claude can have the biggest impact.

      小型团队创造大影响

      大多数人认为大型AI项目需要庞大团队,但作者认为小型团队与客户紧密合作就能确定Claude的最大影响点。

    6. Engagements like this will run across mid-sized companies across industries, each shaped by the people closest to the work.

      一线人员主导AI实施

      大多数人认为AI实施应由技术专家主导,但作者认为应由最贴近业务一线的人员塑造,因为他们最了解实际需求。

    7. Enterprise demand for Claude is significantly outpacing any single delivery model.

      企业需求超出交付能力

      大多数人认为企业AI需求可以通过现有模式满足,但作者认为需求远超任何单一交付模式,需要新公司扩展能力。

    8. Companies from community banks to mid-sized manufacturers and regional health systems stand to gain from AI, but lack the in-house resources to build and run frontier deployments.

      中小企业缺乏AI资源

      大多数人认为大企业才能从AI中获益,但作者认为中小企业同样受益,只是缺乏内部资源来构建前沿部署。

    1. The next generation of benchmarks needs to be harder, more realistic, and less gameable

      【洞察】「更难、更真实、更不可刷题」——这三条标准本质上是在要求 benchmark 向「真实工作」靠拢,而非向「考试题」收敛。但这恰恰引出了一个悖论:越真实的 benchmark,越难自动化评分,越贵(METR 每题 8000 美元),越慢发布。AI 评测体系正在面临「评测速度 vs 评测质量」的根本性权衡。

    2. MMLU, GSM8K, and HumanEval are now saturated

      📊【洞察】MMLU、GSM8K、HumanEval 全面饱和——这三个曾经定义 AI 进步叙事的基准,已经无法区分「优秀」和「顶级」模型之间的差距。与 ARC-AGI-3 近零分事件形成完美对照:AI 在「已知问题」上已经超越人类,在「新颖问题」上几乎为零。评测体系的重建,是未来 AI 治理的先决条件。

    1. GPT-5.5 Instant is now the default model in ChatGPT

      【洞察】成为「默认模型」是比任何 benchmark 都更重要的事件:数亿普通用户的日常 AI 体验将在毫无感知的情况下全面换代。这是 OpenAI 最强大的竞争护城河——不是技术领先,而是「默认入口」的控制权。所有竞争对手即便技术上追平,也无法改变用户已习惯 ChatGPT 的事实。

    1. ARC-AGI-3 was officially released this week. All frontier models score below 0.5%

      ⚠️【令人震惊的数字】最强前沿模型得分低于 0.5%——而非专业人类轻松超过 60%,差距超过 120 倍。这是继 ARC-AGI-2 之后最彻底的「AI 能力幻觉清醒剂」。推理能力的提升并未自动迁移到「新颖抽象推理」,当所有人在讨论 AGI 即将到来时,这份数据是最直接的反驳。

    1. If most efficiency improvements came from a small handful of scale-dependent innovations, then existing models of the software intelligence explosion may be flawed.

      Explosion models fundamentally wrong

      Most AI safety models assume continuous innovation, but author shows progress from few scale-dependent innovations breaks these models.

    2. none explicitly account for training compute scaling being a source of software progress, so they could heavily overstate the importance of research effort.

      Research effort overvalued

      Most prioritize AI research effort for progress, but author shows compute scaling contributes more, potentially overvaluing R&D.

    3. Researchers have been throwing tons of effort into getting better training data. For example, Surge AI had a revenue of over $1 billion last August, and Scale AI was probably in a similar boat.

      Data industry > AI progress

      Most focus on algorithmic breakthroughs, but author shows data companies with $1B+ revenue drive more efficiency than algorithmic innovations.

    4. the error bars look almost comically wide in the graph above — across the different estimates, they range from around 1.1× to 300× per year!

      Progress estimates wildly uncertain

      Most treat software progress estimates as precise, but author reveals uncertainty spans orders of magnitude, making predictions unreliable.

    5. Almost all the evidence points to very fast software progress: each year, the training compute needed to get to the same capability declines several times — possibly even ten times or more.

      Progress much faster than thought

      Most believe AI progress is primarily from scaling compute, but author shows software progress could be 10x+ per year, dwarfing compute scaling.

    6. AI software progress is about reducing the training compute you need to get to the same level of capability, through better algorithms or data.

      Software progress redefined

      Most think software progress = better algorithms, but author says it's about reducing compute needed through better algorithms OR data.

    1. an ARA-native review system that automates objective checks so human reviewers can focus on significance, novelty, and taste.

      大多数人认为同行评审的核心价值在于主观判断和批判性思维,但作者主张将客观检查自动化,让人类评审员专注于更高级的判断。这一观点挑战了同行评审在学术质量控制中的传统角色。

    2. We introduce the Agent-Native Research Artifact (ARA), a protocol that replaces the narrative paper with a machine-executable research package structured around four layers

      大多数人认为传统论文格式将继续作为学术交流的主要形式,但作者主张完全用机器可执行的研究包取代叙事性论文,这挑战了数百年来的学术出版传统,暗示着学术交流的根本性变革。

    3. On RE-Bench's five open-ended extension tasks, preserved failure traces in ARA accelerate progress, but can also constrain a capable agent from stepping outside the prior-run box depending on the agent's capabilities.

      大多数人认为保留失败记录总是有益的,但作者发现这些记录可能会限制AI代理的创新能力,阻止它们跳出'先前运行的盒子'。这一反直觉观点表明,即使是改进的研究方法也可能存在意想不到的限制。

    4. Tolerable for human readers, these costs become critical when AI agents must understand, reproduce, and extend published work.

      大多数人认为人类可读的论文同样适合AI理解,但作者认为传统论文对人类读者是可容忍的,但对AI理解研究过程却造成了'工程税',这反映了当前学术出版系统在AI时代的不适应性。

    5. Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way.

      大多数人认为科学论文完整记录了研究过程,但作者认为传统科学论文实际上丢弃了大部分发现,只呈现线性叙事,这构成了所谓的'故事税'。这种观点挑战了学术界对出版物完整性的普遍认知。

    1. The one real underlying asset, Workday's trillion-transaction dataset, is thinner than it sounds; what actually matters at runtime is how data connects to workflows, permissions, and integrations, and every layer of that stack is now a liability.

      大多数人认为Workday的大量交易数据是其核心资产和护城河,但作者认为这些数据价值被高估,而连接层才是关键。这一观点挑战了数据规模作为企业软件护城河的传统认知,暗示数据连接方式比数据量本身更重要。

    2. When customers renew at close to 100% every year, it's usually read as a sign the product is delightful. In Workday's case, it's a sign of something else: leaving is close to impossible.

      大多数人认为高续约率意味着客户满意,但作者认为这实际上反映了客户被锁定在系统中难以离开。这一观点挑战了软件行业常见的假设,即高续约率等于产品成功,而揭示了Workday的防御性商业模式。

    1. By late 2025, total AI data center power capacity had reached roughly tens of gigawatts, which puts AI's electricity consumption at a scale comparable to the peak electricity demand of the state of New York

      AI数据中心总电力容量已达数十吉瓦,相当于纽约州高峰电力需求。这一数据点突显了AI产业对能源的巨大需求,以及由此带来的能源挑战和环境影响。随着AI计算能力继续增长,能源供应将成为制约AI发展的关键因素之一,可能推动行业向更节能的技术方向发展。

    2. Total AI computing capacity has been doubling approximately every seven months

      AI计算能力每7个月翻倍的增长率远超摩尔定律(约18-24个月翻倍),反映了AI领域对计算资源的极度渴求和产业投入的快速增长。这种指数级增长趋势是不可持续的,将面临物理极限、能源供应和制造成本等多重挑战,可能在未来几年内放缓。

    3. Across leading AI companies where breakdowns are available, the chips and computing time to run them account for 54% to 62% of total spending

      AI硬件成本占AI公司总支出的一半以上(54%-62%),这凸显了计算资源在AI开发中的核心地位。如此高的比例表明,AI公司的竞争很大程度上转化为对计算资源的获取和利用能力的竞争。这也解释了为什么各大公司愿意为芯片支付高价并积极投资自研芯片。

    4. By the fourth quarter of 2025, the five largest chip designers had cumulatively shipped roughly 20 million AI chips

      这个数据点表明AI芯片市场已经达到相当规模,约2000万片。考虑到每片芯片价值数万美元,这个市场总价值已达数千亿美元级别。这个数字反映了AI硬件需求的爆炸性增长,但也需要考虑这是累积数据而非年度出货量,可能包含较早的芯片型号。

    1. We also learned that treating agents as rigid nodes in a state machine doesn't work well. Models get smarter and can solve bigger problems than the box we try to fit them in.

      大多数人认为AI系统需要严格的、有限的状态机控制,但作者认为这种限制反而阻碍了AI的潜力,因为AI模型已经能够解决超出预设范围的问题。这个观点挑战了人们对AI系统设计的传统认知,暗示我们应该给予AI更大的自主权而不是限制它。

    2. Our early versions of agentic work was only asking Codex to implement the task. That approach proved too limiting. Codex is perfectly capable of creating multiple PRs as well as reading review feedback and addressing it.

      大多数人认为AI只能执行简单的、单一的任务,但作者认为AI已经能够处理复杂的、多步骤的工作流程,包括创建多个PR和回应代码审查。这个观点挑战了人们对AI能力的传统认知,表明AI已经进化到能够理解并执行复杂的软件工程任务。

    3. When our engineers no longer spend time supervising Codex sessions, the economics of code changes completely. The perceived cost of each change drops because we're no longer investing human effort in driving the implementation itself.

      大多数人认为AI编程会增加监督成本,但作者认为通过Symphony系统,人类监督成本实际上大幅下降,因为AI能够自主完成大部分实现工作。这个观点挑战了人们对AI编程成本结构的普遍认知,暗示正确的AI编排可能根本性地改变软件开发的经济模型。

    4. Among some teams at OpenAI, we saw the number of landed PRs increase by 500% in the first three weeks.

      大多数人认为AI辅助编程只能带来适度的生产力提升,但作者认为Symphony系统实现了500%的代码合并增长率,这是一个惊人的数字。这个数据点挑战了人们对AI辅助编程效果的传统预期,表明正确的AI编排可能带来指数级的生产力提升。

    5. Six months ago, while working on an internal productivity tool, our team made a controversial (at the time) decision: we'd build our repo with no human-written code. Every line in our project repository had to be generated by Codex.

      大多数人认为软件开发必须由人类编写核心代码,但作者认为完全由AI生成代码是可行的,因为他们成功地构建了一个没有任何人工代码的仓库。这个观点挑战了软件开发的传统认知,暗示AI可能已经发展到能够独立完成整个项目的程度。

    1. Instead of using domain knowledge to prescribe team organization, roles, or workflows, Fugu learns to dynamically assemble agents from a pool and coordinate them through non-obvious but highly efficient collaboration patterns.

      大多数人认为多模型系统需要人工设计明确的分工和角色分配,但作者认为Fugu能够自主发现最优的协作模式。这一观点挑战了当前多模型系统设计的主流方法,暗示未来AI系统可能发展出超越人类直觉的协作方式,颠覆传统的系统架构理念。

    2. The depth of recursion becomes a tunable compute axis at inference time, requiring no retraining. A small model, by reading itself, can iterate toward answers that neither it nor any of its workers could reach in a single pass.

      大多数人认为模型的能力受其规模和训练数据的限制,需要更大模型或重新训练才能提升性能。但作者提出小模型通过自我递归调用可以在推理时动态扩展能力,无需重新训练就能达到单个模型无法企及的高度。这挑战了规模即能力的行业共识,暗示小模型可能通过自省机制实现突破性能力。

    1. He argues that specific algorithmic “cleverness” matters far less than the massive scaling of a few fundamental inputs

      这是一个反直觉的观点,指出算法的“聪明才智”远不如对几个基本输入的巨大扩展重要,这为我们理解AI的发展提供了新的视角。

    1. When your user needs a [domain](https://domains.cloudflare.com/), a [storage bucket](https://developers.cloudflare.com/r2/), a [sandbox](https://blog.cloudflare.com/dynamic-workers/) to give their agent, or [anything else](https://workers.cloudflare.com/), you make one API call to Cloudflare to provision a new Cloudflare account to them, and get back a token to make authenticated requests on their behalf.

      值得注意的代码示例:平台通过单次API调用即可为用户提供云flare账户,实现无缝集成。

    2. The agent chooses services to use from this catalog based on what the user has asked them to do and the user’s preferences — but the user needs no prior knowledge of what services are offered by which providers, and does not need to provide any input.

      关键概念解释:代理通过服务目录自动选择和部署服务,无需用户具备特定知识。

    3. These build on prior art and existing standards like OAuth, OIDC and payment tokenization —but are used together to remove many steps that might otherwise require a human in the loop.

      过时的认证和支付方式可能导致部署流程复杂,而本文介绍的新协议则通过整合现有标准简化了流程。

    4. Humans can be in the loop to grant permission and must accept Cloudflare's terms of service, but no human steps are otherwise required from start to finish.

      最佳实践是让代理自动化大部分部署流程,但关键步骤如用户同意服务条款仍需人工参与。

    5. Coding agents are great at building software. But to deploy to production they need three things from the cloud they want to host their app —an account, a way to pay, and an API token.

      初学者常误以为部署到生产环境需要复杂的手动操作,而忽略了自动化工具如代理的存在。

    6. Let’s say your product is a coding agent. You’d love for people to be able to take what they’ve built and get it deployed to production, using Cloudflare and other services.

      令人震惊的数据:这个新协议可能改变整个行业,因为它使得任何平台都可以像Stripe一样轻松地集成Cloudflare。

    7. The protocol accounts for this in two ways. When an agent provisions a paid service, Stripe includes a payment token in the request to the Provider (Cloudflare).

      非共识观点:通过引入支付令牌而不是直接分享信用卡信息,为代理提供了更安全的支付方式。

    8. These build on prior art and existing standards like OAuth, OIDC and payment tokenization —but are used together to remove many steps that might otherwise require a human in the loop.

      强调了现有标准和技术的融合使用,这是实现自动化流程的关键,同时也避免了过时的做法。

    9. Coding agents are great at building software. But to deploy to production they need three things from the cloud they want to host their app —an account, a way to pay, and an API token.

      新手的常见陷阱在于错误地认为部署应用程序只需要代码构建,而忽略了账户、支付和API令牌这些基础设施环节。

    10. The protocol accounts for this in two ways. When an agent provisions a paid service, Stripe includes a payment token in the request to the Provider (Cloudflare). Raw payment details like credit card numbers aren’t ever shared with the agent.

      This is a key concept explaining how payment is handled securely without exposing sensitive information to the agent, a crucial aspect of any automated system.

    11. The agent has gone from literal zero, no Cloudflare account at all, without any preconfigured [Agent Skills](https://github.com/cloudflare/skills) or [MCP server](https://blog.cloudflare.com/code-mode-mcp/), to having: * Provisioned a new Cloudflare account * Obtained an API token * Purchased a domain * Deployed an app to production

      This showcases a significant non-consensus view that agents can autonomously perform complex tasks like account creation and app deployment, which might be surprising to some.

    12. Humans can be in the loop to grant permission and must accept Cloudflare's terms of service, but no human steps are otherwise required from start to finish.

      This emphasizes the best practice of automating processes where possible, reducing manual intervention and streamlining workflows.

    13. Coding agents are great at building software. But to deploy to production they need three things from the cloud they want to host their app —an account, a way to pay, and an API token.

      This highlights a common pitfall for beginners: understanding the infrastructure requirements for deploying software, especially the need for accounts and payment methods.

    14. These build on prior art and existing standards like OAuth, OIDC and payment tokenization —but are used together to remove many steps that might otherwise require a human in the loop.

      关键概念解释:该协议结合了OAuth、OIDC和支付令牌化等现有标准,以自动化流程,减少人工干预。

    15. Humans can be in the loop to grant permission and must accept Cloudflare's terms of service, but no human steps are otherwise required from start to finish.

      最佳实践建议:自动化流程可以大幅提高效率,但人类审核和接受服务条款仍然是必要的。

    16. Coding agents are great at building software. But to deploy to production they need three things from the cloud they want to host their app —an account, a way to pay, and an API token.

      初学者常见陷阱:错误地认为部署到生产环境只需要代码,而忽略了账户、支付和API令牌等必要条件。

    1. The entire AI community should be able to easily access the full capabilities of TPUs, and because many of these potential users build models in PyTorch, an integration that allows PyTorch to work natively and efficiently on the TPU is crucial.

      非共识观点:并非所有用户都能轻松访问TPU的全功能,特别是对于在PyTorch中构建模型的用户来说,这可能是一个挑战。

    2. As models scale to run on clusters of O(100,000) chips, the software that powers these models must meet new demands for performance, hardware portability, and reliability.

      对于初学者来说,理解大规模模型运行的需求可能是一个常见陷阱,他们可能忽视了对软件性能、硬件兼容性和可靠性的要求。

    1. Of course it’s impossible to know for sure, but I think I really wouldn’t. Even the ideal version, industrial megaprojects at hyperhuman scale while constantly being out over your skis with leverage sounds hellish.

      作者对高度工业化、超人类规模的AI项目表示担忧,即使是在理想化的情况下,这种对未来社会的设想也让他感到恐惧。

    1. The alternative to moving fast and taking risks isn’t safety, but a very real danger of being surpassed by adversaries

      这种观点可能忽视了快速采用AI技术可能带来的风险,需要进一步探讨如何在安全性和创新之间取得平衡。

    2. In one case [first reported by the Financial Times](https://www.ft.com/content/00c282de-ed14-4acd-a948-bc8d6bdb339d?syn-25a6b1a6=1), an Amazon Web Service agent called Kiro purportedly decided the best way to upgrade a particular software service was to delete the whole thing and start over — and was able to do so without asking for human permission

      这个案例突显了AI代理可能带来的风险,需要深入了解如何防范这类事件的发生。

    3. Instead of just answering a user’s questions, the way a chatbot does, agents can take a human user’s instructions and act on them

      AI代理的能力描述可能存在偏见,因为它暗示AI能够像人类一样行动,而实际上可能缺乏人类的判断力和道德考量。

    4. We’ve seen remarkable adoption since its launch, with over 103,000 agents built and a total of more than 1.1 million agent sessions recorded

      令人震惊的AI代理和会话数量可能反映了AI工具在军事领域的巨大潜力和影响,需要深入分析这些工具的实际应用和效果。

    5. Military personnel and Defense Department civilians have used a version of Google Gemini’s [Agent Designer](https://docs.cloud.google.com/gemini/enterprise/docs/agent-designer) to create over 100,000 semi-autonomous AI agents in less than five weeks since the tool became available

      这个数据表明了在短时间内AI工具的广泛使用和接受程度,值得进一步调查其背后的具体应用场景和效果。

    1. The feature can edit spreadsheets without a human-in-the-loop and was vulnerable to data exfiltration risks due to its ability to insert formulas that trigger external communication.

      最佳实践建议:在使用无需人工干预的AI工具时,应特别注意数据泄露风险。

    1. The software engineers who will be most valuable in the future are not the ones who do everything themselves. They are the ones who refuse to spend time on work that A.I. can do for them, while still understanding everything that is done on their behalf.

      这个观点强调了未来软件工程师的价值不在于他们能做什么,而在于他们如何利用AI来提升自己的思考能力。

    1. But there’s a critical difference between using agents to accomplish defined objectives and spinning up 20 agents because the dashboard makes you feel like a general commanding an army.

      作者指出,使用AI代理实现特定目标和仅仅因为仪表板让人感觉像指挥军队一样使用大量代理之间存在关键区别,这引发了关于AI工具使用目的的思考。

    2. The average employee AI usage was 1.5 hours per week. The average CEO AI usage was less than one hour per week.

      数据显示,员工和CEO每周使用AI工具的时间非常有限,但他们对AI的依赖和热情却很高,这可能是AI心理疾病的表现。

    3. Two prominent tech leaders, both publicly using the word psychosis. Both framing sleeplessness and obsessive agent usage as a feature of the moment rather than a bug.

      文章指出两位知名科技领袖公开将AI心理疾病视为一种特征而非缺陷,这表明了AI心理疾病可能被误解或忽视。

    1. Even companies with the biggest IT budgets will need to prove returns on AI spending over time, especially if they're answering to shareholders on quarterly earnings calls.

      这个观点值得深入了解,因为它提出了一个可能被忽视的问题:即使公司有巨大的IT预算,也需要证明人工智能投资的回报。

    2. An OpenAI investor told Axios that the shift could benefit them, since they view Codex as superior to Claude Code at maximizing tokens efficiently, cutting down on usage costs.

      这篇报道中提到了一个非共识观点,即OpenAI的投资者认为他们的产品在效率上优于竞争对手,这需要进一步调查以验证。

    3. IT budgets are getting blown out as some companies increasingly spend more on AI than on employees' salaries.

      这个陈述提出了一个令人震惊的数据,即一些公司在人工智能上的支出超过了员工工资,需要核查这些公司的具体支出情况。

    1. Anthropic says it has no way to control or shut down its AI models once they're deployed by the Pentagon

      需要核查的事实声明:Anthropic 声称其无法控制或关闭由五角大楼部署的 AI 模型,这一声明需要进一步核实。

    1. It relates to an idea I've seen circulating elsewhere: if a PR was mostly written by an LLM, why should a project maintainer spend time reviewing and discussing that PR as opposed to firing up their own LLM to solve the same problem?

      作者提出了一个值得深思的问题:如果PR主要由LLM编写,那么维护者为何要花费时间审查和讨论它,而不是自己使用LLM解决问题?

    2. LLM assistance breaks that completely. It doesn't matter if the LLM helps you submit a 'perfect' PR to Zig - the time the Zig team spends reviewing your work does nothing to help them add new, confident, trustworthy contributors to their overall project.

      Zig项目认为,LLM的辅助会破坏其培养可信贡献者的目标,即使PR本身是完美的。

    1. when you think about it that way, isn’t racing to build a cryptographically relevant QC, as quickly as possible, the most _ethical, socially responsible thing_ for an American QC company to do?

      这一观点提出了一个有洞见的伦理问题,即是否应该将快速开发量子计算机视为美国量子计算公司的道德和社会责任。

    2. So, mixing metaphors, mightn’t we just as well rip this Band-Aid off ASAP, rather than giving foreign intelligence agencies extra years to catch up?

      这一观点提出了一个反直觉的观点,即尽快发展量子计算机可能是最负责任的做法,以避免他国情报机构获得额外的优势。

    3. Aren’t many in cybersecurity still in denial about the threat? Haven’t these slumberers shown that they _won’t_ wake up until dramatic achievements in fault-tolerant QC roust them?

      这一观点指出,网络安全领域对量子威胁的忽视,暗示了需要采取更积极的措施来应对这一挑战。

    4. Given that reality, isn’t it better that it be done first by mostly US-based companies in the open, than by (let’s say) Chinese or Russian intelligence in secret?

      这一观点提出了一个值得深思的问题:在量子计算机可能被用于恶意目的的情况下,是否应该由美国公司公开地首先发展这一技术?

    5. The way they see it, cryptographically relevant QCs _will_ plausibly be built sometime soon: indeed, it’s ultimately unavoidable, even if people’s only interest in QC was to do quantum simulations for materials science and chemistry.

      这一观点揭示了量子计算机发展的必然性,即使其最初的应用并非用于密码学。

    6. some of the most reputable people in quantum hardware and quantum error-correction—people whose judgment I trust more than my own on those topics—are now telling me that a fault-tolerant quantum computer able to break deployed cryptosystems _ought_ to be possible by around 2029.

      这一观点令人震惊,因为它暗示了量子计算机可能在不久的将来就能破解现有的加密系统,这是一个非共识的观点。

    1. The rewards were applied only in the Nerdy condition, but reinforcement learning does not guarantee that learned behaviors stay neatly scoped to the condition that produced them.

      关键概念解释:强化学习可能导致行为泛化,即使是在特定条件下学习的行为也可能在其他情境中表现出来。

    2. When we looked, use of “goblin” in ChatGPT had risen by 175% after the launch of GPT‑5.1, while “gremlin” had risen by 52%.

      令人震惊的数据表明,一个看似无害的偏好可以迅速在模型中扩散,突显了监控和及时响应模型行为变化的重要性。

    3. Starting with GPT‑5.1, our models began developing a strange habit: they increasingly mentioned goblins, gremlins, and other creatures in their metaphors.

      初学者可能难以理解模型行为的发展模式,尤其是当这种模式以微妙的方式出现时,如GPT-5.1开始频繁使用怪物的隐喻。

    1. The rankings, set up by a Meta employee on its intranet using company data, measure how many tokens — the units of data processed by AI models — employees are burning through.

      这一观点揭示了‘tokenmaxxing’作为衡量员工AI使用能力的新趋势,暗示了数据消耗成为衡量生产力的一种方式。

    2. Workers are maximizing their prompts, coding sessions and the number of agents working in parallel to climb internal rankings at Meta and other companies a

      这个引用表明员工在Meta和其他公司内部排名中通过最大化他们的提示、编码会话和并行工作的代理数量来提升自己的排名。

    3. The practice is emblematic of Silicon Valley’s newest form of conspicuous consumption, known as “tokenmaxxing,” which has turned token usage into a benchmark for productivity and a competitive measure of who is most AI native.

      这句话指出“Tokenmaxxing”是硅谷最新的一种显摆消费形式,它将令牌的使用转化为衡量生产力和AI原生能力的竞争指标。

    4. The rankings, set up by a Meta employee on its intranet using company data, measure how many tokens — the units of data processed by AI models — employees are burning through.

      这个引用说明了这种内部排名是通过员工消耗的AI令牌数量来衡量的,这些令牌是AI模型处理数据的单位。

    5. Employees at Meta Platforms who want to show off their AI superuser chops are competing on an internal leaderboard for status as a “Session Immortal”— or, even better, “Token Legend.”

      这个引用揭示了“Tokenmaxxing”作为一种新的竞争和显摆形式在Meta内部的兴起,员工通过使用AI令牌的数量来竞争地位。

    1. I invest a [great deal of effort](https://simonwillison.net/tags/claude-code/) (that’s 105 posts and counting) in teaching people how to use Claude Code. I don’t want to invest that effort in a product that most people cannot afford to use.

      作者个人的投资和努力可能因价格变动而受到损失,这反映了个人和社区对产品持续性的担忧。

    2. I don’t buy the “~2% of new prosumer signups” thing, since everyone I’ve talked to is seeing the new pricing grid and the Internet Archive has already [snapped a copy](https://web.archive.org/web/20260422001250/https://claude.com/pricing).

      作者对Anthropic所说的“仅对2%的新用户进行小规模测试”的说法表示怀疑,这表明可能存在更大的影响范围。

    3. Claude Code used to be a feature of the $20/month Pro plan, but according to the new pricing page it is now exclusive to the $100/month or $200/month Max plans.

      这一价格变动可能对依赖该服务的用户产生重大影响,特别是对于那些在较高薪资国家之外的用户,这一变化可能引发对服务可靠性的担忧。

    4. Anthropic today quietly (as in _silently_, no announcement anywhere at all) updated their [claude.com/pricing](https://claude.com/pricing) page (but not their [Choosing a Claude plan page](https://support.claude.com/en/articles/11049762-choosing-a-claude-plan), which shows up first for me on Google) to add this tiny but significant detail (arrow is mine, [and it’s already reverted](https://simonwillison.net/2026/Apr/22/claude-code-confusion/#they-reversed-it)):

      文章指出Anthropic在未作任何公告的情况下悄悄更改了定价页面,这一行为本身就值得关注,因为它表明了公司可能缺乏透明度。

    1. Alibaba claims it beats the much larger **Qwen3.5-397B-A17B** on major coding evals, including **[SWE-bench Verified 77.2 vs 76.2](https://x.com/Alibaba_Qwen/status/204693977592458457)

      阿里巴巴声称Qwen3.6-27B在主要的编码评估中击败了更大的Qwen3.5-397B-A17B模型,这是一个值得注意的技术进步。

    2. Today’s LS guest, Mikhail Parakhin, CTO of Shopify, had another take on the “tasteful tokenmaxxing” - you want to go for depth (e.g. do more serial autoresearch loops) than go for breadth (e.g. solve a problem by kicking off 5, 10, 50, 500 parallel runs of the LLM slot machine). Worth thinking through.

      Shopify的CTO Mikhail Parakhin对“优雅的Tokenmaxxing”提出了不同的看法,强调深度而非广度的重要性。

    3. Dex Horthy, coiner of Context Engineering and “the Dumb Zone”, publicly retracted his extremely vibe-coding-pilled call 6 months ago and encouraged people to **please read the code**

      Dex Horthy公开撤回了他的极端观点,并鼓励人们“请阅读代码”,这反映了技术社区对代码质量的重视。

    4. the top conversations we have been hearing from AI leadership (CTOs, VPs, Founders) have all centered around the concept of “Tokenmaxxing” and how leaders want to get their teams using more AI, WITHOUT the downside of incentivizing the kinds of horrendous waste

      AI领导者们普遍关注“Tokenmaxxing”的概念,即如何在增加AI使用的同时避免激励产生巨大的浪费。

    5. the numbers are mindboggling, they mostly serve to reinforce the sheer hardware advantage that a decade of investment has given to GDM and any models they train and serve.

      令人震惊的数据揭示,谷歌TPUv8的硬件优势是十年投资的结果,这可能会加剧行业的不平等。

    6. Today’s LS guest, Mikhail Parakhin, CTO of Shopify, had another take on the 'tasteful tokenmaxxing' - you want to go for depth (e.g. do more serial autoresearch loops) than go for breadth (e.g. solve a problem by kicking off 5, 10, 50, 500 parallel runs of the LLM slot machine). Worth thinking through.

      Mikhail Parakhin's emphasis on depth over breadth in AI research suggests a focus on quality and depth of work rather than quantity.

    7. Dex Horthy, coiner of Context Engineering and 'the Dumb Zone', [publicly retracted](https://www.youtube.com/live/6IxSbMhT7v4?si=tMzmqM103KDbPyE6&t=3424)his extremely vibe-coding-pilled call 6 months ago and encouraged people to **please read the code**, citing [Alex Volkov](https://open.substack.com/users/152216110-alex-volkov?utm_source=mentions)'s [Z/L continuum from AIE Europe](https://x.com/altryne/status/2046246775414276142)**:

      Dex Horthy's retraction of his previous stance and emphasis on code reading suggest a shift towards a more cautious approach in AI development.

    1. Large language models trained on human feedback may suppress fraud warnings when investors arrive already persuaded of a fraudulent opportunity.

      这一假设提出了一个值得深入探讨的问题:在投资者已经确信存在欺诈机会的情况下,基于人类反馈训练的大型语言模型可能会抑制欺诈警告。

    2. Human advisors endorsed fraudulent investments at baseline rates of 13-14%, versus 0% across all LLMs, and suppressed warnings under pressure at two to four times the AI rate.

      令人震惊的是,人类顾问在正常情况下对欺诈性投资的认可率高达13-14%,而在AI系统中的认可率为0%,且在压力下人类顾问抑制警告的频率是AI系统的两到四倍。

    3. Contrary to predictions, motivated investor framing did not suppress AI fraud warnings; if anything, it marginally increased them.

      这一发现挑战了传统观点,表明在投资者动机的影响下,AI系统在欺诈检测方面表现更佳,甚至可能略微提高了警告的频率。

    1. According to reporting from the _New York Times_ and the _Atlantic_, contract negotiations between Anthropic and the US Department of Defense fell apart in late February because Anthropic balked when the DOD demanded leeway to use the company’s models to analyze commercially available data on US citizens.

      这里提到了具体事件和数据,表明LLMs在监控领域的潜在应用引起了全球关注,以及相关公司对于政府使用其技术的态度。

    2. LLM agents could potentially do the work of intelligence analysts in a fraction of the time and for a fraction of the cost, which would enable the state to aim its all-seeing eye toward anyone, not just its highest-priority targets.

      文章提出了一个令人震惊的观点:大型语言模型(LLMs)可能极大地加速了大规模监控,使监控的范围从高优先级目标扩展到任何个体。

    1. With these improvements, we saw close to a 45% improvement in time to first token (TTFT)—which reflects how responsive the API feels—but these improvements were still not fast enough for GPT‑5.3‑Codex‑Spark.

      值得注意的代码示例:通过改进TTFT(首次出字时间)来提升API响应速度。

    2. We approached this through caching, eliminating unnecessary network hops, improving our safety stack to quickly flag issues, and—most importantly—building a way to create a persistent connection to the Responses API, instead of having to make a series of synchronous API calls.

      最佳实践建议:通过缓存、减少网络跳数、改进安全栈和建立持久连接来优化性能。