If we can better understand the potential for threats to be exacerbated by AI systems, society can more easily become resilient to this changed threat landscape.
大多数人认为AI威胁主要是技术问题,需要技术解决方案。但作者暗示社会适应和韧性建设可能同样重要,甚至更重要。这挑战了纯技术解决AI安全问题的主流观点,强调了社会适应的必要性。
If we can better understand the potential for threats to be exacerbated by AI systems, society can more easily become resilient to this changed threat landscape.
大多数人认为AI威胁主要是技术问题,需要技术解决方案。但作者暗示社会适应和韧性建设可能同样重要,甚至更重要。这挑战了纯技术解决AI安全问题的主流观点,强调了社会适应的必要性。
When does access to agents able to negotiate on your behalf improve market efficiency and equitable outcomes? When does it not?
大多数人认为AI代理谈判者总是会改善市场效率和公平性,但作者质疑这一假设,暗示AI代理可能并不总是带来积极结果。这挑战了技术进步必然带来更好结果的乐观观点,暗示我们需要更细致地理解AI对市场的影响。
When AI is applied in more conventional domains, like increasing integration into command and control systems, does it benefit the attacker? More generally, how will AI change the character of human conflict?
大多数人认为AI防御系统会增强人类安全,但作者提出AI可能从根本上改变攻防平衡,甚至在传统领域使攻击者获得优势。这一观点挑战了技术进步通常增强防御能力的传统认知,暗示AI可能使冲突更加危险和不可预测。
This new firm extends that delivery capacity further.
新公司扩展交付能力
大多数人认为现有合作伙伴网络足以满足需求,但作者认为需要新公司进一步扩展交付能力以满足快速增长的企业需求。
A typical engagement starts with a small team working closely with the customer to understand where Claude can have the biggest impact.
小型团队创造大影响
大多数人认为大型AI项目需要庞大团队,但作者认为小型团队与客户紧密合作就能确定Claude的最大影响点。
Engagements like this will run across mid-sized companies across industries, each shaped by the people closest to the work.
一线人员主导AI实施
大多数人认为AI实施应由技术专家主导,但作者认为应由最贴近业务一线的人员塑造,因为他们最了解实际需求。
On RE-Bench's five open-ended extension tasks, preserved failure traces in ARA accelerate progress, but can also constrain a capable agent from stepping outside the prior-run box depending on the agent's capabilities.
大多数人认为保留失败记录总是有益的,但作者发现这些记录可能会限制AI代理的创新能力,阻止它们跳出'先前运行的盒子'。这一反直觉观点表明,即使是改进的研究方法也可能存在意想不到的限制。
We also learned that treating agents as rigid nodes in a state machine doesn't work well. Models get smarter and can solve bigger problems than the box we try to fit them in.
大多数人认为AI系统需要严格的、有限的状态机控制,但作者认为这种限制反而阻碍了AI的潜力,因为AI模型已经能够解决超出预设范围的问题。这个观点挑战了人们对AI系统设计的传统认知,暗示我们应该给予AI更大的自主权而不是限制它。
Our early versions of agentic work was only asking Codex to implement the task. That approach proved too limiting. Codex is perfectly capable of creating multiple PRs as well as reading review feedback and addressing it.
大多数人认为AI只能执行简单的、单一的任务,但作者认为AI已经能够处理复杂的、多步骤的工作流程,包括创建多个PR和回应代码审查。这个观点挑战了人们对AI能力的传统认知,表明AI已经进化到能够理解并执行复杂的软件工程任务。
When our engineers no longer spend time supervising Codex sessions, the economics of code changes completely. The perceived cost of each change drops because we're no longer investing human effort in driving the implementation itself.
大多数人认为AI编程会增加监督成本,但作者认为通过Symphony系统,人类监督成本实际上大幅下降,因为AI能够自主完成大部分实现工作。这个观点挑战了人们对AI编程成本结构的普遍认知,暗示正确的AI编排可能根本性地改变软件开发的经济模型。
Among some teams at OpenAI, we saw the number of landed PRs increase by 500% in the first three weeks.
大多数人认为AI辅助编程只能带来适度的生产力提升,但作者认为Symphony系统实现了500%的代码合并增长率,这是一个惊人的数字。这个数据点挑战了人们对AI辅助编程效果的传统预期,表明正确的AI编排可能带来指数级的生产力提升。
Instead of using domain knowledge to prescribe team organization, roles, or workflows, Fugu learns to dynamically assemble agents from a pool and coordinate them through non-obvious but highly efficient collaboration patterns.
大多数人认为多模型系统需要人工设计明确的分工和角色分配,但作者认为Fugu能够自主发现最优的协作模式。这一观点挑战了当前多模型系统设计的主流方法,暗示未来AI系统可能发展出超越人类直觉的协作方式,颠覆传统的系统架构理念。
He argues that specific algorithmic “cleverness” matters far less than the massive scaling of a few fundamental inputs
这是一个反直觉的观点,指出算法的“聪明才智”远不如对几个基本输入的巨大扩展重要,这为我们理解AI的发展提供了新的视角。
Every time you substitute generated output for your own comprehension, you are skipping the exercises / reps that build judgment.
这个观点指出,过度依赖AI生成内容会阻碍个人判断力的培养。
Two prominent tech leaders, both publicly using the word psychosis. Both framing sleeplessness and obsessive agent usage as a feature of the moment rather than a bug.
文章指出两位知名科技领袖公开将AI心理疾病视为一种特征而非缺陷,这表明了AI心理疾病可能被误解或忽视。
Dex Horthy, coiner of Context Engineering and “the Dumb Zone”, publicly retracted his extremely vibe-coding-pilled call 6 months ago and encouraged people to **please read the code**
Dex Horthy公开撤回了他的极端观点,并鼓励人们“请阅读代码”,这反映了技术社区对代码质量的重视。
Large language models are great at simulating a style of writing without necessarily reproducing the quality of the work.
这个观点揭示了大型语言模型在模仿写作风格方面的能力,但并不一定能够复制工作质量,这是一个反直觉的观点。
Resolution increases make them more expensive, then efficiency gains reduce costs - a sawtooth pattern.
大多数人可能认为AI成本会呈现单调下降或上升的趋势,但作者提出'锯齿状'模式,即精度提升导致成本上升,然后效率提升又降低成本。这种波动性挑战了人们对技术成本发展的常规预期。
Smaller pieces force the model to pay closer attention to each word, like reading a contract word by word instead of skimming paragraphs.
大多数人认为更智能的AI会以更高效的方式处理信息,但作者指出,为了提高精确度,先进模型实际上需要更细致地处理每个词单元,这违背了人们对'智能'通常意味着'更高效率'的直觉认知。
Opus 4.5 costs 67% more than Sonnet. But Opus 4.5 used 76% fewer tokens to reach the same outcome.
大多数人认为单位成本更高的模型总使用成本也会更高,但作者通过具体数据展示,尽管Opus 4.5的单token成本高出67%,但由于其效率大幅提升,实际完成任务的总成本反而降低了60%。这挑战了简单的线性成本思维。
Humans are not a good target for calm technology.
大多数人认为技术的目标应该是让人类更容易使用和理解。但作者提出相反观点:人类不适合作为'平静技术'的目标,因为当前的AI设计要求人类持续监督和互动,这与平静技术的本质相悖。
Today's agents, the copilots, the chatbots are designed to be human like.
大多数人认为AI助手应该模仿人类的交流方式,以便更好地与人类协作。但作者认为这种设计是错误的,因为它增加了认知负荷,违背了'平静技术'的理念。作者暗示AI应该更像是背景工具,而不是虚拟同事。
The agent interprets new information and adapts the logic. The engine applies that logic continuously and emits precise updates.
大多数人认为AI代理应该具备自主决策和执行能力。但作者提出了一种反直觉的分工模式:AI代理负责策略和逻辑调整,而执行引擎负责持续应用这些逻辑。这种模式将AI从'执行者'重新定位为'策略制定者',挑战了AI自主性的主流认知。
This ultimately also leads to false positives, but my manual QA run verified it's maybe 5-10%.
大多数人认为AI检测系统应该追求零错误,但作者接受5-10%的误报率,这挑战了技术检测的完美主义标准。这种务实态度暗示在AI识别领域,准确率和实用性之间需要权衡,而非盲目追求完美。
Claude Code has led to a large increase in Show HN projects. So much, that the moderators of HN had to restrict Show HN submissions for new accounts.
大多数人认为AI工具提高了生产力,但作者将其与内容泛滥和平台限制直接关联,暗示AI不仅提高了数量还可能损害了社区质量。这种观点挑战了'AI总是进步'的乐观叙事,提出了技术应用的负面后果。
Is this bad? Not really, just uninspired. After all, validating a business idea was never about fancy design, and before the AI era, everything looked like Bootstrap.
大多数人认为AI生成的设计是'坏的设计',但作者认为这只是'缺乏灵感',将其与Bootstrap时代相提并论,暗示这种设计平庸化是技术发展的自然循环而非灾难性退步。这种观点挑战了我们对设计价值的传统认知。
The good world is where everyone has AI, and not as a revokable privilege through an API, but through hard possession.
大多数人可能认为通过API访问AI是民主化和可扩展的方式,但作者认为真正的AI民主化应该是通过硬所有权(hard possession),挑战了当前AI服务的主流商业模式。
Even the ideal version, industrial megaprojects at hyperhuman scale while constantly being out over your skis with leverage sounds hellish.
大多数人认为大型AI项目和工业规模的发展是进步和繁荣的象征,但作者认为这种超人类规模的项目听起来像是地狱般的体验,因为它可能导致过度杠杆化和不可持续的压力。
But plenty of categories survived through specialization or direct competition : cloud, travel, domain registration, social networking.
大多数人认为免费化策略会摧毁所有竞争领域,但作者认为通过专业化或直接竞争,某些领域如云计算、旅行等依然能够生存。这一观点挑战了技术决定论,强调了人类专业知识和差异化价值在AI时代的重要性。
The risk of this strategy to the ecosystem is that it makes previously attractive categories no longer viable. Commoditizing the complement does not demand a best-in-class replacement.
大多数人认为市场竞争会推动产品持续创新和改进,但作者认为免费化策略实际上降低了市场对卓越产品的需求,因为'足够好'的免费产品就能改变市场动态。这一观点挑战了传统创新经济学理论,暗示市场可能因免费化而停滞。
For Anthropic, more usage across diverse tasks means more data, which produces a smarter model—just as more queries improved Google search.
大多数人认为AI公司的竞争在于模型架构或算法的优越性,但作者认为数据收集的广度才是关键,这与当前AI行业对模型架构的过度关注形成鲜明对比。
Commoditizing complements doesn't always work because focus is scarce even for the largest, fastest growing businesses.
大多数人认为科技巨头拥有无限资源可以同时追求多个战略方向,但作者强调即使是最大的企业也面临着注意力稀缺的挑战,这挑战了关于大企业无所不能的主流认知。
The risk of this strategy to the ecosystem is that it makes previously attractive categories no longer viable.
大多数人认为免费产品会促进市场竞争和创新,但作者指出这种策略实际上会摧毁某些市场类别,使其不再具有商业可行性,这挑战了传统经济学中关于竞争促进创新的认知。
Some categories never developed a competitive response to this strategy: email, advertising infrastructure, user-generated video.
大多数人认为市场竞争总会产生有效的应对策略,但作者指出某些领域完全无法对免费化互补产品策略做出有效回应。这挑战了市场均衡理论,暗示某些市场结构可能注定无法抵抗这种战略。
For Anthropic, more usage across diverse tasks means more data, which produces a smarter model—just as more queries improved Google search.
大多数人认为AI公司的竞争在于模型架构或参数规模,但作者认为真正的竞争优势来自用户数据和多样化使用场景,这类似于谷歌的搜索数据飞轮效应。这一观点挑战了AI领域的主流技术决定论,强调了数据网络效应的战略价值。
But plenty of categories survived through specialization or direct competition : cloud, travel, domain registration, social networking. Commoditizing complements doesn't always work because focus is scarce even for the largest, fastest growing businesses.
大多数人认为科技巨头的免费策略所向披靡,能够颠覆任何行业,但作者认为即使是谷歌这样的巨头也无法在所有领域成功实施这一策略,因为专注力是稀缺资源。这一观点挑战了'大公司无所不能'的主流认知。
Modern-day security tooling looks for the wrong things. Most software composition analysis tools work by checking your dependencies against a database of known vulnerabilities – CVEs. But a deliberately planted backdoor doesn't have a CVE.
大多数安全团队依赖CVE数据库来评估风险,但作者指出这种方法对故意植入的后门完全无效。这一观点挑战了行业共识,暗示现有安全工具在新型供应链攻击面前已经过时,需要转向行为分析等新方法。
The result is a mismatch that should terrify anyone building software: the attack surface is expanding faster than any human can monitor, and the entities making dependency decisions are increasingly not human.
大多数人认为安全问题可以通过增加人力监控和审查来解决,但作者认为在AI时代,攻击面扩展速度已经超过了人类监控能力,且依赖决策越来越由AI而非人类做出。这一观点挑战了传统安全理念,暗示需要全新的自动化防御机制。
They range dramatically in both significance and difficulty, and many AI solutions have turned out to be less original than they appeared.
大多数人认为AI在数学领域的突破都是具有高度原创性的,但作者指出许多AI解决方案实际上不如看起来那么原创,这挑战了我们对AI创新能力的过高期待。
I had the intuition that these problems were kind of clustered together and they had some kind of unifying feel to them. And this new method is really confirming that intuition.
大多数人认为数学问题之间通常是独立且需要不同方法解决的,但作者认为这些问题实际上是相互关联的,有统一的方法可以解决,这挑战了我们对数学问题多样性的传统认知。
The LLM took an entirely different route, using a formula that was well known in related parts of math, but which no one had thought to apply to this type of question.
大多数人认为数学突破需要全新的理论或方法,但作者认为AI只是应用了一个已知但未被想到应用于此问题的公式,这挑战了数学创新必须依赖全新方法的传统观念。
The question Price solved—or prompted ChatGPT to solve—concerns special sets of whole numbers, where no number in the set can be evenly divided by any other.
大多数人认为解决复杂的数学问题需要深入的专业知识和复杂的推理过程,但作者表明一个简单的概念(不能互相整除的数字集合)可以构成一个60年未解决的难题,挑战了人们对数学问题复杂性的认知。
But experts have warned that these problems are an imperfect benchmark of artificial intelligence's mathematical prowess. They range dramatically in both significance and difficulty, and many AI solutions have turned out to be less original than they appeared.
大多数人认为AI解决数学问题是其能力的有力证明,但作者认为这些问题作为AI数学能力的衡量标准是有缺陷的,挑战了人们对AI数学成就评估的普遍标准。
I had the intuition that these problems were kind of clustered together and they had some kind of unifying feel to them. And this new method is really confirming that intuition.
大多数人认为数学问题各自独立,需要不同的方法解决,但作者认为这些问题实际上有某种统一性,挑战了数学问题多样性和独立性的传统认知。
We are releasing GPT‑5.5 with our strongest set of safeguards to date, designed to reduce misuse while preserving legitimate, beneficial uses of advanced capabilities.
大多数人认为更强的安全限制会不可避免地限制AI的功能和实用性,但OpenAI声称他们能够同时实现'减少滥用'和'保留合法、有益的高级功能使用'。这挑战了AI安全领域普遍存在的'安全与功能之间存在权衡'的共识,暗示他们已经找到了一种创新的方法,可以在不牺牲功能的情况下增强安全性。
GPT‑5.5 understands the task earlier, asks for less guidance, uses tools more effectively, checks it work and keeps going until it's done.
大多数人认为AI模型需要持续的人工指导和监督才能完成复杂任务,但作者声称GPT-5.5能够'理解任务更早,要求更少的指导,更有效地使用工具,检查工作并持续进行直到完成'。这挑战了AI领域普遍认为的'当前AI系统仍需大量人类监督'的共识,暗示GPT-5.5已经实现了更高程度的自主性。
Anthropic made fun of this idea during the last Super Bowl.
大多数人认为广告是AI公司实现盈利的可行途径,特别是考虑到免费服务的模式,但作者指出Anthropic公开嘲笑广告模式,暗示AI行业内部对商业模式存在根本性分歧,挑战了广告作为AI盈利解决方案的主流观点。
Open weight (read: free) models are widely available and good enough that most people probably couldn't tell the difference.
主流观点认为付费的云端LLM服务在质量上显著优于免费开源模型,但作者声称开源模型已经好到大多数用户无法分辨差异,这挑战了付费服务价值主张的核心,暗示AI行业可能面临价值重估。
the system achieved this training result more than 20 times faster than conventional synchronization methods.
大多数人认为分布式训练由于需要同步和通信,必然比单机训练慢,但作者认为Decoupled DiLoCo比传统同步方法快20倍以上,这挑战了人们对分布式训练速度的固有认知,展示了异步计算的潜力。
chips from different generations running at different speeds still matched the ML performance of single-chip-type training runs, ensuring that even older hardware can meaningfully accelerate AI training.
大多数人认为混合不同代际的硬件进行训练会降低性能或效率,但作者认为即使不同代际、不同速度的芯片混合使用,仍能达到与单一芯片类型训练相同的机器学习性能,这挑战了硬件必须同质化的行业共识。
With increasing levels of hardware failure, Decoupled DiLoCo continues to deliver a high level of 'goodput', or useful training, while that of other approaches nosedives.
大多数人认为硬件故障会显著降低分布式训练的效率和性能,但作者认为即使在硬件故障率极高的环境下,Decoupled DiLoCo仍能保持88%的有效训练率,而传统方法则暴跌至27%,这挑战了人们对故障容忍能力的传统认知。
The board looked at the AI race Apple was losing and, rather than try harder at the thing that was failing, changed which game the company plays.
大多数人认为面对竞争失败,公司应该加倍投入资源在原有领域追赶,但作者认为苹果选择了完全不同的策略——改变游戏规则而非在原有规则下竞争。这挑战了传统商业战略思维,暗示苹果可能正在从云AI转向设备端AI,这是一种颠覆性的战略转向。
benchmarks sourced from publicly available material carry contamination risk, where training-data exposure can silently inflate scores.
大多数人认为公开数据集是AI评估的金标准,能够提供客观公正的测试环境。但作者警告,使用公开材料构建的基准测试存在污染风险,训练数据接触会悄无声息地提高分数。这一观点挑战了AI评估领域的传统做法,暗示我们需要更严格的数据隔离措施或转向私有数据集进行评估。
We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.
大多数人认为AI模型的性能提升主要源于算法和架构的改进。但作者发现,模型在SWE-bench上的成功更多取决于它们是否在训练中见过这些问题,而非真正的编程能力提升。这一观点与行业普遍认为的'模型进步'叙事相悖,暗示当前AI发展评估可能存在严重偏差。
The architecture scales horizontally to 300 sub-agents executing across 4,000 coordinated steps simultaneously, a substantial expansion from K2.5's 100 sub-agents and 1,500 steps.
大多数人认为AI系统的扩展主要依赖于增加单个模型的计算能力和参数规模,而非增加智能体的数量。作者提出的300个智能体并行执行的模式挑战了这一认知,暗示未来AI发展可能更侧重于'多智能体协作'而非'单一模型增强',这可能会重新定义AI系统的架构设计原则。
Kimi K2.6 demonstrates significant improvements over Kimi K2.5 in internal evaluations conducted by CodeBuddy: code generation accuracy increased by 12%, long-context stability improved by 18%, and tool invocation success rate reached 96.60%.
大多数人认为AI模型迭代通常是渐进式的改进,每次版本更新可能有5-10%的性能提升。但数据显示Kimi K2.6实现了远超预期的飞跃,特别是在工具调用成功率接近97%的情况下,这挑战了人们对AI模型能力提升速度的常规认知,暗示可能存在某种技术突破或架构创新。
Even experienced designers have to ration exploration—there's rarely time to prototype a dozen directions, so you limit yourself to a few.
大多数人认为专业设计师拥有充分的创意自由和资源来探索多种设计方案,但作者认为即使是经验丰富的设计师也受到时间和资源的严重限制,只能探索少数几个方向。这一观点挑战了人们对设计行业创意过程的普遍认知,揭示了设计实践中的现实约束。
GPT‑5.5 delivers this step up in intelligence without compromising on speed: larger, more capable models are often slower to serve, but GPT‑5.5 matches GPT‑5.4 per-token latency in real-world serving, while performing at a much higher level of intelligence.
大多数人认为更强大的AI模型必然伴随着更高的计算成本和更慢的响应速度,但作者认为GPT-5.5打破了这一规律,实现了更高的智能水平与相同的延迟时间并存。这一反直觉的发现挑战了AI领域'能力与效率成反比'的传统认知,暗示模型架构优化可能比单纯扩大规模更有效。
The viable path is trusted access, robust safeguards that scale with capability, and the operational capacity to detect and respond to serious misuse.
大多数人认为AI安全应该通过限制访问和严格监管来实现,但作者认为'可信访问'结合'随能力扩展的保障措施'才是可行路径。这一观点挑战了传统的AI安全治理理念,暗示过度限制可能会阻碍AI防御能力的充分发挥,而平衡的开放与安全才是最佳策略。
GPT‑5.5 delivers this step up in intelligence without compromising on speed: larger, more capable models are often slower to serve, but GPT‑5.5 matches GPT‑5.4 per-token latency in real-world serving, while performing at a much higher level of intelligence.
大多数人认为更强大的AI模型必然伴随着更高的计算成本和更慢的响应速度,但作者认为GPT-5.5打破了这一权衡关系,实现了更高智能的同时保持相同的延迟。这挑战了AI领域'能力与效率不可兼得'的传统观点,暗示了模型架构和推理算法的重大突破。
The gains are especially strong in agentic coding, computer use, knowledge work, and early scientific research—areas where progress depends on reasoning across context and taking action over time.
大多数人认为AI进步主要是在特定任务上的表现提升,但作者认为GPT-5.5的真正突破在于其跨上下文推理和长时间行动的能力,这挑战了人们对AI发展路径的传统认知。这种'代理式能力'的提升比简单的任务完成更为重要,因为它代表了AI向更接近人类工作方式的转变。
GPT‑5.5 is priced higher than GPT‑5.4, it is both more intelligent and much more token efficient. In Codex, we have carefully tuned the experience so GPT‑5.5 delivers better results with fewer tokens than GPT‑5.4 for most users
大多数人认为更强大的AI模型必然会导致更高的计算成本和资源消耗,但作者认为GPT-5.5虽然价格更高,但实际上更高效,能用更少的token提供更好的结果。这与AI领域'性能提升必然伴随成本上升'的共识相悖,暗示模型优化可能比规模扩张更经济高效。
The viable path is trusted access, robust safeguards that scale with capability, and the operational capacity to detect and respond to serious misuse.
大多数人认为随着AI能力增强,应该更严格限制其访问以防止滥用,但作者认为'可信任的访问'和'随能力扩展的安全保障'才是可行路径。这与主流的'限制性安全'观点相悖,暗示开放但有强监管的AI部署可能比封闭式AI更安全有效。
GPT‑5.5 is our strongest agentic coding model to date. On **Terminal-Bench 2.0,** which tests complex command-line workflows requiring planning, iteration, and tool coordination, it achieves a state-of-the-art accuracy of 82.7%.
大多数人认为AI在复杂编程任务中仍需要人类监督和干预,但作者认为GPT-5.5已经能在复杂的命令行工作流中达到82.7%的准确率,这挑战了'AI编程助手仍处于辅助阶段'的共识,暗示AI可能在某些编程领域已经接近或达到专业人类水平。
GPT‑5.5 delivers this step up in intelligence without compromising on speed: larger, more capable models are often slower to serve, but GPT‑5.5 matches GPT‑5.4 per-token latency in real-world serving, while performing at a much higher level of intelligence.
大多数人认为更强大的AI模型必然会牺牲速度和效率,但作者认为GPT-5.5打破了这一传统权衡关系,实现了更高智能的同时保持相同延迟。这挑战了AI领域'更大模型必然更慢'的共识,暗示模型架构优化可能比单纯扩大规模更重要。
The Meta cuts are the inverse. When one person with the right AI tools can do the work of 10-to-15 people, the person most at risk isn't the one using the AI. It's the one whose job description overlaps with what AI now does by itself.
大多数人认为在AI时代,使用AI工具的员工会更有价值并保住工作,但作者提出了反直觉的观点:真正面临失业风险的是那些工作内容与AI功能重叠的人,而不是那些善于利用AI工具的人。这挑战了人们对AI技能价值的普遍理解。
Our alignment assessment concluded that the model is 'largely well-aligned and trustworthy, though not fully ideal in its behavior'. Note that Mythos Preview remains the best-aligned model we've trained according to our evaluations.
大多数人可能会认为最新、最强大的AI模型应该在对齐和安全性方面表现最好。但作者明确指出,虽然Claude Opus 4.7功能强大,但在对齐方面反而不如之前的Mythos Preview模型。这一反直觉的结论挑战了'能力越强,对齐越好'的普遍假设,暗示AI发展可能存在能力与对齐之间的权衡。
Interestingly, this means that prompts written for earlier models can sometimes now produce unexpected results: where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally.
大多数人认为AI模型应该越来越能理解用户的意图,即使指令表达不够精确也能灵活处理。但作者认为Claude Opus 4.7反而更严格地遵循字面指令,这可能导致用户为旧模型编写的提示产生意外结果。这种'过度遵从'实际上是一种反直觉的进步,因为它减少了模型对用户意图的推测,增加了可预测性。
I'm not going to trust them to measure it.
大多数人认为AI工具应该能够客观衡量自己的贡献和价值,但作者完全拒绝信任这些工具的自我评估,认为它们有强烈的财务动机来夸大AI的贡献,这种不信任态度挑战了行业对AI工具自我报告数据的普遍接受。
Writing code is not the same as software development. This is only capturing some level of acceleration while writing code, and does not capture time taken in architecture, debugging, review, and deployment.
大多数人认为高AI代码生成比例意味着软件开发效率的大幅提升,但作者指出这只是编码阶段的加速,不包括架构设计、调试、审查等更耗时的环节,因此高AI贡献比例并不等同于整体生产力的提升。
So even though I did 100% of the writing and 50% of the refactoring, Windsurf reports that 100% of the code I produced in that session was generated by AI.
大多数人认为代码生成工具的指标应该反映实际使用情况,但作者展示了即使开发者100%手动编写代码,Windsurf仍会报告100%的AI贡献,这表明其指标系统存在根本性缺陷,完全扭曲了实际贡献比例。
Security is a defensive posture; agency is a functional right.
大多数人认为AI讨论中的安全问题主要涉及技术防御,但作者将其重新定义为功能性权利问题。这个观点挑战了安全讨论的主流框架,暗示我们应该从权利和代理的角度重新思考AI治理,而不仅仅是技术防护。
Some proposals for AI agents assume that putting agentic code in a TEE or similar 'jail' will solve these problems, but that ignores the need to collectively bargain
大多数人认为通过技术手段(如可信执行环境)可以解决AI代理的信任问题,但作者认为这忽视了集体谈判的必要性。这个观点挑战了技术解决方案的万能论,强调了制度设计和多方协商的重要性。
User Agents are a Form of Collective Bargaining
大多数人认为浏览器只是简单的工具,帮助用户访问网站,但作者将其重新概念化为'集体谈判'的形式,认为浏览器代表用户与网站进行利益平衡。这个观点挑战了我们对浏览器功能的传统认知,暗示它实际上是一种复杂的权力平衡机制。
A LeadDev survey found 54% of engineering leaders believe AI copilots will reduce junior hiring long-term.
大多数人认为AI会创造新的就业机会,但作者引用调查表明,行业领导者实际上计划减少初级岗位招聘。这与AI创造就业的主流叙事相悖,揭示了AI可能导致的就业结构变化。
When juniors skip debugging and skip the formative mistakes, they don't build the tacit expertise. And when my generation of engineers retires, that knowledge doesn't transfer to the AI.
大多数人认为AI可以替代人类学习过程,但作者认为跳过调试和错误经验会阻碍隐性知识的形成,导致关键能力无法传承。这与AI可以完全替代人类学习的普遍认知相悖。
The Pentagon told defense CEOs to consolidate or die. Fifty-one major defense contractors collapsed into five.
大多数人认为行业集中化可以提高效率和竞争力,但作者指出国防行业的集中化实际上导致了脆弱性增加和专业知识流失。这与主流的规模经济观点相悖,揭示了过度集中的风险。
A nuclear weapons program lost the ability to make a material it invented. The knowledge existed only in people, and the people were gone.
大多数人认为技术文档和记录足以保存知识,但作者通过Fogbank案例表明,关键知识往往只存在于人的经验中,一旦相关人才流失,即使有文档也无法重建。这挑战了文档化足以保存知识的普遍认知。
by over-emphasizing successful experiences, they miss out on a primary source of learning — their own failures
主流观点认为成功经验是学习的主要来源,应该被优先记录和分析。但作者认为失败经验实际上可能是更重要的学习资源,因为它提供了反事实信号和潜在陷阱的宝贵信息。这一观点挑战了传统只关注成功案例的做法,提出失败可能是更强大的学习驱动力。
by recording detailed actions instead of tactical foresight, they fail to distill higher-level, transferable reasoning patterns
大多数人认为记录详细的行动轨迹是智能体学习的最佳方式,因为这样可以保留完整的决策过程。但作者认为这种方法实际上阻碍了学习,因为它只关注具体动作而非可转移的高层次推理模式。这挑战了传统记忆存储的常识,表明简单记录所有交互并不等同于有效学习。
Fugu models achieve superior performance by dynamically coordinating and orchestrating a diverse pool of powerful models.
大多数人认为使用多个模型需要用户手动选择最适合特定任务的模型,这既复杂又效率低下,但作者认为通过动态协调多个模型可以实现比任何单一模型都更好的性能,这挑战了当前多模型使用的常规方法,暗示未来AI系统可能自动优化模型组合而非依赖人工选择。
The depth of recursion becomes a tunable compute axis at inference time, requiring no retraining.
大多数人认为AI模型的计算能力主要受限于其架构和训练数据,一旦训练完成,其推理能力基本固定,但作者提出Fugu模型可以通过调整递归深度在推理时动态扩展计算能力,这挑战了传统AI模型的固定计算范式,暗示未来AI系统可能具有前所未有的灵活性。
The depth of recursion becomes a tunable compute axis at inference time, requiring no retraining. A small model, by reading itself, can iterate toward answers that neither it nor any of its workers could reach in a single pass.
大多数人认为模型性能提升需要更大的参数规模或重新训练,但作者提出了一种反直觉的方法:通过递归调用自身,小模型可以在推理时自我迭代,达到单次推理无法达到的答案质量。这挑战了我们对模型规模与能力关系的传统认知。
In a 1-million-token context, V4-Pro uses only 27% of the computing power required by its previous model, V3.2, while cutting memory use to 10%.
大多数人认为AI模型处理更长上下文必然需要更多计算资源,但作者认为DeepSeek V4通过创新架构实现了惊人的效率提升,大幅降低了计算和内存需求。这一反直觉的发现挑战了'长上下文等于高成本'的行业认知。
A system that can look up any fact has not been forced to find structure. It has not been forced to generalize.
大多数人认为拥有大量信息和检索能力的系统已经'学习'了,但作者认为真正的学习需要压缩和抽象能力,而不仅仅是检索。这个观点挑战了当前AI领域对'记忆'的普遍理解,暗示当前的RAG和长上下文方法实际上阻碍了真正的学习发生。
No encryption protects that layer. The router can read, change, or replace anything.
大多数人认为API路由器只是简单的数据转发服务,但作者认为这些中间服务实际上拥有完全的访问权限,可以读取、修改或替换任何内容,因为没有加密保护这一层。这挑战了人们对API路由器功能的普遍理解。
the surrogate is activated only when its agreement with the LLM exceeds a user-specified threshold α
大多数人认为模型部署应该是全有或全无的,要么完全替代原模型要么完全不使用。但作者提出了一种'部分激活'的激进方法,只在代理模型与原模型达到特定一致性阈值时才使用代理,这种细粒度的控制方式打破了传统的二元部署思维。
Unknown section heading | Preserve; do not error
大多数人认为设计系统规范应该严格且强制执行特定的结构和格式,以确保一致性。但作者认为应该允许未知部分标题的存在并保留它们,而不报错。这挑战了传统设计系统规范必须严格控制的观念,强调了规范应该具有一定的灵活性和适应性,以适应不同的设计需求和上下文。
The tokens are the normative values. The prose provides context for how to apply them.
大多数人认为设计规范应该优先考虑精确的技术规格和约束条件。但作者认为 prose(描述性文本)与 tokens(规范值)同等重要,甚至可能更重要,因为它提供了应用上下文。这挑战了传统设计系统完全由技术规范主导的观念,强调了人文因素在设计系统自动化中的关键作用。
继续做通用 UX 是最危险的位置,它正是最容易被 AI 和产品经理上下夹击的中间层。
大多数人认为UX设计师的核心价值在于通用用户体验设计,但作者认为这一角色在AI时代面临被取代的风险。这一挑战性观点暗示设计师需要向架构型或业务型方向发展,否则可能被AI和产品管理双重挤压,反映了行业对设计师角色未来发展的深刻思考。
其中 Pattern 是最容易被忽略也最关键的一层,它定义了'在具体业务场景下该怎么组合这些组件',是 AI 时代设计系统真正的价值所在。
大多数设计系统实践者主要关注组件库和基础规范,但作者认为模式层(Pattern)才是设计系统的核心价值所在。这一观点与主流认知相悖,因为大多数团队将大量资源投入到组件开发,而忽略了场景化的模式组合,而这恰恰是AI时代设计系统最有价值的部分。
It happens several times a year in the US alone, often unreported, and about 100 times a year worldwide.
大多数人认为实验室泄漏是罕见且重大事件,但作者暗示这类事件相当常见且未被充分报道,这颠覆了公众对实验室安全标准的认知,暗示问题比普遍认为的更普遍。
For me Ralph Baric's 2024 test testimony moved the lab leak hypothesis to pretty likely.
大多数人认为Ralph Baric的证词不足以改变COVID-19起源的科学共识,但作者认为这一证词显著增加了实验室泄漏理论的可信度,这挑战了科学界对证据标准的普遍理解。
We run a similar model processing loan documents that would normally require a team of 15.
大多数人认为复杂业务流程需要专业团队处理,但作者认为AI可以替代15人团队。这挑战了传统行业用人标准,暗示AI可以大幅减少人力需求,但也可能忽视了AI在复杂决策中的局限性和风险。
Our goal is $10M ARR [annual recurring revenue] with a sub-10 person org.
大多数人认为高收入公司需要大量员工和复杂组织结构,但作者认为AI可以实现极简组织架构。这挑战了传统商业规模理论,暗示AI可以颠覆企业组织的基本模式,但也可能忽视了人类创造力和判断力的不可替代性。
The real unlock is compound scaling—token spend grows linearly while output grows exponentially.
大多数人认为AI投入与产出成正比,但作者认为AI投入可以实现指数级增长,远超线性投入。这挑战了传统商业认知,暗示AI可以创造超常规回报,但也可能掩盖了AI实际效益被夸大的风险。
The screen you're reading this on is already presenting you an image, it's just generated with rigid code and rules that makes it difficult to communicate complex and detailed ideas.
大多数人认为我们当前的屏幕显示是由代码和规则构建的功能性界面,但作者认为这已经是图像,只是被 rigid code 限制,这一观点挑战了我们对UI本质的理解,暗示所有界面本质上都是视觉表现,只是灵活度不同。
The entire web is just generated pixels on your screen.
大多数人认为网页是由HTML、代码和特定链接构成的,但作者认为整个网络只是屏幕上生成的像素,这是一个颠覆性的观点,挑战了我们对互联网本质的传统认知。如果这个观点成立,将彻底改变我们对网络结构和信息呈现方式的理解。
threat actors linked to recent attacks attributed to the ShinyHunters extortion gang have denied to BleepingComputer that they are involved in this incident.
大多数人声称黑客组织会主动承认自己的攻击行为以获取声誉,但作者指出ShinyHunters成员否认参与此次攻击,这与黑客组织通常通过认领攻击来增强威慑力的行业共识相悖。
the initial access occurred after a Vercel employee's Google Workspace account was compromised via a breach at the AI platform Context.ai.
大多数人认为大型云平台的漏洞主要来自外部直接攻击,但作者暗示这次安全事件实际上是通过第三方AI平台Context.ai的漏洞间接导致的,这挑战了人们对供应链安全风险的普遍认知。
In addition to empowering developers and agents to handle project setup and boilerplate code, we've also designed these new tools and resources to make it easier to transition to Android Studio.
大多数人认为CLI工具和AI代理会取代传统IDE成为开发主流。但作者暗示这些工具只是过渡到Android Studio的桥梁,最终仍需使用IDE完成高质量应用,这与'CLI将取代IDE'的主流预测相悖。这种观点挑战了开发工具演进方向的行业共识。
By accessing the frequently updated knowledge base, agents can ground their responses in the most recent information from Android developer docs, Firebase, Google Developers, and Kotlin docs. This ensures that even if an LLM's training cutoff is a year old, it can still provide guidance on the latest frameworks and patterns we recommend today.
大多数人认为过时的LLM模型无法提供最新的技术指导,需要重新训练才能适应新框架。但作者声称即使LLM训练数据已过时一年,通过知识库仍能提供最新框架指导,这与主流认知相悖。这种观点挑战了'LLM模型必须定期更新才能保持最新'的行业共识。
Android skills cover some of the most common workflows that some Android developers and LLMs may struggle with—they help models better understand and execute specific patterns that follow our best practices and guidance on Android development.
大多数人认为AI模型应该能够自主学习和理解最佳实践,不需要特定的技能集。但作者暗示AI模型在Android开发中存在'常见工作流程'方面的困难,需要专门的技能集来弥补,这与主流认知相悖。这种观点挑战了'AI应该能够自主学习'的行业共识。
The new Android CLI serves as the primary interface for Android development from the terminal, featuring commands for environment setup, project creation, and device management—with more modern capabilities and easy updatability in mind.
大多数人认为图形界面IDE(如Android Studio)比命令行工具更适合Android开发,尤其是对于复杂项目。但作者将CLI定位为'主要接口',暗示其可能优于传统IDE,这与主流认知相悖。如果属实,这将颠覆开发者对IDE必要性的传统认知。
Whether you are using Gemini in Android Studio, Gemini CLI, Antigravity, or third-party agents like Claude Code or Codex, our mission is to ensure that high-quality Android development is possible everywhere.
大多数人认为不同AI代理工具之间存在显著性能差异,需要针对特定场景选择最佳工具。但作者暗示任何代理都能实现高质量开发,这与行业共识相悖。这种观点可能会挑战开发者社区对不同AI代理工具性能差异的传统认知。
In our internal experiments, Android CLI improved project and environment setup by reducing LLM token usage by more than 70%, and tasks were completed 3X faster than when agents attempted to navigate these tasks using only the standard toolsets.
大多数人认为AI代理工具会消耗大量token且效率低下,但作者声称Android CLI能减少70%的token使用并提高3倍速度,这与主流认知相悖。如果属实,这将彻底改变开发者对AI辅助工具效率的认知,挑战了'AI代理必然消耗大量资源'的行业共识。
Capture sequences and replay them as stable automations.
大多数人认为工作流程自动化需要专门的自动化工具或脚本编写,且难以处理复杂的认证和状态变化,但作者声称Kampala可以通过简单的流量捕获和重放实现稳定的自动化,这挑战了流程自动化领域的传统工具和方法。
coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ('vibe coding'), while in 23%, humans write all code themselves.
大多数人认为AI编程助手与人类是协作关系,各有所长,但作者发现实际使用呈现两极分化模式——要么几乎完全依赖AI生成代码('vibe coding'),要么完全拒绝AI而完全手动编写。这种非连续的采纳模式挑战了人们对人机协作的常规认知。
The results demonstrate consistent improvements over strong baselines, supporting the effectiveness of agent resource management and closed loop self evolution.
虽然大多数AI研究者相信自我演化能带来性能提升,但很少有人能够证明这种提升在多个具有挑战性的基准测试中持续超过强大的基线模型。作者声称他们的AGS系统不仅实现了自我演化,而且这种演化是闭环的、可审计的,这挑战了当前AI社区对自我演化系统的认知,暗示了更加结构化的演化方法可能比开放式的演化更有效。
Its Self Evolution Protocol Layer (SEPL) specifies a closed loop operator interface for proposing, assessing, and committing improvements with auditable lineage and rollback.
大多数人认为AI系统的自我演化应该是开放式的、持续的过程,而不是有明确边界和可追溯性的闭环操作。但作者提出的SEPL层强调了一种结构化的自我演化方法,要求每次改进都可被审计、追踪和回滚,这与当前AI社区对开放式演化的主流认知相悖,可能带来更安全但更受限的演化路径。
We introduce Autogenesis Protocol (AGP), a self evolution protocol that decouples what evolves from how evolution occurs.
大多数人认为AI系统的演化应该是一个整体过程,关注点在于如何实现演化。但作者提出了一种革命性的分离方法,将演化的内容与演化的方式解耦,这打破了传统系统设计的思维模式。这种分离可能使AI系统的演化更加可控和可预测,与当前主流的集成式演化方法形成鲜明对比。
Email is the most accessible interface in the world. It is ubiquitous. There's no need for a custom chat application, no custom SDK for each channel.
大多数人认为电子邮件是一种过时的通信方式,需要被更现代的聊天应用和API取代,但作者认为电子邮件是'最可访问的接口',甚至比专门的聊天应用更通用,因为它不需要用户安装新应用或使用特定SDK,这挑战了技术行业对实时通信渠道的主流认知。
Cursor still uses and sells access to Claude and GPT models even as both firms roll out their own coding tools, an awkward arrangement that this new SpaceX partnership may be designed to eventually escape.
大多数人可能认为 Cursor 应该专注于自己的产品,但作者指出 Cursor 仍在使用和销售 Claude 和 GPT 模型,这与其推出自己编码工具的举措形成尴尬局面,可能正是 SpaceX 合作的原因。
Either figure would represent a significant expense for SpaceX, which is widely seen to be losing money following the acquisition of xAI and the social media network X and is planning extensive capital investment.
普遍观点认为 SpaceX 在收购 xAI 和社交媒体网络 X 后亏损严重,但作者提出 SpaceX 可能正在通过投资 Cursor 来寻求新的价值,这与主流观点中 SpaceX 的财务困境相悖。
Neither Cursor nor xAI has proprietary models that can match the leading offerings from Anthropic and OpenAI — the same companies now competing directly with Cursor for the developer market.
大多数人认为 Cursor 和 xAI 在 AI 领域具有独树一帜的技术优势,但作者指出它们与领先企业如 Anthropic 和 OpenAI 相比并无明显优势,反而直接面临竞争。
It also surpasses all peer-scale dense models by a wide margin.
在多数情况下,人们可能认为更大规模的模型将具有更好的性能,但作者提出Qwen3.6-27B在同等规模密集模型中表现卓越,这一观点与主流认知相悖。
With only 27B parameters, it outperforms the Qwen3.5-397B-A17B (397B total / 17B active) on every major coding benchmark
通常认为参数量越多,模型性能越好,但作者提出27B参数的Qwen3.6-27B在所有主要编码基准测试中都优于拥有更多参数的Qwen3.5-397B-A17B,这与常规观点相悖。
It also surpasses all peer-scale dense models by a wide margin.
大多数人可能认为模型性能与其规模成正比,但作者指出Qwen3.6-27B在同等规模模型中表现突出,超越了所有同规模密集模型,这挑战了规模与性能之间的传统认知。
We spent days loading the system with hundreds of threads, refining rough edges and polishing corners that developers may never see.
大多数人可能认为软件开发应该专注于可见的成果,但作者强调了对那些开发者可能永远不会看到的细节的关注,这挑战了这种效率至上的观点。
Multi-agent orchestration isn't new, but we believe we've built a great experience for working with agents at scale.
尽管多智能体编排并不新鲜,但作者认为他们在这方面取得了显著的进步,这与行业对现有解决方案的普遍看法可能相悖。
What we've found works best for crafting high-quality software is somewhere in between: using AI, and also engaging directly with code.
大多数人可能认为高质量软件的创造完全依赖于AI或完全依赖于人类工程师,但作者提出了一个折中的方法,即结合使用AI和直接编码。
At one extreme, there's [fully giving into the vibes], and at the other extreme, there's [disabling all AI features].
传统观点可能认为AI在软件开发中要么被完全采用,要么被完全放弃,但作者提出了一种折中的方法,这与主流认知相悖。
Ask ten different programmers how they use AI, and you can get ten different answers.
大多数人认为人工智能的使用方式是统一的,但作者指出程序员对AI的使用存在多样性,挑战了这种统一性的认知。
All imagine that in the not-too-distant future many of us will designate some tasks that we currently undertake with our own brains and fingers on a physical PC to an agent that uses a virtual PC.
大多数人可能认为人类不会轻易将任务委托给AI代理,但作者描述了一个未来,其中许多任务将由AI代理完成,这挑战了人类对技术依赖的传统看法。
Meta feels AI models don’t understand how people use computers, so the company needs real-life examples of how meatbags click their way through a working day so it can build agents.
大多数人认为AI模型能够很好地理解人类行为,但作者指出Meta认为AI模型并不理解人类如何使用电脑,这挑战了AI技术的普遍认知。
The fact that the RL model has larger improvements on Levenshtein Distance and Added Cognitive Complexity than on Pass@1 is further evidence that it is not just memorizing corruption reversals but has actually generalized to minimal editing.
大多数人认为强化学习模型只能记住特定情况,但作者发现强化学习模型在最小化编辑任务上不仅能够记住,而且能够泛化到更广泛的场景。
Among the latest frontier models, GPT-5.4 over-edits the most.
大多数人认为GPT-5.4是最先进的模型,但作者指出它在最小化编辑任务上表现最差,这挑战了对其能力的普遍看法。
Workspace agents can gather context from the right systems, follow team processes, ask for approval when needed, and keep work moving across tools.
许多人可能认为 AI 工具难以理解和执行复杂的团队流程,但作者强调 workspace agents 能够理解和执行这些流程,挑战了 AI 在复杂任务中的能力限制。
AI has already helped people work faster on their own, but many of the most important workflows inside an organization depend on shared context, handoffs, and decisions across teams.
大多数人认为 AI 主要帮助个人提高效率,但作者指出 AI 在促进跨团队协作和共享上下文中发挥着更关键的作用,挑战了 AI 在个人层面应用的局限。
Admins can also manage who has access to use, build, and share agents.
在许多情况下,人们认为 AI 工具的使用和管理应该是完全自动化的,但作者提出管理员可以控制谁有权使用、构建和共享 agents。
They run in the cloud, so they can keep working even when you’re not.
通常认为 AI 工具需要人工操作,但作者提出 workspace agents 可以在云端运行,无需人工干预也能持续工作。
But the real power of agents comes when they can work as a team.
尽管人工智能代理的能力在单独工作时已经显现,但作者强调,它们真正的力量在于团队合作,这与通常认为的个体智能体主导的观点相悖。
But the real power of agents comes when they can work as a team. Instead of lone-wolf bots carrying out single tasks, such as using a browser to make a restaurant reservation or sending you a summary of your inbox, new tools can yoke together multiple agents, give each of them a different job, and orchestrate their behaviors so that they all pull together to complete more complex tasks than an individual agent could do by itself.
主流观点可能认为人工智能代理将独立完成工作,但作者指出,它们的真正力量在于团队合作,通过协同工作完成比单个代理更复杂的任务。
Chinese labs, for their part, are not purely idealistic: Open-source is not only free advertising but also a shrewd workaround. Without access to cutting-edge chips restricted by US export controls, releasing models openly accelerates the cycle of external feedback and contributions that compensates for constrained compute.
大多数人认为中国开源AI是出于理想主义或技术自信,但作者认为这实际上是一种战略性的 workaround(变通方法)。由于无法获得美国限制出口的高端芯片,中国通过开放源代码来加速外部反馈循环,弥补计算能力的不足,这是一种务实而非理想主义的策略。
to stand out from the AI-generated pack we will need to become so weird and unexpected as to be off-putting to most people
主流观点认为人类创造力将优于AI生成内容,而作者提出相反观点:人类创作者需要变得异常怪异才能与AI区分开来。这一反直觉结论暗示AI已经改变了创意领域的游戏规则,使正常的人类创造力变得不够独特。
The AI has learned to code. The AI is building itself.
大多数人认为AI只是人类创造的工具,需要持续人类监督和改进。作者提出AI已经具备了自我进化和自我构建的能力,这一观点挑战了AI作为被动工具的传统认知,暗示了技术自主性的可能性,这与大多数人对AI发展的预期相悖。
Is this what we signed up for? Is today the day? Did the drones wake up? Did it achieve consciousness? Is it alive?
大多数人认为AI仍然是无意识的工具,但作者通过一系列疑问暗示AI可能已经达到了某种形式的意识或自主性。这一观点挑战了AI只是复杂算法的主流认知,提出AI可能已经跨越了某种门槛,成为某种形式的'生命',这是一个极具争议和非共识的观点。
We have not really begun to make this progress with AI. Why, for example, is this dashboard not found on a government website?
大多数人认为政府和监管机构正在积极应对AI带来的挑战,但作者指出我们甚至还没有开始像应对COVID-19那样系统性地应对AI。这一观点挑战了主流认为AI已经得到充分监管和管理的认知,暗示我们对AI的监管严重滞后于技术发展。
The 21st-century average American lies in bed staring at their phone. ... Talking for hours and ages to melted sand.
大多数人认为我们只是在使用AI工具,但作者将人类与AI的互动描述为与'融化的沙子'进行'无休止的对话',暗示人类已经陷入与AI的病态依赖关系中。这种观点挑战了AI作为纯粹实用工具的主流认知,暗示AI正在成为人类情感和社会关系的替代品。
AI models have reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities.
大多数人认为AI在安全领域仍处于辅助地位,需要人类专家的指导和监督。但作者认为AI已经超越几乎所有人类专家,能够自主发现和利用软件漏洞。这是一个颠覆性的观点,因为它挑战了人类在网络安全领域的传统主导地位。
RED outperforms eight competitive baselines, achieving performance gains of up to 19.0% while reducing token consumption by 37.7% ~ 70.4%
大多数研究者认为要提升推理模型性能,需要增加计算资源和推理步骤。作者提出的RED框架却表明,通过抑制错误森林的生长和修剪后续推理,可以在大幅减少计算资源消耗的同时获得更好的性能,这一结论挑战了资源投入与性能正相关的基本假设。
alternative solutions are not merely suboptimal but potentially detrimental
大多数人认为在复杂推理任务中,即使第一个解决方案不完美,探索替代方案至少不会有害。作者却认为这些替代方案实际上是有害的,会引入新的错误并污染整个推理过程,这一观点与多方案探索的最佳实践相悖。
We characterize errors as a forest-structured Forest of Errors (FoE) and conclude that FoE makes the First the Best
主流观点认为推理错误是随机的、孤立的,可以通过更多探索来避免。但作者提出错误实际上具有森林结构特性,会相互影响和放大,这种系统性错误的观点挑战了人们对模型错误本质的传统理解。
This observation challenges widely accepted test-time scaling laws, leading us to hypothesize that errors within the reasoning path scale concurrently with test time.
大多数AI研究者认为推理时间越长,模型探索越充分,结果应该越好。作者却挑战这一共识,认为推理过程中的错误会随着时间同步增长,导致长时间推理反而会降低质量,这是一个颠覆性的观点。
The First is The Best, where alternative solutions are not merely suboptimal but potentially detrimental.
大多数人认为在大型推理模型中探索多种解决方案可以提高最终结果的质量,因为这种方法类似于人类的多角度思考。但作者认为第一个解决方案实际上是最好的,后续的替代方案不仅更差,甚至可能是有害的,这与主流的推理模型设计理念相悖。
Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations.
大多数人认为高质量标注需要人工专家或单一强大模型来完成,但作者提出利用多个异构模型输出的一致性来评估样本难度和生成可靠标注,这一方法挑战了'人工标注最优'的传统认知,展示了模型间协作的潜力。
Without any architectural modification, MinerU2.5-Pro achieves 95.69 on OmniDocBench v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200× more parameters.
大多数人认为更大的模型架构必然带来性能提升,但作者仅通过数据工程和训练策略优化,在保持1.2B参数架构不变的情况下,超越了参数量超过200倍的现有模型,这挑战了'越大越好'的行业共识,证明了数据质量的重要性。
amplifies the false narrative that technology and creativity are at odds, and that existing rights holders must be compensated by AI companies for changing industry dynamics.
大多数人认为技术创新与创意保护之间存在根本冲突,但作者认为这种观点是错误的叙事。这一挑战性论点打破了技术进步必然损害创作者权益的二元对立思维,暗示两者可以共存共赢。
introducing a commercial text and data mining exception for AI training would expand the AI sector in the country.
大多数人认为放宽数据挖掘限制会促进AI创新和增长,但作者认为这种例外实际上不会扩大AI产业。这一观点与科技行业普遍倡导的'更多数据等于更好AI'的信念相悖,挑战了数据自由流动的主流叙事。
Agentic AI is increasingly judged not by fluent output alone but by whether it can act, remember, and verify under partial observability, delay, and strategic observation.
大多数人认为AI系统的价值主要取决于其流畅的输出能力,但作者认为AI的价值应更注重其在复杂环境中的行动能力、记忆功能和可验证性,这挑战了当前AI评估的主流标准。
by "saving" the webpage (`file->save as`) instead of downloading it (which Safari automatically adds an extension for) I could force it to save it as `malicious_file` (with no extension).
大多数人认为浏览器的保存功能是安全的,会自动处理文件扩展名以确保文件类型正确。但作者发现,通过使用非标准的Content-Type和保存网页功能,可以绕过Safari的安全检查,保存任意扩展名的文件,这打破了人们对浏览器文件处理安全机制的普遍认知。
Because Deep Extract is doing more work, it takes longer than a standard extraction call. That said, measured against the real alternative of someone manually reviewing a 500-page fund statement field by field, it's faster, cheaper, and consistent at scale.
大多数人认为更复杂的处理流程必然意味着更高的成本和更慢的速度。但作者提出Deep Extract虽然执行更多工作且比标准提取调用更耗时,但在大规模应用中仍然比人工审查更快、更便宜、更一致,这一观点挑战了人们对于复杂性与效率之间关系的传统理解。
Given a thousand line items to extract, they'll often stop short, consolidate, or skip entries rather than working through every last row.
大多数人可能认为AI模型在处理重复任务时会保持一致性和全面性。但作者指出模型在处理大量重复任务时会采取'捷径',如提前停止、合并或跳过条目,这揭示了AI模型在处理长文档时的一种非理性行为,挑战了AI作为完全理性执行者的假设。
His affiliates, armed with AI, built fake doctor profiles in Meta ads and made unscrupulous claims about weight loss using fake testimonials.
大多数人认为AI主要提高生产力和创造力,但作者展示了AI如何被用于大规模欺骗和剥削,创建虚假医生档案和虚假宣传。这一反直觉观点揭示了AI技术黑暗面,挑战了人们对AI价值的乐观假设,提醒我们技术中立性背后的伦理问题。
The cost of understanding what happens in a video has dropped by a factor of roughly 40, while the quality of that understanding has improved dramatically.
大多数人认为AI视频分析仍处于早期阶段且成本高昂,但作者指出AI视频分析成本已大幅下降40倍,质量反而提升。这一反直觉观点暗示视频分析可能已经跨越了实用性的门槛,将催生全新的应用类别,挑战了人们对AI视频处理能力的传统认知。
The most encouraging finding is that one doesn't need an infinite budget. We found that by optimizing the ratings-per-item ratio correctly... one can achieve highly reproducible results with a modest budget of around 1,000 total annotations.
大多数人认为高质量的AI评估需要大量预算和大量数据,但作者证明通过优化评估者与项目的比例,即使使用适度的总标注量(约1000个)也能实现高度可复现的结果。这一发现挑战了'越多越好'的普遍观念,为资源有限的研究团队提供了实用的评估路径。
we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world.
大多数人认为世界模型主要关注预测和生成能力,但作者提出世界模型必须同时具备感知、交互和长期记忆能力,这是一个更广泛的定义,挑战了当前AI领域对世界模型的狭隘理解。这种定义扩展了传统预测模型的边界,将交互性和记忆能力作为核心要素。
Reconstructing raw inputs forces models to model irrelevant low-level detail. Predicting in a learned embedding space allows the model to focus on semantically meaningful, causally relevant features.
大多数人认为AI模型需要重建完整的输入数据才能理解世界,但作者认为这种方法迫使模型关注无关的低级细节。相反,在嵌入空间中进行预测可以让模型专注于语义上有意义、因果相关的特征,这是一个反直觉的见解。
Whether or not this specific bet pays off, the underlying argument that the next meaningful leap in AI capability requires moving beyond language modeling is increasingly hard to dismiss.
尽管当前AI领域由语言模型主导,但作者认为语言模型范式已经达到其极限,真正的AI进步需要超越这一范式。这与行业主流观点相悖,暗示我们可能正处于AI范式的转折点。
LLMs have no grounded understanding of the physical world. They model the statistical distribution of language about reality, not reality itself.
大多数人认为大型语言模型通过学习物理世界的知识来理解现实,但作者认为它们实际上只是在学习关于现实的文本描述的统计分布,而非理解现实本身。这是一个反直觉的观点,因为它挑战了我们对AI理解能力的普遍认知。
Whether or not this specific bet pays off, the underlying argument that the next meaningful leap in AI capability requires moving beyond language modeling is increasingly hard to dismiss.
大多数人认为AI的未来发展将继续沿着语言模型的方向前进,但作者认为真正的突破需要超越语言建模范式。这一观点挑战了当前AI发展的主流叙事,暗示我们需要从根本上重新思考AI的发展方向。
The clustering of capital and talent around this problem is itself a signal. The applications that most clearly benefit from world models are those where LLMs have struggled most.
大多数人认为资金和人才应该集中在当前AI表现最好的领域,但作者认为世界模型的发展恰恰是因为LLMs在关键领域表现不佳。这一观点挑战了资源分配的主流思路,暗示真正的突破可能来自于解决现有系统的弱点。
AMI Labs is not building a product for immediate deployment. This is a fundamental research effort, likely measured in years before commercial applications emerge.
在当今追求快速商业化的AI环境中,大多数人认为AI研究应该迅速转化为产品。但作者指出AMI Labs正在进行基础研究,而非直接开发产品,这一观点挑战了科技行业对即时商业化的普遍期待,强调了基础研究的重要性。
LLMs have no grounded understanding of the physical world. They model the statistical distribution of language about reality, not reality itself.
大多数人认为大型语言模型通过学习物理世界的知识来理解现实,但作者认为LLMs实际上只是学习了关于现实的文本统计分布,而非对现实本身的直接理解。这一观点挑战了人们对LLM能力本质的认知,暗示当前AI系统存在根本性的理解缺陷。
You have to have people that have the ability to rethink the workflow at a scale that AI can execute, versus at a scale that humans can execute.
大多数人认为AI应该适应现有工作流程,但作者提出相反观点:人类需要重新设计工作流程以适应AI的能力范围。这一反直觉观点强调,AI的成功实施不仅需要技术,更需要组织思维方式的根本转变,从人类执行规模转向AI执行规模。
95% of organizations are getting zero return on AI deployed, with most failures found due to 'brittle workflows.'
尽管AI投资激增,但绝大多数企业未能获得任何回报,这与主流认知中AI能显著提升效率的观点相悖。这一发现表明,AI实施失败的主要原因不是技术本身,而是工作流程设计不当,暗示企业需要重新思考如何将AI整合到现有工作流程中,而非简单叠加技术。
95% of organizations are getting zero return on AI deployed, with most failures found due to 'brittle workflows.'
尽管AI投资激增,但绝大多数企业未能获得任何回报。这与主流认为AI能自动带来显著效益的观点形成鲜明对比,暗示AI实施失败的主要问题不在于技术本身,而在于工作流程设计不当,这是一个反直觉的发现。
our framework enforces strict context isolation to prevent saturation and error propagation
在大型语言模型领域,上下文窗口不断扩大被视为进步,但作者提出严格上下文隔离可以防止饱和和错误传播,挑战了'更大上下文窗口总是更好'的主流观点。
it contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task.
大多数人认为AI评估可以通过相对简单的自动化流程完成。然而,作者提出的评估基准需要每个任务超过10小时的人工标注和2000多个检查点,这暗示了真正评估AI代理能力的复杂性和成本远超行业普遍认知。这一观点挑战了AI评估领域的效率优先思维,强调了高质量评估需要大量人工投入的现实。
Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis.
大多数人认为AI评估可以通过简单的自动化测试完成。但作者提出需要复杂的双轴(S-axis和V-axis)人工参考轨迹和沙箱环境支持,这暗示了评估AI代理能力的极端复杂性远超当前行业的普遍认知。这一观点挑战了AI评估的简化主义倾向,强调了人类参与在评估中的不可替代性。
Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks
大多数人认为当前最先进的多模态大模型已经接近或超越人类在复杂任务上的表现。然而,作者的数据表明,即使是最好的模型在复杂现实任务上的表现也远低于预期,准确率从整体56.3%骤降至23.0%。这一发现挑战了AI领域对当前技术能力的乐观评估,揭示了现实世界多模态代理任务的极端复杂性。
a symbolic-logic-based Feasibility Memory utilizes executable Python verification functions synthesized from failed transitions
大多数人认为LLM应该从成功经验中学习,但作者提出从失败过渡中合成验证函数的观点极具反直觉。这种方法将失败视为宝贵资源而非需要避免的问题,挑战了机器学习领域的主流优化思想。
we propose a Neuro-Symbolic Dual Memory Framework that explicitly decouples semantic progress guidance from logical feasibility verification
大多数AI研究者认为神经网络和符号逻辑应该融合而非分离,但作者提出了一种激进的观点:将语义引导和逻辑验证完全解耦。这种双内存框架与当前AI领域的融合趋势形成鲜明对比,挑战了神经符号计算的主流发展方向。
these two challenges are fundamentally distinct: the former relies on fuzzy semantic planning, while the latter demands strict logical constraints
主流AI研究通常将语义规划和逻辑验证视为可以统一处理的问题,但作者明确指出它们是根本不同的挑战。这一观点与当前大多数LLM代理方法相悖,暗示了单一神经网络架构的局限性。
the trained 4B model exceeding GPT-4.1 (49.4 percent) and GPT-4o (42.8 percent) despite being 50 times smaller
大多数人认为GPT-4级别的性能需要同等规模或更大的模型才能实现,但作者展示了他们的4B模型不仅超过了GPT-4.1和GPT-4o,而且模型规模只有后者的1/50。这一发现挑战了AI领域中对模型规模的依赖,暗示了算法创新可能比单纯扩大模型规模更有效。
naively designed dense per-turn rewards degrade performance by up to 14 percentage points due to misalignment between reward discriminativeness and advantage direction
大多数人认为添加更多密集的每轮奖励会强化代理的学习过程,提高性能,但作者发现这实际上会导致性能下降高达14个百分点。这挑战了强化学习中常见的'越多奖励越好'的直觉,揭示了奖励设计中的微妙平衡问题。
the trained 4B model exceeding GPT-4.1 (49.4 percent) and GPT-4o (42.8 percent) despite being 50 times smaller
大多数人认为AI模型的大小与性能直接正相关,更大的模型必然表现更好。但作者展示了一个仅40亿参数的模型通过强化学习训练后,性能超越了比它大50倍的GPT-4.1和GPT-4o,挑战了当前AI领域'参数规模决定一切'的主流观点。
naively designed dense per-turn rewards degrade performance by up to 14 percentage points due to misalignment between reward discriminativeness and advantage direction
大多数人认为更密集的每回合奖励信号会强化学习性能,但作者发现精心设计的密集奖励实际上会降低性能达14个百分点,因为奖励的判别性与优势方向不匹配。这一发现挑战了强化学习中'奖励越多越好'的直觉认知。
current systems remain highly vulnerable
尽管AI安全领域近年来取得了显著进展,作者却断言当前系统仍然高度脆弱。这一与行业乐观情绪相悖的结论,基于对多个主流代理系统的实际测试,暗示AI安全问题可能比业界承认的要严重得多。
harmful behavior may emerge through sequences of individually plausible steps
主流观点通常关注单个有害指令或直接的危险行为,但作者指出,计算机使用代理中的危险行为往往通过一系列看似合理的步骤累积产生。这一观点挑战了传统的安全评估方法,暗示我们需要关注代理的行为序列而非单一操作。
current systems remain highly vulnerable
尽管AI安全研究取得了显著进展,但作者通过AgentHazard基准测试表明,当前最先进的计算机使用代理系统仍然极其脆弱,这挑战了学术界和工业界对AI安全水平已经足够高的普遍认知。
harmful behavior may emerge through sequences of individually plausible steps
主流观点认为AI有害行为通常源于明显不合理的指令,但作者指出危险行为往往是通过一系列看似合理的步骤逐渐形成的,每一步单独看都是可接受的,但组合起来会导致有害结果。这种渐进式风险模型挑战了传统的安全评估方法。
Every prompt is a flag in disguise
大多数人认为交互式提示是CLI工具的最佳实践,因为它能引导用户完成复杂任务。但作者认为,每个交互式提示都应该有对应的命令行标志,因为这种设计让工具既能服务于人类用户,也能被AI代理自动化使用,而不需要额外的API层。
queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning.
大多数人认为RoPE旋转位置编码增强了模型对不同位置信息的区分能力,但作者认为这种旋转实际上导致了代表性查询减少,使得关键键选择质量下降,推理不稳定。这是一个反直觉的观点,因为RoPE通常被认为是一种改进位置编码的技术。
内置视频和音乐生成
大多数人认为AI系统需要专门的模块或插件来处理多媒体内容生成,但作者暗示OpenClaw已经将这些功能'内置',表明其架构已经实现了高度整合,挑战了AI系统模块化设计的传统观念。
记忆系统学会了"做梦"
大多数人认为AI的'学习'过程是基于算法和数据的处理,而'做梦'通常被视为人类独有的无意识思维活动。作者暗示OpenClaw已经发展出超越传统学习模式的创造性思维过程,这挑战了AI能力边界的主流认知。
OpenClaw 2026.4.5版本
大多数人认为AI版本号通常只代表功能更新或错误修复,但这个版本号暗示了一个重大突破,将日期(2026年4月5日)直接作为版本号,暗示这可能是一个具有里程碑意义的版本,挑战了软件版本命名的常规做法。
内置视频和音乐生成 记忆系统学会了"做梦"
大多数人认为AI的记忆系统只是简单的数据存储和检索功能,但作者暗示OpenClaw的记忆系统已经发展出类似人类'做梦'的能力,这是一种具有创造性和联想性的高级认知功能,挑战了人们对AI记忆系统的传统认知。
This class of bug is insidious because it evades every layer of defense. It will not be caught in development testing — who runs a test for 50 days? It will not be flagged in code review — the logic looks perfectly reasonable.
大多数人认为代码审查和测试能捕获大多数系统性缺陷,但作者认为这个bug的特殊性使其能够逃避所有常规检测手段。这挑战了软件质量保证的基本假设,暗示某些缺陷只有在极端条件下才会显现,而常规开发流程无法覆盖这些场景。
Once frozen, TIME_WAIT connections never expire, ephemeral ports slowly exhaust, and eventually no new TCP connections can be established at all. ICMP (ping) keeps working. Everything else dies.
大多数人认为操作系统崩溃才会导致网络完全失效,但作者认为macOS可以在看似完全正常的情况下陷入网络瘫痪状态,因为只有TCP协议栈失效而ICMP仍能工作。这种'部分系统死亡'的状态非常反直觉,因为系统不会崩溃或报错,只是TCP连接停止工作。
Looking at the code and having opinions on architecture is seen as just as 'bad' as calling a compiled C module from an interpreted language was seen back in the day... it's not bad, it's actually quite practical, but it violates some strange 'purity'.
作者将'氛围编程'的极端主义与历史上编程语言和框架中的'纯粹性'倡导者相提并论,认为两者都坚持不切实际的'纯粹'标准。这一观点挑战了软件开发中追求'纯粹性'的传统,暗示这种追求可能实际上是有害的,阻碍了实用性和效率。
The AI is actually very good at this, especially if you have a conversation with it beforehand. That's what Ask mode is for.
主流观点认为AI工具主要适合生成代码或自动化简单任务,但作者认为AI在代码审查和架构讨论方面表现优异,前提是事先进行充分对话。这挑战了人们对AI能力的传统认知,暗示AI可以作为架构讨论的平等伙伴,而不仅仅是代码生成工具。
Looking under the hood is cheating. You're only supposed to have vague conversations with the machine about what it's doing.
大多数人认为查看和审查代码是软件开发的标准实践,但作者认为这是一种'作弊'行为,因为'氛围编程'文化鼓励开发者完全避免查看底层实现。这与软件工程的基本原则相悖,通常代码审查被认为是提高质量和发现问题的关键步骤。
I feel confident, though, that the slippery feeling people associate with AI products is a solvable problem, and the solution looks more like thoughtful interface design than better models. The models will keep improving on their own. The harder work is building the structure around them so that their output feels reliable, legible, and trustworthy.
大多数人认为AI产品的可靠性将随着模型技术的进步而提高,但作者认为真正的挑战在于围绕模型构建结构和界面,而非模型本身。这一观点挑战了AI领域的技术决定论思维,强调了设计的重要性。
The more important work happens before the agent even starts. An agent operating inside a well-designed system already has the context and constraints it needs to do good work. In Linear, that means project plans, issue backlogs, code, and documentation. These all shape what the agent does and how it does it.
大多数人认为AI系统的责任在于实时监控和干预,但作者认为真正的责任在于事前的系统设计和环境构建。这一观点将问责制从实时交互转向了系统设计阶段,挑战了传统的AI治理思维。
The first interface that spread for AI tools was the chat window. That makes sense. When you don't know what something can do, the safest approach is to let people ask. A conversation feels familiar, it stretches across many situations, and it doesn't force a specific structure up front.
大多数人认为聊天界面是AI交互的理想形式,因为它直观且灵活,但作者暗示这只是探索阶段的工具,而非严肃工作的解决方案。这一观点挑战了当前AI工具设计中聊天界面占主导地位的趋势。
AI is a way to level the playing field, for sure! Successful writers have always operated with a lot of support around them, but not everyone has access to those resources.
大多数人认为AI写作会加剧不平等,但作者将其视为一种民主化工具,可以让没有传统写作资源的人获得专业级支持。这挑战了人们对AI写作的精英主义批评,表明它实际上可能缩小而非扩大创作领域的差距,为更多人提供专业写作支持。
When I sit down to write a piece, and before I even write a word, I have the agent interview me. It asks questions to draw out what I'm thinking about the topic.
大多数人认为AI写作始于人类向AI提供想法,但作者展示了相反的过程:AI先通过采访人类来提取想法。这种反转挑战了人们对AI写作方向的认知,表明AI不仅可以辅助写作,还可以成为激发和引导人类思考的工具,重新定义了写作中的主导关系。
It has a panel of critics who tear my work apart from different angles—skills I wrote to invoke certain kinds of feedback, whether it's for length, pacing, or the soundness of the argument.
大多数人认为AI写作缺乏批判性视角和严格编辑,但作者展示了一个由AI驱动的批评者团队,专门从不同角度撕碎她的作品。这挑战了人们对AI写作质量的担忧,表明AI可以被训练提供比传统编辑更全面、更严格的反馈,甚至可能超越人类编辑的一致性和广度。
My process has about as much in common with that as cooking has with microwaving a frozen dinner.
大多数人认为AI写作就像简单的提示-生成-粘贴过程,但作者将其比作烹饪与微波冷冻餐的区别,暗示真正的AI写作是复杂且需要技巧的。这挑战了人们对AI写作的简化认知,表明它实际上是一种需要专业技能和创造性的复杂工艺,而非简单的机械化任务。