35 Matching Annotations
  1. Last 7 days
    1. A small but directionally consistent improvement on strict instruction following. Loose evaluation is flat. Both models already follow the high-level instructions — the strict-mode gap comes down to 4.6 occasionally mishandling exact formatting where 4.7 doesn't.

      这一发现揭示了AI模型能力提升的一个微妙现象:微小但精确的改进可能比重大但模糊的改进更有价值。Claude 4.7只在严格指令遵循上有微小提升,但这种提升针对的是实际开发中常见的精确格式化问题,这挑战了人们对'重大突破'的执念,强调了'精准解决特定问题'的价值。

  2. Apr 2026
    1. Tracks the evolution of LLM security capabilities across benchmarks (CyberGym, Cybench, etc.), calculates capability doubling times, detects emergence patterns, and monitors cost-efficiency trends.

      这个功能模块代表了AI安全研究的前沿方向,不仅关注当前能力,还追踪能力演化和效率变化。计算'能力倍增时间'特别值得关注,这可能揭示AI安全能力发展的加速趋势,对预测未来安全挑战具有重要意义。

    1. 从视频生成器升级为导演工具套件

      这一表述隐含着一个重要假设:AI已经具备了理解并执行复杂创作流程的能力。作者假设AI工具已经超越了简单的内容生成,能够理解导演工作的完整流程和决策逻辑,这是一个相当大胆的技术能力假设。

    2. 从视频生成器升级为导演工具套件

      这一表述揭示了一个令人惊讶的事实:AI工具正在从'执行单一任务'向'理解复杂创作流程'转变。这表明AI不再仅仅是内容生成工具,而是开始具备对整个创作过程的系统理解,这是AI创作能力进化的一个重要里程碑。

    1. Andon Labs started by giving an AI control of a vending machine at Anthropic's office.

      这个开篇揭示了AI能力发展的渐进式路径,从简单控制到复杂决策的惊人速度。一个AI从管理自动售货机开始,短短时间内就发展到能自主经营实体企业,展示了AI能力指数级增长的潜力。

    1. The capability rankings reshuffled completely across tasks. There is no stable best model across cybersecurity tasks. The capability frontier is jagged.

      这一发现揭示了AI安全能力的'锯齿状前沿'现象,不同模型在不同安全任务上的表现差异巨大。这表明不存在'一刀切'的最佳安全模型,而是需要根据具体任务选择合适的模型,这对AI安全系统的设计有重要启示。

    2. Eight out of eight models detected Mythos's flagship FreeBSD exploit, including one with only 3.6 billion active parameters costing $0.11 per million tokens.

      这是一个令人惊讶的发现,表明即使是小型、廉价的模型也能实现与昂贵的专有模型相当的安全漏洞检测能力。这挑战了AI安全领域需要最前沿模型的假设,暗示了经济高效的AI安全解决方案的可能性。

    1. The model can reverse-engineer compiled software to detect malware and vulnerabilities without needing source code, aiming to help analysts inspect and secure systems more efficiently.

      能够无需源代码即可逆向编译软件检测恶意代码的能力,展示了AI在网络安全领域的突破性进展。这种技术可能彻底改变安全分析师的工作方式,但也可能被滥用,引发关于AI安全与伦理的深刻思考。

    1. M2.7 demonstrates excellent performance in real-world software engineering, including end-to-end project delivery, log analysis for bug hunting, code security, and machine learning tasks.

      这一声明暗示AI模型已经超越了简单的代码生成,能够完成完整的软件开发生命周期,这代表了AI在工程领域应用的重大突破,可能重新定义软件开发的未来模式。

    1. 用户输入不再只是触发一次性行为,而会逐渐安装、调用、组合并保留可复用的 neural routines。

      这一描述揭示了神经计算机与传统计算机在交互本质上的根本差异。用户输入将变成安装能力的过程,这不仅是技术变革,更是人机关系的重新定义,暗示未来可能通过自然交互直接塑造AI能力。

    1. AI capability is not plateauing. It is accelerating and reaching more people than ever.

      这一声明挑战了AI发展可能趋于平缓的普遍预期,表明技术进步实际上正在加速。这种加速不仅体现在性能指标上,还体现在采用率的惊人增长上,暗示AI正处于指数级增长阶段,可能带来前所未有的社会变革。

    1. We are also unlocking a new capability: instrument reading, enabling robots to read complex gauges and sight glasses — a use case we discovered through close collaboration with our partner, Boston Dynamics.

      这一令人惊讶的突破展示了AI如何从实际工业需求中汲取灵感。仪表读能能力不仅是技术上的进步,更代表了AI开始理解人类专业领域的复杂任务。与Boston Dynamics的合作表明,前沿AI研究正日益与实际应用场景紧密结合,这种产学研融合模式可能加速机器人技术在现实世界中的普及。

    1. Both services can be disabled for fully offline operation.

      令人惊讶的是:Sage 可以完全禁用云服务,实现完全离线运行。这种离线能力对于需要在隔离环境中工作的用户(如政府机构或高度敏感项目)至关重要,展示了该工具的灵活性和适应性,这是许多现代安全工具所不具备的特性。

    1. gpt-oss-20B (high): 0.7%

      gpt-oss-20B 的成绩是 0.7%——在 452 个专业任务中,只有不到 4 个通过了评测。这个数字与顶级模型的 33.3% 之间,存在近 50 倍的差距。这说明专业服务 Agent 能力不是「渐进改善」,而是存在明确的「能力阶梯」——低于某个规模的模型,在这类任务上几乎完全失效。这对企业 AI 选型的启示:在专业服务场景,「够用的小模型」可能根本不存在,只有「能用的大模型」和「完全不能用的模型」两种。

    1. a logistic curve is a poor fit because we haven't seen any evidence of the exponential growth in time horizon slowing down.

      METR 明确指出:截至 2026 年初,时间地平线的指数增长没有任何放缓迹象——这意味着 S 曲线的「饱和阶段」尚未到来。对 AI 进展持怀疑态度者常援引「进步将减速」的论点,但这个数据点直接挑战了这一叙事。指数增长持续意味着每隔固定时间,AI 能独立完成的任务复杂度就翻倍——而这个倍增周期,根据历史数据,大约是 6-7 个月。

    1. Tang Jie (CEO of Zhipu AI) even recently said: "The truth may be that the gap [between US and Chinese AI] is actually widening."

      智谱 CEO 唐杰亲口承认差距可能正在扩大——这句话的分量极重。在中国 AI 公司普遍对外宣称「与美国差距不大」的舆论环境下,一位领军者公开说出这句话,是罕见的清醒与坦诚。这与本文的核心论点完全吻合:算力差距在出口管制和国内芯片滞后的双重压力下,短期内很难缩小。对智谱内部的战略制定而言,这句话的代价和勇气都值得深思。

  3. Jan 2026
    1. Early on, Doug Engelbart recognized that what makes us capable, beyond the basic human abilities we were born, is a whole infrastructure of capabilities, with higher level capabilities depending on the execution of lower level capabilities. For example, the ability to solve a problem collaboratively depends on our ability to communicate with language, to use acceptable conventions and methodologies for working together, to identify opportunities, plan, implement, write, discuss, etc.

      Human system and tool system are the foundations of capability infrastructure.

  4. Oct 2025
    1. But this process took over two decades instead of a few hours in the case of AlphaZero because safety considerations put a limit on the extent to which each iteration of this loop could be scaled up compared to the previous one

      So the capability reliability gap implies that if you’re willing to accept self driving cars mowing through droves of people until they can reliably not crush people on the street under their wheels, then we’d likely end up with self driving cards pretty quickly, but for the most part we don’t see this as acceptable

  5. Jun 2025
    1. But neither Defense Secretary Pete Hegseth nor Gen. Dan Caine, the chairman of the Joint Chiefs of Staff, could immediately say whether Iran still retained the ability to make a nuclear weapon. Mr. Hegseth repeated President Trump’s assertion from the previous night that the nuclear sites had been “obliterated.” General Caine did not.

      Given their experience and records, General Dan Caine is the better source of potential truth here.

  6. Nov 2024
  7. Oct 2023
    1. Relays, string figures, pass-ing patterns back and forth, giving and receiving, patterning, holdingthe unasked-for pattern in one’s hands, response-ability; that is core towhat I mean by staying with the trouble in serious multispecies worlds.Becoming-with, not becoming, is the name of the game; becoming-withis how partners are, in Vinciane Despret’s terms, rendered capable.7 On-

      cooperative pattern-seeking as the only meaningful and useful becoming (becoming-with)

      1/2

  8. Dec 2022
    1. Emergent abilities are not present in small models but can be observed in large models.

      Here’s a lovely blog by Jason Wei that pulls together 137 examples of ’emergent abilities of large language models’. Emergence is a phenomenon seen in contemporary AI research, where a model will be really bad at a task at smaller scales, then go through some discontinuous change which leads to significantly improved performance.

    1. Houston, we have a Capability Overhang problem: Because language models have a large capability surface, these cases of emergent capabilities are an indicator that we have a ‘capabilities overhang’ – today’s models are far more capable than we think, and our techniques available for exploring the models are very juvenile. We only know about these cases of emergence because people built benchmark datasets and tested models on them. What about all the capabilities we don’t know about because we haven’t thought to test for them? There are rich questions here about the science of evaluating the capabilities (and safety issues) of contemporary models. 
    1. There’s a concept in AI that I’m particularly fond of that I think helps explain what’s happening. It’s called “capability overhang” and refers to the hidden capacities of AI: skills and aptitudes latent within systems that researchers haven’t even begun to investigate yet. You might have heard before that AI models are “black boxes” — that they’re so huge and complex that we don’t fully understand how they operate or come to specific conclusions. This is broadly true and is what creates this overhang.
  9. Feb 2022
  10. Jan 2021
    1. psychological and physical capability (i.e. the individual’s psychological and physical capacity to engage in the activity concerned, including the necessary knowledge and skills),

      psikolojik ve fiziksel yeterlilik (yani, gerekli bilgi ve beceriler dahil olmak üzere bireyin ilgili faaliyete katılmaya yönelik psikolojik ve fiziksel kapasitesi)

  11. Sep 2020
  12. May 2020
  13. Feb 2014
    1. But at the level of the capability hierarchy where we wish to work, it seems useful to us to distinguish several different types of structuring--even though each type is fundamentally a structuring of the basic physical processes. Tentatively we have isolated five such types--although we are not sure how many we shall ultimately want to use in considering the problem of augmenting the human intellect, nor how we might divide and subdivide these different manifestations of physical-process structuring. We use the terms "mental structuring", "concept structuring", "symbol structuring", "process structuring," and "physical structuring."

      The 5 structuring types outlined by Doug Engelbart:

      • mental
      • concept
      • symbol
      • process
      • physical