53 Matching Annotations
  1. May 2026
  2. Apr 2026
    1. But the real power of agents comes when they can work as a team. Instead of lone-wolf bots carrying out single tasks, such as using a browser to make a restaurant reservation or sending you a summary of your inbox, new tools can yoke together multiple agents, give each of them a different job, and orchestrate their behaviors so that they all pull together to complete more complex tasks than an individual agent could do by itself.

      主流观点可能认为人工智能代理将独立完成工作,但作者指出,它们的真正力量在于团队合作,通过协同工作完成比单个代理更复杂的任务。

    1. The capability rankings reshuffled completely across tasks. There is no stable best model across cybersecurity tasks. The capability frontier is jagged.

      这一发现揭示了AI安全能力的'锯齿状前沿'现象,不同模型在不同安全任务上的表现差异巨大。这表明不存在'一刀切'的最佳安全模型,而是需要根据具体任务选择合适的模型,这对AI安全系统的设计有重要启示。

    1. It maintains 97% skill compliance across 40 complex skills on MM Claw, each skill exceeding 2,000 tokens.

      97%的技能合规率是一个非常高的指标,特别是在处理超过2000个token的复杂技能时。这表明M2.7不仅能够理解复杂指令,还能在长时间任务中保持一致性和可靠性。对于需要构建复杂代理工作流的开发者来说,这一数据点特别有价值,因为它意味着模型可以可靠地执行多步骤、高复杂度的任务。

    2. The 66.6% medal rate on MLE Bench Lite, achieved autonomously over 24 hour windows, tells you something real about how this model behaves when you give it a hard problem and step back.

      这个66.6%的奖牌率是在完全自主的情况下连续24小时运行后取得的,这是一个令人印象深刻的数据点。它表明M2.7不仅能够在长时间内保持专注,还能持续改进解决问题的策略。这种自主解决问题的能力可能是评估代理模型实际价值的关键指标,远超传统基准测试所能衡量的范围。

    1. Corporate Lawyer: Force Majeure Under Executive Order... Management Consultant: 2026 Capital Budget Allocation... Investment Banking Analyst: KVUE DCF Update

      三个示例任务揭示了 APEX-Agents 评测的设计哲学:不是「能否回答问题」,而是「能否完成专业人员一天的真实工作」——判断不可抗力条款是否适用、基于矩阵模型分配资本预算、更新 DCF 模型并重算成本数据。这些任务需要读取附件文件、进行数值计算、然后以规定格式输出结论。对银行/咨询行业的 AI 产品选型,这是目前最接近真实场景的评测维度。

    1. we may see a growing divergence between the capabilities we can measure and the capabilities we actually care about.

      「可测量的能力」与「真正关心的能力」之间的分歧正在扩大——这是整篇文章最深刻的洞见。所有当前 benchmark 都偏向「干净、自包含、可自动评分」的任务,而真实工作是「混乱、跨系统、需人类判断」的。随着 AI 向长任务延伸,这个测量-现实之间的鸿沟不会缩小,只会加速扩大。这意味着未来关于「AI 能否替代某类工作」的争论,将越来越难以用数据解决——因为数据本身无法捕捉真实工作的本质。

    2. If this pace of progress continues — doubling task length every six or seven months — we should expect LLMs capable of completing week-long tasks some time next year, and month-long tasks in 2028.

      周级任务明年,月级任务 2028——这个时间线与 METR 自己的预测(12-18 个月内 200 小时时间地平线)高度吻合,两个独立来源的收敛给了这个预测更高的可信度。月级任务意味着 AI 能独立完成一个完整的短期项目,从需求到交付。这不是「AI 辅助工作」的时代,而是「AI 执行项目」的时代——而距离这个时代到来,按目前的轨迹只有不到三年。

    1. this behavioral shift does not compromise performance on straightforward problems, it might affect performance on more challenging tasks.

      「简单题不影响,难题可能变差」——这个不对称性极为危险。它意味着我们在用简单任务验证 Agent 可靠性时,得到的是虚假的信心。而当 Agent 真正面临高风险、高复杂度的任务时,上下文累积已经悄悄关闭了它的自我验证模式,在没有任何预警的情况下退化为浅层推理。这是一种「隐性能力衰减」,比显而易见的失败更危险。

    1. Overnight, agents can do maybe 200 human hours of work, but only for very agent-shaped tasks, so researchers need to deliberately sequence projects such that very long tasks suitable for agents happen overnight.

      「喂饱 Agent 过夜」这个概念令人震惊:未来的研究者需要像农民「播种」一样,在下班前精心设计好「足够 Agent 形态的」长任务,让 AI 在人类睡眠的 8 小时里完成相当于 200 人时的工作,然后早上来「收割结果」。这意味着人类工作的节奏将被彻底重组——不再是「我来执行任务」,而是「我来为任务执行做准备」。

    1. Four levels of shared control can be distinguished [1]: strategic (e.g.,setting a destination), tactical (e.g., doing a specific maneuver like merging into a lane), oper-ational (e.g., maintaining a certain distance from another car), and execution (lowest-level ofcontrolling locomotion, steering, and so on).
    2. Third, control can be shared via partitioning. In this case, a task is decomposed into parts thatcan be addressed by humans and machines separately. An example of such control sharing is semi-automatic parallel parking, which provides the driver with some braking ability while the machinecontrols the speed and steering of the car. An HCI example is automatic spell checking, where thesystem detects and highlights incorrectly spelled words but does not change them. Instead, theuser has to take an explicit corrective action, such as selecting a misspelled word and choosing analternative.
  3. Mar 2026
    1. Usability concerns how easily computer-based tools may be operated by users trying to accomplish a task. Usability differs from utility. Usability concerns whether users can use the product in a way that makes it possible to realize its utility; utility is about whether the goal is important to the user.

      Highlight tasks

    2. The utility of an interactive system concerns its match with the tasks of users. If the match is good, the tool has high utility; if the tasks that users want to do are not supported by the tool, the tool has low utility.

      Highlight tasks

    3. Users actively repurpose tools to make them more personally usable and relevant. Design should support such repurposing. For example, Renom et al. [696] conducted a study on text editing using a novel user interface. They found that exploration and technical reasoning facilitate creative tool use. Users who explore available commands in a tool are better at repurposing its functionality. More surprisingly, engaging in technical reasoning (reasoning about functionality and objects) supports repurposing more than procedural knowledge inherited from other software.

      Highlight tasks

  4. Jan 2026
  5. Apr 2025
  6. Jan 2024
    1. One of the reasons that some projects don't use Gitlab's issues and use an external tracking platform is the lack of issues relations. Without relations issues are just flat, no way to actually track progress of big features. No way to create a "meta" issue that depends on 4 other or create subtasks and so on. The same problem exists on Github too. It would surely make a difference if Gitlab offers a full features tracking issue, instead of just flat issues. Relations is a major first step towards that.
  7. Nov 2021
  8. May 2021
    1. You can now search for tasks using task: similar to block:. There is also task-todo: and task-done: which will match only the tasks that are incomplete or complete, respectively. Use task:"" to match all tasks.

      This will be incredibly useful to create as a view.

    2. Task lists [x] can now contain any character to indicate a completed task, instead of just x. This value can be used by custom CSS to change the appearance of the check mark, and is also available for plugins to use.

      I'll need to create some custom CSS for these in the past as I've used:

        • [>] to indicate that an item was pushed forward
        • [?] to indicate something I'm not sure was done in retrospect (typically for a particular day)
      • others?
  9. Dec 2020
  10. Nov 2020
    1. xperimental condition

      (1) tone discrimination (2) onset discrimination (3) nonlinguistic pitch judgement and discrimination of low tone, 90 Hz vs high tone, and 100 Hz. Respectively, the conditions are T, O and P judgements.

  11. Aug 2020
  12. Apr 2020
  13. Dec 2019
  14. Oct 2018
  15. cconlinejournal.org cconlinejournal.org
    1. the activities the participants claimed they used a computer to complete during their average day.

      Interesting that most of these are passive activities--watching TV and buying items. Uploading photos might be the most complex tasks students have done.

  16. Jun 2014
    1. A fundamental task for public philosophy is to attend to the work the public is doing in developing its own self-conception.

      This strikes me as a very productive way of identifying an important aspect of public philosophy. On the one hand, it allows us to distinguish between philosophers who think more people should be listening to them and philosophers who think they should be listening to more people. On the other hand, it suggests and leaves open a number of questions that can be addressed in and through the work public philosophers are doing in developing their own self-conceptions.