Both illustrate how decomposing complex tasks across specialized agents can address problems that monolithic models handle poorly.
这一观点提出了多智能体架构在处理复杂任务中的优势,为解决单一模型难以处理的问题提供了新的解决方案。
Both illustrate how decomposing complex tasks across specialized agents can address problems that monolithic models handle poorly.
这一观点提出了多智能体架构在处理复杂任务中的优势,为解决单一模型难以处理的问题提供了新的解决方案。
Workspace agents can gather context from the right systems, follow team processes, ask for approval when needed, and keep work moving across tools.
许多人可能认为 AI 工具难以理解和执行复杂的团队流程,但作者强调 workspace agents 能够理解和执行这些流程,挑战了 AI 在复杂任务中的能力限制。
But the real power of agents comes when they can work as a team. Instead of lone-wolf bots carrying out single tasks, such as using a browser to make a restaurant reservation or sending you a summary of your inbox, new tools can yoke together multiple agents, give each of them a different job, and orchestrate their behaviors so that they all pull together to complete more complex tasks than an individual agent could do by itself.
主流观点可能认为人工智能代理将独立完成工作,但作者指出,它们的真正力量在于团队合作,通过协同工作完成比单个代理更复杂的任务。
The capability rankings reshuffled completely across tasks. There is no stable best model across cybersecurity tasks. The capability frontier is jagged.
这一发现揭示了AI安全能力的'锯齿状前沿'现象,不同模型在不同安全任务上的表现差异巨大。这表明不存在'一刀切'的最佳安全模型,而是需要根据具体任务选择合适的模型,这对AI安全系统的设计有重要启示。
It maintains 97% skill compliance across 40 complex skills on MM Claw, each skill exceeding 2,000 tokens.
97%的技能合规率是一个非常高的指标,特别是在处理超过2000个token的复杂技能时。这表明M2.7不仅能够理解复杂指令,还能在长时间任务中保持一致性和可靠性。对于需要构建复杂代理工作流的开发者来说,这一数据点特别有价值,因为它意味着模型可以可靠地执行多步骤、高复杂度的任务。
The 66.6% medal rate on MLE Bench Lite, achieved autonomously over 24 hour windows, tells you something real about how this model behaves when you give it a hard problem and step back.
这个66.6%的奖牌率是在完全自主的情况下连续24小时运行后取得的,这是一个令人印象深刻的数据点。它表明M2.7不仅能够在长时间内保持专注,还能持续改进解决问题的策略。这种自主解决问题的能力可能是评估代理模型实际价值的关键指标,远超传统基准测试所能衡量的范围。
Corporate Lawyer: Force Majeure Under Executive Order... Management Consultant: 2026 Capital Budget Allocation... Investment Banking Analyst: KVUE DCF Update
三个示例任务揭示了 APEX-Agents 评测的设计哲学:不是「能否回答问题」,而是「能否完成专业人员一天的真实工作」——判断不可抗力条款是否适用、基于矩阵模型分配资本预算、更新 DCF 模型并重算成本数据。这些任务需要读取附件文件、进行数值计算、然后以规定格式输出结论。对银行/咨询行业的 AI 产品选型,这是目前最接近真实场景的评测维度。
we may see a growing divergence between the capabilities we can measure and the capabilities we actually care about.
「可测量的能力」与「真正关心的能力」之间的分歧正在扩大——这是整篇文章最深刻的洞见。所有当前 benchmark 都偏向「干净、自包含、可自动评分」的任务,而真实工作是「混乱、跨系统、需人类判断」的。随着 AI 向长任务延伸,这个测量-现实之间的鸿沟不会缩小,只会加速扩大。这意味着未来关于「AI 能否替代某类工作」的争论,将越来越难以用数据解决——因为数据本身无法捕捉真实工作的本质。
If this pace of progress continues — doubling task length every six or seven months — we should expect LLMs capable of completing week-long tasks some time next year, and month-long tasks in 2028.
周级任务明年,月级任务 2028——这个时间线与 METR 自己的预测(12-18 个月内 200 小时时间地平线)高度吻合,两个独立来源的收敛给了这个预测更高的可信度。月级任务意味着 AI 能独立完成一个完整的短期项目,从需求到交付。这不是「AI 辅助工作」的时代,而是「AI 执行项目」的时代——而距离这个时代到来,按目前的轨迹只有不到三年。
this behavioral shift does not compromise performance on straightforward problems, it might affect performance on more challenging tasks.
「简单题不影响,难题可能变差」——这个不对称性极为危险。它意味着我们在用简单任务验证 Agent 可靠性时,得到的是虚假的信心。而当 Agent 真正面临高风险、高复杂度的任务时,上下文累积已经悄悄关闭了它的自我验证模式,在没有任何预警的情况下退化为浅层推理。这是一种「隐性能力衰减」,比显而易见的失败更危险。
Overnight, agents can do maybe 200 human hours of work, but only for very agent-shaped tasks, so researchers need to deliberately sequence projects such that very long tasks suitable for agents happen overnight.
「喂饱 Agent 过夜」这个概念令人震惊:未来的研究者需要像农民「播种」一样,在下班前精心设计好「足够 Agent 形态的」长任务,让 AI 在人类睡眠的 8 小时里完成相当于 200 人时的工作,然后早上来「收割结果」。这意味着人类工作的节奏将被彻底重组——不再是「我来执行任务」,而是「我来为任务执行做准备」。
Four levels of shared control can be distinguished [1]: strategic (e.g.,setting a destination), tactical (e.g., doing a specific maneuver like merging into a lane), oper-ational (e.g., maintaining a certain distance from another car), and execution (lowest-level ofcontrolling locomotion, steering, and so on).
Control does not need to be either/or like in many semi-autonomous vehicles.
When two agents sharing control have asymmetric capa-bilities, both loose and tight rein control should be available.
First, control can be shared via an extensionthat allows a machine to amplify human ability.
When riding a horse, the rider communicates high-level information (e.g., the goal) to thehorse but must be ready to guide the horse at a lower level. When the horse knows what to do,for example, if the route is familiar, the rider may not need to engage in low-level control. Thisform of control, called loose rein control, is possible if the horse knows what the rider wants.
First, communication is vital for sharingcontrol, and this can happen at different levels; second, both agents must have internal modelsof each other to understand what those communicative acts mean.
Shared control is about carrying out a task together with a competent partner [1, p. 511]:“In shared control, human(s) and robot(s) are interacting congruently in a perception–actioncycle to perform a dynamic task that either the human or the robot could execute individuallyunder ideal circumstances.”
The question of shared control is timely; semi-autonomous vehicles are only partiallyautonomous. They need the human to assist them and, therefore, some way of handing controlover to the human driver. They also need to have guidance from the driver, for example, onthe choice of route.
An example of such control sharing is powersteering in a car: The car provides additional work to allow the driver to turn the wheels withless effort. An HCI example is mouse acceleration, which allows a user to move the cursor on thescreen farther than the physical movement of the mouse.
Second, control can be shared via relief, which means that the overall burden on the humanis reduced by the machine. An example is automatic shift transmission, which relieves the driverof the task of changing gears in a car. An HCI example is text entry using autocomplete, whichprevents the user from correcting typing mistakes as they type.
Third, control can be shared via partitioning. In this case, a task is decomposed into parts thatcan be addressed by humans and machines separately. An example of such control sharing is semi-automatic parallel parking, which provides the driver with some braking ability while the machinecontrols the speed and steering of the car. An HCI example is automatic spell checking, where thesystem detects and highlights incorrectly spelled words but does not change them. Instead, theuser has to take an explicit corrective action, such as selecting a misspelled word and choosing analternative.
Successful time-sharing depends on the strategy and difficulty of the task in terms of tempo-ral constraints—how many tasks are processed in a given interval—and task complexity—thequantity of information that needs to be processed for a given task.
The first is called thesingle channel theory, which posits that there is limited capacity in the human information pro-cessing system in a time-sharing scenario. When the channel capacity is exceeded, multiple taskstransition from parallel processing to serial processing.
The third theory is information processing analysis theory. Ifat least one task can be carried out automatically, the other task can be carried out with little orno impact on performance (at an appropriate time–error trade-off point).
The second theoryis the multiple resources model, which states that resource limitation concerns the entire systemrather than a channel (Chapter 5).
Second, tasks canbe shared in terms of control, which means that some control over the tasks is assigned to anotheragent, such as another user or a machine.
Ingeneral, perfect time-sharing with no degradation in performance occurs only for tasks that areautomatic, such as speaking while walking.
First, tasks can be time-shared, which implies that the user performs multiple tasks.
Eye-typing forces users to think in terms of individual letters. This has a cognitive cost and is not a fluid means of communication.
Highlight tasks
To use such a switch for typing, the SGD interface must be designed with this in mind from the beginning. The most common solution is a scanning keyboard.
Highlight tasks
TTF refers to the ability of technology to support a task [197]. The capabilities of the technology should match the demands of the task and the skills of the individual; in this case, the fit is perfect.
Highlight tasks
A system may be usable for some tasks and less usable for others; it may be usable for some users but not for others.
Highlight tasks
Usability concerns how easily computer-based tools may be operated by users trying to accomplish a task. Usability differs from utility. Usability concerns whether users can use the product in a way that makes it possible to realize its utility; utility is about whether the goal is important to the user.
Highlight tasks
The utility of an interactive system concerns its match with the tasks of users. If the match is good, the tool has high utility; if the tasks that users want to do are not supported by the tool, the tool has low utility.
Highlight tasks
Users actively repurpose tools to make them more personally usable and relevant. Design should support such repurposing. For example, Renom et al. [696] conducted a study on text editing using a novel user interface. They found that exploration and technical reasoning facilitate creative tool use. Users who explore available commands in a tool are better at repurposing its functionality. More surprisingly, engaging in technical reasoning (reasoning about functionality and objects) supports repurposing more than procedural knowledge inherited from other software.
Highlight tasks
I’m already finding better ways to batch my tasks. Making smarter to-do lists that include more context
task lists with better context, means? batching tasks, yes helpful
Instructions on how to add the facilitator as an editor/commenter might help.
Sub tasks are just issues with a parent issue. When displaying them, always display the ancestor crumb trail.
Marking an issue as as a subtask of another. Having task lists in a description is a great start, but it doesn't help (AFAIK) navigating from child back up the chain to parent. Creating umbrella issues is a very common way to track the top-level focus areas for a release.
subtasks would give insight into how long it will take!
One of the reasons that some projects don't use Gitlab's issues and use an external tracking platform is the lack of issues relations. Without relations issues are just flat, no way to actually track progress of big features. No way to create a "meta" issue that depends on 4 other or create subtasks and so on. The same problem exists on Github too. It would surely make a difference if Gitlab offers a full features tracking issue, instead of just flat issues. Relations is a major first step towards that.
Should Financial Executives lead the IT department? A bit of IT-financials thinking... https://en.itpedia.nl/2021/11/23/moeten-financial-executives-leiding-geven-aan-de-it-afdeling/
You can now search for tasks using task: similar to block:. There is also task-todo: and task-done: which will match only the tasks that are incomplete or complete, respectively. Use task:"" to match all tasks.
This will be incredibly useful to create as a view.
Task lists [x] can now contain any character to indicate a completed task, instead of just x. This value can be used by custom CSS to change the appearance of the check mark, and is also available for plugins to use.
I'll need to create some custom CSS for these in the past as I've used:
These are sequential because build:ssr imports the public/index.html that build:dom produces.
wordepicture association judgment task
Learner group underwent an extra task where they were asked to listen to a word and identify if a picture was the correct association with the word.
xperimental condition
(1) tone discrimination (2) onset discrimination (3) nonlinguistic pitch judgement and discrimination of low tone, 90 Hz vs high tone, and 100 Hz. Respectively, the conditions are T, O and P judgements.
Grözinger. N., Irlenbusch. B., Laske. K., Schröder. M., (2020). Innovation and Communication Media in Virtual Teams – An Experimental Study. Institute of Labor Economics. Retrieved from: https://covid-19.iza.org/publications/innovation-and-communication-media-in-virtual-teams-an-experimental-study/
For example: I wanted a way to add recurring tasks to my list, so I wrote a simple bash script called goodmorning.sh. It uses the command prompt client to quickly add a bunch of tasks to my todo list of choice. I run this script first thing in the morning every workday, and I like it better than any built-in system I’ve found for recurring tasks, because it’s fully under my control.
the activities the participants claimed they used a computer to complete during their average day.
Interesting that most of these are passive activities--watching TV and buying items. Uploading photos might be the most complex tasks students have done.
A fundamental task for public philosophy is to attend to the work the public is doing in developing its own self-conception.
This strikes me as a very productive way of identifying an important aspect of public philosophy. On the one hand, it allows us to distinguish between philosophers who think more people should be listening to them and philosophers who think they should be listening to more people. On the other hand, it suggests and leaves open a number of questions that can be addressed in and through the work public philosophers are doing in developing their own self-conceptions.