Hypothesis

GPT-5.4 (xhigh) scores the highest on APEX-Agents-AA Pass@1 with a score of 33.3%, followed by Claude Opus 4.6 (Adaptive Reasoning, Max Effort) with a score of 33.0%, and Gemini 3.1 Pro Preview with a score of 32.0%

令人震惊的数字：即便是全球最强的 AI Agent，在投行/咨询/律所的专业任务上也只有三分之一的成功率。更惊讶的是前三名几乎并列——GPT-5.4 的 33.3%、Claude Opus 4.6 的 33.0%、Gemini 3.1 Pro 的 32.0%——三家顶级实验室在专业服务 Agent 评测上的差距已缩小到统计噪声级别。「谁的 AI 更强」的问题，在这个维度上已经没有明确答案。

33-percent benchmark three-way-tie professional-AI surprising

Tags

Annotators

URL