or instance, datasets such as AAAR-1.0[ 61], ScienceAgentBench [11 ], and TaskBench [ 83 ] provide struc-tured, expert-labeled benchmarks for assessing research reasoning,scientific workflows, and multi-tool planning. Others, such as Flow-Bench [96 ], ToolBench [38 ], and API-Bank [ 47 ], focus on tool useand function-calling across large API repositories. These bench-marks typically include not only the gold tool sequences but alsoexpected parameter structures, enabling fine-grained evaluation.In parallel, datasets like AssistantBench [ 109], AppWorld [91 ],and WebArena [ 126] simulate more open-ended and interactiveagent behaviors in web and application environments. They empha-size dynamic decision-making, long-horizon planning, and user-agent interactions. Several benchmarks also support safety androbustness testing—for example, AgentHarm [5 ] assesses poten-tially harmful behaviors, while AgentDojo [ 17 ] evaluates resilienceagainst prompt injection attacks. Leaderboards such as the Berke-ley Function-Calling Leaderboard (BFCL) [ 100] and Holistic AgentLeaderboard [ 88 ] consolidate these evaluations by
bench marking