7 Matching Annotations
  1. Jan 2026
    1. or instance, datasets such as AAAR-1.0[ 61], ScienceAgentBench [11 ], and TaskBench [ 83 ] provide struc-tured, expert-labeled benchmarks for assessing research reasoning,scientific workflows, and multi-tool planning. Others, such as Flow-Bench [96 ], ToolBench [38 ], and API-Bank [ 47 ], focus on tool useand function-calling across large API repositories. These bench-marks typically include not only the gold tool sequences but alsoexpected parameter structures, enabling fine-grained evaluation.In parallel, datasets like AssistantBench [ 109], AppWorld [91 ],and WebArena [ 126] simulate more open-ended and interactiveagent behaviors in web and application environments. They empha-size dynamic decision-making, long-horizon planning, and user-agent interactions. Several benchmarks also support safety androbustness testing—for example, AgentHarm [5 ] assesses poten-tially harmful behaviors, while AgentDojo [ 17 ] evaluates resilienceagainst prompt injection attacks. Leaderboards such as the Berke-ley Function-Calling Leaderboard (BFCL) [ 100] and Holistic AgentLeaderboard [ 88 ] consolidate these evaluations by

      bench marking

    1. While the initial results fall short, the AI field has a history of blowing through challenging benchmarks. Now that the APEX-Agents test is public, it’s an open challenge for AI labs that believe they can do better — something Foody fully expects in the months to come.

      expectation that models will get trained against the tests they currently fail.

  2. Nov 2025
    1. LLM benchmarks are essential for tracking progress and ensuring safety in AI, but most benchmarks don't measure what matters.

      Paper concludes most benchmarks used for LLMs to establish progress are mistargeted / leave out aspects that matter.

  3. Oct 2020
  4. Feb 2020