88 Matching Annotations
  1. Jan 2026
    1. The live nature of BFCL aimed to bridge someof these gaps by introducing BFCL v2, which in-cludes organizational tools, and BFCL v3, whichincorporates integrated multi-turn and multi-step

      function calling tool use bench marking

    2. The MINT benchmark (Wang et al.,2023) evaluates planning in interactive environ-ments, showing that even advanced LLMs strugglewith long-horizon tasks requiring multiple steps.

      long horizon tasks struggle

    3. foundational need formulti-step planning has led to the developmentof specialized benchmarks and evaluation frame-works that systematically assess these capabilities

      specialized benchmarking for planning

    4. Wealso identify critical gaps that future researchmust address—particularly in assessing cost-efficiency, safety, and robustness, and in de-veloping fine-grained, and scalable evaluationmethods.

      does my review assess this gap too

    1. This validates the identified claims that the pipeline is end-to-end reproducible and does incorporate plug-and-play functionality for different generator and reranker models with minimal modifications.

      usable

    2. Traditional synthetic data generation approaches, such as InPars, rely on static, handcrafted prompt templates that do not scale well to unseen datasets. To address this issue, we leverage the LLM’s own capabilities to determine what constitutes a relevant query. We utilize Chain of Thought (CoT) reasoning to guide the LLM in breaking down the document’s content into a series of logical steps before formulating the final query. Prior work has shown that CoT reasoning can enhance performance by encouraging more structured, contextually appropriate outputs (wei2022chain, 20).

      prompt templating using llms optimization and papers

    3. The trained generator model can then be used in zero-shot fashion in order to produce higher quality queries, enhancing downstream performance.

      high quality queries

    4. we aim to verify the validity of the three main claims we have identified: (1) the InPars Toolkit is an end-to-end reproducible pipeline; (2) it provides a comprehensive guide for reproducing InPars-V1, InPars-V2, and partially Promptagator; and (3) the toolkit has plug-and-play functionality.

      claims are verified so if they are able to that means works

    5. introduced DSPy (khattab2024dspy, 10), a programming model that offers a more systematic approach to optimizing LLM pipelines. DSPy brings the construction and optimization of LLM pipelines closer to traditional programming, where a compiler automatically constructs prompts and invocation strategies by optimizing the weights of general-purpose layers following a program.

      optimized prompting for pipeline

    6. The goal of CPO is to overcome two key shortcomings of supervised fine-tuning. First, the performance of supervised fine-tuning is limited by the quality of gold-standard human-annotated reference translations. Second, supervised fine-tuning cannot reject errors present in the provided gold-standard translations. Report issue for preceding element CPO addresses these limitations by introducing a training schema that utilises triplets comprising reference translations, translations from a teacher model (e. g., GPT-4), and those from the student model. Reference-free evaluation models are employed to score the translations, and these scores are subsequently used to guide the student model toward generating preferred translations.

      optimization

    7. Promptagator prompts the 137B-parameter model FLAN (wei2021finetuned, 19) with eight query-document examples, followed by a target document, to generate a relevant query for the target document. Unlike InPars, which uses a fixed prompt template, Promptagator employs a dataset-specific prompt template tailored to the target dataset’s retrieval task.

      promptagator qgen pipeline and working

    8. extends the original InPars by replacing the scoring metric with a relevance score provided by a monoT5-3B reranker model. Additionally, the authors switch the generator LLM to GPT-J (wang2021gpt, 18). Report issue for preceding element In both works, the authors prompt the generator model 100 000 times for synthetic queries but only keep the top 10 000 highest scoring instances. This means that 90% of the generated queries are discarded. It should be noted that the authors provide no substantiation or ablation for this setting.

      isuues and improvements to inpars

    9. nPars method creates a prompt by concatenating a static prefix tt with a document from the target domain dd. InPars considers two different (fixed) prompt templates: a vanilla template and a guided by bad question (GBQ) template. The vanilla template consists of a fixed set of 3 pairs of queries and their relevant documents, sampled from the MS MARCO (bajaj2016ms, 3) dataset. The GBQ prompt extends this format by posing the original query to be a bad question, in contrast with a good question manually constructed by the authors. Feeding this prefix-target document pair t||qt||q to the LLM is then expected to output a novel query q∗q^{*} likely to be relevant to the target document. Thousands of these positive examples are generated for the target domain, and are later used to fine-tune a monoT5 reranker model (nogueira2020document, 13).

      inPars method

    10. First, a generator LLM is presented with a prompt containing a target document and a set of relevant query-document pairs. The generator LLM is then invoked to generate a relevant query for this document. This process is repeated for each document in the dataset. The generated queries are subsequently filtered based on a scoring mechanism, and the high-quality queries are used to fine-tune a reranker model. At the end of the pipeline, the fine-tuned reranker model is evaluated on some test data and various retrieval metrics are reported.

      qgen working

    11. (LLMs) to generate relevant queries synthetically. Notable examples include InPars (bonifacio2022inpars, 4), its extensions InPars-V2 (jeronymo2023inpars, 9) and InPars-light (boytsov2023inpars, 5), as well as Promptagator (dai2022promptagator, 7). These approaches are commonly referred to as QGen pipelines(chaudhary2023itsrelativesynthetic, 6),

      QGen , synthetic query generation pipelines

    1. major challenge in evaluating LLM-based agents is assessingtheir performance on long-horizon tasks in dynamic, evolving en-vironments. Unlike most current benchmarks that focus on shortepisodes or single interactions, real-world enterprise agents oftenoperate continuously over extended periods while interacting withusers, systems, and data

      real world dynamic long horizon challeneges.

    2. or instance, datasets such as AAAR-1.0[ 61], ScienceAgentBench [11 ], and TaskBench [ 83 ] provide struc-tured, expert-labeled benchmarks for assessing research reasoning,scientific workflows, and multi-tool planning. Others, such as Flow-Bench [96 ], ToolBench [38 ], and API-Bank [ 47 ], focus on tool useand function-calling across large API repositories. These bench-marks typically include not only the gold tool sequences but alsoexpected parameter structures, enabling fine-grained evaluation.In parallel, datasets like AssistantBench [ 109], AppWorld [91 ],and WebArena [ 126] simulate more open-ended and interactiveagent behaviors in web and application environments. They empha-size dynamic decision-making, long-horizon planning, and user-agent interactions. Several benchmarks also support safety androbustness testing—for example, AgentHarm [5 ] assesses poten-tially harmful behaviors, while AgentDojo [ 17 ] evaluates resilienceagainst prompt injection attacks. Leaderboards such as the Berke-ley Function-Calling Leaderboard (BFCL) [ 100] and Holistic AgentLeaderboard [ 88 ] consolidate these evaluations by

      bench marking

    3. evaluation frameworks have begun incor-porating access control constraints into their design. For example,IntellAgent [ 45 ] includes evaluation tasks that require authentica-tion of user identity and enforce policies that deny access to otherusers’ information. By embedding role-specific restrictions into taskgeneration, these approaches more accurately model how agentsbehave in permission-sensitive enterprise context

      frameworks

    4. The LLM-as-a-Judge approach [ 125] leverages the reasoningcapabilities of LLMs to evaluate agent responses based on qualita-tive criteria, assessing responses according to the criteria providedthrough instructions. This method has gained traction due to itsability to handle tasks that are subjective and nuanced, such assummarization, reasoning, and conversational interactions.

      paper assessing llms ability for query generation and assessment

    5. The code-based method is the most deterministic and objectiveapproach [ 8 , 53, 57]. [8] It relies on explicit rules, test cases, orassertions to verify whether an agent’s response meets predefinedcriteria. This method is particularly effective for tasks with well-defined outputs, such as numerical calculations, structured querygeneration, or syntactic correctness in programming tasks

      query genrartion review

    1. The tool selection process involves choosing through a retriever or directly allowing LLMs to pick from a provided list of tools. When there are too many tools, a tool retriever is typically used to identify the top-K𝐾Kitalic_K relevant tools to offer to the LLMs, a process known as retriever-based tool selection. If the quantity of tools is limited or upon receiving the tools retrieved during the tool retrieval phase, the LLMs need to select the appropriate tools based on the tool descriptions and the sub-question, which is known as LLM-based tool selection.

      repaeted coverage for tool selection

    2. RestGPT [93] introduces a Coarse-to-Fine Online Planning approach, an iterative task planning methodology that enables LLMs to progressively refine the process of task decomposition.

      another paper on task decomposition

    3. the paradigms for employing tool learning can be categorized into two types: tool learning with one-step task solving and tool learning with iterative task solving.

      two paradigms, iterative vs one step

    4. user query along with the selected tools are furnished to the LLMs, enabling it to select the optimal tool and configure the necessary parameters for tool calling.

      LLM step

    1. ANS complements these emerging protocols by integrating Public Key Infrastructure (PKI) for identity and trust, defining structured communication via JSON Schema, incorporating DNS-like naming for discovery, establishing mechanisms for agent registration and renewal, and providing a formal specification of the protocol to enhance precision and implementability.

      how works components

    1. Architectural Trade-offs Are Protocol-Specific. The "right" registry architecture depends heavily on deployment context. Enterprise environments with existing Azure AD infrastructure benefit from Entra Agent ID’s seamless integration and zero-maintenance approach. Open research communities and decentralized applications require the cryptographic guarantees and federated governance of NANDA Index. Protocol-specific ecosystems like MCP benefit from purpose-built registries that leverage existing authentication systems (GitHub OAuth) while maintaining simplicity.

      key point for review

    2. MCP: JSON file with agent meta data but not actual tools only instruction of how to connect(Look up table)

      NANDA index: de centralized index quilt connects registries , security useful for next stages.

      centralization

    3. Decentralization Enables Long-term Sustainability. While centralized approaches offer operational simplicity, they create single points of failure and vendor lock-in risks that become increasingly problematic as agent ecosystems mature. NANDA Index’s federated design demonstrate how decentralized architectures can achieve both scalability and community governance. The dual-path resolution pattern in NANDA particularly addresses privacy concerns that will become critical as agent interactions proliferate. Report issue for preceding element 3.

      key point for review

    1. revisit the evaluation protocol introducedby previous works and identify a limitation in thisprotocol that leads to an artificially high pass rate

      See what prev works cote

  2. Apr 2025
    1. Researchers like Josh Tenenbaum, Anima Anandkumar, and Yejin Choi are also now headed in increasingly neurosymbolic directions. Large contingents at IBM, Intel, Google, Facebook, and Microsoft, among others, have started to invest seriously in neurosymbolic approaches. Swarat Chaudhuri and his colleagues are developing a field called “neurosymbolic programming”23 that is music to my ears.

      example symbols

    2. Artur Garcez and Luis Lamb wrote a manifesto for hybrid models in 2009, called Neural-Symbolic Cognitive Reasoning. And some of the best-known recent successes in board-game playing (Go, Chess, and so forth, led primarily by work at Alphabet’s DeepMind) are hybrids. AlphaGo used symbolic-tree search, an idea from the late 195

      example symbols

    3. Belittling unfashionable ideas that haven’t yet been fully explored is not the right way to go. Hinton is quite right that in the old days AI researchers tried—too soon—to bury deep learning. But Hinton is just as wrong to do the same today to symbol-manipulation.

      argument symbols

    4. When deep learning reemerged in 2012, it was with a kind of take-no-prisoners attitude that has characterized most of the last decade. By 2015, his hostility toward all things symbols had fully crystallized. He gave a talk at an AI workshop at Stanford comparing symbols to aether, one of science’s greatest mistakes.

      argument symbols

    5. Warren McCulloch and Walter Pitts wrote in 1943, “A Logical Calculus of the Ideas Immanent in Nervous Activity,” the only paper von Neumann found worthy enough to cite in his own foundational paper on computers.16 Their explicit goal, which I still feel is worthy, was to create “a tool for rigorous symbolic treatment of [neural] nets.”

      example symbols

    6. Symbolic operations also underlie data structures like dictionaries or databases that might keep records of particular individuals and their properties (like their addresses, or the last time a salesperson has been in touch with them, and allow programmers to build libraries of reusable code, and ever larger modules, which ease the development of complex systems. Such techniques are ubiquitous, the bread and butter of the software world.

      argument example symbols

    7. s the renowned computer scientist Peter Norvig famously and ingeniously pointed out, when you have Google-sized data, you have a new option: simply look at logs of how users correct themselves.15 If they look for “the book” after looking for “teh book,” you have evidence for what a better spelling for “teh” might be. No rules of spelling required.

      example symbols

    8. our word processor, for example, has a string of symbols, collected in a file, to represent your document. Various abstract operations will do things like copy stretches of symbols from one place to another. Each operation is defined in ways such that it can work on any document, in any location. A word processor, in essence, is a kind of application of a set of algebraic operations (“functions” or “subroutines”) that apply to variables (such as “currentl

      example symbols

    9. it goes back at least to 1945, when the legendary mathematician von Neumann outlined the architecture that virtually all modern computers follow. Indeed, it could be argued that von Neumann’s recognition of the ways in which binary bits could be symbolically manipulated was at the center of one of the most important inventions of the 20th century—literally every computer program you have ever used is premised on it. (The “embeddings” that are popular in neural networks also look remarkably like symbols, though nobody seems to acknowledge this. Often, for example, any given word will be assigned a unique vector, in a one-to-one fashion that is quite analogous to the ASCII code. Calling something an “embedding” doesn’t mean it’s not a symbol.)

      example symbols ?

    10. lassical computer science, of the sort practiced by Turing and von Neumann and everyone after, manipulates symbols in a fashion that we think of as algebraic, and that’s what’s really at stake. In simple algebra, we have three kinds of entities, variables (like x and y), operations (like + or -), and bindings (which tell us, for example, to let x = 12 for the purpose of some calculation). If I tell you that x = y + 2, and that y = 12, you can solve for the value of x by binding y to 12 and adding to that value, yielding 14. Virtually all the world’s software works by stringing algebraic operations together, assembling them into ever more complex algorithms.

      argument symbols

    11. A wakeup call came at the end of 2021, at a major competition, launched in part by a team of Facebook (now Meta), called the NetHack Challenge. NetHack, an extension of an earlier game known as Rogue, and forerunner to Zelda, is a single-user dungeon exploration game that was released in 1987.

      example symbols

    12. A lot of confusion in the field has come from not seeing the differences between the two—having symbols, and processing them algebraically. To understand how AI has wound up in the mess that it is in, it is essential to see the difference between the two.What are symbols? They are basically just codes. Symbols offer a principled mechanism for extrapolation: lawful, algebraic procedures that can be applied universally, independently of any similarity to known examples. They are (at least for now) still the best way to handcraft knowledge, and to deal robustly with abstractions in novel situations.

      argument symbols

    13. What’s more, the so-called scaling laws aren’t universal laws like gravity but rather mere observations that might not hold forever, much like Moore’s law, a trend in computer chip production that held for decades but arguably began to slow a decade ago.11

      example scaling

    14. There are serious holes in the scaling argument. To begin with, the measures that have scaled have not captured what we desperately need to improve: genuine comprehension. Insiders have long known that one of the biggest problems in AI research is the tests (“benchmarks”) that we use to evaluate AI systems

      arg scaling

    15. For example, when we typed this: “You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you …” GPT continued with “drink it. You are now dead.”

      example error

    16. Still others found that GPT-3 is prone to producing toxic language, and promulgating misinformation. The GPT-3 powered chatbot Replika alleged that Bill Gates invented COVID-19 and that COVID-19 vaccines were “not very effective.” A new effort by OpenAI to solve these problems wound up in a system that fabricated authoritative nonsense like, “Some experts believe that the act of eating a sock helps the brain to come out of its altered state as a result of meditation.

      example error

    17. A deep-learning system has mislabeled an apple as an iPod because the apple had a piece of paper in front with “iPod” written across. Another mislabeled an overturned bus on a snowy road as a snowplow; a whole subfield of machine learning now studies errors like these but no clear answers have emerged.

      example

    18. The car failed to recognize the person (partly obscured by the stop sign) and the stop sign (out of its usual context on the side of a road); the human driver had to take over. The scene was far enough outside of the training database that the system had no idea what to do.

      arg point 1

    19. Google’s latest contribution to language is a system (Lamda) that is so flighty that one of its own authors recently acknowledged it is prone to producing “bullshit.”5

      example