The live nature of BFCL aimed to bridge someof these gaps by introducing BFCL v2, which in-cludes organizational tools, and BFCL v3, whichincorporates integrated multi-turn and multi-step
function calling tool use bench marking
The live nature of BFCL aimed to bridge someof these gaps by introducing BFCL v2, which in-cludes organizational tools, and BFCL v3, whichincorporates integrated multi-turn and multi-step
function calling tool use bench marking
The MINT benchmark (Wang et al.,2023) evaluates planning in interactive environ-ments, showing that even advanced LLMs strugglewith long-horizon tasks requiring multiple steps.
long horizon tasks struggle
foundational need formulti-step planning has led to the developmentof specialized benchmarks and evaluation frame-works that systematically assess these capabilities
specialized benchmarking for planning
Wealso identify critical gaps that future researchmust address—particularly in assessing cost-efficiency, safety, and robustness, and in de-veloping fine-grained, and scalable evaluationmethods.
does my review assess this gap too
This validates the identified claims that the pipeline is end-to-end reproducible and does incorporate plug-and-play functionality for different generator and reranker models with minimal modifications.
usable
Traditional synthetic data generation approaches, such as InPars, rely on static, handcrafted prompt templates that do not scale well to unseen datasets. To address this issue, we leverage the LLM’s own capabilities to determine what constitutes a relevant query. We utilize Chain of Thought (CoT) reasoning to guide the LLM in breaking down the document’s content into a series of logical steps before formulating the final query. Prior work has shown that CoT reasoning can enhance performance by encouraging more structured, contextually appropriate outputs (wei2022chain, 20).
prompt templating using llms optimization and papers
we opt to not use the CoT methodology outlined in §2. 5 to
if not issue can be used
The trained generator model can then be used in zero-shot fashion in order to produce higher quality queries, enhancing downstream performance.
high quality queries
we aim to verify the validity of the three main claims we have identified: (1) the InPars Toolkit is an end-to-end reproducible pipeline; (2) it provides a comprehensive guide for reproducing InPars-V1, InPars-V2, and partially Promptagator; and (3) the toolkit has plug-and-play functionality.
claims are verified so if they are able to that means works
InPars and Promptagator, have proven to be effective strategies for improving Neural Information Retrieval models.
these qgen pipelines are effective
introduced DSPy (khattab2024dspy, 10), a programming model that offers a more systematic approach to optimizing LLM pipelines. DSPy brings the construction and optimization of LLM pipelines closer to traditional programming, where a compiler automatically constructs prompts and invocation strategies by optimizing the weights of general-purpose layers following a program.
optimized prompting for pipeline
The goal of CPO is to overcome two key shortcomings of supervised fine-tuning. First, the performance of supervised fine-tuning is limited by the quality of gold-standard human-annotated reference translations. Second, supervised fine-tuning cannot reject errors present in the provided gold-standard translations. Report issue for preceding element CPO addresses these limitations by introducing a training schema that utilises triplets comprising reference translations, translations from a teacher model (e. g., GPT-4), and those from the student model. Reference-free evaluation models are employed to score the translations, and these scores are subsequently used to guide the student model toward generating preferred translations.
optimization
Promptagator prompts the 137B-parameter model FLAN (wei2021finetuned, 19) with eight query-document examples, followed by a target document, to generate a relevant query for the target document. Unlike InPars, which uses a fixed prompt template, Promptagator employs a dataset-specific prompt template tailored to the target dataset’s retrieval task.
promptagator qgen pipeline and working
extends the original InPars by replacing the scoring metric with a relevance score provided by a monoT5-3B reranker model. Additionally, the authors switch the generator LLM to GPT-J (wang2021gpt, 18). Report issue for preceding element In both works, the authors prompt the generator model 100 000 times for synthetic queries but only keep the top 10 000 highest scoring instances. This means that 90% of the generated queries are discarded. It should be noted that the authors provide no substantiation or ablation for this setting.
isuues and improvements to inpars
nPars method creates a prompt by concatenating a static prefix tt with a document from the target domain dd. InPars considers two different (fixed) prompt templates: a vanilla template and a guided by bad question (GBQ) template. The vanilla template consists of a fixed set of 3 pairs of queries and their relevant documents, sampled from the MS MARCO (bajaj2016ms, 3) dataset. The GBQ prompt extends this format by posing the original query to be a bad question, in contrast with a good question manually constructed by the authors. Feeding this prefix-target document pair t||qt||q to the LLM is then expected to output a novel query q∗q^{*} likely to be relevant to the target document. Thousands of these positive examples are generated for the target domain, and are later used to fine-tune a monoT5 reranker model (nogueira2020document, 13).
inPars method
First, a generator LLM is presented with a prompt containing a target document and a set of relevant query-document pairs. The generator LLM is then invoked to generate a relevant query for this document. This process is repeated for each document in the dataset. The generated queries are subsequently filtered based on a scoring mechanism, and the high-quality queries are used to fine-tune a reranker model. At the end of the pipeline, the fine-tuned reranker model is evaluated on some test data and various retrieval metrics are reported.
qgen working
(LLMs) to generate relevant queries synthetically. Notable examples include InPars (bonifacio2022inpars, 4), its extensions InPars-V2 (jeronymo2023inpars, 9) and InPars-light (boytsov2023inpars, 5), as well as Promptagator (dai2022promptagator, 7). These approaches are commonly referred to as QGen pipelines(chaudhary2023itsrelativesynthetic, 6),
QGen , synthetic query generation pipelines
that they can operate within domain-specific policies andcompliance constraints.
domain specific policies
major challenge in evaluating LLM-based agents is assessingtheir performance on long-horizon tasks in dynamic, evolving en-vironments. Unlike most current benchmarks that focus on shortepisodes or single interactions, real-world enterprise agents oftenoperate continuously over extended periods while interacting withusers, systems, and data
real world dynamic long horizon challeneges.
or instance, datasets such as AAAR-1.0[ 61], ScienceAgentBench [11 ], and TaskBench [ 83 ] provide struc-tured, expert-labeled benchmarks for assessing research reasoning,scientific workflows, and multi-tool planning. Others, such as Flow-Bench [96 ], ToolBench [38 ], and API-Bank [ 47 ], focus on tool useand function-calling across large API repositories. These bench-marks typically include not only the gold tool sequences but alsoexpected parameter structures, enabling fine-grained evaluation.In parallel, datasets like AssistantBench [ 109], AppWorld [91 ],and WebArena [ 126] simulate more open-ended and interactiveagent behaviors in web and application environments. They empha-size dynamic decision-making, long-horizon planning, and user-agent interactions. Several benchmarks also support safety androbustness testing—for example, AgentHarm [5 ] assesses poten-tially harmful behaviors, while AgentDojo [ 17 ] evaluates resilienceagainst prompt injection attacks. Leaderboards such as the Berke-ley Function-Calling Leaderboard (BFCL) [ 100] and Holistic AgentLeaderboard [ 88 ] consolidate these evaluations by
bench marking
the 𝜏-benchmark [ 104] explicitly incorporates the pass^𝑘 metric toevaluate the consistency of an agent
reliability and consistency paper comparision
evaluation frameworks have begun incor-porating access control constraints into their design. For example,IntellAgent [ 45 ] includes evaluation tasks that require authentica-tion of user identity and enforce policies that deny access to otherusers’ information. By embedding role-specific restrictions into taskgeneration, these approaches more accurately model how agentsbehave in permission-sensitive enterprise context
frameworks
The LLM-as-a-Judge approach [ 125] leverages the reasoningcapabilities of LLMs to evaluate agent responses based on qualita-tive criteria, assessing responses according to the criteria providedthrough instructions. This method has gained traction due to itsability to handle tasks that are subjective and nuanced, such assummarization, reasoning, and conversational interactions.
paper assessing llms ability for query generation and assessment
The code-based method is the most deterministic and objectiveapproach [ 8 , 53, 57]. [8] It relies on explicit rules, test cases, orassertions to verify whether an agent’s response meets predefinedcriteria. This method is particularly effective for tasks with well-defined outputs, such as numerical calculations, structured querygeneration, or syntactic correctness in programming tasks
query genrartion review
The tool selection process involves choosing through a retriever or directly allowing LLMs to pick from a provided list of tools. When there are too many tools, a tool retriever is typically used to identify the top-K𝐾Kitalic_K relevant tools to offer to the LLMs, a process known as retriever-based tool selection. If the quantity of tools is limited or upon receiving the tools retrieved during the tool retrieval phase, the LLMs need to select the appropriate tools based on the tool descriptions and the sub-question, which is known as LLM-based tool selection.
repaeted coverage for tool selection
RestGPT [93] introduces a Coarse-to-Fine Online Planning approach, an iterative task planning methodology that enables LLMs to progressively refine the process of task decomposition.
another paper on task decomposition
innate abilities of LLMs enable effective planning through methods such as few-shot or even zero-shot prompting.
LLM ability for planning
This stage involves the decomposition of a user question into multiple sub-questions as required
decomposition
the paradigms for employing tool learning can be categorized into two types: tool learning with one-step task solving and tool learning with iterative task solving.
two paradigms, iterative vs one step
two types based on whether a retriever is used: retriever-based tool selection and LLM-based tool selection.
Retriever vs LLM based selection
user query along with the selected tools are furnished to the LLMs, enabling it to select the optimal tool and configure the necessary parameters for tool calling.
LLM step
ANS complements these emerging protocols by integrating Public Key Infrastructure (PKI) for identity and trust, defining structured communication via JSON Schema, incorporating DNS-like naming for discovery, establishing mechanisms for agent registration and renewal, and providing a formal specification of the protocol to enhance precision and implementability.
how works components
a modular Protocol Adapter Layer supporting diverse communication standards (A2A, MCP, ACP, etc. ); a
supports diverse communication standards
Architectural Trade-offs Are Protocol-Specific. The "right" registry architecture depends heavily on deployment context. Enterprise environments with existing Azure AD infrastructure benefit from Entra Agent ID’s seamless integration and zero-maintenance approach. Open research communities and decentralized applications require the cryptographic guarantees and federated governance of NANDA Index. Protocol-specific ecosystems like MCP benefit from purpose-built registries that leverage existing authentication systems (GitHub OAuth) while maintaining simplicity.
key point for review
MCP: JSON file with agent meta data but not actual tools only instruction of how to connect(Look up table)
NANDA index: de centralized index quilt connects registries , security useful for next stages.
centralization
Decentralization Enables Long-term Sustainability. While centralized approaches offer operational simplicity, they create single points of failure and vendor lock-in risks that become increasingly problematic as agent ecosystems mature. NANDA Index’s federated design demonstrate how decentralized architectures can achieve both scalability and community governance. The dual-path resolution pattern in NANDA particularly addresses privacy concerns that will become critical as agent interactions proliferate. Report issue for preceding element 3.
key point for review
This paper surveys three prominent registry approaches each defined by a unique metadata model: MCP’s mcp. json, A2A’s Agent Card, and NANDA’s AgentFacts.
Registry approaches
APIBench
API bench see relevance
LLaMA-based model that surpasses the performanceof GPT-4 on writing API calls.
task development and testong /ranking
modern indexing
Hierarchical Structure
Implementation of hierarchal structure
Evaluation Protocol
Relevance of this approach if used
Self-Reflection Mechanism in LLMs. S
Self reflection mechanism is agent search engine approach used
struggling with pre-cise numbers and specific fields of expertise
Task development
either solvable or non-solvable
Approach to queries
ToolBench
Search query task evaluation task development
Rapid API
Cite as example use of hierarchal structure
Hierarchical Structure
Hierarchical structure
revisit the evaluation protocol introducedby previous works and identify a limitation in thisprotocol that leads to an artificially high pass rate
See what prev works cote
solver
Task development query section
self-reflectionmechanism, which re-activates AnyTool if the ini-tial solution proves impracticable
Task development
API retriever with a hierarchical structure
Hierarchical element
16,000 APIs from Rapid API
Use in indexing
hypothesis
tradional keywor match
We should never forget that the human brain is perhaps the most complicated system in the known universe; if we are to build something roughly its equal, open-hearted collaboration will be key.
arg symbols
Researchers like Josh Tenenbaum, Anima Anandkumar, and Yejin Choi are also now headed in increasingly neurosymbolic directions. Large contingents at IBM, Intel, Google, Facebook, and Microsoft, among others, have started to invest seriously in neurosymbolic approaches. Swarat Chaudhuri and his colleagues are developing a field called “neurosymbolic programming”23 that is music to my ears.
example symbols
Artur Garcez and Luis Lamb wrote a manifesto for hybrid models in 2009, called Neural-Symbolic Cognitive Reasoning. And some of the best-known recent successes in board-game playing (Go, Chess, and so forth, led primarily by work at Alphabet’s DeepMind) are hybrids. AlphaGo used symbolic-tree search, an idea from the late 195
example symbols
No single AI approach will ever be enough on its own; we must master the art of putting diverse approaches together, if we are to have any hope at all.
arg symbols
Belittling unfashionable ideas that haven’t yet been fully explored is not the right way to go. Hinton is quite right that in the old days AI researchers tried—too soon—to bury deep learning. But Hinton is just as wrong to do the same today to symbol-manipulation.
argument symbols
For at least four reasons, hybrid AI, not deep learning alone (nor symbols alone) seems the best way forward:
arg symbols
n 1990, Hinton published a special issue of the journal Artificial Intelligence called Connectionist Symbol Processing that explicitly aimed to bridge the two worlds of deep learning and symbol manipulation.
argument example symbols
When deep learning reemerged in 2012, it was with a kind of take-no-prisoners attitude that has characterized most of the last decade. By 2015, his hostility toward all things symbols had fully crystallized. He gave a talk at an AI workshop at Stanford comparing symbols to aether, one of science’s greatest mistakes.
argument symbols
Warren McCulloch and Walter Pitts wrote in 1943, “A Logical Calculus of the Ideas Immanent in Nervous Activity,” the only paper von Neumann found worthy enough to cite in his own foundational paper on computers.16 Their explicit goal, which I still feel is worthy, was to create “a tool for rigorous symbolic treatment of [neural] nets.”
example symbols
Symbolic operations also underlie data structures like dictionaries or databases that might keep records of particular individuals and their properties (like their addresses, or the last time a salesperson has been in touch with them, and allow programmers to build libraries of reusable code, and ever larger modules, which ease the development of complex systems. Such techniques are ubiquitous, the bread and butter of the software world.
argument example symbols
If symbols are so critical for software engineering, why not use them in AI, too?
arg symbolsItalic
s the renowned computer scientist Peter Norvig famously and ingeniously pointed out, when you have Google-sized data, you have a new option: simply look at logs of how users correct themselves.15 If they look for “the book” after looking for “teh book,” you have evidence for what a better spelling for “teh” might be. No rules of spelling required.
example symbols
our word processor, for example, has a string of symbols, collected in a file, to represent your document. Various abstract operations will do things like copy stretches of symbols from one place to another. Each operation is defined in ways such that it can work on any document, in any location. A word processor, in essence, is a kind of application of a set of algebraic operations (“functions” or “subroutines”) that apply to variables (such as “currentl
example symbols
it goes back at least to 1945, when the legendary mathematician von Neumann outlined the architecture that virtually all modern computers follow. Indeed, it could be argued that von Neumann’s recognition of the ways in which binary bits could be symbolically manipulated was at the center of one of the most important inventions of the 20th century—literally every computer program you have ever used is premised on it. (The “embeddings” that are popular in neural networks also look remarkably like symbols, though nobody seems to acknowledge this. Often, for example, any given word will be assigned a unique vector, in a one-to-one fashion that is quite analogous to the ASCII code. Calling something an “embedding” doesn’t mean it’s not a symbol.)
example symbols ?
lassical computer science, of the sort practiced by Turing and von Neumann and everyone after, manipulates symbols in a fashion that we think of as algebraic, and that’s what’s really at stake. In simple algebra, we have three kinds of entities, variables (like x and y), operations (like + or -), and bindings (which tell us, for example, to let x = 12 for the purpose of some calculation). If I tell you that x = y + 2, and that y = 12, you can solve for the value of x by binding y to 12 and adding to that value, yielding 14. Virtually all the world’s software works by stringing algebraic operations together, assembling them into ever more complex algorithms.
argument symbols
A wakeup call came at the end of 2021, at a major competition, launched in part by a team of Facebook (now Meta), called the NetHack Challenge. NetHack, an extension of an earlier game known as Rogue, and forerunner to Zelda, is a single-user dungeon exploration game that was released in 1987.
example symbols
A lot of confusion in the field has come from not seeing the differences between the two—having symbols, and processing them algebraically. To understand how AI has wound up in the mess that it is in, it is essential to see the difference between the two.What are symbols? They are basically just codes. Symbols offer a principled mechanism for extrapolation: lawful, algebraic procedures that can be applied universally, independently of any similarity to known examples. They are (at least for now) still the best way to handcraft knowledge, and to deal robustly with abstractions in novel situations.
argument symbols
And yet, for the most part, that’s how most current AI proceeds. Hinton and many others have tried hard to banish symbols altogether.
arg symbols
A 2022 paper from Google concludes that making GPT-3-like models bigger makes them more fluent, but no more trustworthy.
example scaling
f scaling doesn’t get us to safe autonomous driving, tens of billions of dollars of investment in scaling could turn out to be for naught.
arg scaling
Among other things, we are very likely going to need to revisit a once-popular idea that Hinton seems devoutly to want to crush: the idea of manipulating symbols
argument sysmbols
scaling starts to falter on some measures, such as toxicity, truthfulness, reasoning, and common sense.
argument scaling
What’s more, the so-called scaling laws aren’t universal laws like gravity but rather mere observations that might not hold forever, much like Moore’s law, a trend in computer chip production that held for decades but arguably began to slow a decade ago.11
example scaling
There are serious holes in the scaling argument. To begin with, the measures that have scaled have not captured what we desperately need to improve: genuine comprehension. Insiders have long known that one of the biggest problems in AI research is the tests (“benchmarks”) that we use to evaluate AI systems
arg scaling
hat should we do about it? One option, currently trendy, might be just to gather more data.
arg scaling
For example, when we typed this: “You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you …” GPT continued with “drink it. You are now dead.”
example error
Still others found that GPT-3 is prone to producing toxic language, and promulgating misinformation. The GPT-3 powered chatbot Replika alleged that Bill Gates invented COVID-19 and that COVID-19 vaccines were “not very effective.” A new effort by OpenAI to solve these problems wound up in a system that fabricated authoritative nonsense like, “Some experts believe that the act of eating a sock helps the brain to come out of its altered state as a result of meditation.
example error
A deep-learning system has mislabeled an apple as an iPod because the apple had a piece of paper in front with “iPod” written across. Another mislabeled an overturned bus on a snowy road as a snowplow; a whole subfield of machine learning now studies errors like these but no clear answers have emerged.
example
Current deep-learning systems frequently succumb to stupid errors like this. They sometimes misread dirt on an image that a human radiologist would recognize as a glitch.
arg error
The car failed to recognize the person (partly obscured by the stop sign) and the stop sign (out of its usual context on the side of a road); the human driver had to take over. The scene was far enough outside of the training database that the system had no idea what to do.
arg point 1
it occasionally confuses baby photos of my two children. But the stakes are low—if the app makes an occasional error, I am not going to throw away my phone.
arg
I asked my iPhone the other day to find a picture of a rabbit
example
Google’s latest contribution to language is a system (Lamda) that is so flighty that one of its own authors recently acknowledged it is prone to producing “bullshit.”5
example