88 Matching Annotations

Jan 2026
arxiv.org arxiv.org

2503.16416v1.pdf

4
1. omarknazir 26 Jan 2026
  
  in Public
  
  The live nature of BFCL aimed to bridge someof these gaps by introducing BFCL v2, which in-cludes organizational tools, and BFCL v3, whichincorporates integrated multi-turn and multi-step
  
  function calling tool use bench marking
2. omarknazir 26 Jan 2026
  
  in Public
  
  The MINT benchmark (Wang et al.,2023) evaluates planning in interactive environ-ments, showing that even advanced LLMs strugglewith long-horizon tasks requiring multiple steps.
  
  long horizon tasks struggle
3. omarknazir 26 Jan 2026
  
  in Public
  
  foundational need formulti-step planning has led to the developmentof specialized benchmarks and evaluation frame-works that systematically assess these capabilities
  
  specialized benchmarking for planning
4. omarknazir 25 Jan 2026
  
  in Public
  
  Wealso identify critical gaps that future researchmust address—particularly in assessing cost-efficiency, safety, and robustness, and in de-veloping fine-grained, and scalable evaluationmethods.
  
  does my review assess this gap too
Visit annotations in context

Annotators

omarknazir

URL

arxiv.org/pdf/2503.16416
arxiv.org arxiv.org

InPars+: Supercharging Synthetic Data Generation for Information Retrieval Systems

13
1. omarknazir 25 Jan 2026
  
  in Public
  
  This validates the identified claims that the pipeline is end-to-end reproducible and does incorporate plug-and-play functionality for different generator and reranker models with minimal modifications.
  
  usable
2. omarknazir 25 Jan 2026
  
  in Public
  
  Traditional synthetic data generation approaches, such as InPars, rely on static, handcrafted prompt templates that do not scale well to unseen datasets. To address this issue, we leverage the LLM’s own capabilities to determine what constitutes a relevant query. We utilize Chain of Thought (CoT) reasoning to guide the LLM in breaking down the document’s content into a series of logical steps before formulating the final query. Prior work has shown that CoT reasoning can enhance performance by encouraging more structured, contextually appropriate outputs (wei2022chain, 20).
  
  prompt templating using llms optimization and papers
3. omarknazir 25 Jan 2026
  
  in Public
  
  we opt to not use the CoT methodology outlined in §2. 5 to
  
  if not issue can be used
4. omarknazir 25 Jan 2026
  
  in Public
  
  The trained generator model can then be used in zero-shot fashion in order to produce higher quality queries, enhancing downstream performance.
  
  high quality queries
5. omarknazir 25 Jan 2026
  
  in Public
  
  we aim to verify the validity of the three main claims we have identified: (1) the InPars Toolkit is an end-to-end reproducible pipeline; (2) it provides a comprehensive guide for reproducing InPars-V1, InPars-V2, and partially Promptagator; and (3) the toolkit has plug-and-play functionality.
  
  claims are verified so if they are able to that means works
6. omarknazir 25 Jan 2026
  
  in Public
  
  InPars and Promptagator, have proven to be effective strategies for improving Neural Information Retrieval models.
  
  these qgen pipelines are effective
7. omarknazir 25 Jan 2026
  
  in Public
  
  introduced DSPy (khattab2024dspy, 10), a programming model that offers a more systematic approach to optimizing LLM pipelines. DSPy brings the construction and optimization of LLM pipelines closer to traditional programming, where a compiler automatically constructs prompts and invocation strategies by optimizing the weights of general-purpose layers following a program.
  
  optimized prompting for pipeline
8. omarknazir 25 Jan 2026
  
  in Public
  
  The goal of CPO is to overcome two key shortcomings of supervised fine-tuning. First, the performance of supervised fine-tuning is limited by the quality of gold-standard human-annotated reference translations. Second, supervised fine-tuning cannot reject errors present in the provided gold-standard translations. Report issue for preceding element CPO addresses these limitations by introducing a training schema that utilises triplets comprising reference translations, translations from a teacher model (e. g., GPT-4), and those from the student model. Reference-free evaluation models are employed to score the translations, and these scores are subsequently used to guide the student model toward generating preferred translations.
  
  optimization
9. omarknazir 25 Jan 2026
  
  in Public
  
  Promptagator prompts the 137B-parameter model FLAN (wei2021finetuned, 19) with eight query-document examples, followed by a target document, to generate a relevant query for the target document. Unlike InPars, which uses a fixed prompt template, Promptagator employs a dataset-specific prompt template tailored to the target dataset’s retrieval task.
  
  promptagator qgen pipeline and working
10. omarknazir 25 Jan 2026
  
  in Public
  
  extends the original InPars by replacing the scoring metric with a relevance score provided by a monoT5-3B reranker model. Additionally, the authors switch the generator LLM to GPT-J (wang2021gpt, 18). Report issue for preceding element In both works, the authors prompt the generator model 100 000 times for synthetic queries but only keep the top 10 000 highest scoring instances. This means that 90% of the generated queries are discarded. It should be noted that the authors provide no substantiation or ablation for this setting.
  
  isuues and improvements to inpars
11. omarknazir 25 Jan 2026
  
  in Public
  
  nPars method creates a prompt by concatenating a static prefix tt with a document from the target domain dd. InPars considers two different (fixed) prompt templates: a vanilla template and a guided by bad question (GBQ) template. The vanilla template consists of a fixed set of 3 pairs of queries and their relevant documents, sampled from the MS MARCO (bajaj2016ms, 3) dataset. The GBQ prompt extends this format by posing the original query to be a bad question, in contrast with a good question manually constructed by the authors. Feeding this prefix-target document pair t||qt||q to the LLM is then expected to output a novel query q∗q^{*} likely to be relevant to the target document. Thousands of these positive examples are generated for the target domain, and are later used to fine-tune a monoT5 reranker model (nogueira2020document, 13).
  
  inPars method
12. omarknazir 25 Jan 2026
  
  in Public
  
  First, a generator LLM is presented with a prompt containing a target document and a set of relevant query-document pairs. The generator LLM is then invoked to generate a relevant query for this document. This process is repeated for each document in the dataset. The generated queries are subsequently filtered based on a scoring mechanism, and the high-quality queries are used to fine-tune a reranker model. At the end of the pipeline, the fine-tuned reranker model is evaluated on some test data and various retrieval metrics are reported.
  
  qgen working
13. omarknazir 25 Jan 2026
  
  in Public
  
  (LLMs) to generate relevant queries synthetically. Notable examples include InPars (bonifacio2022inpars, 4), its extensions InPars-V2 (jeronymo2023inpars, 9) and InPars-light (boytsov2023inpars, 5), as well as Promptagator (dai2022promptagator, 7). These approaches are commonly referred to as QGen pipelines(chaudhary2023itsrelativesynthetic, 6),
  
  QGen , synthetic query generation pipelines
Visit annotations in context

Annotators

omarknazir

URL

arxiv.org/html/2508.13930v1
arxiv.org arxiv.org

Evaluation and Benchmarking of LLM Agents: A Survey

7
1. omarknazir 24 Jan 2026
  
  in Public
  
  that they can operate within domain-specific policies andcompliance constraints.
  
  domain specific policies
2. omarknazir 24 Jan 2026
  
  in Public
  
  major challenge in evaluating LLM-based agents is assessingtheir performance on long-horizon tasks in dynamic, evolving en-vironments. Unlike most current benchmarks that focus on shortepisodes or single interactions, real-world enterprise agents oftenoperate continuously over extended periods while interacting withusers, systems, and data
  
  real world dynamic long horizon challeneges.
3. omarknazir 24 Jan 2026
  
  in Public
  
  or instance, datasets such as AAAR-1.0[ 61], ScienceAgentBench [11 ], and TaskBench [ 83 ] provide struc-tured, expert-labeled benchmarks for assessing research reasoning,scientific workflows, and multi-tool planning. Others, such as Flow-Bench [96 ], ToolBench [38 ], and API-Bank [ 47 ], focus on tool useand function-calling across large API repositories. These bench-marks typically include not only the gold tool sequences but alsoexpected parameter structures, enabling fine-grained evaluation.In parallel, datasets like AssistantBench [ 109], AppWorld [91 ],and WebArena [ 126] simulate more open-ended and interactiveagent behaviors in web and application environments. They empha-size dynamic decision-making, long-horizon planning, and user-agent interactions. Several benchmarks also support safety androbustness testing—for example, AgentHarm [5 ] assesses poten-tially harmful behaviors, while AgentDojo [ 17 ] evaluates resilienceagainst prompt injection attacks. Leaderboards such as the Berke-ley Function-Calling Leaderboard (BFCL) [ 100] and Holistic AgentLeaderboard [ 88 ] consolidate these evaluations by
  
  bench marking
  
  benchmarks
4. omarknazir 24 Jan 2026
  
  in Public
  
  the 𝜏-benchmark [ 104] explicitly incorporates the pass^𝑘 metric toevaluate the consistency of an agent
  
  reliability and consistency paper comparision
  
  query benchmarking
5. omarknazir 24 Jan 2026
  
  in Public
  
  evaluation frameworks have begun incor-porating access control constraints into their design. For example,IntellAgent [ 45 ] includes evaluation tasks that require authentica-tion of user identity and enforce policies that deny access to otherusers’ information. By embedding role-specific restrictions into taskgeneration, these approaches more accurately model how agentsbehave in permission-sensitive enterprise context
  
  frameworks
6. omarknazir 24 Jan 2026
  
  in Public
  
  The LLM-as-a-Judge approach [ 125] leverages the reasoningcapabilities of LLMs to evaluate agent responses based on qualita-tive criteria, assessing responses according to the criteria providedthrough instructions. This method has gained traction due to itsability to handle tasks that are subjective and nuanced, such assummarization, reasoning, and conversational interactions.
  
  paper assessing llms ability for query generation and assessment
7. omarknazir 24 Jan 2026
  
  in Public
  
  The code-based method is the most deterministic and objectiveapproach [ 8 , 53, 57]. [8] It relies on explicit rules, test cases, orassertions to verify whether an agent’s response meets predefinedcriteria. This method is particularly effective for tasks with well-defined outputs, such as numerical calculations, structured querygeneration, or syntactic correctness in programming tasks
  
  query genrartion review
Visit annotations in context

Tags

benchmarks

benchmarking

query

Annotators

omarknazir

URL

arxiv.org/pdf/2507.21504
arxiv.org arxiv.org

Tool Learning with Large Language Models: A Survey

7
1. omarknazir 21 Jan 2026
  
  in Public
  
  The tool selection process involves choosing through a retriever or directly allowing LLMs to pick from a provided list of tools. When there are too many tools, a tool retriever is typically used to identify the top-K𝐾Kitalic_K relevant tools to offer to the LLMs, a process known as retriever-based tool selection. If the quantity of tools is limited or upon receiving the tools retrieved during the tool retrieval phase, the LLMs need to select the appropriate tools based on the tool descriptions and the sub-question, which is known as LLM-based tool selection.
  
  repaeted coverage for tool selection
2. omarknazir 21 Jan 2026
  
  in Public
  
  RestGPT [93] introduces a Coarse-to-Fine Online Planning approach, an iterative task planning methodology that enables LLMs to progressively refine the process of task decomposition.
  
  another paper on task decomposition
3. omarknazir 21 Jan 2026
  
  in Public
  
  innate abilities of LLMs enable effective planning through methods such as few-shot or even zero-shot prompting.
  
  LLM ability for planning
4. omarknazir 21 Jan 2026
  
  in Public
  
  This stage involves the decomposition of a user question into multiple sub-questions as required
  
  decomposition
5. omarknazir 21 Jan 2026
  
  in Public
  
  the paradigms for employing tool learning can be categorized into two types: tool learning with one-step task solving and tool learning with iterative task solving.
  
  two paradigms, iterative vs one step
6. omarknazir 21 Jan 2026
  
  in Public
  
  two types based on whether a retriever is used: retriever-based tool selection and LLM-based tool selection.
  
  Retriever vs LLM based selection
7. omarknazir 21 Jan 2026
  
  in Public
  
  user query along with the selected tools are furnished to the LLMs, enabling it to select the optimal tool and configure the necessary parameters for tool calling.
  
  LLM step
Visit annotations in context

Annotators

omarknazir

URL

arxiv.org/html/2405.17935v3
arxiv.org arxiv.org

Agent Name Service (ANS): A Universal Directory for Secure AI Agent Discovery and Interoperability

2
1. omarknazir 18 Jan 2026
  
  in Public
  
  ANS complements these emerging protocols by integrating Public Key Infrastructure (PKI) for identity and trust, defining structured communication via JSON Schema, incorporating DNS-like naming for discovery, establishing mechanisms for agent registration and renewal, and providing a formal specification of the protocol to enhance precision and implementability.
  
  how works components
  
  indexing
2. omarknazir 18 Jan 2026
  
  in Public
  
  a modular Protocol Adapter Layer supporting diverse communication standards (A2A, MCP, ACP, etc. ); a
  
  supports diverse communication standards
Visit annotations in context

Tags

indexing

Annotators

omarknazir

URL

arxiv.org/html/2505.10609v1
arxiv.org arxiv.org

A Survey of AI Agent Registry Solutions

4
1. omarknazir 14 Jan 2026
  
  in Public
  
  Architectural Trade-offs Are Protocol-Specific. The "right" registry architecture depends heavily on deployment context. Enterprise environments with existing Azure AD infrastructure benefit from Entra Agent ID’s seamless integration and zero-maintenance approach. Open research communities and decentralized applications require the cryptographic guarantees and federated governance of NANDA Index. Protocol-specific ecosystems like MCP benefit from purpose-built registries that leverage existing authentication systems (GitHub OAuth) while maintaining simplicity.
  
  key point for review
  
  indexing
2. omarknazir 14 Jan 2026
  
  in Public
  
  MCP: JSON file with agent meta data but not actual tools only instruction of how to connect(Look up table)
  
  NANDA index: de centralized index quilt connects registries , security useful for next stages.
  
  centralization
3. omarknazir 14 Jan 2026
  
  in Public
  
  Decentralization Enables Long-term Sustainability. While centralized approaches offer operational simplicity, they create single points of failure and vendor lock-in risks that become increasingly problematic as agent ecosystems mature. NANDA Index’s federated design demonstrate how decentralized architectures can achieve both scalability and community governance. The dual-path resolution pattern in NANDA particularly addresses privacy concerns that will become critical as agent interactions proliferate. Report issue for preceding element 3.
  
  key point for review
  
  indexing
4. omarknazir 13 Jan 2026
  
  in Public
  
  This paper surveys three prominent registry approaches each defined by a unique metadata model: MCP’s mcp. json, A2A’s Agent Card, and NANDA’s AgentFacts.
  
  Registry approaches
  
  indexing
Visit annotations in context

Tags

indexing

Annotators

omarknazir

URL

arxiv.org/html/2508.03095v1
arxiv.org arxiv.org

2305.15334.pdf

3
1. omarknazir 09 Jan 2026
  
  in Public
  
  APIBench
  
  API bench see relevance
2. omarknazir 09 Jan 2026
  
  in Public
  
  LLaMA-based model that surpasses the performanceof GPT-4 on writing API calls.
  
  task development and testong /ranking
3. omarknazir 05 Jan 2026
  
  in Public
  
  modern indexing
Visit annotations in context

Annotators

omarknazir

URL

arxiv.org/pdf/2305.15334
arxiv.org arxiv.org

AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls

14
1. omarknazir 07 Jan 2026
  
  in Public
  
  Hierarchical Structure
  
  Implementation of hierarchal structure
2. omarknazir 07 Jan 2026
  
  in Public
  
  Evaluation Protocol
  
  Relevance of this approach if used
3. omarknazir 07 Jan 2026
  
  in Public
  
  Self-Reflection Mechanism in LLMs. S
  
  Self reflection mechanism is agent search engine approach used
4. omarknazir 07 Jan 2026
  
  in Public
  
  struggling with pre-cise numbers and specific fields of expertise
  
  Task development
5. omarknazir 07 Jan 2026
  
  in Public
  
  either solvable or non-solvable
  
  Approach to queries
6. omarknazir 07 Jan 2026
  
  in Public
  
  ToolBench
  
  Search query task evaluation task development
7. omarknazir 07 Jan 2026
  
  in Public
  
  Rapid API
  
  Cite as example use of hierarchal structure
8. omarknazir 07 Jan 2026
  
  in Public
  
  Hierarchical Structure
  
  Hierarchical structure
9. omarknazir 07 Jan 2026
  
  in Public
  
  revisit the evaluation protocol introducedby previous works and identify a limitation in thisprotocol that leads to an artificially high pass rate
  
  See what prev works cote
10. omarknazir 07 Jan 2026
  
  in Public
  
  solver
  
  Task development query section
11. omarknazir 07 Jan 2026
  
  in Public
  
  self-reflectionmechanism, which re-activates AnyTool if the ini-tial solution proves impracticable
  
  Task development
12. omarknazir 07 Jan 2026
  
  in Public
  
  API retriever with a hierarchical structure
  
  Hierarchical element
13. omarknazir 07 Jan 2026
  
  in Public
  
  16,000 APIs from Rapid API
  
  Use in indexing
14. omarknazir 05 Jan 2026
  
  in Public
  
  hypothesis
Visit annotations in context

Annotators

omarknazir

URL

arxiv.org/pdf/2402.04253
www.staff.city.ac.uk www.staff.city.ac.uk

inr-019.dvi

1
1. omarknazir 05 Jan 2026
  
  in Public
  
  tradional keywor match
Visit annotations in context

Annotators

omarknazir

URL

staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf
Apr 2025
nautil.us nautil.us

Deep Learning Is Hitting a Wall

33
1. omarknazir 13 Apr 2025
  
  in Public
  
  We should never forget that the human brain is perhaps the most complicated system in the known universe; if we are to build something roughly its equal, open-hearted collaboration will be key.
  
  arg symbols
  
  symbols argument
2. omarknazir 13 Apr 2025
  
  in Public
  
  Researchers like Josh Tenenbaum, Anima Anandkumar, and Yejin Choi are also now headed in increasingly neurosymbolic directions. Large contingents at IBM, Intel, Google, Facebook, and Microsoft, among others, have started to invest seriously in neurosymbolic approaches. Swarat Chaudhuri and his colleagues are developing a field called “neurosymbolic programming”23 that is music to my ears.
  
  example symbols
  
  example symbols
3. omarknazir 13 Apr 2025
  
  in Public
  
  Artur Garcez and Luis Lamb wrote a manifesto for hybrid models in 2009, called Neural-Symbolic Cognitive Reasoning. And some of the best-known recent successes in board-game playing (Go, Chess, and so forth, led primarily by work at Alphabet’s DeepMind) are hybrids. AlphaGo used symbolic-tree search, an idea from the late 195
  
  example symbols
  
  example symbols
4. omarknazir 13 Apr 2025
  
  in Public
  
  No single AI approach will ever be enough on its own; we must master the art of putting diverse approaches together, if we are to have any hope at all.
  
  arg symbols
  
  arguments symbols
5. omarknazir 13 Apr 2025
  
  in Public
  
  Belittling unfashionable ideas that haven’t yet been fully explored is not the right way to go. Hinton is quite right that in the old days AI researchers tried—too soon—to bury deep learning. But Hinton is just as wrong to do the same today to symbol-manipulation.
  
  argument symbols
  
  argument symbols
6. omarknazir 13 Apr 2025
  
  in Public
  
  For at least four reasons, hybrid AI, not deep learning alone (nor symbols alone) seems the best way forward:
  
  arg symbols
  
  argument symbols
7. omarknazir 13 Apr 2025
  
  in Public
  
  n 1990, Hinton published a special issue of the journal Artificial Intelligence called Connectionist Symbol Processing that explicitly aimed to bridge the two worlds of deep learning and symbol manipulation.
  
  argument example symbols
  
  argument example symbols
8. omarknazir 13 Apr 2025
  
  in Public
  
  When deep learning reemerged in 2012, it was with a kind of take-no-prisoners attitude that has characterized most of the last decade. By 2015, his hostility toward all things symbols had fully crystallized. He gave a talk at an AI workshop at Stanford comparing symbols to aether, one of science’s greatest mistakes.
  
  argument symbols
  
  symbols argument
9. omarknazir 13 Apr 2025
  
  in Public
  
  Warren McCulloch and Walter Pitts wrote in 1943, “A Logical Calculus of the Ideas Immanent in Nervous Activity,” the only paper von Neumann found worthy enough to cite in his own foundational paper on computers.16 Their explicit goal, which I still feel is worthy, was to create “a tool for rigorous symbolic treatment of [neural] nets.”
  
  example symbols
  
  example symbols
10. omarknazir 13 Apr 2025
  
  in Public
  
  Symbolic operations also underlie data structures like dictionaries or databases that might keep records of particular individuals and their properties (like their addresses, or the last time a salesperson has been in touch with them, and allow programmers to build libraries of reusable code, and ever larger modules, which ease the development of complex systems. Such techniques are ubiquitous, the bread and butter of the software world.
  
  argument example symbols
  
  argument example symbols
11. omarknazir 13 Apr 2025
  
  in Public
  
  If symbols are so critical for software engineering, why not use them in AI, too?
  
  arg symbolsItalic
  
  argument symbols
12. omarknazir 13 Apr 2025
  
  in Public
  
  s the renowned computer scientist Peter Norvig famously and ingeniously pointed out, when you have Google-sized data, you have a new option: simply look at logs of how users correct themselves.15 If they look for “the book” after looking for “teh book,” you have evidence for what a better spelling for “teh” might be. No rules of spelling required.
  
  example symbols
  
  example symbols
13. omarknazir 13 Apr 2025
  
  in Public
  
  our word processor, for example, has a string of symbols, collected in a file, to represent your document. Various abstract operations will do things like copy stretches of symbols from one place to another. Each operation is defined in ways such that it can work on any document, in any location. A word processor, in essence, is a kind of application of a set of algebraic operations (“functions” or “subroutines”) that apply to variables (such as “currentl
  
  example symbols
  
  example symbols
14. omarknazir 13 Apr 2025
  
  in Public
  
  it goes back at least to 1945, when the legendary mathematician von Neumann outlined the architecture that virtually all modern computers follow. Indeed, it could be argued that von Neumann’s recognition of the ways in which binary bits could be symbolically manipulated was at the center of one of the most important inventions of the 20th century—literally every computer program you have ever used is premised on it. (The “embeddings” that are popular in neural networks also look remarkably like symbols, though nobody seems to acknowledge this. Often, for example, any given word will be assigned a unique vector, in a one-to-one fashion that is quite analogous to the ASCII code. Calling something an “embedding” doesn’t mean it’s not a symbol.)
  
  example symbols ?
  
  example symbols
15. omarknazir 13 Apr 2025
  
  in Public
  
  lassical computer science, of the sort practiced by Turing and von Neumann and everyone after, manipulates symbols in a fashion that we think of as algebraic, and that’s what’s really at stake. In simple algebra, we have three kinds of entities, variables (like x and y), operations (like + or -), and bindings (which tell us, for example, to let x = 12 for the purpose of some calculation). If I tell you that x = y + 2, and that y = 12, you can solve for the value of x by binding y to 12 and adding to that value, yielding 14. Virtually all the world’s software works by stringing algebraic operations together, assembling them into ever more complex algorithms.
  
  argument symbols
  
  argument symbols
16. omarknazir 13 Apr 2025
  
  in Public
  
  A wakeup call came at the end of 2021, at a major competition, launched in part by a team of Facebook (now Meta), called the NetHack Challenge. NetHack, an extension of an earlier game known as Rogue, and forerunner to Zelda, is a single-user dungeon exploration game that was released in 1987.
  
  example symbols
  
  example symbols
17. omarknazir 13 Apr 2025
  
  in Public
  
  A lot of confusion in the field has come from not seeing the differences between the two—having symbols, and processing them algebraically. To understand how AI has wound up in the mess that it is in, it is essential to see the difference between the two.What are symbols? They are basically just codes. Symbols offer a principled mechanism for extrapolation: lawful, algebraic procedures that can be applied universally, independently of any similarity to known examples. They are (at least for now) still the best way to handcraft knowledge, and to deal robustly with abstractions in novel situations.
  
  argument symbols
  
  argument symbols
18. omarknazir 13 Apr 2025
  
  in Public
  
  And yet, for the most part, that’s how most current AI proceeds. Hinton and many others have tried hard to banish symbols altogether.
  
  arg symbols
  
  argument symbols
19. omarknazir 13 Apr 2025
  
  in Public
  
  A 2022 paper from Google concludes that making GPT-3-like models bigger makes them more fluent, but no more trustworthy.
  
  example scaling
  
  example scaling
20. omarknazir 13 Apr 2025
  
  in Public
  
  f scaling doesn’t get us to safe autonomous driving, tens of billions of dollars of investment in scaling could turn out to be for naught.
  
  arg scaling
  
  argument scaling
21. omarknazir 13 Apr 2025
  
  in Public
  
  Among other things, we are very likely going to need to revisit a once-popular idea that Hinton seems devoutly to want to crush: the idea of manipulating symbols
  
  argument sysmbols
  
  arguments symbols
22. omarknazir 13 Apr 2025
  
  in Public
  
  scaling starts to falter on some measures, such as toxicity, truthfulness, reasoning, and common sense.
  
  argument scaling
  
  argument scaling
23. omarknazir 13 Apr 2025
  
  in Public
  
  What’s more, the so-called scaling laws aren’t universal laws like gravity but rather mere observations that might not hold forever, much like Moore’s law, a trend in computer chip production that held for decades but arguably began to slow a decade ago.11
  
  example scaling
  
  example scaling
24. omarknazir 13 Apr 2025
  
  in Public
  
  There are serious holes in the scaling argument. To begin with, the measures that have scaled have not captured what we desperately need to improve: genuine comprehension. Insiders have long known that one of the biggest problems in AI research is the tests (“benchmarks”) that we use to evaluate AI systems
  
  arg scaling
  
  argument scaling
25. omarknazir 13 Apr 2025
  
  in Public
  
  hat should we do about it? One option, currently trendy, might be just to gather more data.
  
  arg scaling
  
  argument scaling
26. omarknazir 13 Apr 2025
  
  in Public
  
  For example, when we typed this: “You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you …” GPT continued with “drink it. You are now dead.”
  
  example error
  
  example error
27. omarknazir 13 Apr 2025
  
  in Public
  
  Still others found that GPT-3 is prone to producing toxic language, and promulgating misinformation. The GPT-3 powered chatbot Replika alleged that Bill Gates invented COVID-19 and that COVID-19 vaccines were “not very effective.” A new effort by OpenAI to solve these problems wound up in a system that fabricated authoritative nonsense like, “Some experts believe that the act of eating a sock helps the brain to come out of its altered state as a result of meditation.
  
  example error
  
  example error
28. omarknazir 13 Apr 2025
  
  in Public
  
  A deep-learning system has mislabeled an apple as an iPod because the apple had a piece of paper in front with “iPod” written across. Another mislabeled an overturned bus on a snowy road as a snowplow; a whole subfield of machine learning now studies errors like these but no clear answers have emerged.
  
  example
  
  example error
29. omarknazir 13 Apr 2025
  
  in Public
  
  Current deep-learning systems frequently succumb to stupid errors like this. They sometimes misread dirt on an image that a human radiologist would recognize as a glitch.
  
  arg error
  
  argument error
30. omarknazir 13 Apr 2025
  
  in Public
  
  The car failed to recognize the person (partly obscured by the stop sign) and the stop sign (out of its usual context on the side of a road); the human driver had to take over. The scene was far enough outside of the training database that the system had no idea what to do.
  
  arg point 1
  
  argument
31. omarknazir 13 Apr 2025
  
  in Public
  
  it occasionally confuses baby photos of my two children. But the stakes are low—if the app makes an occasional error, I am not going to throw away my phone.
  
  arg
  
  argument
32. omarknazir 13 Apr 2025
  
  in Public
  
  I asked my iPhone the other day to find a picture of a rabbit
  
  example
  
  example
33. omarknazir 13 Apr 2025
  
  in Public
  
  Google’s latest contribution to language is a system (Lamda) that is so flighty that one of its own authors recently acknowledged it is prone to producing “bullshit.”5
  
  example
Visit annotations in context

Tags

argument

example

symbols

error

arguments

scaling

Annotators

omarknazir

URL

nautil.us/deep-learning-is-hitting-a-wall-238440/

omarknazir

Annotations: 88

Joined: April 13, 2025

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL