Another librarian might choose a different starting point, yielding slightly different results.
Technically this isn't true for IVF.
More accurately, the same librarian might stop slightly earlier or later with sometimes very different results.
Another librarian might choose a different starting point, yielding slightly different results.
Technically this isn't true for IVF.
More accurately, the same librarian might stop slightly earlier or later with sometimes very different results.
Technically, some systems store the “seed,” or starting point, within a session to ensure consistent results, but starting a new session with the same query may trigger the selection of a different starting point, leading to different results.
This is a oversimplfication. In fact, HNSW/FAISS is deterministic given a fixed index and parameters (index construction might be non-deterministic) but non-determinism can come from other factors like multi-threading or when you reindex.
Here’s a rough analogy: imagine a library of one million books indexed by subject headings that cluster similar subjects together. A librarian is asked to find the five books most relevant to “history of trade wars.” Rather than scanning every book she: Pulls roughly 10 candidate books from a shelf with a relevant subject heading and skims them. After skimming those candidates, she decides whether her shortlist looks good enough. If not, she jumps to the next-closest shelf cluster and repeats until she runs out of time. In the end, she will have a pretty good (but maybe not perfect) set of five books
This whole section is pretty close to a ANN method called Inverted File Index (IVF) but the more common ANN method is (HNSW (Hierarchical Navigable Small World) which is closer to Six Degrees of Kevin Bacon game.
calculated using measures like cosine similarity,
Another popular metric used is dot product. Both are the same if the vector embedding is length-normalized,
There are more sophisticated multiple embedding approaches, like ColBERT, that store individual embeddings for each token or word in a text chunk. This improves interpretability.
In ColBERT with MaxSim operator you can see the contribution to the overall score from the interactions between the query embedding tokens and document embedding tokens. However you still can't fully predict or interpret the cosine similarity scores between the token embeddings themeselves.
reach or approach near human-level scores in many prominent natural language benchmarks
This is disputed somewhat of course.
empirical studies show that, typically, 95 to 99 percent of the papers cited by the RAG systems do exist
This refers to academic RAG systems only.
e.g. Correctness and Quality of References generated by AI-based Research Assistant Tools: The Case of Scopus AI, Elicit, SciSpace and Scite in the Field of Business Administration shows no references from Scopus AI etc were entirely fabricated or non-existent though minor inaccuracies exist
For non-academic RAG, A 2025 study in Nature Communications showed that 99.3% and 96% of links from GPT4o RAG and Gemini 1.o ultra RAG were not broken. Not all these would technically be hallucinations, as some broken links are invalid urls given by an web API.
Generally though, because of the bounded nature of academic RAG, systems can easily do a post-processing check to ensure any suggested citation exists and if any are flagged as non0existent, the response can be regenerated. Note, this does not help with the source faithfulness issues.
Scopus metadata and abstracts (from 2003 onward for summary generation). No full text.Web of Science metadata and abstracts. No full text.
Both Scopus AI and Web of Science Research Assistant, exclude retracted works by default. See https://www.elsevier.com/products/scopus/scopus-ai and https://clarivate.com/academia-government/release-notes/web-of-science/web-of-science-february-13-2025-release-notes/. Testing shows Primo Research Assistant will surface and use retracted papers in the RAG answer,
These studies likely overcount hallucinations by failing to distinguish them from other types of errors in citations (for example, flagging papers as fictitious due to errors in the citations but which are actually due to incorrect metadata in the source used by the RAG system).
For RAG systems that search the general Web, urls might be broken because the index has outdated urls in the index. Studies may also use the word "hallucinated citations" to cover a variety of issues such as source faithfulness (citation exists but does not say what the LLM claims it says).
Still RAG systems that search the web seem more prone to generating "fake citation" and this could be due down to the fact that the system doesn't do any post processing checks and it is difficult to check web citations.
Undermind currently uses titles and abstracts only with no full text.
As of Nov 2024, Undermind is starting to use full text from Semantic Scholar to help understand whether the paper is relevant.

Like many AI search tools, it utilizes the Semantic Scholar corpus (title and abstract only).
As of Nov 2024, Undermind is starting to using full text from Semantic Scholar to understand papers to assess relevancy.
My initial early testing, using gold standard sets found in systematic reviews, looks promising
For example, using "Efficacy of digital mental health interventions for PTSD symptoms: A systematic review of meta-analyses" as a gold standard, Undermind found 8 put of 11 in the top 10, a recall@10 of 72.7%. It found one more at position 44, with a recall@50 of 81.8%!
Still not good enough to replace a full blown systematic review, but pretty good for just a 3 minutes wait.
Figure 3,
As of Nov 2024, there is a new section - "Discuss the results with an expert" that allows you to ask questions over the papers found. This is similar to the feature found in Elicit.com, SciSpace, Scite assistant.

test
54rfre