RAG and Retrieval

Every substantive claim the agent writes is grounded in documents from your own knowledge bases. This page explains how that grounding works: the retrieval augmented generation (RAG) pipeline that runs in every workflow, along with best practices.

The principle: grounded synthesis, not free generation

Agent Bayes is built on a simple principle. Every claim should be grounded in specific evidence, because accuracy cannot be compromised in academic research. Each claim the agent writes carries a self-aware confidence score that reflects the strength of its supporting sources. Multiple viewpoints are considered and nuance is preserved, so the resulting output participates in the scholarly discussion rather than collapsing it into a single perspective.

How does the agent generates grounded synthesis?

  • The agent works in a multi role pipeline that retrieves first, then synthesizes claims, then verifies them and identifies gaps. This loop repeats, refining the output and closing gaps until no further improvement can be made.
  • If the agent cannot find anything relevant in your knowledge base, it will not invent an answer. It will tell you it came up empty.
  • When building arguments (if you instruct it to do so, e.g by saying "Argue for ..."), the agent will write continuous prose as a set of mindmap nodes, that interleaves conclusions and interpretations, with presentation of the evidence and scholarly viewpoints. These interpretations are clearly noticeable, so you can review them and verify they rest on evidence that was presented earlier.
  • Every node on the mindmap that is grounded with citations is marked with a green color and have citation badges. Nodes that are not grounded in any source are marked gray and have no badges.

Indexing: how documents become searchable

When you upload a PDF to a KB, it goes through an indexing pipeline:

  1. Text Extraction. Rather than trusting the PDF's built-in text layer, the pipeline performs its own high-quality text extraction that correctly handles multi-column layouts, data tables, equations, and figures. Scanned pages are processed with a custom OCR pipeline. The result is clean and accurately reflects what was on the page.
  2. Semantic chunking. Most RAG systems slice text using a sliding window of fixed character counts. This is fast but produces incoherent fragments — a chunk might start mid-argument and end mid-sentence, mixing unrelated ideas. The result is a noisy index where important claims are buried and retrieval become inefficient. Agent Bayes groups text into semantically cohesive units instead, following the natural boundaries of an argument or explanation even when they span across a page boundary, or across multiple columns. Each semantically cohesive chunk contains one complete, coherent part from a document, which allows to further process it in a meaningful way.
  3. Context rebuilding. Each chunk is paired with a generated context — a compact summary of the surrounding material — so the original text of the chunk can be understood in isolation without losing the meaning. This context travels with the chunk through retrieval and into synthesis, and can be examined of viewed by the end user.
  4. Idea and claim extraction. The contextualised chunk is decomposed into bullet points in english, each capturing a single idea, claim, or piece of evidence. These become the primary units the agent retrieves and reasons over. Even if the original document is written not in english, this step produce english bullet points, which allows a unified retrieval and reasoning process across documents in any language.
  5. Key term extraction. Domain-specific terminology is extracted from each semantically cohesive chunk and improves information retrieval. Once more than ten sources are indexed in the same knowledge base, a terminology co-occurence graph can be produced. These graphs allows the AI agent to understand terms, related concepts and disambiguate meanings. They can also be explored in the Terminology Graph Explorer accessible from the KB page.
  6. Embedding and indexing. Both the extracted bullet points and the original chunk text are embedded into vector representations and stored for semantic search — accessible through the RAG Queries panel or automatically by the agent during a workflow.

Note: Indexing is a heavy process that can take several minutes for a 100-page paper, and up to a few hours for long documents with complex layouts (e.g. dissertations). The design decision was to prioritize quality and accuracy of the resulting index over speed, because a noisy index would lead to poor retrieval and low-quality synthesis. A wise man from the software industry once said: "garbage in, garbage out".

You can watch the indexing progress in real time: Pending → Indexing (with %) → Completed (or Failed / Cancelled).

Retrieval: how a question becomes evidence

When the agent has a research task — either initial broad discovery or a refined follow-up — the agent runs this sequence:

  1. Multi-query expansion. It generates roughly ten search queries from the task, each phrased to surface a different angle. This widens recall — a single phrasing might miss papers that use different terminology.
  2. Parallel search. All queries run concurrently against every KB attached to the current project. No KB is preferred over another; results pool together.
  3. Unified re-ranking. All results across all queries are scored together — by relevance, citation quality, and cross-task coherence — then deduplicated.
  4. Viewpoint diversity. The ranker actively looks for opposing or contradictory passages so the synthesis can represent multiple viewpoints instead of averaging them out.
  5. Synthesis. The agent writes citation-backed claims addressing all tasks in the batch simultaneously, identifying connections and contradictions across them.
  6. Confidence scoring. Each claim gets a 0–100 score based on source strength.
  7. Gap identification. The generated claims are checked against the user instructions and the retrieved evidence. If a claim doesn't have strong support, or if the instructions aren't fully addressed, those become gaps to be closed in the next loop.
  8. Coverage assessment. If no new material could be retrieved for closing the gaps, the agent flags a potential KB limitation.

What you see on the mindmap

The output of all of the above lands on your mindmap as nodes:

  • The claim text is what was synthesized.
  • The citations attached to the node are the semantically cohesive chunks provided by the retrieval system.
  • Multiple viewpoints become sibling nodes under the same parent.

Manual retrieval

You don't have to go through the agent to use the retrieval system. The RAG Queries panel in a project lets you run searches by hand against the attached KBs, browse the top passages, and label results. See Labeled Results for more information.

What's next