Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG explained

Brief context

Publication timing, weekly edition context, and source links for this brief.

Week

Apr 13, 2026

Published

Apr 16, 2026, 3:05 AM

Current score

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

Retrieval-Augmented Generation (RAG) grounds LLM responses in external evidence but treats the model as a passive consumer of search results: it never sees how the corpus is organized or what it has not yet retrieved, limiting its ability to backtrack or combine scattered evidence. We present Corpus2Skill, which distills a document corpus into a hierarchical skill directory offline and lets an LLM agent navigate it at serve time. The compilation pipeline iteratively clusters documents, generates LLM-written summaries at each level, and materializes the result as a tree of navigable skill files. At serve time, the agent receives a bird's-eye view of the corpus, drills into topic branches via progressively finer summaries, and retrieves full documents by ID. Because the hierarchy is explicitly visible, the agent can reason about where to look, backtrack from unproductive paths, and combine evidence across branches. On WixQA, an enterprise customer-support benchmark for RAG, Corpus2Skill outperforms dense retrieval, RAPTOR, and agentic RAG baselines across all quality metrics.

Open the original arXiv page

Score 87Full-paper briefagentsinferenceinfradata

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

This paper challenges a core RAG assumption: instead of searching enterprise knowledge at query time, compile it once into a navigable map that an agent can browse. If that pattern holds, support, operations, and internal knowledge teams may be able to trade some retrieval infrastructure for a more structured knowledge layer that improves answer quality and cross-document reasoning. The reported result is real enough to take seriously on enterprise QA—Corpus2Skill beats dense retrieval, RAPTOR, and an agentic baseline on WixQA—but it is not a free lunch, because the quality gain comes with much higher per-query token cost and batch-style updates rather than real-time freshness.

The important shift here is architectural, not just model-level: the paper moves work from query-time retrieval into offline corpus compilation, and at serve time needs no embedding index or vector database. That is a meaningful simplification for teams running enterprise QA, but only if they can tolerate periodic recompilation and a more static knowledge layer.
The quality gains are notable, but they were bought with much heavier inference payloads: about 53,487 input tokens and $0.172 per query, versus 25,807 and $0.098 for the agentic baseline, and far above RAPTOR. For high-value support or expert-assistance workflows that may be acceptable; for high-volume, low-margin traffic it probably is not.
On WixQA, this approach leads on all reported quality metrics, with F1 at 0.460 and factuality at 0.729, and hierarchical methods in general beat flat retrieval. The practical question is not whether navigation helps in theory, but whether your corpus is organized enough—and stable enough—that a compiled map outperforms search on the questions that matter commercially.
The biggest failure mode was not answer generation but getting routed into the wrong branch: 38 of 62 failed queries were navigation misses. That means the next competitive battleground is likely hierarchy design, update workflows, and top-level routing quality—not simply swapping in a stronger model.

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

capabilityhighp.6p.1

Corpus2Skill outperforms dense retrieval, RAPTOR, and agentic RAG baselines on WixQA quality metrics.

stackhighp.2p.3

The system requires no embedding index or vector database at serve time, shifting work to an offline compile stage.

traininghighp.6

Compilation on the 6,221-document WixQA corpus took 6.5 minutes and produced a 3-level hierarchy with 665 navigation files.

inferencehighp.7p.7

The quality gains come with materially higher serving cost driven by large input-token loads.

caveathighp.18p.18

The dominant failure mode is incorrect navigation into the wrong branch, not pure synthesis error.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.CL

SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Chenxi Wang et al.

Read brief arXiv

cs.LG

CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

Yi Yu et al.

Read brief arXiv

cs.LG

Lightweight Adaptation for LLM-based Technical Service Agent: Latent Logic Augmentation and Robust Noise Reduction

Yi Yu et al.

Read brief arXiv

cs.LG

MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding

Junxian Wu et al.

Read brief arXiv