arXiv 2605.27220v1May 26, 2026

The Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System

Zafar Hussain, Kristoffer Nielbo

Brief context

Publication timing, weekly edition context, and source links for this brief.

Published

May 26, 2026, 4:08 PM

Current score

77

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

In modern RAG pipelines, query augmentation methods such as HyDE and query expansion are applied to every query, resulting in substantial LLM inference costs and increased end-to-end latency. The empirical justification for this overhead in real production traffic remains largely unexplored. We present a case study of the Danish National Encyclopedia, evaluating five retrieval workflows over 20,000 query-workflow pairs from production traffic and synthetic conditions. In this system, synthetic queries suggest that LLM augmentation is needed for over 90% of queries to achieve high retrieval coverage. However, under our production deferral policy, only 27.8% of real user queries need LLM augmentation. We call this gap the Coverage Illusion and attribute it to a structural mismatch between synthetic and real query distributions. Pre-retrieval routing cannot resolve this gap, as the need for LLM augmentation is only revealed after searching the index, a result confirmed by our evaluation of four machine learning paradigms. The coverage gap, undetectable from the query alone, motivates a post-retrieval cascade that runs workflows in cheapest-first order and escalates to LLM augmentation only when a step returns no documents. Operating entirely without training overhead or secondary serving infrastructure, the cascade improves quality by +0.140 Composite Overall points over Always-HyDE, reduces latency by 31.8%, and serves 72.2% of real user queries without LLM augmentation.

Score 77Full-paper briefinferenceinfradatamodels

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

RAG teams are often paying an LLM tax on every query because synthetic tests make augmentation look more necessary than production traffic does. In this production encyclopedia system, a simple cheapest-first cascade served most real users without LLM augmentation, improved the paper’s measured quality score, and cut average latency versus Always-HyDE. The near-term implication is practical: AI ops, product, and procurement teams should challenge always-on query expansion defaults, while remembering this is strongest evidence for short-query, curated-corpus search rather than every enterprise assistant.

  • If your RAG evaluation is built mainly on synthetic questions, it may be overstating how often expensive LLM query augmentation is needed. The paper’s core warning is that real production queries can have a very different shape from benchmark queries, especially in search-like products where users type short keyword fragments.
  • Ask vendors and internal teams whether they decide to augment before retrieval or after seeing what the index returns. In this case, pre-retrieval ML routing barely helped, while a simple post-retrieval “did we find anything?” check was both cheaper and more effective.
  • For RAG workloads with a strong searchable corpus, always running HyDE or query expansion may be a latency and inference-cost anti-pattern. Here, cheapest-first retrieval improved the measured score while cutting average latency, suggesting a practical optimization path that does not require training a new router.
  • A cascade can lower average cost while making a minority of hard queries slower. Any production rollout should report escalation rates and worst-case latency, not only average latency or aggregate quality.
  • The evidence is strongest for a curated Danish encyclopedia with many short, keyword-style queries and a policy that heavily penalizes no-source answers. Conversational assistants, messy enterprise corpora, or systems that answer without retrieved sources may see a different tradeoff.

Affiliations

Institution names extracted from the brief's PDF summary call.

Aarhus University, Denmark

Author markers Zafar Hussain, Kristoffer Nielbo

From PDF summary

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

strategichighp.1

Synthetic evaluation substantially overstated the need for LLM augmentation relative to real production traffic in this RAG system.

stackhighp.6

The paper argues that routing should be conditioned on retrieval results, not only on the query text.

inferencehighp.1

A cheapest-first post-retrieval cascade improved measured quality and reduced average latency versus Always-HyDE in the tested deployment.

caveathighp.9

Generalization beyond this specific Danish encyclopedia deployment remains uncertain.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.CL

When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories

Avinash Baidya et al.

cs.CR

Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

Syed Huma Shah

cs.LG

FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse

Lingzhi Yuan et al.

cs.AI

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Zhuohan Gu et al.

Thank you to arXiv for use of its open access interoperability. This product was not reviewed or approved by, nor does it necessarily express or reflect the policies or opinions of, arXiv.
LightDark