arXiv 2605.22411v1May 21, 2026

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

Jianing Yin, Tan Tang

Brief context

Publication timing, weekly edition context, and source links for this brief.

Published

May 21, 2026, 12:36 PM

Current score

78

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

Large language model (LLM) agents still struggle with long-term memory question answering, where answer-supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content. Existing memory systems typically process memory before future queries are known, then retrieve the resulting units based on similarity rather than their utility for answering the query. This workflow leaves downstream answerers to denoise retrieved candidates and reconstruct query-specific evidence. We present DeferMem, a long-term memory framework that decouples this problem into high-recall candidate retrieval and query-conditioned evidence distillation. DeferMem uses a lightweight segment-link structure to organize raw history and retrieve broad candidates at query time. It then applies a memory distiller trained with DistillPO, our reinforcement learning algorithm for distilling the high-recall but highly noisy candidates into a set of faithful, self-contained, and query-conditioned evidence. DistillPO formulates post-retrieval evidence distillation as a structured action comprising message selection and evidence rewriting. It optimizes this action with a decomposed-and-gated reward pipeline and structure-aligned advantage assignment, gating reward components from validity to quality checks while exposing task-level correctness feedback early and assigning each reward to its responsible output span. On LoCoMo and LongMemEval-S, DeferMem surpasses strong baselines in QA accuracy and memory-system efficiency, achieving the highest QA accuracy with the fastest runtime and zero commercial-API token cost for memory operations.

Score 78Full-paper briefagentsinferencetrainingdata

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

Long-memory AI systems usually fail in a very practical place: they retrieve too much, summarize too early, or lose the tiny detail that answers the user’s actual question. DeferMem’s bet is that memory should stay mostly raw until query time, then a trained distiller turns noisy history into compact evidence; if that holds up, enterprise assistants, support copilots, and personal-agent products get a more plausible path to cheaper, auditable long-term memory. The paper reports better benchmark accuracy, faster memory operations, and zero commercial-API token cost for memory operations, but the cost is partly shifted into offline training and the evidence is still concentrated in long-memory QA benchmarks.

  • The paper challenges the common pattern of summarizing or organizing memory before anyone knows the future question. If your product depends on long user histories, the key design question becomes whether to preserve raw history and distill at query time, rather than trusting precomputed summaries to keep the right details.
  • DeferMem reports zero commercial-API token cost for memory operations and large runtime gains, but those metrics exclude final answer generation and the method still uses commercial LLM calls during distiller training. Ask vendors whether their “memory cost” number includes retrieval, distillation, answer generation, evaluation, retraining, and long-history storage.
  • The useful result is not just better retrieval; the learned distiller is doing material work. A credible adoption signal would be vendors showing trained, auditable evidence distillation that improves answers versus simply stuffing more retrieved context into the model.
  • The evidence is promising but still benchmark-bound: the distiller can omit small but decisive details, training data is modest, and million-token histories raise memory-operation time sharply. This looks like a strong architecture direction, not a guarantee that enterprise chat histories can be trusted without domain-specific testing.

Affiliations

Institution names extracted from the brief's PDF summary call.

State Key Lab of CAD&CG, Zhejiang University

Author markers Jianing Yin, Tan Tang

From PDF summary

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

capabilityhighp.7

DeferMem reports top or strongest QA accuracy on the evaluated long-term memory QA benchmarks.

inferencehighp.7p.30

The system shifts commercial LLM usage away from repeated memory operations and toward offline distiller training.

capabilityhighp.8

High-recall retrieval alone is not enough; query-conditioned distillation materially improves downstream answers.

caveatmediump.16p.30

The evidence should be read with caution because some baselines are inherited and the trained distiller uses modest data.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.AI

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Zhuohan Gu et al.

cs.LG

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Rui Yang et al.

cs.CL

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

Peter Fernandes, Ria Kanjilal

cs.CL

When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories

Avinash Baidya et al.

Thank you to arXiv for use of its open access interoperability. This product was not reviewed or approved by, nor does it necessarily express or reflect the policies or opinions of, arXiv.
LightDark