arXiv 2603.10899v1Mar 11, 2026

LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

Jinwoo Ahn et al.

Brief context

Publication timing, weekly edition context, and source links for this brief.

Published

Mar 11, 2026, 3:44 PM

Current score

60

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

Transformer-based large language models (LLMs) rely on key-value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long-context tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores. Notably, a recent line of work proposes to improve eviction quality by "glimpsing into the future", in which a draft generator produces a surrogate future response approximating the target model's true response, and this surrogate is subsequently used to estimate the importance of cached KV more accurately. However, these approaches rely on computationally expensive draft generation, which introduces substantial prefilling overhead and limits their practicality in real-world deployment. To address this challenge, we propose LookaheadKV, a lightweight eviction framework that leverages the strength of surrogate future response without requiring explicit draft generation. LookaheadKV augments transformer layers with parameter-efficient modules trained to predict true importance scores with high accuracy. Our design ensures negligible runtime overhead comparable to existing inexpensive heuristics, while achieving accuracy superior to more costly approximation methods. Extensive experiments on long-context understanding benchmarks, across a wide range of models, demonstrate that our method not only outperforms recent competitive baselines in various long-context understanding tasks, but also reduces the eviction cost by up to 14.5x, leading to significantly faster time-to-first-token. Our code is available at https://github.com/SamsungLabs/LookaheadKV.

Score 60PDF-backedinferenceinfratrainingmodels

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

Long-context AI gets expensive fast because the model’s memory cache balloons with every token, and most attempts to trim it either guess badly or add so much setup work that latency suffers anyway; this paper presents a more deployable compromise, and the evidence looks fairly strong on benchmarked models, though it still depends on extra training and paper-specific implementations.

  • The paper argues that long-context inference is being constrained less by raw model quality than by KV-cache growth: as prompts get longer, the model’s stored attention state grows linearly and quickly becomes a memory and latency bottleneck. Existing eviction methods either use cheap heuristics that throw away useful context or use draft generation to peek at the future, which can improve accuracy but adds enough prefill work to blunt the operational benefit.
  • LookaheadKV tries to keep the accuracy advantage of “glimpse into the future” methods without actually generating a draft response at inference time. Instead, it trains small add-on modules to predict which prompt tokens will matter later by matching the model’s true future attention patterns, defined as the average cross-attention from response tokens back to prompt tokens, so eviction decisions are better informed without paying draft-generation cost.
  • The practical design is lightweight by research standards: the authors freeze the base model, add 32 learned lookahead tokens, and attach selectively activated LoRA adapters with rank 8 and scaling 32 to attention and feed-forward layers. Those modules add less than 0.5% trainable parameters across tested models, are used only during prefill for eviction rather than during decoding, and can be turned on or off without changing the original model weights, which lowers integration risk for existing serving stacks.
  • The measured case is strongest on latency-adjusted accuracy rather than on absolute quality alone. At 32K context on LLaMA3.1-8B, the paper reports under 2.16% eviction overhead and a measured TTFT increase of 38 ms over a 1760 ms forward-pass baseline, while draft-based rivals added roughly 503 ms to 554 ms; across LongBench, small-cache results are materially better than simple baselines, such as 41.60 average score for LookaheadKV versus 28.84 for StreamingLLM on Llama3.2-3B at KV budget 64, and on very long RULER inputs it still beats other eviction methods, though it remains below FullKV.
  • This looks like a credible systems improvement for teams trying to serve long-context models without paying the full memory bill or the draft-generation latency tax. But it is not free: training requires model-generated responses or a substitute dataset, experiments were run on a limited set of model families and mostly prefill-stage benchmarks, and some baseline latency numbers may be sensitive to implementation quality, so the right read is promising and fairly well evidenced, not yet production-proof across stacks.

Evidence ledger

inferencehighp.2

LookaheadKV removes explicit draft generation by predicting future token importance with lightweight trainable modules.

stackhighp.6

The method adds less than 0.5% trainable parameters across tested models.

inferencehighp.3

At 32K context, reported eviction overhead is below 2.16% and up to 14.5× lower than draft-based approaches.

inferencehighp.24

Empirical TTFT overhead is small relative to draft-based methods on LLaMA3.1-8B.

caveatmediump.17

Training requires model-generated responses to build ground-truth importance labels, which adds adoption cost.

caveathighp.10

The method currently covers prefill-stage eviction only, not decoding-stage eviction.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.RO

Latent World Models for Automated Driving: A Unified Taxonomy, Evaluation Framework, and Open Challenges

Rongxiang Zeng, Yongqi Dong

cs.AI

Resource-constrained Amazons chess decision framework integrating large language models and graph attention

Tianhao Qian et al.

cs.SE

PostTrainBench: Can LLM Agents Automate LLM Post-Training?

Ben Rank et al.

cs.LG

Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability

Xingyu Xie et al.

Thank you to arXiv for use of its open access interoperability. This product was not reviewed or approved by, nor does it necessarily express or reflect the policies or opinions of, arXiv.
LightDark