Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability explained

Brief context

Publication timing, weekly edition context, and source links for this brief.

Week

Mar 9, 2026

Published

Mar 12, 2026, 3:14 PM

Current score

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately $1.6\times$--$14.4\times$ higher decoding throughput while generally maintaining quality on par with the full-KV baseline across long-context and long-CoT settings. Because SFI is training-free and applies directly to existing checkpoints, it offers a practical path to reducing inference cost for contemporary autoregressive reasoning models in long-context, long-horizon, and agentic workloads.

Open the original arXiv page

Score 76Full-paper briefinferenceinfraagentsmodels

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

Long-context AI is often held back less by the model than by the cost of rereading an ever-growing prompt at every token. This paper claims you can keep most of the quality while making long responses and long-horizon reasoning materially cheaper and faster—reporting 1.6× to 14.4× decoding throughput gains on Qwen3 models without retraining, but only with custom runtime engineering rather than a simple switch flip. If that holds beyond this stack, infrastructure, platform, and product teams should revisit the assumption that long-context and agent-style workloads must stay prohibitively expensive at inference time.

If your roadmap assumes long prompts and long reasoning traces scale inference cost almost linearly with context length, this paper is a direct challenge. The reported gains get larger as context grows, which matters most for enterprise search, copilots over large documents, and agent loops that keep accumulating history.
The paper is clear that algorithmic sparsity is not enough; wall-clock gains required an asynchronous overlap pipeline and a memory-coalesced sparse-attention kernel. If a vendor claims similar savings without explaining scheduler, kernel, and cache-layout changes, assume the headline gain may not survive production benchmarks.
Because SFI is training-free and was tested on existing Qwen checkpoints, the practical implication is faster deployment cycles for model-serving teams: the lever is the inference stack, not another training run. That makes this more relevant to platform and infra budgets than to foundation-model training budgets.
The evidence is strong enough to take seriously, but still concentrated in Qwen3 models and a specific vLLM/Triton plus NVIDIA B200 setup. The adoption signal that matters next is independent replication on other major model families and serving stacks, especially where latency consistency and output quality under boundary-triggered refreshes are measured together.
The paper’s throughput comparisons are apples-to-apples only against full dense attention in the same runtime. Comparisons with other KV-cache compression methods were quality-only under a different framework, so this is not yet proof that SFI is the best production option among all acceleration approaches.

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

inferencehighp.1p.3

SFI reports 1.6×–14.4× end-to-end decoding throughput gains versus a full-KV baseline while generally maintaining near-parity quality.

inferencehighp.14

The speedup grows with context length, with Qwen3-4B increasing from 1.91× at 8K to 14.36× at 128K context.

strategichighp.1p.3

SFI is training-free and applies directly to existing checkpoints, making it a practical inference-stack optimization rather than a retraining method.

stackhighp.2p.11

Realizing wall-clock gains depends on custom systems work, including asynchronous overlap and a memory-coalesced sparse-attention kernel.

caveathighp.13

Evidence against competing KV-cache compression methods is not a fair wall-clock comparison, because those methods were run in a different framework.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.LG

KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators

Taras Sereda et al.

Read brief arXiv

cs.AI

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Zhuohan Gu et al.

Read brief arXiv

cs.LG

FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse

Lingzhi Yuan et al.

Read brief arXiv

cs.DC

Compliance-Scored Best-of-N Guardrail Orchestration for Multimodal Document Generation in Payments Dispute Defense

Nataraj Agaram Sundar, Tejas Morabia

Read brief arXiv