arXiv 2606.13392v1Jun 11, 2026

MiniMax Sparse Attention

Xunhao Lai et al.

Brief context

Publication timing, weekly edition context, and source links for this brief.

Published

Jun 11, 2026, 2:23 PM

Current score

73

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.

Score 73Full-paper briefmodelstraininginferenceinfra

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

MiniMax is making a practical attack on one of the biggest blockers to useful long-context AI: the cost of letting a model look across hundreds of thousands or millions of tokens. The paper claims its sparse attention design cuts attention compute by 28.4× at 1M context and turns that into 14.2× prefill and 7.6× decode speedups on H800, while keeping quality broadly close to dense attention in a 109B-scale model. If that generalizes, long-context agents, codebase reasoning, and large-document workflows become materially cheaper to serve; the open question is how much of the gain survives outside MiniMax’s model, kernels, hardware, and benchmark mix.

  • If these results hold outside MiniMax’s stack, very long-context workloads move closer to being an operating choice rather than a special-case demo: repository-scale code review, large-document analysis, and persistent agent memory become less constrained by attention cost.
  • A vendor claiming million-token support should show end-to-end prefill and decode latency on your hardware, context lengths, and batching pattern. The paper itself notes that indexing, Top-k selection, query gathering, and load balancing eat into theoretical gains.
  • The authors released an inference kernel and a public model using MSA, and they report both from-scratch sparse training and conversion from a dense checkpoint at 109B scale. The useful next signal is independent serving benchmarks on non-H800 infrastructure.
  • The reported capability story is broadly encouraging, but not clean: MSA variants improve some benchmarks and regress on others, including a HumanEval drop for the converted model and mixed long-context subtask results. Buyers should test the exact retrieval, coding, and document workflows they care about.
  • A key implication is that long-context competition may shift from simply advertising larger token limits to selecting the right parts of the context cheaply. The paper’s ablations suggest content-dependent selection beats a fixed sliding window under the same sparse budget.

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

inferencehighp.1p.12

MSA substantially reduces per-token attention compute at 1M context in the authors’ reported configuration.

inferencehighp.1p.12

The co-designed kernel translates sparsity into large measured attention-latency gains on H800.

capabilitymediump.2p.11

The paper reports broadly comparable capability to dense GQA at 109B scale, with task-level gains and regressions.

caveathighp.12

Theoretical FLOP reductions overstate realized deployment gains because sparse attention adds nontrivial system overhead.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.CR

Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

Syed Huma Shah

cs.CL

When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories

Avinash Baidya et al.

cs.AI

Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents

Abhilasha Lodha et al.

cs.AI

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

Saroj Mishra

Thank you to arXiv for use of its open access interoperability. This product was not reviewed or approved by, nor does it necessarily express or reflect the policies or opinions of, arXiv.
LightDark