arXiv 2606.26587v1Jun 25, 2026

SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference

Haoqian Meng et al.

Brief context

Publication timing, weekly edition context, and source links for this brief.

Published

Jun 25, 2026, 4:19 AM

Current score

76

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

Low-bit floating-point formats and semi-structured sparsity are increasingly supported by modern accelerators, yet combining them for LLM activation compression remains challenging: activations contain input-dependent outliers that dominate block scales in FP4 quantization, and directly applying N:M sparsity masks discards moderate values, coupling sparsification loss with quantization error. We introduce SharQ, a training-free inference method that bridges activation sparsity and FP4 quantization through an online sparse--dense decomposition. For each activation tensor, SharQ generates an input-adaptive N:M mask to extract an outlier-dominated sparse backbone, quantizes it to FP4, and defines a dense residual relative to the quantized sparse backbone rather than the unquantized sparse values. A sparse FP4 GEMM processes the backbone while a dense FP4 GEMM compensates for both mask-induced activation loss and sparse-path quantization error. The two paths share a single FP4 weight payload with path-specific scale views, and a fused preparation kernel absorbs mask generation, residual construction, and layer normalization into one operator. SharQ requires no calibration data, retraining, or model-specific tuning. Evaluated on Llama-3.1-8B, Qwen2.5-7B, Qwen3-30B-A3B, and Qwen3-VL-8B, SharQ recovers 43--63% of the NVFP4-to-FP16 accuracy gap across language and vision-language tasks, and generalizes across NVFP4, HiF4, and MXFP4 formats. On an RTX 5090, SharQ delivers 2.2--2.4$\times$ latency reduction over FP16 and 1.2--1.4$\times$ throughput improvement over FP8 in language model serving, and up to 1.58$\times$ speedup on Wan2.2-T2V-A14B video generation when combined with SageAttention. Our code is available at https://github.com/actypedef/SharQ.

Score 76Full-paper briefinferenceinframodelstraining

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

FP4 inference has promised cheaper LLM serving, but the usual blocker is quality loss; SharQ’s claim is that a hardware-aware sparse-plus-residual path can recover enough accuracy to make FP4 more realistic without retraining. On RTX 5090/Blackwell-style hardware, the paper reports 2.2–2.4× lower latency than FP16 and 1.2–1.4× higher throughput than FP8 for language serving, which would put pressure on FP8 as the default efficiency tier. Take this seriously as a practical systems result across several model families, but not as universal proof: it depends on modern FP4 and N:M sparsity support and still does not fully close the FP16 quality gap.

  • If the result holds outside the authors’ setup, FP4 becomes less of a risky quality tradeoff and more of a deployable cost/performance option: SharQ recovers 43–63% of the FP4-to-FP16 accuracy gap without retraining, calibration data, or model-specific tuning.
  • The paper’s strongest business challenge is to the assumption that FP8 is the practical endpoint for efficient LLM serving. SharQ reports 2.2–2.4× latency gains over FP16 and 1.2–1.4× throughput gains over FP8, suggesting FP4-plus-repair may become the more competitive inference lane where hardware support exists.
  • A useful procurement question is whether the inference stack supports the whole pattern: online N:M activation masking, sparse FP4 GEMM, dense FP4 residual repair, fused preparation kernels, and shared FP4 weight payloads. Plain “FP4 support” is not enough to reproduce these gains.
  • The efficiency story is tied to Blackwell-class FP4 and semi-structured sparsity, and the gains are largest when decode-time GEMMs dominate. Workloads with more attention, preprocessing, or non-GEMM bottlenecks may see smaller benefits—the paper’s own video-generation result is much more modest for SharQ alone at 720P.
  • This matters commercially only when it becomes a standard serving primitive in inference engines and cloud GPU offerings, not a one-off research kernel. The paper makes that plausible by showing a training-free method that generalizes across NVFP4, HiF4, and MXFP4, but operational adoption will depend on integration quality.

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

capabilityhighp.1

SharQ recovers a substantial but incomplete share of FP4 quantization accuracy loss across evaluated language and vision-language tasks.

inferencehighp.1

SharQ reports major language-serving efficiency gains on RTX 5090 hardware versus FP16 and FP8 baselines.

traininghighp.1p.3

SharQ is designed as a training-free inference method, lowering deployment complexity compared with calibration- or retraining-dependent compression approaches.

caveathighp.13

The method’s practicality depends on accelerator and software support for block-scaled FP4 and N:M semi-structured sparsity.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.LG

S4oP: Operator-level Pruning of Structured State Space Models for Resource-Constrained Devices

Marco Deano, Filippo Ziche, Nicola Bombieri

cs.LG

KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators

Taras Sereda et al.

cs.AI

LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI

Lalit Yadav, Akshaj Gurugubelli

cs.CL

When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories

Avinash Baidya et al.

Thank you to arXiv for use of its open access interoperability. This product was not reviewed or approved by, nor does it necessarily express or reflect the policies or opinions of, arXiv.
LightDark