Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning explained

Brief context

Publication timing, weekly edition context, and source links for this brief.

Week

Apr 13, 2026

Published

Apr 14, 2026, 7:02 AM

Current score

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy per parameter, and 3) include MTP layers for inference acceleration through native speculative decoding. We pre-trained Nemotron 3 Super on 25 trillion tokens followed by post-training using supervised fine tuning (SFT) and reinforcement learning (RL). The final model supports up to 1M context length and achieves comparable accuracy on common benchmarks, while also achieving up to 2.2x and 7.5x higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. Nemotron 3 Super datasets, along with the base, post-trained, and quantized checkpoints, are open-sourced on HuggingFace.

Open the original arXiv page

Score 85Full-paper briefmodelsinferenceinfraagents

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

This paper matters because it pushes on a practical bottleneck, not just a leaderboard one: how to run very large reasoning models fast enough and cheaply enough that long-context, tool-using agents become more deployable. NVIDIA claims a 120.6B-parameter open model with only ~12.7B active parameters per pass, up to 1M-token context, and materially higher throughput than comparable open 120B-class models, which, if it holds outside NVIDIA’s stack, would put real pressure on inference economics, model vendor selection, and hardware planning. The evidence is stronger on engineering execution than on universal superiority: the speed gains are measured on NVIDIA B200s with optimized runtimes, but the release of open checkpoints and quantized versions makes this more market-ready than many frontier-model papers.

The important shift here is not just model quality; it is the claim that a 120B-class model can behave more like a ~12B-class model at serving time because only ~12.7B parameters are active per pass. If that pattern holds in production, teams evaluating open models should compare active compute, not headline parameter count, when estimating latency and GPU budgets.
The paper’s strongest business claim is throughput, but it is tied to B200 GPUs, NVIDIA-native low precision, and optimized runtimes. A useful buying question is whether a vendor’s advantage comes from model architecture, speculative decoding, quantization, or simply a hardware/software stack you may not share.
A 1M-token context window plus explicit post-training for terminal use, software tasks, search, SQL, finance, and tool calling suggests the model is aimed at real workflows where context retention and action-taking matter. The implication is that procurement, product, and ops teams can start testing whether one open model can cover document-heavy copilots and agent-style automation without stitching together multiple specialized systems.
This looks closer to market-ready than many research releases because NVIDIA open-sources base, post-trained, and quantized checkpoints, including deployment-oriented FP8 and NVFP4 versions. What would make it strategically meaningful is independent evidence that the speed/quality tradeoff survives on different serving stacks, not just inside the paper’s preferred setup.
The paper also shows how hard this is to reproduce: low-precision pretraining caused underflow issues, and large-scale RL required substantial systems workarounds. That means the model may be deployable as a checkpoint, but copying the full training recipe or expecting effortless portability would be an overread.

Affiliations

Institution names extracted from the brief's PDF summary call.

NVIDIA

No author markers parsed

From PDF summary

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

capabilityhighp.4

Nemotron 3 Super uses a sparse 120.6B architecture with ~12.7B active parameters per forward pass.

inferencehighp.1p.2

The model reports up to 2.2× and 7.5× higher throughput than GPT-OSS-120B and Qwen3.5-122B under the paper’s test setup.

strategichighp.8p.25

The architecture is explicitly designed for long-context and agentic workloads, including 1M-token context support and broad agent/tool post-training.

caveathighp.8p.9p.35

Low-precision training and deployment are central to the efficiency story, but they introduce nontrivial implementation complexity and training pathologies.