NVIDIA Nemotron-3 Super 120B Targets Agentic AI at Scale

Model overview

NVIDIA-Nemotron-3-Super-120B-A12B-BF16 is a large language model from NVIDIA designed for agentic workflows and reasoning-intensive tasks. Released in March 2026, this model features 120 billion total parameters with 12 billion active parameters, making it more efficient than its full-parameter counterparts. The architecture combines a hybrid Latent Mixture-of-Experts design with Mamba-2 and attention layers, plus Multi-Token Prediction for faster generation. Unlike smaller variants like the Nemotron-3-Nano-30B, the Super model delivers substantially higher performance across reasoning and agentic benchmarks while supporting up to 1 million token context lengths. The model supports seven languages: English, French, German, Italian, Japanese, Spanish, and Chinese.

Model inputs and outputs

The model accepts text prompts and produces text responses. It can generate intermediate reasoning traces before providing final answers, with this behavior controlled through chat template configuration. The reasoning mode is togglable depending on task requirements, allowing users to choose between faster responses without reasoning or higher-quality outputs that include step-by-step thinking.

Inputs

Text prompts in supported languages
System instructions and user queries
Chat templates with configurable reasoning settings (enable_thinking=True/False)

Outputs

Generated text responses
Optional intermediate reasoning traces showing problem-solving steps
Multi-token predictions for improved generation quality

Capabilities

This model excels at complex reasoning, tool use, and managing high-volume workloads. It achieves 90.21 on AIME25 and 93.67 on HMMT February 2025 benchmarks without tools, demonstrating strong mathematical and logical reasoning. The model performs agentic tasks like software engineering challenges with 60.47 on SWE-Bench, making it suitable for IT ticket automation and collaborative agent systems. Long-context performance remains strong through 1 million tokens, achieving 91.75 on RULER at 1M token length. Multi-language support enables deployment across global teams and diverse datasets. Tool-augmented performance jumps to 94.73 on HMMT and 82.70 on GPQA when external tools are available, showing effective integration capabilities.

What can I use it for?

Build AI agent systems that need to make decisions and take actions across enterprise tools. Deploy IT ticket automation systems that route, categorize, and resolve support requests at scale. Create retrieval-augmented generation (RAG) systems that combine external knowledge with reasoning. Develop collaborative multi-agent architectures where models coordinate complex workflows. Handle customer service applications requiring nuanced understanding and tool integration. The model's efficiency makes it practical for high-volume production deployments where smaller models like the Nemotron-3-Nano-30B would struggle. Commercial licensing is supported through the NVIDIA Nemotron Open Model License.

Things to try

Configure the reasoning mode differently for various tasks—enable reasoning traces for complex problem-solving but disable them for speed-critical applications. Experiment with tool use by providing function definitions and API endpoints, then observe how the model naturally calls them to solve problems. Test the model's performance on your domain-specific data by fine-tuning using the provided training recipes and datasets. Push the 1 million token context window with long documents, legal contracts, or codebase analysis to leverage capabilities most smaller models cannot match. Run A/B tests comparing outputs with reasoning enabled versus disabled to find your quality-speed tradeoff. Use temperature=1.0 and top_p=0.95 across all tasks and serving backends, as recommended in official guidance.

This is a simplified guide to an AI model called NVIDIA-Nemotron-3-Super-120B-A12B-BF16 maintained by nvidia. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.