NVIDIA's Nemotron 3 Super took the top spot on EnterpriseOps-Gym this week, a new 1,150-task agent benchmark that runs models in fully interactive environments with 512 callable tools โ beating DeepSeek v3.2 and Kimi-K2.5 to lead the open-source category. The model itself shipped in March; the leaderboard win is the news. But the more interesting story is what made it possible: this is the first frontier-scale model pretrained natively in 4-bit precision.
Nemotron 3 Super is 120B total / 12B active parameters, a hybrid Mamba-Transformer-MoE with a 1M-token context window. Three architectural moves stacked here. LatentMoE projects token embeddings into a compressed low-rank latent space before routing to experts and back โ letting the model consult 4ร as many experts for the same compute cost. Multi-Token Prediction uses shared-weight heads forecasting several future tokens simultaneously, claimed up to 3ร wall-clock speedup on structured generation. Most significant: NVFP4 native pretraining means the model learned to be accurate within 4-bit arithmetic from the very first gradient update โ not quantized post-hoc after FP16/FP32 training. NVIDIA reports 4ร inference speedup on B200 vs FP8 on H100. EnterpriseOps-Gym score: 27.3 average, beating Kimi-K2.5 (2nd) and DeepSeek v3.2 (3rd). PinchBench: 85.6%. Inference throughput: 2.2ร faster than GPT-OSS-120B, 7.5ร faster than Qwen3.5-122B at 8k input / 64k output.
Native low-precision pretraining is the genuinely new thing. Until now, the move has been: train in BF16 or FP8, then quantize post-hoc to INT4 or NVFP4 for deployment, paying a quality tax along the way. Nemotron 3 Super being trained natively in 4-bit means the weight distributions are already compatible with the deployment format โ no post-hoc gymnastics, no fine-tuning to recover lost accuracy. If this generalizes, it changes training-compute economics for the next generation of open models, and it lets B200 hardware operate closer to its peak FLOPS budget. The 4ร B200-vs-H100-FP8 number is what makes this a generational shift rather than incremental. For the wider open-source landscape, DeepSeek and Kimi-K2 have set the bar for "frontier open" since late 2025; NVIDIA shipping a model that beats both on agentic benchmarks โ under a permissive license, with free hosted inference โ closes a competitive gap that wasn't obvious would close this fast.
Available on Hugging Face as `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` plus NVFP4 deployment variants, under the NVIDIA Nemotron Open Model License. Free hosted inference via OpenRouter. Worth pulling for agent workloads where 1M context, tool calling, and inference speed matter more than raw single-shot eval scores. The native 4-bit angle is the part to watch for the next six months โ if other labs replicate, the cost-per-quality curve shifts for everyone.
