Cohere Command A+: 218B sparse MoE (25B active), 2x H100 W4A4, Apache 2.0 open

Cohere released Command A+ as Apache 2.0 open-weight: a decoder-only sparse Mixture-of-Experts transformer with 218 billion total parameters, 25 billion active per token. Topology: 128 experts with 8 active per token plus 1 shared expert. 128K input context, 64K max generation. The deployment story is the headline for builders: W4A4 quantization (NVFP4 applied to MoE experts only, attention paths kept full precision) runs on as few as 2 H100 GPUs. Alternative configurations: 1x B200, 4x H100 at FP8, 8x H100 at BF16. Available on HuggingFace, supported by vLLM 0.21.0+ and Transformers. Quantization-Aware Distillation post-training recovers quality at W4A4. Cohere positions Command A+ as the unified multimodal Command A (text, image, tool inputs; text, reasoning, tool use outputs).

Agentic benchmark deltas versus Cohere's prior Command A Reasoning are the substantive signal. τ²-Bench Telecom moved 37% to 85%. Terminal-Bench Hard agentic coding went 3% to 25%. Agentic QA accuracy improved by 20 percentage points. The Terminal-Bench Hard delta is the most telling — that benchmark tests multi-step command-line agentic problem-solving, and a 3% to 25% jump on the Hard tier is a step change in agent reliability for systems work. Cohere is targeting the same agentic capability claim as Anthropic's Code With Claude Capability Curve (SWE-bench 62%→87% in twelve months) and Google's Gemini 3.5 Flash agent-first framing, but with open weights instead of closed API. The W4A4 deployment story is what differentiates: running a 218B-class frontier MoE on 2 H100s is the accessible-to-mid-market scenario that closed-weight Anthropic/Google/OpenAI frontier models can't match on TCO.

Ecosystem context. NVFP4 (the 4-bit format we covered on the May 18 NVIDIA pre-training piece) is the quantization standard here — Cohere is using it on the MoE expert paths while keeping attention at full precision. That's the practical shape of NVFP4 adoption: not full-model 4-bit, but selective application to the high-parameter-count low-precision-tolerant layers. The MoE design (218B total, 25B active) follows the DeepSeek-V3 and Llama 4 Behemoth lineage — sparse activation lets the model carry frontier-scale knowledge without frontier-scale inference cost. Apache 2.0 is the strategic differentiator: Cohere is positioning as the open-weights frontier-class option versus Anthropic and Google going closed-weight vertical (Code With Claude, Antigravity) and Mistral going industrial-vertical (Emmi acquisition). Five labs, five different bets visible this week. Cohere's bet is open-weights agentic frontier on accessible hardware.

Monday: if you run agentic workloads on closed-API frontier models (Claude Opus, GPT-4-class, Gemini Pro), benchmark Command A+ on your own evals — Apache 2.0 means you can fine-tune, redistribute, modify without commercial-use restrictions. Specific tests: (1) run your terminal-style agentic tasks against Command A+ W4A4 on 2 H100s, compare wall-clock and quality to your current closed-API spend. The Terminal-Bench Hard 3%→25% claim is concrete enough to verify on your own task distribution. (2) Evaluate the 128K input / 64K generation budget against your agentic context needs — most long-horizon agents are bounded by output generation, not input context, so 64K max generation is the relevant constraint. (3) If you've been holding off on agentic deployment because of closed-API cost or data-egress concerns, the W4A4 / 2-H100 deployment story may close that gap. For the broader trend: open-weights frontier-class agentic models are now a real category, not a future hope. Cohere just made it concrete. Watch for DeepSeek, Llama, and Qwen to follow with their own NVFP4-quantized agentic-tuned releases over the next quarter.

Cohere Command A+: 218B sparse MoE (25B active), 2x H100 W4A4, Apache 2.0 open

More News