Mistral Medium 3.5 lands: dense 128B, 256k context, 77.6% on SWE-Bench Verified

Mistral has released Mistral Medium 3.5, a dense 128B model with a 256k context window, alongside Vibe (a CLI coding agent) and Remote Agents (async cloud-based coding sessions, spawnable from CLI or Le Chat). The model is multimodal with a vision encoder Mistral trained from scratch to handle variable image sizes and aspect ratios — not a CLIP retrofit. Weights ship open on HuggingFace. Mistral describes 3.5 as their first "flagship merged" model, which is corporate-speak that needs unpacking.

The headline benchmark is 77.6% on SWE-Bench Verified, with 91.4 on τ³-Telecom. The first number is the one to stress-test, because Verified scores are harness-dependent: open-hands, swe-agent, and mini-swe each give different pass rates from the same model. Mistral has not disclosed the harness, and that's the missing piece. For honest comparison: Claude Sonnet 4.5 sits at 82.0% on SWE-Bench Verified (with parallel test-time compute) under Anthropic's published harness; Mistral's 77.6% under unknown configuration is competitive but not directly comparable. The 256k context plus dense (not MoE) architecture at 128B is unusual — most labs at this scale have moved to sparse routing. Dense gives consistent latency and simpler deployment; the cost is parameter efficiency.

Vibe and Remote Agents are the real product story. Vibe runs locally as a CLI coding agent. Remote Agents extend that to long-running cloud sessions in isolated sandboxes — and crucially, local sessions can be teleported to the cloud preserving history and state. Integration points include GitHub, Linear, Jira, Sentry, Slack, Teams. Mistral is converging on the same agent-and-async-execution shape that Devin, Claude Code, and Codex have been building toward, but with open weights underneath and an EU sovereignty angle that matters for European builders and regulated industries. Open-weight agent infrastructure with a 77%-class SWE-Bench model is a different proposition than the closed-weight equivalents.

Pull the weights and run them through your own harness before trusting the 77.6%. If you're EU-based or have data-residency constraints, this is the most credible open-weight option for a frontier-class coding agent. Vibe is worth a try if you're on the CLI tooling spectrum — Remote Agents via Le Chat changes the cost curve on long autonomous tasks. The dense architecture means inference is heavier per token than an equivalent MoE; budget for that if you're self-hosting.

Mistral Medium 3.5 lands: dense 128B, 256k context, 77.6% on SWE-Bench Verified

More News