NVIDIA Polar trains agents via API proxy, Qwen3.5-4B 3.8% to 26.4% on Codex

NVIDIA released Polar (Apache-2.0, on GitHub at NVIDIA-NeMo/ProRL-Agent-Server), a rollout framework that trains language agents with GRPO reinforcement learning without modifying their agent harnesses. The architecture is a gateway proxy at the model API boundary: it detects provider APIs (Anthropic, OpenAI, Google), normalizes requests to OpenAI Chat Completions format, captures token-level data and log probabilities, then returns responses in the original provider shape. Only required harness change is pointing its model base URL at the gateway. Reported results on a Qwen3.5-4B base: SWE-Bench Verified pass@1 goes from 3.8% to 26.4% under the Codex harness (+22.6 pp), with smaller gains of +4.8 pp on Claude Code and +6.2 pp on Pi.

The harness-specific gain spread is the most interesting builder signal. Codex sees the biggest lift because Qwen3.5-4B started unfamiliar with Codex's action protocol and patch submission style — GRPO closed the alignment gap between base-model output distribution and harness expectations. Claude Code lifted less because "the base model is already well-aligned with that harness," which says Claude Code's interaction format is closer to natural code-tool dialogue than Codex's. That delta is also a signal about pretraining data composition: harness conventions that look like natural code review are absorbed earlier than harness conventions with custom action vocabularies. Multi-turn trajectory reconstruction uses prefix_merging — verifying strict token-prefix relations between consecutive completions to form coherent chains across what the harness sees as separate API calls.

The ecosystem read for builders: agent training is becoming harness-decoupled, which lowers the cost and increases the surface of "make this model better at this specific tool stack." 64 GPU-hours of offline SFT on 8×H100s is the offline-rollout compute footprint — in the $200-400 range at current spot rates, well within indie ML budget. Apache-2.0 license and built-in support for Codex, Claude Code, Qwen Code, Gemini CLI, OpenCode, and Pi means any team running these harnesses can train a custom model variant against their actual prod harness without rewriting the harness or maintaining a forked stack. The proxy architecture also has secondary uses — eval logging, behavior monitoring, replay debugging — that any agent platform could lift.

If you train your own agent models Monday morning: Polar is the cleanest path from a generic base model to a harness-specialized agent variant for a non-trivial budget. If you ship an agent harness: instrument your harness so it advertises configurable model base URL, reliable token IDs, and per-call log probabilities — that is the minimum interface to be trainable. The next phase of agent improvement is harness-specific RL on top of generic bases, and Polar is a reference implementation of how that loop closes.

NVIDIA Polar trains agents via API proxy, Qwen3.5-4B 3.8% to 26.4% on Codex

More News