GitHub released Rubber Duck, an experimental feature in Copilot CLI that uses cross-model review to catch coding errors that single AI models consistently miss. When developers use Claude as their primary coding agent, Rubber Duck automatically runs a secondary review using GPT-5.4, and vice versa. The system triggers at three key points: after planning, after complex implementations, and after writing tests but before execution.
This addresses a fundamental problem with AI coding agents: they compound early mistakes because later steps build on the same flawed assumptions. Self-reflection helps, but a model reviewing its own work is still constrained by the same training biases that created the error. Different model families—Anthropic's Claude versus OpenAI's GPT—carry different training biases, making cross-model review more effective at surfacing blind spots.
Benchmark results on SWE-Bench Pro show Claude Sonnet with Rubber Duck closed 74.7% of the performance gap to the more capable Claude Opus, with gains most pronounced on multi-file tasks requiring 70+ steps. GitHub's examples reveal the kinds of errors caught: schedulers that exit immediately, infinite loops in background tasks, and silent data overwrites that drop search categories. One particularly telling case involved NodeBB's email system, where three files were reading from a Redis key that new code had stopped writing—a deployment-breaking bug with no error message.
For developers, this represents a practical step toward more reliable AI coding assistance. The narrow focus on surfacing assumptions, edge cases, and requirement conflicts suggests GitHub understands that effective AI review isn't about rewriting code—it's about identifying the specific failure modes that human developers need to know about before shipping.
