GitHub Copilot CLI gets a second-opinion feature built on cross-model review

Coding agents make decisions in sequence: a plan is drafted, implemented, then tested. Any error introduced early compounds as subsequent steps build on the same flawed assumption. Self-reflection is a recognized mitigation technique, and one GitHub Copilot already supports, but a model reviewing its own output is still constrained by the same training data and blind spots that produced it.

GitHub addressed that constraint this week with the release of Rubber Duck, a cross-model review feature now available in experimental mode in GitHub Copilot CLI.

GitHub Copilot cross-model review

What Rubber Duck does

Rubber Duck is a dedicated review agent that runs on a model from a different AI family than the one handling the primary Copilot session. When a developer selects a Claude model as the orchestrator in the model picker, Rubber Duck runs on GPT-5.4. Different model families carry different training biases, so a review from a complementary family surfaces errors that the primary model may consistently miss.

The reviewer’s job is narrow. It produces a short list of concerns: assumptions the primary agent made without sufficient basis, edge cases that were overlooked, and implementation details that conflict with requirements elsewhere in the codebase.

Benchmark results on SWE-Bench Pro

“Our evaluations show that Claude Sonnet + Rubber Duck makes up 74.7% of the performance gap between Sonnet and Opus alone, achieving better results for tackling difficult multi-file and long-running tasks,” researchers Nick McKenna and Bartek Perz explained.

The gains were more pronounced on harder problems. On tasks spanning three or more files that would normally require 70 or more steps, Sonnet paired with Rubber Duck scored 3.8% higher than the Sonnet baseline, and 4.8% higher on the hardest subset identified across three trials.

GitHub presented three examples of errors Rubber Duck caught during testing. In one case involving the OpenLibrary async scheduler, Rubber Duck identified that a proposed scheduler would exit immediately on start, running zero jobs, and that one of the scheduled tasks was itself an infinite loop. In a second case involving Solr facet handling, Rubber Duck caught a loop that silently overwrote the same dictionary key on every iteration, causing three of four facet categories to be dropped from search queries without any error raised. In a third case involving NodeBB’s email confirmation flow, Rubber Duck identified that three files were reading from a Redis key that the new code had stopped writing, which would have silently broken the confirmation UI and cleanup paths on deploy.

When the review agent activates

Rubber Duck can be triggered automatically or on demand. GitHub Copilot invokes it automatically at three checkpoints: after the agent drafts a plan, after a complex implementation, and after writing tests but before running them. The agent can also call Rubber Duck reactively if it becomes stuck in a loop. Developers can request a critique at any point in a session; Copilot will query Rubber Duck, incorporate the feedback, and display what changed and why.

The design deliberately limits how often Rubber Duck activates. The goal is to surface high-value signal at the checkpoints where it matters most, without adding noise to routine tasks. Rubber Duck runs through Copilot’s existing task tool, the same infrastructure used for other subagents.

Availability and model scope

Rubber Duck is available now in experimental mode in GitHub Copilot CLI. Developers access it by running the /experimental slash command. The feature requires a Claude model selected in the model picker and access to GPT-5.4. GitHub has enabled Rubber Duck for all Claude family models in the orchestrator role, including Opus, Sonnet, and Haiku, and states it is exploring additional model family pairings for future configurations.

Secure by Design: Building security in at the beginning

Don't miss