Uber Migrated 75,000 Test Classes Off JUnit 4 — and Explicitly Rejected LLMs to Do It

Uber published a writeup of how they migrated 75,000+ test classes and 1.25 million lines of test code across their Java monorepo from JUnit 4 to JUnit 5, and the headline detail for any builder team currently weighing AI-assisted refactoring is that Uber explicitly chose not to use generative AI for the transformation. Their stated reasoning, quoted in the writeup: "Deterministic transformation tooling was critical for consistency at this scale," and LLM-based approaches "produced inconsistent results across custom test patterns." Instead, the team built on OpenRewrite, the open-source semantic code transformation framework that operates on lossless semantic trees rather than raw text, with custom recipes targeting Uber-specific base classes and test runners. They paired it with a unified-execution compatibility layer (JUnit Platform running Vintage and Jupiter engines simultaneously, so partially-migrated repos kept working), precondition checks to block partial migrations, and an internal orchestration system called Shepherd that fanned transformations out across thousands of Bazel targets in parallel and validated each via CI.

The technical reality behind the choice is more interesting than the LLM-versus-not framing suggests. At Uber scale, the failure mode that matters most is silent inconsistency: a transformation that works on 99.5% of files and quietly mangles 0.5% creates 375 broken test classes, each of which has to be diagnosed and hand-fixed. OpenRewrite recipes are deterministic; given the same input AST and the same recipe, you get the same output every run, and the transformations are expressible as composable visitors over a typed semantic tree. LLM-based code transformation, by contrast, is non-deterministic at the token level and especially struggles with rare patterns it has not seen often in training data, which is exactly where Uber's custom test runners and base-class hierarchies live. The InfoQ piece notes that initial Shepherd runs surfaced build and test failures that informed updates to the transformation logic; this is the iteration loop you can actually run with deterministic tooling because the failures are reproducible. With an LLM you re-run the same prompt and get a slightly different mistake, which is much harder to diagnose at scale.

The broader implication for the AI-coding-tools narrative is worth being precise about. Uber is not saying LLMs cannot do code transformation; they are saying that for this specific class of problem (high-volume, mechanical-but-pattern-rich, correctness-critical), deterministic tooling won. This matches what frontier labs themselves do internally: large-scale codebase rewrites at Google, Meta, and Microsoft have for years been done with deterministic refactoring tools (rewrite engines, jscodeshift, gofmt-style transforms, Comby, OpenRewrite), with LLMs used selectively for the long tail of patterns the deterministic recipes cannot express. The framing in tech press of "AI replaces code refactoring" gets this backward: at scale the AI assist is in the recipe-writing and edge-case-handling, not in the bulk transformation pass. The economics also favour determinism for a one-time migration: writing a recipe is a fixed cost that amortises across 75,000 files, while running an LLM over 75,000 files is a variable cost that scales linearly and produces output you still have to verify.

For builder teams, the actionable takeaway is to think about your refactoring tasks in three buckets. First, mechanical pattern transformations with finite well-defined rules: API rename, import update, annotation swap, JUnit version migration. These belong to deterministic AST tools, full stop, and Uber's writeup is the clearest recent case study of what that looks like at scale. Second, semantic refactorings with judgment calls: extracting an abstraction, renaming for clarity, restructuring control flow. These are where AI-assisted coding tools earn their keep, because the edits are local, reviewable, and the LLM's flexibility helps where rigid recipes break. Third, bug-fixing or feature work with embedded refactoring: this is the agentic-coding-tools sweet spot, where the model can read the surrounding context and adapt. The mistake to avoid is using a tool from one bucket for a job in another. Uber's choice to ship 1.25M lines of mechanical migration on OpenRewrite, with a deterministic CI loop and parallel orchestration, is the right answer for bucket one, and it is worth keeping in mind the next time someone proposes throwing Claude or GPT at a million-line refactor.

Uber Migrated 75,000 Test Classes Off JUnit 4 — and Explicitly Rejected LLMs to Do It

More News