Recursive Language Models (RLM), submitted December 2025 and last revised May 11, propose handling prompts up to two orders of magnitude beyond the model's native context window by giving the LLM a Python REPL it can recursively call. Alex L. Zhang, Tim Kraska (MIT), and Omar Khattab (Stanford, also DSPy and ColBERT) report 26% improvement over compaction methods on GPT-5, 130% over CodeAct with sub-calls, and 13% over Claude Code on four long-context tasks at comparable cost. The fine-tuned RLM-Qwen3-8B variant gains 28.3% over baseline Qwen3-8B and approaches vanilla GPT-5 on three of those tasks. arXiv 2512.24601.

The mechanism: the parent LLM runs in a Python REPL where the user context is bound to a `context` variable, and an `llm_query()` function spawns child RLM instances with their own fresh REPLs. The architectural choice that makes the whole thing work is that child responses are returned as Python variables, not as text dumped back into the parent's context window. The parent composes final answers from variable references โ€” "the dictionary I asked sub-call A to build", "the country list I asked sub-call B for" โ€” without paying the token cost of re-inlining their outputs. That is the structural difference from Anthropic's Claude Code subagents and from CodeAct, both of which return text into the parent's running context.

Mapped to the existing agent-architecture taxonomy: ReAct (single agent plus tool loop), CodeAct (agent calls user-defined Python functions), Self-Loops (agent re-prompts itself with summarized history), and Subagents (lead agent delegates to specialist sub-agents over text). RLM is closest to Subagents but with symbolic-return rather than text-return semantics. The economic claim โ€” comparable cost while exceeding all four โ€” comes from not blowing up the parent context with child outputs that the parent only needs by reference. Two questions the paper raises but does not fully settle for production: how to debug when half your reasoning lives behind opaque variable references, and how to cache child computations across runs.

Monday: if you operate an agent system that hits context limits because subagent or tool outputs are eating budget, the symbolic-return pattern is implementable today even without adopting the full RLM framework โ€” wrap your subagent calls so the parent receives a handle to where the output lives, not the output itself. The Qwen3-8B result (28.3% lift on the same model) suggests this technique compounds with whatever model you are running, not just frontier. Watch for Anthropic, OpenAI, or Google adopting symbolic-return semantics in their first-party subagent products over the next two quarters.