AlphaGo's David Silver Doubles Down: LLMs Are the Wrong Path, RL-from-Experience Is the Way

David Silver, the principal architect behind AlphaGo, AlphaZero and MuZero, gave Wired an interview this week restating the core argument behind his new startup Ineffable Intelligence: large language models are not on the path to superintelligence. Silver left Google DeepMind earlier this year to launch Ineffable, and Sequoia led a $1B seed at roughly $4B pre-money to back the bet. The thesis traces directly to his "Era of Experience" paper from last year, co-authored with Rich Sutton — the Alberta School line that intelligence comes from agents learning by interacting with an environment and receiving reward signals, not from neural networks trained to predict the next token in human writing. Silver's specific claim in the Wired interview: "We want to go beyond what humans know, and to do that we're going to need a different type of method, and that type of method will require our AIs to actually figure things out for themselves."

The technical substance behind the headline is more precise than the framing. Silver is not saying LLMs do not work; he is saying they are upper-bounded by the distribution of human-generated text. AlphaGo's Move 37 and AlphaZero's chess novelties are the existence proof he leans on: an RL agent operating in an environment with a sharp reward signal can discover strategies that no human had written down, because the agent is not learning from humans, it is learning from the game. That is a real result, and it is meaningfully different from what next-token prediction does. The honest caveat is that AlphaGo and AlphaZero operated in domains with closed rules, perfect information and an unambiguous win/lose reward — Go, chess, shogi, video games. Generalising the same approach to physical-world tasks, multi-step research, or open-ended problem solving has been an open research question for fifteen years and remains one. Silver's bet is that flexible reward functions grounded in real-world measurements, what the Era of Experience paper calls grounded reward — heart rate for a health agent, CO2 for a climate agent — close that gap. Whether they do is empirical and unresolved.

For the builder audience, the LLM-versus-RL framing is mostly a false dichotomy that the press coverage cannot resist. Every frontier lab is already running the synthesis. RLHF is RL on an LLM. RL with verifiable rewards (the recipe behind o-series and Claude reasoning models) is RL on an LLM with a programmatic reward. Agentic systems with tool use and verifiers — the direction the entire industry has shifted toward over the last eighteen months — are RL on an LLM in an environment. The question is not whether RL or LLMs win; it is whether you need a language-pretrained backbone at all, or whether a sufficiently large RL agent can learn from raw experience without first absorbing the corpus of human writing. Silver's bet is no, you do not need it. That is a much more aggressive claim than the Wired headline suggests, and it is genuinely contrarian — most of the field, including most ex-DeepMind alumni, thinks language pretraining is a useful prior for everything downstream. The intellectual honest version of Silver's position is: language pretraining is a shortcut that ceiling-caps you at human knowledge, and a system that can scale without it will eventually surpass one that cannot.

The developer takeaway is to take the technical claim seriously while ignoring the marketing dichotomy. If you are building agents today, the practical bottleneck is not "LLM versus RL," it is reward design: in the domains where you can write a verifier, RL on top of an LLM works extraordinarily well and the recipe is converging across labs. In the domains where you cannot — most real-world business tasks, most research workflows — you fall back to RLHF or supervised imitation, which inherits the human-data ceiling Silver is flagging. So Silver is empirically right about where the wall is, even if he is wrong about whether you need to throw out the LLM backbone to get past it. The Ineffable Intelligence bet is worth watching for one specific reason: if the $1B buys a frontier-scale pure-RL agent that learns from raw experience and approaches LLM-like generality without language pretraining, that resets the architecture conversation. If it buys a domain-specific RL system that works well in a narrow vertical and never generalises, it confirms the synthesis view. Either outcome is informative; the next 18 to 24 months will tell us which.

AlphaGo's David Silver Doubles Down: LLMs Are the Wrong Path, RL-from-Experience Is the Way

More News