The emergence debate started with a 2022 paper from Google and collaborators working on BIG-Bench, a massive benchmark suite with over 200 tasks. They tested language models across a range of sizes and found something striking: on many tasks, performance was essentially flat (near random) for small and medium models, then jumped sharply once a model crossed a certain parameter threshold. The paper, "Emergent Abilities of Large Language Models" by Wei et al., plotted these curves and the pattern was dramatic — abilities appeared to switch on like a light, not fade in gradually. The framing captured the imagination of the field. If models could acquire qualitatively new capabilities just by getting bigger, then scaling was not just an engineering challenge but a path to genuinely surprising intelligence.
The examples were compelling. GPT-3 (175 billion parameters) could do few-shot arithmetic that GPT-2 (1.5 billion) could not touch. Multi-step reasoning, where a model has to chain logical inferences together, appeared only in models above a certain size. Translation between language pairs the model was never explicitly trained on showed up at scale. Code generation — the ability to write working programs from natural language descriptions — went from useless to functional somewhere between 10 and 100 billion parameters. Word unscrambling, a task that seems to require some internal representation of spelling, jumped from 0% to near-perfect over a narrow parameter range. The pattern repeated across dozens of BIG-Bench tasks: flat, flat, flat, then sudden competence. This looked like evidence that scaling produced genuine phase transitions — qualitative shifts in what a model could do, not just quantitative improvements in how well it did familiar things.
In 2023, Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo at Stanford published a direct challenge. Their argument was precise: emergence is not a property of the model but a property of the metric. The BIG-Bench tasks that showed sharp transitions mostly used discontinuous metrics — exact-match accuracy, where you get zero credit for almost-right answers. A model that improves gradually from 0.1% to 5% to 30% correct looks like it is doing nothing, nothing, nothing, then suddenly performing, because partial credit does not exist. When Schaeffer et al. re-evaluated the same models on the same tasks using continuous metrics like log-likelihood or token-level accuracy, the sharp transitions disappeared. Performance improved smoothly and predictably with scale. The "emergence" was an artifact of choosing metrics that could not detect gradual improvement. This was not a minor methodological quibble. If correct, it meant the most exciting narrative in AI — that bigger models spontaneously develop new capabilities — was partly a measurement illusion.
The stakes of this debate extend well beyond academic interest. If emergence is real — if models genuinely acquire unpredicted capabilities at certain scales — then safety planning faces a fundamental problem: you cannot prepare for abilities you cannot foresee. A model that is harmless at 100 billion parameters might develop persuasion capabilities, deception strategies, or tool-use skills at 1 trillion parameters, with no warning in the scaling curve. This is the core argument for cautious, incremental scaling with extensive evaluation at each step. If emergence is primarily a measurement artifact, the picture is more reassuring: capabilities improve smoothly and predictably, so evaluations at smaller scales give you meaningful signal about what to expect from larger models. The safety implications of each interpretation are nearly opposite, which is why both sides of the debate are genuinely invested in getting the answer right.
The honest answer is that the field has not reached consensus. The Stanford critique is widely accepted as demonstrating that some reported emergent abilities were measurement artifacts — that part is not seriously disputed. But many researchers maintain that the critique does not explain everything. Certain capabilities, particularly those involving compositionality (combining learned skills in novel ways), planning, and multi-step reasoning, do appear to show genuine qualitative shifts that are not easily explained by metric choice alone. The practical upshot for labs making scaling decisions is a mixed message: you can probably predict next-step improvements more reliably than the original emergence papers suggested, but you should not assume all surprises have been explained away. The prudent approach — adopted by most frontier labs — is to evaluate extensively at every scale increase and maintain the infrastructure to pause if something unexpected appears. Whether you call the resulting surprises "emergence" or "predictable improvement that we failed to measure properly" matters less than whether you are prepared to handle them.