Talkie-1930 Drops a 13B LLM Trained Only on Pre-1931 English to Probe What Web-Era Models Memorise vs. Learn

A small non-profit team led by Nick Levine, David Duvenaud (Toronto), and Alec Radford (formerly OpenAI's GPT lineage architect) released Talkie-1930 today, a 13B parameter open-weight language model trained from scratch on 260 billion tokens of strictly pre-1931 English text. The corpus is books, newspapers, periodicals, scientific journals, patents, and case law, all from public-domain sources, which required substantial OCR transcription work because the team found that off-the-shelf OCR output produced only 30% of the learning efficiency of human-transcribed text. Two checkpoints are public on HuggingFace under Apache 2.0: talkie-1930-13b-base for raw completions and talkie-1930-13b-it instruction-tuned via direct preference optimization with Claude Sonnet 4.6 as judge. The model needs at least a 28 GB GPU for local inference. The work has the structure of a research artifact rather than a competitive frontier model, but the research goal is unusually concrete: produce a base model whose knowledge cutoff is December 31, 1930, against which the team also trained a "modern twin" of identical 13B architecture on contemporary web data, in order to do controlled experiments on what current language models actually learn versus memorise.

The technically interesting parts are the data engineering and the contamination-control argument, both of which are useful for builders even if they will not deploy this model. The anachronism-filtering pipeline is its own contribution: the team built a document-level n-gram-based anachronism classifier to catch later-date material that had slipped into ostensibly pre-1931 sources, because once a single 1950s newspaper scan leaks into the training set the temporal bound is broken. The OCR quality finding is actionable in a way that has not been emphasised enough in the industry: a 70% efficiency penalty for cheap OCR over hand-transcription means that any team training on historical or scanned text and using off-the-shelf OCR is leaving most of the learning signal on the table. The instruction-tuning detail is also clever; the IT split was generated entirely from historical sources to keep the temporal bound, with a modern model only used as the preference judge, which lets the model output instruction-following behaviour without smuggling in modern factual knowledge.

The broader implication is that Talkie-1930 is a usable benchmark instrument for the contamination problem that has been the embarrassment of frontier-model evaluation since GPT-4. Every public benchmark gets scraped, indexed, and absorbed into the next training run, which makes scoring on those benchmarks at the frontier increasingly meaningless. A model whose training data ends in 1930 cannot have memorised any post-1930 evaluation, so any task that touches material after that date can be used to measure pure generalisation. This is the same trick people have tried with carefully held-out test sets, but Talkie-1930 raises the bar to "anything in the last 96 years," which removes a much larger class of inadvertent leakage. The "modern twin" comparison is what makes this load-bearing: parity on core language understanding when anachronistic questions are filtered out is the result the authors are specifically calling out, which suggests that a meaningful portion of what frontier models look like they "learn" from contemporary data is in fact closer to memorisation. Whether that result holds up under independent replication is the question the next 30 days will answer, but the artifact itself is now public and reproducible.

For builders, three concrete things matter. First, if you are running benchmark evaluations and want a contamination-resistant baseline, talkie-1930-13b-it is now the standard control group in that 13B class. Anyone publishing capability claims at that scale should compare against it. Second, the OCR quality lesson generalises: if your domain involves historical documents, scanned manuals, archival media, or any non-machine-readable corpus, the gap between cheap OCR and clean transcription is much larger than the per-token cost makes obvious. The right benchmark is not "does the OCR look readable" but "what is the perplexity-per-token cost relative to clean text," and Talkie-1930's number is 3.3x. Third, the methodological pattern of training a temporally-bounded model plus a modern twin is replicable in other domains. A team building a medical or legal model could in principle do the same thing: train on pre-cutoff curated sources, hold out post-cutoff evaluation material, and use the gap to separate generalisation from memorisation. The Talkie-1930 work is small in compute relative to frontier training but large in methodological infrastructure, and the methodology is what is going to get reused.

Talkie-1930 Drops a 13B LLM Trained Only on Pre-1931 English to Probe What Web-Era Models Memorise vs. Learn

More News