Andrej Karpathy quietly posted microgpt.py to GitHub Gists this week โ a single Python file that trains and runs inference on a GPT using only `os`, `math`, and `random` from the standard library. No PyTorch, no NumPy, no TensorFlow, no GPU. The gist has crossed 5,000 stars and 2,400 forks within days. Karpathy's framing in the file docstring is uncharacteristically absolute: "The most atomic way to train and run inference for a GPT in pure, dependency-free Python. This file is the complete algorithm. Everything else is just efficiency." In an interview on the No Priors podcast he added the origin story: he built it entirely by hand because no LLM agent he tried could distill the essence of GPT training into a single clear file.
The architecture follows GPT-2 with three deliberate simplifications: RMSNorm instead of LayerNorm, no biases on any linear layer, and ReLU instead of GeLU. Defaults are tiny โ 1 layer, 16-dim embeddings, 4 attention heads (4-dim each), block size 16 โ sized to train character-level on Karpathy's classic `names.txt` corpus from makemore. The autograd is a 40-line `Value` class (essentially the micrograd pattern), a topo-sort backward pass, and an inline Adam optimizer with linear LR decay over 1,000 steps. Inference samples 20 hallucinated names. The whole thing fits the screen if you have a wide-enough monitor. Forks have already produced a NumPy port that runs ~250ร faster than the scalar autograd, a Julia matrix version, a JavaScript port that runs in the browser at ~4K parameters, and a microgpt-denovo project that demonstrates an agent can now rebuild this file from a high-level spec โ closing the loop Karpathy said his agents couldn't close at the start.
The ecosystem signal here is pedagogy as moat. Most pretraining infrastructure today is layers of abstraction โ PyTorch on CUDA on hardware-specific kernels โ that hide the algorithm from the practitioner who depends on it. microgpt.py is the un-abstraction: it makes the transformer forward pass, the chain-rule backward, RMSNorm scaling, multi-head attention, KV-cache append, softmax with the max-subtract trick, and Adam moment buffers visible all at once. For anyone fine-tuning a Llama, debugging a training run, or writing custom CUDA kernels, that single screen is more useful than half the textbook chapters on transformers. The community fork pattern โ benchmarks, alt-language ports, agent-reconstructions โ is also a real-time experiment in whether educational LLM-related code is now a coordination layer in its own right.
For a Monday-morning builder: clone the gist, run `python microgpt.py`, and watch a transformer converge on character-level name generation in a few minutes on CPU. If you've ever shipped code against the OpenAI or Anthropic API without having traced a forward pass end-to-end, this is the cheapest possible way to fix that. If you train models, hand it to your junior engineers as a reading exercise โ it removes the framework veil that makes "what is this loss doing?" hard to answer. The companion no-priors interview is where Karpathy also explained why he thinks current agent capability still tops out below the bar of writing this file from scratch unprompted โ a useful capability marker to track.
