When a neural network trains, it is adjusting millions or billions of numbers organized into matrices of weights and biases. Each weight controls how strongly a signal flows from one neuron to the next; each bias shifts the activation threshold. These are the parameters. Training works through gradient descent — the model makes a prediction, measures how wrong it was (the loss), then nudges every parameter a tiny amount in the direction that would have made the prediction less wrong. Repeat this billions of times across trillions of tokens, and those parameters converge into something that can write poetry, debug code, or explain quantum mechanics. The parameters are not a lookup table or a database. They are a compressed, distributed, lossy representation of patterns in the training data, and no single parameter "knows" anything on its own.
The history of modern AI can be told in parameter counts. GPT-2 had 1.5 billion parameters in 2019 and people thought it was dangerously capable. GPT-3 arrived in 2020 with 175 billion and rewrote the rules. Each jump in scale unlocked capabilities that smaller models simply could not match — few-shot learning, coherent long-form writing, basic reasoning — and labs raced to train ever-larger models. This was not just marketing. Scaling laws published by OpenAI and DeepMind showed a remarkably smooth relationship between parameter count, training data, compute budget, and model performance. More parameters, trained on more data, with more compute, meant predictably better results. The arms race was rational, at least for a while.
Not all parameters are equal, and not all of them fire on every input. Mixture-of-Experts (MoE) models like Mixtral and (reportedly) GPT-4 contain many billions of total parameters, but a routing network selects only a subset of "expert" sub-networks for each token. Mixtral 8x7B has roughly 47 billion total parameters but activates only about 13 billion per forward pass — giving you the quality of a much larger model at the inference cost of a smaller one. Meanwhile, the Chinchilla scaling research from DeepMind in 2022 upended the "bigger is always better" assumption entirely. They showed that most large models were undertrained: a smaller model trained on significantly more data could outperform a larger model trained on less. Chinchilla, at 70 billion parameters trained on 1.4 trillion tokens, beat the 280-billion-parameter Gopher. The lesson was that parameter count alone tells you very little without knowing how much data and compute went into training.
Parameters have a direct, unavoidable cost in GPU memory. Each parameter stored in fp16 (16-bit floating point) or bf16 takes 2 bytes. A 7-billion-parameter model therefore needs roughly 14 GB of VRAM just to hold the weights — before you account for anything else. Quantize to int8 (8-bit integers) and that drops to 7 GB; go to 4-bit and you are down to about 3.5 GB. That is inference. Training is a different beast entirely, because you also need to store gradients (same size as the parameters), optimizer states (often 2x the parameter size for Adam), and activations for backpropagation. A rough rule of thumb: training a model in mixed precision requires 4 to 6 bytes per parameter at minimum, and can reach 16 to 20 bytes per parameter with full optimizer state and no memory optimizations. This is why a 7B model that runs comfortably on a single consumer GPU for inference requires a cluster of datacenter GPUs for training.
The industry has largely moved past the belief that stacking more parameters is the primary path to better models. The evidence piled up from multiple directions: Chinchilla proved data quantity mattered as much as model size, open-weights models like Llama 3 and Qwen 2.5 showed that careful data curation and longer training could make 70B models competitive with much larger ones, and architecture innovations like MoE, state-space models, and improved attention mechanisms delivered better performance per parameter than brute-force scaling. The frontier today is about training efficiency, data quality, and post-training techniques like RLHF and distillation — not just making the parameter counter go up. Parameter count still matters as a rough proxy for capacity, but it is increasingly the least interesting thing about a model.