Zubnet AIAprenderWiki › Parameters
Fundamentos

Parameters

También conocido como: Weights, Model Parameters
Los valores internos que una red neuronal aprende durante el entrenamiento — esencialmente el «conocimiento» del modelo codificado como números. Cuando alguien dice que un modelo tiene «7 mil millones de parámetros», quiere decir 7 mil millones de valores numéricos individuales que fueron ajustados durante el entrenamiento para capturar los patrones en los datos. Más parámetros generalmente significa más capacidad para aprender patrones complejos, pero también más memoria para almacenar y más compute para correr.

Por qué importa

La cantidad de parámetros es el atajo más común para el tamaño del modelo, y determina directamente cuánta memoria GPU necesitas. Un modelo 7B en precisión 16-bit necesita ~14GB de VRAM solo para los pesos. Entender los parámetros te ayuda a estimar costos, elegir hardware, y entender por qué la cuantización (reducir precisión por parámetro) es tan importante para hacer los modelos accesibles.

Deep Dive

When a neural network trains, it is adjusting millions or billions of numbers organized into matrices of weights and biases. Each weight controls how strongly a signal flows from one neuron to the next; each bias shifts the activation threshold. These are the parameters. Training works through gradient descent — the model makes a prediction, measures how wrong it was (the loss), then nudges every parameter a tiny amount in the direction that would have made the prediction less wrong. Repeat this billions of times across trillions of tokens, and those parameters converge into something that can write poetry, debug code, or explain quantum mechanics. The parameters are not a lookup table or a database. They are a compressed, distributed, lossy representation of patterns in the training data, and no single parameter "knows" anything on its own.

The Arms Race

The history of modern AI can be told in parameter counts. GPT-2 had 1.5 billion parameters in 2019 and people thought it was dangerously capable. GPT-3 arrived in 2020 with 175 billion and rewrote the rules. Each jump in scale unlocked capabilities that smaller models simply could not match — few-shot learning, coherent long-form writing, basic reasoning — and labs raced to train ever-larger models. This was not just marketing. Scaling laws published by OpenAI and DeepMind showed a remarkably smooth relationship between parameter count, training data, compute budget, and model performance. More parameters, trained on more data, with more compute, meant predictably better results. The arms race was rational, at least for a while.

Total Parameters vs. Active Parameters

Not all parameters are equal, and not all of them fire on every input. Mixture-of-Experts (MoE) models like Mixtral and (reportedly) GPT-4 contain many billions of total parameters, but a routing network selects only a subset of "expert" sub-networks for each token. Mixtral 8x7B has roughly 47 billion total parameters but activates only about 13 billion per forward pass — giving you the quality of a much larger model at the inference cost of a smaller one. Meanwhile, the Chinchilla scaling research from DeepMind in 2022 upended the "bigger is always better" assumption entirely. They showed that most large models were undertrained: a smaller model trained on significantly more data could outperform a larger model trained on less. Chinchilla, at 70 billion parameters trained on 1.4 trillion tokens, beat the 280-billion-parameter Gopher. The lesson was that parameter count alone tells you very little without knowing how much data and compute went into training.

The VRAM Math

Parameters have a direct, unavoidable cost in GPU memory. Each parameter stored in fp16 (16-bit floating point) or bf16 takes 2 bytes. A 7-billion-parameter model therefore needs roughly 14 GB of VRAM just to hold the weights — before you account for anything else. Quantize to int8 (8-bit integers) and that drops to 7 GB; go to 4-bit and you are down to about 3.5 GB. That is inference. Training is a different beast entirely, because you also need to store gradients (same size as the parameters), optimizer states (often 2x the parameter size for Adam), and activations for backpropagation. A rough rule of thumb: training a model in mixed precision requires 4 to 6 bytes per parameter at minimum, and can reach 16 to 20 bytes per parameter with full optimizer state and no memory optimizations. This is why a 7B model that runs comfortably on a single consumer GPU for inference requires a cluster of datacenter GPUs for training.

Beyond Raw Count

The industry has largely moved past the belief that stacking more parameters is the primary path to better models. The evidence piled up from multiple directions: Chinchilla proved data quantity mattered as much as model size, open-weights models like Llama 3 and Qwen 2.5 showed that careful data curation and longer training could make 70B models competitive with much larger ones, and architecture innovations like MoE, state-space models, and improved attention mechanisms delivered better performance per parameter than brute-force scaling. The frontier today is about training efficiency, data quality, and post-training techniques like RLHF and distillation — not just making the parameter counter go up. Parameter count still matters as a rough proxy for capacity, but it is increasingly the least interesting thing about a model.

Conceptos relacionados

← Todos los términos
← PagedAttention Perplexity →
ESC