Zubnet AI学习Wiki › Weights
Training

Weights

又名: Model Weights, Neural Network Weights
神经网络内部的数值,在训练时被调整以最小化误差。神经元之间的每个连接都有一个权重,决定一个神经元对下一个有多大的影响。当你下载一个模型文件 — 一个 .safetensors、.gguf 或 .pt 文件 — 你下载的就是它的权重。“发布权重”意思是公开这些文件,让任何人都能运行这个模型。权重就是模型本身;其他一切只是告诉你怎么组织它们的架构。

为什么重要

当 AI 产业说“open weights”和“open source”时,这个区别很重要。光有权重,你可以运行和 fine-tune 一个模型,但没有训练代码、数据和配方,你没法从零复现它。理解权重能帮你把握模型分发、量化(降低权重精度)、以及为什么一个 fp16 的 7B 模型需要约 14GB 的磁盘空间。

Deep Dive

A weight is a floating-point number. That is it. Every connection between two neurons in a network carries one of these numbers, and a modern large language model has billions of them — arranged in enormous matrices, one per layer. Before training begins, these matrices are filled with values that look essentially random (more on initialization in a moment). Then the network sees data, computes how wrong its predictions are via a loss function, and backpropagation flows the gradient of that error backward through every layer, nudging each weight a tiny amount in the direction that would have made the prediction less wrong. Repeat this a few billion times across terabytes of text and you get a model that can write poetry, explain quantum mechanics, or debug your code. The weights are where all of that learned capability lives. There is no separate knowledge store, no database of facts — just matrices of numbers that, through sheer statistical pressure, have organized themselves into something that looks a lot like understanding.

Formats and Precision

The format you store those numbers in matters more than you might expect. Full-precision weights use fp32 — 32-bit floating point — which gives you roughly 7 decimal digits of precision and a huge dynamic range. That is what researchers used for years, and it is still the gold standard for numerical stability. But fp32 is expensive: a 7-billion-parameter model in fp32 eats 28 GB just for the weights, before you even think about optimizer states or activations. Half-precision fp16 cuts that in half, but its limited exponent range makes it prone to overflow and underflow during training. Enter bf16 — bfloat16 — which keeps fp32's exponent range but truncates the mantissa to 16 bits. Google developed it specifically for deep learning, and it has become the de facto standard for training because it rarely blows up numerically while using half the memory of fp32. For inference, you can go further: int8 quantization packs weights into 8-bit integers (one quarter the size of fp32) with surprisingly little quality loss, and int4 — pioneered by the GPTQ and AWQ methods — halves that again. A 70B model that would need 140 GB in fp16 fits in about 35 GB at 4-bit precision, which is why quantization is the reason you can run serious models on consumer GPUs at all.

Model File Formats

When you download a model, the file format determines how those weight matrices are serialized to disk. For years, the default was PyTorch's .bin format, which is just Python's pickle serialization applied to tensors. It works, but pickle has a well-known security problem: a malicious .bin file can execute arbitrary code when you load it. Hugging Face created safetensors specifically to fix this — it is a simple, memory-mapped format that contains only tensor data and metadata, with no code execution possible. Safetensors also loads faster because it supports lazy loading and zero-copy reads. It has become the standard for distributing models on Hugging Face and beyond. Then there is GGUF, which is the format used by llama.cpp and the broader local-inference ecosystem. GGUF bundles weights, tokenizer configuration, and model metadata into a single self-contained file, often with built-in quantization. If you are running a model locally on your laptop or a consumer GPU, you are almost certainly using a GGUF file. The short version: safetensors for distribution and fine-tuning, GGUF for local inference, and .bin only when you encounter legacy checkpoints.

Why Initialization Matters

Before training even starts, the values you put into those weight matrices shape everything that follows. Initialize them all to zero and the network cannot learn — every neuron in a layer computes the same thing, so gradients are identical and symmetry never breaks. Initialize them too large and activations explode; too small and gradients vanish to zero before reaching the early layers. Xavier initialization (2010) solved this for sigmoid and tanh networks by scaling initial weights based on fan-in and fan-out — the number of connections coming in and going out of each layer. Kaiming initialization (2015, from the He et al. paper) adapted the idea for ReLU activations, which behave differently because they zero out half their inputs. Modern Transformers typically use variants of these, sometimes with additional scaling factors tuned for attention layers. There is also the lottery ticket hypothesis (Frankle & Carlin, 2019), which showed that within a randomly initialized network, there exist small subnetworks — "winning tickets" — that can be trained in isolation to match the full network's performance. The implication is striking: most of those billions of initial weights might be unnecessary, and the right sparse initialization could theoretically give you the same model at a fraction of the size. In practice, reliably finding those winning tickets remains expensive, but the idea has shaped how researchers think about pruning and efficient architectures.

Weights, Parameters, and "The Model"

People use "weights" and "parameters" almost interchangeably, and for most purposes that is fine — but technically, parameters include biases (a small constant added after the weighted sum at each neuron) and any other learned values like layer normalization scales. In a typical Transformer, biases account for a tiny fraction of total parameters, so when someone says a model has 70 billion parameters, they effectively mean 70 billion weights. The deeper point is that when you download a model's weight file, you are downloading everything the model learned. The architecture — how many layers, how wide, what activation functions — is just a blueprint. The weights are the building itself. Two models with identical architectures but different weights will behave completely differently if they were trained on different data or for different durations. This is why "releasing the weights" is such a significant act: you are not sharing a design, you are sharing the accumulated result of millions of dollars of compute and months of training. The knowledge is in the numbers.

相关概念

← 所有术语
← Watermarking Weights & Biases →
ESC