A model is three things fused together: an architecture, a set of parameters, and the ghost of its training data. The architecture is the blueprint — it defines how information flows through the system. A Transformer processes text through layers of attention mechanisms. A diffusion model iteratively denoises random noise into images. A Mamba model uses selective state spaces to process sequences without attention at all. The architecture determines what kind of input the model can handle and what kind of output it can produce, but on its own it does nothing. It is a blank structure with no knowledge.
Parameters are the knowledge. During training, the model adjusts millions or billions of numerical weights until it can predict its training data well. These weights encode everything the model "knows" — grammar, facts, reasoning patterns, style, biases. When people say a model has 70 billion parameters, they mean 70 billion learned numbers that collectively represent whatever patterns the model extracted from its training corpus. The parameters are the model in the most concrete sense: they are the file you download, the thing that gets loaded into GPU memory, the artifact that turns architecture into capability.
When you download a model, you are downloading those parameters serialized into a file. The format matters more than you might expect. PyTorch .pt or .bin files are the native format for models trained in PyTorch — they use Python's pickle serialization, which means they can technically contain arbitrary code. This is a real security concern if you download models from untrusted sources. Safetensors, developed by Hugging Face, solves this by storing only the raw tensor data in a format that cannot execute code. It is also faster to load because it supports memory-mapped access. Most model repositories have moved to safetensors as the default.
GGUF is a different beast entirely. Developed by the llama.cpp community, GGUF is designed for CPU and mixed CPU/GPU inference on consumer hardware. It packages the model weights along with metadata about quantization, tokenizer configuration, and architecture details into a single self-contained file. If you see someone running a 70B model on a MacBook, they are almost certainly using a GGUF file that has been quantized down to 4-bit or 5-bit precision. ONNX (Open Neural Network Exchange) takes yet another approach — it is an interoperability format designed to let you train a model in one framework and run it in another, often with hardware-specific optimizations applied by the runtime.
Models go through a lifecycle that most users never see. Pre-training is the expensive part: a foundation model is trained on massive amounts of data (often trillions of tokens for large language models) at costs ranging from hundreds of thousands to hundreds of millions of dollars. This produces a base model that can predict text but is not particularly useful for conversation. Fine-tuning adapts the base model for specific tasks — instruction following, code generation, medical diagnosis — using much smaller, curated datasets. RLHF or similar alignment techniques make the model's outputs more helpful and less harmful. Quantization compresses the model's precision from 16-bit or 32-bit floating point down to 8-bit, 4-bit, or even lower, trading a small amount of quality for dramatic reductions in memory and compute requirements. Deployment puts the model behind an API or loads it onto a device. Serving handles the actual inference requests at scale.
The distinction between open and closed models is murkier than it sounds. When Meta "releases" Llama, they publish the model weights — you can download the parameters and run the model on your own hardware. But they do not release the training data or the full training code. Mistral does something similar. These are more accurately called "open-weight" models. Truly open-source models would include weights, training data, training code, and evaluation pipelines — a standard almost nobody meets. On the other side, closed models like GPT-4 and Claude are only available through APIs. You never see the weights, you cannot modify the model, and you are subject to the provider's terms of service. The practical difference is enormous: open-weight models give you control, privacy, and the ability to fine-tune, but you pay for compute and take on operational complexity. Closed models give you convenience and often better performance, but you are renting access to someone else's system.
Benchmarks are the standard way models are compared, and they are deeply unreliable. A model that scores highest on MMLU (a multiple-choice knowledge test) might struggle with your specific task. Benchmark contamination — where test data leaks into training data — is rampant and hard to detect. Chatbot Arena, which ranks models based on blind human preference votes, is more trustworthy but still reflects general conversational quality rather than domain-specific performance. The only reliable way to choose a model is to test candidates on your actual workload. Write ten representative prompts, run them through three or four models, and compare the outputs. That one-hour investment will tell you more than any leaderboard.