VRAM

The memory on a GPU, separate from system RAM. AI models must fit in VRAM to run on a GPU. A 7B parameter model in 16-bit precision needs ~14GB of VRAM. Consumer GPUs have 8-24GB; datacenter GPUs (A100, H100) have 40-80GB. VRAM is almost always the bottleneck for local AI.

Why it matters

VRAM determines which models you can run. It's why quantization exists (to shrink models to fit), why MoE models are tricky (all experts must fit in VRAM), and why GPU prices scale so steeply with memory. "Will it fit in VRAM?" is the first question of self-hosting AI.

Deep Dive

VRAM (Video RAM) is physically separate memory chips soldered onto the GPU board, distinct from your system's main RAM. The reason it exists separately is bandwidth — the connection between a GPU and its VRAM is massively wider than the connection between a CPU and system RAM. An NVIDIA H100 achieves over 3.3 TB/s of memory bandwidth to its HBM3 (High Bandwidth Memory) stack, while a typical DDR5 system might manage 50-80 GB/s. For AI inference, where the bottleneck is reading billions of weight parameters from memory for every token generated, this bandwidth difference is why running a model on a GPU is dramatically faster than running it on a CPU — even when the CPU has plenty of system RAM.

Sizing the Budget

Calculating VRAM requirements for a model is straightforward arithmetic with a few gotchas. The base formula: multiply the number of parameters by the bytes per parameter for your precision format. A 7B model in FP16 (2 bytes per parameter) needs 14GB just for the weights. But weights aren't all that lives in VRAM. During inference, you also need space for the KV cache — the stored key-value pairs from attention computations that grow with context length. For a 7B model running at a 4,096-token context, the KV cache might add 1-2GB. Extend that to 128K tokens and the KV cache alone can consume 20-40GB. This is why long-context models need significantly more VRAM than their parameter count suggests, and why context window limits exist even on powerful hardware.

Training vs. Inference

Training is far more VRAM-hungry than inference. Beyond storing the model weights, training requires storing optimizer states (Adam keeps two extra copies of every parameter — that's 3x the weight size right there), gradients (another 1x), and activations (the intermediate values needed for backpropagation, which scale with batch size and sequence length). A rule of thumb: training in BF16 with the Adam optimizer requires roughly 18-20 bytes per parameter. A 7B model needs ~140GB just for training state — more than any single consumer GPU has. This is why techniques like FSDP (Fully Sharded Data Parallelism), gradient checkpointing, and mixed-precision training exist: they distribute or reduce memory usage so you can train on the hardware you actually have, at the cost of speed or compute overhead.

The Hardware Landscape

The consumer VRAM landscape defines what's practically achievable for local AI. NVIDIA's RTX 4090 at 24GB is the high end — enough to comfortably run quantized models up to about 14B parameters, or squeeze in a Q4-quantized 30B with careful tuning. The RTX 4070 Ti Super at 16GB handles 7B-13B models well. The RTX 4060 at 8GB is the floor for usable local LLM inference — you're limited to small models or aggressively quantized ones. AMD's RX 7900 XTX offers 24GB at a lower price but with weaker software support for AI workloads. On the datacenter side, the NVIDIA H100 comes in 80GB, the H200 in 141GB, and AMD's MI300X offers 192GB of HBM3. For models too large for any single GPU, tensor parallelism splits the model across multiple GPUs — but this requires fast interconnects (NVLink, InfiniBand) between GPUs, or the communication overhead kills your performance.

The Hidden Tax

One nuance practitioners learn the hard way: your total VRAM isn't fully available. The GPU driver, display processes (if it's also driving a monitor), and CUDA context overhead each consume some memory. On a 24GB card, you might actually have 22-23GB usable. And VRAM fragmentation can prevent you from allocating a single large contiguous block even when total free memory looks sufficient. Tools like nvidia-smi show you current VRAM usage, but the number that matters is the largest contiguous free block, not just the total free amount. This is why inference engines sometimes fail to load a model that should theoretically fit — the memory is there but scattered.