Optimization in AI is really three separate disciplines that happen to share a name. Training optimization is about making the learning process faster and cheaper. Inference optimization is about making the trained model respond faster and use less hardware. Serving optimization is about handling many concurrent users efficiently. Most people conflate these because the techniques sometimes overlap, but the goals and constraints are different. A training optimization like gradient checkpointing trades compute time for memory — you recompute activations during the backward pass instead of storing them. That makes sense when you are GPU-memory-bound during a multi-day training run. It would make no sense during inference, where there is no backward pass. Understanding which phase you are optimizing for, and what resource you are trading for what, is the foundation of making good decisions here.
If you could only learn one optimization technique, quantization would be the one to pick. The idea is simple: models are trained in high-precision floating point (typically bfloat16, which uses 16 bits per parameter), but they can run in much lower precision without catastrophic quality loss. A 14-billion-parameter model in bfloat16 takes about 28 GB of VRAM. Quantize it to 4-bit (Q4_K_M in llama.cpp's notation) and it fits in under 9 GB — suddenly it runs on a single consumer GPU. The quality trade-off exists but is smaller than you would expect. Modern quantization methods like GPTQ, AWQ, and GGUF are calibrated against real data so the most important weights keep higher precision. In blind tests, most users cannot tell the difference between a full-precision model and its 4-bit quantized version for everyday tasks. The gap shows up on edge cases — complex reasoning chains, niche factual knowledge, multilingual tasks — but for most production use cases, quantization is free performance.
Beyond quantization, the biggest inference speedups come from how you manage requests rather than how you shrink the model. Continuous batching — the approach used by vLLM and TensorRT-LLM — lets the server process multiple requests simultaneously, filling GPU idle cycles that would otherwise be wasted while one request waits for its next token. KV-cache optimization (like PagedAttention) prevents the memory overhead of the key-value cache from growing linearly with sequence length, which is critical for long-context applications. Speculative decoding uses a small, fast "draft" model to generate several candidate tokens, then the large model verifies them in a single forward pass — if the draft model guesses right (which it often does for predictable text), you get multiple tokens for the cost of one large-model call. These techniques compound. A well-tuned serving stack using continuous batching, quantization, and speculative decoding can serve the same model at five to ten times the throughput of a naive implementation.
For most teams, optimization is ultimately about cost per query. Running a 70B model on a cluster of A100s costs serious money — roughly $8–$15 per GPU per hour at cloud prices. Optimization determines whether that cluster handles 50 requests per second or 500. Distillation is another lever: you train a smaller "student" model to mimic the outputs of a larger "teacher" model on your specific task. A distilled 8B model that matches 90% of the 70B model's quality on your particular use case costs a fraction to run. The practical workflow many teams follow is: prototype with the largest model available via API, measure what quality level you actually need, then work backward — quantize, distill, or switch to a smaller model until you hit the sweet spot where quality is acceptable and cost is sustainable. The teams that skip this process and jump straight to the biggest model in production are almost always overspending.