Zubnet AI學習Wiki › Optimization
Training

Optimization

又名: Model Optimization, Inference Optimization
用來讓 AI 模型更快、更小、更便宜或更準的一大組技術。包括訓練優化(混合精度、梯度檢查點、資料平行)、推理優化(量化、剪枝、蒸餾、speculative decoding)和服務優化(batching、快取、負載均衡)。優化就是你能在 laptop 上跑一個 14B 參數模型的原因。

為什麼重要

光有原始能力而跑不起,沒有意義。優化是研究 demo 和生產產品之間的區別。這就是為什麼 open-weights 模型能和 API 供應商競爭、為什麼有行動 AI、為什麼推理成本一直在降。

Deep Dive

Optimization in AI is really three separate disciplines that happen to share a name. Training optimization is about making the learning process faster and cheaper. Inference optimization is about making the trained model respond faster and use less hardware. Serving optimization is about handling many concurrent users efficiently. Most people conflate these because the techniques sometimes overlap, but the goals and constraints are different. A training optimization like gradient checkpointing trades compute time for memory — you recompute activations during the backward pass instead of storing them. That makes sense when you are GPU-memory-bound during a multi-day training run. It would make no sense during inference, where there is no backward pass. Understanding which phase you are optimizing for, and what resource you are trading for what, is the foundation of making good decisions here.

Quantization: The Single Biggest Win

If you could only learn one optimization technique, quantization would be the one to pick. The idea is simple: models are trained in high-precision floating point (typically bfloat16, which uses 16 bits per parameter), but they can run in much lower precision without catastrophic quality loss. A 14-billion-parameter model in bfloat16 takes about 28 GB of VRAM. Quantize it to 4-bit (Q4_K_M in llama.cpp's notation) and it fits in under 9 GB — suddenly it runs on a single consumer GPU. The quality trade-off exists but is smaller than you would expect. Modern quantization methods like GPTQ, AWQ, and GGUF are calibrated against real data so the most important weights keep higher precision. In blind tests, most users cannot tell the difference between a full-precision model and its 4-bit quantized version for everyday tasks. The gap shows up on edge cases — complex reasoning chains, niche factual knowledge, multilingual tasks — but for most production use cases, quantization is free performance.

Inference Speed: Batching, Caching, and Speculative Decoding

Beyond quantization, the biggest inference speedups come from how you manage requests rather than how you shrink the model. Continuous batching — the approach used by vLLM and TensorRT-LLM — lets the server process multiple requests simultaneously, filling GPU idle cycles that would otherwise be wasted while one request waits for its next token. KV-cache optimization (like PagedAttention) prevents the memory overhead of the key-value cache from growing linearly with sequence length, which is critical for long-context applications. Speculative decoding uses a small, fast "draft" model to generate several candidate tokens, then the large model verifies them in a single forward pass — if the draft model guesses right (which it often does for predictable text), you get multiple tokens for the cost of one large-model call. These techniques compound. A well-tuned serving stack using continuous batching, quantization, and speculative decoding can serve the same model at five to ten times the throughput of a naive implementation.

The Cost Equation in Practice

For most teams, optimization is ultimately about cost per query. Running a 70B model on a cluster of A100s costs serious money — roughly $8–$15 per GPU per hour at cloud prices. Optimization determines whether that cluster handles 50 requests per second or 500. Distillation is another lever: you train a smaller "student" model to mimic the outputs of a larger "teacher" model on your specific task. A distilled 8B model that matches 90% of the 70B model's quality on your particular use case costs a fraction to run. The practical workflow many teams follow is: prototype with the largest model available via API, measure what quality level you actually need, then work backward — quantize, distill, or switch to a smaller model until you hit the sweet spot where quality is acceptable and cost is sustainable. The teams that skip this process and jump straight to the biggest model in production are almost always overspending.

相關概念

← 所有術語
← OpenAI Overfitting →
ESC