Zubnet AI学习Wiki › Optimization
Training

Optimization

又名: Model Optimization, Inference Optimization
用来让 AI 模型更快、更小、更便宜或更准的一大组技术。包括训练优化(混合精度、梯度检查点、数据并行)、推理优化(量化、剪枝、蒸馏、speculative decoding)和服务优化(batching、缓存、负载均衡)。优化就是你能在 laptop 上跑一个 14B 参数模型的原因。

为什么重要

光有原始能力而跑不起,没有意义。优化是研究 demo 和生产产品之间的区别。这就是为什么 open-weights 模型能和 API 供应商竞争、为什么有移动 AI、为什么推理成本一直在降。

Deep Dive

Optimization in AI is really three separate disciplines that happen to share a name. Training optimization is about making the learning process faster and cheaper. Inference optimization is about making the trained model respond faster and use less hardware. Serving optimization is about handling many concurrent users efficiently. Most people conflate these because the techniques sometimes overlap, but the goals and constraints are different. A training optimization like gradient checkpointing trades compute time for memory — you recompute activations during the backward pass instead of storing them. That makes sense when you are GPU-memory-bound during a multi-day training run. It would make no sense during inference, where there is no backward pass. Understanding which phase you are optimizing for, and what resource you are trading for what, is the foundation of making good decisions here.

Quantization: The Single Biggest Win

If you could only learn one optimization technique, quantization would be the one to pick. The idea is simple: models are trained in high-precision floating point (typically bfloat16, which uses 16 bits per parameter), but they can run in much lower precision without catastrophic quality loss. A 14-billion-parameter model in bfloat16 takes about 28 GB of VRAM. Quantize it to 4-bit (Q4_K_M in llama.cpp's notation) and it fits in under 9 GB — suddenly it runs on a single consumer GPU. The quality trade-off exists but is smaller than you would expect. Modern quantization methods like GPTQ, AWQ, and GGUF are calibrated against real data so the most important weights keep higher precision. In blind tests, most users cannot tell the difference between a full-precision model and its 4-bit quantized version for everyday tasks. The gap shows up on edge cases — complex reasoning chains, niche factual knowledge, multilingual tasks — but for most production use cases, quantization is free performance.

Inference Speed: Batching, Caching, and Speculative Decoding

Beyond quantization, the biggest inference speedups come from how you manage requests rather than how you shrink the model. Continuous batching — the approach used by vLLM and TensorRT-LLM — lets the server process multiple requests simultaneously, filling GPU idle cycles that would otherwise be wasted while one request waits for its next token. KV-cache optimization (like PagedAttention) prevents the memory overhead of the key-value cache from growing linearly with sequence length, which is critical for long-context applications. Speculative decoding uses a small, fast "draft" model to generate several candidate tokens, then the large model verifies them in a single forward pass — if the draft model guesses right (which it often does for predictable text), you get multiple tokens for the cost of one large-model call. These techniques compound. A well-tuned serving stack using continuous batching, quantization, and speculative decoding can serve the same model at five to ten times the throughput of a naive implementation.

The Cost Equation in Practice

For most teams, optimization is ultimately about cost per query. Running a 70B model on a cluster of A100s costs serious money — roughly $8–$15 per GPU per hour at cloud prices. Optimization determines whether that cluster handles 50 requests per second or 500. Distillation is another lever: you train a smaller "student" model to mimic the outputs of a larger "teacher" model on your specific task. A distilled 8B model that matches 90% of the 70B model's quality on your particular use case costs a fraction to run. The practical workflow many teams follow is: prototype with the largest model available via API, measure what quality level you actually need, then work backward — quantize, distill, or switch to a smaller model until you hit the sweet spot where quality is acceptable and cost is sustainable. The teams that skip this process and jump straight to the biggest model in production are almost always overspending.

相关概念

← 所有术语
← OpenAI Overfitting →
ESC