Hyperparameter Tuning: Definition & Meaning — AI Wiki

系统地搜索最佳超参数 — 那些不是在训练中学到、而必须在训练前设定的配置选择。学习率、batch size、层数、dropout 率、LoRA rank 都是超参数。调优方法包括网格搜索(尝试所有组合)、随机搜索(尝试随机组合)和贝叶斯优化(用过去结果指导搜索)。

为什么重要

一组好的和坏的超参数之间的差距可以非常大 — 一个错的学习率可以让训练发散或收敛到差的解。超参数调优是你从模型架构和数据里榨出最大价值的方法。对 LLM 的 fine-tuning 来说,学习率和 epoch 数通常是最值得调的超参数。

Deep Dive

Grid search evaluates every combination of specified hyperparameter values: learning rates [1e-3, 1e-4, 1e-5] × batch sizes [16, 32, 64] = 9 experiments. It's exhaustive but exponentially expensive as more hyperparameters are added. Random search samples random combinations from specified ranges — surprisingly, it often finds better configurations than grid search because it explores the space more evenly (Bergstra & Bengio, 2012).

Bayesian Optimization

Bayesian optimization uses a probabilistic model (typically a Gaussian process or tree-based model) to predict which hyperparameters are likely to perform well based on past experiments, then prioritizes those regions. Libraries like Optuna, Ray Tune, and W&B Sweeps implement this. For expensive experiments (training a model takes hours), Bayesian optimization's efficiency advantage over random search is significant — it typically finds good configurations in 3–5x fewer experiments.

Practical Tips

Start with established defaults for your architecture (published learning rates, batch sizes, etc.), then tune the most impactful parameters first. For LLM fine-tuning, learning rate is almost always the most important (try 1e-5 to 5e-4). For LoRA, rank (4–64) and alpha (typically 2× rank) matter most. Use early stopping to cut unpromising experiments short. Log everything to W&B or similar — you'll want to compare runs and understand what worked.

Hyperparameter Tuning

为什么重要

Deep Dive

Bayesian Optimization

Practical Tips

相关概念