Hyperparameter Tuning: Definition & Meaning — AI Wiki

系統地搜尋最佳超參數 — 那些不是在訓練中學到、而必須在訓練前設定的配置選擇。學習率、batch size、層數、dropout 率、LoRA rank 都是超參數。調優方法包括網格搜尋(嘗試所有組合)、隨機搜尋(嘗試隨機組合)和貝氏優化(用過去結果指導搜尋)。

為什麼重要

一組好的和壞的超參數之間的差距可以非常大 — 一個錯的學習率可以讓訓練發散或收斂到差的解。超參數調優是你從模型架構和資料裡榨出最大價值的方法。對 LLM 的 fine-tuning 來說,學習率和 epoch 數通常是最值得調的超參數。

Deep Dive

Grid search evaluates every combination of specified hyperparameter values: learning rates [1e-3, 1e-4, 1e-5] × batch sizes [16, 32, 64] = 9 experiments. It's exhaustive but exponentially expensive as more hyperparameters are added. Random search samples random combinations from specified ranges — surprisingly, it often finds better configurations than grid search because it explores the space more evenly (Bergstra & Bengio, 2012).

Bayesian Optimization

Bayesian optimization uses a probabilistic model (typically a Gaussian process or tree-based model) to predict which hyperparameters are likely to perform well based on past experiments, then prioritizes those regions. Libraries like Optuna, Ray Tune, and W&B Sweeps implement this. For expensive experiments (training a model takes hours), Bayesian optimization's efficiency advantage over random search is significant — it typically finds good configurations in 3–5x fewer experiments.

Practical Tips

Start with established defaults for your architecture (published learning rates, batch sizes, etc.), then tune the most impactful parameters first. For LLM fine-tuning, learning rate is almost always the most important (try 1e-5 to 5e-4). For LoRA, rank (4–64) and alpha (typically 2× rank) matter most. Use early stopping to cut unpromising experiments short. Log everything to W&B or similar — you'll want to compare runs and understand what worked.

Hyperparameter Tuning

為什麼重要

Deep Dive

Bayesian Optimization

Practical Tips

相關概念