Three hyperparameters dominate every training run, and understanding how they interact is more important than memorizing default values. The learning rate controls how much the model's weights change on each update step — too high and the loss explodes, too low and you waste compute crawling toward a minimum you will never reach. Typical values for pre-training a large language model land somewhere between 1e-4 and 6e-4, though that range shifts depending on model size and optimizer. Batch size determines how many examples the model sees before updating its weights. Larger batches give more stable gradient estimates but cost more memory and can sometimes hurt generalization. The optimizer — almost always some variant of Adam (AdamW being the current standard) — decides how to use the gradient information to actually move the weights. AdamW adds decoupled weight decay, which acts as a regularizer and keeps weights from growing unbounded. These three are deeply entangled: doubling your batch size often means you can increase your learning rate (the linear scaling rule), and switching optimizers can change which learning rates are even stable. You cannot tune one in isolation and expect clean results.
A constant learning rate is almost never the right choice, and this is one of those things that sounds like conventional wisdom but has solid empirical backing. Most successful training runs use a warmup phase followed by some form of decay. Warmup starts the learning rate near zero and ramps it up over the first few hundred to few thousand steps — this prevents the randomly initialized model from taking enormous, destructive gradient steps before it has learned any useful structure. After warmup, cosine decay is the most popular schedule: the learning rate follows a half-cosine curve from its peak down to near zero over the remaining training steps. This gives the model a long period at a productive learning rate followed by a gentle cooldown that helps it settle into a good minimum. Linear decay works too, but cosine has become the default because it consistently performs as well or better across architectures. Some recent work explores cyclic schedules and warmup-stable-decay patterns, but if you are starting a new project and want something reliable, cosine decay with warmup is the safe bet.
The hyperparameters that matter shift dramatically depending on whether you are pre-training from scratch or fine-tuning an existing model. Pre-training is a brute-force affair — you care about learning rate, batch size, optimizer, and weight decay because you are building representations from nothing. Fine-tuning is surgery on an already-trained brain, and the rules change accordingly. Learning rates drop by an order of magnitude or more: where pre-training might use 3e-4, fine-tuning typically uses 1e-5 to 5e-5, because you want to nudge the model, not overwrite what it already knows. The number of epochs matters much more in fine-tuning — one to three passes over the data is often enough, and going further risks overfitting catastrophically on a small dataset. With parameter-efficient methods like LoRA, a new hyperparameter enters the picture: the rank, which controls how much capacity the adapter has. Rank 8 to 64 covers most use cases, with higher ranks adding expressiveness at the cost of more trainable parameters. LoRA also introduces its own alpha scaling factor, and the ratio of alpha to rank effectively controls the adapter's learning rate. The upshot is that fine-tuning has fewer hyperparameters to set, but each one is more sensitive because you are operating on a model that already has strong priors.
Grid search — trying every combination of values on a predefined grid — is the strategy everyone learns first and almost nobody uses at scale. The problem is combinatorial: five hyperparameters with five values each means 3,125 runs, and most of those runs explore boring, redundant regions of the space. Random search, proposed by Bergstra and Bengio in 2012, is embarrassingly simple and consistently outperforms grid search: just sample hyperparameter values from reasonable distributions and run a fixed budget of experiments. It works because not all hyperparameters matter equally, and random sampling is far more likely to hit the important values of the ones that do. Beyond random search, Bayesian optimization (tools like Optuna or Weights & Biases Sweeps) builds a model of how hyperparameters map to performance and uses that model to suggest increasingly promising configurations. Population-based training takes a different approach entirely — it runs many training jobs in parallel, periodically copies the weights of the best-performing ones, and mutates their hyperparameters, effectively evolving a good configuration during training rather than before it. Each strategy trades off compute cost against exploration efficiency, but the honest answer is that random search with a reasonable budget gets you 90% of the way there.
Here is the part that rarely makes it into textbooks: most hyperparameters in production systems are not derived from first principles or found through rigorous search. They are copied from papers that worked on similar problems. When someone pre-trains a 7B parameter model, they look at what learning rate LLaMA used, what batch size Chinchilla recommended, what weight decay GPT-3 reported — and they start there. This is not laziness; it is rational. The hyperparameter landscape for large models is vast, each experiment costs thousands of dollars in compute, and the published configurations represent hundreds of thousands of dollars of implicit search already performed by well-funded labs. The craft of hyperparameter tuning, in practice, is knowing which paper's settings to start from, which one or two knobs are worth adjusting for your specific situation, and when something is going wrong enough that you need to actually search rather than tweak. First principles matter for understanding why a choice works, but copying from successful predecessors is how most real training runs get off the ground.