The algorithm: maintain a beam of width k (e.g., k=5). At each step, expand each candidate by all possible next tokens, score the resulting sequences, and keep the top k. Continue until all candidates have generated an end token or hit a length limit. Return the highest-scoring complete sequence. The beam width k trades off quality vs. compute: k=1 is greedy decoding, larger k explores more paths but costs k times more compute.
Raw beam search favors shorter sequences (fewer tokens = fewer probability multiplications = higher total probability). A length penalty (dividing the score by length^α) counteracts this bias, encouraging the model to generate complete, well-formed outputs rather than cutting short. The penalty factor α is a hyperparameter: α=0 is no penalty, α=1 normalizes fully by length. Typical values are 0.6–1.0.
Beam search works best for tasks with a "correct" answer (translation, summarization, structured generation) where you want the single most likely output. It works poorly for creative or conversational tasks where diversity matters — beam search tends to produce generic, repetitive text because high-probability sequences are often boring. Modern LLM interfaces use sampling (with temperature and top-p) for chat, and beam search is mainly used internally for specific tasks like tool-call generation.