LoRA: Definition & Meaning — AI Wiki

一種讓 fine-tuning 變得便宜得多的技術,只訓練少量額外參數,而不是修改整個模型。LoRA「adapters」是輕量級附加(通常只有幾 MB),它們修改模型的行為而不用重訓它那幾十億參數。

為什麼重要

LoRA 民主化了 fine-tuning。在它之前,客製化一個 7B 模型需要嚴肅的 GPU 資源。現在你可以在一張消費級 GPU 上幾個小時內 fine-tune,然後分享那個小小的 adapter 檔案。這就是 HuggingFace 上有成千上萬個專門模型的原因。

Deep Dive

The core insight behind LoRA, published by Hu et al. in 2021, is that the weight updates during fine-tuning tend to be low-rank — meaning the changes can be well-approximated by the product of two much smaller matrices. Instead of updating a weight matrix W (say, 4096 x 4096 = 16 million parameters), LoRA freezes W entirely and adds two small matrices A (4096 x r) and B (r x 4096) where r (the rank) is typically 8, 16, or 64. The effective update is the product BA, which has the same shape as W but is parameterized by only 2 x 4096 x r values. At rank 16, that is about 131,000 trainable parameters instead of 16 million — a 120x reduction for that single layer. Apply this across all attention layers in the model and your total trainable parameters drop from billions to a few million, which is why a LoRA adapter file is often just 10-50 MB compared to the multi-gigabyte base model.

Tuning the Knobs

In practice, you choose which layers to apply LoRA to (typically the attention projection matrices: Q, K, V, and the output projection) and set the rank r and a scaling factor called alpha. The alpha/r ratio controls how much influence the adapter has relative to the frozen base weights. Higher rank means more expressiveness but more parameters and memory; in practice, rank 16 or 32 covers most use cases. QLoRA, introduced by Dettmers et al. in 2023, pushed the efficiency even further by combining LoRA with 4-bit quantization of the base model: the frozen weights are stored in NF4 (a 4-bit format optimized for normally distributed weights) while the LoRA adapters train in bf16. This lets you fine-tune a 70B-parameter model on a single 48GB GPU — something that would otherwise require a multi-GPU setup with hundreds of gigabytes of VRAM.

The Tool Landscape

The LoRA ecosystem has matured rapidly. HuggingFace's PEFT library is the standard implementation, and tools like Axolotl, LLaMA-Factory, and Unsloth wrap it in higher-level interfaces that handle data formatting, hyperparameter defaults, and training loops. One of the most powerful practical features is adapter composability: because LoRA adapters are additive, you can train separate adapters for different tasks and merge or swap them at inference time without reloading the base model. Some serving frameworks like LoRAX and vLLM exploit this to serve hundreds of different LoRA adapters from a single base model in memory, routing each request to the appropriate adapter. This makes it feasible to offer per-customer fine-tuned models without the cost of deploying separate model instances.

The Trade-offs

LoRA is not a free lunch, though. The low-rank constraint means it cannot learn arbitrary weight changes — if the task requires significant restructuring of the model's internal representations, full fine-tuning will outperform LoRA. In practice, this matters most for tasks that are very different from what the base model was pre-trained on, or when you are trying to teach the model substantial new factual knowledge rather than adjusting its style or format. There is also a common mistake of setting the rank too low and wondering why the adapter does not seem to learn, or setting it too high and ending up with an adapter that overfits on small datasets. The rank is a regularization knob as much as a capacity knob, and tuning it alongside the learning rate and number of training steps is essential for good results.

LoRA

為什麼重要

Deep Dive

Tuning the Knobs

The Tool Landscape

The Trade-offs

相關概念