Zubnet AI學習Wiki › Diffusion Model
Models

Diffusion Model

一種生成式模型,透過從純雜訊開始,逐步去除雜訊,直到一個一致的輸出出現,來創造影像(或影片、音訊)。模型學會逆轉給真實資料加雜訊的過程。Stable Diffusion、DALL-E 3、Midjourney 都用這種方法的變體。

為什麼重要

擴散模型在 2022 年左右取代 GAN 成為主導的影像生成技術。它們產生更多樣、更可控的輸出,是今天幾乎每個影像和影片 AI 工具的骨幹。

Deep Dive

The core idea is deceptively simple. Take a real image, add Gaussian noise to it step by step until it becomes pure static, then train a neural network to reverse each step. At generation time, you start with random noise and run the learned denoising process forward. The model never generates an image from scratch in one shot — it sculpts one through dozens or hundreds of iterative refinement steps, each one nudging the noisy mess a little closer to something coherent. This iterative nature is both the strength and the weakness of the approach: it produces remarkably high-quality outputs, but each image requires many forward passes through the network, making generation slow compared to single-pass architectures.

Working in Latent Space

In practice, modern diffusion models do not work directly in pixel space. Latent diffusion (the "Stable" in Stable Diffusion) compresses images into a much smaller latent representation using a pretrained autoencoder, then runs the diffusion process there. This is what made high-resolution generation practical — diffusing a 512x512 image in pixel space requires operating on 786,432 values per step, while the latent space might compress that to 64x64x4, or about 16,384 values. The autoencoder handles the mapping back to pixels at the end. DALL-E 3, Midjourney, Flux, and essentially every competitive image generator today uses some form of latent diffusion.

Steering the Output

Conditioning is how you steer the output. Text-to-image models encode your prompt using a text encoder (CLIP or T5, typically), then inject those embeddings into the denoising network via cross-attention at each step. Classifier-free guidance (CFG) is the trick that makes this work well — during training, the model occasionally drops the conditioning signal so it also learns unconditional generation. At inference, you compute both the conditioned and unconditioned predictions, then extrapolate away from the unconditioned one. Higher CFG scales mean the model follows your prompt more literally, but push too far and you get oversaturated, artifact-heavy images. This is that "guidance scale" slider you see in every diffusion UI.

The architecture of the denoising network itself has been evolving fast. The original U-Net backbone (a convolutional architecture borrowed from medical image segmentation) dominated through Stable Diffusion 1.x and 2.x. But the field has been steadily moving toward Transformer-based denoisers — Diffusion Transformers, or DiTs. Sora, Stable Diffusion 3, and Flux all use DiT variants. The shift makes sense: Transformers handle variable-length sequences and scale more predictably with compute. For video generation, the sequence just becomes a series of frames, and attention can model temporal coherence directly.

Faster, Not Memorized

A common misunderstanding is that diffusion models "store" or "retrieve" training images. They do not. The model learns a statistical denoising function — the gradient of the data distribution, technically. Memorization can happen with highly duplicated training data, but it is a failure mode, not the mechanism. Another practical gotcha: the number of denoising steps has a huge impact on quality and speed. Techniques like DDIM and DPM-Solver reduced the required steps from thousands to 20-50, and distillation methods (SDXL Turbo, Latent Consistency Models) have pushed this further to 1-4 steps, though with some quality trade-offs. This is the frontier right now — making diffusion fast enough for real-time and interactive use without sacrificing the quality that made it dominant in the first place.

相關概念

← 所有術語
← Differential Privacy Diffusion Transformer →
ESC