The core idea is deceptively simple. Take a real image, add Gaussian noise to it step by step until it becomes pure static, then train a neural network to reverse each step. At generation time, you start with random noise and run the learned denoising process forward. The model never generates an image from scratch in one shot — it sculpts one through dozens or hundreds of iterative refinement steps, each one nudging the noisy mess a little closer to something coherent. This iterative nature is both the strength and the weakness of the approach: it produces remarkably high-quality outputs, but each image requires many forward passes through the network, making generation slow compared to single-pass architectures.
In practice, modern diffusion models do not work directly in pixel space. Latent diffusion (the "Stable" in Stable Diffusion) compresses images into a much smaller latent representation using a pretrained autoencoder, then runs the diffusion process there. This is what made high-resolution generation practical — diffusing a 512x512 image in pixel space requires operating on 786,432 values per step, while the latent space might compress that to 64x64x4, or about 16,384 values. The autoencoder handles the mapping back to pixels at the end. DALL-E 3, Midjourney, Flux, and essentially every competitive image generator today uses some form of latent diffusion.
Conditioning is how you steer the output. Text-to-image models encode your prompt using a text encoder (CLIP or T5, typically), then inject those embeddings into the denoising network via cross-attention at each step. Classifier-free guidance (CFG) is the trick that makes this work well — during training, the model occasionally drops the conditioning signal so it also learns unconditional generation. At inference, you compute both the conditioned and unconditioned predictions, then extrapolate away from the unconditioned one. Higher CFG scales mean the model follows your prompt more literally, but push too far and you get oversaturated, artifact-heavy images. This is that "guidance scale" slider you see in every diffusion UI.
The architecture of the denoising network itself has been evolving fast. The original U-Net backbone (a convolutional architecture borrowed from medical image segmentation) dominated through Stable Diffusion 1.x and 2.x. But the field has been steadily moving toward Transformer-based denoisers — Diffusion Transformers, or DiTs. Sora, Stable Diffusion 3, and Flux all use DiT variants. The shift makes sense: Transformers handle variable-length sequences and scale more predictably with compute. For video generation, the sequence just becomes a series of frames, and attention can model temporal coherence directly.
A common misunderstanding is that diffusion models "store" or "retrieve" training images. They do not. The model learns a statistical denoising function — the gradient of the data distribution, technically. Memorization can happen with highly duplicated training data, but it is a failure mode, not the mechanism. Another practical gotcha: the number of denoising steps has a huge impact on quality and speed. Techniques like DDIM and DPM-Solver reduced the required steps from thousands to 20-50, and distillation methods (SDXL Turbo, Latent Consistency Models) have pushed this further to 1-4 steps, though with some quality trade-offs. This is the frontier right now — making diffusion fast enough for real-time and interactive use without sacrificing the quality that made it dominant in the first place.