Zubnet AI学习Wiki › Stable Diffusion
Models

Stable Diffusion

SD, SDXL, SD3
使用最广泛的开源图像生成模型,由 Stability AI 与学术研究者合作创建。Stable Diffusion 用潜在扩散从文本 prompt 生成图像 — 在压缩的潜在空间而不是像素空间里做去噪过程,让它能在消费级 GPU 上跑得够快。SD 1.5、SDXL、SD3 是连续几代。

为什么重要

Stable Diffusion 民主化了 AI 图像生成。在 SD 之前,图像生成需要昂贵的 API 访问(DALL-E)或只能在研究里。SD 的开放权重意味着任何人都能在本地跑、fine-tune、在此基础上构建。这催生了巨大的生态:LoRA fine-tunes、ControlNet、自定义模型、社区训练的 checkpoint,以及从 Automatic1111 到 ComfyUI 的各种应用。

Deep Dive

The architecture has three components: a text encoder (CLIP or T5) converts the prompt into embeddings, a U-Net (SD 1.5/SDXL) or DiT (SD3) performs iterative denoising in latent space, and a VAE decoder converts the final latent representation into a full-resolution image. The "latent" part is key: instead of denoising a 512×512 image (786K values), it denoises a 64×64 latent (4K values), making generation 50x faster.

The Ecosystem

SD's open nature created an unprecedented ecosystem. Civitai and Hugging Face host thousands of community-trained models and LoRA fine-tunes (anime style, photorealism, specific characters). WebUI frontends (Automatic1111, ComfyUI) provide interfaces for complex generation workflows. ControlNet, IP-Adapter, and other extensions add control beyond text prompting. No other AI model has generated this level of community innovation.

SD3 and the Architecture Shift

SD3 replaced the U-Net with a DiT (Diffusion Transformer) and switched from diffusion to flow matching, following the broader architectural trends in the field. It also uses three text encoders (CLIP-L, CLIP-G, T5-XXL) for better prompt understanding. The result: better text rendering, more coherent compositions, and improved prompt following. But the larger model size (2B+ parameters) makes it harder to run on consumer hardware, creating tension with SD's accessibility mission.

相关概念

← 所有术语
ESC