Zubnet AI學習Wiki › Stable Diffusion
Models

Stable Diffusion

SD, SDXL, SD3
使用最廣泛的開源影像生成模型,由 Stability AI 與學術研究者合作創建。Stable Diffusion 用潛在擴散從文字 prompt 生成影像 — 在壓縮的潛在空間而不是像素空間裡做去噪過程,讓它能在消費級 GPU 上跑得夠快。SD 1.5、SDXL、SD3 是連續幾代。

為什麼重要

Stable Diffusion 民主化了 AI 影像生成。在 SD 之前,影像生成需要昂貴的 API 存取(DALL-E)或只能在研究裡。SD 的開放權重意味著任何人都能在本地跑、fine-tune、在此基礎上建構。這催生了巨大的生態:LoRA fine-tunes、ControlNet、自訂模型、社群訓練的 checkpoint,以及從 Automatic1111 到 ComfyUI 的各種應用。

Deep Dive

The architecture has three components: a text encoder (CLIP or T5) converts the prompt into embeddings, a U-Net (SD 1.5/SDXL) or DiT (SD3) performs iterative denoising in latent space, and a VAE decoder converts the final latent representation into a full-resolution image. The "latent" part is key: instead of denoising a 512×512 image (786K values), it denoises a 64×64 latent (4K values), making generation 50x faster.

The Ecosystem

SD's open nature created an unprecedented ecosystem. Civitai and Hugging Face host thousands of community-trained models and LoRA fine-tunes (anime style, photorealism, specific characters). WebUI frontends (Automatic1111, ComfyUI) provide interfaces for complex generation workflows. ControlNet, IP-Adapter, and other extensions add control beyond text prompting. No other AI model has generated this level of community innovation.

SD3 and the Architecture Shift

SD3 replaced the U-Net with a DiT (Diffusion Transformer) and switched from diffusion to flow matching, following the broader architectural trends in the field. It also uses three text encoders (CLIP-L, CLIP-G, T5-XXL) for better prompt understanding. The result: better text rendering, more coherent compositions, and improved prompt following. But the larger model size (2B+ parameters) makes it harder to run on consumer hardware, creating tension with SD's accessibility mission.

相關概念

← 所有術語
ESC