The architecture has three components: a text encoder (CLIP or T5) converts the prompt into embeddings, a U-Net (SD 1.5/SDXL) or DiT (SD3) performs iterative denoising in latent space, and a VAE decoder converts the final latent representation into a full-resolution image. The "latent" part is key: instead of denoising a 512×512 image (786K values), it denoises a 64×64 latent (4K values), making generation 50x faster.
SD's open nature created an unprecedented ecosystem. Civitai and Hugging Face host thousands of community-trained models and LoRA fine-tunes (anime style, photorealism, specific characters). WebUI frontends (Automatic1111, ComfyUI) provide interfaces for complex generation workflows. ControlNet, IP-Adapter, and other extensions add control beyond text prompting. No other AI model has generated this level of community innovation.
SD3 replaced the U-Net with a DiT (Diffusion Transformer) and switched from diffusion to flow matching, following the broader architectural trends in the field. It also uses three text encoders (CLIP-L, CLIP-G, T5-XXL) for better prompt understanding. The result: better text rendering, more coherent compositions, and improved prompt following. But the larger model size (2B+ parameters) makes it harder to run on consumer hardware, creating tension with SD's accessibility mission.