Zubnet AIAprenderWiki › Stable Diffusion
Models

Stable Diffusion

SD, SDXL, SD3
O modelo de geração de imagens open-source mais amplamente usado, criado pela Stability AI em colaboração com pesquisadores acadêmicos. Stable Diffusion gera imagens a partir de prompts textuais usando difusão latente — realizando o processo de denoising num espaço latente comprimido em vez do espaço de pixels, tornando-o rápido o suficiente para rodar em GPUs de consumidor. SD 1.5, SDXL e SD3 representam gerações sucessivas.

Por que importa

Stable Diffusion democratizou a geração de imagens IA. Antes de SD, geração de imagens exigia acesso API caro (DALL-E) ou era limitada a pesquisa. Os pesos abertos de SD significaram que qualquer um podia rodá-lo localmente, fine-tuná-lo e construir em cima. Isso gerou um enorme ecossistema: fine-tunes LoRA, ControlNet, modelos customizados, checkpoints treinados pela comunidade, e aplicações do Automatic1111 ao ComfyUI.

Deep Dive

The architecture has three components: a text encoder (CLIP or T5) converts the prompt into embeddings, a U-Net (SD 1.5/SDXL) or DiT (SD3) performs iterative denoising in latent space, and a VAE decoder converts the final latent representation into a full-resolution image. The "latent" part is key: instead of denoising a 512×512 image (786K values), it denoises a 64×64 latent (4K values), making generation 50x faster.

The Ecosystem

SD's open nature created an unprecedented ecosystem. Civitai and Hugging Face host thousands of community-trained models and LoRA fine-tunes (anime style, photorealism, specific characters). WebUI frontends (Automatic1111, ComfyUI) provide interfaces for complex generation workflows. ControlNet, IP-Adapter, and other extensions add control beyond text prompting. No other AI model has generated this level of community innovation.

SD3 and the Architecture Shift

SD3 replaced the U-Net with a DiT (Diffusion Transformer) and switched from diffusion to flow matching, following the broader architectural trends in the field. It also uses three text encoders (CLIP-L, CLIP-G, T5-XXL) for better prompt understanding. The result: better text rendering, more coherent compositions, and improved prompt following. But the larger model size (2B+ parameters) makes it harder to run on consumer hardware, creating tension with SD's accessibility mission.

Conceitos relacionados

← Todos os termos
ESC