In a Transformer decoder, causal masking ensures each token can only attend to previous tokens (including itself). This is enforced by setting future positions to −∞ in the attention scores before softmax. The result: token 5's representation only depends on tokens 1–5. This constraint is what enables autoregressive generation — you can generate token 6 using only the representations from tokens 1–5, which are already computed.
Modern LLMs (GPT, Claude, Llama) are decoder-only: there's no separate encoder, and the entire model uses causal attention. The input prompt is processed through the same decoder layers as the generated output. This simplicity is why decoder-only won: one architecture, one attention pattern, clean scaling. The model treats everything as generation — even "understanding" the input is framed as predicting what comes next.
In Stable Diffusion, the diffusion process operates in a compressed latent space (64×64 instead of 512×512). The VAE decoder converts this latent representation back into a full-resolution image. It's a separate neural network that's trained to reconstruct images from latents. The quality of the VAE decoder directly affects the final image quality — a good decoder adds fine details and textures that the latent representation can't capture at its lower resolution.