Diffusion Transformer: Definition & Meaning — AI Wiki

一種用 Transformer 替換擴散模型中傳統 U-Net 骨幹的架構。DiT 把注意力機制套用到影像生成,帶來讓 LLM 如此強大的同樣的 scaling 行為。Sora、Flux、Stable Diffusion 3 和大多數最前沿的影像和影片生成器都用 DiT 或其變體。

為什麼重要

DiT 把語言和影像生成的世界統一到單一架構範式下:Transformer。這意味著為 LLM 開發的 scaling laws、訓練技巧、優化策略在很大程度上轉移到影像和影片生成。這就是為什麼影像品質改進得這麼快 — 這個領域和語言乘的是同一條 scaling 曲線。

Deep Dive

The original DiT paper (Peebles & Xie, 2023) showed that simply replacing the U-Net with a standard Transformer and scaling it up produced better image quality. The Transformer processes image patches (similar to Vision Transformers) with added conditioning from the diffusion timestep and class labels. The key finding: DiT follows clear scaling laws — larger models and more compute produce predictably better images, just like with LLMs.

From U-Net to Transformer

U-Nets process images at multiple resolutions, downsampling then upsampling with skip connections. This inductive bias was useful when compute was limited, but it introduces architectural complexity and doesn't scale as cleanly. Transformers, with their uniform architecture, are simpler to scale and benefit more from additional compute and data. The trade-off: Transformers are more memory-hungry due to the quadratic attention over all image patches.

MM-DiT: Multi-Modal DiT

Stable Diffusion 3 and Flux use MM-DiT (Multi-Modal DiT), which processes text and image tokens through separate streams that interact via cross-attention. This is more effective than the simpler text-conditioning used in the original DiT. The text stream uses a frozen text encoder (like T5 or CLIP), and the image stream uses the diffusion process. The two streams exchange information at each Transformer block.

Diffusion Transformer

為什麼重要

Deep Dive

From U-Net to Transformer

MM-DiT: Multi-Modal DiT

相關概念