Video Generation: Definition & Meaning — AI Wiki

用 AI 模型從文字描述、影像或其他影片生成影片。Sora(OpenAI)、Kling(快手)、Runway Gen-3、Vidu 等從「無人機飛過珊瑚礁的鏡頭」這類 prompt 生成影片。這項技術把影像生成擴展到時間維度,增加了跨畫面保持一致性和生成真實運動的挑戰。

為什麼重要

影片生成是生成式 AI 的前沿 — 最難的模態,也是商業潛力最大的。它開始改變電影製作、廣告、社群媒體和教育。AI 與專業影片之間的品質差距正在快速縮小,當前模型產出 5–15 秒的片段,有時已經和真實片段難以區分。

Deep Dive

Most video generation models extend the DiT (Diffusion Transformer) architecture to 3D: instead of processing 2D image patches, they process 3D patches that span both spatial dimensions and time. The model learns to denoise entire video volumes, maintaining spatial consistency (objects look the same across frames) and temporal consistency (motion is smooth and physically plausible). Conditioning works similarly to images: text embeddings guide the generation via cross-attention.

The Compute Challenge

Video generation is extraordinarily compute-intensive. A 10-second video at 30fps is 300 frames — 300x the work of a single image, plus the additional challenge of temporal coherence. Training video models requires video datasets (harder to curate than image datasets) and GPU clusters that make LLM training look modest. This compute requirement is why video generation quality lags behind image generation by roughly 2 years.

Current Limitations

Today's models struggle with: long durations (most max out at 5–15 seconds), complex multi-object interactions, physics-defying motion (objects sometimes float or deform), consistent character identity across cuts, and fine-grained text control. The technology is impressive for b-roll, establishing shots, and creative exploration, but not yet reliable enough for narrative filmmaking where specific actions, expressions, and timing matter.

Video Generation

為什麼重要

Deep Dive

The Compute Challenge

Current Limitations

相關概念