Zubnet AI学习Wiki › Video Generation
基础

Video Generation

Text-to-Video, AI Video
用 AI 模型从文本描述、图像或其他视频生成视频。Sora(OpenAI)、Kling(快手)、Runway Gen-3、Vidu 等从“无人机飞过珊瑚礁的镜头”这类 prompt 生成视频。这项技术把图像生成扩展到时间维度,增加了跨帧保持一致性和生成真实运动的挑战。

为什么重要

视频生成是生成式 AI 的前沿 — 最难的模态,也是商业潜力最大的。它开始改变电影制作、广告、社交媒体和教育。AI 与专业视频之间的质量差距正在快速缩小,当前模型产出 5–15 秒的片段,有时已经和真实片段难以区分。

Deep Dive

Most video generation models extend the DiT (Diffusion Transformer) architecture to 3D: instead of processing 2D image patches, they process 3D patches that span both spatial dimensions and time. The model learns to denoise entire video volumes, maintaining spatial consistency (objects look the same across frames) and temporal consistency (motion is smooth and physically plausible). Conditioning works similarly to images: text embeddings guide the generation via cross-attention.

The Compute Challenge

Video generation is extraordinarily compute-intensive. A 10-second video at 30fps is 300 frames — 300x the work of a single image, plus the additional challenge of temporal coherence. Training video models requires video datasets (harder to curate than image datasets) and GPU clusters that make LLM training look modest. This compute requirement is why video generation quality lags behind image generation by roughly 2 years.

Current Limitations

Today's models struggle with: long durations (most max out at 5–15 seconds), complex multi-object interactions, physics-defying motion (objects sometimes float or deform), consistent character identity across cuts, and fine-grained text control. The technology is impressive for b-roll, establishing shots, and creative exploration, but not yet reliable enough for narrative filmmaking where specific actions, expressions, and timing matter.

相关概念

← 所有术语
← Vector Database Vidu →