NVIDIA released SANA-WM, a 2.6-billion-parameter open-source world model that takes one 720p image plus a 6-DoF camera trajectory as input and produces a 60-second 720p video. The release is concrete on all the parts that usually go undisclosed: 64 H100 GPUs over roughly 18.5 days of training, 212,975 video clips from seven datasets (SpatialVID-HQ, DL3DV real and synthetic, OmniWorld, Sekai Game and Walking-HQ, MiraData) with metric-scale 6-DoF camera annotations, Apache 2.0 license for the code at github.com/NVlabs/Sana, arXiv preprint 2605.15178, and a distilled inference variant that produces a full 60-second clip in 34 GPU-seconds on a single RTX 5090 with NVFP4 quantization. That last number is the headline: minute-long 720p video generation on consumer hardware in under real-time.

The architecture is where the cost reduction lives. SANA-WM is a Diffusion Transformer operating on latent frames from the LTX2-VAE encoder; the backbone is 20 transformer layers split as 15 frame-wise Gated DeltaNet (GDN) blocks interleaved with 5 standard softmax attention blocks. Sixty seconds at 720p compresses to 961 latent frames, and standard softmax attention scales O(n²) in memory across that sequence length — which is exactly what kicks every previous open-source world model out of single-GPU deployment. GDN replaces most of those blocks with a constant-size recurrent state of dimension D×D, which scales O(1) per frame regardless of length. That swap is the engineering decision that makes minute-scale 720p on a 32GB card possible. Two camera-conditioning branches handle 6-DoF control: a coarse UCPE pass that injects ray-local camera basis from camera-to-world pose and intrinsics into attention heads, and a fine Plücker pass that addresses the 8-frame-per-latent compression mismatch by computing pixel-wise Plücker raymaps (6D direction-and-moment pairs) and packing them into 48-channel tensors injected after self-attention.

Reported benchmarks: 4.50° rotation error on the simple split and 8.34° on the hard split for camera accuracy; VBench Overall scores of 80.62 and 81.89 on the two splits. The throughput comparison NVIDIA highlights is 22 videos per hour on 8 H100s for the full pipeline including the refiner — about 36× the published rate of the LingBot-World 14B+14B stack, which lands around 0.6 videos per hour on equivalent hardware. Three inference variants ship: bidirectional at 49.2 GB for offline batch use, chunk-causal autoregressive at 51.1 GB for streaming generation, and the distilled-plus-NVFP4-quantized variant that fits the RTX 5090 single-GPU path. The mixed license framing matters: code is Apache 2.0 but weights and datasets are under separate licenses documented in the paper's Table 11 — read those before shipping a commercial product on top of SANA-WM.

For builders considering video generation in their stack: this is the first credible open-source world model where the inference economics are reasonable on consumer hardware and the methodology is fully disclosed. The 34-GPU-second-per-video figure on a $1,999 consumer card changes the cost curve for any product that wants to generate camera-controlled video at scale — robotics simulation, game prototyping, virtual scouting, animation tooling. The hard part remaining is dataset and weight licensing, not compute. Worth running on your own evaluation tasks; the per-GPU-hour math suggests it is the first open release where you actually can. Watch for third-party reproduction of the VBench numbers and especially the camera-accuracy figures, which are the metrics that matter for any downstream application that depends on faithful trajectory tracking rather than just plausible-looking video.