Netflix researchers released VOID (Video Object and Interaction Deletion), an AI model that removes objects from videos while maintaining physical realism. Built on CogVideoX and fine-tuned with "quadmask conditioning," VOID goes beyond standard video inpainting by understanding causality—when you remove a person holding a guitar, the guitar falls naturally instead of floating in mid-air. The model outperformed existing tools including ProPainter, DiffuEraser, and Runway on both synthetic and real video tests.
This matters because current video editing workflows hit a brutal wall with physics. Hollywood VFX teams spend weeks manually fixing interactions after object removal—making sure shadows disappear, reflections update, and objects obey gravity. VOID automates this by reasoning about scene dynamics rather than just filling pixels. It's the difference between sophisticated background painting and understanding how the world actually works when objects interact.
The technical approach is straightforward but clever: take a proven video generation model (CogVideoX-Fun-V1.5-5b-InP from Alibaba PAI) and teach it to think about physical relationships through specialized mask conditioning. The "quadmask" system helps the model understand not just what to remove, but what secondary effects should follow. Netflix's decision to open-source this suggests they're confident in their lead and want to accelerate adoption across the industry.
For developers, this signals that physics-aware video editing is moving from research curiosity to production tool. The model builds on existing infrastructure (CogVideoX) rather than requiring entirely new architectures, making integration more feasible. Expect video editing APIs to start incorporating interaction-aware removal within the next year—the question is whether they'll match Netflix's quality or just claim to understand physics while still producing floating guitars.
