Model overview
void-model removes objects from videos while preserving the physical interactions those objects create with their environment. Unlike simpler removal tools that only erase the object itself, this model understands that removing a person means objects they were holding should fall, or that removing a support structure leaves dependent items displaced. Built by Netflix, it fine-tunes CogVideoX-5b with interaction-aware conditioning using quadmasks that encode what to remove, overlapping regions, affected areas, and what to keep. This represents a significant step beyond background removal tools that handle semantic segmentation without physics simulation.
Model inputs and outputs
The model takes a source video, a specialized four-value mask, and a text description of the desired scene. It outputs a new video with the specified object removed and all physical consequences of that removal rendered naturally.
Inputs
- Source video in MP4 format at any resolution
- Quadmask video encoding four regions: primary object to remove (value 0), overlap regions (value 63), affected regions where objects fall or shift (value 127), and background to keep (value 255)
- Text prompt describing the scene after removal
Outputs
- Inpainted video with the object removed and physics-aware scene changes applied
- Support for up to 197 frames at 384x672 resolution
Capabilities
The model handles counterfactual video generation by understanding object interactions. When you remove a person holding a coffee cup, the cup falls naturally. When you remove a table, objects that were on it shift appropriately. It achieves temporal consistency across longer clips through optional two-pass refinement, where the first pass handles primary inpainting and the second applies optical flow-warped latent initialization. The architecture uses 3D transformers to maintain coherence across frames while managing memory efficiently with BF16 precision and FP8 quantization.
What can I use it for?
Content creators can use this for professional video editing without extensive manual retouching. Film production can remove unwanted objects from scenes while maintaining physical realism. Advertising teams can generate counterfactual scenarios showing how spaces would look with different elements. Research applications include studying how objects interact in physics simulations and generating synthetic datasets for machine learning. The model supports commercial use cases where removing elements while preserving scene coherence creates significant production value.
Things to try
Test the model on videos with clear object interactions like someone holding items, sitting on furniture, or blocking pathways. Start with shorter clips to establish baseline quality before attempting the full 197-frame capacity. Experiment with different text prompts describing the final scene to guide the inpainting process. The two-pass approach works best on longer sequences where temporal consistency becomes critical, so compare single-pass and dual-pass outputs to see the refinement benefit. Try edge cases like removing support structures or objects that occlude large portions of the frame to understand where the physics-aware conditioning shows its advantages over simpler inpainting methods.
This is a simplified guide to an AI model called void-model maintained by netflix. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.
