The Video Editing Problem Nobody Really Solves

The problem nobody solves

You can remove a person from a video now. Existing tools do this well, painting in the background where they stood. But here's what happens next: the glass they were holding floats in empty air. The shadow they cast is gone, throwing the lighting wrong. They were blocking the path of another person, so that person is standing in an impossible place, knocked backward by a collision that never happened.

Current video object removal methods excel at the visual task, inpainting pixels and correcting appearance-level artifacts like shadows and reflections. But they miss something fundamental. When the removed object had real interactions, collisions, occlusions, or physical relationships with the rest of the scene, these methods produce results that violate the basic laws of physics. Each frame might look acceptable in isolation, but the sequence as a whole feels uncanny because it defies causality.

The issue isn't poor inpainting. Better pixels won't solve this. The problem runs deeper. Existing approaches treat object removal as a visual problem, processing frames independently or smoothing transitions. They never ask the question that matters: if this object had never existed at all, what would the physics of this scene actually be?

Reframing the problem as counterfactual reasoning

This is where the insight shifts. VOID treats object removal not as inpainting but as counterfactual reasoning. Instead of asking "what pixels should fill this hole," it asks "if this object had never been there, what would the world look like?"

This reframing changes everything. A collision that didn't happen means different positions and velocities propagate forward through time. Light that wasn't blocked means different illumination across downstream frames. An object that wasn't held means it wasn't there to constrain motion. These aren't visual problems to patch, they're causal chains to reason about.

The breakthrough is using two specialized tools for this reasoning. A vision-language model identifies which regions of the scene have consequences of the removed object, understanding causality at a semantic level. Then a video diffusion model, trained on understanding counterfactual physics, generates what plausibly happens in those affected regions. The model doesn't just erase, it simulates an alternate timeline.

Building training data through simulation

Getting this right requires training data that no one has. You can't easily film a scene, then time-travel and reshoot it without the object to get ground truth. But you can simulate it.

The researchers generated paired datasets of counterfactual object removals using Kubric, a tool for procedurally generating 3D synthetic environments, combined with HUMOTO for physics simulation. The process is elegant: create a scene with an object present and simulate its physics. Separately, create the same scene where the object never existed and simulate that timeline too. Now you have pairs of videos where the only difference is whether a single object appeared, with everything else cascading from that difference.

This synthetic data becomes the foundation for learning. The model trains on thousands of scenarios where it sees the relationship between "object present" and "object absent" worlds. It's not memorizing pixels, it's learning the implicit physics that connects these timelines. A collision shapes motion differently. An occlusion changes what's visible. Lighting cascades based on what blocks it.

By training on this counterfactual data, the model develops an understanding of how physical consequences ripple through time. When shadows fall differently, when motion changes direction, when light refracts or reflects off new surfaces, these aren't independent frame-by-frame decisions but causal chains the model has learned.

The inference pipeline

When removing an object from real video, two steps unfold. First, a vision-language model looks at the scene and identifies which regions are causally affected by the object's presence. This isn't just the pixels the object occupies, it's everything downstream of its influence. A hand blocking light affects illumination meters away. A collision affects positions frames later. The VLM reasons about these causal consequences without explicit supervision, using its semantic understanding of how the world works.

Second, a video diffusion model, trained on the synthetic counterfactuals, generates physically consistent alternatives for those affected regions. The diffusion process starts from noise and iteratively refines it toward plausible outcomes, conditioned on the surrounding context and the constraint that physics should remain consistent. Because the model learned from counterfactual pairs, it has internalized what physically plausible removal looks like.

This two-stage approach solves a practical problem: you can't manually annotate what should change in real video when an object disappears. By delegating causal reasoning to vision-language models and generation to physics-aware diffusion, the system approximates the counterfactual reasoning that would require human expertise or resimulation.

Testing generalization

The real test is whether this approach actually generalizes beyond the clean, simulated worlds it trained on. Synthetic data is controlled and perfect. Real video is messy, with imperfect physics, complex lighting, and objects that don't fit neat simulation categories.

Experiments on both synthetic holdout data and real-world video show the approach generalizes. Scenes that weren't in training still behave plausibly when objects are removed. More importantly, the model doesn't just work on clean CGI, it handles real footage where shadows are complex, reflections are subtle, and causality is harder to pin down. This suggests the model learned something general about physics and causality rather than overfitting to training-data patterns.

The comparison with prior video object removal methods reveals where those approaches fail. When objects disappear with existing work, the resulting scenes are visually acceptable frame by frame, but temporal coherence breaks down. Shadows appear and disappear unnaturally. Collisions seem to reverse. Lighting flickers. With VOID, these sequences maintain causal consistency. People stand in physically plausible positions. Motion responds to what would have actually happened if the collision never occurred.

What physically plausible removal looks like

The concrete results show what changes when physics is respected. Shadows reappear naturally where light was blocked, not suddenly but with proper falloff. Objects that would have collided move as if that collision never shaped them. People stand where they actually were, not where momentum would have taken them if they'd been hit. Lighting cascades consistently across frames because the model understands what occludes or reflects it.

This matters because it represents a genuine shift in what video editing tools can do. Earlier approaches, like those focused on appearance-only removal, produce results that work for simple cases like removing a person from a static background. But they fail when interactions matter. Work like removing objects and their causal visual artifacts began addressing visual artifacts more comprehensively, yet still within an appearance-focused framework. VOID steps beyond that by building physics reasoning into the model itself.

The larger vision crystallizes here: the goal is to make video editing tools into better simulators of the world. Rather than treating video as pixels to manipulate, the approach treats it as a window into a physical reality. Editing becomes less about visual tricks and more about causal reasoning. Remove an object, and everything that depends on it changes. The model must understand not just how to fill pixels, but how to propagate consequences through time.

This framework sheds light on how to build editing models that don't just fool the eye but respect the laws of physics. By incorporating high-level causal reasoning through vision-language models and physics-aware generation through diffusion models trained on counterfactuals, the system moves closer to genuine understanding. It's not perfect, but it represents a fundamental shift: from "hide the unwanted object" to "simulate the alternate timeline where it never existed."

This is a Plain English Papers summary of a research paper called VOID: Video Object and Interaction Deletion. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.