NeRF Editing and Inpainting Techniques: 4D Extension

Authors: (1) Han Jiang, HKUST and Equal contribution (hjiangav@connect.ust.hk); (2) Haosen Sun, HKUST and Equal contribution (hsunas@connect.ust.hk); (3) Ruoxuan Li, HKUST and Equal contribution (rliba@connect.ust.hk); (4) Chi-Keung Tang, HKUST (cktang@cs.ust.hk); (5) Yu-Wing Tai, Dartmouth College, (yu-wing.tai@dartmouth.edu). Table of Links Abstract and 1. Introduction 2. Related Work 2.1. NeRF Editing and 2.2. Inpainting Techniques 2.3. Text-Guided Visual Content Generation 3. Method 3.1. Training View Pre-processing 3.2. Progressive Training 3.3. 4D Extension 4. Experiments and 4.1. Experimental Setups 4.2. Ablation and comparison 5. Conclusion and 6. References 3.3. 4D Extension Our method extends naturally to 4D by adopting the idea in 3D, that is, to have a raw inference from the seed image, and correct its details with stable diffusion. To infer across frames, we apply multiple existing video editing methods to the input video. We use point-tracking to record the original movement. This movement will be applied to animate the novel object generated through our progressive training. Meanwhile, the foreground of the original video will be removed to adjust the new scene. The video with only the background will be combined with the animated novel object to build up a raw inpainted video that is temporally consistent. For each frame, we extract the seed image from this seed video and project it to other views. Finally, training images from all views and frames are refined by stable diffusion before being used in training. To determine the motion for the new object, since there exists no stable method for generating a background-consistent movement for a generated foreground object, we utilize the original movement of the replaced object. To achieve this goal, it is required to first extract the original movement, and then apply it back to the novel object. We use CoTracker [6] to track the movements of multiple key points on the target. CoTracker enables us to track self-defined key points on the original video precisely. Making use of this functionality, the trajectory followed by the key points can be obtained and further projected to the generated object with consistent background information. In this way, we are able to animate the generated object. This paper is available on arxiv under CC 4.0 license. Authors: (1) Han Jiang, HKUST and Equal contribution (hjiangav@connect.ust.hk); (2) Haosen Sun, HKUST and Equal contribution (hsunas@connect.ust.hk); (3) Ruoxuan Li, HKUST and Equal contribution (rliba@connect.ust.hk); (4) Chi-Keung Tang, HKUST (cktang@cs.ust.hk); (5) Yu-Wing Tai, Dartmouth College, (yu-wing.tai@dartmouth.edu). Authors: Authors: (1) Han Jiang, HKUST and Equal contribution (hjiangav@connect.ust.hk); (2) Haosen Sun, HKUST and Equal contribution (hsunas@connect.ust.hk); (3) Ruoxuan Li, HKUST and Equal contribution (rliba@connect.ust.hk); (4) Chi-Keung Tang, HKUST (cktang@cs.ust.hk); (5) Yu-Wing Tai, Dartmouth College, (yu-wing.tai@dartmouth.edu). Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2. Related Work 2.1. NeRF Editing and 2.2. Inpainting Techniques 2.1. NeRF Editing and 2.2. Inpainting Techniques 2.3. Text-Guided Visual Content Generation 2.3. Text-Guided Visual Content Generation 3. Method 3. Method 3.1. Training View Pre-processing 3.1. Training View Pre-processing 3.2. Progressive Training 3.2. Progressive Training 3.3. 4D Extension 3.3. 4D Extension 4. Experiments and 4.1. Experimental Setups 4. Experiments and 4.1. Experimental Setups 4.2. Ablation and comparison 4.2. Ablation and comparison 5. Conclusion and 6. References 5. Conclusion and 6. References 3.3. 4D Extension Our method extends naturally to 4D by adopting the idea in 3D, that is, to have a raw inference from the seed image, and correct its details with stable diffusion. To infer across frames, we apply multiple existing video editing methods to the input video. We use point-tracking to record the original movement. This movement will be applied to animate the novel object generated through our progressive training. Meanwhile, the foreground of the original video will be removed to adjust the new scene. The video with only the background will be combined with the animated novel object to build up a raw inpainted video that is temporally consistent. For each frame, we extract the seed image from this seed video and project it to other views. Finally, training images from all views and frames are refined by stable diffusion before being used in training. To determine the motion for the new object, since there exists no stable method for generating a background-consistent movement for a generated foreground object, we utilize the original movement of the replaced object. To achieve this goal, it is required to first extract the original movement, and then apply it back to the novel object. We use CoTracker [6] to track the movements of multiple key points on the target. CoTracker enables us to track self-defined key points on the original video precisely. Making use of this functionality, the trajectory followed by the key points can be obtained and further projected to the generated object with consistent background information. In this way, we are able to animate the generated object. This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv