Authors:
(1) Han Jiang, HKUST and Equal contribution ([email protected]);
(2) Haosen Sun, HKUST and Equal contribution ([email protected]);
(3) Ruoxuan Li, HKUST and Equal contribution ([email protected]);
(4) Chi-Keung Tang, HKUST ([email protected]);
(5) Yu-Wing Tai, Dartmouth College, ([email protected]).
2. Related Work
2.1. NeRF Editing and 2.2. Inpainting Techniques
2.3. Text-Guided Visual Content Generation
3.1. Training View Pre-processing
4. Experiments and 4.1. Experimental Setups
5. Conclusion and 6. References
To demonstrate the efficacy of various elements in our baseline design, we compare our baseline to the following 3 variants. As we are the first, to the best of our knowledge, to aim at generative NeRF inpainting, there does not exist baselines for direct comparison. Therefore, we conduct ablation by modifying some parts in our baseline, and conduct comparisons by replacing a subprocess by an existing baseline.
View independent inpainting. In our baseline, all training images are pre-processed to be consistent with the first seed image, providing a guarantee for coarse convergence. To show the necessity of our pre-processing strategy, in this baseline, we inpaint all training images independently. Warmup training and IDU are kept unchanged. The results are shown in figure 4. It is clear that independent inpainting results in divergence across views, and IDU fails to produce the initial highly inconsistent image convergence. This explains that the pre-processing step to help convergence is indispensable.
Instruct-Nerf2Nerf after warmup. The second stage of our NeRF training is largely based on iterative dataset update, which is the core of instruct-nerf2nerf [4]. This brings out the question of whether instruct-nerf2nerf can accomplish our task. Since instruct-nerf2nerf is mainly for appearance editing without large geometry changes, it is inappropriate to directly run it on our generative inpainting task. Therefore, before running their baseline, we first run our baseline until the end of warmup training, so that we have a coarse geometry as the prerequisite. By this design, the main difference between the two methods is that instruct-nerf2nerf uses instruct-pix2pix, a non-latent diffusion model, to do training image correction. Figure 5 shows its results compared with ours. We can see that the final result maintains the blurriness from the warmup NeRF, and does not achieve a higher level of consistency.
Warmup without depth supervision. The goal of warmup training is to provide an initial convergence, especially in geometry, which requires depth loss as supervision. To demonstrate this, we provide an example of warmup training without any depth supervision, and show the rendered depth map in figure 6. Compared with the depth map in our standard setting, this depth map has clear errors within and around the target object. From inconsistencies in geometry, the later fine training stage may suffer from “floater” and inconsistency artifacts. This can be eliminated by our simple planar depth map as guidance, even if it is not accurate.
This paper is available on arxiv under CC 4.0 license.