2. Related Works
2.1. 2D Diffusion Models for 3D Generation
2.2. 3D Generative Models and 2.3. Multi-view Diffusion Models
3. Problem Formulation
3.2. The Distribution of 3D Assets
4. Method and 4.1. Consistent Multi-view Generation
5. Experiments
5.4. Single View Reconstruction
5.5. Novel View Synthesis and 5.6. Discussions
6. Conclusions and Future Works, Acknowledgements and References
We train our model on the LVIS subset of the Objaverse dataset [9], which comprises approximately 30,000+ objects following a cleanup process. Surprisingly, even with fine-tuning on this relatively small-scale dataset, our method demonstrates robust generalization capabilities. To create the rendered multi-view dataset, we first normalized each object to be centered and of unit scale. Then we render normal maps and color images from six views, including the front, back, left, right, front-right, and front-left views, using Blenderproc [11]. Additionally, to enhance dataset diversity, we applied random rotations to the 3D assets during the rendering process.
We fine-tune our model starting from the Stable Diffusion Image Variations Model, which has previously been fine-tuned with image conditions. We retain the optimizer settings and ϵ-prediction strategy from the previous finetuning. During fine-tuning, we use a reduced image size of 256 × 256 and a total batch size of 512 for training. The fine-tuning process involves training the model for 30,000 steps. This entire training procedure typically requires approximately 3 days on a cluster of 8 Nvidia Tesla A800 GPUs. To reconstruct 3D geometry from the 2D representations, our method is built on the instant-NGP based SDF reconstruction method [19].
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.
Authors:
(1) Xiaoxiao Long, The University of Hong Kong, VAST, MPI Informatik and Equal Contributions;
(2) Yuan-Chen Guo, Tsinghua University, VAST and Equal Contributions;
(3) Cheng Lin, The University of Hong Kong with Corresponding authors;
(4) Yuan Liu, The University of Hong Kong;
(5) Zhiyang Dou, The University of Hong Kong;
(6) Lingjie Liu, University of Pennsylvania;
(7) Yuexin Ma, Shanghai Tech University;
(8) Song-Hai Zhang, The University of Hong Kong;
(9) Marc Habermann, MPI Informatik;
(10) Christian Theobalt, MPI Informatik;
(11) Wenping Wang, Texas A&M University with Corresponding authors.