Table of Links Abstract and 1 Introduction 2. Related Works 2.1. 2D Diffusion Models for 3D Generation 2.2. 3D Generative Models and 2.3. Multi-view Diffusion Models 3. Problem Formulation 3.1. Diffusion Models 3.2. The Distribution of 3D Assets 4. Method and 4.1. Consistent Multi-view Generation 4.2. Cross-Domain Diffusion 4.3. Textured Mesh Extraction 5. Experiments 5.1. Implementation Details 5.2. Baselines 5.3. Evaluation Protocol 5.4. Single View Reconstruction 5.5. Novel View Synthesis and 5.6. Discussions 6. Conclusions and Future Works, Acknowledgements and References 5.5. Novel View Synthesis We evaluate the quality of novel view synthesis for different methods. The quantitative results are presented in Table 2, and the qualitative results can be found in Figure 3. Zero123 [31] produces visually reasonable images, but they lack multi-view consistency since it operates on each view independently. Although SyncDreamer [31] introduces a volume attention scheme to enhance the consistency of multi-view images, their model is sensitive to the elevation degrees of the input images and tends to produce unreasonable results. In contrast, our method is capable of generating images that not only exhibit semantic consistency with the input image but also maintain a high degree of consistency across multiple views in terms of both colors and geometry 5.6. Discussions In this section, we conduct a set of studies to verify the effectiveness of our designs as well as the properties of the method. Cross-Domain Diffusion. To validate the effectiveness of our proposed cross-domain diffusion scheme, we study the following settings: (a) cross-domain model with cross-domain attention; (b) cross-domain model without cross-domain attention; (c) sequential model rgb-to-normal: first train a multi-view color diffusion model then train a multiview normal diffusion model conditioned on the previously generated color images; (d) sequential model normal-torgb: first train a multi-view normal diffusion model then train a multi-view color diffusion model conditioned on the previously generated normal images. As shown in (a) and (b) of Figure 7, it’s evident that the cross-domain attentions significantly enhance the consistency between color images and normals, particularly in terms of the detailed geometries of objects like the icecream and Pharaoh sculpture. From (c) and (d) of Figure 7, while the normals and color images generated by sequential models maintain some consistency, their results suffer from performance drops. performance drops. For the sequential model rgb-to-normal, conditioning on the separately generated normal maps, the generated color images exhibit color aberrations in comparison to the input image, as shown in (c) of Figure 7. Conversely, for the sequential model normal-to-rgb, conditioning on the separately generated color images, the normal maps give unreasonable geometry, as illustrated in (d) of Figure 7. These experiments demonstrate that jointly predicting normal maps and color images through the cross-domain attention mechanism can facilitate a comprehensive perception of information from different domains. We also speculate that in the context of sequential models, the generated color images or normal maps of stage 1 may exhibit a minor domain gap to the ground truth data trained in stage 2. Therefore, compared to sequential prediction, the cross-domain approach proves to be more effective in enhancing the quality of each domain as well as the overall prediction Multi-view Consistency. We conducted an analysis of the effectiveness of the multi-view attention mechanism, as illustrated in Figure 9. Our findings show that the multi-view attention greatly enhances the 3D consistency of the generated multi-view images, particularly for the rear views. In the absence of the multi-view attention, the color images of the rear views exhibited unrealistic predictions. Normal Fusion. To assess the efficacy of our normal fusion algorithm, we conducted experiments using the complex lion model, which is rich in geometric details, as illustrated in Figure 8. The baseline model’s surfaces exhibited numerous holes and noises. Utilizing either the geometry-aware normal loss or the outlier-dropping loss helps mitigate the noisy surfaces. Finally, combining both strategies yields the best performance, resulting in clean surfaces while preserving detailed geometries. Generalization. To demonstrate the generalization capability of our method, we conducted evaluations using diverse image styles, including sketches, cartoons, and images of animals, as shown in Figure 5 and Figure 10. Despite variations in lighting effects and geometric complexities among these images, our method consistently generated multi-view normal maps and color images, ultimately yielding high-quality geometries. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. Authors:
(1) Xiaoxiao Long, The University of Hong Kong, VAST, MPI Informatik and Equal Contributions;
(2) Yuan-Chen Guo, Tsinghua University, VAST and Equal Contributions;
(3) Cheng Lin, The University of Hong Kong with Corresponding authors;
(4) Yuan Liu, The University of Hong Kong;
(5) Zhiyang Dou, The University of Hong Kong;
(6) Lingjie Liu, University of Pennsylvania;
(7) Yuexin Ma, Shanghai Tech University;
(8) Song-Hai Zhang, The University of Hong Kong;
(9) Marc Habermann, MPI Informatik;
(10) Christian Theobalt, MPI Informatik;
(11) Wenping Wang, Texas A&M University with Corresponding authors. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2. Related Works 2.1. 2D Diffusion Models for 3D Generation 2.1. 2D Diffusion Models for 3D Generation 2.2. 3D Generative Models and 2.3. Multi-view Diffusion Models 2.2. 3D Generative Models and 2.3. Multi-view Diffusion Models 3. Problem Formulation 3.1. Diffusion Models 3.1. Diffusion Models 3.2. The Distribution of 3D Assets 3.2. The Distribution of 3D Assets 4. Method and 4.1. Consistent Multi-view Generation 4. Method and 4.1. Consistent Multi-view Generation 4.2. Cross-Domain Diffusion 4.2. Cross-Domain Diffusion 4.3. Textured Mesh Extraction 4.3. Textured Mesh Extraction 5. Experiments 5.1. Implementation Details 5.1. Implementation Details 5.2. Baselines 5.2. Baselines 5.3. Evaluation Protocol 5.3. Evaluation Protocol 5.4. Single View Reconstruction 5.4. Single View Reconstruction 5.5. Novel View Synthesis and 5.6. Discussions 5.5. Novel View Synthesis and 5.6. Discussions 6. Conclusions and Future Works, Acknowledgements and References 6. Conclusions and Future Works, Acknowledgements and References 5.5. Novel View Synthesis We evaluate the quality of novel view synthesis for different methods. The quantitative results are presented in Table 2, and the qualitative results can be found in Figure 3. Zero123 [31] produces visually reasonable images, but they lack multi-view consistency since it operates on each view independently. Although SyncDreamer [31] introduces a volume attention scheme to enhance the consistency of multi-view images, their model is sensitive to the elevation degrees of the input images and tends to produce unreasonable results. In contrast, our method is capable of generating images that not only exhibit semantic consistency with the input image but also maintain a high degree of consistency across multiple views in terms of both colors and geometry 5.6. Discussions In this section, we conduct a set of studies to verify the effectiveness of our designs as well as the properties of the method. Cross-Domain Diffusion. To validate the effectiveness of our proposed cross-domain diffusion scheme, we study the following settings: (a) cross-domain model with cross-domain attention; (b) cross-domain model without cross-domain attention; (c) sequential model rgb-to-normal: first train a multi-view color diffusion model then train a multiview normal diffusion model conditioned on the previously generated color images; (d) sequential model normal-torgb: first train a multi-view normal diffusion model then train a multi-view color diffusion model conditioned on the previously generated normal images. Cross-Domain Diffusion. As shown in (a) and (b) of Figure 7, it’s evident that the cross-domain attentions significantly enhance the consistency between color images and normals, particularly in terms of the detailed geometries of objects like the icecream and Pharaoh sculpture. From (c) and (d) of Figure 7, while the normals and color images generated by sequential models maintain some consistency, their results suffer from performance drops. performance drops. For the sequential model rgb-to-normal, conditioning on the separately generated normal maps, the generated color images exhibit color aberrations in comparison to the input image, as shown in (c) of Figure 7. Conversely, for the sequential model normal-to-rgb, conditioning on the separately generated color images, the normal maps give unreasonable geometry, as illustrated in (d) of Figure 7. These experiments demonstrate that jointly predicting normal maps and color images through the cross-domain attention mechanism can facilitate a comprehensive perception of information from different domains. We also speculate that in the context of sequential models, the generated color images or normal maps of stage 1 may exhibit a minor domain gap to the ground truth data trained in stage 2. Therefore, compared to sequential prediction, the cross-domain approach proves to be more effective in enhancing the quality of each domain as well as the overall prediction Multi-view Consistency. We conducted an analysis of the effectiveness of the multi-view attention mechanism, as illustrated in Figure 9. Our findings show that the multi-view attention greatly enhances the 3D consistency of the generated multi-view images, particularly for the rear views. In the absence of the multi-view attention, the color images of the rear views exhibited unrealistic predictions. Multi-view Consistency. Normal Fusion. To assess the efficacy of our normal fusion algorithm, we conducted experiments using the complex lion model, which is rich in geometric details, as illustrated in Figure 8. The baseline model’s surfaces exhibited numerous holes and noises. Utilizing either the geometry-aware normal loss or the outlier-dropping loss helps mitigate the noisy surfaces. Finally, combining both strategies yields the best performance, resulting in clean surfaces while preserving detailed geometries. Normal Fusion. Generalization. To demonstrate the generalization capability of our method, we conducted evaluations using diverse image styles, including sketches, cartoons, and images of animals, as shown in Figure 5 and Figure 10. Despite variations in lighting effects and geometric complexities among these images, our method consistently generated multi-view normal maps and color images, ultimately yielding high-quality geometries. Generalization. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. available on arxiv Authors: (1) Xiaoxiao Long, The University of Hong Kong, VAST, MPI Informatik and Equal Contributions; (2) Yuan-Chen Guo, Tsinghua University, VAST and Equal Contributions; (3) Cheng Lin, The University of Hong Kong with Corresponding authors; (4) Yuan Liu, The University of Hong Kong; (5) Zhiyang Dou, The University of Hong Kong; (6) Lingjie Liu, University of Pennsylvania; (7) Yuexin Ma, Shanghai Tech University; (8) Song-Hai Zhang, The University of Hong Kong; (9) Marc Habermann, MPI Informatik; (10) Christian Theobalt, MPI Informatik; (11) Wenping Wang, Texas A&M University with Corresponding authors. Authors: Authors: (1) Xiaoxiao Long, The University of Hong Kong, VAST, MPI Informatik and Equal Contributions; (2) Yuan-Chen Guo, Tsinghua University, VAST and Equal Contributions; (3) Cheng Lin, The University of Hong Kong with Corresponding authors; (4) Yuan Liu, The University of Hong Kong; (5) Zhiyang Dou, The University of Hong Kong; (6) Lingjie Liu, University of Pennsylvania; (7) Yuexin Ma, Shanghai Tech University; (8) Song-Hai Zhang, The University of Hong Kong; (9) Marc Habermann, MPI Informatik; (10) Christian Theobalt, MPI Informatik; (11) Wenping Wang, Texas A&M University with Corresponding authors.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Wonder3D: Evaluating the Quality of Novel View Synthesis for Different Methods

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

2D Diffusion Models for 3D Generation: How They're Related to Wonder3D

Wonder3D: What Is Cross-Domain Diffusion?

Wonder3D: 3D Generative Models and Multi-View Diffusion Models

Wonder3D's Evaluation Protocol: Datasets and Metrics

What Is Wonder3D? A Method for Generating High-Fidelity Textured Meshes From Single-View Images

2D Diffusion Models for 3D Generation: How They're Related to Wonder3D

2D Diffusion Models for 3D Generation: How They're Related to Wonder3D

Wonder3D: What Is Cross-Domain Diffusion?

Wonder3D: 3D Generative Models and Multi-View Diffusion Models

Wonder3D's Evaluation Protocol: Datasets and Metrics

What Is Wonder3D? A Method for Generating High-Fidelity Textured Meshes From Single-View Images

2D Diffusion Models for 3D Generation: How They're Related to Wonder3D

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps