Table of Links Abstract and 1 Introduction 2. Related Works 2.1. 2D Diffusion Models for 3D Generation 2.2. 3D Generative Models and 2.3. Multi-view Diffusion Models 3. Problem Formulation 3.1. Diffusion Models 3.2. The Distribution of 3D Assets 4. Method and 4.1. Consistent Multi-view Generation 4.2. Cross-Domain Diffusion 4.3. Textured Mesh Extraction 5. Experiments 5.1. Implementation Details 5.2. Baselines 5.3. Evaluation Protocol 5.4. Single View Reconstruction 5.5. Novel View Synthesis and 5.6. Discussions 6. Conclusions and Future Works, Acknowledgements and References 4.2. Cross-Domain Diffusion Our model is built upon pre-trained 2D stable diffusion models [45] to leverage its strong generalization. However, current 2D diffusion models [31, 45] are designed for a single domain, so the main challenge is how to effectively extend stable diffusion models that are capable of operating on more than one domain. Naive Solutions. To achieve this goal, we explore several possible designs. A straightforward solution is to add four more channels to the output of the UNet module representing the extra domain. Therefore, the diffusion model can simultaneously output normals and color image domains. However, we notice that such a design suffers from low convergence speed and poor generalization. This is because the channel expansion may perturb the pre-trained weights of stable diffusion models and therefore cause catastrophic model forgetting. The domain switcher s is first encoded via positional encoding [39] and subsequently concatenated with the time embedding. This combined representation is then injected into the UNet of the stable diffusion models. Interestingly, experiments show that this subtle modification does not significantly alter the pre-trained priors. As a result, it allows for fast convergence and robust generalization, without requiring substantial changes to the stable diffusion models. Cross-domain Attention. Using the proposed domain switcher, the diffusion model can generate two different domains. However, it is important to note that for a single view, there is no guarantee that the generated color image and the normal map will be geometrically consistent. To address this issue and ensure the consistency between the generated normal maps and color images, we introduce a crossdomain attention mechanism to facilitate the exchange of information between the two domains. This mechanism aims to ensure that the generated outputs align well in terms of geometry and appearance. The cross-domain attention layer maintains the same structure as the original self-attention layer and is integrated before the cross-attention layer in each transformer block of the UNet, as depicted in Figure 4. In the cross-domain attention layer, the keys and values from the normal and color image domains are combined and processed through attention operations. This design ensures that the generations of color images and normal maps are closely correlated, thus promoting geometric consistency between the two domains. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. Authors:
(1) Xiaoxiao Long, The University of Hong Kong, VAST, MPI Informatik and Equal Contributions;
(2) Yuan-Chen Guo, Tsinghua University, VAST and Equal Contributions;
(3) Cheng Lin, The University of Hong Kong with Corresponding authors;
(4) Yuan Liu, The University of Hong Kong;
(5) Zhiyang Dou, The University of Hong Kong;
(6) Lingjie Liu, University of Pennsylvania;
(7) Yuexin Ma, Shanghai Tech University;
(8) Song-Hai Zhang, The University of Hong Kong;
(9) Marc Habermann, MPI Informatik;
(10) Christian Theobalt, MPI Informatik;
(11) Wenping Wang, Texas A&M University with Corresponding authors. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2. Related Works 2.1. 2D Diffusion Models for 3D Generation 2.1. 2D Diffusion Models for 3D Generation 2.2. 3D Generative Models and 2.3. Multi-view Diffusion Models 2.2. 3D Generative Models and 2.3. Multi-view Diffusion Models 3. Problem Formulation 3.1. Diffusion Models 3.1. Diffusion Models 3.2. The Distribution of 3D Assets 3.2. The Distribution of 3D Assets 4. Method and 4.1. Consistent Multi-view Generation 4. Method and 4.1. Consistent Multi-view Generation 4.2. Cross-Domain Diffusion 4.2. Cross-Domain Diffusion 4.3. Textured Mesh Extraction 4.3. Textured Mesh Extraction 5. Experiments 5.1. Implementation Details 5.1. Implementation Details 5.2. Baselines 5.2. Baselines 5.3. Evaluation Protocol 5.3. Evaluation Protocol 5.4. Single View Reconstruction 5.4. Single View Reconstruction 5.5. Novel View Synthesis and 5.6. Discussions 5.5. Novel View Synthesis and 5.6. Discussions 6. Conclusions and Future Works, Acknowledgements and References 6. Conclusions and Future Works, Acknowledgements and References 4.2. Cross-Domain Diffusion Our model is built upon pre-trained 2D stable diffusion models [45] to leverage its strong generalization. However, current 2D diffusion models [31, 45] are designed for a single domain, so the main challenge is how to effectively extend stable diffusion models that are capable of operating on more than one domain. Naive Solutions. To achieve this goal, we explore several possible designs. A straightforward solution is to add four more channels to the output of the UNet module representing the extra domain. Therefore, the diffusion model can simultaneously output normals and color image domains. However, we notice that such a design suffers from low convergence speed and poor generalization. This is because the channel expansion may perturb the pre-trained weights of stable diffusion models and therefore cause catastrophic model forgetting. The domain switcher s is first encoded via positional encoding [39] and subsequently concatenated with the time embedding. This combined representation is then injected into the UNet of the stable diffusion models. Interestingly, experiments show that this subtle modification does not significantly alter the pre-trained priors. As a result, it allows for fast convergence and robust generalization, without requiring substantial changes to the stable diffusion models. Cross-domain Attention. Using the proposed domain switcher, the diffusion model can generate two different domains. However, it is important to note that for a single view, there is no guarantee that the generated color image and the normal map will be geometrically consistent. To address this issue and ensure the consistency between the generated normal maps and color images, we introduce a crossdomain attention mechanism to facilitate the exchange of information between the two domains. This mechanism aims to ensure that the generated outputs align well in terms of geometry and appearance. Cross-domain Attention. The cross-domain attention layer maintains the same structure as the original self-attention layer and is integrated before the cross-attention layer in each transformer block of the UNet, as depicted in Figure 4. In the cross-domain attention layer, the keys and values from the normal and color image domains are combined and processed through attention operations. This design ensures that the generations of color images and normal maps are closely correlated, thus promoting geometric consistency between the two domains. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. available on arxiv Authors: (1) Xiaoxiao Long, The University of Hong Kong, VAST, MPI Informatik and Equal Contributions; (2) Yuan-Chen Guo, Tsinghua University, VAST and Equal Contributions; (3) Cheng Lin, The University of Hong Kong with Corresponding authors; (4) Yuan Liu, The University of Hong Kong; (5) Zhiyang Dou, The University of Hong Kong; (6) Lingjie Liu, University of Pennsylvania; (7) Yuexin Ma, Shanghai Tech University; (8) Song-Hai Zhang, The University of Hong Kong; (9) Marc Habermann, MPI Informatik; (10) Christian Theobalt, MPI Informatik; (11) Wenping Wang, Texas A&M University with Corresponding authors. Authors: Authors: (1) Xiaoxiao Long, The University of Hong Kong, VAST, MPI Informatik and Equal Contributions; (2) Yuan-Chen Guo, Tsinghua University, VAST and Equal Contributions; (3) Cheng Lin, The University of Hong Kong with Corresponding authors; (4) Yuan Liu, The University of Hong Kong; (5) Zhiyang Dou, The University of Hong Kong; (6) Lingjie Liu, University of Pennsylvania; (7) Yuexin Ma, Shanghai Tech University; (8) Song-Hai Zhang, The University of Hong Kong; (9) Marc Habermann, MPI Informatik; (10) Christian Theobalt, MPI Informatik; (11) Wenping Wang, Texas A&M University with Corresponding authors.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Wonder3D: What Is Cross-Domain Diffusion?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

2D Diffusion Models for 3D Generation: How They're Related to Wonder3D

Wonder3D: Evaluating the Quality of Novel View Synthesis for Different Methods

Finding AI-Generated Faces in the Wild: Model

Finding AI-Generated Faces in the Wild: Data sets

Finding AI-Generated Faces in the Wild: Results

2D Diffusion Models for 3D Generation: How They're Related to Wonder3D

Wonder3D: Evaluating the Quality of Novel View Synthesis for Different Methods

Finding AI-Generated Faces in the Wild: Model

Finding AI-Generated Faces in the Wild: Data sets

Finding AI-Generated Faces in the Wild: Results

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps