What Is Wonder3D? A Method for Generating High-Fidelity Textured Meshes From Single-View Images

In this work, we introduce Wonder3D, a novel method for efficiently generating high-fidelity textured meshes from single-view images. Recent methods based on Score Distillation Sampling (SDS) have shown the potential to recover 3D geometry from 2D diffusion priors, but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast, certain works directly produce 3D information via fast network inferences, but their results are often of low quality and lack geometric details.

To holistically improve the quality, consistency, and efficiency of single-view reconstruction tasks, we propose a cross-domain diffusion model that generates multiview normal maps and the corresponding color images. To ensure the consistency of generation, we employ a multiview cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly, we introduce a geometry-aware normal fusion algorithm that extracts high-quality surfaces from the multi-view 2D representations. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results, robust generalization, and good efficiency compared to prior works

Introduction

Reconstructing 3D geometry from a single image stands as a fundamental task in computer graphics and 3D computer vision [12, 25, 31, 33, 35, 38, 41, 44], offering a wide range of versatile applications such as virtual reality, video games, 3D content creation, and the precision of robotics grasping. However, this task is notably challenging since it is ill-posed and demands the ability to discern the 3D geometry of both visible and invisible parts. This ability requires extensive knowledge of the 3D world.

knowledge of the 3D world. Recently, the field of 3D generation has experienced rapid and flourishing development with the introduction of diffusion models. A growing body of research [5, 29, 43, 59, 63], such as DreamField [24], DreamFusion [43], and Magic3D [29], resort to distilling prior knowledge of 2D image diffusion models or vision language models to create 3D models from text or images via Score Distillation Sampling (SDS) [43]. Despite their compelling results, these methods suffer from two main limitations: efficiency and consistency. The per-shape optimization process typically entails tens of thousands of iterations, involving full-image volume rendering and inferences of the diffusion models. Consequently, it often consumes tens of minutes or even hours on per-shape optimization. Moreover, the 2D prior model operates by considering only a single view at each iteration and strives to make every view resemble the input image. This often results in the generation of 3D shapes exhibiting inconsistencies, thus, often leading to the generation of 3D shapes with inconsistencies such as multiple faces (i.e., the Janus problem [43]).

There exists another group of works that endeavor to directly produce 3D geometries like point clouds [37, 41, 71, 75], meshes [16, 34], neural fields [1, 4, 7, 14, 17, 21, 25– 27, 40, 42, 61, 72] via network inference to avoid timeconsuming per-shape optimization. Most of them attempt to train 3D generative diffusion models from scratch on 3D assets. However, due to the limited size of publicly available 3D datasets, these methods demonstrate poor generalizability, most of which can only generate shapes on specific categories.

More recently, several methods have emerged that directly generate multi-view 2D images, with representative works including SyncDreamer [33] and MVDream [51]. By enhancing the multi-view consistency of image generation, these methods can recover 3D shapes from the generated multi-view images. Following these works, our method also adopts a multi-view generation scheme to favor the flexibility and efficiency of 2D representations. However, due to only relying on color images, the fidelity of the generated shapes is not well-maintained, and they struggle to recover geometric details or come with enormous computational costs.

To better address the issues of fidelity, consistency, generalizability and efficiency in the aforementioned works, in this paper, we introduce a new approach to the task of single-view 3D reconstruction by generating multi-view consistent normal maps and their corresponding color images with a cross-domain diffusion model. The key idea is to extend the stable diffusion framework to model the joint distribution of two different domains, i.e., normals and colors. We demonstrate that this can be achieved by introducing a domain switcher and a cross-domain attention scheme. In particular, the domain switcher allows the diffusion model to generate either normal maps or color images, while the cross-domain attention mechanisms assist in the information exchange between the two domains, ultimately improving consistency and quality. Finally, in order to stably extract surfaces from the generated views, we propose a geometry-aware normal fusion algorithm that is robust to the inaccuracies and capable of reconstructing clean and high-quality geometries (see Figure 1).

We conduct extensive experiments on the Google Scanned Object dataset [13] and various 2D images with different styles. The experiments validate that Wonder3D is capable of producing high-quality geometry with high efficiency in comparison with baseline methods. Wonder3D possesses several distinctive properties and accordingly has the following contributions:

• Wonder3D holistically considers the issues of generation quality, efficiency, generalizability, and consistency for 2 single-view 3D reconstruction. It has achieved a leading level of geometric details with reasonably good efficiency among current zero-shot single-view reconstruction methods.

• We propose a new multi-view cross-domain 2D diffusion model to predict normal maps and color images. This representation not only adapts to the original data distribution of Stable Diffusion model but also effectively captures the rich surface details of the target shape.

• We propose a cross-domain attention mechanism to produce multi-view normal maps and color images that are consistently aligned. This mechanism facilitates information perception across different domains, enabling our method to recover high-fidelity geometry.

• We introduce a novel geometry-aware normal fusion algorithm that can robustly extract surfaces from the generated normal maps and color images.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Authors:

(1) Xiaoxiao Long, The University of Hong Kong, VAST, MPI Informatik and Equal Contributions;

(2) Yuan-Chen Guo, Tsinghua University, VAST and Equal Contributions.

(3) Cheng Lin, The University of Hong Kong with Corresponding authors;

(4) Yuan Liu, The University of Hong Kong;

(5) Zhiyang Dou, The University of Hong Kong;

(6) Lingjie Liu, University of Pennsylvania;

(7) Yuexin Ma, Shanghai Tech University;

(8) Song-Hai Zhang, The University of Hong Kong;

(9) Marc Habermann, MPI Informatik;

(10) Christian Theobalt, MPI Informatik;

(11) Wenping Wang, Texas A&M University with Corresponding authors.

What Is Wonder3D? A Method for Generating High-Fidelity Textured Meshes From Single-View Images

Too Long; Didn't Read

Table of Links

Abstract

Introduction

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

What Is Wonder3D? A Method for Generating High-Fidelity Textured Meshes From Single-View Images

Too Long; Didn't Read

Table of Links

Abstract

Introduction

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics