Authors: Jun Gao, NVIDIA, University of Toronto, Vector Institute (jung@nvidia.com) Tianchang Shen, NVIDIA, University of Toronto, Vector Institute (frshen@nvidia.com) Zian Wang, NVIDIA, University of Toronto, Vector Institute (zianw@nvidia.com) Wenzheng Chen, NVIDIA, University of Toronto, Vector Institute (wenzchen@nvidia.com) Kangxue Yin, NVIDIA (kangxuey@nvidia.com) Daiqing Li, NVIDIA (daiqingl@nvidia.com) Or Litany, NVIDIA (olitany@nvidia.com) Zan Gojcic, NVIDIA (zgojcic@nvidia.com) Sanja Fidler, NVIDIA, University of Toronto, Vector Institute (sfidler@nvidia.com) Abstract As several industries are moving towards modeling massive 3D virtual worlds, the need for content creation tools that can scale in terms of the quantity, quality, and diversity of 3D content is becoming evident. In our work, we aim to train performant 3D generative models that synthesize textured meshes that can be directly consumed by 3D rendering engines, thus immediately usable in down-stream applications. Prior works on 3D generative modeling either lack geometric details, are limited in the mesh topology they can produce, typically do not support textures, or utilize neural renderers in the synthesis process, which makes their use in common 3D software non-trivial. In this work, we introduce GET3D, a enerative model that directly generates xplicit extured meshes with complex topology, rich geometric details, and high fidelity textures. We bridge recent success in the differentiable surface modeling, differentiable rendering as well as 2D Generative Adversarial Networks to train our model from 2D image collections. GET3D is able to generate high-quality 3D textured meshes, rang-ing from cars, chairs, animals, motorbikes and human characters to buildings, achieving significant improvements over previous methods. Our project page: G E T 3D https://nv-tlabs.github.io/GET3D 1 Introduction Diverse, high-quality 3D content is becoming increasingly important for several industries, including gaming, robotics, architecture, and social platforms. However, manual creation of 3D assets is very time-consuming and requires specific technical knowledge as well as artistic modeling skills. One of the main challenges is thus scale – while one can find 3D models on 3D marketplaces such as Turbosquid [ ] or Sketchfab [ ], creating many 3D models to, say, populate a game or a movie with a crowd of characters that all look different still takes a significant amount of artist time. 4 3 To facilitate the content creation process and make it accessible to a variety of (novice) users, generative 3D networks that can produce high-quality and diverse 3D assets have recently become an active area of research [ , , , , , , , , , , ]. However, to be practically useful for current real-world applications, 3D generative models should ideally fulfill the following requirements: They should have the capacity to generate shapes with detailed geometry and arbitrary topology, The output should be a textured mesh, which is a primary representation used by standard graphics software packages such as Blender [ ] and Maya [ ], and We should be able to leverage 2D images for supervision, as they are more widely available than explicit 3D shapes. 5 14 43 46 53 68 75 60 59 69 23 (a) (b) 15 1 (c) Prior work on 3D generative modeling has focused on subsets of the above requirements, but no method to date fulfills all of them (Tab. ). For example, methods that generate 3D point clouds [ , 68, 75] typically do not produce textures and have to be converted to a mesh in post-processing. 1 5 Methods generating voxels often lack geometric details and do not produce texture [ , , , ]. Generative models based on neural fields [ , ] focus on extracting geometry but disregard texture. Most of these also require explicit 3D supervision. Finally, methods that directly output textured 3D meshes [ , ] typically require pre-defined shape templates and cannot generate shapes with complex topology and variable genus. 66 20 27 40 43 14 54 53 Recently, rapid progress in neural volume rendering [ ] and 2D Generative Adversarial Networks (GANs) [ , , , , ] has led to the rise of 3D-aware image synthesis [ , , , , , ]. However, this line of work aims to synthesize multi-view consistent images using neural rendering in the synthesis process and does not guarantee that meaningful 3D shapes can be generated. While a mesh can potentially be obtained from the underlying neural field representation using the marching cube algorithm [ ], extracting the corresponding texture is non-trivial. 45 34 35 33 29 52 7 57 8 49 51 25 39 In this work, we introduce a novel approach that aims to tackle all the requirements of a practically useful 3D generative model. Specifically, we propose GET3D, a enerative model for 3D shapes that directly outputs xplicit extured meshes with high geometric and texture detail and arbitrary mesh topology. In the heart of our approach is a generative process that utilizes a differentiable surface extraction method [ ] and a differentiable rendering technique [ , ]. The former enables us to directly optimize and output textured 3D meshes with arbitrary topology, while the latter allows us to train our model with 2D images, thus leveraging powerful and mature discriminators developed for 2D image synthesis. Since our model directly generates meshes and uses a highly efficient (differentiable) graphics renderer, we can easily scale up our model to train with image G E T 3D explicit 60 47 37 resolution as high as 1024 × 1024, allowing us to learn high-quality geometric and texture details. We demonstrate state-of-the-art performance for unconditional 3D shape generation on multiple categories with complex geometry from ShapeNet [ ], Turbosquid [ ] and Renderpeople [ ], such as chairs, motorbikes, cars, human characters, and buildings. With explicit mesh as output representation, GET3D is also very flexible and can easily be adapted to other tasks, including: learning to generate decomposed material and view-dependent lighting effects using advanced differentiable rendering [ ], without supervision, text-guided 3D shape generation using CLIP [ ] embedding. 9 4 2 (a) 12 (b) 56 2 Related Work We review recent advances in 3D generative models for geometry and appearance, as well as 3D-aware generative image synthesis. In recent years, 2D generative models have achieved photorealistic quality in high-resolution image synthesis [ , , , , , , ]. This progress has also inspired research in 3D content generation. Early approaches aimed to directly extend the 2D CNN generators to 3D voxel grids [ , , , , ], but the high memory footprint and computational complexity of 3D convolutions hinder the generation process at high resolution. As an alternative, other works have explored point cloud [ , , , ], implicit [ , ], or octree [ ] representations. However, these works focus mainly on generating geometry and disregard appearance. Their output representations also need to be post-processed to make them compatible with standard graphics engines. 3D Generative Models 34 35 33 52 29 19 16 66 20 27 40 62 5 68 75 46 43 14 30 More similar to our work, Textured3DGAN [ , ] and DIBR [ ] generate textured 3D meshes, but they formulate the generation as a deformation of a template mesh, which prevents them from generating complex topology or shapes with varying genus, which our method can do. PolyGen [ ] and SurfGen [ ] can produce meshes with arbitrary topology, but do not synthesize textures. 54 53 11 48 41 Inspired by the success of neural volume rendering [ ] and implicit representations [ , ], recent work started tackling the problem of 3D-aware image synthesis [ , , , , , , , , , ]. However, neural volume rendering networks are typically slow to query, leading to long training times [ , ], and generate images of limited resolution. GIRAFFE [ ] and StyleNerf [ ] improve the training and rendering efficiency by performing neural rendering at a lower resolution and then upsampling the results with a 2D CNN. However, the performance gain comes at the cost of a reduced multi-view consistency. By utilizing a dual discriminator, EG3D [ ] can partially mitigate this problem. Nevertheless, extracting a textured surface from methods that are based on neural rendering is a non-trivial endeavor. In contrast, GET3D directly outputs textured 3D meshes that can be readily used in standard graphics engines. 3D-Aware Generative Image Synthesis 45 43 14 7 57 49 26 25 76 8 51 58 67 7 57 49 25 8 3 Method We now present our GET3D framework for synthesizing textured 3D shapes. Our generation process is split into two parts: a geometry branch, which differentiably outputs a surface mesh of arbitrary topology, and a texture branch that produces a texture field that can be queried at the surface points to produce colors. The latter can be extended to other surface properties such as for example materials (Sec. ). During training, an efficient differentiable rasterizer is utilized to render the resulting textured mesh into 2D high-resolution images. The entire process is differentiable, allowing for adversarial training from images (with masks indicating an object of interest) by propagating the gradients from the 2D discriminator to both generator branches. Our model is illustrated in Fig. . In the following, we first introduce our 3D generator in Sec , before proceeding to the differentiable rendering and loss functions in Sec . 4.3.1 2 3.1 3.2 3.1 Generative Model of 3D Textured Meshes We aim to learn a 3D generator = ( ) to map a sample from a Gaussian distribution M, E G z ∈ N (0*,* ) to a mesh with texture . z I M E Since the same geometry can have different textures, and the same texture can be applied to different geometries, we sample two random input vectors 1 ∈ R512 and 2 ∈ R512. Following StyleGAN [ , , ], we then use non-linear mapping networks geo and tex to map 1 and 2 to intermediate latent vectors 1 = geo( 1) and 2 = tex( 2) which are further used to produce that control the generation of 3D shapes and texture, respectively. We formally introduce the generator for geometry in Sec. and the texture generator in Sec. . z z 34 35 33 f f z z w f z w f z styles 3.1.1 3.1.2 3.1.1 Geometry Generator We design our geometry generator to incorporate DMTet [ ], a recently proposed differentiable surface representation. DMTet represents geometry as a signed distance field (SDF) defined on a deformable tetrahedral grid [ , ], from which the surface can be differentiably recovered through marching tetrahedra [ ]. Deforming the grid by moving its vertices results in a better utilization of its resolution. By adopting DMTet for surface extraction, we can produce explicit meshes with arbitrary topology and genus. We next provide a brief summary of DMTet and refer the reader to the original paper for further details. 60 22 24 17 Let ( ) denote the full 3D space that the object lies in, where are the vertices in the tetrahedral grid . Each tetrahedron ∈ is defined using four vertices { }, with ∈ {1*, . . . , K*}, where is the total number of tetrahedra, and ∈ ∈ R3. In addition to its 3D coordinates, each vertex contains the SDF value ∈ R and the deformation ∆ ∈ R3 of the vertex from its initial canonical coordinate. This representation allows recovering the explicit mesh through differentiable marching tetrahedra [ ], where SDF values in continuous space are computed by a barycentric interpolation of their value on the deformed vertices ′ = + ∆ . VT , T VT T Tk T v ak , v bk , v ck , v dk k K v ik VT , v ik i v i si v i 60 si v v i v i We map 1 ∈ R512 to SDF values and deformations at each vertex through a series of conditional 3D convolutional and fully connected layers. Specifically, we first use 3D convolutional layers to generate a feature volume conditioned on 1. We then query the feature at each vertex ∈ using trilinear interpolation and feed it into MLPs that outputs the SDF value and the deformation ∆ . In cases where modeling at a high-resolution is required (e.g. motorbike with thin structures in the wheels), we further use volume subdivision following [ ]. Network Architecture w v i w v i VT si v i 60 After obtaining and ∆ for all the vertices, we use the differentiable marching tetrahedra algorithm to extract the explicit mesh. Marching tetrahedra determines the surface topology within each tetrahedron based on the signs of . In particular, a mesh face is extracted when sign( ) /= sign( ), where denotes the indices of vertices in the edge of tetrahedron, and the vertices of that face are determined by a linear interpolation as mi,j = v 0 i sj−v 0 j si sj−si . Note that the above equation is only evaluated when si 6= sj , thus it is differentiable, and the gradient from mi,j can be back-propagated into the SDF values si and deformations ∆vi . With this representation, the shapes with arbitrary topology can easily be generated by predicting different signs of si . Differentiable Mesh Extraction si v i si si sj i, j m i,j 3.1.2 Texture Generator Directly generating a texture map consistent with the output mesh is not trivial, as the generated shape can have an arbitrary genus and topology. We thus parameterize the texture as a texture field [ ]. 50 Specifically, we model the texture field with a function that maps the 3D location of a surface point ∈ R3, conditioned on the 2, to the RGB color ∈ R3 at that location. Since the texture field depends on geometry, we additionally condition this mapping on the geometry latent code 1, such that = ( *,* 1 ⊕ 2), where ⊕ denotes concatenation. ft p w c w c ft p w w We represent our texture field using a tri-plane representation, which is efficient and expressive in reconstructing 3D objects [ ] and generating 3D-aware images [ ] . Specifically, we follow [ , ] and use a conditional 2D convolutional neural network to map the latent code 1 ⊕ 2 to three axis-aligned orthogonal feature planes of size × × ( × 3), where = 256 denotes the spatial resolution and = 32 the number of channels. Network Architecture 55 8 8 35 w w N N C N C Given the feature planes, the feature vector f t ∈ R 32 of a surface point p can be recovered as f t = P e ρ(πe(p)), where πe(p) is the projection of the point p to the feature plane e and ρ(·) denotes bilinear interpolation of the features. An additional fully connected layer is then used to map the aggregated feature vector f t to the RGB color c. Note that, different from other works on 3D-aware image synthesis that also use a neural field representation, we only need to sample the texture field at the locations of the surface points (as opposed to dense samples along a ray). This greatly reduces the computational complexity for rendering high-resolution images and guarantees to generate multi-view consistent images by construction. 3.2 Differentiable Rendering and Training In order to supervise our model during training, we draw inspiration from Nvdiffrec [ ] that performs multi-view 3D object reconstruction by utilizing a differentiable renderer. Specifically, we render the extracted 3D mesh and the texture field into 2D images using a differentiable renderer [ ], and supervise our network with a 2D discriminator, which tries to distinguish the image from a real object or rendered from the generated object. 47 37 We assume that the camera distribution C that was used to acquire the images in the dataset is known. To render the generated shapes, we randomly sample a camera from C, and utilize a highly-optimized differentiable rasterizer Nvdiffrast [ ] to render the 3D mesh into a 2D silhouette as well as an image where each pixel contains the coordinates of the corresponding 3D Differentiable Rendering c 37 point on the mesh surface. These coordinates are further used to query the texture field to obtain the RGB values. Since we operate directly on the extracted mesh, we can render high-resolution images with high efficiency, allowing our model to be trained with image resolution as high as 1024×1024. We train our model using an adversarial objective. We adopt the discriminator architecture from StyleGAN [ ], and use the same non-saturating GAN objective with R1 regularization [ ]. We empirically find that using two separate discriminators, one for RGB images and another one for silhouettes, yields better results than a single discriminator operating on both. Let denote the discriminator, where can either be an RGB image or a silhouette. The adversarial objective is then be defined as follows: Discriminator & Objective 34 42 Dx x where ( ) is defined as ( ) = − log(1 +exp(− )), is the distribution of real images, denotes rendering, and is a hyperparameter. Since is differentiable, the gradients can be backpropagated from 2D images to our 3D generators. g u g u u px R λ R To remove internal floating faces that are not visible in any of the views, we further regularize the geometry generator with a cross-entropy loss defined between the SDF values of the neighboring vertices [ ]: Regularization 47 where denotes binary cross-entropy loss and denotes the sigmoid function. The sum in Eq. is defined over the set of unique edges S in the tetrahedral grid, for which sign( ) /= sign( ). H σ 2 e si sj The overall loss function is then defined as: where is a hyperparameter that controls the level of regularization. µ 4 Experiments We conduct extensive experiments to evaluate our model. We first compare the quality of the 3D textured meshes generated by GET3D to the existing methods using the ShapeNet [ ] and Turbosquid [ ] datasets. Next, we ablate our design choices in Sec. . Finally, we demonstrate the flexibility of GET3D by adapting it to downstream applications in Sec. . Additional experimental results and implementation details are provided in Appendix. 9 4 4.2 4.3 4.1 Experiments on Synthetic Datasets For evaluation on ShapeNet [ ], we use three categories with complex geometry – , , and , which contain 7497, 6778, and 337 shapes, respectively. We randomly split each category into training (70%), validation (10 %), and test (20 %), and further remove from the test set shapes that have duplicates in the training set. To render the training data, we randomly sample camera poses from the upper hemisphere of each shape. For the and categories, we use 24 random views, while for we use 100 views due to less number of shapes. As models in ShapeNet only have simple textures, we also evaluate GET3D on an dataset (442 shapes) collected from TurboSquid [ ], where textures are more detailed and we split it into training, validation and test as defined above. Finally, to demonstrate the versatility of GET3D, we also provide qualitative results on the dataset collected from Turbosquid (563 shapes), and dataset from Renderpeople [ ] (500 shapes). We train a separate model on each category. Datasets 9 Car Chair Motorbike Car Chair Motorbike Animal 4 House Human Body 2 We compare GET3D to two groups of works: 3D generative models that rely on 3D supervision: PointFlow [ ] and OccNet [ ]. Note that these methods only generate geometry without texture. 3D-aware image generation methods: GRAF [ ], PiGAN [ ], and EG3D [ ]. Baselines 1) 68 43 2) 57 7 8 To evaluate the quality of our synthesis, we consider both the geometry and texture of the generated shapes. For geometry, we adopt metrics from [ ] and use both Chamfer Distance (CD) and Light Field Distance [ ] (LFD) to compute the Coverage score and Minimum Matching Distance. For OccNet [ ], GRAF [ ], PiGAN [ ] and EG3D [ ], we use marching cubes to extract the underlying geometry. For PointFlow [ ], we use Poisson surface reconstruction to convert a point cloud into a mesh when evaluating LFD. To evaluate texture quality, we adopt the FID [ ] metric commonly used to evaluate image synthesis. In particular, for each category, we render the test shapes into 2D images, and also render the generated 3D shapes from each model into 50k images using the same camera distribution. We then compute FID on the two image sets. As the baselines from 3D-aware image synthesis [ , , ] do not directly output textured meshes, we compute FID score in two ways: ( ) we use their neural volume rendering to obtain 2D images, which we refer to as FID-Ori, and ( ) we extract the mesh from their neural field representation using marching cubes, render it, and then use the 3D location of each pixel to query the network to obtain the RGB values. We refer to this score, that is more aware of the actual 3D shape, as FID-3D. Further details on the evaluation metrics are available in the Appendix . Metrics 5 10 43 57 7 8 68 28 57 7 8 i ii B.3 We provide quantitative results in Table. and qualitative examples in Fig. and Fig. . Additional results are available in the supplementary video. Compared to OccNet [ ] that uses 3D supervision during training, GET3D achieves better performance in terms of both diversity (COV) and quality (MMD), and our generated shapes have more geometric details. Experimental Results 2 3 4 43