3. Method and 3.1. Architecture
3.2. Loss and 3.3. Implementation Details
4. Data Curation
5. Experiments and 5.1. Metrics
5.3. Comparison to SOTA Methods
5.4. Qualitative Results and 5.5. Ablation Study
A. Additional Qualitative Comparison
B. Inference on AI-generated Images
In this section we describe our data generation procedure for training and for rendering the object scans from OmniObject3D to generate one of our benchmark test sets.
Image Rendering. For an arbitrary 3D mesh asset, our Blender-based rendering pipeline first loads it into a scene and normalizes it to fit inside a unit cube. Our scene consists of a large rectangular bowl with a flat bottom, a common scene setup that 3D artists use for rendering to allow for realistic shading, and 4 point light sources and one area light source. We randomly place cameras around the object with 30mm to 70mm focal length for a 35mm sensor size equivalent. We randomly vary the distance, elevation (from 5 to 65 degrees), the LookAt point of the camera and generate images of 600 × 600 resolution (see Fig. 11). This variation in object/camera geometry allows capturing the variability of projective geometry in real world scenarios, coming from different capture devices and camera poses. This is in contrast with prior work that uses fixed intrinsics, fixed distance, and LookAt pointed at the center of the object.
In addition to RGB images, we extract segmentation masks, depth maps, intrinsics, extrinsics and object pose. We center crop the objects, mask out the background, resize images to 224 × 224 and process the additional annotations to account for the crop, segmentation and resize.
The original videos released by the OmniObject3D dataset have noisy foreground masks and are mostly taken indoor on a tabletop. To improve the lighting variability and ensure accurate segmentations, we follow the rendering procedure described in the previous section to generate testing data. Different from our training set generation, we use HDRI environment maps to generate scene lighting, which results in high lighting quality and diversity (see Fig. 12).
This paper is available on arxiv under CC BY 4.0 DEED license.
[6] https://github.com/autonomousvision/occupancy_networks
[7] https://github.com/laughtervv/DISN
Authors:
(1) Zixuan Huang, University of Illinois at Urbana-Champaign and both authors contributed equally to this work;
(2) Stefan Stojanov, Georgia Institute of Technology and both authors contributed equally to this work;
(3) Anh Thai, Georgia Institute of Technology;
(4) Varun Jampani, Stability AI;
(5) James M. Rehg, University of Illinois at Urbana-Champaign.