3. Method and 3.1. Architecture
3.2. Loss and 3.3. Implementation Details
4. Data Curation
5. Experiments and 5.1. Metrics
5.3. Comparison to SOTA Methods
5.4. Qualitative Results and 5.5. Ablation Study
A. Additional Qualitative Comparison
B. Inference on AI-generated Images
We now present our architecture (see Fig. 3) for shape reconstruction. Our architecture is based on two established practices from prior works in this field: 1) usage of intermediate geometric representation [33, 56, 64, 67, 70] and 2) explicit reasoning with spatial feature maps [5, 63, 68]. Specifically, our model consists of three submodules: a depth and camera estimator, a geometric unprojection unit and a projection-guided shape reconstructor.
Depth and camera estimator. We propose to estimate the 3D visible object surface as an intermediate representation. To infer the full shape of an object, one must understand the visible surface—not only because the visible surface is often a large part of the full surface, but also because an accurate visible surface facilitates geometric reasoning of the full object reconstruction. This is because cues for reconstruction that allow for generalization, such as symmetry, curvature, and repetition, can be more effectively detected and leveraged in the 3D space. For example, if an object is symmetric, then accurately inferring the 3D symmetry planes from a partial 3D surface is much easier than from 2D RGB or relative depth.
We use a view-centric coordinate system, because prior works show that view-centric learning is beneficial to generalization [55, 56]. Therefore the camera coordinate frame is the “world” coordinate frame for shape reconstruction, which means that only the camera intrinsics matrix is required to unproject pixels to 3D. Note that unprojection is fully differentiable w.r.t. D and K, so we can easily use it as a module in an end-to-end learning-based model. Additionally, the projection maps are foreground-segmented, and the represented visible surface is normalized in the 3D space to be zero-mean and unit-scale before being fed into the next module.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Zixuan Huang, University of Illinois at Urbana-Champaign and both authors contributed equally to this work;
(2) Stefan Stojanov, Georgia Institute of Technology and both authors contributed equally to this work;
(3) Anh Thai, Georgia Institute of Technology;
(4) Varun Jampani, Stability AI;
(5) James M. Rehg, University of Illinois at Urbana-Champaign.