paint-brush
Single Image 3D Scene Reconstruction: A Review of Recent Advancesby@senarect
1,000 reads
1,000 reads

Single Image 3D Scene Reconstruction: A Review of Recent Advances

by Fraltsov DenisApril 7th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Computer vision is a rapidly developing field of artificial intelligence, particularly in the area of 3D. This overview will consider an applied task: transitioning between 2D and 3D environments. To begin with, we will analyze how to solve a direct problem of computer graphics, namely creating a 2D image using a 3D model.
featured image - Single Image 3D Scene Reconstruction: A Review of Recent Advances
Fraltsov Denis HackerNoon profile picture


Computer vision is a rapidly developing field of artificial intelligence, particularly in the area of 3D. This overview will consider an applied task: transitioning between 2D and 3D environments.

DIRECT TASK

To begin with, we will analyze how to solve a direct problem of computer graphics, namely creating a 2D image using a 3D model, and get acquainted with the basic concepts.

Rendering is the process of moving from a 3D model to its 2D projection. You’ve probably heard of some of them:


  1. Rasterization is one of the earliest and fastest rendering methods. Rasterization treats the model as a grid of polygons. These polygons have vertexes embedded with information such as position, texture, and color. These vertices are then projected onto a plane perpendicular to the perspective. Rasterization has problems with overlapping objects: if the surfaces overlap, the last part drawn will be reflected during rendering, which will cause the wrong object to be displayed. This problem was solved using z-buffering (in fact, the z-buffer is a depth map).
  2. Ray casting. Unlike rasterization, the potential problem of overlapping surfaces does not occur during raycasting. Ray casting, as the name suggests, directs rays at the model from the camera’s point of view. Rays are output to each pixel on the image plane. The surface that it hits first will be shown during rendering, and any other intersection after the first surface will not be drawn.
  3. Ray tracing. Despite the advantages of ray casting, the technique still lacks the ability to correctly model shadows, reflections, and refractions. The ray tracing method was developed to help resolve these issues. Ray tracing works in a similar way to ray casting, except that it displays light better. Basically, the primary rays from the camera’s point of view are directed at the models to produce secondary rays. After hitting the model, shadow rays, reflected rays, or refractive rays will be emitted, depending on the surface properties.


Now that we’ve considered the direct problem of building a 2D image from a 3D model, let’s look at ways to solve the inverse problem: building a 3D model from a 2D image.

INVERSE PROBLEM

A two-dimensional photograph is a projection of a three-dimensional scene. A 3D scene is a collection of 3D meshes, vertices, faces, texture maps, and a light source viewed from a camera or viewpoint. For simplicity, let’s limit the scene to a single 3D object. If we were able to restore the original 3D scene from which the 2D photo was created, we should be able to verify this by projecting the 3D object onto 2D using the same point of view that was used to create the input 2D photo.


To reconstruct an object, you need to calculate all possible combinations of vertices, faces, light sources, and textures, which, when projected in 2D, should give an equivalent image in 2D, given the input image, provided that the camera position is the same. This is essentially a search problem. But the problem with brute forcing is that there are a huge number of combinations of vertices, faces, texture maps, and lighting that can be created, so we can’t solve this problem by brute force.


Let’s look at the existing approaches to solving this problem!


DIB-R

DIB-R is a differential renderer that models pixel values using the differentiable rasterization algorithm. It has two methods for assigning pixel values. One for foreground pixels, and the other for background pixels. Here, in contrast to standard rendering, where the pixel value is assigned to the nearest face covering the pixel, foreground rasterization is considered as an interpolation of vertex attributes. On each foreground pixel, we perform a z-buffering test and assign it to the nearest covering face. Each pixel is affected exclusively by this face. So foreground pixels are calculated as an interpolation of the nearest three neighboring vertices using a weight for each vertex. For background pixels, i.e. pixels that are not covered by any face of the 3D object, the value is calculated based on the distance from the pixel to the nearest face.


The architecture scheme from the official paper

DIB-R can generate images with realistic lighting and shading effects that are difficult to achieve with traditional rendering.


Official paper


Im2Struct: SMN+SRN

A structural masking network (SMN) creates an object mask based on the input 2D image at different scales. This is a multi-layer convolutional neural network (CNN). Its task is to save information about the form while viewing irrelevant information: background and textures.

A structure-restoring network (SRN) recursively reconstructs the hierarchy of object details in the form of a cuboid structure. The SRN receives input data from the SMN, adds CNN characteristics of the 2D image, and then passes these functions to the recursive neural network (RvNN) for decoding into a 3D structure. At the output, we get an image in the form of three-dimensional cuboids with a plausible spatial configuration.


The architecture scheme from the official paper


Im2Struct has several advantages over traditional 3D scanning methods, as it can recover the 3D structure of an object from a single 2D image, which is often faster and less expensive than scanning an object from multiple viewpoints.


Official paper


Link to the solution


ATLAS

The method takes as input a sequence of RGB images of arbitrary length. The internal characteristics and pose are known for each image. These images are passed through a 2D CNN backbone for feature extraction. The objects are then projected back into the 3D voxel volume and accumulated using the current average value. Once the image elements are combined in 3D, we regress the TSDF directly using the 3D CNN.


The architecture scheme from the official paper


ATLAS is useful in a variety of industries, including manufacturing, engineering, and archaeology. One limitation of ATLAS 3D is that it requires the object being scanned to be stationary, which may not always be feasible in certain applications. Additionally, the system may struggle to capture fine details and textures on objects with highly reflective or transparent surfaces.


Official paper


Link to the solution


Mesh R-CNN

The framework uses a two-stage approach: in the first stage, it detects and segments the object in the image using a convolutional neural network (CNN), similar to the popular Mask R-CNN framework. In the second stage, it regresses a set of 3D vertices for each object instance using a mesh prediction network.


The architecture scheme from the official paper


One of the main advantages of Mesh R-CNN is its ability to reconstruct detailed 3D meshes of objects, including their fine-grained geometry and texture. This makes it useful for applications such as virtual reality, augmented reality, and 3D printing.


Official paper


Link to the solution


We’ve covered several state-of-the-art solutions for solving the inverse graphics problem. All of these solutions can help you solve a wide variety of tasks, like reconstructing a room, creating a 3D local map, reconstructing a 3D scene from a single image, or even estimating the height and depth of crops or terrain to guide planting, harvesting, and irrigation decisions.


Keep in mind that all these solutions are based on different approaches to rendering, voxel prediction, mesh prediction, and so on. But they all have a common need to build or predict a depth map in one form or another.


That’s why I also propose to separately consider the problem of constructing a depth map.

DEPTH ESTIMATION

There are several ways to get a depth map:


  1. Use a stereo pair of RGB images.
  2. Use an RGB-D camera.
  3. Teach a model to predict the depth map based on one RGB image

Let’s look at several state-of-the-art solutions for predicting depth maps.


  1. Monocular depth estimation — GLPN

A new architecture with global and local feature paths through the entire network. The overall structure of the framework is next: the transformer encoder enables the model to learn global dependencies, and the proposed decoder successfully recovers the extracted feature into the target depth map by constructing the local path through a skip connection and the feature fusion module.

The part of the photo with the result from the official paper

Official paper


Link to the solution


  1. Dense depth model

For the encoder, the RGB input image is encoded into a vector of objects using a DenseNet-169 network pre-trained in ImageNet.


This vector is then fed into a sequential series of layers with increased sampling to build a final depth map with a resolution equal to half of the input. These upsampling layers and their associated bandwidth connections form the decoder.


The part of the photo with the result from the official paper


Official paper


Link to the solution


  1. Midas

The architecture is represented by a visual multi-connected transformer as the basis. The overall encoder-decoder structure that has been successful for prediction in the past is preserved. The input image is converted to tokens either by extracting non-overlapping sections and then the linear projection of their smoothed representation (DPT-Base and DPT-Large) or by applying ResNet-50 (DPT-Hybrid). Image embedding is supplemented with positional embedding and a patch-independent token. Tokens go through several stages of conversion. Tokens are collected from different stages as a multi-resolution image (Reassemble). The Fusion modules gradually merge and upsample views to produce a detailed forecast.


The part of the photo with the result from the official paper


Link to the solution


All of these solutions can help you get the depth map and use it in your own way. For instance, you might want to build 3D scenes with PyTorch-3d.

Official paper

THE HAND OF CROWDSOURCING

In the latter case, MIDAS was able to achieve its result by linking new data sources, which no one had implemented before. There is a difficulty in collecting diverse depth datasets at scale, so a tool has been introduced to combine complementary data sources. In addition, a new dataset based on 3D movies provides reliable information about various dynamic scenes.


Thus, I wanted to focus on the problem of data for the 3D direction. Every developer faces this problem and has to somehow dodge it, including architecturally. All these solutions I described were using almost the same scroll of the open datasets.


It was not enough because it’s not that simple to collect such complex and high-quality data due to various reasons, such as occlusions, poor lighting conditions, and limited viewpoints. When there is not enough data available, it becomes difficult to accurately estimate the depth and structure of the scene, leading to inaccurate or incomplete 3D reconstructions.


Crowdsourcing can be used as a potential solution to address the problem of needing more data for 3D reconstruction. By leveraging the collective effort of a large number of individuals, crowdsourcing can provide additional data and perspectives on a scene, which can improve the accuracy and completeness of the 3D reconstruction.


For example, a crowdsourcing platform could be used to collect multiple images of a scene taken from different viewpoints by a large number of contributors. These images could then be processed using multi-view stereo or structure-from-motion techniques to create a more accurate 3D reconstruction of the scene.


This is exactly what was implemented in the Neatsy project to partially compensate for a lack of 3-D data. Neatsy develops AI software for virtually sizing shoes. They used the Toloka crowdsourcing platform for additional data collection (more than 50 thousand new photos) and made improvements to the model’s metrics. Their software creates a 3D model of your feet using around 50 different measurements and helps you find the perfect pair of sneakers. The project has moved on and now they can also diagnose health problems in feet, all thanks to data from people in the crowd. This is just one example of the strong potential of developing 3D technology.

SUMMARY

There are many state-of-the-art solutions available for forward and inverse graphics, as well as for predicting the depth map. We looked at the approaches of each in practical applications and also noted the limitations caused by the lack of data for 3D. Crowdsourcing platforms have the potential to solve the data collection problem and support the development of 3D technologies for real-life computer vision applications.