Hi! Welcome to ‘Inside the Lab’, the research and engineering blog of artlabs. This week’s topic is how 3D content is represented and handled by AI methodologies, how AI utilizes these representations for 3D content creation, as well as the pros & cons of these techniques.
Machine learning models are trained using various 3D content representations such as voxels, point clouds, signed distance fields, neural radiance fields (NeRF), polygonal meshes… We will talk about voxel, point cloud, NeRF, and polygon representations in this post. Let’s go over these, one by one.
You know about picture elements (a.k.a. pixels) but have you ever heard about volume elements (a.k.a voxels)? Now you did! Pixels are represented as red, green, and blue color intensity values with an additional opacity value between 0 and 255 on a 2D grid represented by x and y coordinates. Voxels, similarly consist of red, green, blue, and opacity values on a 3D grid. AI models aim to learn these 4 values for each voxel to efficiently represent the scene.
Machine Learning models such as 3D-R2N2 (2016), Pix2Vox/++ (2019/2020), and EVoIT (2021) take advantage of the voxel representation’s simplicity and utilize multi-view images of an object to reconstruct that object as a voxel grid.
Voxels are hella good if you want to represent cubic shapes. As there is pixel art, there is also 3D art based on voxels. Furthermore, who doesn’t want to generate Minecraft-like worlds?! Metaverses like Sandbox also utilize voxel representations, and AI-based voxel creation can help improve them as well.
Well, you guessed it: Point clouds are clouds formed by colored points in 3D space. Unlike voxels, they are not contained within a grid, so you can represent a wider range of objects better with point clouds. However, since there is no grid, you need also to consider each point’s position in the 3D space. This means you need to keep more data compared to voxels for each data point.
Models such as OpenAI’s Point-E (2022) have demonstrated success in point cloud-based 3D content creation. However, as with everything good in the world, point clouds have their advantages and disadvantages.
Point clouds are actually used widely in several industries. They can be acquired by LiDARs installed on drones or smart cars. One can create point cloud objects and environments with AI to be utilized within simulations to improve the algorithms that are being run for better driverless vehicles. Furthermore, they are also used in medical imaging. AI-based creation of medical point clouds can improve disease and physical trauma detection in patients as well.
Given a set of images and corresponding camera pose information, a NeRF can reconstruct a 3D scene by finding out where each pixel on an image corresponds to in the 3D space. Once the scene is reconstructed, a NeRF can provide a full 3D view of a scene, even from unseen angles. Furthermore, the representation itself is AI! Basically, it is a neural network that contains the whole information required to render a 3D scene. The scene is represented within the neural network and when queried with a new camera pose, the neural network can respond with a new render of that view. While the original NeRF network had to be trained for hours (days on some occasions), several novel NeRF variants can reconstruct a high-quality 3D scene within mere seconds.
Neural Radiance Fields can render scenes from any angle, and they can potentially be used widely by cinematic arts. It is widely known that camera angle and motion are very important in cinematography, and NeRFs can create renders from angles a camera person might have trouble with.
Polygonal meshes consist of points (namely, vertices), lines that connect these points to each other (namely, edges), and polygons that are constructed in between these edges. Vertices are represented by their coordinates; edges are represented by which vertices they are connecting, and polygons are represented by which edges they are being constructed on. Furthermore, there are multiple ways of representing color on meshes ranging from simply coloring each vertex with red, green, and blue intensity values to deciding how that color will interact with any given light by providing material properties such as diffusion, specularity, opacity, refractive index, surface normals, etc.
Methods such as NVDiffrec-MC (2022) can infer a mesh, light, and material triplet by utilizing image sets. Lately, many more methods have been developed to reconstruct meshes and textures from text or image inputs: GET3D, DreamFusion, Score Jacobian Chaining, Magic3D…
Polygonal meshes are already utilized in gaming, cinematic arts, Web3, and XR. Many industries like e-commerce highly benefit from polygonal meshes by visualizing their products in 3D. By creation of content with AI, all of these industries can generate content at scale and awe their audience.
At artlabs, we utilize all these representations and AI at different sections of our pipeline. See more of how artlabs utilizes AI to create content at scale here.
Thanks for reading! See you in the next post of “Inside the Lab” 👋🏻
Author: Doğancan Kebude, R&D Lead at artlabs