With LASR, you can generate 3D models of humans or animals moving from only a short video as input. This task is called 3D reconstruction, and Google Research, along with Carnegie Mellon University, just published a paper called LASR: Learning Articulated Shape Reconstruction from a Monocular Video. Learn more about the project below. Watch the video References References:►Read the full article: ►Gengshan Yang et al., (2021), LASR: Learning Articulated Shape Reconstruction from a Monocular Video, CVPR, https://www.louisbouchard.ai/3d-reconstruction-from-videos https://lasr-google.github.io/ Video Transcript 00:00 How hard is it for a machine to understand an image? 00:03 Researchers have made a lot of progress in image classification, image detection, and 00:08 image segmentation. 00:09 These three tasks iteratively deepen our understanding of what's going on in an image. 00:14 In the same order, classification tells us what's in the image. 00:18 Detection tells us where it is approximately, and segmentation precisely tells us where 00:23 it is. 00:24 Now, an even more complex step would be to represent this image in the real world. 00:29 In other words, it would be to represent an object taken from an image or video into a 00:34 3D surface, just like GANverse3D can do for inanimate objects, as I showed in a recent 00:41 video. 00:42 This demonstrates a deep understanding of the image or video by the model, representing 00:46 the complete shape of an object, which is why it is such a complex task. 00:51 Even more challenging is to do the same thing on nonrigid shapes. 00:55 Or rather, on humans and animals, objects that can be weirdly shaped and even deformed 01:00 to a certain extent. 01:02 This task of generating a 3D model based on a video or images is called 3D reconstruction, 01:08 and Google Research, along with Carnegie Mellon University just published a paper called LASR: 01:14 Learning Articulated Shape Reconstruction from a Monocular Video. 01:18 As the name says, this is a new method for generating 3D models of humans or animals 01:23 moving from only a short video as input. 01:26 Indeed, it actually understands that this is an odd shape, that it can move, but still 01:31 needs to stay attached as this is still one "object" and not just many objects together. 01:37 Typically, 3D modeling techniques needed data prior. 01:40 In this case, the data prior was an approximate shape of the complex objects, which looks 01:45 like this... 01:46 As you can see, it had to be quite similar to the actual human or animal, which is not 01:51 very intelligent. 01:52 With LASR, you can produce even better results. 01:55 With no prior at all, it starts with just a plain sphere whatever the object to reconstruct. 02:01 You can imagine what this means for generalizability and how powerful this can be when you don't 02:06 have to explicitly tell the network both what the object is and how it "typically" looks. 02:11 This is a significant step forward! 02:13 But how does it work? 02:15 As I said, it only needs a video, but there are still some pre-processing steps to do. 02:19 Don't worry. 02:20 These steps are quite well-understood in computer vision. 02:22 As you may recall, I mentioned image segmentation at the beginning of the video. 02:27 We need this segmentation of an object that can be done easily using a trained neural 02:30 network. 02:31 Then, we need the optical flow for each frame, which is the motion of objects between consecutive 02:37 frames of the video. 02:38 This is also easily found using computer vision techniques and improved with neural networks, 02:43 as I covered not even a year ago on my channel. 02:46 They start the rendering process with a sphere assuming it is a rigid object, so an object 02:51 that does not have articulations. 02:53 With this assumption, they optimize the shape and the camera viewpoint understanding of 02:57 their model iteratively for 20 epochs. 03:00 This rigid assumption is shown here with the number of bones equal to zero, meaning that 03:05 nothing can move separately. 03:07 Then, we get back to real life, where the human is not rigid. 03:11 Now, the goal is to have an accurate 3d model that can move realistically. 03:16 This is achieved by increasing the number of bones and vertices to make the model more 03:20 and more precise. 03:21 Here the vertices are 3-dimensional pixels where the lines and volumes of the rendered 03:26 object connect, and the bones are, well, they are basically bones. 03:30 These bones are all the parts of the objects that move during the video with either translations 03:35 or rotations. 03:36 Both the bones and vertices are incrementally augmented until we reach stage 3, where the 03:41 model has learned to generate a pretty accurate render of the current object. 03:45 Here, they also need a model to render this object, which is called a differentiable renderer. 03:50 I won't dive into how it works as I already covered it in previous videos, but basically, 03:55 it is a model able to create a 3-dimensional representation of an object. 03:59 It has the particularity to be differentiable. 04:03 Meaning that you can train this model in a similar way as a typical neural network with 04:07 back-propagation. 04:08 Here, everything is trained together, optimizing the results following the four stages we just 04:13 saw improving the rendered result at each stage. 04:17 The model then learns just like any other machine learning model using gradient descent 04:22 and updating the model's parameters based on the difference between the rendered output 04:27 and the ground-truth video measurements. 04:29 So it doesn't even need to see a ground-truth version of the rendered object. 04:34 It only needs the video, segmentation, and optical flow results to learn by transforming 04:39 back the rendered object into a segmented image and its optical flow and comparing it 04:44 to the input. 04:45 What is even better is that all this is done in a self-supervised learning process. 04:50 Meaning that you give the model the videos to train on with their corresponding segmentation 04:55 and optical flow results, and it iteratively learns to render the objects during training. 05:00 No annotations are needed at all! 05:03 And Voilà, you have your complex 3D renderer without any special training or ground truth 05:08 needed! 05:09 If gradient descent, epoch, parameters, or self-supervised learning are still unclear 05:14 concepts to you, I invite you to watch the series of short videos I made explaining the 05:18 basics of machine learning. 05:20 As always, the full article is available on my website louisbouchard.ai, with many other 05:25 great papers explained and information. 05:28 Thank you for watching.