This promising model called GANverse3D only needs an image to create a 3D figure that can be customized and animated! How cool would it be to take a picture of an object on the internet, let’s say a car, and automatically have the 3D object in less than a second ready to insert in your game? This is cool, right? Well, imagine that within a few seconds, you can even animate this car, making the wheels turn, flashing the lights, etc. Would you believe me if I told you that an AI could already do that? If video games weren’t enough, this new application works for any 3D scene you are working on, illustrations, movies, architecture, design, and more! Watch the video References Video demo: https://youtu.be/dvjwRBZ3Hnw Karras et al., (2019), “StyleGAN”: https://arxiv.org/pdf/1812.04948.pdf Chen et al., (2019), “DIB-R”: https://arxiv.org/pdf/1908.01210.pdf Omniverse, NVIDIA, (2021): https://www.nvidia.com/en-us/omniverse/ Zhang et al., (2020), “IMAGE GANS MEET DIFFERENTIABLE RENDERING FOR INVERSE GRAPHICS AND INTERPRETABLE 3D NEURAL RENDERING”: https://arxiv.org/pdf/2010.09125.pdf GANverse3D official NVIDIA video: https://youtu.be/0PQnrnUIBlU NVIDIA’S GANverse 3D blog article: https://blogs.nvidia.com/blog/2021/04/16/gan-research-knight-rider-ai-omniverse/ Video transcript 00:00 What you see here is someone carefully creating 00:02 a scene for a video game. 00:04 It takes many hours of work by a professional just for a single object like this one. 00:10 How cool would it be to take a picture of an object on the internet, let's say a car, 00:15 and automatically have the 3D object in less than a second ready to insert in your game? 00:21 This is cool, right? 00:22 Well, imagine that within a few seconds, you can even animate this car, making the wheels 00:28 turn, flashing the lights, etc. 00:30 Would you believe me if I told you that an AI could already do that? 00:34 If video games weren't enough, this new application works for any 3D scene you are working on, 00:41 illustrations, movies, architecture, design, and more! 00:44 Removing hundreds if not thousands of hours by professional designers for long and iterative 00:50 tests, allowing small businesses to produce quick simulations a lot cheaper! 00:55 By the time you take your sip of coffee, this model will have processed an image of a car 01:00 and generated a whole 3D animated version of it with realistic headlights, taillights, 01:06 and blinkers! 01:07 Moreover, you can even drive it around in a virtual environment platform like Omniverse, 01:12 as you can see here. 01:13 To introduce this new tool presented in the recent GTC event, Omniverse was designed for 01:18 creators who rely on virtual environments to test new ideas and visualize prototypes 01:24 before creating their final products. 01:27 You can use this tool to simulate complex virtual worlds with real-time ray tracing. 01:32 Since this video isn't about Omniverse, which is awesome by itself, I will not dive further 01:37 into this new platform's details. 01:39 I linked more resources about it in the description. 01:43 Here, I want to focus on the algorithm behind the 3D model generation technique NVIDIA published 01:49 in ICLR and CVPR 2021. 01:52 Indeed, this promising model called GANverse3D only needs an image to create a 3D figure 02:00 that can be customized and animated! 02:02 Just by its name, I think it won't surprise you if I say that it uses a GAN to achieve 02:07 that. 02:08 Here, I won't enter into how GANs work since I covered it many times on my channel, where 02:14 you can find many videos explaining them like the one appearing in the top right corner 02:19 right now. 02:20 Generative networks are relatively new in 3D model generation from 2D images, also called 02:26 "inverse graphics" because of the complexity of the task needing to understand depths, 02:32 textures, and lighting using multiple viewpoints of an object to generate such an accurate 02:37 3D model. 02:39 Well, the researchers discovered that generative adversarial networks were implicitly acquiring 02:44 such knowledge during training. 02:46 Meaning that the information regarding the shapes, lighting, and texture of the objects 02:51 was already encoded inside the GAN model's latent code. 02:56 This latent code is the output of the encoder part of the GAN architecture that is typically 03:01 sent into a decoder to generate a new image controlling specific attributes. 03:07 As observed in previous research, we know that different layers control different attributes 03:12 within the images, which is why you saw so many different and cool applications using 03:18 GANs in the past year where some could control the style of the face to generate cartoon 03:23 images. 03:24 In contrast, others could make your head move and all this from a single image of yourself. 03:30 In this case, they used the well-known StyleGAN architecture, a powerful generator used on 03:36 many different buzz applications you saw on the internet and my channel. 03:41 The researchers experimentally found that the first four layers could control the camera 03:46 viewpoints by fixing the remaining layers. 03:49 Thus, by manipulating this characteristic of the StyleGAN architecture, they could use 03:54 these first four layers to automatically generate such novel viewpoints for the rendering task 03:59 from only one picture! 04:01 Similarly, as you can see in the first two rows, doing the opposite and fixing these 04:06 first four layers, they could produce images of different objects with the same viewpoints. 04:11 This characteristic, coupled with different loss functions, could control not only the 04:16 shape and viewpoints of the images but also the texture and background! 04:21 This discovery is very novative since most works on inverse graphics use 3D labels or 04:28 at least multi-view images of the same object during the training of their rendering network. 04:34 This type of data is typically difficult to have and thus very limited. 04:38 These approaches struggle on real photographs because of the domain gap between the training, 04:43 synthetic, images, and these real images due to this lack of training data. 04:49 As you can see, it only needs one picture to generate these amazing transformations 04:53 that look just as real, reducing the need for data annotation by over 10,000 times. 04:59 Of course, this GAN architecture that generates such important novel viewpoints also needs 05:04 to be trained on a lot of data to make this possible. 05:07 Fortunately, it is a lot less costly since it just needs many examples of the object 05:13 itself and does not require multiple viewpoints of the same picture, but this is still a limitation 05:19 to what object we can model using this technique. 05:23 As you can see here, StyleGAN is used as a multi-view generator to build the missing 05:27 data to train the rendering architecture. 05:30 Before going into the renderer, let's jump back a little to understand the whole process. 05:36 You can see here that the architecture doesn't start with a regular image but with a latent 05:41 code. 05:42 This latent code is basically what they learn during training. 05:45 The CNNs and MLP networks you see here are just basic convolutional neural networks and 05:51 multi-layer perceptrons used to create a code that disentangles the shape, texture, and 05:57 background of the image. 05:59 Meaning that this code will independently contain all these characteristics that will 06:03 be used in the rendering model. 06:05 During training, this code is updated to control these features by playing with the different 06:10 StyleGAN layers, as we just saw. 06:12 When you will use this model and send an image, it will pass through the StyleGAN encoder 06:17 and create the latent code containing all the information we need. 06:21 Then, this information will be extracted using the disentangling module we just talked about 06:26 to extract the camera viewpoint, the 3D mesh, texture, and background of your image. 06:32 These characteristics are individually sent to the renderer producing the final model. 06:37 In this architecture, the renderer is a state-of-the-art differentiable renderer called DIB-R, here 06:44 referred to as DIFFGraphicsRenderer. 06:46 It is called a differentiable renderer because this technique, also developed by NVIDIA, 06:52 just like StyleGAN and this very paper, was one of the first to allow the gradient to 06:57 be analytically computed over the entire images making it possible to train a neural network 07:03 to generate the 3D shape. 07:05 You can see that they mainly used state-of-the-art models for each individual task because the 07:10 overall architecture is much more important and innovative than these models themselves 07:15 that are already extremely powerful on their own. 07:19 This is how this new paper, coupled with NVIDIA's new 3D platform: Omniverse, will allow architects, 07:26 creators, game developers, and designers over the world to easily add new animated objects 07:32 to their mockups without needing any expertise in 3D modeling or a large budget to spend 07:37 on renderings. 07:38 Note that this application currently only exists for cars, horses, and birds because 07:44 of the amount of data GANs need to perform well, but this is extremely promising. 07:49 I just want to come back in one year and see how powerful it will have become. 07:53 Who would've thought 10 or 20 years ago that creating a controllable, realistically animated 07:59 version of your car on your computer screen could take less than one second? 08:04 And that to do so, it only needed a shiny little gadget in your pocket to take a picture 08:09 of it and upload it. 08:11 This is just crazy. 08:12 I can't wait to see what researchers will come up with in another 10-20 years! 08:17 Before ending this video, I just wanted to announce that I just created a Patreon if 08:22 you would like to support my work. 08:23 It would help me improve the quality of the videos and keep on making them. 08:28 Regardless of what you decide to do, I will take this opportunity to thank you for watching 08:32 the videos. 08:33 This is, of course, already more than enough!