This promising model called GANverse3D only needs an image to create a 3D figure that can be customized and animated!
How cool would it be to take a picture of an object on the internet, let’s say a car, and automatically have the 3D object in less than a second ready to insert in your game?
This is cool, right? Well, imagine that within a few seconds, you can even animate this car, making the wheels turn, flashing the lights, etc. Would you believe me if I told you that an AI could already do that? If video games weren’t enough, this new application works for any 3D scene you are working on, illustrations, movies, architecture, design, and more!
00:00
What you see here is someone carefully creating
00:02
a scene for a video game.
00:04
It takes many hours of work by a professional just for a single object like this one.
00:10
How cool would it be to take a picture of an object on the internet, let's say a car,
00:15
and automatically have the 3D object in less than a second ready to insert in your game?
00:21
This is cool, right?
00:22
Well, imagine that within a few seconds, you can even animate this car, making the wheels
00:28
turn, flashing the lights, etc.
00:30
Would you believe me if I told you that an AI could already do that?
00:34
If video games weren't enough, this new application works for any 3D scene you are working on,
00:41
illustrations, movies, architecture, design, and more!
00:44
Removing hundreds if not thousands of hours by professional designers for long and iterative
00:50
tests, allowing small businesses to produce quick simulations a lot cheaper!
00:55
By the time you take your sip of coffee, this model will have processed an image of a car
01:00
and generated a whole 3D animated version of it with realistic headlights, taillights,
01:06
and blinkers!
01:07
Moreover, you can even drive it around in a virtual environment platform like Omniverse,
01:12
as you can see here.
01:13
To introduce this new tool presented in the recent GTC event, Omniverse was designed for
01:18
creators who rely on virtual environments to test new ideas and visualize prototypes
01:24
before creating their final products.
01:27
You can use this tool to simulate complex virtual worlds with real-time ray tracing.
01:32
Since this video isn't about Omniverse, which is awesome by itself, I will not dive further
01:37
into this new platform's details.
01:39
I linked more resources about it in the description.
01:43
Here, I want to focus on the algorithm behind the 3D model generation technique NVIDIA published
01:49
in ICLR and CVPR 2021.
01:52
Indeed, this promising model called GANverse3D only needs an image to create a 3D figure
02:00
that can be customized and animated!
02:02
Just by its name, I think it won't surprise you if I say that it uses a GAN to achieve
02:07
that.
02:08
Here, I won't enter into how GANs work since I covered it many times on my channel, where
02:14
you can find many videos explaining them like the one appearing in the top right corner
02:19
right now.
02:20
Generative networks are relatively new in 3D model generation from 2D images, also called
02:26
"inverse graphics" because of the complexity of the task needing to understand depths,
02:32
textures, and lighting using multiple viewpoints of an object to generate such an accurate
02:37
3D model.
02:39
Well, the researchers discovered that generative adversarial networks were implicitly acquiring
02:44
such knowledge during training.
02:46
Meaning that the information regarding the shapes, lighting, and texture of the objects
02:51
was already encoded inside the GAN model's latent code.
02:56
This latent code is the output of the encoder part of the GAN architecture that is typically
03:01
sent into a decoder to generate a new image controlling specific attributes.
03:07
As observed in previous research, we know that different layers control different attributes
03:12
within the images, which is why you saw so many different and cool applications using
03:18
GANs in the past year where some could control the style of the face to generate cartoon
03:23
images.
03:24
In contrast, others could make your head move and all this from a single image of yourself.
03:30
In this case, they used the well-known StyleGAN architecture, a powerful generator used on
03:36
many different buzz applications you saw on the internet and my channel.
03:41
The researchers experimentally found that the first four layers could control the camera
03:46
viewpoints by fixing the remaining layers.
03:49
Thus, by manipulating this characteristic of the StyleGAN architecture, they could use
03:54
these first four layers to automatically generate such novel viewpoints for the rendering task
03:59
from only one picture!
04:01
Similarly, as you can see in the first two rows, doing the opposite and fixing these
04:06
first four layers, they could produce images of different objects with the same viewpoints.
04:11
This characteristic, coupled with different loss functions, could control not only the
04:16
shape and viewpoints of the images but also the texture and background!
04:21
This discovery is very novative since most works on inverse graphics use 3D labels or
04:28
at least multi-view images of the same object during the training of their rendering network.
04:34
This type of data is typically difficult to have and thus very limited.
04:38
These approaches struggle on real photographs because of the domain gap between the training,
04:43
synthetic, images, and these real images due to this lack of training data.
04:49
As you can see, it only needs one picture to generate these amazing transformations
04:53
that look just as real, reducing the need for data annotation by over 10,000 times.
04:59
Of course, this GAN architecture that generates such important novel viewpoints also needs
05:04
to be trained on a lot of data to make this possible.
05:07
Fortunately, it is a lot less costly since it just needs many examples of the object
05:13
itself and does not require multiple viewpoints of the same picture, but this is still a limitation
05:19
to what object we can model using this technique.
05:23
As you can see here, StyleGAN is used as a multi-view generator to build the missing
05:27
data to train the rendering architecture.
05:30
Before going into the renderer, let's jump back a little to understand the whole process.
05:36
You can see here that the architecture doesn't start with a regular image but with a latent
05:41
code.
05:42
This latent code is basically what they learn during training.
05:45
The CNNs and MLP networks you see here are just basic convolutional neural networks and
05:51
multi-layer perceptrons used to create a code that disentangles the shape, texture, and
05:57
background of the image.
05:59
Meaning that this code will independently contain all these characteristics that will
06:03
be used in the rendering model.
06:05
During training, this code is updated to control these features by playing with the different
06:10
StyleGAN layers, as we just saw.
06:12
When you will use this model and send an image, it will pass through the StyleGAN encoder
06:17
and create the latent code containing all the information we need.
06:21
Then, this information will be extracted using the disentangling module we just talked about
06:26
to extract the camera viewpoint, the 3D mesh, texture, and background of your image.
06:32
These characteristics are individually sent to the renderer producing the final model.
06:37
In this architecture, the renderer is a state-of-the-art differentiable renderer called DIB-R, here
06:44
referred to as DIFFGraphicsRenderer.
06:46
It is called a differentiable renderer because this technique, also developed by NVIDIA,
06:52
just like StyleGAN and this very paper, was one of the first to allow the gradient to
06:57
be analytically computed over the entire images making it possible to train a neural network
07:03
to generate the 3D shape.
07:05
You can see that they mainly used state-of-the-art models for each individual task because the
07:10
overall architecture is much more important and innovative than these models themselves
07:15
that are already extremely powerful on their own.
07:19
This is how this new paper, coupled with NVIDIA's new 3D platform: Omniverse, will allow architects,
07:26
creators, game developers, and designers over the world to easily add new animated objects
07:32
to their mockups without needing any expertise in 3D modeling or a large budget to spend
07:37
on renderings.
07:38
Note that this application currently only exists for cars, horses, and birds because
07:44
of the amount of data GANs need to perform well, but this is extremely promising.
07:49
I just want to come back in one year and see how powerful it will have become.
07:53
Who would've thought 10 or 20 years ago that creating a controllable, realistically animated
07:59
version of your car on your computer screen could take less than one second?
08:04
And that to do so, it only needed a shiny little gadget in your pocket to take a picture
08:09
of it and upload it.
08:11
This is just crazy.
08:12
I can't wait to see what researchers will come up with in another 10-20 years!
08:17
Before ending this video, I just wanted to announce that I just created a Patreon if
08:22
you would like to support my work.
08:23
It would help me improve the quality of the videos and keep on making them.
08:28
Regardless of what you decide to do, I will take this opportunity to thank you for watching
08:32
the videos.
08:33
This is, of course, already more than enough!