This promising model called GANverse3D only needs an image to create a 3D figure that can be customized and animated!
How cool would it be to take a picture of an object on the internet, let’s say a car, and automatically have the 3D object in less than a second ready to insert in your game?
This is cool, right? Well, imagine that within a few seconds, you can even animate this car, making the wheels turn, flashing the lights, etc. Would you believe me if I told you that an AI could already do that? If video games weren’t enough, this new application works for any 3D scene you are working on, illustrations, movies, architecture, design, and more!
What you see here is someone carefully creating
a scene for a video game.
It takes many hours of work by a professional just for a single object like this one.
How cool would it be to take a picture of an object on the internet, let's say a car,
and automatically have the 3D object in less than a second ready to insert in your game?
This is cool, right?
Well, imagine that within a few seconds, you can even animate this car, making the wheels
turn, flashing the lights, etc.
Would you believe me if I told you that an AI could already do that?
If video games weren't enough, this new application works for any 3D scene you are working on,
illustrations, movies, architecture, design, and more!
Removing hundreds if not thousands of hours by professional designers for long and iterative
tests, allowing small businesses to produce quick simulations a lot cheaper!
By the time you take your sip of coffee, this model will have processed an image of a car
and generated a whole 3D animated version of it with realistic headlights, taillights,
Moreover, you can even drive it around in a virtual environment platform like Omniverse,
as you can see here.
To introduce this new tool presented in the recent GTC event, Omniverse was designed for
creators who rely on virtual environments to test new ideas and visualize prototypes
before creating their final products.
You can use this tool to simulate complex virtual worlds with real-time ray tracing.
Since this video isn't about Omniverse, which is awesome by itself, I will not dive further
into this new platform's details.
I linked more resources about it in the description.
Here, I want to focus on the algorithm behind the 3D model generation technique NVIDIA published
in ICLR and CVPR 2021.
Indeed, this promising model called GANverse3D only needs an image to create a 3D figure
that can be customized and animated!
Just by its name, I think it won't surprise you if I say that it uses a GAN to achieve
Here, I won't enter into how GANs work since I covered it many times on my channel, where
you can find many videos explaining them like the one appearing in the top right corner
Generative networks are relatively new in 3D model generation from 2D images, also called
"inverse graphics" because of the complexity of the task needing to understand depths,
textures, and lighting using multiple viewpoints of an object to generate such an accurate
Well, the researchers discovered that generative adversarial networks were implicitly acquiring
such knowledge during training.
Meaning that the information regarding the shapes, lighting, and texture of the objects
was already encoded inside the GAN model's latent code.
This latent code is the output of the encoder part of the GAN architecture that is typically
sent into a decoder to generate a new image controlling specific attributes.
As observed in previous research, we know that different layers control different attributes
within the images, which is why you saw so many different and cool applications using
GANs in the past year where some could control the style of the face to generate cartoon
In contrast, others could make your head move and all this from a single image of yourself.
In this case, they used the well-known StyleGAN architecture, a powerful generator used on
many different buzz applications you saw on the internet and my channel.
The researchers experimentally found that the first four layers could control the camera
viewpoints by fixing the remaining layers.
Thus, by manipulating this characteristic of the StyleGAN architecture, they could use
these first four layers to automatically generate such novel viewpoints for the rendering task
from only one picture!
Similarly, as you can see in the first two rows, doing the opposite and fixing these
first four layers, they could produce images of different objects with the same viewpoints.
This characteristic, coupled with different loss functions, could control not only the
shape and viewpoints of the images but also the texture and background!
This discovery is very novative since most works on inverse graphics use 3D labels or
at least multi-view images of the same object during the training of their rendering network.
This type of data is typically difficult to have and thus very limited.
These approaches struggle on real photographs because of the domain gap between the training,
synthetic, images, and these real images due to this lack of training data.
As you can see, it only needs one picture to generate these amazing transformations
that look just as real, reducing the need for data annotation by over 10,000 times.
Of course, this GAN architecture that generates such important novel viewpoints also needs
to be trained on a lot of data to make this possible.
Fortunately, it is a lot less costly since it just needs many examples of the object
itself and does not require multiple viewpoints of the same picture, but this is still a limitation
to what object we can model using this technique.
As you can see here, StyleGAN is used as a multi-view generator to build the missing
data to train the rendering architecture.
Before going into the renderer, let's jump back a little to understand the whole process.
You can see here that the architecture doesn't start with a regular image but with a latent
This latent code is basically what they learn during training.
The CNNs and MLP networks you see here are just basic convolutional neural networks and
multi-layer perceptrons used to create a code that disentangles the shape, texture, and
background of the image.
Meaning that this code will independently contain all these characteristics that will
be used in the rendering model.
During training, this code is updated to control these features by playing with the different
StyleGAN layers, as we just saw.
When you will use this model and send an image, it will pass through the StyleGAN encoder
and create the latent code containing all the information we need.
Then, this information will be extracted using the disentangling module we just talked about
to extract the camera viewpoint, the 3D mesh, texture, and background of your image.
These characteristics are individually sent to the renderer producing the final model.
In this architecture, the renderer is a state-of-the-art differentiable renderer called DIB-R, here
referred to as DIFFGraphicsRenderer.
It is called a differentiable renderer because this technique, also developed by NVIDIA,
just like StyleGAN and this very paper, was one of the first to allow the gradient to
be analytically computed over the entire images making it possible to train a neural network
to generate the 3D shape.
You can see that they mainly used state-of-the-art models for each individual task because the
overall architecture is much more important and innovative than these models themselves
that are already extremely powerful on their own.
This is how this new paper, coupled with NVIDIA's new 3D platform: Omniverse, will allow architects,
creators, game developers, and designers over the world to easily add new animated objects
to their mockups without needing any expertise in 3D modeling or a large budget to spend
Note that this application currently only exists for cars, horses, and birds because
of the amount of data GANs need to perform well, but this is extremely promising.
I just want to come back in one year and see how powerful it will have become.
Who would've thought 10 or 20 years ago that creating a controllable, realistically animated
version of your car on your computer screen could take less than one second?
And that to do so, it only needed a shiny little gadget in your pocket to take a picture
of it and upload it.
This is just crazy.
I can't wait to see what researchers will come up with in another 10-20 years!
Before ending this video, I just wanted to announce that I just created a Patreon if
you would like to support my work.
It would help me improve the quality of the videos and keep on making them.
Regardless of what you decide to do, I will take this opportunity to thank you for watching
This is, of course, already more than enough!