paint-brush
CVPR 2021 Best Paper Award: GIRAFFE Controllable Image Generationby@whatsai
548 reads
548 reads

CVPR 2021 Best Paper Award: GIRAFFE Controllable Image Generation

by Louis BouchardJuly 9th, 2021
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

CVPR 2021 Best Paper Award Goes to Michael Niemeyer and Andreas Geiger from the Max Planck Institute for Intelligent Systems and the University of Tubingen for their paper called Giraffe. They look at generating new images and controlling what will appear, the objects and their positions and orientations, the background, etc. Using a modified GAN architecture, they can even move objects in the image without affecting the background or the other objects.Learn more in the video below!Watch the full article: https://www.louisbouchard.ai/cvpr-2021-best-paper/
featured image - CVPR 2021 Best Paper Award: GIRAFFE Controllable Image Generation
Louis Bouchard HackerNoon profile picture

CVPR 2021 Best Paper Award Goes to Michael Niemeyer and Andreas Geiger from the Max Planck Institute for Intelligent Systems and the University of Tubingen for their paper called Giraffe, which looks at the task of controllable image synthesis.

In other words, they look at generating new images and controlling what will appear, the objects and their positions and orientations, the background, etc.

Using a modified GAN architecture, they can even move objects in the image without affecting the background or the other objects! CVPR is a yearly conference that happened just last week where a ton of new research papers in computer vision were out just for this event.

Learn more in the video below!

Watch the video

References

►Read the full article: https://www.louisbouchard.ai/cvpr-2021-best-paper/
►Michael Niemeyer and Andreas Geiger, (2021), "GIRAFFE: Representing
Scenes as Compositional Generative Neural Feature Fields", Published in
CVPR2021.
►Project link with paper and more: https://m-niemeyer.github.io/project-pages/giraffe/index.html
►Code: https://github.com/autonomousvision/giraffe
►NERF video:

Subscribe to my weekly AI newsletter!

Video Transcript

00:00

CVPR 2021 Best Paper Award Goes to Michael Niemeyer and Andreas Geiger from the Max Planck

00:07

Institute for Intelligent Systems and the University of Tubingen for their paper called

00:13

Giraffe, which looks at the task of controllable image

00:16

synthesis.

00:17

In other words, they look at generating new images and controlling what will appear, the

00:22

objects and their positions and orientations, the background, etc.

00:27

Using a modified GAN architecture, they can even move objects in the image without affecting

00:31

the background or the other objects!

00:34

CVPR is a yearly conference that happened just last week where a ton of new research

00:38

papers in computer vision were out just for this event.

00:42

As you already know, if you regularly watch my videos, conventional GAN architectures

00:47

work with an encoder and a decoder setup, just like this.

00:51

During training, the encoder receives an image, encodes it into a condensed representation,

00:56

and the decoder takes this representation to create a new image changing the style.

01:01

This is repeated numerous times with all the images we have in our training dataset so

01:06

that the encoder and decoder learn how to maximize the results of the task we want to

01:10

achieve during training.

01:12

Once the training is done, you can send an image to the encoder, and it will do the same

01:17

process, generating a new and unseen image following your needs.

01:20

It will work very similarly whatever the task, whether it is to translate an image of a face

01:26

into another style like a cartoonifier or create a beautiful landscape out of a quick

01:31

draft.

01:32

Using only the decoder, which we also call the generator since it is the model responsible

01:37

for creating the new image, we can walk in this encoded information space

01:42

and sample information that we send the generator to generate an infinite amount of new images.

01:48

This encoded information space is often referred to as the latent space, and the information

01:53

we use to generate the new image the latent code.

01:56

We basically select some latent code randomly within this optimal space, and it generates

02:02

a new random image following the task we want to achieve, following a training process of

02:07

this generator, of course.

02:09

This is incredibly cool, but as I just said, the image is completely random, and we have

02:14

no or few ideas on what it will look like, which is already a lot less useful for creators.

02:21

This is the problem they attacked with this paper.

02:24

Indeed, by taking latent codes of the shape and appearances of objects and sending it

02:28

to the decoder, or generator, they are able to control the pose of the objects,

02:34

which means they can move them around, change their appearances, add other objects, change

02:39

the background and even change the camera pose.

02:42

All these transformations can be done independently on each object or background, without affecting

02:48

anything else in the image!

02:50

As you can see, it is MUCH better than other GAN-based approaches that typically cannot

02:55

disentangle the objects from one another and are all affected by the modification of a

03:00

specific object.

03:01

The difference with their method is that they attack this problem in a three-dimensional

03:06

scene representation, just like how we see the real world, instead of staying in the

03:10

two-dimensional image world as other GANs do.

03:13

But other than that, the process is quite similar.

03:16

They encode the information, identify the objects, edit them inside the latent space,

03:21

and decode it to generate the new image.

03:24

Here, there are just some more steps to do inside this latent space.

03:29

We can see this as a combination of the classical GAN image synthesis network with a neural

03:34

renderer used to generate the 3D scene from the images sent to the network, as we will

03:40

see.

03:41

There are three main steps to achieve that.

03:43

After encoding the input image, meaning that we are already in the latent space, the first

03:48

step is to transfer the image into a 3D scene.

03:51

But not just a simple 3D scene, a 3D scene composed of 3D elements, which are the objects

03:57

and background.

03:58

This way of seeing the images as a scene composed of generated volume renderings allows them

04:03

to change the camera angle in the generated image and control the objects independently.

04:09

This is achieved using a similar model as the paper I previously covered called NERF,

04:15

but instead of using a single model to generate the entire locked scene from the input image,

04:20

they independently generate the objects and background using two separate models.

04:25

Here called the Sampled Feature Fields.

04:28

The parameters of this network are also learned during training.

04:32

I won't enter into the details, but it is very similar to NERF, which I covered in another

04:37

video.

04:38

If you would like to have more details on such networks, it is appearing on the top

04:44

right corner right now and in the description.

04:48

Having this scene with disentangled elements, we can edit them individually without affecting

04:53

the rest of the image.

04:55

This is the second step.

04:56

They can do whatever they want to the object, like changing its position and orientation.

05:01

In other words, they change the pose of the objects or background.

05:05

At this point, they can even add new objects placed wherever they want.

05:10

Then, they simply combine them into a final 3D scene containing all the objects and background

05:16

by adding all feature fields together.

05:18

Finally, we have to come back to the 2D world of natural images.

05:24

So the last step is to take this 3D scene and render a regular image out of it.

05:29

Since we are still in the 3D world, we can change the camera viewpoint to decide how

05:34

we will look at the scene.

05:36

Then, we evaluate each pixel based on this camera ray and other parameters such as the

05:41

alpha value and the transmittance.

05:44

This gives us what they call the feature image, but this feature image is an image composed

05:49

of feature vectors for each pixel.

05:52

As we are still in the latent space, these features need to be translated into RGB colors

05:57

and high-resolution images.

05:59

This is done using the typical decoder just like other GAN architectures, upscaling it

06:05

back to its original dimensions and learning the feature to RGB channels translation simultaneously.

06:13

And voilà, you have your new image with a lot more control over what is generated!

06:18

Of course, as you can see, it is still not perfect when used on real-world data.

06:23

Still, it is extremely impressive and is a significant step forward in the right direction,

06:28

especially considering that these are synthetic images entirely generated by GANs and that

06:34

it is only the first paper able to control generated images at this level of precision.

06:39

The paper is really interesting, and I recommend reading it to understand how their model works.

06:45

Congratulations to Michael Niemeyer and Andreas Geiger for their well-deserved best paper

06:51

award.

06:52

They also made the code available on their GitHub if you would like to play with it.

06:56

The link is in the description.

06:57

Thank you for watching!