paint-brush
Manipulate Images Using Text Commands via this AIby@whatsai
5,346 reads
5,346 reads

Manipulate Images Using Text Commands via this AI

by Louis BouchardSeptember 7th, 2021
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

AI has the incredible ability to generate realistic images in pretty much any domain; real-life humans, cartoons, sketches, etc. Then, they could leverage it to take an image and edit it following a specific style, like changing it into a cartoon character or transforming any face into a smiling face. This is extremely complicated as the representation in which we edit the images is not human-friendly. The challenge is to edit only the wanted parts and keep everything else the same. Now, you can do that using only text! Learn more in the video…
featured image - Manipulate Images Using Text Commands via this AI
Louis Bouchard HackerNoon profile picture

Researchers used AI to generate images. Then, they leveraged it to take an image and edit it following a specific style, like changing it into a cartoon character or transforming any face into a smiling face. This needed a lot of tweaking and model engineering and many trials and errors before achieving something realistic. There have been many advances in this field, mainly StyleGAN, which has the incredible ability to generate realistic images in pretty much any domain; real-life humans, cartoons, sketches, etc.

StyleGAN is amazing, but it still needs quite a lot of work to make the results look as intended, which is why many people are trying to understand how these images are made, and especially how to control them. This is extremely complicated as the representation in which we edit the images is not human-friendly. Instead of being regular images with three dimensions, red, green, and blue, it is extremely dense in information and therefore contains hundreds of dimensions with information about all the features the image may contain.

This is why understanding and localizing the features we want to change to generate a new version of the same image requires so much work. The keywords here are “of the same image.” The challenge is to edit only the wanted parts and keep everything else the same. If we change the colors of the eyes, we want all other facial features to stay the same.

I recently covered various techniques where the researchers tried to make this control much easier for the user by using only a few image examples or quick sketches of what we want to achieve.

Now, you can do that using only text! Learn more in the video…

Watch the video

References

►The full article: https://www.louisbouchard.ai/styleclip/
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/
►Patashnik, Or, et al., (2021), "Styleclip: Text-driven manipulation of stylegan imagery.", https://arxiv.org/abs/2103.17249
►Code (use with local GUI or colab notebook): https://github.com/orpatashnik/StyleCLIP
►Demo: https://colab.research.google.com/github/orpatashnik/StyleCLIP/blob/main/notebooks/StyleCLIP_global.ipynb
►OpenAI's Distill article for CLIP: Gabriel Goh, Nick Cammarata, Chelsea
Voss, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and
Chris Olah. Multimodal neurons in artificial neural networks. Distill, https://distill.pub/2021/multimodal-neurons/, 2021.

Video Transcript

00:00

Researchers used AI to generate images.

00:02

Then, they could leverage it to take an image and edit it following a specific style, like

00:07

changing it into a cartoon character or transforming any face into a smiling face.

00:12

This needed a lot of tweaking and model engineering and many trials and errors before achieving

00:18

something realistic.

00:19

There have been many advances in this field, mainly StyleGAN, which has the incredible

00:24

ability to generate realistic images in pretty much any domain; real-life humans, cartoons,

00:30

sketches, etc.

00:31

StyleGAN is amazing, but it still needs quite a lot of work to make the results look as

00:36

intended, which is why many people are trying to understand how these images are made, and

00:41

especially how to control them.

00:43

This is extremely complicated as the representation in which we edit the images is not human-friendly.

00:49

Instead of being regular images with three dimensions, red, greed, and blue, it is extremely

00:54

dense in information and therefore contains hundreds of dimensions with information about

00:59

all the features the image may contain.

01:02

This is why understanding and localizing the features we want to change to generate a new

01:06

version of the same image requires so much work.

01:09

The keywords here are "of the same image."

01:12

The challenge is to edit only the wanted parts and keep everything else the same.

01:17

If we change the colors of the eyes, we want all other facial features to stay the same.

01:21

I recently covered various techniques where the researchers tried to make this control

01:26

much easier for the user by using only a few image examples or quick sketches of what we

01:31

want to achieve.

01:32

Now, you can do that using only text.

01:35

In this new paper, Or Patashnik et al. created a model able to control the image generation

01:40

process through simple text input.

01:43

You can send it pretty much any face transformation and using StyleGAN and CLIP.

01:48

It will understand what you want and change it.

01:51

Then, you can tweak some parameters to have the best result possible, and it takes less

01:55

than a second.

01:56

photo: paper + video GUI gif example

01:57

I mentioned StyleGAN.

01:58

StyleGAN is NVIDIA's state-of-the-art GAN architecture for image synthesis or image

02:03

generation.

02:04

I made a lot of videos covering it in various applications that you should definitely watch

02:08

if you are not familiar with it.

02:10

Before entering in the details, the only thing left to cover is the other model I talked

02:15

about that StyleGAN is combined with, which is CLIP.

02:18

Quickly, CLIP is a powerful language to image model recently published by OpenAI.

02:24

As we will see, this model is the one in charge of controlling the modifications to the image

02:28

using only our image and text input.

02:32

It was trained on a lot of image-text pairs from the Web and can basically understand

02:36

what appears in an image.

02:38

Since CLIP was trained on such image-text pairs, it can efficiently match a text description

02:42

to an existing image.

02:44

Thus, we can use this same principle in our current model to orient the StyleGAN-generated

02:49

image to the desired text transformation.

02:52

You should read OpenAI's Distill article if you'd like to learn more about CLIP.

02:56

It is linked in the description below.

02:58

It has been used to search specific images on Unsplash from text input and other very

03:03

cool applications.

03:04

It will become very clear soon how CLIP can be useful in this case.

03:08

By the way, if you find this interesting, take a second to share the fun and send this

03:12

video to a friend.

03:14

It helps a lot!

03:15

As I said, the researchers used both these already trained models, StyleGAN and CLIP,

03:20

to make this happen.

03:21

Here's how...

03:22

It takes an input image, such as a human face in this case.

03:26

But it can also be a horse, a cat or a car...

03:29

Anything that you can find a StyleGAN model trained on such images with sufficient data.

03:34

Then, this image is encoded into a latent code using an encoder, just like this, here

03:40

called w.

03:41

This latent code is just a condensed representation of the image produced by a convolutional neural

03:47

network.

03:48

It contains the most useful information about the image which have been identified during

03:52

the training of the model.

03:54

If this is already too complicated, I'd strongly recommend pausing the video and watch the

03:58

short 1-minute video I made covering GANs where I explain how the encoder part typically

04:04

works.

04:05

This latent code, or new image representation, is then sent into three mapper networks that

04:11

are trained to manipulate the desired attributes of the image while preserving the other features.

04:16

Each of these networks is in charge of learning how to map a specific level of detail, from

04:21

coarse to fine, which is decided when extracting the information from the encoder at different

04:26

depths in the network, as I explained in my GAN video.

04:30

This way, they can manipulate general or fine features individually.

04:34

This is where the CLIP model is used to manipulate these mappings.

04:38

Because of the training, the mappings will learn to move accordingly to the text input

04:42

as the CLIP model understands the content of the images and encode the text in the same

04:48

way as the image is encoded.

04:49

Thus, CLIP can understand the translations made from a text to another, like "a neutral

04:55

face" to "a surprised face," and tell the mapping networks how to apply this same transformation

05:01

to the image mappings.

05:03

This transformation is the delta vector here that is controlled by CLIP and applies this

05:08

same relative translations and rotations to the latent code w, as what happened for the

05:13

text.

05:14

Then, this modified latent code is sent in the StyleGAN generator to create our transformed

05:20

image.

05:21

In summary, the CLIP model understands the changes happening in a sentence, like "a neutral

05:26

face" to "a surprised face," and they apply the same transformation to the encoded image

05:32

representation.

05:33

This new transformed latent code is then sent to the StyleGAN generator to generate the

05:37

new image.

05:38

And Voilà!

05:39

This is how you can send an image and change it based on a simple sentence with this new

05:44

model.

05:45

They also made a google colab and a local GUI to test it for yourself with any image

05:50

and play with it easily using sliders to control the modifications intuitively.

05:55

Of course, the code is available on GitHub as well.

05:59

If you found this interesting, please give this video a like, comment on what you think

06:03

about this research below, and share it with a friend.

06:06

It helps A LOT!

06:07

The only limitation for this is that you have to train the mapping networks, but they also

06:12

attacked this issue in their paper.

06:14

For a deeper understanding of how it works and to see these two other techniques they

06:18

introduced to control image generation with CLIP without any training needed.

06:22

I'd strongly recommend reading their paper.

06:25

It is worth the time!

06:26

All the links are in the description below.

06:28

Thank you for watching!