Researchers used AI to generate images. Then, they leveraged it to take an image and edit it following a specific style, like changing it into a cartoon character or transforming any face into a smiling face. This needed a lot of tweaking and model engineering and many trials and errors before achieving something realistic. There have been many advances in this field, mainly StyleGAN, which has the incredible ability to generate realistic images in pretty much any domain; real-life humans, cartoons, sketches, etc.
StyleGAN is amazing, but it still needs quite a lot of work to make the results look as intended, which is why many people are trying to understand how these images are made, and especially how to control them. This is extremely complicated as the representation in which we edit the images is not human-friendly. Instead of being regular images with three dimensions, red, green, and blue, it is extremely dense in information and therefore contains hundreds of dimensions with information about all the features the image may contain.
This is why understanding and localizing the features we want to change to generate a new version of the same image requires so much work. The keywords here are “of the same image.” The challenge is to edit only the wanted parts and keep everything else the same. If we change the colors of the eyes, we want all other facial features to stay the same.
I recently covered various techniques where the researchers tried to make this control much easier for the user by using only a few image examples or quick sketches of what we want to achieve.
Now, you can do that using only text! Learn more in the video…
►The full article: https://www.louisbouchard.ai/styleclip/
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/
►Patashnik, Or, et al., (2021), "Styleclip: Text-driven manipulation of stylegan imagery.", https://arxiv.org/abs/2103.17249
►Code (use with local GUI or colab notebook): https://github.com/orpatashnik/StyleCLIP
►Demo: https://colab.research.google.com/github/orpatashnik/StyleCLIP/blob/main/notebooks/StyleCLIP_global.ipynb
►OpenAI's Distill article for CLIP: Gabriel Goh, Nick Cammarata, Chelsea
Voss, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and
Chris Olah. Multimodal neurons in artificial neural networks. Distill, https://distill.pub/2021/multimodal-neurons/, 2021.
00:00
Researchers used AI to generate images.
00:02
Then, they could leverage it to take an image and edit it following a specific style, like
00:07
changing it into a cartoon character or transforming any face into a smiling face.
00:12
This needed a lot of tweaking and model engineering and many trials and errors before achieving
00:18
something realistic.
00:19
There have been many advances in this field, mainly StyleGAN, which has the incredible
00:24
ability to generate realistic images in pretty much any domain; real-life humans, cartoons,
00:30
sketches, etc.
00:31
StyleGAN is amazing, but it still needs quite a lot of work to make the results look as
00:36
intended, which is why many people are trying to understand how these images are made, and
00:41
especially how to control them.
00:43
This is extremely complicated as the representation in which we edit the images is not human-friendly.
00:49
Instead of being regular images with three dimensions, red, greed, and blue, it is extremely
00:54
dense in information and therefore contains hundreds of dimensions with information about
00:59
all the features the image may contain.
01:02
This is why understanding and localizing the features we want to change to generate a new
01:06
version of the same image requires so much work.
01:09
The keywords here are "of the same image."
01:12
The challenge is to edit only the wanted parts and keep everything else the same.
01:17
If we change the colors of the eyes, we want all other facial features to stay the same.
01:21
I recently covered various techniques where the researchers tried to make this control
01:26
much easier for the user by using only a few image examples or quick sketches of what we
01:31
want to achieve.
01:32
Now, you can do that using only text.
01:35
In this new paper, Or Patashnik et al. created a model able to control the image generation
01:40
process through simple text input.
01:43
You can send it pretty much any face transformation and using StyleGAN and CLIP.
01:48
It will understand what you want and change it.
01:51
Then, you can tweak some parameters to have the best result possible, and it takes less
01:55
than a second.
01:56
photo: paper + video GUI gif example
01:57
I mentioned StyleGAN.
01:58
StyleGAN is NVIDIA's state-of-the-art GAN architecture for image synthesis or image
02:03
generation.
02:04
I made a lot of videos covering it in various applications that you should definitely watch
02:08
if you are not familiar with it.
02:10
Before entering in the details, the only thing left to cover is the other model I talked
02:15
about that StyleGAN is combined with, which is CLIP.
02:18
Quickly, CLIP is a powerful language to image model recently published by OpenAI.
02:24
As we will see, this model is the one in charge of controlling the modifications to the image
02:28
using only our image and text input.
02:32
It was trained on a lot of image-text pairs from the Web and can basically understand
02:36
what appears in an image.
02:38
Since CLIP was trained on such image-text pairs, it can efficiently match a text description
02:42
to an existing image.
02:44
Thus, we can use this same principle in our current model to orient the StyleGAN-generated
02:49
image to the desired text transformation.
02:52
You should read OpenAI's Distill article if you'd like to learn more about CLIP.
02:56
It is linked in the description below.
02:58
It has been used to search specific images on Unsplash from text input and other very
03:03
cool applications.
03:04
It will become very clear soon how CLIP can be useful in this case.
03:08
By the way, if you find this interesting, take a second to share the fun and send this
03:12
video to a friend.
03:14
It helps a lot!
03:15
As I said, the researchers used both these already trained models, StyleGAN and CLIP,
03:20
to make this happen.
03:21
Here's how...
03:22
It takes an input image, such as a human face in this case.
03:26
But it can also be a horse, a cat or a car...
03:29
Anything that you can find a StyleGAN model trained on such images with sufficient data.
03:34
Then, this image is encoded into a latent code using an encoder, just like this, here
03:40
called w.
03:41
This latent code is just a condensed representation of the image produced by a convolutional neural
03:47
network.
03:48
It contains the most useful information about the image which have been identified during
03:52
the training of the model.
03:54
If this is already too complicated, I'd strongly recommend pausing the video and watch the
03:58
short 1-minute video I made covering GANs where I explain how the encoder part typically
04:04
works.
04:05
This latent code, or new image representation, is then sent into three mapper networks that
04:11
are trained to manipulate the desired attributes of the image while preserving the other features.
04:16
Each of these networks is in charge of learning how to map a specific level of detail, from
04:21
coarse to fine, which is decided when extracting the information from the encoder at different
04:26
depths in the network, as I explained in my GAN video.
04:30
This way, they can manipulate general or fine features individually.
04:34
This is where the CLIP model is used to manipulate these mappings.
04:38
Because of the training, the mappings will learn to move accordingly to the text input
04:42
as the CLIP model understands the content of the images and encode the text in the same
04:48
way as the image is encoded.
04:49
Thus, CLIP can understand the translations made from a text to another, like "a neutral
04:55
face" to "a surprised face," and tell the mapping networks how to apply this same transformation
05:01
to the image mappings.
05:03
This transformation is the delta vector here that is controlled by CLIP and applies this
05:08
same relative translations and rotations to the latent code w, as what happened for the
05:13
text.
05:14
Then, this modified latent code is sent in the StyleGAN generator to create our transformed
05:20
image.
05:21
In summary, the CLIP model understands the changes happening in a sentence, like "a neutral
05:26
face" to "a surprised face," and they apply the same transformation to the encoded image
05:32
representation.
05:33
This new transformed latent code is then sent to the StyleGAN generator to generate the
05:37
new image.
05:38
And Voilà!
05:39
This is how you can send an image and change it based on a simple sentence with this new
05:44
model.
05:45
They also made a google colab and a local GUI to test it for yourself with any image
05:50
and play with it easily using sliders to control the modifications intuitively.
05:55
Of course, the code is available on GitHub as well.
05:59
If you found this interesting, please give this video a like, comment on what you think
06:03
about this research below, and share it with a friend.
06:06
It helps A LOT!
06:07
The only limitation for this is that you have to train the mapping networks, but they also
06:12
attacked this issue in their paper.
06:14
For a deeper understanding of how it works and to see these two other techniques they
06:18
introduced to control image generation with CLIP without any training needed.
06:22
I'd strongly recommend reading their paper.
06:25
It is worth the time!
06:26
All the links are in the description below.
06:28
Thank you for watching!