I explain Artificial Intelligence terms and news to non-experts.
►The full article: https://www.louisbouchard.ai/styleclip/
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/
►Patashnik, Or, et al., (2021), "Styleclip: Text-driven manipulation of stylegan imagery.", https://arxiv.org/abs/2103.17249
►Code (use with local GUI or colab notebook): https://github.com/orpatashnik/StyleCLIP
►Demo: https://colab.research.google.com/github/orpatashnik/StyleCLIP/blob/main/notebooks/StyleCLIP_global.ipynb
►OpenAI's Distill article for CLIP: Gabriel Goh, Nick Cammarata, Chelsea
Voss, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and
Chris Olah. Multimodal neurons in artificial neural networks. Distill, https://distill.pub/2021/multimodal-neurons/, 2021.
00:00
Researchers used AI to generate images.
00:02
Then, they could leverage it to take an image and edit it following a specific style, like
00:07
changing it into a cartoon character or transforming any face into a smiling face.
00:12
This needed a lot of tweaking and model engineering and many trials and errors before achieving
00:18
something realistic.
00:19
There have been many advances in this field, mainly StyleGAN, which has the incredible
00:24
ability to generate realistic images in pretty much any domain; real-life humans, cartoons,
00:30
sketches, etc.
00:31
StyleGAN is amazing, but it still needs quite a lot of work to make the results look as
00:36
intended, which is why many people are trying to understand how these images are made, and
00:41
especially how to control them.
00:43
This is extremely complicated as the representation in which we edit the images is not human-friendly.
00:49
Instead of being regular images with three dimensions, red, greed, and blue, it is extremely
00:54
dense in information and therefore contains hundreds of dimensions with information about
00:59
all the features the image may contain.
01:02
This is why understanding and localizing the features we want to change to generate a new
01:06
version of the same image requires so much work.
01:09
The keywords here are "of the same image."
01:12
The challenge is to edit only the wanted parts and keep everything else the same.
01:17
If we change the colors of the eyes, we want all other facial features to stay the same.
01:21
I recently covered various techniques where the researchers tried to make this control
01:26
much easier for the user by using only a few image examples or quick sketches of what we
01:31
want to achieve.
01:32
Now, you can do that using only text.
01:35
In this new paper, Or Patashnik et al. created a model able to control the image generation
01:40
process through simple text input.
01:43
You can send it pretty much any face transformation and using StyleGAN and CLIP.
01:48
It will understand what you want and change it.
01:51
Then, you can tweak some parameters to have the best result possible, and it takes less
01:55
than a second.
01:56
photo: paper + video GUI gif example
01:57
I mentioned StyleGAN.
01:58
StyleGAN is NVIDIA's state-of-the-art GAN architecture for image synthesis or image
02:03
generation.
02:04
I made a lot of videos covering it in various applications that you should definitely watch
02:08
if you are not familiar with it.
02:10
Before entering in the details, the only thing left to cover is the other model I talked
02:15
about that StyleGAN is combined with, which is CLIP.
02:18
Quickly, CLIP is a powerful language to image model recently published by OpenAI.
02:24
As we will see, this model is the one in charge of controlling the modifications to the image
02:28
using only our image and text input.
02:32
It was trained on a lot of image-text pairs from the Web and can basically understand
02:36
what appears in an image.
02:38
Since CLIP was trained on such image-text pairs, it can efficiently match a text description
02:42
to an existing image.
02:44
Thus, we can use this same principle in our current model to orient the StyleGAN-generated
02:49
image to the desired text transformation.
02:52
You should read OpenAI's Distill article if you'd like to learn more about CLIP.
02:56
It is linked in the description below.
02:58
It has been used to search specific images on Unsplash from text input and other very
03:03
cool applications.
03:04
It will become very clear soon how CLIP can be useful in this case.
03:08
By the way, if you find this interesting, take a second to share the fun and send this
03:12
video to a friend.
03:14
It helps a lot!
03:15
As I said, the researchers used both these already trained models, StyleGAN and CLIP,
03:20
to make this happen.
03:21
Here's how...
03:22
It takes an input image, such as a human face in this case.
03:26
But it can also be a horse, a cat or a car...
03:29
Anything that you can find a StyleGAN model trained on such images with sufficient data.
03:34
Then, this image is encoded into a latent code using an encoder, just like this, here
03:40
called w.
03:41
This latent code is just a condensed representation of the image produced by a convolutional neural
03:47
network.
03:48
It contains the most useful information about the image which have been identified during
03:52
the training of the model.
03:54
If this is already too complicated, I'd strongly recommend pausing the video and watch the
03:58
short 1-minute video I made covering GANs where I explain how the encoder part typically
04:04
works.
04:05
This latent code, or new image representation, is then sent into three mapper networks that
04:11
are trained to manipulate the desired attributes of the image while preserving the other features.
04:16
Each of these networks is in charge of learning how to map a specific level of detail, from
04:21
coarse to fine, which is decided when extracting the information from the encoder at different
04:26
depths in the network, as I explained in my GAN video.
04:30
This way, they can manipulate general or fine features individually.
04:34
This is where the CLIP model is used to manipulate these mappings.
04:38
Because of the training, the mappings will learn to move accordingly to the text input
04:42
as the CLIP model understands the content of the images and encode the text in the same
04:48
way as the image is encoded.
04:49
Thus, CLIP can understand the translations made from a text to another, like "a neutral
04:55
face" to "a surprised face," and tell the mapping networks how to apply this same transformation
05:01
to the image mappings.
05:03
This transformation is the delta vector here that is controlled by CLIP and applies this
05:08
same relative translations and rotations to the latent code w, as what happened for the
05:13
text.
05:14
Then, this modified latent code is sent in the StyleGAN generator to create our transformed
05:20
image.
05:21
In summary, the CLIP model understands the changes happening in a sentence, like "a neutral
05:26
face" to "a surprised face," and they apply the same transformation to the encoded image
05:32
representation.
05:33
This new transformed latent code is then sent to the StyleGAN generator to generate the
05:37
new image.
05:38
And Voilà!
05:39
This is how you can send an image and change it based on a simple sentence with this new
05:44
model.
05:45
They also made a google colab and a local GUI to test it for yourself with any image
05:50
and play with it easily using sliders to control the modifications intuitively.
05:55
Of course, the code is available on GitHub as well.
05:59
If you found this interesting, please give this video a like, comment on what you think
06:03
about this research below, and share it with a friend.
06:06
It helps A LOT!
06:07
The only limitation for this is that you have to train the mapping networks, but they also
06:12
attacked this issue in their paper.
06:14
For a deeper understanding of how it works and to see these two other techniques they
06:18
introduced to control image generation with CLIP without any training needed.
06:22
I'd strongly recommend reading their paper.
06:25
It is worth the time!
06:26
All the links are in the description below.
06:28
Thank you for watching!