Researchers used AI to generate images. Then, they leveraged it to take an image and edit it following a specific style, like changing it into a cartoon character or transforming any face into a smiling face. This needed a lot of tweaking and model engineering and many trials and errors before achieving something realistic. There have been many advances in this field, mainly StyleGAN, which has the incredible ability to generate realistic images in pretty much any domain; real-life humans, cartoons, sketches, etc. StyleGAN is amazing, but it still needs quite a lot of work to make the results look as intended, which is why many people are trying to understand how these images are made, and especially how to control them. This is extremely complicated as the representation in which we edit the images is not human-friendly. Instead of being regular images with three dimensions, red, green, and blue, it is extremely dense in information and therefore contains hundreds of dimensions with information about all the features the image may contain. This is why understanding and localizing the features we want to change to generate a new version of the same image requires so much work. The keywords here are “of the same image.” The challenge is to edit only the wanted parts and keep everything else the same. If we change the colors of the eyes, we want all other facial features to stay the same. I recently covered various techniques where the researchers tried to make this control much easier for the user by using or quick sketches of what we want to achieve. only a few image examples Now, you can do that using only text! Learn more in the video… Watch the video References ►The full article: ►My Newsletter (A new AI application explained weekly to your emails!): ►Patashnik, Or, et al., (2021), "Styleclip: Text-driven manipulation of stylegan imagery.", ►Code (use with local GUI or colab notebook): ►Demo: ►OpenAI's Distill article for CLIP: Gabriel Goh, Nick Cammarata, Chelsea Voss, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal neurons in artificial neural networks. Distill, , 2021. https://www.louisbouchard.ai/styleclip/ https://www.louisbouchard.ai/newsletter/ https://arxiv.org/abs/2103.17249 https://github.com/orpatashnik/StyleCLIP https://colab.research.google.com/github/orpatashnik/StyleCLIP/blob/main/notebooks/StyleCLIP_global.ipynb https://distill.pub/2021/multimodal-neurons/ Video Transcript 00:00 Researchers used AI to generate images. 00:02 Then, they could leverage it to take an image and edit it following a specific style, like 00:07 changing it into a cartoon character or transforming any face into a smiling face. 00:12 This needed a lot of tweaking and model engineering and many trials and errors before achieving 00:18 something realistic. 00:19 There have been many advances in this field, mainly StyleGAN, which has the incredible 00:24 ability to generate realistic images in pretty much any domain; real-life humans, cartoons, 00:30 sketches, etc. 00:31 StyleGAN is amazing, but it still needs quite a lot of work to make the results look as 00:36 intended, which is why many people are trying to understand how these images are made, and 00:41 especially how to control them. 00:43 This is extremely complicated as the representation in which we edit the images is not human-friendly. 00:49 Instead of being regular images with three dimensions, red, greed, and blue, it is extremely 00:54 dense in information and therefore contains hundreds of dimensions with information about 00:59 all the features the image may contain. 01:02 This is why understanding and localizing the features we want to change to generate a new 01:06 version of the same image requires so much work. 01:09 The keywords here are "of the same image." 01:12 The challenge is to edit only the wanted parts and keep everything else the same. 01:17 If we change the colors of the eyes, we want all other facial features to stay the same. 01:21 I recently covered various techniques where the researchers tried to make this control 01:26 much easier for the user by using only a few image examples or quick sketches of what we 01:31 want to achieve. 01:32 Now, you can do that using only text. 01:35 In this new paper, Or Patashnik et al. created a model able to control the image generation 01:40 process through simple text input. 01:43 You can send it pretty much any face transformation and using StyleGAN and CLIP. 01:48 It will understand what you want and change it. 01:51 Then, you can tweak some parameters to have the best result possible, and it takes less 01:55 than a second. 01:56 photo: paper + video GUI gif example 01:57 I mentioned StyleGAN. 01:58 StyleGAN is NVIDIA's state-of-the-art GAN architecture for image synthesis or image 02:03 generation. 02:04 I made a lot of videos covering it in various applications that you should definitely watch 02:08 if you are not familiar with it. 02:10 Before entering in the details, the only thing left to cover is the other model I talked 02:15 about that StyleGAN is combined with, which is CLIP. 02:18 Quickly, CLIP is a powerful language to image model recently published by OpenAI. 02:24 As we will see, this model is the one in charge of controlling the modifications to the image 02:28 using only our image and text input. 02:32 It was trained on a lot of image-text pairs from the Web and can basically understand 02:36 what appears in an image. 02:38 Since CLIP was trained on such image-text pairs, it can efficiently match a text description 02:42 to an existing image. 02:44 Thus, we can use this same principle in our current model to orient the StyleGAN-generated 02:49 image to the desired text transformation. 02:52 You should read OpenAI's Distill article if you'd like to learn more about CLIP. 02:56 It is linked in the description below. 02:58 It has been used to search specific images on Unsplash from text input and other very 03:03 cool applications. 03:04 It will become very clear soon how CLIP can be useful in this case. 03:08 By the way, if you find this interesting, take a second to share the fun and send this 03:12 video to a friend. 03:14 It helps a lot! 03:15 As I said, the researchers used both these already trained models, StyleGAN and CLIP, 03:20 to make this happen. 03:21 Here's how... 03:22 It takes an input image, such as a human face in this case. 03:26 But it can also be a horse, a cat or a car... 03:29 Anything that you can find a StyleGAN model trained on such images with sufficient data. 03:34 Then, this image is encoded into a latent code using an encoder, just like this, here 03:40 called w. 03:41 This latent code is just a condensed representation of the image produced by a convolutional neural 03:47 network. 03:48 It contains the most useful information about the image which have been identified during 03:52 the training of the model. 03:54 If this is already too complicated, I'd strongly recommend pausing the video and watch the 03:58 short 1-minute video I made covering GANs where I explain how the encoder part typically 04:04 works. 04:05 This latent code, or new image representation, is then sent into three mapper networks that 04:11 are trained to manipulate the desired attributes of the image while preserving the other features. 04:16 Each of these networks is in charge of learning how to map a specific level of detail, from 04:21 coarse to fine, which is decided when extracting the information from the encoder at different 04:26 depths in the network, as I explained in my GAN video. 04:30 This way, they can manipulate general or fine features individually. 04:34 This is where the CLIP model is used to manipulate these mappings. 04:38 Because of the training, the mappings will learn to move accordingly to the text input 04:42 as the CLIP model understands the content of the images and encode the text in the same 04:48 way as the image is encoded. 04:49 Thus, CLIP can understand the translations made from a text to another, like "a neutral 04:55 face" to "a surprised face," and tell the mapping networks how to apply this same transformation 05:01 to the image mappings. 05:03 This transformation is the delta vector here that is controlled by CLIP and applies this 05:08 same relative translations and rotations to the latent code w, as what happened for the 05:13 text. 05:14 Then, this modified latent code is sent in the StyleGAN generator to create our transformed 05:20 image. 05:21 In summary, the CLIP model understands the changes happening in a sentence, like "a neutral 05:26 face" to "a surprised face," and they apply the same transformation to the encoded image 05:32 representation. 05:33 This new transformed latent code is then sent to the StyleGAN generator to generate the 05:37 new image. 05:38 And Voilà! 05:39 This is how you can send an image and change it based on a simple sentence with this new 05:44 model. 05:45 They also made a google colab and a local GUI to test it for yourself with any image 05:50 and play with it easily using sliders to control the modifications intuitively. 05:55 Of course, the code is available on GitHub as well. 05:59 If you found this interesting, please give this video a like, comment on what you think 06:03 about this research below, and share it with a friend. 06:06 It helps A LOT! 06:07 The only limitation for this is that you have to train the mapping networks, but they also 06:12 attacked this issue in their paper. 06:14 For a deeper understanding of how it works and to see these two other techniques they 06:18 introduced to control image generation with CLIP without any training needed. 06:22 I'd strongly recommend reading their paper. 06:25 It is worth the time! 06:26 All the links are in the description below. 06:28 Thank you for watching!