German researchers combined the efficiency of GANs and convolutional approaches with the expressivity of transformers to produce a powerful and time-efficient method for semantically-guided high-quality image synthesis. If the title and subtitle sound like another language to you, this video was made for you! Chapters: 0:37 Image-GPT 1:44 Transformers and image generation? 2:38 GANs + Transformers for image synthesis- the paper. 5:39 Available pre-trained model and demo 6:16 Conclusion References: Taming Transformers for High-Resolution Image Synthesis, Esser et al., 2020 Project link with paper and results: https://compvis.github.io/taming-transformers/ Code: https://github.com/CompVis/taming-transformers Colab demo to start sampling right away: https://colab.research.google.com/github/CompVis/taming-transformers/blob/master/scripts/taming-transformers.ipynb Follow me for more AI content: Instagram: https://www.instagram.com/whats_ai/ LinkedIn: https://www.linkedin.com/in/whats-ai/ Twitter: https://twitter.com/Whats_AI Facebook: https://www.facebook.com/whats.artificial.intelligence/ Medium: https://medium.com/@whats_ai Join Our Discord channel, Learn AI Together: https://discord.gg/learnaitogether The best courses in AI: https://www.omologapps.com/whats-ai https://medium.com/towards-artificial-intelligence/start-machine-learning-in-2020-become-an-expert-from-nothing-for-free-f31587630cf7 https://github.com/louisfb01/start-machine-learning-in-2020 Become a member of the YouTube community: https://www.youtube.com/channel/UCUzGQrN-lyyc0BWTYoJM_Sg/join Video Transcript 00:00 tldr they combine the efficiency of guns 00:03 and convolutional approaches 00:04 with the expressivity of transformers to 00:07 produce a powerful 00:08 and time efficient method for 00:10 semantically guided high quality 00:12 image synthesis if what i said sounds 00:14 like another language to you this video 00:16 was made for you 00:24 this is what's ai and i share artificial 00:26 intelligence news every week 00:28 if you are new to the channel and want 00:30 to stay up to date please consider 00:31 subscribing to not miss any further news 00:34 you've probably heard of igpt or image 00:37 gpt 00:37 recently published by openai that i 00:40 covered on my channel 00:41 it's the state-of-the-art generative 00:44 transformer model 00:45 openai use the transformer architecture 00:47 on a pixel representation of 00:49 images to perform image synthesis 00:52 in short they use transformers with half 00:55 the pixels of an image as 00:57 inputs to generate the other half of the 00:59 image 01:00 as you can see here it is extremely 01:02 powerful however 01:03 as you know there are 4k high resolution 01:06 images and videos 01:07 and do you know how many pixels there 01:09 are in one 4k 01:11 image it counts in millions and even 01:14 tens of millions which is a pretty long 01:17 sequence when compared with a single 01:19 phrase or paragraph 01:20 for natural language processing 01:22 applications 01:23 because transformers are designed to 01:25 learn long-range interactions 01:27 on sequential data which in this case 01:29 will be to use all the pixels 01:31 sequentially their approach is 01:33 excessively demanding in computation 01:35 and doesn't scale beyond 192 per 192 01:39 image resolutions so transformers cannot 01:42 be used with images 01:44 since no one wants to generate a super 01:46 low definition image 01:47 right well not really researchers from 01:51 the heidelberg university in germany 01:53 recently published a new paper combining 01:56 the efficiency of convolutional 01:58 approaches with the expressivity of 01:59 transformers to produce a semantically 02:02 guided 02:02 synthesis of high quality images meaning 02:06 that they used a convolutional neural 02:08 network to obtain context-rich 02:09 representations of 02:11 images to then use this representation 02:13 instead of the actual image to train a 02:15 transformer model to synthesize an 02:17 actual image from it 02:19 allowing much higher resolution than 02:21 igpt while conserving the quality of the 02:23 resulted image 02:25 but we will come back to that in a 02:27 minute with a better explanation 02:29 if you are not familiar with cnns or 02:31 transformers i will strongly recommend 02:33 you to watch the videos i made 02:34 explaining them to have a better 02:36 understanding of this approach 02:38 this paper is called taming transformers 02:41 for high resolution image synthesis 02:44 and as i said it enables transformers to 02:47 synthesize high resolution images from 02:49 semantic images 02:50 just like you can see here where the 02:52 only information needed is an 02:54 approximative semantic segmentation 02:56 showing what kind of environment you 02:59 will like at which position in the image 03:02 and it will output a complete high 03:03 definition image 03:05 filling the segmentations with real 03:07 mountains grass 03:08 sky sunsets and etc now the question 03:11 is why are these researchers and openai 03:14 using a transformer 03:15 instead of our typical gan architectures 03:18 for image synthesis 03:19 well the advantages of using 03:22 transformers for image generation 03:24 is clear they continue to show 03:26 state-of-the-art results on a wide 03:28 variety of tasks 03:29 and are extremely promising then they 03:31 contain 03:32 no inductive bias found in cnns where 03:35 the use of two-dimensional images and 03:37 filters 03:38 causes a prioritization of local 03:40 interactions 03:41 this inductive bias is what makes cnns 03:44 so efficient 03:45 but it may be too restrictive to make 03:47 the network expressive or 03:49 original now that we know that 03:51 transformers are more expressive and 03:53 very powerful 03:54 the only thing left is to find a way to 03:56 make it more efficient 03:58 indeed in their approach they achieved 04:00 to use both this high effectiveness 04:02 caused by inductive bias coming from 04:04 cnns as well as the expressivity of 04:07 transformers 04:08 as i said the convolutional neural 04:10 network architecture 04:11 composed of a classic encoder decoder 04:13 and an adversarial training processor 04:15 using a discriminator which they called 04:18 vqgan 04:19 is used to generate an efficient and 04:21 rich representation of the images 04:23 in the form of a code book as the name 04:26 suggests 04:27 it's a gan architecture that is used to 04:29 train a generator to generate a high 04:31 resolution image 04:33 if you are not familiar with how guns 04:35 work you can watch the video i made 04:37 explaining them 04:38 once this first training is done they 04:40 take only the decoder that is then used 04:42 to represent the encoded information 04:45 of the input image as input for a 04:47 transformer 04:48 here referred to as a code book such 04:51 that rather than directly using the 04:53 pixels of the image 04:54 the transformer uses this code book 04:57 containing a representation of the image 04:59 in the form of a composition of 05:01 perceptually rich image constituents 05:03 of course this code book is composed of 05:06 extremely compressed data 05:07 made so it can be read semantically by 05:09 the transformer 05:11 then using this representation as a 05:13 training data set for the transformer 05:15 it learns to predict the distribution of 05:17 possible next indices 05:19 inside this representation just like a 05:21 regular autoregressive model 05:23 meaning that it automatically builds a 05:25 regression equation 05:27 which uses previous time steps as inputs 05:29 to predict the values of future time 05:31 steps 05:32 therefore combining cnns and gans with 05:35 transformers to perform high resolution 05:37 image synthesis here you can see an 05:40 example using the demo version of their 05:42 code 05:42 that we can try right now on google 05:44 caleb without having to set up anything 05:47 they already made the setup for us and 05:49 you just have to run these few lines it 05:52 downloads their code from github 05:54 and installs the required dependencies 05:56 automatically 05:57 then it loads the model and imports a 05:59 pre-trained version of it 06:01 finally you can use their segmented 06:03 image as a test or upload your own 06:05 segmented 06:06 image run a few more lines to encode the 06:08 segmentation 06:09 reminding you here that it's a necessary 06:12 step for the transformer to create a 06:14 specific codebook associated with your 06:16 image 06:17 of course this was just an overview of 06:19 this new paper i strongly recommend 06:21 reading it for a better technical 06:22 understanding 06:23 also as i mentioned earlier their code 06:26 is available on github with pre-trained 06:28 models 06:28 so you can try it yourself and even 06:30 improve it all the links are in the 06:32 description below please leave a like if 06:35 you went this far in the video 06:36 and since there are over 80 percent of 06:38 you guys that are not subscribed yet 06:40 please consider subscribing to the 06:42 channel to not miss any further news 06:44 thank you for watching