This week we take a look at visual generative modeling. The goal is to generate a complete scene in high-resolution, rather than a single face image or object. This process is similar to StyleGAN, but it uses the GAN in a traditional generative and discriminative way, with convolutional neural networks. Watch the Video: Chapters: - Hey! Tap the Thumbs Up button and Subscribe. You'll learn a lot of cool stuff, I promise. - Text-To-Image translation - Examples - Conclusion 0:00 0:24 0:51 5:50 References : : : Drew A. Hudson and C. Lawrence Zitnick, Generative Adversarial Transformers, (2021), Published on Arxiv., abstract: Paper https://arxiv.org/pdf/2103.01209.pdf Code https://github.com/dorarad/gansformer Complete reference "We introduce the GANsformer, a novel and efficient type of transformer, and explore it for the task of visual generative modeling. The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linearly efficiency, that can readily scale to high-resolution synthesis. It iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes. In contrast to the classic transformer architecture, it utilizes multiplicative integration that allows flexible region-based modulation, and can thus be seen as a generalization of the successful StyleGAN network. We demonstrate the model's strength and robustness through a careful evaluation over a range of datasets, from simulated multi-object environments to rich real-world indoor and outdoor scenes, showing it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data efficiency. Further qualitative and quantitative experiments offer us an insight into the model's inner workings, revealing improved interpretability and stronger disentanglement, and illustrating the benefits and efficacy of our approach. An implementation of the model is available at https://github.com/dorarad/gansformer ." Follow me for more AI content: ►Instagram: ►LinkedIn: ►Twitter: ►Facebook: https://www.instagram.com/whats_ai/ https://www.linkedin.com/in/whats-ai/ https://twitter.com/Whats_AI https://www.facebook.com/whats.artifi... Join Our Discord channel, Learn AI Together: ► https://discord.gg/learnaitogether The best courses in AI & Guide+Repository on how to start: ► ► https://www.omologapps.com/whats-ai https://github.com/louisfb01/start-ma... Become a member of the YouTube community and support my work: https://www.youtube.com/channel/UCUzG... Video Transcript Note: This transcript is auto-generated by Youtube and may not be entirely accurate. 00:00 the basically leveraged transformers 00:02 attention mechanism in the powerful stat 00:04 gun 2 architecture to make it even more 00:06 powerful 00:10 [Music] 00:14 this is what's ai and i share artificial 00:16 intelligence news every week 00:18 if you are new to the channel and would 00:19 like to stay up to date please consider 00:21 subscribing to not miss any further news 00:24 last week we looked at dali openai's 00:27 most recent paper 00:28 it uses a similar architecture as gpt3 00:31 involving transformers to generate an 00:33 image from text 00:35 this is a super interesting and complex 00:37 task called 00:38 text to image translation as you can see 00:41 again here the results were surprisingly 00:43 good compared to previous 00:45 state-of-the-art techniques this is 00:47 mainly due to the use of transformers 00:49 and a large amount of data this week we 00:52 will look at a very similar task 00:54 called visual generative modelling where 00:56 the goal is to generate a 00:58 complete scene in high resolution such 01:00 as a road or a room 01:02 rather than a single face or a specific 01:04 object this is different from delhi 01:06 since we are not generating the scene 01:08 from a text but from a trained model 01:10 on a specific style of scenes which is a 01:13 bedroom in this case 01:14 rather it is just like style gun that is 01:17 able to generate unique and non-existing 01:19 human faces 01:20 being trained on a data set of real 01:22 faces 01:24 the difference is that it uses this gan 01:26 architecture in a traditional generative 01:28 and discriminative way 01:29 with convolutional neural networks a 01:32 classic gun architecture will have a 01:34 generator 01:35 trained to generate the image and a 01:36 discriminator 01:38 used to measure the quality of the 01:40 generated images 01:41 by guessing if it's a real image coming 01:43 from the data set 01:44 or a fake image generated by the 01:46 generator 01:48 both networks are typically composed of 01:50 convolutional neural networks where the 01:52 generator 01:53 looks like this mainly composed of down 01:56 sampling the image using convolutions to 01:58 encode it 01:59 and then it up samples the image again 02:02 using convolutions to generate a new 02:04 version 02:05 of the image with the same style based 02:07 on the encoding 02:08 which is why it is called style gun then 02:12 the discriminator takes the generated 02:14 image or 02:15 an image from your data set and tries to 02:17 figure out whether it is real or 02:18 generated 02:19 called fake instead they leverage 02:22 transformers attention mechanism 02:24 inside the powerful stargane 2 02:26 architecture to make it 02:27 even more powerful attention is an 02:30 essential feature of this network 02:32 allowing the network to draw global 02:34 dependencies between 02:36 input and output in this case it's 02:39 between the input at the current step of 02:41 the architecture 02:42 and the latent code previously encoded 02:44 as we will see in a minute 02:46 before diving into it if you are not 02:48 familiar with transformers or attention 02:50 i suggest you watch the video i made 02:52 about transformers 02:54 for more details and a better 02:55 understanding of attention 02:57 you should definitely have a look at the 02:58 video attention is all you need 03:01 from a fellow youtuber and inspiration 03:03 of mine janik 03:04 kilter covering this amazing paper 03:07 alright 03:07 so we know that they use transformers 03:09 and guns together to generate better and 03:12 more realistic scenes 03:13 explaining the name of this paper 03:15 transformers 03:17 but why and how did they do that exactly 03:20 as for the y they did that to generate 03:22 complex and realistic scenes 03:24 like this one automatically this could 03:26 be a powerful application for many 03:28 industries like movies or video games 03:30 requiring a lot less time and effort 03:33 than having an 03:34 artist create them on a computer or even 03:36 make them 03:37 in real life to take a picture of it 03:40 also 03:40 imagine how useful it could be for 03:42 designers when coupled with text to 03:44 image translation generating many 03:46 different scenes from a single text 03:48 input 03:48 and pressing a random button they use a 03:51 state-of-the-art style gun architecture 03:53 because guns are powerful generators 03:55 when we talk about the general image 03:58 because guns work using convolutional 04:00 neural networks 04:01 they are by nature using local 04:03 information of the pixels 04:05 merging them to end up with the general 04:07 information regarding the image 04:09 missing out on the long range 04:11 interaction of the faraway pixel 04:13 for the same reason this causes guns to 04:15 be powerful generators for the overall 04:18 style of the image 04:19 still they are a lot less powerful 04:21 regarding the quality of the small 04:23 details in the generated image 04:25 for the same reason being unable to 04:27 control the style of localized regions 04:30 within the generated image itself this 04:33 is why they had the idea to combine 04:34 transformers and gans in one 04:36 architecture they called 04:38 bipartite transformer as gpt3 and many 04:41 other papers already proved transformers 04:44 are powerful for long-range interactions 04:46 drawing dependencies between them and 04:48 understanding the context of text 04:50 or images we can see that this simply 04:53 added attention layers 04:54 which is the base of the transformer's 04:56 network in between the convolutional 04:58 layers of both the generator and 05:00 discriminator 05:01 thus rather than focusing on using 05:03 global information and controlling 05:05 all features globally as convolutions do 05:07 by nature 05:08 they use this attention to propagate 05:10 information from the local pixels to the 05:12 global high level representation 05:14 and vice versa like other transformers 05:17 applied to images 05:18 this attention layer takes the pixel's 05:20 position and the style gun to latent 05:23 spaces w 05:24 and z the latent space w is an encoding 05:27 of the input into an intermediate latent 05:30 space 05:30 done at the beginning of the network 05:32 denoted here 05:34 as a while the encoding z is just the 05:37 resulting features of the input at the 05:39 current step of the network 05:40 this makes the generation much more 05:42 expressive over the whole image 05:44 especially in generating images 05:46 depicting multi-object 05:48 scenes which is the goal of this paper 05:51 of course this was just an overview of 05:53 this new paper by facebook ai research 05:55 and stanford university 05:57 i strongly recommend reading the paper 05:59 to have a better understanding of this 06:00 approach it's the first link in the 06:02 description below 06:03 the code is also available and linked in 06:05 the description as well 06:07 if you went this far in the video please 06:08 consider leaving a like 06:10 and commenting your thoughts i will 06:12 definitely read them and answer you 06:14 and since there's still over 80 percent 06:16 of you guys that are not subscribed yet 06:18 please consider clicking this free 06:20 subscribe button 06:21 to not miss any further news clearly 06:23 explained 06:24 thank you for watching 06:33 [Music]