If you thought had great results, wait until you see what this new model from Google Brain can do. Dall-e 2 Dalle-e is amazing but often lacks realism, and this is what the team attacked with this new model called Imagen. They share a lot of results on their project page as well as a benchmark, which they introduced for comparing text-to-image models, where they clearly outperform , and previous image generation approaches. Learn more in the video... Dall-E 2 References ►Read the full article: ►Paper: Saharia et al., 2022, Imagen - Google Brain, ►Project link: ►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/google-brain-imagen/ https://gweb-research-imagen.appspot.com/paper.pdf https://gweb-research-imagen.appspot.com/ https://www.louisbouchard.ai/newsletter/ Video transcript 0:00 if you thought dali 2 had great results 0:02 wait until you see what this new model 0:04 from google brain can do delhi is 0:07 amazing but often lacks realism and this 0:10 is what the team attacked with this new 0:12 model called imogen they share a lot of 0:14 results on their project page as well as 0:16 a benchmark which they introduced for 0:18 comparing text to image models where 0:20 they clearly outperformed daily2 and 0:23 previous image generation approaches 0:25 this benchmark is also super cool as we 0:27 see more and more text to image models 0:29 and it's pretty difficult to compare the 0:31 results unless we assume the results are 0:34 really bad which we often do but this 0:36 model and le2 definitely defied the odds 0:40 tldr it's a new text-to-image model that 0:43 you can compare to dali to with more 0:45 realism as per human testers so just 0:48 like dali that i covered not even a 0:50 month ago this model takes texts like a 0:53 golden retriever dog wearing a blue 0:56 checkered barrette and a red dotted 0:58 turtleneck and tries to generate a 1:00 photorealistic image out of this weird 1:02 sentence the main point here is that 1:05 imogen can not only understand text but 1:08 it can also understand the images it 1:10 generates since they are more realistic 1:12 than all previous approaches of course 1:15 when i say understand i mean its own 1:17 kind of understanding that is really 1:20 different than ours the modal doesn't 1:22 really understand the text or the image 1:24 it generates it definitely has some kind 1:27 of knowledge about it but it mainly 1:28 understands how this particular kind of 1:31 sentence with these objects should be 1:33 represented using pixels on an image but 1:36 i'll concede that it sure looks like it 1:38 understands what we send it when we see 1:41 those results obviously you can trick it 1:43 with some really weird sentences that 1:45 couldn't look realistic like this one 1:48 but it sometimes beats even your own 1:50 imagination and just creates something 1:53 amazing still what's even more amazing 1:56 is how it works using something i never 1:58 discussed on the channel a diffusion 2:00 model but before using this diffusion 2:03 model we first need to understand the 2:05 text input and this is also the main 2:07 difference with dali they used a huge 2:10 text model similar to gpt3 to understand 2:13 the text as best as an ai system can so 2:16 instead of training a text model along 2:18 with the image generation model they 2:21 simply use a big pre-trained model and 2:23 freeze it so that it doesn't change 2:25 during the training of the image 2:27 generation model from their study this 2:30 led to much better results and it seemed 2:32 like the model understood text better so 2:35 this text module is how the model 2:37 understands text and this understanding 2:40 is represented in what we call encodings 2:42 which is what the model has been trained 2:44 to do on huge datasets to transfer text 2:47 inputs into a space of information that 2:50 it can use and understand 2:52 now we need to use this transform text 2:54 data to generate the image and as i said 2:57 they used a diffusion model to achieve 3:00 that but what is a diffusion model 3:02 diffusion models are generative models 3:04 that convert random gaussian noise like 3:07 this into images by learning how to 3:10 reverse gaussian noise iteratively they 3:13 are powerful models for super resolution 3:15 or other image to image translations and 3:18 in this case use a modified unit 3:20 architecture which i covered numerous 3:22 times in previous videos so i won't 3:24 enter into the architectural details 3:26 here basically the model is trained to 3:29 denoise an image from pure noise which 3:31 the orient using the text encodings and 3:34 a technique called classifier free 3:36 guidance which they say is essential and 3:38 clearly explained in their paper i'll 3:40 let you read it for more information on 3:42 this technique so now we have a model 3:45 able to take random gaussian noise and 3:47 our text encoding and denoise it with 3:49 guidance from the text encodings to 3:51 generate our image but as you see here 3:54 it isn't as simple as it sounds the 3:56 image we just generated is a very small 3:58 image as a bigger image will require 4:00 much more computation and a much bigger 4:02 model which are not viable instead we 4:05 first generate a photorealistic image 4:07 using the diffusion model we just 4:09 discussed and then use other diffusion 4:12 models to improve the quality of the 4:14 image iteratively i already covered 4:16 super resolution models in past videos 4:19 so i won't enter into the details here 4:21 but let's do a quick overview once again 4:24 we want to have noise and not an image 4:26 so we cover up this initially generated 4:28 low resolution image with again some 4:31 gaussian noise and we train our second 4:33 diffusion model to take this modified 4:35 image and improve it then we repeat 4:38 these two steps with another model but 4:40 this time using just patches of the 4:43 image instead of the full image to do 4:45 the same upscaling ratio and stay 4:47 computationally viable and voila we end 4:51 up with our photorealistic high 4:53 resolution image 4:55 of course this was just an overview of 4:56 this exciting new model with really cool 4:59 results i definitely invite you to read 5:01 their great paper for a deeper 5:03 understanding of their approach and a 5:05 detailed results analysis 5:07 and you do you think the results are 5:09 comparable to delhi too are they better 5:12 or worse i sure think it is dally's main 5:15 competitor as of now let me know what 5:17 you think of this new google brain 5:19 publication and the explanation i hope 5:21 you enjoyed this video and if you did 5:24 please take a second to leave a like and 5:26 subscribe to stay up to date with 5:27 exciting ai news if you are subscribed i 5:30 will see you next week with another amazing paper