Last year I shared , an amazing model by OpenAI capable of generating images from a text input with incredible results. Now is time for his big brother, DALL·E 2. And you won’t believe the progress in a single year! DALL·E 2 is not only better at generating photorealistic images from text. The results are four times the resolution! DALL·E As if it wasn’t already impressive enough, the recent model learned a new skill; . image inpainting DALL·E could generate images from text inputs. DALL·E 2 can do it better, but it doesn’t stop there. It can also edit those images and make them look even better! Or simply add a feature you want like some flamingos in the background. Sounds interesting? Learn more in the video! References ►Read the full article: ►A. Ramesh et al., 2022, DALL-E 2 paper: ►OpenAI's blog post: ►Risks and limitations: ►OpenAI Dalle's instagram page: ►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/openais-new-model-dall-e-2-is-amazing/ https://cdn.openai.com/papers/dall-e-2.pdf https://openai.com/dall-e-2 https://github.com/openai/dalle-2-preview/blob/main/system-card.md https://www.instagram.com/openaidalle/ https://www.louisbouchard.ai/newsletter/ Video Transcript 0:00 last year i shared dolly an amazing 0:02 model by openai capable of generating 0:05 images from a texan foot with incredible 0:08 results now it's time for his big 0:10 brother dolly too and you won't believe 0:13 the progress in a single year dolly 2 is 0:15 not only better at generating 0:17 photorealistic images from texts the 0:20 results are four times the resolution as 0:22 if it wasn't already impressive enough 0:25 the recent model learned a new skill 0:27 image in painting delhi could generate 0:30 images from text inputs dolly 2 can do 0:33 it better but it doesn't stop there it 0:35 can also edit those images and make them 0:38 look even better or simply add a feature 0:41 you want like some flapping goes in the 0:43 background this is what image and 0:45 painting is we take the part of an image 0:47 and replace it with something else 0:49 following the style and reflections in 0:51 the image keeping realism of course it 0:53 doesn't only replace the part of the 0:55 image at random this will be too easy 0:58 for openai this in-painting process is 1:00 also text guided which means you can 1:02 tell it to add a famine go here there or 1:05 even there 1:06 before diving into the nitty-gritty of 1:08 this newest dahle model let me talk a 1:11 little about this episode sponsor 1:13 weights and biases if you are not 1:15 familiar with weight and biases you are 1:17 most certainly new here and should 1:19 definitely subscribe to the channel 1:21 weight and biases allows you to keep 1:22 track of all your experiments with only 1:25 a handful of lines added to your code 1:27 one feature i love is how you can 1:29 quickly create and share amazing looking 1:31 interactive reports like this one 1:34 clearly showing your team or future self 1:36 your runs metrics hyperparameters and 1:38 data configurations alongside any notes 1:41 you or your team had at the time it's a 1:44 powerful feature to either add quick 1:46 comments on an experiment or create 1:48 polished pieces of analysis reports can 1:50 also be used as dashboards for reporting 1:53 a smaller subset of metrics than the 1:55 main workspace you can even create 1:57 public view-only links to share with 2:00 anyone easily capturing and sharing your 2:02 work is essential if you want to grow as 2:04 an ml practitioner which is why i 2:06 recommend using tools that improve your 2:08 work like weights and biases just try it 2:11 with the first link below and start 2:13 sharing your work like a pro 2:16 now let's dive into how dolly 2 can not 2:19 only generate images from text but is 2:21 also capable of editing them indeed this 2:24 new in-painting skill the network has 2:26 learned is due to its better 2:28 understanding of concepts and the images 2:30 themselves locally and globally what i 2:33 mean by locally and globally is that 2:35 dahle 2 has a deeper understanding of 2:37 why the pixels next to each other has 2:40 these colors as it understands the 2:42 objects in the scene and their 2:43 interrelation to each other this way it 2:46 will be able to understand that this 2:48 water has reflection and the object on 2:50 the right should be also reflected there 2:53 it also understands the global scene 2:55 which is what is happening just like if 2:58 you were to describe what is going on 3:00 when the person took the photo here 3:02 you'd say that this photo does not exist 3:05 obviously or else i'm definitely down to 3:07 try that if we forget that this is 3:09 impossible you'd say that the astronaut 3:11 is riding a horse in space so if i were 3:14 to ask you to draw the same scene but on 3:17 a planet rather than in free space you'd 3:19 be able to picture something like that 3:21 since you understand that the horse and 3:23 astronaut are the objects of interest to 3:25 keep in the picture this seems obvious 3:28 but it's extremely complex for a machine 3:30 that only sees pixels of colors which is 3:33 why dahli 2 is so impressive to me but 3:35 how exactly does the model understand 3:38 the text we send it and can generate an 3:40 image out of it well it's pretty similar 3:43 to the first model i covered on the 3:45 channel it starts by using the clip 3:47 model by openai to encode both a text 3:50 and an image into the same domain a 3:52 condensed representation called a latent 3:55 code then it will take this encoding and 3:58 use a generator also called a decoder to 4:01 generate a new image that means the same 4:04 thing as the text since it's from the 4:06 same latent code so dali 2 has two steps 4:10 clip to encode the information and the 4:12 new decoder model to take this encoded 4:15 information and generate an image out of 4:17 it these two separated steps are also 4:20 why we can generate variations of the 4:22 images we can simply randomly change the 4:25 encoded information just a little making 4:27 it move a tiny bit in the latent space 4:30 and it will still represent the same 4:32 sentence while having all different 4:34 values creating a different image 4:36 representing the same text as we see 4:39 here it initially takes a text input and 4:42 encodes it what we see above is the 4:44 first step of the training process where 4:46 we also feed it an image and encode it 4:48 using clip so that images and text are 4:51 encoded similarly following the clip 4:53 objective then for generating a new 4:56 image we switch to the section below 4:58 where we use the text encoding guided by 5:00 clip to transform it into an image ready 5:03 encoding this transformation is done 5:05 using a diffusion prior which we will 5:07 cover shortly as it is very similar to 5:09 the diffusion model used for the final 5:12 step finally we use our newly created 5:14 image encoding and decode it into a new 5:17 image using the diffusion decoder a 5:20 diffusion decoder or modal is a kind of 5:23 model that starts with random noise and 5:25 learns how to iteratively change this 5:28 noise to get back to an image it learns 5:30 that by doing the opposite during 5:32 training we will feed it images and 5:34 apply random gaussian noise on the image 5:37 iteratively until we can't see anything 5:40 other than noise then we simply reverse 5:43 the model to generate images from noise 5:45 if you'd like more detail about this 5:47 kind of network which are really cool i 5:50 invite you to watch this video i made 5:51 about them and voila this is how dali 2 5:55 generates such high quality images 5:58 following text it's super impressive and 6:00 tells us that the model does understand 6:02 the text but does it deeply understand 6:05 what it created 6:06 well it sure looks like it it's the 6:08 capability of impainting images that 6:10 makes us believe that it does understand 6:12 the pictures pretty well but why is that 6:15 so how can it link a text input to an 6:18 image and understand the image enough to 6:20 replace only some parts of it without 6:23 affecting the realism this is all 6:25 because of clip as it links a text input 6:28 to an image if we encode back our newly 6:30 generated image and use a different text 6:33 input to guide another generation we can 6:35 generate the second version of the image 6:38 that will replace only the wanted region 6:40 in our first generation and you will end 6:43 up with this picture unfortunately the 6:46 code isn't publicly available and is not 6:48 in their api yet the reason for that as 6:51 per openai is to study the risks and 6:53 limitations of such a powerful model 6:56 they actually discuss these potential 6:58 risks and the reason for this privacy in 7:00 their paper and in a great repository i 7:02 linked in the description below if you 7:04 are interested they also opened an 7:06 instagram account to share more results 7:08 if you'd like to see that it's also 7:10 linked below i loved dally and this one 7:13 is even cooler 7:15 of course this was just an overview of 7:17 how dahli2 works and i strongly invite 7:19 reading their great paper linked below 7:21 for more detail on their implementation 7:23 of the model i hope you enjoyed this 7:26 video as much as i enjoyed making it and 7:28 i will see you next week with another amazing paper thank you for watching