OpenAI just released the paper showing how DALL-E works! It is called "Zero-Shot Text-to-Image Generation". Here's a video explaining it: References: A. Ramesh et al., Zero-shot text-to-image generation, 2021. arXiv:2102.12092 [cs.CV] Code & more information for the discrete VAE used for DALL·E: https://github.com/openai/DALL-E DALL·E paper: https://arxiv.org/pdf/2102.12092.pdf OpenAI CLIP paper & code: https://openai.com/blog/clip/ CLIP used on Unsplash images search: https://github.com/haltakov/natural-l... Follow me for more AI content: Instagram: https://www.instagram.com/whats_ai/ LinkedIn: https://www.linkedin.com/in/whats-ai/ Twitter: https://twitter.com/Whats_AI Facebook: https://www.facebook.com/whats.artifi... Join Our Discord channel, Learn AI Together: https://discord.gg/learnaitogether The best courses in AI & Guide+Repository on how to start: https://www.omologapps.com/whats-ai https://github.com/louisfb01/start-ma... Become a member of the YouTube community and support my work: https://www.youtube.com/channel/UCUzG... Chapters: - Hey! Tap the Thumbs Up button and Subscribe. You'll learn a lot of cool stuff, I promise. 0:00 - Paper explanation 2:40 Video Transcript: 00:00 openai successfully trained a network 00:02 able to generate 00:03 images from text captions it's very 00:06 similar to gpt3 00:07 and image gpt and produces amazing 00:10 results 00:11 let's see what it's really capable of in 00:13 fact it's a smaller version of gpt3 00:16 using 00:16 12 billion parameters instead of 175 00:20 billion parameters 00:21 but it has been specifically trained to 00:24 generate images from text descriptions 00:26 using a data set of text image pairs 00:29 instead of very broad 00:30 data set like gpt3 it can generate 00:33 images from text captions 00:34 using natural language just like gpt3 00:37 can create websites and stories 00:39 it's a continuation of msgpt and gpt3 00:42 that i both covered in previous videos 00:44 if you haven't watched them yet 00:46 dolly is very similar to gpt3 in the way 00:49 that it's also a transformer language 00:52 model 00:52 receiving text and images as inputs to 00:55 output a final transformed image 00:57 in many forms it can edit attributes of 01:00 specific objects 01:01 in images as you can see here or even 01:03 control multiple objects and their 01:05 attributes at the same time 01:07 this is a very complicated task since 01:09 the network has to understand the 01:11 relation 01:12 between the objects and create an image 01:14 based on its understanding 01:16 just take this example feeding to the 01:18 network an emoji 01:20 of a baby penguin wearing a blue hat red 01:23 gloves 01:23 green shirt and yellow pens all these 01:26 components need to be understood 01:28 the objects colors and even the location 01:31 of the objects 01:32 meaning that the gloves need to be both 01:34 red and on the hands on the penguin 01:36 the same thing for the rest and the 01:38 results are very impressive 01:40 considering the complexity of the task 01:42 it uses self-attention as i described in 01:44 a previous video to understand the 01:46 context of the text 01:47 and sparse attention for the images 01:49 there are not many details about how it 01:51 works or how exactly it was trained 01:54 but they will be publishing a paper 01:55 explaining their approach 01:57 i will be sure to cover it as soon as 01:59 it's released 02:16 [Music] 02:40 open ai just released the paper 02:42 explaining how dali works 02:44 it's called zero shot text to image 02:47 generation 02:48 as i previously mentioned it uses a 02:50 transformer architecture to generate 02:52 images from a text and base image 02:54 sent as input to the network but it 02:56 doesn't simply take the image that takes 02:58 and sends it to the network 03:00 first in order to be understood by the 03:02 transformer architecture the information 03:04 needs to be modeled into a single stream 03:06 of data 03:08 this is because using the pixels of the 03:10 image directly 03:11 will require way too much memory for 03:13 high resolution 03:14 images instead they use a discrete 03:16 variational auto encoder 03:18 called dva that takes the input image 03:21 and transforms it into a 32 by 32 grid 03:25 giving as a result 1024 image tokens 03:28 rather than millions of tokens for a 03:30 high resolution image 03:32 indeed the only task of this dva network 03:35 is to reduce the memory footprint of the 03:37 transformer by generating a new version 03:39 of the image you can see it as a kind of 03:41 image compressing step 03:43 the encoder and decoder in the dva are 03:45 composed 03:46 of classic convolutions and resnet 03:48 architectures with skip connections 03:50 if you've never heard of variational 03:52 auto encoders before 03:53 i strongly recommend you to watch the 03:55 video i made explaining them 03:57 unfortunately this dva network was also 04:00 shared in openai's github 04:02 with a notebook to try it yourself and 04:05 information details 04:06 in the paper the links are in the 04:08 description below 04:09 these image tokens produced by the 04:11 discrete va model 04:12 are then sent with the text as input to 04:15 the transformer model 04:16 again as i described in my previous 04:18 video about delhi this transformer is a 04:20 12 billion parameter sparse transformer 04:23 model 04:24 without diving too much into the 04:26 transformers architecture 04:27 as i already covered it in a previous 04:29 video they are sequence to sequence 04:31 models that often use 04:32 encoders and decoders in this case it 04:35 only uses a decoder 04:37 since it takes the generated image by 04:39 the dva 04:40 and the text as inputs each of the 1024 04:44 image tokens that were generated by the 04:46 discrete 04:46 va has access to all text tokens and 04:49 using 04:50 self-attention it can predict an optimal 04:52 image text pairing 04:54 then it is finally fed into a 04:56 pre-trained contrastive model 04:58 which is in fact the pre-trained clip 05:00 model that open ai published in early 05:03 january 05:03 it's used to optimize the relationship 05:06 between an image 05:07 and a specific text giving an image 05:09 generated by the transformer 05:10 and the initial caption clip assigns a 05:13 score based on how well the 05:14 image matches the caption the clip model 05:17 was even used on unsplash images to help 05:20 you find the image you are looking for 05:22 as well as finding specific frames in a 05:24 video from the text input 05:26 of course in our case we already have an 05:29 image generated 05:30 and we just want it to match the text 05:32 input 05:33 well clip still gives us a perfect 05:35 measure to use as a penalty function to 05:38 improve the results of the transformers 05:39 decoder 05:40 iteratively during training clips 05:42 capabilities are very similar to the 05:45 zero shot capabilities of gpd2 and gpt3 05:48 similarly clip was also trained on a 05:50 huge data set of 400 million text image 05:53 pairs 05:54 this zero shot capability means that it 05:57 works on images and text samples that 05:59 were not found in the training data set 06:01 which are also referred to as 06:03 unseen object categories finally the 06:06 overall architecture was trained using 06:09 250 million text image pairs 06:11 taken from the internet mostly from 06:13 wikipedia and it basically learns to 06:16 generate a new image 06:17 based on the given tokens as inputs just 06:20 like we described earlier in the video 06:23 this was possible because transformers 06:25 make the use of more parallelization 06:27 possible during training making it way 06:30 faster while producing more accurate 06:32 results 06:32 being powerful natural language tools as 06:35 well as powerful computer vision tools 06:37 when used with a proper encoding system 06:40 of course 06:41 this was just an overview of this new 06:43 paper by openai 06:44 i strongly recommend reading the dolly 06:46 paper and the clip paper to have a 06:48 better understanding of this approach 06:50 i'm excited to see what the community 06:52 will do with these codes 06:54 now available please leave a like if you 06:56 went this far in the video 06:58 and since there's over 80 percent of you 07:00 guys that are not subscribed yet 07:02 consider subscribing to the channel to 07:04 not miss any further news 07:06 thank you for watching 07:18 [Music]