OpenAI just released the paper showing how DALL-E works! It is called "Zero-Shot Text-to-Image Generation".
Here's a video explaining it:
0:00 - Hey! Tap the Thumbs Up button and Subscribe. You'll learn a lot of cool stuff, I promise.
2:40 - Paper explanation
00:00
openai successfully trained a network
00:02
able to generate
00:03
images from text captions it's very
00:06
similar to gpt3
00:07
and image gpt and produces amazing
00:10
results
00:11
let's see what it's really capable of in
00:13
fact it's a smaller version of gpt3
00:16
using
00:16
12 billion parameters instead of 175
00:20
billion parameters
00:21
but it has been specifically trained to
00:24
generate images from text descriptions
00:26
using a data set of text image pairs
00:29
instead of very broad
00:30
data set like gpt3 it can generate
00:33
images from text captions
00:34
using natural language just like gpt3
00:37
can create websites and stories
00:39
it's a continuation of msgpt and gpt3
00:42
that i both covered in previous videos
00:44
if you haven't watched them yet
00:46
dolly is very similar to gpt3 in the way
00:49
that it's also a transformer language
00:52
model
00:52
receiving text and images as inputs to
00:55
output a final transformed image
00:57
in many forms it can edit attributes of
01:00
specific objects
01:01
in images as you can see here or even
01:03
control multiple objects and their
01:05
attributes at the same time
01:07
this is a very complicated task since
01:09
the network has to understand the
01:11
relation
01:12
between the objects and create an image
01:14
based on its understanding
01:16
just take this example feeding to the
01:18
network an emoji
01:20
of a baby penguin wearing a blue hat red
01:23
gloves
01:23
green shirt and yellow pens all these
01:26
components need to be understood
01:28
the objects colors and even the location
01:31
of the objects
01:32
meaning that the gloves need to be both
01:34
red and on the hands on the penguin
01:36
the same thing for the rest and the
01:38
results are very impressive
01:40
considering the complexity of the task
01:42
it uses self-attention as i described in
01:44
a previous video to understand the
01:46
context of the text
01:47
and sparse attention for the images
01:49
there are not many details about how it
01:51
works or how exactly it was trained
01:54
but they will be publishing a paper
01:55
explaining their approach
01:57
i will be sure to cover it as soon as
01:59
it's released
02:16
[Music]
02:40
open ai just released the paper
02:42
explaining how dali works
02:44
it's called zero shot text to image
02:47
generation
02:48
as i previously mentioned it uses a
02:50
transformer architecture to generate
02:52
images from a text and base image
02:54
sent as input to the network but it
02:56
doesn't simply take the image that takes
02:58
and sends it to the network
03:00
first in order to be understood by the
03:02
transformer architecture the information
03:04
needs to be modeled into a single stream
03:06
of data
03:08
this is because using the pixels of the
03:10
image directly
03:11
will require way too much memory for
03:13
high resolution
03:14
images instead they use a discrete
03:16
variational auto encoder
03:18
called dva that takes the input image
03:21
and transforms it into a 32 by 32 grid
03:25
giving as a result 1024 image tokens
03:28
rather than millions of tokens for a
03:30
high resolution image
03:32
indeed the only task of this dva network
03:35
is to reduce the memory footprint of the
03:37
transformer by generating a new version
03:39
of the image you can see it as a kind of
03:41
image compressing step
03:43
the encoder and decoder in the dva are
03:45
composed
03:46
of classic convolutions and resnet
03:48
architectures with skip connections
03:50
if you've never heard of variational
03:52
auto encoders before
03:53
i strongly recommend you to watch the
03:55
video i made explaining them
03:57
unfortunately this dva network was also
04:00
shared in openai's github
04:02
with a notebook to try it yourself and
04:05
information details
04:06
in the paper the links are in the
04:08
description below
04:09
these image tokens produced by the
04:11
discrete va model
04:12
are then sent with the text as input to
04:15
the transformer model
04:16
again as i described in my previous
04:18
video about delhi this transformer is a
04:20
12 billion parameter sparse transformer
04:23
model
04:24
without diving too much into the
04:26
transformers architecture
04:27
as i already covered it in a previous
04:29
video they are sequence to sequence
04:31
models that often use
04:32
encoders and decoders in this case it
04:35
only uses a decoder
04:37
since it takes the generated image by
04:39
the dva
04:40
and the text as inputs each of the 1024
04:44
image tokens that were generated by the
04:46
discrete
04:46
va has access to all text tokens and
04:49
using
04:50
self-attention it can predict an optimal
04:52
image text pairing
04:54
then it is finally fed into a
04:56
pre-trained contrastive model
04:58
which is in fact the pre-trained clip
05:00
model that open ai published in early
05:03
january
05:03
it's used to optimize the relationship
05:06
between an image
05:07
and a specific text giving an image
05:09
generated by the transformer
05:10
and the initial caption clip assigns a
05:13
score based on how well the
05:14
image matches the caption the clip model
05:17
was even used on unsplash images to help
05:20
you find the image you are looking for
05:22
as well as finding specific frames in a
05:24
video from the text input
05:26
of course in our case we already have an
05:29
image generated
05:30
and we just want it to match the text
05:32
input
05:33
well clip still gives us a perfect
05:35
measure to use as a penalty function to
05:38
improve the results of the transformers
05:39
decoder
05:40
iteratively during training clips
05:42
capabilities are very similar to the
05:45
zero shot capabilities of gpd2 and gpt3
05:48
similarly clip was also trained on a
05:50
huge data set of 400 million text image
05:53
pairs
05:54
this zero shot capability means that it
05:57
works on images and text samples that
05:59
were not found in the training data set
06:01
which are also referred to as
06:03
unseen object categories finally the
06:06
overall architecture was trained using
06:09
250 million text image pairs
06:11
taken from the internet mostly from
06:13
wikipedia and it basically learns to
06:16
generate a new image
06:17
based on the given tokens as inputs just
06:20
like we described earlier in the video
06:23
this was possible because transformers
06:25
make the use of more parallelization
06:27
possible during training making it way
06:30
faster while producing more accurate
06:32
results
06:32
being powerful natural language tools as
06:35
well as powerful computer vision tools
06:37
when used with a proper encoding system
06:40
of course
06:41
this was just an overview of this new
06:43
paper by openai
06:44
i strongly recommend reading the dolly
06:46
paper and the clip paper to have a
06:48
better understanding of this approach
06:50
i'm excited to see what the community
06:52
will do with these codes
06:54
now available please leave a like if you
06:56
went this far in the video
06:58
and since there's over 80 percent of you
07:00
guys that are not subscribed yet
07:02
consider subscribing to the channel to
07:04
not miss any further news
07:06
thank you for watching
07:18
[Music]