Last year I shared , an amazing model by OpenAI capable of generating images from a text input with incredible results. Now is time for his big brother, DALL·E 2. And you won’t believe the progress in a single year! DALL·E 2 is not only better at generating photorealistic images from text. The results are four times the resolution!
As if it wasn’t already impressive enough, the recent model learned a new skill; .
DALL·E could generate images from text inputs.
DALL·E 2 can do it better, but it doesn’t stop there. It can also edit those images and make them look even better! Or simply add a feature you want like some flamingos in the background.
Sounds interesting? Learn more in the video!
►Read the full article: https://www.louisbouchard.ai/openais-new-model-dall-e-2-is-amazing/
►A. Ramesh et al., 2022, DALL-E 2 paper: https://cdn.openai.com/papers/dall-e-2.pdf
►OpenAI's blog post: https://openai.com/dall-e-2
►Risks and limitations: https://github.com/openai/dalle-2-preview/blob/main/system-card.md
►OpenAI Dalle's instagram page: https://www.instagram.com/openaidalle/
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/
0:00
last year i shared dolly an amazing
0:02
model by openai capable of generating
0:05
images from a texan foot with incredible
0:08
results now it's time for his big
0:10
brother dolly too and you won't believe
0:13
the progress in a single year dolly 2 is
0:15
not only better at generating
0:17
photorealistic images from texts the
0:20
results are four times the resolution as
0:22
if it wasn't already impressive enough
0:25
the recent model learned a new skill
0:27
image in painting delhi could generate
0:30
images from text inputs dolly 2 can do
0:33
it better but it doesn't stop there it
0:35
can also edit those images and make them
0:38
look even better or simply add a feature
0:41
you want like some flapping goes in the
0:43
background this is what image and
0:45
painting is we take the part of an image
0:47
and replace it with something else
0:49
following the style and reflections in
0:51
the image keeping realism of course it
0:53
doesn't only replace the part of the
0:55
image at random this will be too easy
0:58
for openai this in-painting process is
1:00
also text guided which means you can
1:02
tell it to add a famine go here there or
1:05
even there
1:06
before diving into the nitty-gritty of
1:08
this newest dahle model let me talk a
1:11
little about this episode sponsor
1:13
weights and biases if you are not
1:15
familiar with weight and biases you are
1:17
most certainly new here and should
1:19
definitely subscribe to the channel
1:21
weight and biases allows you to keep
1:22
track of all your experiments with only
1:25
a handful of lines added to your code
1:27
one feature i love is how you can
1:29
quickly create and share amazing looking
1:31
interactive reports like this one
1:34
clearly showing your team or future self
1:36
your runs metrics hyperparameters and
1:38
data configurations alongside any notes
1:41
you or your team had at the time it's a
1:44
powerful feature to either add quick
1:46
comments on an experiment or create
1:48
polished pieces of analysis reports can
1:50
also be used as dashboards for reporting
1:53
a smaller subset of metrics than the
1:55
main workspace you can even create
1:57
public view-only links to share with
2:00
anyone easily capturing and sharing your
2:02
work is essential if you want to grow as
2:04
an ml practitioner which is why i
2:06
recommend using tools that improve your
2:08
work like weights and biases just try it
2:11
with the first link below and start
2:13
sharing your work like a pro
2:16
now let's dive into how dolly 2 can not
2:19
only generate images from text but is
2:21
also capable of editing them indeed this
2:24
new in-painting skill the network has
2:26
learned is due to its better
2:28
understanding of concepts and the images
2:30
themselves locally and globally what i
2:33
mean by locally and globally is that
2:35
dahle 2 has a deeper understanding of
2:37
why the pixels next to each other has
2:40
these colors as it understands the
2:42
objects in the scene and their
2:43
interrelation to each other this way it
2:46
will be able to understand that this
2:48
water has reflection and the object on
2:50
the right should be also reflected there
2:53
it also understands the global scene
2:55
which is what is happening just like if
2:58
you were to describe what is going on
3:00
when the person took the photo here
3:02
you'd say that this photo does not exist
3:05
obviously or else i'm definitely down to
3:07
try that if we forget that this is
3:09
impossible you'd say that the astronaut
3:11
is riding a horse in space so if i were
3:14
to ask you to draw the same scene but on
3:17
a planet rather than in free space you'd
3:19
be able to picture something like that
3:21
since you understand that the horse and
3:23
astronaut are the objects of interest to
3:25
keep in the picture this seems obvious
3:28
but it's extremely complex for a machine
3:30
that only sees pixels of colors which is
3:33
why dahli 2 is so impressive to me but
3:35
how exactly does the model understand
3:38
the text we send it and can generate an
3:40
image out of it well it's pretty similar
3:43
to the first model i covered on the
3:45
channel it starts by using the clip
3:47
model by openai to encode both a text
3:50
and an image into the same domain a
3:52
condensed representation called a latent
3:55
code then it will take this encoding and
3:58
use a generator also called a decoder to
4:01
generate a new image that means the same
4:04
thing as the text since it's from the
4:06
same latent code so dali 2 has two steps
4:10
clip to encode the information and the
4:12
new decoder model to take this encoded
4:15
information and generate an image out of
4:17
it these two separated steps are also
4:20
why we can generate variations of the
4:22
images we can simply randomly change the
4:25
encoded information just a little making
4:27
it move a tiny bit in the latent space
4:30
and it will still represent the same
4:32
sentence while having all different
4:34
values creating a different image
4:36
representing the same text as we see
4:39
here it initially takes a text input and
4:42
encodes it what we see above is the
4:44
first step of the training process where
4:46
we also feed it an image and encode it
4:48
using clip so that images and text are
4:51
encoded similarly following the clip
4:53
objective then for generating a new
4:56
image we switch to the section below
4:58
where we use the text encoding guided by
5:00
clip to transform it into an image ready
5:03
encoding this transformation is done
5:05
using a diffusion prior which we will
5:07
cover shortly as it is very similar to
5:09
the diffusion model used for the final
5:12
step finally we use our newly created
5:14
image encoding and decode it into a new
5:17
image using the diffusion decoder a
5:20
diffusion decoder or modal is a kind of
5:23
model that starts with random noise and
5:25
learns how to iteratively change this
5:28
noise to get back to an image it learns
5:30
that by doing the opposite during
5:32
training we will feed it images and
5:34
apply random gaussian noise on the image
5:37
iteratively until we can't see anything
5:40
other than noise then we simply reverse
5:43
the model to generate images from noise
5:45
if you'd like more detail about this
5:47
kind of network which are really cool i
5:50
invite you to watch this video i made
5:51
about them and voila this is how dali 2
5:55
generates such high quality images
5:58
following text it's super impressive and
6:00
tells us that the model does understand
6:02
the text but does it deeply understand
6:05
what it created
6:06
well it sure looks like it it's the
6:08
capability of impainting images that
6:10
makes us believe that it does understand
6:12
the pictures pretty well but why is that
6:15
so how can it link a text input to an
6:18
image and understand the image enough to
6:20
replace only some parts of it without
6:23
affecting the realism this is all
6:25
because of clip as it links a text input
6:28
to an image if we encode back our newly
6:30
generated image and use a different text
6:33
input to guide another generation we can
6:35
generate the second version of the image
6:38
that will replace only the wanted region
6:40
in our first generation and you will end
6:43
up with this picture unfortunately the
6:46
code isn't publicly available and is not
6:48
in their api yet the reason for that as
6:51
per openai is to study the risks and
6:53
limitations of such a powerful model
6:56
they actually discuss these potential
6:58
risks and the reason for this privacy in
7:00
their paper and in a great repository i
7:02
linked in the description below if you
7:04
are interested they also opened an
7:06
instagram account to share more results
7:08
if you'd like to see that it's also
7:10
linked below i loved dally and this one
7:13
is even cooler
7:15
of course this was just an overview of
7:17
how dahli2 works and i strongly invite
7:19
reading their great paper linked below
7:21
for more detail on their implementation
7:23
of the model i hope you enjoyed this
7:26
video as much as i enjoyed making it and
7:28
i will see you next week with another
amazing paper thank you for watching