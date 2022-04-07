I explain Artificial Intelligence terms and news to non-experts.
Last year I shared , an amazing model by OpenAI capable of generating images from a text input with incredible results. Now is time for his big brother, DALL·E 2. And you won’t believe the progress in a single year! DALL·E 2 is not only better at generating photorealistic images from text. The results are four times the resolution!
As if it wasn’t already impressive enough, the recent model learned a new skill; .
DALL·E could generate images from text inputs.
DALL·E 2 can do it better, but it doesn’t stop there. It can also edit those images and make them look even better! Or simply add a feature you want like some flamingos in the background.
Sounds interesting? Learn more in the video!
last year i shared dolly an amazing
model by openai capable of generating
images from a texan foot with incredible
results now it's time for his big
brother dolly too and you won't believe
the progress in a single year dolly 2 is
not only better at generating
photorealistic images from texts the
results are four times the resolution as
if it wasn't already impressive enough
the recent model learned a new skill
image in painting delhi could generate
images from text inputs dolly 2 can do
it better but it doesn't stop there it
can also edit those images and make them
look even better or simply add a feature
you want like some flapping goes in the
background this is what image and
painting is we take the part of an image
and replace it with something else
following the style and reflections in
the image keeping realism of course it
doesn't only replace the part of the
image at random this will be too easy
for openai this in-painting process is
also text guided which means you can
tell it to add a famine go here there or
even there
before diving into the nitty-gritty of
this newest dahle model let me talk a
little about this episode sponsor
now let's dive into how dolly 2 can not
only generate images from text but is
also capable of editing them indeed this
new in-painting skill the network has
learned is due to its better
understanding of concepts and the images
themselves locally and globally what i
mean by locally and globally is that
dahle 2 has a deeper understanding of
why the pixels next to each other has
these colors as it understands the
objects in the scene and their
interrelation to each other this way it
will be able to understand that this
water has reflection and the object on
the right should be also reflected there
it also understands the global scene
which is what is happening just like if
you were to describe what is going on
when the person took the photo here
you'd say that this photo does not exist
obviously or else i'm definitely down to
try that if we forget that this is
impossible you'd say that the astronaut
is riding a horse in space so if i were
to ask you to draw the same scene but on
a planet rather than in free space you'd
be able to picture something like that
since you understand that the horse and
astronaut are the objects of interest to
keep in the picture this seems obvious
but it's extremely complex for a machine
that only sees pixels of colors which is
why dahli 2 is so impressive to me but
how exactly does the model understand
the text we send it and can generate an
image out of it well it's pretty similar
to the first model i covered on the
channel it starts by using the clip
model by openai to encode both a text
and an image into the same domain a
condensed representation called a latent
code then it will take this encoding and
use a generator also called a decoder to
generate a new image that means the same
thing as the text since it's from the
same latent code so dali 2 has two steps
clip to encode the information and the
new decoder model to take this encoded
information and generate an image out of
it these two separated steps are also
why we can generate variations of the
images we can simply randomly change the
encoded information just a little making
it move a tiny bit in the latent space
and it will still represent the same
sentence while having all different
values creating a different image
representing the same text as we see
here it initially takes a text input and
encodes it what we see above is the
first step of the training process where
we also feed it an image and encode it
using clip so that images and text are
encoded similarly following the clip
objective then for generating a new
image we switch to the section below
where we use the text encoding guided by
clip to transform it into an image ready
encoding this transformation is done
using a diffusion prior which we will
cover shortly as it is very similar to
the diffusion model used for the final
step finally we use our newly created
image encoding and decode it into a new
image using the diffusion decoder a
diffusion decoder or modal is a kind of
model that starts with random noise and
learns how to iteratively change this
noise to get back to an image it learns
that by doing the opposite during
training we will feed it images and
apply random gaussian noise on the image
iteratively until we can't see anything
other than noise then we simply reverse
the model to generate images from noise
if you'd like more detail about this
kind of network which are really cool i
invite you to watch this video i made
about them and voila this is how dali 2
generates such high quality images
following text it's super impressive and
tells us that the model does understand
the text but does it deeply understand
what it created
well it sure looks like it it's the
capability of impainting images that
makes us believe that it does understand
the pictures pretty well but why is that
so how can it link a text input to an
image and understand the image enough to
replace only some parts of it without
affecting the realism this is all
because of clip as it links a text input
to an image if we encode back our newly
generated image and use a different text
input to guide another generation we can
generate the second version of the image
that will replace only the wanted region
in our first generation and you will end
up with this picture unfortunately the
code isn't publicly available and is not
in their api yet the reason for that as
per openai is to study the risks and
limitations of such a powerful model
they actually discuss these potential
risks and the reason for this privacy in
their paper and in a great repository i
linked in the description below if you
are interested they also opened an
instagram account to share more results
if you'd like to see that it's also
linked below i loved dally and this one
is even cooler
of course this was just an overview of
how dahli2 works and i strongly invite
reading their great paper linked below
for more detail on their implementation
of the model i hope you enjoyed this
video as much as i enjoyed making it and
i will see you next week with another
amazing paper thank you for watching