If you thought had great results, wait until you see what this new model from Google Brain can do.
Dalle-e is amazing but often lacks realism, and this is what the team attacked with this new model called Imagen.
They share a lot of results on their project page as well as a benchmark, which they introduced for comparing text-to-image models, where they clearly outperform , and previous image generation approaches. Learn more in the video...
►Read the full article: https://www.louisbouchard.ai/google-brain-imagen/
►Paper: Saharia et al., 2022, Imagen - Google Brain, https://gweb-research-imagen.appspot.com/paper.pdf
►Project link: https://gweb-research-imagen.appspot.com/
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/
0:00
if you thought dali 2 had great results
0:02
wait until you see what this new model
0:04
from google brain can do delhi is
0:07
amazing but often lacks realism and this
0:10
is what the team attacked with this new
0:12
model called imogen they share a lot of
0:14
results on their project page as well as
0:16
a benchmark which they introduced for
0:18
comparing text to image models where
0:20
they clearly outperformed daily2 and
0:23
previous image generation approaches
0:25
this benchmark is also super cool as we
0:27
see more and more text to image models
0:29
and it's pretty difficult to compare the
0:31
results unless we assume the results are
0:34
really bad which we often do but this
0:36
model and le2 definitely defied the odds
0:40
tldr it's a new text-to-image model that
0:43
you can compare to dali to with more
0:45
realism as per human testers so just
0:48
like dali that i covered not even a
0:50
month ago this model takes texts like a
0:53
golden retriever dog wearing a blue
0:56
checkered barrette and a red dotted
0:58
turtleneck and tries to generate a
1:00
photorealistic image out of this weird
1:02
sentence the main point here is that
1:05
imogen can not only understand text but
1:08
it can also understand the images it
1:10
generates since they are more realistic
1:12
than all previous approaches of course
1:15
when i say understand i mean its own
1:17
kind of understanding that is really
1:20
different than ours the modal doesn't
1:22
really understand the text or the image
1:24
it generates it definitely has some kind
1:27
of knowledge about it but it mainly
1:28
understands how this particular kind of
1:31
sentence with these objects should be
1:33
represented using pixels on an image but
1:36
i'll concede that it sure looks like it
1:38
understands what we send it when we see
1:41
those results obviously you can trick it
1:43
with some really weird sentences that
1:45
couldn't look realistic like this one
1:48
but it sometimes beats even your own
1:50
imagination and just creates something
1:53
amazing still what's even more amazing
1:56
is how it works using something i never
1:58
discussed on the channel a diffusion
2:00
model but before using this diffusion
2:03
model we first need to understand the
2:05
text input and this is also the main
2:07
difference with dali they used a huge
2:10
text model similar to gpt3 to understand
2:13
the text as best as an ai system can so
2:16
instead of training a text model along
2:18
with the image generation model they
2:21
simply use a big pre-trained model and
2:23
freeze it so that it doesn't change
2:25
during the training of the image
2:27
generation model from their study this
2:30
led to much better results and it seemed
2:32
like the model understood text better so
2:35
this text module is how the model
2:37
understands text and this understanding
2:40
is represented in what we call encodings
2:42
which is what the model has been trained
2:44
to do on huge datasets to transfer text
2:47
inputs into a space of information that
2:50
it can use and understand
2:52
now we need to use this transform text
2:54
data to generate the image and as i said
2:57
they used a diffusion model to achieve
3:00
that but what is a diffusion model
3:02
diffusion models are generative models
3:04
that convert random gaussian noise like
3:07
this into images by learning how to
3:10
reverse gaussian noise iteratively they
3:13
are powerful models for super resolution
3:15
or other image to image translations and
3:18
in this case use a modified unit
3:20
architecture which i covered numerous
3:22
times in previous videos so i won't
3:24
enter into the architectural details
3:26
here basically the model is trained to
3:29
denoise an image from pure noise which
3:31
the orient using the text encodings and
3:34
a technique called classifier free
3:36
guidance which they say is essential and
3:38
clearly explained in their paper i'll
3:40
let you read it for more information on
3:42
this technique so now we have a model
3:45
able to take random gaussian noise and
3:47
our text encoding and denoise it with
3:49
guidance from the text encodings to
3:51
generate our image but as you see here
3:54
it isn't as simple as it sounds the
3:56
image we just generated is a very small
3:58
image as a bigger image will require
4:00
much more computation and a much bigger
4:02
model which are not viable instead we
4:05
first generate a photorealistic image
4:07
using the diffusion model we just
4:09
discussed and then use other diffusion
4:12
models to improve the quality of the
4:14
image iteratively i already covered
4:16
super resolution models in past videos
4:19
so i won't enter into the details here
4:21
but let's do a quick overview once again
4:24
we want to have noise and not an image
4:26
so we cover up this initially generated
4:28
low resolution image with again some
4:31
gaussian noise and we train our second
4:33
diffusion model to take this modified
4:35
image and improve it then we repeat
4:38
these two steps with another model but
4:40
this time using just patches of the
4:43
image instead of the full image to do
4:45
the same upscaling ratio and stay
4:47
computationally viable and voila we end
4:51
up with our photorealistic high
4:53
resolution image
4:55
of course this was just an overview of
4:56
this exciting new model with really cool
4:59
results i definitely invite you to read
5:01
their great paper for a deeper
5:03
understanding of their approach and a
5:05
detailed results analysis
5:07
and you do you think the results are
5:09
comparable to delhi too are they better
5:12
or worse i sure think it is dally's main
5:15
competitor as of now let me know what
5:17
you think of this new google brain
5:19
publication and the explanation i hope
5:21
you enjoyed this video and if you did
5:24
please take a second to leave a like and
5:26
subscribe to stay up to date with
5:27
exciting ai news if you are subscribed i
5:30
will see you next week with another
amazing paper