An Intro to eDiffi: NVIDIA's New SOTA Image Synthesis Model

Written by whatsai | Published 2022/11/05
Tech Story Tags: artificial-intelligence | machine-learning | computer-vision | synthetic-media | youtubers | youtube | hackernoon-top-story | ai | web-monetization | hackernoon-es | hackernoon-hi | hackernoon-zh | hackernoon-zh | hackernoon-vi | hackernoon-fr | hackernoon-pt | hackernoon-ja

TLDReDiffi, NVIDIA's most recent model, generates better-looking and more accurate images than all previous approaches like DALLE 2 or Stable Diffusion. eDiffi better understands the text you send and is more customizable, adding a feature we saw in a previous paper from NVIDIA: the painter tool. Learn more in the video...via the TL;DR App

eDiffi, NVIDIA's most recent model, generates better-looking and more accurate images than all previous approaches like DALLE 2 or Stable Diffusion. eDiffi better understands the text you send and is more customizable, adding a feature we saw in a previous paper from NVIDIA: the painter tool. Learn more in the video...

References

►Read the full article: https://www.louisbouchard.ai/ediffi/
► Balaji, Y. et al., 2022, eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers, https://arxiv.org/abs/2211.01324
►Project page: https://deepimagination.cc/eDiffi/
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/

Video Transcript

0:06
the new state-of-the-art approach for
0:08
image synthesis it generates better
0:10
looking and more accurate images than
0:13
all previous approaches like Delhi 2 or
0:15
stable diffusion either if he better
0:17
understands the text you send and is
0:19
more customizable adding a new feature
0:21
we saw in a previous paper from Nvidia
0:23
the painter tool as they see you can
0:26
paint with words in short this means you
0:29
can enter a few subjects and paint in
0:32
the image what should appear here and
0:34
there allowing you to create much more
0:36
customized images compared to a random
0:39
generation following a prompt this is
0:41
the next level enabling you to pretty
0:43
much get the exact image you have in
0:45
mind by simply drawing a horrible quick
0:47
sketch something even I can do as I
0:50
mentioned the results are not only Sota
0:52
and better looking than stable diffusion
0:55
but they are also way more controllable
0:57
of course it's a different use case as
0:59
it needs a bit more work and a clearer
1:02
ID in mind for creating such a draft but
1:04
it's definitely super very exciting and
1:06
interesting it's also why I wanted to
1:08
cover it on my channel since it's not
1:11
merely a better model but also a
1:13
different approach with much more
1:15
control over the output the tool isn't
1:17
available yet unfortunately but I sure
1:19
hope it will be soon by the way you
1:22
should definitely subscribe to the
1:23
channel and follow me on Twitter at what
1:25
say hi if you like this kind of video
1:27
and would like to have access to easily
1:30
digestible news on this heavily
1:32
complicated field another win which they
1:34
allow you to have more control in this
1:37
new model is by using the same feature
1:39
we saw but differently indeed the model
1:42
generates images Guided by a sentence
1:44
but it can also be influenced using a
1:47
quick sketch so it basically takes an
1:49
image and a text as inputs this means
1:52
you can do other stuff as it understands
1:54
images here they leverage this
1:56
capability by developing a style
1:58
transfer approach where you can
2:00
influence the style of the image
2:02
generation process giving an image with
2:04
a particular style well along with your
2:06
text input this is super cool and just
2:09
look at the results they speak for
2:11
themselves it's incredible beating both
2:14
Sota style transfer models and image
2:16
synthesis models with a single approach
2:18
now the question is how could Nvidia
2:22
develop a model that creates better
2:23
looking images enable more control over
2:26
both the style and the image structure
2:29
as well as better understanding and
2:31
representing what you actually want in
2:34
your text well they change the typical
2:36
diffusion architecture in two ways first
2:39
they encode the text using two different
2:41
approaches that I already covered on the
2:43
channel which we refer to as clip and T5
2:46
encoders this means they will use
2:48
pre-trained models to take text and
2:50
create various embeddings focusing on
2:52
different features as they are trained
2:55
and behaved differently and meanings are
2:57
just representations maximizing what the
3:00
sentence actually means for the
3:01
algorithm or the machine to understand
3:04
it regarding the input image they just
3:06
use the clip embeddings as well
3:08
basically encoding the image so that the
3:11
model can understand it which you can
3:13
learn more about in my other videos
3:14
covering generative models as they are
3:16
pretty much all built on clip this is
3:19
what allows them to have more control
3:21
over the output as well as processed
3:23
text and images rather than only text
3:25
the second modification is using a
3:28
Cascade of diffusion models instead of
3:31
reusing the same iteratively as we
3:33
usually do with diffusion based models
3:35
here the use models trained for the
3:38
specific part of the generative process
3:39
meaning that each model does not have to
3:42
be as general as the regular diffusion
3:44
denoiser since each model has to focus
3:46
on a specific part of the process it can
3:49
be much better at it they use this
3:51
approach because they observed that the
3:52
denoising models seemed to use the text
3:55
embeddings a lot more to orient its
3:57
generation towards the beginning of the
3:59
process and then use it less and less to
4:02
focus on output quality and Fidelity the
4:05
this naturally brings the hypothesis
4:07
that reusing the same denoising model
4:09
throughout the whole process might not
4:11
be the best ID since it automatically
4:13
focuses on different tasks and we know
4:15
that a generalist is far from the expert
4:18
level at all tasks why not use a few
4:20
experts instead of one generalist to get
4:23
much better results so this is what they
4:25
did and why they call them denoising
4:28
experts and the main reason for this
4:30
improves performance in quality and
4:32
faithfulness the rest of the
4:34
architecture is pretty similar to other
4:36
approaches of scaling the final results
4:38
with other models to get a high
4:40
definition final image the image and
4:43
video synthesis fields are just getting
4:45
crazy nowadays and we are seeing
4:47
impressive results coming out every week
4:49
I am super excited for the next releases
4:51
and I love to see different approaches
4:53
with both innovative ways of tackling
4:55
the problem and also going for different
4:57
use cases as a great person once said
5:01
what a time to be alive I hope you like
5:04
this quick overview of the approach a
5:06
bit more high level than what I usually
5:08
do as it takes most Parts I already
5:10
covered in numerous videos and changed
5:12
them to act differently I invite you to
5:15
watch my stable diffusion video to learn
5:17
a bit more about the diffusion approach
5:19
itself and read the nvidia's paper to
5:21
learn more about this specific approach
5:23
and its implementation I will see you
5:26
next week with another amazing paper
5:32
foreign
5:36
[Music]

Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.
Published by HackerNoon on 2022/11/05