paint-brush
An Intro to eDiffi: NVIDIA's New SOTA Image Synthesis Modelby@whatsai
3,201 reads
3,201 reads

An Intro to eDiffi: NVIDIA's New SOTA Image Synthesis Model

by Louis BouchardNovember 5th, 2022
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

eDiffi, NVIDIA's most recent model, generates better-looking and more accurate images than all previous approaches like DALLE 2 or Stable Diffusion. eDiffi better understands the text you send and is more customizable, adding a feature we saw in a previous paper from NVIDIA: the painter tool. Learn more in the video...
featured image - An Intro to eDiffi: NVIDIA's New SOTA Image Synthesis Model
Louis Bouchard HackerNoon profile picture

eDiffi, NVIDIA's most recent model, generates better-looking and more accurate images than all previous approaches like DALLE 2 or Stable Diffusion. eDiffi better understands the text you send and is more customizable, adding a feature we saw in a previous paper from NVIDIA: the painter tool. Learn more in the video...

References

►Read the full article: https://www.louisbouchard.ai/ediffi/
► Balaji, Y. et al., 2022, eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers, https://arxiv.org/abs/2211.01324
►Project page: https://deepimagination.cc/eDiffi/
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/

Video Transcript

0:06

the new state-of-the-art approach for

0:08

image synthesis it generates better

0:10

looking and more accurate images than

0:13

all previous approaches like Delhi 2 or

0:15

stable diffusion either if he better

0:17

understands the text you send and is

0:19

more customizable adding a new feature

0:21

we saw in a previous paper from Nvidia

0:23

the painter tool as they see you can

0:26

paint with words in short this means you

0:29

can enter a few subjects and paint in

0:32

the image what should appear here and

0:34

there allowing you to create much more

0:36

customized images compared to a random

0:39

generation following a prompt this is

0:41

the next level enabling you to pretty

0:43

much get the exact image you have in

0:45

mind by simply drawing a horrible quick

0:47

sketch something even I can do as I

0:50

mentioned the results are not only Sota

0:52

and better looking than stable diffusion

0:55

but they are also way more controllable

0:57

of course it's a different use case as

0:59

it needs a bit more work and a clearer

1:02

ID in mind for creating such a draft but

1:04

it's definitely super very exciting and

1:06

interesting it's also why I wanted to

1:08

cover it on my channel since it's not

1:11

merely a better model but also a

1:13

different approach with much more

1:15

control over the output the tool isn't

1:17

available yet unfortunately but I sure

1:19

hope it will be soon by the way you

1:22

should definitely subscribe to the

1:23

channel and follow me on Twitter at what

1:25

say hi if you like this kind of video

1:27

and would like to have access to easily

1:30

digestible news on this heavily

1:32

complicated field another win which they

1:34

allow you to have more control in this

1:37

new model is by using the same feature

1:39

we saw but differently indeed the model

1:42

generates images Guided by a sentence

1:44

but it can also be influenced using a

1:47

quick sketch so it basically takes an

1:49

image and a text as inputs this means

1:52

you can do other stuff as it understands

1:54

images here they leverage this

1:56

capability by developing a style

1:58

transfer approach where you can

2:00

influence the style of the image

2:02

generation process giving an image with

2:04

a particular style well along with your

2:06

text input this is super cool and just

2:09

look at the results they speak for

2:11

themselves it's incredible beating both

2:14

Sota style transfer models and image

2:16

synthesis models with a single approach

2:18

now the question is how could Nvidia

2:22

develop a model that creates better

2:23

looking images enable more control over

2:26

both the style and the image structure

2:29

as well as better understanding and

2:31

representing what you actually want in

2:34

your text well they change the typical

2:36

diffusion architecture in two ways first

2:39

they encode the text using two different

2:41

approaches that I already covered on the

2:43

channel which we refer to as clip and T5

2:46

encoders this means they will use

2:48

pre-trained models to take text and

2:50

create various embeddings focusing on

2:52

different features as they are trained

2:55

and behaved differently and meanings are

2:57

just representations maximizing what the

3:00

sentence actually means for the

3:01

algorithm or the machine to understand

3:04

it regarding the input image they just

3:06

use the clip embeddings as well

3:08

basically encoding the image so that the

3:11

model can understand it which you can

3:13

learn more about in my other videos

3:14

covering generative models as they are

3:16

pretty much all built on clip this is

3:19

what allows them to have more control

3:21

over the output as well as processed

3:23

text and images rather than only text

3:25

the second modification is using a

3:28

Cascade of diffusion models instead of

3:31

reusing the same iteratively as we

3:33

usually do with diffusion based models

3:35

here the use models trained for the

3:38

specific part of the generative process

3:39

meaning that each model does not have to

3:42

be as general as the regular diffusion

3:44

denoiser since each model has to focus

3:46

on a specific part of the process it can

3:49

be much better at it they use this

3:51

approach because they observed that the

3:52

denoising models seemed to use the text

3:55

embeddings a lot more to orient its

3:57

generation towards the beginning of the

3:59

process and then use it less and less to

4:02

focus on output quality and Fidelity the

4:05

this naturally brings the hypothesis

4:07

that reusing the same denoising model

4:09

throughout the whole process might not

4:11

be the best ID since it automatically

4:13

focuses on different tasks and we know

4:15

that a generalist is far from the expert

4:18

level at all tasks why not use a few

4:20

experts instead of one generalist to get

4:23

much better results so this is what they

4:25

did and why they call them denoising

4:28

experts and the main reason for this

4:30

improves performance in quality and

4:32

faithfulness the rest of the

4:34

architecture is pretty similar to other

4:36

approaches of scaling the final results

4:38

with other models to get a high

4:40

definition final image the image and

4:43

video synthesis fields are just getting

4:45

crazy nowadays and we are seeing

4:47

impressive results coming out every week

4:49

I am super excited for the next releases

4:51

and I love to see different approaches

4:53

with both innovative ways of tackling

4:55

the problem and also going for different

4:57

use cases as a great person once said

5:01

what a time to be alive I hope you like

5:04

this quick overview of the approach a

5:06

bit more high level than what I usually

5:08

do as it takes most Parts I already

5:10

covered in numerous videos and changed

5:12

them to act differently I invite you to

5:15

watch my stable diffusion video to learn

5:17

a bit more about the diffusion approach

5:19

itself and read the nvidia's paper to

5:21

learn more about this specific approach

5:23

and its implementation I will see you

5:26

next week with another amazing paper

5:32

foreign

5:36

[Music]