OpenAI's New Model is Amazing! DALL·E 2 Explained Simply by@whatsai

OpenAI's New Model is Amazing! DALL·E 2 Explained Simply

image
Louis Bouchard HackerNoon profile picture

Louis Bouchard

I explain Artificial Intelligence terms and news to non-experts.

Last year I shared DALL·E, an amazing model by OpenAI capable of generating images from a text input with incredible results. Now is time for his big brother, DALL·E 2. And you won’t believe the progress in a single year! DALL·E 2 is not only better at generating photorealistic images from text. The results are four times the resolution!

As if it wasn’t already impressive enough, the recent model learned a new skill; image inpainting.

DALL·E could generate images from text inputs.

DALL·E 2 can do it better, but it doesn’t stop there. It can also edit those images and make them look even better! Or simply add a feature you want like some flamingos in the background.

Sounds interesting? Learn more in the video!

References

►Read the full article: https://www.louisbouchard.ai/openais-new-model-dall-e-2-is-amazing/
►A. Ramesh et al., 2022, DALL-E 2 paper: https://cdn.openai.com/papers/dall-e-2.pdf
►OpenAI's blog post: https://openai.com/dall-e-2
►Risks and limitations: https://github.com/openai/dalle-2-preview/blob/main/system-card.md
►OpenAI Dalle's instagram page: https://www.instagram.com/openaidalle/
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/

Video Transcript

       0:00

last year i shared dolly an amazing

0:02

model by openai capable of generating

0:05

images from a texan foot with incredible

0:08

results now it's time for his big

0:10

brother dolly too and you won't believe

0:13

the progress in a single year dolly 2 is

0:15

not only better at generating

0:17

photorealistic images from texts the

0:20

results are four times the resolution as

0:22

if it wasn't already impressive enough

0:25

the recent model learned a new skill

0:27

image in painting delhi could generate

0:30

images from text inputs dolly 2 can do

0:33

it better but it doesn't stop there it

0:35

can also edit those images and make them

0:38

look even better or simply add a feature

0:41

you want like some flapping goes in the

0:43

background this is what image and

0:45

painting is we take the part of an image

0:47

and replace it with something else

0:49

following the style and reflections in

0:51

the image keeping realism of course it

0:53

doesn't only replace the part of the

0:55

image at random this will be too easy

0:58

for openai this in-painting process is

1:00

also text guided which means you can

1:02

tell it to add a famine go here there or

1:05

even there

1:06

before diving into the nitty-gritty of

1:08

this newest dahle model let me talk a

1:11

little about this episode sponsor

1:13

weights and biases if you are not

1:15

familiar with weight and biases you are

1:17

most certainly new here and should

1:19

definitely subscribe to the channel

1:21

weight and biases allows you to keep

1:22

track of all your experiments with only

1:25

a handful of lines added to your code

1:27

one feature i love is how you can

1:29

quickly create and share amazing looking

1:31

interactive reports like this one

1:34

clearly showing your team or future self

1:36

your runs metrics hyperparameters and

1:38

data configurations alongside any notes

1:41

you or your team had at the time it's a

1:44

powerful feature to either add quick

1:46

comments on an experiment or create

1:48

polished pieces of analysis reports can

1:50

also be used as dashboards for reporting

1:53

a smaller subset of metrics than the

1:55

main workspace you can even create

1:57

public view-only links to share with

2:00

anyone easily capturing and sharing your

2:02

work is essential if you want to grow as

2:04

an ml practitioner which is why i

2:06

recommend using tools that improve your

2:08

work like weights and biases just try it

2:11

with the first link below and start

2:13

sharing your work like a pro

2:16

now let's dive into how dolly 2 can not

2:19

only generate images from text but is

2:21

also capable of editing them indeed this

2:24

new in-painting skill the network has

2:26

learned is due to its better

2:28

understanding of concepts and the images

2:30

themselves locally and globally what i

2:33

mean by locally and globally is that

2:35

dahle 2 has a deeper understanding of

2:37

why the pixels next to each other has

2:40

these colors as it understands the

2:42

objects in the scene and their

2:43

interrelation to each other this way it

2:46

will be able to understand that this

2:48

water has reflection and the object on

2:50

the right should be also reflected there

2:53

it also understands the global scene

2:55

which is what is happening just like if

2:58

you were to describe what is going on

3:00

when the person took the photo here

3:02

you'd say that this photo does not exist

3:05

obviously or else i'm definitely down to

3:07

try that if we forget that this is

3:09

impossible you'd say that the astronaut

3:11

is riding a horse in space so if i were

3:14

to ask you to draw the same scene but on

3:17

a planet rather than in free space you'd

3:19

be able to picture something like that

3:21

since you understand that the horse and

3:23

astronaut are the objects of interest to

3:25

keep in the picture this seems obvious

3:28

but it's extremely complex for a machine

3:30

that only sees pixels of colors which is

3:33

why dahli 2 is so impressive to me but

3:35

how exactly does the model understand

3:38

the text we send it and can generate an

3:40

image out of it well it's pretty similar

3:43

to the first model i covered on the

3:45

channel it starts by using the clip

3:47

model by openai to encode both a text

3:50

and an image into the same domain a

3:52

condensed representation called a latent

3:55

code then it will take this encoding and

3:58

use a generator also called a decoder to

4:01

generate a new image that means the same

4:04

thing as the text since it's from the

4:06

same latent code so dali 2 has two steps

4:10

clip to encode the information and the

4:12

new decoder model to take this encoded

4:15

information and generate an image out of

4:17

it these two separated steps are also

4:20

why we can generate variations of the

4:22

images we can simply randomly change the

4:25

encoded information just a little making

4:27

it move a tiny bit in the latent space

4:30

and it will still represent the same

4:32

sentence while having all different

4:34

values creating a different image

4:36

representing the same text as we see

4:39

here it initially takes a text input and

4:42

encodes it what we see above is the

4:44

first step of the training process where

4:46

we also feed it an image and encode it

4:48

using clip so that images and text are

4:51

encoded similarly following the clip

4:53

objective then for generating a new

4:56

image we switch to the section below

4:58

where we use the text encoding guided by

5:00

clip to transform it into an image ready

5:03

encoding this transformation is done

5:05

using a diffusion prior which we will

5:07

cover shortly as it is very similar to

5:09

the diffusion model used for the final

5:12

step finally we use our newly created

5:14

image encoding and decode it into a new

5:17

image using the diffusion decoder a

5:20

diffusion decoder or modal is a kind of

5:23

model that starts with random noise and

5:25

learns how to iteratively change this

5:28

noise to get back to an image it learns

5:30

that by doing the opposite during

5:32

training we will feed it images and

5:34

apply random gaussian noise on the image

5:37

iteratively until we can't see anything

5:40

other than noise then we simply reverse

5:43

the model to generate images from noise

5:45

if you'd like more detail about this

5:47

kind of network which are really cool i

5:50

invite you to watch this video i made

5:51

about them and voila this is how dali 2

5:55

generates such high quality images

5:58

following text it's super impressive and

6:00

tells us that the model does understand

6:02

the text but does it deeply understand

6:05

what it created

6:06

well it sure looks like it it's the

6:08

capability of impainting images that

6:10

makes us believe that it does understand

6:12

the pictures pretty well but why is that

6:15

so how can it link a text input to an

6:18

image and understand the image enough to

6:20

replace only some parts of it without

6:23

affecting the realism this is all

6:25

because of clip as it links a text input

6:28

to an image if we encode back our newly

6:30

generated image and use a different text

6:33

input to guide another generation we can

6:35

generate the second version of the image

6:38

that will replace only the wanted region

6:40

in our first generation and you will end

6:43

up with this picture unfortunately the

6:46

code isn't publicly available and is not

6:48

in their api yet the reason for that as

6:51

per openai is to study the risks and

6:53

limitations of such a powerful model

6:56

they actually discuss these potential

6:58

risks and the reason for this privacy in

7:00

their paper and in a great repository i

7:02

linked in the description below if you

7:04

are interested they also opened an

7:06

instagram account to share more results

7:08

if you'd like to see that it's also

7:10

linked below i loved dally and this one

7:13

is even cooler

7:15

of course this was just an overview of

7:17

how dahli2 works and i strongly invite

7:19

reading their great paper linked below

7:21

for more detail on their implementation

7:23

of the model i hope you enjoyed this

7:26

video as much as i enjoyed making it and

7:28

i will see you next week with another

amazing paper thank you for watching




Comments

Signup or Login to Join the Discussion

Tags

Related Stories