OpenAI’s DALL·E: Text-to-Image Generation Explained

Written by whatsai | Published 2021/02/26
Tech Story Tags: ai | artificial-intelligence | openai | youtubers | youtube-transcripts | machine-learning | computer-vision | hackernoon-top-story | web-monetization

TLDR

via the TL;DR App

OpenAI just released the paper showing how DALL-E works! It is called "Zero-Shot Text-to-Image Generation".

Here's a video explaining it:

References:

A. Ramesh et al., Zero-shot text-to-image generation, 2021. arXiv:2102.12092 [cs.CV]
Code & more information for the discrete VAE used for DALL·E: https://github.com/openai/DALL-E
DALL·E paper: https://arxiv.org/pdf/2102.12092.pdf
OpenAI CLIP paper & code: https://openai.com/blog/clip/
CLIP used on Unsplash images search: https://github.com/haltakov/natural-l...

Follow me for more AI content:

Instagram: https://www.instagram.com/whats_ai/
LinkedIn: https://www.linkedin.com/in/whats-ai/
Twitter: https://twitter.com/Whats_AI
Facebook: https://www.facebook.com/whats.artifi...
Join Our Discord channel, Learn AI Together:
https://discord.gg/learnaitogether
The best courses in AI & Guide+Repository on how to start:
https://www.omologapps.com/whats-ai
https://github.com/louisfb01/start-ma...
Become a member of the YouTube community and support my work:
https://www.youtube.com/channel/UCUzG...

Chapters:

0:00 - Hey! Tap the Thumbs Up button and Subscribe. You'll learn a lot of cool stuff, I promise.

2:40 - Paper explanation

Video Transcript:

00:00

openai successfully trained a network

00:02

able to generate

00:03

images from text captions it's very

00:06

similar to gpt3

00:07

and image gpt and produces amazing

00:10

results

00:11

let's see what it's really capable of in

00:13

fact it's a smaller version of gpt3

00:16

using

00:16

12 billion parameters instead of 175

00:20

billion parameters

00:21

but it has been specifically trained to

00:24

generate images from text descriptions

00:26

using a data set of text image pairs

00:29

instead of very broad

00:30

data set like gpt3 it can generate

00:33

images from text captions

00:34

using natural language just like gpt3

00:37

can create websites and stories

00:39

it's a continuation of msgpt and gpt3

00:42

that i both covered in previous videos

00:44

if you haven't watched them yet

00:46

dolly is very similar to gpt3 in the way

00:49

that it's also a transformer language

00:52

model

00:52

receiving text and images as inputs to

00:55

output a final transformed image

00:57

in many forms it can edit attributes of

01:00

specific objects

01:01

in images as you can see here or even

01:03

control multiple objects and their

01:05

attributes at the same time

01:07

this is a very complicated task since

01:09

the network has to understand the

01:11

relation

01:12

between the objects and create an image

01:14

based on its understanding

01:16

just take this example feeding to the

01:18

network an emoji

01:20

of a baby penguin wearing a blue hat red

01:23

gloves

01:23

green shirt and yellow pens all these

01:26

components need to be understood

01:28

the objects colors and even the location

01:31

of the objects

01:32

meaning that the gloves need to be both

01:34

red and on the hands on the penguin

01:36

the same thing for the rest and the

01:38

results are very impressive

01:40

considering the complexity of the task

01:42

it uses self-attention as i described in

01:44

a previous video to understand the

01:46

context of the text

01:47

and sparse attention for the images

01:49

there are not many details about how it

01:51

works or how exactly it was trained

01:54

but they will be publishing a paper

01:55

explaining their approach

01:57

i will be sure to cover it as soon as

01:59

it's released

02:16

[Music]

02:40

open ai just released the paper

02:42

explaining how dali works

02:44

it's called zero shot text to image

02:47

generation

02:48

as i previously mentioned it uses a

02:50

transformer architecture to generate

02:52

images from a text and base image

02:54

sent as input to the network but it

02:56

doesn't simply take the image that takes

02:58

and sends it to the network

03:00

first in order to be understood by the

03:02

transformer architecture the information

03:04

needs to be modeled into a single stream

03:06

of data

03:08

this is because using the pixels of the

03:10

image directly

03:11

will require way too much memory for

03:13

high resolution

03:14

images instead they use a discrete

03:16

variational auto encoder

03:18

called dva that takes the input image

03:21

and transforms it into a 32 by 32 grid

03:25

giving as a result 1024 image tokens

03:28

rather than millions of tokens for a

03:30

high resolution image

03:32

indeed the only task of this dva network

03:35

is to reduce the memory footprint of the

03:37

transformer by generating a new version

03:39

of the image you can see it as a kind of

03:41

image compressing step

03:43

the encoder and decoder in the dva are

03:45

composed

03:46

of classic convolutions and resnet

03:48

architectures with skip connections

03:50

if you've never heard of variational

03:52

auto encoders before

03:53

i strongly recommend you to watch the

03:55

video i made explaining them

03:57

unfortunately this dva network was also

04:00

shared in openai's github

04:02

with a notebook to try it yourself and

04:05

information details

04:06

in the paper the links are in the

04:08

description below

04:09

these image tokens produced by the

04:11

discrete va model

04:12

are then sent with the text as input to

04:15

the transformer model

04:16

again as i described in my previous

04:18

video about delhi this transformer is a

04:20

12 billion parameter sparse transformer

04:23

model

04:24

without diving too much into the

04:26

transformers architecture

04:27

as i already covered it in a previous

04:29

video they are sequence to sequence

04:31

models that often use

04:32

encoders and decoders in this case it

04:35

only uses a decoder

04:37

since it takes the generated image by

04:39

the dva

04:40

and the text as inputs each of the 1024

04:44

image tokens that were generated by the

04:46

discrete

04:46

va has access to all text tokens and

04:49

using

04:50

self-attention it can predict an optimal

04:52

image text pairing

04:54

then it is finally fed into a

04:56

pre-trained contrastive model

04:58

which is in fact the pre-trained clip

05:00

model that open ai published in early

05:03

january

05:03

it's used to optimize the relationship

05:06

between an image

05:07

and a specific text giving an image

05:09

generated by the transformer

05:10

and the initial caption clip assigns a

05:13

score based on how well the

05:14

image matches the caption the clip model

05:17

was even used on unsplash images to help

05:20

you find the image you are looking for

05:22

as well as finding specific frames in a

05:24

video from the text input

05:26

of course in our case we already have an

05:29

image generated

05:30

and we just want it to match the text

05:32

input

05:33

well clip still gives us a perfect

05:35

measure to use as a penalty function to

05:38

improve the results of the transformers

05:39

decoder

05:40

iteratively during training clips

05:42

capabilities are very similar to the

05:45

zero shot capabilities of gpd2 and gpt3

05:48

similarly clip was also trained on a

05:50

huge data set of 400 million text image

05:53

pairs

05:54

this zero shot capability means that it

05:57

works on images and text samples that

05:59

were not found in the training data set

06:01

which are also referred to as

06:03

unseen object categories finally the

06:06

overall architecture was trained using

06:09

250 million text image pairs

06:11

taken from the internet mostly from

06:13

wikipedia and it basically learns to

06:16

generate a new image

06:17

based on the given tokens as inputs just

06:20

like we described earlier in the video

06:23

this was possible because transformers

06:25

make the use of more parallelization

06:27

possible during training making it way

06:30

faster while producing more accurate

06:32

results

06:32

being powerful natural language tools as

06:35

well as powerful computer vision tools

06:37

when used with a proper encoding system

06:40

of course

06:41

this was just an overview of this new

06:43

paper by openai

06:44

i strongly recommend reading the dolly

06:46

paper and the clip paper to have a

06:48

better understanding of this approach

06:50

i'm excited to see what the community

06:52

will do with these codes

06:54

now available please leave a like if you

06:56

went this far in the video

06:58

and since there's over 80 percent of you

07:00

guys that are not subscribed yet

07:02

consider subscribing to the channel to

07:04

not miss any further news

07:06

thank you for watching

07:18

[Music]

Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.

Published by HackerNoon on 2021/02/26