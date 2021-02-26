OpenAI’s DALL·E: Text-to-Image Generation Explained

@ whatsai Louis Bouchard I explain Artificial Intelligence terms and news to non-experts.

OpenAI just released the paper showing how DALL-E works! It is called "Zero-Shot Text-to-Image Generation".

Here's a video explaining it:

References:

A. Ramesh et al., Zero-shot text-to-image generation, 2021. arXiv:2102.12092 [cs.CV]

Code & more information for the discrete VAE used for DALL·E: https://github.com/openai/DALL-E​

DALL·E paper: https://arxiv.org/pdf/2102.12092.pdf​

OpenAI CLIP paper & code: https://openai.com/blog/clip/​

CLIP used on Unsplash images search: https://github.com/haltakov/natural-l...​

2:40​ - Paper explanation



Video Transcript:

openai successfully trained a network

able to generate

images from text captions it's very

similar to gpt3

and image gpt and produces amazing

results

let's see what it's really capable of in

fact it's a smaller version of gpt3

using

12 billion parameters instead of 175

billion parameters

but it has been specifically trained to

generate images from text descriptions

using a data set of text image pairs

instead of very broad

data set like gpt3 it can generate

images from text captions

using natural language just like gpt3

can create websites and stories

it's a continuation of msgpt and gpt3

that i both covered in previous videos

if you haven't watched them yet

dolly is very similar to gpt3 in the way

that it's also a transformer language

model

receiving text and images as inputs to

output a final transformed image

in many forms it can edit attributes of

specific objects

in images as you can see here or even

control multiple objects and their

attributes at the same time

this is a very complicated task since

the network has to understand the

relation

between the objects and create an image

based on its understanding

just take this example feeding to the

network an emoji

of a baby penguin wearing a blue hat red

gloves

green shirt and yellow pens all these

components need to be understood

the objects colors and even the location

of the objects

meaning that the gloves need to be both

red and on the hands on the penguin

the same thing for the rest and the

results are very impressive

considering the complexity of the task

it uses self-attention as i described in

a previous video to understand the

context of the text

and sparse attention for the images

there are not many details about how it

works or how exactly it was trained

but they will be publishing a paper

explaining their approach

i will be sure to cover it as soon as

it's released

open ai just released the paper

explaining how dali works

it's called zero shot text to image

generation

as i previously mentioned it uses a

transformer architecture to generate

images from a text and base image

sent as input to the network but it

doesn't simply take the image that takes

and sends it to the network

first in order to be understood by the

transformer architecture the information

needs to be modeled into a single stream

of data

this is because using the pixels of the

image directly

will require way too much memory for

high resolution

images instead they use a discrete

variational auto encoder

called dva that takes the input image

and transforms it into a 32 by 32 grid

giving as a result 1024 image tokens

rather than millions of tokens for a

high resolution image

indeed the only task of this dva network

is to reduce the memory footprint of the

transformer by generating a new version

of the image you can see it as a kind of

image compressing step

the encoder and decoder in the dva are

composed

of classic convolutions and resnet

architectures with skip connections

if you've never heard of variational

auto encoders before

i strongly recommend you to watch the

video i made explaining them

unfortunately this dva network was also

shared in openai's github

with a notebook to try it yourself and

information details

in the paper the links are in the

description below

these image tokens produced by the

discrete va model

are then sent with the text as input to

the transformer model

again as i described in my previous

video about delhi this transformer is a

12 billion parameter sparse transformer

model

without diving too much into the

transformers architecture

as i already covered it in a previous

video they are sequence to sequence

models that often use

encoders and decoders in this case it

only uses a decoder

since it takes the generated image by

the dva

and the text as inputs each of the 1024

image tokens that were generated by the

discrete

va has access to all text tokens and

using

self-attention it can predict an optimal

image text pairing

then it is finally fed into a

pre-trained contrastive model

which is in fact the pre-trained clip

model that open ai published in early

january

it's used to optimize the relationship

between an image

and a specific text giving an image

generated by the transformer

and the initial caption clip assigns a

score based on how well the

image matches the caption the clip model

was even used on unsplash images to help

you find the image you are looking for

as well as finding specific frames in a

video from the text input

of course in our case we already have an

image generated

and we just want it to match the text

input

well clip still gives us a perfect

measure to use as a penalty function to

improve the results of the transformers

decoder

iteratively during training clips

capabilities are very similar to the

zero shot capabilities of gpd2 and gpt3

similarly clip was also trained on a

huge data set of 400 million text image

pairs

this zero shot capability means that it

works on images and text samples that

were not found in the training data set

which are also referred to as

unseen object categories finally the

overall architecture was trained using

250 million text image pairs

taken from the internet mostly from

wikipedia and it basically learns to

generate a new image

based on the given tokens as inputs just

like we described earlier in the video

this was possible because transformers

make the use of more parallelization

possible during training making it way

faster while producing more accurate

results

being powerful natural language tools as

well as powerful computer vision tools

when used with a proper encoding system

of course

this was just an overview of this new

paper by openai

i strongly recommend reading the dolly

paper and the clip paper to have a

better understanding of this approach

i'm excited to see what the community

will do with these codes

now available please leave a like if you

went this far in the video

and since there's over 80 percent of you

guys that are not subscribed yet

consider subscribing to the channel to

not miss any further news

thank you for watching

