Visual Generative Modeling: Using GANsformers to Generate Scenes

Written by whatsai | Published 2021/03/07
Tech Story Tags: artificial-intelligence | gans | generative-adversarial-network | computer-vision | visual-generative-modeling | transformer-architecture | youtubers | youtube-transcripts | web-monetization

TLDR

via the TL;DR App

This week we take a look at visual generative modeling. The goal is to generate a complete scene in high-resolution, rather than a single face image or object. This process is similar to StyleGAN, but it uses the GAN in a traditional generative and discriminative way, with convolutional neural networks.

Watch the Video:

Chapters:

0:00 - Hey! Tap the Thumbs Up button and Subscribe. You'll learn a lot of cool stuff, I promise.
0:24 - Text-To-Image translation
0:51 - Examples
5:50 - Conclusion

References

Paper: https://arxiv.org/pdf/2103.01209.pdf
Code: https://github.com/dorarad/gansformer

Complete reference:
Drew A. Hudson and C. Lawrence Zitnick, Generative Adversarial Transformers, (2021), Published on Arxiv., abstract:

"We introduce the GANsformer, a novel and efficient type of transformer, and explore it for the task of visual generative modeling. The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linearly efficiency, that can readily scale to high-resolution synthesis.

It iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes.

In contrast to the classic transformer architecture, it utilizes multiplicative integration that allows flexible region-based modulation, and can thus be seen as a generalization of the successful StyleGAN network.

We demonstrate the model's strength and robustness through a careful evaluation over a range of datasets, from simulated multi-object environments to rich real-world indoor and outdoor scenes, showing it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data efficiency.

Further qualitative and quantitative experiments offer us an insight into the model's inner workings, revealing improved interpretability and stronger disentanglement, and illustrating the benefits and efficacy of our approach. An implementation of the model is available at https://github.com/dorarad/gansformer."

Follow me for more AI content:
►Instagram: https://www.instagram.com/whats_ai/
►LinkedIn: https://www.linkedin.com/in/whats-ai/
►Twitter: https://twitter.com/Whats_AI
►Facebook: https://www.facebook.com/whats.artifi...
Join Our Discord channel, Learn AI Together:
►https://discord.gg/learnaitogether
The best courses in AI & Guide+Repository on how to start:
►https://www.omologapps.com/whats-ai
►https://github.com/louisfb01/start-ma...
Become a member of the YouTube community and support my work:
https://www.youtube.com/channel/UCUzG...

Video Transcript

Note: This transcript is auto-generated by Youtube and may not be entirely accurate.

00:00

the basically leveraged transformers

00:02

attention mechanism in the powerful stat

00:04

gun 2 architecture to make it even more

00:06

powerful

00:10

[Music]

00:14

this is what's ai and i share artificial

00:16

intelligence news every week

00:18

if you are new to the channel and would

00:19

like to stay up to date please consider

00:21

subscribing to not miss any further news

00:24

last week we looked at dali openai's

00:27

most recent paper

00:28

it uses a similar architecture as gpt3

00:31

involving transformers to generate an

00:33

image from text

00:35

this is a super interesting and complex

00:37

task called

00:38

text to image translation as you can see

00:41

again here the results were surprisingly

00:43

good compared to previous

00:45

state-of-the-art techniques this is

00:47

mainly due to the use of transformers

00:49

and a large amount of data this week we

00:52

will look at a very similar task

00:54

called visual generative modelling where

00:56

the goal is to generate a

00:58

complete scene in high resolution such

01:00

as a road or a room

01:02

rather than a single face or a specific

01:04

object this is different from delhi

01:06

since we are not generating the scene

01:08

from a text but from a trained model

01:10

on a specific style of scenes which is a

01:13

bedroom in this case

01:14

rather it is just like style gun that is

01:17

able to generate unique and non-existing

01:19

human faces

01:20

being trained on a data set of real

01:22

faces

01:24

the difference is that it uses this gan

01:26

architecture in a traditional generative

01:28

and discriminative way

01:29

with convolutional neural networks a

01:32

classic gun architecture will have a

01:34

generator

01:35

trained to generate the image and a

01:36

discriminator

01:38

used to measure the quality of the

01:40

generated images

01:41

by guessing if it's a real image coming

01:43

from the data set

01:44

or a fake image generated by the

01:46

generator

01:48

both networks are typically composed of

01:50

convolutional neural networks where the

01:52

generator

01:53

looks like this mainly composed of down

01:56

sampling the image using convolutions to

01:58

encode it

01:59

and then it up samples the image again

02:02

using convolutions to generate a new

02:04

version

02:05

of the image with the same style based

02:07

on the encoding

02:08

which is why it is called style gun then

02:12

the discriminator takes the generated

02:14

image or

02:15

an image from your data set and tries to

02:17

figure out whether it is real or

02:18

generated

02:19

called fake instead they leverage

02:22

transformers attention mechanism

02:24

inside the powerful stargane 2

02:26

architecture to make it

02:27

even more powerful attention is an

02:30

essential feature of this network

02:32

allowing the network to draw global

02:34

dependencies between

02:36

input and output in this case it's

02:39

between the input at the current step of

02:41

the architecture

02:42

and the latent code previously encoded

02:44

as we will see in a minute

02:46

before diving into it if you are not

02:48

familiar with transformers or attention

02:50

i suggest you watch the video i made

02:52

about transformers

02:54

for more details and a better

02:55

understanding of attention

02:57

you should definitely have a look at the

02:58

video attention is all you need

03:01

from a fellow youtuber and inspiration

03:03

of mine janik

03:04

kilter covering this amazing paper

03:07

alright

03:07

so we know that they use transformers

03:09

and guns together to generate better and

03:12

more realistic scenes

03:13

explaining the name of this paper

03:15

transformers

03:17

but why and how did they do that exactly

03:20

as for the y they did that to generate

03:22

complex and realistic scenes

03:24

like this one automatically this could

03:26

be a powerful application for many

03:28

industries like movies or video games

03:30

requiring a lot less time and effort

03:33

than having an

03:34

artist create them on a computer or even

03:36

make them

03:37

in real life to take a picture of it

03:40

also

03:40

imagine how useful it could be for

03:42

designers when coupled with text to

03:44

image translation generating many

03:46

different scenes from a single text

03:48

input

03:48

and pressing a random button they use a

03:51

state-of-the-art style gun architecture

03:53

because guns are powerful generators

03:55

when we talk about the general image

03:58

because guns work using convolutional

04:00

neural networks

04:01

they are by nature using local

04:03

information of the pixels

04:05

merging them to end up with the general

04:07

information regarding the image

04:09

missing out on the long range

04:11

interaction of the faraway pixel

04:13

for the same reason this causes guns to

04:15

be powerful generators for the overall

04:18

style of the image

04:19

still they are a lot less powerful

04:21

regarding the quality of the small

04:23

details in the generated image

04:25

for the same reason being unable to

04:27

control the style of localized regions

04:30

within the generated image itself this

04:33

is why they had the idea to combine

04:34

transformers and gans in one

04:36

architecture they called

04:38

bipartite transformer as gpt3 and many

04:41

other papers already proved transformers

04:44

are powerful for long-range interactions

04:46

drawing dependencies between them and

04:48

understanding the context of text

04:50

or images we can see that this simply

04:53

added attention layers

04:54

which is the base of the transformer's

04:56

network in between the convolutional

04:58

layers of both the generator and

05:00

discriminator

05:01

thus rather than focusing on using

05:03

global information and controlling

05:05

all features globally as convolutions do

05:07

by nature

05:08

they use this attention to propagate

05:10

information from the local pixels to the

05:12

global high level representation

05:14

and vice versa like other transformers

05:17

applied to images

05:18

this attention layer takes the pixel's

05:20

position and the style gun to latent

05:23

spaces w

05:24

and z the latent space w is an encoding

05:27

of the input into an intermediate latent

05:30

space

05:30

done at the beginning of the network

05:32

denoted here

05:34

as a while the encoding z is just the

05:37

resulting features of the input at the

05:39

current step of the network

05:40

this makes the generation much more

05:42

expressive over the whole image

05:44

especially in generating images

05:46

depicting multi-object

05:48

scenes which is the goal of this paper

05:51

of course this was just an overview of

05:53

this new paper by facebook ai research

05:55

and stanford university

05:57

i strongly recommend reading the paper

05:59

to have a better understanding of this

06:00

approach it's the first link in the

06:02

description below

06:03

the code is also available and linked in

06:05

the description as well

06:07

if you went this far in the video please

06:08

consider leaving a like

06:10

and commenting your thoughts i will

06:12

definitely read them and answer you

06:14

and since there's still over 80 percent

06:16

of you guys that are not subscribed yet

06:18

please consider clicking this free

06:20

subscribe button

06:21

to not miss any further news clearly

06:23

explained

06:24

thank you for watching

06:33

[Music]

Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.

Published by HackerNoon on 2021/03/07