This week we take a look at visual generative modeling. The goal is to generate a complete scene in high-resolution, rather than a single face image or object. This process is similar to StyleGAN, but it uses the GAN in a traditional generative and discriminative way, with convolutional neural networks.
Chapters:
0:00 - Hey! Tap the Thumbs Up button and Subscribe. You'll learn a lot of cool stuff, I promise.
0:24 - Text-To-Image translation
0:51 - Examples
5:50 - Conclusion
Paper: https://arxiv.org/pdf/2103.01209.pdf
Code: https://github.com/dorarad/gansformer
Complete reference:
Drew A. Hudson and C. Lawrence Zitnick, Generative Adversarial Transformers, (2021), Published on Arxiv., abstract:
"We introduce the GANsformer, a novel and efficient type of transformer, and explore it for the task of visual generative modeling. The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linearly efficiency, that can readily scale to high-resolution synthesis.
It iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes.
In contrast to the classic transformer architecture, it utilizes multiplicative integration that allows flexible region-based modulation, and can thus be seen as a generalization of the successful StyleGAN network.
We demonstrate the model's strength and robustness through a careful evaluation over a range of datasets, from simulated multi-object environments to rich real-world indoor and outdoor scenes, showing it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data efficiency.
Further qualitative and quantitative experiments offer us an insight into the model's inner workings, revealing improved interpretability and stronger disentanglement, and illustrating the benefits and efficacy of our approach. An implementation of the model is available at https://github.com/dorarad/gansformer."
Note: This transcript is auto-generated by Youtube and may not be entirely accurate.
00:00
the basically leveraged transformers
00:02
attention mechanism in the powerful stat
00:04
gun 2 architecture to make it even more
00:06
powerful
00:10
[Music]
00:14
this is what's ai and i share artificial
00:16
intelligence news every week
00:18
if you are new to the channel and would
00:19
like to stay up to date please consider
00:21
subscribing to not miss any further news
00:24
last week we looked at dali openai's
00:27
most recent paper
00:28
it uses a similar architecture as gpt3
00:31
involving transformers to generate an
00:33
image from text
00:35
this is a super interesting and complex
00:37
task called
00:38
text to image translation as you can see
00:41
again here the results were surprisingly
00:43
good compared to previous
00:45
state-of-the-art techniques this is
00:47
mainly due to the use of transformers
00:49
and a large amount of data this week we
00:52
will look at a very similar task
00:54
called visual generative modelling where
00:56
the goal is to generate a
00:58
complete scene in high resolution such
01:00
as a road or a room
01:02
rather than a single face or a specific
01:04
object this is different from delhi
01:06
since we are not generating the scene
01:08
from a text but from a trained model
01:10
on a specific style of scenes which is a
01:13
bedroom in this case
01:14
rather it is just like style gun that is
01:17
able to generate unique and non-existing
01:19
human faces
01:20
being trained on a data set of real
01:22
faces
01:24
the difference is that it uses this gan
01:26
architecture in a traditional generative
01:28
and discriminative way
01:29
with convolutional neural networks a
01:32
classic gun architecture will have a
01:34
generator
01:35
trained to generate the image and a
01:36
discriminator
01:38
used to measure the quality of the
01:40
generated images
01:41
by guessing if it's a real image coming
01:43
from the data set
01:44
or a fake image generated by the
01:46
generator
01:48
both networks are typically composed of
01:50
convolutional neural networks where the
01:52
generator
01:53
looks like this mainly composed of down
01:56
sampling the image using convolutions to
01:58
encode it
01:59
and then it up samples the image again
02:02
using convolutions to generate a new
02:04
version
02:05
of the image with the same style based
02:07
on the encoding
02:08
which is why it is called style gun then
02:12
the discriminator takes the generated
02:14
image or
02:15
an image from your data set and tries to
02:17
figure out whether it is real or
02:18
generated
02:19
called fake instead they leverage
02:22
transformers attention mechanism
02:24
inside the powerful stargane 2
02:26
architecture to make it
02:27
even more powerful attention is an
02:30
essential feature of this network
02:32
allowing the network to draw global
02:34
dependencies between
02:36
input and output in this case it's
02:39
between the input at the current step of
02:41
the architecture
02:42
and the latent code previously encoded
02:44
as we will see in a minute
02:46
before diving into it if you are not
02:48
familiar with transformers or attention
02:50
i suggest you watch the video i made
02:52
about transformers
02:54
for more details and a better
02:55
understanding of attention
02:57
you should definitely have a look at the
02:58
video attention is all you need
03:01
from a fellow youtuber and inspiration
03:03
of mine janik
03:04
kilter covering this amazing paper
03:07
alright
03:07
so we know that they use transformers
03:09
and guns together to generate better and
03:12
more realistic scenes
03:13
explaining the name of this paper
03:15
transformers
03:17
but why and how did they do that exactly
03:20
as for the y they did that to generate
03:22
complex and realistic scenes
03:24
like this one automatically this could
03:26
be a powerful application for many
03:28
industries like movies or video games
03:30
requiring a lot less time and effort
03:33
than having an
03:34
artist create them on a computer or even
03:36
make them
03:37
in real life to take a picture of it
03:40
also
03:40
imagine how useful it could be for
03:42
designers when coupled with text to
03:44
image translation generating many
03:46
different scenes from a single text
03:48
input
03:48
and pressing a random button they use a
03:51
state-of-the-art style gun architecture
03:53
because guns are powerful generators
03:55
when we talk about the general image
03:58
because guns work using convolutional
04:00
neural networks
04:01
they are by nature using local
04:03
information of the pixels
04:05
merging them to end up with the general
04:07
information regarding the image
04:09
missing out on the long range
04:11
interaction of the faraway pixel
04:13
for the same reason this causes guns to
04:15
be powerful generators for the overall
04:18
style of the image
04:19
still they are a lot less powerful
04:21
regarding the quality of the small
04:23
details in the generated image
04:25
for the same reason being unable to
04:27
control the style of localized regions
04:30
within the generated image itself this
04:33
is why they had the idea to combine
04:34
transformers and gans in one
04:36
architecture they called
04:38
bipartite transformer as gpt3 and many
04:41
other papers already proved transformers
04:44
are powerful for long-range interactions
04:46
drawing dependencies between them and
04:48
understanding the context of text
04:50
or images we can see that this simply
04:53
added attention layers
04:54
which is the base of the transformer's
04:56
network in between the convolutional
04:58
layers of both the generator and
05:00
discriminator
05:01
thus rather than focusing on using
05:03
global information and controlling
05:05
all features globally as convolutions do
05:07
by nature
05:08
they use this attention to propagate
05:10
information from the local pixels to the
05:12
global high level representation
05:14
and vice versa like other transformers
05:17
applied to images
05:18
this attention layer takes the pixel's
05:20
position and the style gun to latent
05:23
spaces w
05:24
and z the latent space w is an encoding
05:27
of the input into an intermediate latent
05:30
space
05:30
done at the beginning of the network
05:32
denoted here
05:34
as a while the encoding z is just the
05:37
resulting features of the input at the
05:39
current step of the network
05:40
this makes the generation much more
05:42
expressive over the whole image
05:44
especially in generating images
05:46
depicting multi-object
05:48
scenes which is the goal of this paper
05:51
of course this was just an overview of
05:53
this new paper by facebook ai research
05:55
and stanford university
05:57
i strongly recommend reading the paper
05:59
to have a better understanding of this
06:00
approach it's the first link in the
06:02
description below
06:03
the code is also available and linked in
06:05
the description as well
06:07
if you went this far in the video please
06:08
consider leaving a like
06:10
and commenting your thoughts i will
06:12
definitely read them and answer you
06:14
and since there's still over 80 percent
06:16
of you guys that are not subscribed yet
06:18
please consider clicking this free
06:20
subscribe button
06:21
to not miss any further news clearly
06:23
explained
06:24
thank you for watching
06:33
[Music]