I explain Artificial Intelligence terms and news to non-experts.
This week we take a look at visual generative modeling. The goal is to generate a complete scene in high-resolution, rather than a single face image or object. This process is similar to StyleGAN, but it uses the GAN in a traditional generative and discriminative way, with convolutional neural networks.
Chapters:
0:24 - Text-To-Image translation
0:51 - Examples
5:50 - Conclusion
Paper: https://arxiv.org/pdf/2103.01209.pdf
Code: https://github.com/dorarad/gansformer
Complete reference:
Drew A. Hudson and C. Lawrence Zitnick, Generative Adversarial Transformers, (2021), Published on Arxiv., abstract:
"We introduce the GANsformer, a novel and efficient type of transformer, and explore it for the task of visual generative modeling. The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linearly efficiency, that can readily scale to high-resolution synthesis.
It iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes.
In contrast to the classic transformer architecture, it utilizes multiplicative integration that allows flexible region-based modulation, and can thus be seen as a generalization of the successful StyleGAN network.
We demonstrate the model's strength and robustness through a careful evaluation over a range of datasets, from simulated multi-object environments to rich real-world indoor and outdoor scenes, showing it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data efficiency.
Further qualitative and quantitative experiments offer us an insight into the model's inner workings, revealing improved interpretability and stronger disentanglement, and illustrating the benefits and efficacy of our approach. An implementation of the model is available at https://github.com/dorarad/gansformer."
Note: This transcript is auto-generated by Youtube and may not be entirely accurate.
the basically leveraged transformers
attention mechanism in the powerful stat
gun 2 architecture to make it even more
powerful
this is what's ai and i share artificial
intelligence news every week
if you are new to the channel and would
like to stay up to date please consider
subscribing to not miss any further news
last week we looked at dali openai's
most recent paper
it uses a similar architecture as gpt3
involving transformers to generate an
image from text
this is a super interesting and complex
task called
text to image translation as you can see
again here the results were surprisingly
good compared to previous
state-of-the-art techniques this is
mainly due to the use of transformers
and a large amount of data this week we
will look at a very similar task
called visual generative modelling where
the goal is to generate a
complete scene in high resolution such
as a road or a room
rather than a single face or a specific
object this is different from delhi
since we are not generating the scene
from a text but from a trained model
on a specific style of scenes which is a
bedroom in this case
rather it is just like style gun that is
able to generate unique and non-existing
human faces
being trained on a data set of real
faces
the difference is that it uses this gan
architecture in a traditional generative
and discriminative way
with convolutional neural networks a
classic gun architecture will have a
generator
trained to generate the image and a
discriminator
used to measure the quality of the
generated images
by guessing if it's a real image coming
from the data set
or a fake image generated by the
generator
both networks are typically composed of
convolutional neural networks where the
generator
looks like this mainly composed of down
sampling the image using convolutions to
encode it
and then it up samples the image again
using convolutions to generate a new
version
of the image with the same style based
on the encoding
which is why it is called style gun then
the discriminator takes the generated
image or
an image from your data set and tries to
figure out whether it is real or
generated
called fake instead they leverage
transformers attention mechanism
inside the powerful stargane 2
architecture to make it
even more powerful attention is an
essential feature of this network
allowing the network to draw global
dependencies between
input and output in this case it's
between the input at the current step of
the architecture
and the latent code previously encoded
as we will see in a minute
before diving into it if you are not
familiar with transformers or attention
i suggest you watch the video i made
about transformers
for more details and a better
understanding of attention
you should definitely have a look at the
video attention is all you need
from a fellow youtuber and inspiration
of mine janik
kilter covering this amazing paper
alright
so we know that they use transformers
and guns together to generate better and
more realistic scenes
explaining the name of this paper
transformers
but why and how did they do that exactly
as for the y they did that to generate
complex and realistic scenes
like this one automatically this could
be a powerful application for many
industries like movies or video games
requiring a lot less time and effort
than having an
artist create them on a computer or even
make them
in real life to take a picture of it
also
imagine how useful it could be for
designers when coupled with text to
image translation generating many
different scenes from a single text
input
and pressing a random button they use a
state-of-the-art style gun architecture
because guns are powerful generators
when we talk about the general image
because guns work using convolutional
neural networks
they are by nature using local
information of the pixels
merging them to end up with the general
information regarding the image
missing out on the long range
interaction of the faraway pixel
for the same reason this causes guns to
be powerful generators for the overall
style of the image
still they are a lot less powerful
regarding the quality of the small
details in the generated image
for the same reason being unable to
control the style of localized regions
within the generated image itself this
is why they had the idea to combine
transformers and gans in one
architecture they called
bipartite transformer as gpt3 and many
other papers already proved transformers
are powerful for long-range interactions
drawing dependencies between them and
understanding the context of text
or images we can see that this simply
added attention layers
which is the base of the transformer's
network in between the convolutional
layers of both the generator and
discriminator
thus rather than focusing on using
global information and controlling
all features globally as convolutions do
by nature
they use this attention to propagate
information from the local pixels to the
global high level representation
and vice versa like other transformers
applied to images
this attention layer takes the pixel's
position and the style gun to latent
spaces w
and z the latent space w is an encoding
of the input into an intermediate latent
space
done at the beginning of the network
denoted here
as a while the encoding z is just the
resulting features of the input at the
current step of the network
this makes the generation much more
expressive over the whole image
especially in generating images
depicting multi-object
scenes which is the goal of this paper
of course this was just an overview of
this new paper by facebook ai research
and stanford university
i strongly recommend reading the paper
to have a better understanding of this
approach it's the first link in the
description below
the code is also available and linked in
the description as well
if you went this far in the video please
consider leaving a like
and commenting your thoughts i will
definitely read them and answer you
and since there's still over 80 percent
of you guys that are not subscribed yet
please consider clicking this free
subscribe button
to not miss any further news clearly
explained
thank you for watching
