Google used a modified StyleGAN2 architecture to create an online fitting room where you can automatically try-on any pants or shirts you want using only an image of yourself.
In this video, I explain more about this project, titled VOGUE, and how it works.
References:
►Lewis, Kathleen M et al., (2021), VOGUE: Try-On by StyleGAN Interpolation Optimization, https://vogue-try-on.github.io/
►Interactive examples: https://vogue-try-on.github.io/demo_r...
Follow me for more AI content:
►Instagram: https://www.instagram.com/whats_ai/
►LinkedIn: https://www.linkedin.com/in/whats-ai/
►Twitter: https://twitter.com/Whats_AI
►Facebook: https://www.facebook.com/whats.artifi...
►Medium: https://medium.com/@whats_ai
The best courses in AI:
►https://www.omologapps.com/whats-ai
Join Our Discord channel, Learn AI Together:
►https://discord.gg/learnaitogether
Chapters:
0:00 Hey! Tap the Thumbs Up button and Subscribe. You'll learn a lot of cool stuff, I promise.
0:40 Paper explanation
1:47 VOGUE's model
4:46 More examples
google used a modified style gun
00:02
architecture to create an online fitting
00:04
room where you can automatically try on
00:06
any pants or shirt you want
00:08
using only one image of yourself let's
00:11
see how they achieve that
00:12
and more impressive results
00:17
[Music]
00:21
this is what's ai and i share artificial
00:23
intelligence news every week
00:25
if you are new to the channel and want
00:27
to stay up to date please consider
00:28
subscribing to not miss any further news
00:31
a team of researchers from google mit
00:33
and the university of washington
00:35
recently published a paper called vogue
00:38
try on by style gun interpolation
00:40
optimization
00:42
they use a gun architecture to create an
00:44
online fitting room
00:45
where you can automatically try on any
00:47
pens or choice you want
00:49
using only an image of yourself also
00:51
called
00:52
garment transfer the goal is to take the
00:54
clothes from a person in a picture and
00:56
transfer it to someone else
00:58
while conserving the correct body shape
01:00
hair and skin color
01:02
this is a complex task since some parts
01:05
like the garment of the output
01:07
image need to be extracted from one
01:09
image and the other parts proper to the
01:11
actual person
01:12
is taken from another picture keeping
01:14
the identity of the person where we want
01:16
to try
01:17
clothes on well they were able to do
01:20
exactly that using a gun based
01:22
architecture
01:23
more precisely a pose-conditioned style
01:26
gun 2 is at the core of their
01:27
architecture
01:28
i won't go into the details of this
01:30
talgun 2 and the gun architectures
01:32
since i've already explained them in
01:34
many videos like in this video
01:36
where i explain to nephi which also uses
01:39
a style gun
01:40
2 based architecture i definitely invite
01:42
you to watch this video before
01:44
continuing this one if you are not
01:45
familiar with gans or style gun 2.
01:48
so in order to work and generate
01:50
photorealistic images
01:52
with different outfit vogue needs to
01:54
train this post-condition
01:56
style gun 2 architecture but this is
01:58
harder than simply implementing style
02:00
gun 2
02:01
since it was mainly developed for face
02:03
images which is where it got
02:05
its popularity from they had to make two
02:08
key modifications
02:09
at first they had to modify the
02:11
beginning of the generator
02:13
with an encoder that takes pose key
02:15
points of the image as inputs
02:17
this serves as the input of the first
02:20
4x4 style block
02:21
of style gantu instead of a constant
02:24
input to implement this pose condition
02:26
then they trained their stargan2 to
02:28
output segmentations
02:30
at each resolution in addition to the
02:32
rgb image as you can see here
02:35
using this network they were able to
02:37
generate many images and their
02:38
segmentations
02:39
with desired poles following this given
02:42
an input pair of
02:43
images they could project the images
02:46
into the latent space of the generator
02:48
to compute the latent codes
02:50
that will best differentiate the
02:51
characteristics of the pair of input
02:53
images
02:54
using an optimizer to find the space of
02:56
combinations where lies the garment
02:59
from the second image and the person
03:01
from the first image
03:02
they had to maximize changes within the
03:05
region of interest while minimizing
03:07
changes
03:08
outside of the region of interest to do
03:10
that
03:11
they used two latent space representing
03:13
the two input images
03:15
the first one from the image with the
03:17
person to be generated
03:19
and the second one from the image with
03:21
the garment to be transferred
03:23
as we saw they also needed the pose heat
03:25
map as
03:26
input to the stargand 2 generator showed
03:29
here again
03:30
in grey then they had access to the
03:32
segmentations and images generated from
03:35
the trained gan architecture
03:36
following this they used a loss function
03:39
composed of
03:40
three separate terms that each optimized
03:42
a part of the generated image
03:45
there's the edition localization lust
03:47
term
03:48
that encourages the network to only
03:50
interpolate styles
03:51
within the region of interest defined
03:54
here as m
03:55
using the segmentation outputs then
03:58
there's the garment loss used to
04:00
transfer over the correct shape and
04:02
texture of the garments
04:03
using embeddings from a very popular
04:05
convolutional neural network
04:07
architecture called
04:08
vgg16 they compute the distance between
04:10
the garment
04:11
areas of the two images using again the
04:14
segmentation labels
04:16
this created mask is then applied to the
04:18
generated rgb images
04:20
finally there's the identity loss which
04:23
guides the network to
04:24
as it says preserve the identity of the
04:26
person
04:27
this is again done using the
04:28
segmentation labels following the same
04:31
procedure as the garment loss
04:33
just take a second to look at how these
04:35
losses affect the output image
04:37
you can clearly see when the
04:39
localization less or the identity less
04:41
is missing
04:42
and their importance
04:45
as they state our method can synthesize
04:48
the same style short for varied poses
04:51
and body shapes by fixing the style
04:54
vector
04:54
we present several different styles in
04:56
multiple poses
04:59
just look at how much better the results
05:01
are with this new approach
05:03
of course this was just an overview of
05:05
this new paper
05:07
i strongly invite you to read their
05:09
paper for a better technical
05:11
understanding
05:12
it is the first link in the description
05:14
please leave a like if you went this far
05:16
in the video
05:17
and since there's over 80 percent of you
05:19
guys that are not subscribed yet
05:21
consider subscribing to the channel to
05:23
not miss any further news
05:25
thank you for watching
05:38
[Music]