VOGUE by Google, MIT, and UW: The AI-Powered Online Fitting Room by@whatsai

# VOGUE by Google, MIT, and UW: The AI-Powered Online Fitting Room

### @whatsaiLouis Bouchard

I explain Artificial Intelligence terms and news to non-experts.

Google used a modified StyleGAN2 architecture to create an online fitting room where you can automatically try-on any pants or shirts you want using only an image of yourself.

References:

►Lewis, Kathleen M et al., (2021), VOGUE: Try-On by StyleGAN Interpolation Optimization, https://vogue-try-on.github.io/
►Interactive examples: https://vogue-try-on.github.io/demo_r...

Follow me for more AI content:

►Instagram: https://www.instagram.com/whats_ai/
►Medium: https://medium.com/@whats_ai

The best courses in AI:
https://www.omologapps.com/whats-ai

https://discord.gg/learnaitogether

Chapters:

0:00​ Hey! Tap the Thumbs Up button and Subscribe. You'll learn a lot of cool stuff, I promise.
0:40​ Paper explanation
1:47​ VOGUE's model
4:46​ More examples

## Video Transcript

google used a modified style gun

00:02

architecture to create an online fitting

00:04

room where you can automatically try on

00:06

any pants or shirt you want

00:08

using only one image of yourself let's

00:11

see how they achieve that

00:12

and more impressive results

00:17

[Music]

00:21

this is what's ai and i share artificial

00:23

intelligence news every week

00:25

if you are new to the channel and want

00:27

to stay up to date please consider

00:28

subscribing to not miss any further news

00:31

a team of researchers from google mit

00:33

and the university of washington

00:35

recently published a paper called vogue

00:38

try on by style gun interpolation

00:40

optimization

00:42

they use a gun architecture to create an

00:44

online fitting room

00:45

where you can automatically try on any

00:47

pens or choice you want

00:49

using only an image of yourself also

00:51

called

00:52

garment transfer the goal is to take the

00:54

clothes from a person in a picture and

00:56

transfer it to someone else

00:58

while conserving the correct body shape

01:00

hair and skin color

01:02

this is a complex task since some parts

01:05

like the garment of the output

01:07

image need to be extracted from one

01:09

image and the other parts proper to the

01:11

actual person

01:12

is taken from another picture keeping

01:14

the identity of the person where we want

01:16

to try

01:17

clothes on well they were able to do

01:20

exactly that using a gun based

01:22

architecture

01:23

more precisely a pose-conditioned style

01:26

gun 2 is at the core of their

01:27

architecture

01:28

i won't go into the details of this

01:30

talgun 2 and the gun architectures

01:32

since i've already explained them in

01:34

many videos like in this video

01:36

where i explain to nephi which also uses

01:39

a style gun

01:40

2 based architecture i definitely invite

01:42

you to watch this video before

01:44

continuing this one if you are not

01:45

familiar with gans or style gun 2.

01:48

so in order to work and generate

01:50

photorealistic images

01:52

with different outfit vogue needs to

01:54

train this post-condition

01:56

style gun 2 architecture but this is

01:58

harder than simply implementing style

02:00

gun 2

02:01

since it was mainly developed for face

02:03

images which is where it got

02:05

its popularity from they had to make two

02:08

key modifications

02:09

at first they had to modify the

02:11

beginning of the generator

02:13

with an encoder that takes pose key

02:15

points of the image as inputs

02:17

this serves as the input of the first

02:20

4x4 style block

02:21

of style gantu instead of a constant

02:24

input to implement this pose condition

02:26

then they trained their stargan2 to

02:28

output segmentations

02:30

at each resolution in addition to the

02:32

rgb image as you can see here

02:35

using this network they were able to

02:37

generate many images and their

02:38

segmentations

02:39

with desired poles following this given

02:42

an input pair of

02:43

images they could project the images

02:46

into the latent space of the generator

02:48

to compute the latent codes

02:50

that will best differentiate the

02:51

characteristics of the pair of input

02:53

images

02:54

using an optimizer to find the space of

02:56

combinations where lies the garment

02:59

from the second image and the person

03:01

from the first image

03:02

they had to maximize changes within the

03:05

region of interest while minimizing

03:07

changes

03:08

outside of the region of interest to do

03:10

that

03:11

they used two latent space representing

03:13

the two input images

03:15

the first one from the image with the

03:17

person to be generated

03:19

and the second one from the image with

03:21

the garment to be transferred

03:23

as we saw they also needed the pose heat

03:25

map as

03:26

input to the stargand 2 generator showed

03:29

here again

03:30

03:32

segmentations and images generated from

03:35

the trained gan architecture

03:36

following this they used a loss function

03:39

composed of

03:40

three separate terms that each optimized

03:42

a part of the generated image

03:45

there's the edition localization lust

03:47

term

03:48

that encourages the network to only

03:50

interpolate styles

03:51

within the region of interest defined

03:54

here as m

03:55

using the segmentation outputs then

03:58

there's the garment loss used to

04:00

transfer over the correct shape and

04:02

texture of the garments

04:03

using embeddings from a very popular

04:05

convolutional neural network

04:07

architecture called

04:08

vgg16 they compute the distance between

04:10

the garment

04:11

areas of the two images using again the

04:14

segmentation labels

04:16

this created mask is then applied to the

04:18

generated rgb images

04:20

finally there's the identity loss which

04:23

guides the network to

04:24

as it says preserve the identity of the

04:26

person

04:27

this is again done using the

04:28

segmentation labels following the same

04:31

procedure as the garment loss

04:33

just take a second to look at how these

04:35

losses affect the output image

04:37

you can clearly see when the

04:39

localization less or the identity less

04:41

is missing

04:42

and their importance

04:45

as they state our method can synthesize

04:48

the same style short for varied poses

04:51

and body shapes by fixing the style

04:54

vector

04:54

we present several different styles in

04:56

multiple poses

04:59

just look at how much better the results

05:01

are with this new approach

05:03

of course this was just an overview of

05:05

this new paper

05:07

i strongly invite you to read their

05:09

paper for a better technical

05:11

understanding

05:12

it is the first link in the description

05:14

please leave a like if you went this far

05:16

in the video

05:17

and since there's over 80 percent of you

05:19

guys that are not subscribed yet

05:21

consider subscribing to the channel to

05:23

not miss any further news

05:25

thank you for watching

05:38

[Music]