Gender and Race Change on Your Selfie with Neural Nets

Today I will tell you how you can change your face on a photo using complex pipeline with several generative neural networks (GANs). You’ve probably seen a bunch of popular apps that convert your selfie into female or old-man. They do not use deep learning all the way because of two main issues:

GAN processing is still heavy and slow
Quality of classical CV methods is good enough for production level

But, anyway, proposed method has some potential, and work described below proves the concept that GANs are applicable to this type of tasks.

The pipeline for converting your photo may look like this:

detect and extract face from input image
transform extracted face in desired way (convert into female, asian, etc.)
upscale/enhance transformed face
paste transformed face back into the original image

Each of these steps can be solved with separate neural network, or can be not. Let’s walk through this pipeline step by step.

Face Detection

This is the easiest part. You can simply use something like dlib.get_frontal_face_detector() (example). Default face detector provided by dlib uses linear classification on HOG-features. As shown on example below, the resulting rectangle could not fit the whole face, so it is better to extend that rectangle by some factor in each dimension.

Detected face with dlib’s default detector

By tuning these factors by hand you may end up with the following code:

and with the following result:

Extended face rectangle

If by any reason you’re not satisfied with the performance of this old-school method, you can try SOTA deep learning techniques. Any object detection architecture (e.g. Faster-RCNN or YOLOv2) can handle this task easily.

Face Transformation

This is the most interesting part. As you probably know, GANs are pretty good at generating and transforming images. And there are lots of models named like _<prefix of your choice>_GAN. Problem of transforming image from one subset (domain) into another is called Domain Transfer. And the domain transfer network of my choice is Cycle-GAN.

Cycle-GAN

Why Cycle-GAN? Because it works. And because it’s really easy to get started with it. Visit project web-site for application examples. You can convert paintings to photos, zebras to horses, pandas to bears or even faces to ramen (how crazy is that?!).

Applications of Cycle-GAN (pic. from original paper)

To get started you just need to prepare two folders with images of your two domains (e.g. Male photos and Female photos), clone the author’s repo with PyTorch implementation of Cycle-GAN, and start training. That’s it.

How it works

This figure from original paper has concise and complete description of how this model works. I love the idea, since it is simple, elegant, and it leads to great results.

In addition to GAN Loss and Cycle-Consistency Loss authors also add an Identity Mapping Loss. It acts like a regularizer for the model and wants it to not change images if they came from the target domain. E.g. if input to Zebra-generator is an image of zebra — it shouldn’t be transformed at all. This additional loss helps in preserving colors of input images (see fig. below)

Network Architectures

Generator networks contain two stride-2 convolutions to downsample the input two times, several residual blocks, and two fractionally strided convolutions for upsampling. ReLu activations and Instance Normalization are used in all layers.

3 layered Fully-Convolutional network is used as a discriminator. This classifier does not have any fully-connected layers, so it accepts input images of any size. For the first time a FCN architecture was introduced in paper Fully Convolutional Networks for Semantic Segmentation and this type of models became rather popular nowadays.

Fully-convolutional discriminator maps an input to a several feature maps and then makes a decision whether image is real or fake. This can be interpreted as extracting a number of patches from the input and classifying them into real/fake. Size of patches (or size of receptive field) is controlled by the number of layers in the network. This type of discriminators are also called Patch-GANs. In this post Philip Isola explained me the magic behind this networks. Look it through for better understanding. E.g. 1-layer Patch-GAN looks at 16x16 patches, while 5-layer network will have receptive field 286x286. Discriminator that is used in the Cycle-GAN paper has 70x70 receptive field with 3 4x4-Convolutional layers followed by Batch Normaliztion and LeakyReLu activations.

Training Cycle-GAN

Let’s try to solve the task of converting male photo into female and vice versa. To do this we need datasets with male and female images. Well, CelebA dataset is perfect for our needs. It is available for free, it has 200k images and 40 binary labels like Gender, Eyeglasses, WearingHat, BlondeHair, etc.

CelebA dataset

This dataset has 90k photos of male and 110k female photos. That’s well enough for our DomainX and DomainY. The average size of face on these images is not really big, just 150x150 pixels. So we resized all extracted faces to 128x128, while keeping the aspect ratio and using black background for images. Typical input to our Cycle-GAN could look like this:

128x128 preprocessed input image

Perceptual Loss

In our setting we changed the way how identity loss is calculated. Instead of using per-pixel loss, we used style-features from pretrained vgg-16 network. And that is quite reasonable, imho. If you want to preserve image style, why calculate pixel-wise difference, when you have layers responsible for representing style of an image? This idea was first introduced in paper Perceptual Losses for Real-Time Style Transfer and Super-Resolution and is widely used in Style Transfer tasks. And this small change lead to some interesting effect I’ll describe later.

Training

Well, the overall model is quite huge. We train 4 networks simultaneously. Inputs are passed through them several times to calculate all losses, plus all gradients must be propagated as well. 1 epoch of training on 200k images on GForce 1080 takes about 5 hours, so it’s hard to experiment a lot with different hyper-parameters. Substitution of identity loss with perceptual one was the only change from the original Cycle-GAN configuration in our final model. Patch-GANs with fewer or more than 3 layers did not show good results. Adam with betas=(0.5, 0.999) was used as an optimizer. Learning rate started from 0.0002 with small decay on every epoch. Batchsize was equal to 1 and Instance Normalization was used everywhere instead of Batch Normalization. One interesting trick that I like to notice is that instead of feeding discriminator with the last output of generator, a buffer of 50 previously generated images was used, so a random image from that buffer is passed to the discriminator. So the D network uses images from earlier versions of G. This useful trick is one among others listed in this wonderful note by Soumith Chintala. I recommend to always have this list in front of you when working with GANs. We did not have time to try all of them, e.g. LeakyReLu and alternative upsampling layers in Generator. But tips with setting and controlling the training schedule for Generator-Discriminator pair really added some stability to the learning process.

Experiments

Finally we got the examples section.

Training generative networks is a bit different from training other deep learning models. You will not see a decreasing loss and increasing accuracy plots very often. Estimate on how good is your model doing is done mostly by visually looking through generators’ outputs. A typical picture of a Cycle-GAN training process looks like this:

Generators diverges, other losses are slowly going down, but nevertheless, model’s output is quite good and reasonable. By the way, to get such visualizations of training process we used visdom, an easy-to-use open-source product maintaned by Facebook Research. On each iteration following 8 pictures were shown:

real_A — input from domain A
fake_B — real_A converted by generator A->B
rec_A — reconstructed image, fake_B converted by generator B->A
idty_B — real_A, converted by generator B->A
and 4 according images for converting from domain B to A.

After 5 epochs of training you could expect a model to produce quite good images. Look at the example below. Generators’ losses are not decreasing, but still, female generator handles to convert a face of a man that looks like G.Hinton into a woman. How could it?!!

Sometimes things could go really bad:

In this case just press Ctrl+C and call a reporter to claim that you’ve “just shut down AI”.

In summary, despite some artifacts and low resolution, we can say that Cycle-GAN handles the task very well. Here are some samples.

Male <-> Female

White <-> Asian

White <-> Black

More examples with celebrities:

And now an extreme case:

Not bad, huh? Did you notice an interesting effect on identity images? If we’d use original identity loss, idty_A and idty_B would be just equal to their original images. But with perceptual loss identity mapping now learns the strongest features of each domain and increases them on inputs. That’s why men become more mature and women got brighter skin and more makeup. And effect increases if you pass an image through several times.That is a ready to use beautification app. Check it out:

Shut up and give me your money!

Image superresolution

Images produced by Cycle-GAN have low resolution. It’s better to increase and enhance them. The problem of increasing image’s resolution is called Superresolution. Plenty of research has been done in this field. I want to point out two state-of-the-art deep learning models that are capable of solving image superresolution task: SRResNet and EDSR.

SRResNet

In paper Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network authors propose a generative network for superresolution that is based on ResNet architecture. As an objective MSE loss between target image and superresolved one is used. But with two more additional terms: discriminator loss and perceptual loss on VGG features (see, everyone’s doing it!)

Generator uses residual blocks with 3x3 convolutions, Batch Normalization and Parametric ReLu. SubPixel convolutions are used for upsampling.

Discriminator uses 8 convolutions with 3x3 kernels and Leaky ReLu activations. Downsampling is done by strided convolutions (no pooling layers). Classification is done by two last fully-connected layers with sigmoid activation at the end.

SRResNet architecture

EDSR

Enhanced Deep Super-Resolution network is similar to SRResNet with few improvements:

No batch normalization is used. This saves up to 40% of memory and allows to increase number of layers and filters.
No ReLu is used outside residual blocks.
All ResNet blocks are scaled by 0.1 before concatenation with identity vector. This helps to stabilize training process.

Training

To train SR networks we need a dataset of high resolution images. We parsed several thousands images with hashtag #face from Instagram. Training is usually done on small patches instead of full images. This helps generator to deal with small fine details. In evaluation mode a full-sized images are passed, that is possible because of the fully-convolutional fashion of the network. In practice EDSR, that supposed to be an enhancement of SRResNet, did not show better results and was slow in training time. So in our final pipeline we used SRResNet trained on 64x64 patches with Perceptual loss and no Discriminator. Below are some examples from training set.

And that is how this network performed on our artificial images. Not perfect, but OK for now.

Pasting image back into the original

Even this task can be done via deep learning. I’ve found an interesting paper on image blending. GP-GAN: Towards Realistic High-Resolution Image Blending. Instead of diving into the details I’ll just show you this figure from the paper.

But we’ve implemented simple and straight-forward solution. Transformed face is pasted into original image with increasing transparency when moving closer to its edges. Results look like this:

Conclusion

Result seems to be ok, but it is not production ready yet. A lot of work is to be done in the field of compressing generative networks to make them light and fast. This includes experiments in knowledge distillation, factorization, quantization, all that stuff. And than the whole pipeline may be deployed as a mobile app. A lot of interesting work ahead…