Listen to this story
Machine Learning Engineer with expertise in Computer Vision and Recommender Systems.
This story contains new, firsthand information uncovered by the writer.
Significant progress has been made in Face Recognition in recent years. This review offers a brief overview of key tasks, models, and solution methods, focusing on the evolution of loss functions.
Simply put, Face Recognition is a method of identifying or verifying a person’s identity using photos, videos, or real-time footage. This review will explore identification based on a single digital image or video frame.
Face recognition (FR) has far-reaching applications. It’s used in the financial sector, in cybersecurity, video surveillance, smart home services, multi-factor authentication, etc.
Beyond these practical uses, FR models also play a key role in modern generative models. They’re commonly used for identification loss in Face Restoration models like GFPGAN and CodeFormer, Face Swapping tools such as SimSwap and FaceShifter, Image-to-Image GAN-based models like pSp and HyperStyle, as well as in Transformer-based and Stable Diffusion models for identity preservation.
ArcFace (2018–2019) is the most widely used identification loss function, while CosFace (2018) and FaceNet are used much less frequently.
For this review, I’ll focus on how the FR landscape has changed since ArcFace, particularly in recent years.
Fce recognition requires some pre-processing: face detection, cropping, and alignment. Preprocessing should be the same for both training and test data, usually using FFHQ-like alignment (Flickr-Faces-HQ Dataset). Typically, two separate additional detectors are used for this: a face-bounding box detector and a face landmark detector. There are end-to-end models, with alignment that are trained together with the main model, but I do not consider them in this part of the review. Here, we assume that the training and test datasets are uniformly cropped and aligned. Thus, the model is fed with cropped and aligned inputs.
Typical FR preprocess pipeline
In the training dataset for the FR task, there are several images for each identity (person). The model's task is to learn to distinguish between photos belonging to the same person and photos of different people.
The model usually consists of two components:
Backbone. A backbone, which can also be called a feature extractor, takes a preprocessed face photo as input and outputs a feature vector of embeddings. Classic backbones are convolutional neural networks (CNN) such as ResNet, VGGNet, ResFace, SE-ResNet and others. These can also be VisionTransformer or Feature Pyramid Network models or their more complex variations. We will not dwell on the backbones of models in detail in this part of the review.
Loss function. At the training stage, a loss function is applied to supervise the backbone training. The goal of training is to obtain a model that will produce close embeddings for different photos of the same person and distant ones for different persons faces. We are talking about measuring the distance between embedding vectors using, for example, cosine distance or L2 distance.
The first category is called «pair-based loss», sometimes they are called «metric learning-based methods»: Contrastive loss, Triplet loss, N-pairs loss.
These methods either combine the positive and negative sample pairs before the model training or dynamically combine sample pairs online during the training. Both these modes allow for extracting meaningful facial representations at the sample-wise level but would exponentially increase the data size.
The training scheme using triplet loss looks like this. Two examples with the same label should have their embeddings close together in the embedding space. Two examples with different labels have their embeddings far away.
Triplet loss scheme
The rapid growth of the number of possible pairs with the size of the dataset forces us to look for pair selection strategies, which are usually empirical and computationally complex.
Another category is called «classification-based loss» or sometimes called «prototype learning-based methods»: Softmax loss, CosFace, ArcFace, NormFace. They work with generalized information about classes using a prototype, also referred to as a class proxy or class center. Prototypes are learnable parameters, updated during the model training. Currently, classification-based losses are mainly used for face recognition models.
If we consider the FR task as a classification, then we can use softmax loss (another name is categorical cross-entropy loss). In essence, Softmax loss is a Softmax activation function + Cross-Entropy loss.
Loss scheme
Let's recall the formulas. The first one is Softmax activation, and the second is Cross-Entropy loss.
Softmax activation, and Cross-Entropy loss
Combining get:
Softmax loss
The loss function receives the result of the last fully connected layer, where 𝒙𝒊 denotes the embedding feature of the 𝑖-th training image, 𝑦𝑖 is the label of 𝒙𝒊 and 𝑾 denotes the weight of the last fully connected layer.
This works, but there is an issue - the boundaries between classes are blurred. A new step in FR was made in 2018 with the advent of the ArcFace model. The basis remains the softmax loss, but we move on to considering the angles between vectors. Let us recall the formula cosine similarity:
Let's make a substitution in the softmax loss formula
Next, a margin is added so that intra-class angles are smaller and inter-class angles are larger. This gives a gap between classes instead of the blurry boundaries of softmax loss.
Similar methods: if we replace cos(θ + m) with cos θ − m we get CosFace loss.
This is where the history of modern loss functions for FR begins. Over the years, many modifications and improvements have appeared, but the formulas given above are enough to understand the further material.
One of the improvements appeared in 2020, it is called Sub-center ArcFace and is designed for noisy datasets. Intra-class compactness constraint leads to overfitting in noisy data. Sub-center ArcFace introduces sub-classes. A sample in a training batch should be close to one of the positive sub-centers, not all of them. This reduces the influence of noise in the data.
2020,
Both ArcFace and Sub-center ArcFace models have implementations inside the insightface library, including code for training and pretrained weights.
Insightface has an implementation of ArcFace with different backbones: iresnet (34,50,100,200,2060), mobilefacenet, vit (VisionTransformer).
Consideration of different backbones is beyond the scope of this article, so I will only provide the names of the backbones used with each of the losses under consideration. In most cases, the authors of the losses did not try to select the optimal backbone, but simply used one of the popular ones or the one that was used in the models with which they wanted to make a comparison.
The datasets MS1M, Glint360K, WebFace42M were used for training.
The main challenge for face recognition methods is data noise. Prototype learning-based methods are sensitive to prototype biases that noise introduces. One way to balance between overfitting and underfitting is to adjust margin, the main parameter in softmax-base losses.
One of the first methods to adjust scale and angular margin for cosine-based softmax losses such as L2-softmax, CosFace and ArcFace.
Implements the empirical principle that the learning speed should slow down as the network optimizes. The article introduces a modulating variable equal to the median of all angles in the mini-batch for the corresponding classes, which roughly represents the current degree of model optimization. When the median angle is large, the grid parameters are far from optimal and a larger scale and margin are applied, and vice versa.
2019,
Changing process of angles in each mini-batchwhen training
Trained on CASIA-WebFace and MS1M datasets, input resolution 144 × 144. Tested on LFW, MegaFace and IJB-C datasets, compared with L2-softmax, CosFace and ArcFace losses.
Over the past years, several landmark methods have emerged for applying adaptive margin in FR, such as Dyn-ArcFace (2022), MagFace (2021), ElasticFace (2021), but we will focus on one of the latest works in this area – X2-Softmax (2023).
Compared to AdaCos, X2-Softmax tries to account for the uneven distribution of classes. Fixed margin which is suitable between some classes, may be too large to converge between other classes, or too small to promote significant intra-class compactness of face features between some other classes.
2023,
For classes with large angles, a large margin is needed to increase compactness, for classes with small angles, a smaller one.
Let's recall the general formula for softmax-based losses:
Here, for such losses as ArcFace or CosFace, only the logits function f(θ) differs. For the X2-Softmax loss function it looks like this:
Traditional softmax-based losses use cosine, but cosine turns into a square function when expanded into a Taylor series, so a square function is chosen for X2-Softmax. Discarding high-order terms of x and retaining constant and quadratic terms can avoid the model overfitting.
Here a, h, and k are hyperparameters: h and k determine the vertice position of the logits function curve, and a determines the direction of the opening of the curve and the degree of clustering.
In X2-Softmax, when the angle between weights θ increases, the angular margin ∆θ monotonically increases at the same time.
For two more similar classes, a small margin facilitates the convergence of the model. For two less similar classes, a larger margin will be assigned to enhance inter-class separation of face features.
For training, the authors chose the Resnet50 backbone. The model was trained on the MS1Mv3 dataset (based on the MS-Celeb-1M preprocessed by RetinaFace, to remove noisy images) – 93k identities and 5.1M face images.
Most losses with flexible margins remain within the softmax-based losses, but there are exceptions. SFace abandons softmax-based losses but retains the idea of optimizing intra-class and inter-class distances. The model imposes intra-class and inter-class constraints on the hypersphere manifold, which are controlled by two sigmoid curves. Curves transform gradients by controlling the rate at which coefficients change as they approach the centroid of the target or foreign class.
2022,
Compared to direct margin optimization methods, this provides a finer balance between overfitting and underfitting, with less influence of individual noisy samples on the final loss.
The idea of limiting face embeddings to make them discriminative on a hypersphere manifold has already been seen, for example, in Sphereface (Deep hypersphere embedding for face recognition, 2017).
The aim is to decrease the intra-class distance and increase the inter-class distance so, the sigmoid-constrained hypersphere loss can be formulated as
Where 𝜃𝑦𝑖 is the angular distance between embedding feature of the 𝑖-th training image and the corresponding prototype. 𝜃j is the angular distance to the foreign prototypes.
Functions 𝑟𝑖𝑛𝑡𝑟 and 𝑟𝑖𝑛𝑡𝑒𝑟 designed to re-scale intra-class and inter-class objectives respectively and to control of the optimization degree. [·]𝑏 is the block gradient operator, which prevents the contribution of its inputs to be taken into account for computing gradients.
Sigmoid functions are chosen as the gradient rescale functions:
The authors chose sigmoid functions as the gradient re-scale functions:
𝑠 is the upper asymptote of two sigmoid curves as the initial scale of gradient, and 𝑘 is the control the slope of sigmoid curves. Hyperparameters 𝑎 and 𝑏 decide the horizontal intercept of two sigmoid curves and actually control the flexible interval to suppress the moving speed.
Сompared with softmax-based loss functions, both intra-class and inter-class distance of SFace can be constrained to a designed degree therefore can be optimized in a moderate way, which is exactly the advantage of SFace.
For training, the authors chose the ResNet backbone (as for Arcface).
The model was trained on the CASIA-WebFace, VGGFace2 and MS-Celeb-1M datasets.
Another way to deal with noisy data is to consider that the embedding for one identity (for all faces belonging to one person) is not a point in space but rather a distribution that has an expectation, a variance, and may have outliers.
In face recognition, pair-based losses were abandoned due to the complexity of training, but working with averaged prototypes, we lose some information. With prototype-based approach, training can get stuck in local minima or overfit due to the influence of outliers on prototypes.
2021,
VPL – represents each class as a distribution rather than a point in the latent space.
VPL optimizes the similarity between examples from a training set and a set of variational prototypes that are sampled from a class-wise distribution.
The distribution of prototypes is stored in M and decays over ∆t steps. The authors trained the loss with ResNet50, ResNet100 and MXNet backbones, with MXNet chosen as the final one for testing. MS1M dataset is used for training, input size of face crops is 112×112.
There are several approaches that continue the theme of complementing prototype-based methods with the advantages of pair-based losses (or otherwise called sample-to-sample based models), such as UniTSFace (2023) or UNPG (Unified Negative Pair Generation toward Well-discriminative Feature Space for Face Recognition, 2022). I will focus on one of the newest losses in this article: EPL.
In margin-based softmax loss, the loss is calculated in comparison with prototypes (class centers); all samples of one class are pulled to a common center during the training process. Which is considered as the average during the training process, and is strongly influenced by outliers of examples that can deviate the prototype center. In the Softmax-based methods, the prototype is considered to be stored in the coefficient matrix of the last linear layer, i.e., Pi = Wi, prototype is updated using its gradient in the backpropagation, and the loss function maximizes the similarity between the features of the examples and the corresponding prototypes.
2024,
In EPL prototypes are generated and updated:
Where "α" is adaptive updating coefficient generated using the feature x and its prototype, "σ" is an activation function to adjust the updating coefficient into an appropriate range and s(·, ·) is a similarity function, which is usually taken as a cosine function.
The empirical prototype is updated only using "positive" examples to avoid the influence of outliers of neighboring classes.
Training process: the encoder extract the features, adaptive coefficients α are calculated to update the empirical prototype, the similarities between the features and the prototypes are used to calculate the loss for the encoder training.
As is clear from the above, for the most part, the loss modification is used to solve the problem of noisy data and overfitting, while the backbone is responsible for the "complexity" of the model, but there are exceptions.
This article introduces transformer-metric loss – a combination of standard metric loss and transformer-loss (transformer network as additive loss). Transformer networks have the strength to preserve sequential spatial relationships which allows for increasing the discriminative power of the loss function and applying the model in more complex cases (for example, for age-invariant FR).
The peculiarity of this model is that the transformer is not used as a backbone, as is usually, for example, in the Face Transformer model. Instead, the features from the last convolutional layer are sent to two loss branches. The first branch is a regular flattening layer and a metric loss after it (in this case, ArcFace, but it could be any classification-based loss).
In branch 2 we take the output of size H × W × D, and transform it into S vectors of size 1 × 1 × D. This sequence can be viewed as a sequence of embeddings from patches for a standard transformer encoder. After the transformer encoder layer, a linear layer is applied without any additional settings of activation or dropout. After that, the cross entropy function evaluates loss for the output probability distribution (for the target N classes). Both losses ”branch-1” and ”branch-2” are combined via a weighted sum.
In this review, we focused on one area of Face recognition systems – loss functions. This allowed us to make an overview of new directions and recent articles in this area. All these areas continue to develop every year.
The following topics were left out of this part of the review: