Synthetic Data in Face Recognition: A Game Changer or Just Hype?

Face recognition (FR) technology has advanced significantly in recent years, driven by the need for enhanced security and the proliferation of applications across industries such as low-end consumer devices, aircraft boarding, border control, and financial services. At the heart of effective FR systems lies a crucial component—data. Large-scale datasets are essential for training these models to accurately identify and verify faces in a variety of conditions. For FR to be reliable, models must be exposed to diverse data that includes variations in demographics, lighting, environments, expressions, and occlusions. This ensures robustness and fairness in deployment, reducing the risk of bias or failure when encountering unfamiliar conditions. Synthetic datasets created using genAI techniques can potentially help, but in their current state, they can’t fully replace real-world datasets. This article explores the advantages and disadvantages of synthetic FR datasets and investigates the current state of genAI for face recognition. Face Data Acquisition: Real World vs Synthetic LFW, Cfp-fp, Agedb-30, Ca-lfw, and Cp-lfw are some of the most widely used datasets used for evaluating the verification performance of FR models. Table 1. displays the verification performance of an ML model trained with the same algorithm, on real-world face datasets of different sizes. It can be seen how the dataset size affects the model performance and the scale at which data acquisition must take place to obtain robust FR models. Verification means the model is given a pair of face images, and it predicts whether the face pair belongs to the same person or two separate people. The verification accuracy percentage of model predictions is reported. DatasetName MLModel # TrainingImages LFW Cfp-fp Agedb-30 Ca-LFW Cp-LFW Casia webface resnet-50 500k 99.55 95.31 94.55 93.78 89.95 webface 12m resnet-50 12 million 99.80 99.20 98.10 -- -- glint360k resnet-50 17 million 99.83 99.33 98.55 96.21 94.78 Table 1. Verification accuracies (%) on five different FR benchmarks. For a fair comparison, all results are obtained from original published works using the same ML model and algorithm. In addition to a large-scale training dataset, it is equally important that the dataset contains minimal biases. It is important to first understand what bias means in the context of FR. In general, for a Machine Learning model, bias refers to the model not behaving uniformly across different types of input data. A FR model can be biased in different ways. The most common example is ethnicity bias, where an FR model tends to perform poorly when presented with faces of a particular ethnicity. However, this is not the only bias that needs to be countered to obtain reliable FR models. Age bias, gender bias, and environmental bias (face coverings, facial hair, etc.) are some other examples of how a FR model can exhibit bias. These biases can be minimized by collecting and including representative samples in the dataset used to train the FR model. Acquiring photos of people of different ethnicities, ten to fifteen years apart, or photos of a person against different backgrounds, in varied lighting conditions, with different facial expressions can prove to be a difficult task. In addition, collecting real-world data for FR presents numerous other challenges. Acquiring such large-scale diverse data from across the world is costly. Apart from cost and technical limitations, data acquisition is increasingly difficult due to ethical and privacy concerns. Biometric data is governed by laws like Europe’s GDPR (General Data Protection Regulation), California’s CCPA (California Consumer Privacy Act), and Illlionis’ BIPA (Biometric Information Privacy Act), to name a few. These laws govern the acquisition and storage of biometric data of respective residents, which adds further complexity to large-scale biometric data acquisition. Given the growing demand for FR applications, right now is a crucial time to explore the viability of synthetic data, examining its benefits and drawbacks for developing scalable, ethical, and legally compliant face recognition systems. These challenges, coupled with the rise of Generative AI (genAI) have motivated a large amount of research to create synthetic data to replace real-world sensitive biometric data. Before diving into the current state of synthetic data in FR, it is essential to understand what genAI means. In simple terms, genAI is a type of artificial intelligence that can create new content, such as text, images, or music, based on the data it has been trained on, and the generated data is called ‘synthetic data’. GenAI for face recognition is particularly enticing for multiple reasons. Most notably, synthetic datasets are generated by AI, meaning that researchers, engineers, and enthusiasts can build (and train on) datasets without undergoing the manual process of obtaining images from real individuals. Many of the compliance requirements in the collection and use of real image datasets are not present for synthetic data, and, theoretically, biases that may result in an algorithm trained on real image data could be better accounted for with synthetic data. However, synthetic face datasets are not yet a silver bullet. The following sections in this article cover where synthetic datasets shine, where they fall short, and the current state of genAI for face recognition. Advantages of Synthetic Data in Face Recognition Synthetic data offers several advantages that make it a valuable tool in the development of face recognition technology. One of the primary benefits is that synthetic datasets do not require obtaining images of real people. Synthetic data does not directly use real personal data, therefore, privacy compliance requirements such as consent for use and rights to be forgotten are not raised. Generating synthetic data can also be more cost-effective than collecting and annotating vast amounts of real-world data, which, in addition to the time and resources spent ensuring such a dataset is legally and ethically compliant, is a manual, time-consuming, and expensive process. Synthetic data allows for the creation of controlled environments where specific variables can be manipulated, aiding in the testing and fine-tuning of face recognition models. Furthermore, synthetic data makes it easier to create and obtain large datasets, especially in situations where real-world data is scarce, difficult to collect, or where legal requirements and ethical considerations make such collection untenable. GenAI methods can also be used to supplement an existing real-world dataset, filling in gaps to reduce biases; demographic or otherwise. As an example, many of the publicly released large-scale face datasets consist predominantly of caucasian identities, which causes a demographic bias in ML models trained on such data. This can be easily remedied with a synthetic dataset. Current Limitations of Synthetic Data in Face Recognition For the image domain, Generative Adversarial Networks (GANs) are one of the most popular models used to generate data. Nvidia’s Stylegan, and Stylegan2 have done wonders in generating synthetic face images which are indistinguishable from real faces. Researchers of Microsoft’s Digiface-1m, Kim et al.’s DiscoGAN, Tencents’ Synface, and Michigan State University’s DCFace among others have made considerable progress in generating synthetic datasets for face recognition and demonstrated positive results on real-world data. However, all these techniques have limitations in terms of either cost, time, the number of unique identities that can be generated, and performance that is not up to par with models trained on real-face datasets. Theoretically, a synthetic dataset with “real-looking” faces, and controlled diverse attributes for ethnicity, gender, pose, lighting, and background variation should outperform a real “in the wild” dataset. Then why is the performance of models trained on these datasets nowhere close to models trained on real-world datasets of the same size? The answer to this question lies in the uncontrolled features of the real-world data itself. The magnitude of variations in the real data has not been captured fully by any published research so far. Having the same limited number of variations for all synthetic identities in the dataset hurts the model performance. An attempt to increase the variations results in the identity of the face also changing, which introduces noise in the data, again hurting model performance. The Current State of Synthetic Face Datasets Table 2. lists the performance of the same FR model architecture (Resnet 50) trained on different synthetic datasets. A baseline performance for a model trained on an authentic dataset of roughly the same size is also listed. The table also lists the year of release for each synthetic data. Dataset Name ML Model # Training images LFW Cfp-fp Agedb-30 Ca-LFW Cp-LFW Casia-webface (real world) resnet-50 500k 99.55 95.31 94.55 93.78 89.95 Synface (2021) resnet-50 500k 91.93 75.03 61.63 74.73 70.43 Digiface-1m (2022) resnet-50 500k 95.40 87.40 76.97 78.62 78.87 DCFace (2023) resnet-50 500k 98.55 85.33 89.70 91.60 82.62 Table 2. Verification accuracies (%) on widely used FR evaluation datasets achieved by models trained on synthetic data. The first row is the baseline performance achieved by the model on similar-sized real-world data. All results are obtained from original published works using the same ML model and algorithm. As can be seen in Table 2, models trained on synthetic data do not perform as well as models trained on real-world data. While the performance gap on “simple” and small datasets like ‘LFW’ is small, the gap is more prominent on other tougher datasets like CFP-FP and Agedb-30, which contain samples of profile views of faces, and faces of the same person spanning across multiple ages respectively. Noticeably, the performance of models trained on synthetic data has improved in recent years. Validating the effectiveness of synthetic data remains a challenge. Ensuring that synthetic data accurately represents real-world conditions is crucial for building reliable face recognition systems. However, the process of validation is complex and requires robust methodologies to ensure the data's quality and applicability. A possible solution is to develop a genAI model that can also mimic these features in synthetic data. A generative model can be trained to overcome these limitations by training it on a real-world dataset that contains ample variations in facial attributes, image quality, and background variation. It is reasonable to question where such data might come from. Such data acquisition would face all the aforementioned constraints, namely ethical, legal, and cost restrictions. However, these are mitigated by the smaller dataset size required to train generative FR models. Nvidia’s StyleGAN2 can generate realistic face images, was trained on only 70,000 images, and does not contain information about the identity of the faces in the dataset. These images were not collected with FR in mind, and neither was the model trained for that purpose, which is why models trained on synthetic FR datasets generated by StyleGAN2 do not match real-world performance. Conclusion Synthetic data holds promise for advancing face recognition technology, but it's essential to recognize its current limitations. While genAI benefits include the realism of the synthetic samples, and ease of finely tuning the images to enhance or de-enhance features, like facial expressions, head pose, facial hair, etc. the performance gap between models trained on real versus synthetic data is significant. Synthetic data is not yet a substitute for well-curated real data sets. Even so, the quality of synthetic face data is catching up to the quality of real-world data as the data generation techniques are improving, and thus, we can surmise that in the near future, synthetic data may fully remove the need to use real-world face data for FR training. Feature Image by Steph Meade Face recognition (FR) technology has advanced significantly in recent years, driven by the need for enhanced security and the proliferation of applications across industries such as low-end consumer devices, aircraft boarding, border control, and financial services. At the heart of effective FR systems lies a crucial component—data. Large-scale datasets are essential for training these models to accurately identify and verify faces in a variety of conditions. For FR to be reliable, models must be exposed to diverse data that includes variations in demographics, lighting, environments, expressions, and occlusions. This ensures robustness and fairness in deployment, reducing the risk of bias or failure when encountering unfamiliar conditions. Synthetic datasets created using genAI techniques can potentially help, but in their current state, they can’t fully replace real-world datasets. This article explores the advantages and disadvantages of synthetic FR datasets and investigates the current state of genAI for face recognition. Face Data Acquisition: Real World vs Synthetic Face Data Acquisition: Real World vs Synthetic LFW , Cfp-fp , Agedb-30 , Ca-lfw , and Cp-lfw are some of the most widely used datasets used for evaluating the verification performance of FR models. Table 1. displays the verification performance of an ML model trained with the same algorithm, on real-world face datasets of different sizes. LFW Cfp-fp Agedb-30 Ca-lfw Cp-lfw It can be seen how the dataset size affects the model performance and the scale at which data acquisition must take place to obtain robust FR models. Verification means the model is given a pair of face images, and it predicts whether the face pair belongs to the same person or two separate people. The verification accuracy percentage of model predictions is reported. DatasetName MLModel # TrainingImages LFW Cfp-fp Agedb-30 Ca-LFW Cp-LFW Casia webface resnet-50 500k 99.55 95.31 94.55 93.78 89.95 webface 12m resnet-50 12 million 99.80 99.20 98.10 -- -- glint360k resnet-50 17 million 99.83 99.33 98.55 96.21 94.78 DatasetName MLModel # TrainingImages LFW Cfp-fp Agedb-30 Ca-LFW Cp-LFW Casia webface resnet-50 500k 99.55 95.31 94.55 93.78 89.95 webface 12m resnet-50 12 million 99.80 99.20 98.10 -- -- glint360k resnet-50 17 million 99.83 99.33 98.55 96.21 94.78 DatasetName MLModel # TrainingImages LFW Cfp-fp Agedb-30 Ca-LFW Cp-LFW DatasetName Dataset Name MLModel ML Model # TrainingImages # Training Images LFW LFW LFW LFW Cfp-fp Cfp-fp Cfp-fp Cfp-fp Agedb-30 Agedb-30 Agedb-30 Agedb-30 Ca-LFW Ca-LFW Ca-LFW Ca-LFW Cp-LFW Cp-LFW Cp-LFW Cp-LFW Casia webface resnet-50 500k 99.55 95.31 94.55 93.78 89.95 Casia webface Casia webface Casia webface resnet-50 resnet-50 500k 500k 99.55 99.55 95.31 95.31 94.55 94.55 93.78 93.78 89.95 89.95 webface 12m resnet-50 12 million 99.80 99.20 98.10 -- -- webface 12m webface 12m webface 12m resnet-50 resnet-50 12 million 12 million 99.80 99.80 99.20 99.20 98.10 98.10 -- -- -- -- glint360k resnet-50 17 million 99.83 99.33 98.55 96.21 94.78 glint360k glint360k glint360k resnet-50 resnet-50 17 million 17 million 99.83 99.83 99.83 99.33 99.33 99.33 98.55 98.55 98.55 96.21 96.21 96.21 94.78 94.78 94.78 Table 1. Verification accuracies (%) on five different FR benchmarks. For a fair comparison, all results are obtained from original published works using the same ML model and algorithm. Table 1. In addition to a large-scale training dataset, it is equally important that the dataset contains minimal biases. It is important to first understand what bias means in the context of FR. In general, for a Machine Learning model, bias refers to the model not behaving uniformly across different types of input data. A FR model can be biased in different ways. The most common example is ethnicity bias, where an FR model tends to perform poorly when presented with faces of a particular ethnicity. However, this is not the only bias that needs to be countered to obtain reliable FR models. Age bias, gender bias, and environmental bias (face coverings, facial hair, etc.) are some other examples of how a FR model can exhibit bias. These biases can be minimized by collecting and including representative samples in the dataset used to train the FR model. Acquiring photos of people of different ethnicities, ten to fifteen years apart, or photos of a person against different backgrounds, in varied lighting conditions, with different facial expressions can prove to be a difficult task. In addition, collecting real-world data for FR presents numerous other challenges. Acquiring such large-scale diverse data from across the world is costly. Apart from cost and technical limitations, data acquisition is increasingly difficult due to ethical and privacy concerns. Biometric data is governed by laws like Europe’s GDPR ( General Data Protection Regulation ), California’s CCPA ( California Consumer Privacy Act ), and Illlionis’ BIPA ( Biometric Information Privacy Act ), to name a few. General Data Protection Regulation General Data Protection Regulation California Consumer Privacy Act California Consumer Privacy Act Biometric Information Privacy Act Biometric Information Privacy Act These laws govern the acquisition and storage of biometric data of respective residents, which adds further complexity to large-scale biometric data acquisition. Given the growing demand for FR applications, right now is a crucial time to explore the viability of synthetic data, examining its benefits and drawbacks for developing scalable, ethical, and legally compliant face recognition systems. These challenges, coupled with the rise of Generative AI (genAI) have motivated a large amount of research to create synthetic data to replace real-world sensitive biometric data. Before diving into the current state of synthetic data in FR, it is essential to understand what genAI means. In simple terms, genAI is a type of artificial intelligence that can create new content, such as text, images, or music, based on the data it has been trained on, and the generated data is called ‘synthetic data’. GenAI for face recognition is particularly enticing for multiple reasons. Most notably, synthetic datasets are generated by AI, meaning that researchers, engineers, and enthusiasts can build (and train on) datasets without undergoing the manual process of obtaining images from real individuals. Many of the compliance requirements in the collection and use of real image datasets are not present for synthetic data, and, theoretically, biases that may result in an algorithm trained on real image data could be better accounted for with synthetic data. However, synthetic face datasets are not yet a silver bullet. The following sections in this article cover where synthetic datasets shine, where they fall short, and the current state of genAI for face recognition. Advantages of Synthetic Data in Face Recognition Advantages of Synthetic Data in Face Recognition Synthetic data offers several advantages that make it a valuable tool in the development of face recognition technology. One of the primary benefits is that synthetic datasets do not require obtaining images of real people. Synthetic data does not directly use real personal data, therefore, privacy compliance requirements such as consent for use and rights to be forgotten are not raised. Generating synthetic data can also be more cost-effective than collecting and annotating vast amounts of real-world data, which, in addition to the time and resources spent ensuring such a dataset is legally and ethically compliant, is a manual, time-consuming, and expensive process. Synthetic data allows for the creation of controlled environments where specific variables can be manipulated, aiding in the testing and fine-tuning of face recognition models. Furthermore, synthetic data makes it easier to create and obtain large datasets, especially in situations where real-world data is scarce, difficult to collect, or where legal requirements and ethical considerations make such collection untenable. GenAI methods can also be used to supplement an existing real-world dataset, filling in gaps to reduce biases; demographic or otherwise. As an example, many of the publicly released large-scale face datasets consist predominantly of caucasian identities, which causes a demographic bias in ML models trained on such data. This can be easily remedied with a synthetic dataset. Current Limitations of Synthetic Data in Face Recognition Current Limitations of Synthetic Data in Face Recognition For the image domain, Generative Adversarial Networks (GANs) are one of the most popular models used to generate data. Nvidia’s Stylegan , and Stylegan2 have done wonders in generating synthetic face images which are indistinguishable from real faces. Researchers of Microsoft’s Digiface-1m , Kim et al.’s DiscoGAN , Tencents’ Synface , and Michigan State University’s DCFace among others have made considerable progress in generating synthetic datasets for face recognition and demonstrated positive results on real-world data. Stylegan Stylegan Stylegan2 Stylegan2 Digiface-1m Digiface-1m DiscoGAN DiscoGAN Synface Synface DCFace DCFace However, all these techniques have limitations in terms of either cost, time, the number of unique identities that can be generated, and performance that is not up to par with models trained on real-face datasets. not up to par not up to par Theoretically, a synthetic dataset with “real-looking” faces, and controlled diverse attributes for ethnicity, gender, pose, lighting, and background variation should outperform a real “in the wild” dataset. Then why is the performance of models trained on these datasets nowhere close to models trained on real-world datasets of the same size? The answer to this question lies in the uncontrolled features of the real-world data itself. The magnitude of variations in the real data has not been captured fully by any published research so far. Having the same limited number of variations for all synthetic identities in the dataset hurts the model performance. An attempt to increase the variations results in the identity of the face also changing, which introduces noise in the data, again hurting model performance. The Current State of Synthetic Face Datasets The Current State of Synthetic Face Datasets Table 2. lists the performance of the same FR model architecture (Resnet 50) trained on different synthetic datasets. A baseline performance for a model trained on an authentic dataset of roughly the same size is also listed. The table also lists the year of release for each synthetic data. Dataset Name ML Model # Training images LFW Cfp-fp Agedb-30 Ca-LFW Cp-LFW Casia-webface (real world) resnet-50 500k 99.55 95.31 94.55 93.78 89.95 Synface (2021) resnet-50 500k 91.93 75.03 61.63 74.73 70.43 Digiface-1m (2022) resnet-50 500k 95.40 87.40 76.97 78.62 78.87 DCFace (2023) resnet-50 500k 98.55 85.33 89.70 91.60 82.62 Dataset Name ML Model # Training images LFW Cfp-fp Agedb-30 Ca-LFW Cp-LFW Casia-webface (real world) resnet-50 500k 99.55 95.31 94.55 93.78 89.95 Synface (2021) resnet-50 500k 91.93 75.03 61.63 74.73 70.43 Digiface-1m (2022) resnet-50 500k 95.40 87.40 76.97 78.62 78.87 DCFace (2023) resnet-50 500k 98.55 85.33 89.70 91.60 82.62 Dataset Name ML Model # Training images LFW Cfp-fp Agedb-30 Ca-LFW Cp-LFW Dataset Name Dataset Name ML Model ML Model # Training images # Training images LFW LFW LFW LFW Cfp-fp Cfp-fp Cfp-fp Cfp-fp Agedb-30 Agedb-30 Agedb-30 Agedb-30 Ca-LFW Ca-LFW Ca-LFW Ca-LFW Cp-LFW Cp-LFW Cp-LFW Cp-LFW Casia-webface (real world) resnet-50 500k 99.55 95.31 94.55 93.78 89.95 Casia-webface (real world) Casia-webface (real world) Casia-webface Casia-webface resnet-50 resnet-50 500k 500k 99.55 99.55 99.55 95.31 95.31 95.31 94.55 94.55 94.55 93.78 93.78 93.78 89.95 89.95 89.95 Synface (2021) resnet-50 500k 91.93 75.03 61.63 74.73 70.43 Synface (2021) Synface (2021) Synface resnet-50 resnet-50 500k 500k 91.93 91.93 75.03 75.03 61.63 61.63 74.73 74.73 70.43 70.43 Digiface-1m (2022) resnet-50 500k 95.40 87.40 76.97 78.62 78.87 Digiface-1m (2022) Digiface-1m (2022) Digiface-1m resnet-50 resnet-50 500k 500k 95.40 95.40 87.40 87.40 76.97 76.97 78.62 78.62 78.87 78.87 DCFace (2023) resnet-50 500k 98.55 85.33 89.70 91.60 82.62 DCFace (2023) DCFace (2023) DCFace resnet-50 resnet-50 500k 500k 98.55 98.55 85.33 85.33 89.70 89.70 91.60 91.60 82.62 82.62 Table 2. Verification accuracies (%) on widely used FR evaluation datasets achieved by models trained on synthetic data. The first row is the baseline performance achieved by the model on similar-sized real-world data. All results are obtained from original published works using the same ML model and algorithm. Table 2. As can be seen in Table 2, models trained on synthetic data do not perform as well as models trained on real-world data. While the performance gap on “simple” and small datasets like ‘LFW’ is small, the gap is more prominent on other tougher datasets like CFP-FP and Agedb-30, which contain samples of profile views of faces, and faces of the same person spanning across multiple ages respectively. Noticeably, the performance of models trained on synthetic data has improved in recent years. Validating the effectiveness of synthetic data remains a challenge. Ensuring that synthetic data accurately represents real-world conditions is crucial for building reliable face recognition systems. However, the process of validation is complex and requires robust methodologies to ensure the data's quality and applicability. A possible solution is to develop a genAI model that can also mimic these features in synthetic data. A generative model can be trained to overcome these limitations by training it on a real-world dataset that contains ample variations in facial attributes, image quality, and background variation. It is reasonable to question where such data might come from. Such data acquisition would face all the aforementioned constraints, namely ethical, legal, and cost restrictions. However, these are mitigated by the smaller dataset size required to train generative FR models. Nvidia’s StyleGAN2 can generate realistic face images, was trained on only 70,000 images , and does not contain information about the identity of the faces in the dataset. These images were not collected with FR in mind, and neither was the model trained for that purpose, which is why models trained on synthetic FR datasets generated by StyleGAN2 do not match real-world performance. StyleGAN2 StyleGAN2 70,000 images 70,000 images Conclusion Conclusion Synthetic data holds promise for advancing face recognition technology, but it's essential to recognize its current limitations. While genAI benefits include the realism of the synthetic samples, and ease of finely tuning the images to enhance or de-enhance features, like facial expressions, head pose, facial hair, etc. the performance gap between models trained on real versus synthetic data is significant. Synthetic data is not yet a substitute for well-curated real data sets. Even so, the quality of synthetic face data is catching up to the quality of real-world data as the data generation techniques are improving, and thus, we can surmise that in the near future, synthetic data may fully remove the need to use real-world face data for FR training. Feature Image by Steph Meade Steph Meade Steph Meade