Top 3 Face Datasets and How to Work with Them

What Are Face Datasets?

An image dataset contains specially selected digital images intended to help train, test, and evaluate an artificial intelligence (AI) or machine learning (ML) algorithm, usually a computer vision algorithm.

A face dataset is a type of image dataset that includes images of curated human faces, typically for an ML project. There are several publicly available face datasets that you can leverage instead of collecting your own training data. Managing and optimizing datasets for machine learning is one of the crucial stages in a machine learning operations (MLOps) pipeline.

Face datasets usually include faces in varying positions and lighting conditions, showing a full range of human emotions, ethnicities, ages, and additional characteristics. Face datasets are a major component of producing face recognition technologies. This field of computer vision has many use cases, including video surveillance, device security, and augmented reality (AR).

Top 3 Face Datasets

Here are the most widely used face datasets.

CelebFaces Attributes (CelebA) Dataset

CelebA is a large face attribute dataset containing over 200,000 images of celebrities, each with 40 annotations for various attributes. The images in the CelebA dataset include many variations of background and pose.

This dataset is useful for training or testing models for several computer vision tasks, such as face detection, face attribute recognition, facial landmark localization, face synthesis, and face image editing.

The dataset is especially large, covering 10,177 celebrity identities, with a total of 202,599 face images across five landmark locations, and 40 binary attributes annotations for each image.

Flickr-Faces-HQ (FFHQ) Dataset

The FFHQ dataset contains human face images, offering even more variation than the CelebA dataset. It covers a wide variety of ages, ethnicities, and backgrounds, providing significantly more variety of accessories like hats, eyeglasses, and sunglasses. The images are taken from Flickr and have been automatically cropped and aligned.

Originally intended as a benchmark for Generative Adversarial Networks (GANs), this dataset includes approximately 70,000 PNG images. The images are high quality, with a resolution of 1024/1024.

Labeled Faces in the Wild (LFW)

The LFW image dataset contains curated face photographs intended for researching

face recognition technology without constraints.

It consists of four separate image datasets, including an original set and three related sets with different types of images used for testing algorithms in different conditions. These aligned datasets include LFW-a, funneled images (ICCV 2007), and deep-funneled images (NIPS 2012). LFW-a and deep-funneled images generate higher quality results than regular or funneled images for most face recognition algorithms.

This dataset has more than 13,000 face images collected from different online sources.

Working with the Leading Face Datasets

CelebA

Accessing the Dataset

The official webpage of the CelebA dataset is on this link. There are multiple download links on the webpage that offer different variations of the dataset. In addition, there are ZIP files that contain both images and annotations. However, the annotation files are also separately provided as text files. The webpage also links to the original dataset present in a Baidu drive folder.

Using CelebA in Pytorch

PyTorch provides the dataset directly through its torchvision.dataset module. Users can import the dataset directly and control the variation through parameters. The import has the following definition:

torchvision.datasets.CelebA(root, split = 'train', target_type = 'attr', transform = None, target_transform = None, download = False)

Here is how each parameter is used:

root–specifies where the dataset will get downloaded to
split–specifies what part of the dataset is downloaded, can be 'train', 'valid', 'test', or 'all'
transform–a function that transforms an image
target_type–specifies the type of target, can be the following values:
- attr: labels the attributes with binary values
- identity: labels each image with the person’s identity
- bbox: specifies dimensions of each image’s bounding box
- landmarks: specifies each image’s landmark features
download–downloads dataset and places it in the root if True, doesn’t do so if the dataset is already downloaded

Using CelebA in Tensorflow

TensorFlow offers users to use the dataset through its tfds module directly. Users can download the dataset with the following command:

tfds.load(‘celeb_a’, split=’train’, download=True)

Since the dataset is pre-split between three categories (’train’, ’test’, and ’validation’), the split parameter controls which part of the dataset gets downloaded. The dataset comes with a feature dictionary where each feature is a boolean, and the user can control what features should each downloaded picture have.

Flickr-Faces-HQ Dataset (FFHQ)

The FFHQ dataset came to use when researchers trained an architecture using an alternative generative modeling technique called MvM on it. The technique differs from traditional GAN since it models geometric quantities like p-diameters and centroids.

Accessing the Dataset

The dataset comes with JSON metadata, a script for downloading it, and its documentation. There are two main ways to access the dataset:

Google Drive: The dataset is available for direct download on the official Google Drive link.
Download Script: The script comes with different options to download the images. It can also verify checksums, retry if downloading faces errors and use multiple connections for downloading the dataset.

The scripts can take the following arguments when running it to customize the downloading process:

--json: Downloads the dataset’s metadata as a JSON file
--stats: Displays the dataset’s statistics
--images: Downloads the images in PNG format and a pixel density of 1024x1024 pixels (total download size: 89.1 GB)
--thumbs: Downloads images in the PNG format with a pixel density of 128x128 (total download size: 1.95 GB)
--wilds: Download the original in-the-wild images in the PNG format (total download size: 955 GB)
--tfrecords: Downloads the multi-resolution TFRecords (total download size: 273 GB)
--align: Recreates the images with a pixel density of 1024x1024 from the in-the-wild images
--num_threads: Denotes the number of concurrent threads to download the dataset
--num_attempts: Denotes the number of times the script should try to download each image file in the dataset
--no-rotation: Keeps the original orientation of images and does not align
--no-padding: Instructs to not apply blur-padding around and near the image’s borders
--source-dir: Sends the local directory with existing FFHQ source data

Labeled Faces in the Wild Home (LFW)

The LFW dataset comes with two loaders: one called fetch_lfw_people for face identification and the other called fetch_lfw_pairs for face verification. This tutorial uses the memmapped version existing in the ~/scikit_learn_data/lfw_home/ through the joblib utility.

Using the `fetch_lfw_people` Loader

This loader uses supervised learning to classify faces into multiple classes. This tutorial shows how to import the LFW dataset and show the celebrity in the image’s name.

To use the fetch_lfw_people loader:

Use the following command to fetch the dataset and loader:

from sklearn.datasets import fetch_lfw_people

people_from_lfw = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

Use the following code to print the names of the people in the dataset:

for name in lfw_people.target_names:

print(name)

Each face in the dataset is assigned a single person id from the target array.

Use the following code to get the ground truth data through the target array:

lfw_people.target.shape

list(lfw_people.target[:10])

Using the fetch_lfw_pairs Loader

The loader comes in handy to check if two pictures belong to the same person or not. While fetching the loader, it is important to specify the particular subset of the dataset.

To use the fetch_lfw_pairs loader:

Use the following command to list the available face image pairs after importing the loader:

from sklearn.datasets import fetch_lfw_pairs

lfw_pairs_train_subset = fetch_lfw_pairs(subset='train')

list(lfw_pairs_train_subset.target_names)

The last command retrieves a list of two items:

['Different persons', 'Same person']

Conclusion

In this article, I covered three of the most popular face datasets you can use to build your own face recognition and face detection models—CelebFaces, FFHQ, and LFW. I showed technical details that can help you retrieve the datasets and use them in your model code. I hope this will give you a head start on your next computer vision project.

Top 3 Face Datasets and How to Work with Them

What Are Face Datasets?

Top 3 Face Datasets

CelebFaces Attributes (CelebA) Dataset

Flickr-Faces-HQ (FFHQ) Dataset

Labeled Faces in the Wild (LFW)

Working with the Leading Face Datasets

CelebA

Accessing the Dataset

Using CelebA in Pytorch

Using CelebA in Tensorflow

Flickr-Faces-HQ Dataset (FFHQ)

Accessing the Dataset

Labeled Faces in the Wild Home (LFW)

Using the fetch_lfw_people Loader

Using the fetch_lfw_pairs Loader

list(lfw_pairs_train_subset.target_names)

Conclusion

Using the `fetch_lfw_people` Loader