Variational Autoencoders (VAE): How AI Learns Whether Your Eyes Are Open Or Closed

As a data scientist, I have given and conducted lots of job interviews. I would like to point out that they are not like traditional computer science interviews with algorithmic tasks you can practice on Leetcode and Hackerrank.

Typically data scientists are given tasks to build a machine learning model on some dataset, meaning that there is no exact solution. Any solution is just an approximation, and you never know the threshold of accuracy that the interviewer will be satisfied with.

Learning GAN from Scratch

Also, a trend that I have noticed over the last five years of going through data science interviews is that the complexity of the tasks grows. For example, several years ago, logistic regression or simple multilayer perceptron for binary classification was enough to qualify. Nowadays, knowledge of more advanced techniques is required. For example, building a generative adversarial network (GAN) from scratch or training an image recognition neural net using triplet loss, or training a ML model capable of classifying images of open and closed human eyes without any annotation.

This can help in sleepy driver detection in modern cars, while that lack of annotation is a real case as well since annotation either costs money or takes a long, to obtain or in some cases, even impossible to get due to privacy protection. Let’s take a deeper look at this task and its solution by the author of this article, which will help you be more prepared for interviews for a data scientist position.

The formulation of the task goes like this: given a dataset of images of human eyes and no annotation, train a model that can classify images with as high accuracy as possible. The trained model must allow being tested with a python script on a hold-out dataset. The deadline for this task was 24 hours. So let’s waste no time and hop onto it right away.

First, we need to figure out where to code: in a Jupyter notebook or an IDE like PyCharm. Why the dilemma? The easiest way is to go with a notebook since it allows easy data exploration and nice in-place visualizations of images and graphs. However, eventually, we have to provide a python script that will test the model.

At his point we would like to reuse a part of the code from the training notebook with is not that easy. The approach that I practice is to start coding in a notebook, then once the code and the model become stable, copy the code to a python script and scrap the notebook for good to avoid the loss of synchronization between the script and the notebook.

Images

Initial data exploration shows that we are given 3600 grayscale images of resolution 24x24, some having on them a single eye open, some - closed. We could theoretically manually annotate all 3600 images, and it would take something like an hour, but there will be little prestige in going with this brute force approach of manual labor. If only we could annotate just a handful of samples and apply some techniques of semi-supervised learning to get high accuracy…

Oh, actually, we can! The classical scheme of semi-supervised learning is to train a model in two stages: the first one being an autoencoder and the second one being a linear classifier built on top of the output of the encoder.

Why is this scheme? See, if we were to train a classification model on just several samples (10-100) there is a high chance that any neural network will overfit on the train set since the size of the feature is 24*24=576 numbers.

However, if we could reduce the dimensionality of the feature to a smaller value, say 50, we could avoid the overfit. Dimensionality can be reduced with PCA for example but in this case, we are not quite sure that the signal about open/closed eyes isn’t discarded by PCA. Another method of “smart” dimensionality reduction is an autoencoder, that first compresses the image to a latent vector of a specified size (50 in our case) and then decompresses it to recover the original image.

We are going to talk about autoencoders being a neural net, of course. A traditional autoencoder should be able to learn the embedding of the original native representation of the contents of the image (eye open/shut, multiple features describing the style of the eye, maybe up to person’s physiology like the depth of an eye orbit, iris diameter, etc). However, the individual components may be quite entangled. There is a type of autoencoder called variational autoencoder [1] that can disentangle features and in our case facilitate our target task - classification.

Let’s create a VAE in PyTorch. We are going to use 4 Conv2D layers and 2 full-connected layers with our distributions’ mean and variance predictions, each being a vector with the length of the latent space size. The decoder will be a sequence of 4 ConvTranspose2d to reconstruct the image back from the latent representation. Note in VAE.forward() we apply the re-parametrization trick to make the VAE end-to-end differentiable.

class NormalizedOp(nn.Module):
    def __init__(self, op):
        super().__init__()
        self.op = op
        self.bn = nn.BatchNorm2d(op.out_channels)

    def forward(self, x):
        x = self.op(x)
        x = self.bn(x)
        return x


class Encoder(nn.Module):
    def __init__(self, latent_size):
        super().__init__()
        self.conv1 = NormalizedOp(nn.Conv2d(1, 32, 3, padding=1, stride=2))
        self.conv2 = NormalizedOp(nn.Conv2d(32, 64, 3, padding=1, stride=2))
        self.conv3 = NormalizedOp(nn.Conv2d(64, 128, 3, padding=1, stride=2))
        self.conv4 = NormalizedOp(nn.Conv2d(128, 256, 3, padding=0, stride=3))

        self.mu_fc = nn.Linear(256, latent_size)
        self.log_var_fc = nn.Linear(256, latent_size)

    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = torch.relu(self.conv2(x))
        x = torch.relu(self.conv3(x))
        x = torch.relu(self.conv4(x))
        x = x.view(x.shape[0], -1)

        mu = self.mu_fc(x)
        log_var = self.log_var_fc(x)
        return mu, log_var


class Decoder(nn.Module):
    def __init__(self, latent_size):
        super().__init__()

        self.dec_fc = nn.Linear(latent_size, 256)

        self.tconv1 = nn.ConvTranspose2d(256, 128, 3, padding=0, stride=3)
        self.tconv2 = nn.ConvTranspose2d(128, 64, 3, output_padding=0, stride=2)
        self.tconv3 = nn.ConvTranspose2d(64, 32, 3, output_padding=0, stride=2)
        self.tconv4 = nn.ConvTranspose2d(32, 1, 3, output_padding=0, stride=2)

    def forward(self, z):
        x = self.dec_fc(z)

        x = x.unsqueeze(-1).unsqueeze(-1)
        x = torch.relu(self.tconv1(x))
        x = torch.relu(self.tconv2(x))[:, :, :-1, :-1]
        x = torch.relu(self.tconv3(x))[:, :, :-1, :-1]
        x = torch.sigmoid(self.tconv4(x))[:, :, :-1, :-1]

        return x


class VAE(nn.Module):
    def __init__(self, latent_size):
        super().__init__()

        self.encoder = Encoder(latent_size)
        self.decoder = Decoder(latent_size)

    def forward(self, x):
        mu, log_var = self.encoder(x)

        std = torch.exp(log_var / 2)
        q = torch.distributions.Normal(mu, std)
        z = q.rsample()

        x = self.decoder(z)

        return x, z, mu, std

Training Variational Autoencoders (VAE)

Now we need to train our VAE on the entire dataset of 3600 eye images. The most important thing here is the loss function. It consists of the reconstruction loss which is Mean Squared Error (MSE) in our case and a KL-divergence loss. The KL-divergence term penalizes the discrepancy between the estimated Gaussian distribution parameters and a multivariate N(0, 1) Gaussian distribution.

def kl_divergence(mu, std):
    kl = (std**2 + mu**2 - torch.log(std) - 1/2).mean()
    return kl


class VaeTrainer:
    def __init__(self, latent_size):
        self.vae = VAE(latent_size=latent_size)
        self.vae.cuda()

    def train(self):
        self.vae.train()

        image_list, _ = load_archive()
        tensor_data = (1 / 255 * torch.tensor(image_list).float()).unsqueeze(1)
        dataset = torch.utils.data.TensorDataset(tensor_data)
        loader = torch.utils.data.DataLoader(dataset, batch_size=256,
                                             shuffle=True, num_workers=0)
        optimizer = torch.optim.Adam(self.vae.parameters(), lr=1e-2)

        num_iters = 10_000
        i_iter = 0
        while i_iter < num_iters:
            for i_batch, (batch,) in enumerate(loader):
                batch = batch.cuda()
                pred_batch, z_batch, mu_batch, std_batch = self.vae(batch)
                optimizer.zero_grad()
                loss_recon = F.mse_loss(batch, pred_batch)
                loss_kl = kl_divergence(mu_batch, std_batch)
                loss = loss_recon + 0.001*loss_kl
                loss.backward()
                optimizer.step()
                if i_iter % 100 == 0:
                    print(i_iter, loss_recon.item(), loss_kl.item(), loss.item())
                    torch.save(self.vae.state_dict(), "vae.pth")
                i_iter += 1

Once we’ve trained the VAE to convergence, we can take a look at what the reconstructed images look like:

The reconstruction is performed pretty well considering the latent vector size of just 50 elements. Also, due to the information bottleneck created by the latent vector, the noise in the images is filtered away.

At this point, we have a model train in an unsupervised manner. However, the end goal is open/shut classification. We still need to manually annotate the data. I’ve picked as few as 50 open-eye and 50 shut-eye images and annotated them. Let’s check what accuracy we can achieve with just 40 annotated images. Below is the classifier code which is a frozen pre-trained encoder and a full-connected layer with a single output on top of it.

class Classifier(nn.Module):
    def __init__(self, latent_size, pretrained_model_path="vae.pth"):
        super().__init__()
        self.encoder = Encoder(latent_size)
        self.freeze_backbone = pretrained_model_path is not None
        if pretrained_model_path is not None:
            state_dict = torch.load(pretrained_model_path)
            state_dict = OrderedDict(((k[len("encoder."):], v)
                                     for k, v in state_dict.items()
                                      if "encoder." in k))
            self.encoder.load_state_dict(state_dict, strict=True)
            for param in self.encoder.parameters():
                param.requires_grad = False
        self.encoder.eval()
        self.class_fc = nn.Linear(latent_size, 1)

    def forward(self, x):
        with torch.no_grad() if self.freeze_backbone \
                else contextlib.nullcontext():
            mu, log_var = self.encoder(x)
        logits = self.class_fc(mu)
        x = torch.sigmoid(logits)
        x = x.squeeze(-1)
        return x

    def train(self, mode=True):
        self.encoder.train(False)
        self.class_fc.train(mode)

The Training Process

The training process is very similar though we train the head’s fully connected layer only and we can do significantly fewer iterations. Also, please notice that we train based on the number of iterations, not the number of epochs. This is convenient when the dataset size is a hyperparameter itself. We are free to make it larger by annotating more images. Still, normally you want your training cycle to consist of a fixed number of iterations. Thus we limit iterations, not epochs.

class ClassifierTrainer:
    def __init__(self, latent_size, from_scratch=False):
        self.model = Classifier(
            latent_size,
            **(dict(pretrained_model_path=None)
               if from_scratch else {}))
        self.model.cuda()
        self.prediction_threshold = 0.5

    def _load_dataset(self, is_train):
        anno_images = ... # Hidden for the sake of saving space
        images_np = np.array(anno_images)
        images_np = 1 / 255 * np.expand_dims(images_np, 1)
        anno_np = np.array([v[1] for v in flat_list])
        dataset = torch.utils.data.TensorDataset(
            torch.tensor(images_np, dtype=torch.float32),
            torch.tensor(anno_np, dtype=torch.float32))
        return dataset

    def train(self):

        train_dataset = self._load_dataset(True)
        val_dataset = self._load_dataset(False)

        train_loader = DataLoader(train_dataset, batch_size=len(train_dataset),
                                  shuffle=True, num_workers=0, drop_last=True)
        val_loader = DataLoader(val_dataset, batch_size=len(val_dataset),
                                shuffle=False, num_workers=0, drop_last=True)

        all_params = list(self.model.parameters())
        print("len(all_params)", len(all_params))

        optimizable_params = [p for p in self.model.parameters() if p.requires_grad]
        print("len(optimizable_params)", len(optimizable_params))
        optimizer = torch.optim.Adam(optimizable_params, lr=1e-2, weight_decay=2e-5)

        num_iters = 5_000
        i_iter = 0
        while i_iter < num_iters:
            for i_batch, (image_batch, anno_batch) in enumerate(train_loader):
                self.model.train()
                image_batch = image_batch.cuda()
                anno_batch = anno_batch.cuda()
                pred_batch = self.model(image_batch)
                optimizer.zero_grad()
                loss = F.binary_cross_entropy(pred_batch, anno_batch)
                loss.backward()
                optimizer.step()

                hard_pred_batch = pred_batch > self.prediction_threshold
                anno_bool_batch = anno_batch > self.prediction_threshold
                if i_iter % 1000 == 0:
                    print(i_iter, " train_loss=", loss.item())
                    accuracy = torch.sum(torch.eq(hard_pred_batch, anno_bool_batch)) \
                               / len(anno_bool_batch)
                    print("train_accuracy=", accuracy.item())
                    self._validate(val_loader)
                    torch.save(self.model.state_dict(), "classifier.pth")
                i_iter += 1

        print("weight=", self.model.class_fc.weight.data)
        print("bias=", self.model.class_fc.bias.data)

    def _validate(self, val_loader):
        self.model.train(False)
        image_batch, anno_batch = next(iter(val_loader))
        image_batch = image_batch.cuda()
        anno_batch = anno_batch.cuda()
        with torch.no_grad():
            pred_batch = self.model(image_batch)
            loss = F.binary_cross_entropy(pred_batch, anno_batch)
        print("val_loss=", loss.item())
        hard_pred_batch = pred_batch > self.prediction_threshold
        anno_bool_batch = anno_batch > self.prediction_threshold
        accuracy = torch.sum(torch.eq(hard_pred_batch, anno_bool_batch)) / \
                   len(anno_bool_batch)
        print("val_accuracy=", accuracy.item())

Results

Now we are able to train our classifier from scratch and compare it side-by-side with the finetuned version to figure out how much of an improvement we can get. The validation accuracy for the model trained from scratch to convergence turns out to be 78%.

Finally, the VAE pretrained model gives us as high accuracy as 94%! This is a clear success of VAE-based unsupervised pretraining since going from 78% to 94% accuracy of binary classification is basically going from a useless model to a useful one.

This is achieved with as few as 50 training samples (the other 50 were used for validation). At the same time as a byproduct, we’ve got a generative model which is capable of generation of non-existent eye images, which may be used for the purposes of data augmentation or even privacy protection.

The entire code of this article is available at my GitHub: https://github.com/Obs01ete/eyes.

Check out my personal page: https://obs01ete.github.io/.

Reference:

Variational autoencoder paper https://arxiv.org/abs/1312.6114
Closed Eyes In The Wild (CEW) dataset http://parnec.nuaa.edu.cn/_upload/tpl/02/db/731/template731/pages/xtan/ClosedEyeDatabases.html