What we’re going to build! In this episode: Face detection, Recurrent Neural Networks and more. Make sure to check out part. 1 too! If you like Artificial Intelligence, subscribe to the newsletter to receive updates on articles and much more! Have you ever found yourself eating, with no free hands to change the volume of your movie? Or the brightness of the screen? We’ll see how to use state-of-the-art artificial intelligence techniques to solve this problem by sending commands your computer with eye movements! Note: after you’ve read this, I invite you to read the follow-up post dedicated to the implementation details. Introduction What we want The goal of this project is to use our eyes to trigger actions on our computer. This is a very general problem so we need to specify . what we want to achieve We could, for instance, detect when the eyes look towards a specific corner and then work from that. That’s however quite limited and not really flexible, plus it would require us to hard-code corner combinations. Instead, we are going to use to learn to identify complete eye movements. Recurrent Neural Networks The data We won’t work with external datasets, . This has the advantage of using the same source and processing for both training the model and making the predictions. we’ll make our own Without doubt, the most effective way to extract information from our eyes would be to use a . With such hardware, we could directly track the center of the pupils and do all kinds of fancy stuff. dedicated closeup camera I didn’t want to use an external camera so I decided to use to the good from my laptop. old 720p webcam The pipeline Before we jump directly to the technical aspects, let’s review the steps of the process. Here’s the pipeline I came up with: Take a picture with the webcam and find the eyes Pre-process the images and extract important features (did you say neural network?) Keep a running history of last few frames’ extracted features Predict the current eye movement based on history The pipeline we will use to process the images We’ll go through these steps to see how we can make this work. Let’s get to it ! Getting a picture of the eyes Finding the eyes Straight from the webcam, we start by downsampling the image and converting it to grayscale (color channels are extremely redundant). This will make next steps much faster, and will help our model run in real time. For the detection part, we’ll use as they are extremely fast. With some tuning, we can get some pretty good results, but trying to detect the eyes directly leads to many false positives. To get rid of these, , but rather the face in the image, and then . HAAR Cascades we don’t try to find the eyes in the image the eyes in the face Once we have the bounding boxes of both eyes we can extract the images from the initial full-sized webcam snapshot, so that we don’t lose any information. Pre-processing the data Once we have found both eyes, we need to process them for our dataset. To do that we can simply reshape both to a fixed size — square, 24px — and use to get rid of the shadows. histogram normalization Steps to extract eyes We could then use the normalized pictures directly as input, but we have the opportunity here to do . Instead of using the eye images, we compute the between the eyes in the and frame. This is a very efficient way to encode motion, which is all we need in the end. a little more work that helps a lot difference current previous **Note that for all diagrams except the GIF below, I will use eye pictures to represent eye differences, because differences look awful on screen.** Comparison between normalized frames and frame differences Now that we have processed both eyes, we have the choice to as two representatives of the same class, I chose the latter because, even though the eyes are supposed to follow the exact same motion, having both inputs will make the model more robust. treat them separately or use them together as if they were a single image*. . *What we are going to do is a bit more clever than simply stitching the images together, though Paring both eyes together Creating the dataset Recording I have recorded 50 samples for two separate motions (one that looks like a “ ”, the other looks like a “ ”). I have tried to vary the position, scale, and speed of the samples to help the model generalize. I have also added 50 examples of “ ”, which contain roughly generic pattern-free eye motions as well as still frames. gamma Z idle Motion examples — ‘gamma’, ‘mount’, ‘Z’, ‘idle’ Unfortunately, , so we need to the dataset with new samples. 150 samples is tiny for such a task augment Data augmentation The first thing we can do is fix an sequence length — 100 frames. From there, we can and That’s possible because speed does not define the motion. arbitrary slow down shorter samples speed up longer ones. Also, because sequences should be detected at in the 100 frames window, we can add examples_._ shorter that 100 frames any time padded Sliding window padding for samples shorter than 100 frames With these techniques, we can augment our dataset to be around . 1000–2000 examples The final dataset Let’s take a step back for a minute, and try to understand our data. We have recorded some samples with corresponding labels. Each of these samples is a series of two 24px wide square images. Note that we have one dataset for each eye. Tensor description of the dataset The model Now that we have a dataset, we need to build the right model to learn and generalize from this data. We could write its specifications as follows: Our model should be able to extract information from both images at each time step, combine these features to predict the motion executed with the eyes. Such a complicated system requires using a powerful artificial intelligence model — . Let’s see how we can build one that meets our need. Neural networks layers are like LEGOs, we simply have to choose the and put them at the . neural networks right bricks right place Visual features — Convolutional Neural Network To extract information from the images, we are going to need . These are particularly good at processing images to squeeze out visual features. convolutional layers (Psst! We already saw that in part. 1 ) We need to treat each eye separately, and then merge the features through a . The resulting convolutional neural network ( ) will learn to extract relevant knowledge from pairs of eyes. fully connected layer CNN Convolutional Neural Network — Two parallel convolutional layers extract visual features, which are then merged Temporal features — Recurrent Neural Network Now that we have a simple representation of our images, we need something to process them sequentially. For that, we are going to use a — namely cells. The LSTM updates its using at the current time step . recurrent layer Long Short Term Memory state both the extracted features and its own previous state Finally, when we have processed the whole sequence of images, the state of the LSTM is then fed to a to predict the probability of each motion. softmax classifier Full model Behold our final neural network, which takes as input a sequence of image pairs, and outputs the probability of each motion. What is here is that this we build the model in one single piece, and therefore it can be trained via . crucial end-to-end backpropagation To be fancy, we could say it’s a Dual Deep Convolutional LSTM Recurrent Neural Network, but nobody says that. The CNN extracts visual features from the input, which are processed at each step by the LSTM Results The trained model reaches on the test set. This is quite good, considering the fact that the training set — before augmentation — was pretty small. With more time I could have recorded at least 100–200 examples per class and maybe 3–4 motions instead of 2 ( ). This would have certainely the performance. 85+% accuracy and dedication, +idle improved The only remaining step is to use the classifier in real-time, tune it to avoid false-positives, and implement the logic to trigger actions (change volume, open app, run macro etc.). More on that in the . follow-up article Conclusion In this post, we have seen how we can use to find the eyes on a picture, how to clean images and how using could help in motion-related projects. HAAR cascades image differences We have also seen how to artifically augment the size of our dataset and how we could use a deep neural network to fit our dataset by assembling , , and layers. convolutional fully connected recurrent I hope you’ve liked this project, I’d be happy to hear your feedback! If you like Artificial Intelligence, subscribe to the newsletter to receive updates on articles and much more! Additional article — Code & Implementation details If you are interested, I go into more details on the implementation choices and issues of this project (model choice, eye tracking etc.) . here You can play with the code over there: _DeepEyeControl - Using your eyes to trigger shortcuts on your computer_github.com despoisj/DeepEyeControl Thanks for reading this post, stay tuned for more !