In , I showed how one can use deep learning to recognize complex gestures like heart shapes, check marks or happy faces on mobile devices. I also explained how apps might benefit from using such gestures, and went a little bit into the UX side of things. part 1 This time around I’ll be walking you through the technical process of implementing those ideas in your own app. I’ll also introduce and use Apple’s Core ML framework (new in iOS 11). Real-time recognition of complex gestures, at the end of each stroke This technique uses machine learning to recognize gestures robustly. In the interest of reaching as many developers as possible, I won’t assume any understanding of the field. Some of this article is iOS-specific but Android developers should still find value. The source code for the completed project is . available here What We’re Building By the end of the tutorial we’ll have a setup that allows us to pick totally custom gestures and recognize them with high accuracy in an iOS app. The components involved are: An app to collect a some examples of each gesture (draw some check marks, draw some hearts, etc.) Some Python scripts to train a machine learning algorithm (explained below) to recognize the gestures. We’ll be using TensorFlow, but we’ll get to that later. The app to use the custom gestures in. Records the user’s strokes on the screen and uses the machine learning algorithm to figure out what gesture, if any, they represent. Gestures we draw will be used to train a machine learning algorithm that we’ll evaluate in-app using Core ML. In part 1 I went over why it’s necessary to use a machine learning algorithm. In short, it’s much harder than you’d think to write code to explicitly detect that a stroke the user made is in the shape of a heart, for example. What’s a Machine Learning Algorithm? A machine learning algorithm learns from a set of data in order to make inferences given incomplete information about other data. In our case, the data are strokes made on the screen by the user and their associated gesture classes (“heart”, “check mark”, etc.). What we want to make inferences about are new strokes made by a user for which we don’t know the gesture class (incomplete information). Allowing an algorithm to learn from data is called “training” it. The resulting inference machine that models the data is aptly called a “model”. What’s Core ML? Machine learning models can be complex and (especially on a mobile device) slow to evaluate. With iOS 11, Apple introduces , a new framework that makes them fast and easy to implement. With Core ML, implementing a model comes down primarily to saving it in the Core ML model format (.mlmodel). Xcode 9 makes the rest easy. Core ML An official Python package is available that makes it easy to save mlmodel files. It has for Caffe, Keras, LIBSVM, scikit-learn and XCBoost models, as well as a lower-level API for when those don’t suffice (e.g. when using TensorFlow). Unfortunately coremltools currently requires Python 2.7. coremltools converters Supported formats can be automatically converted into Core ML models using coremltools. Unsupported formats like TensorFlow require more manual work. Note: Core ML only enables models on-device, not new ones. evaluating training 1. Making the Data Set First let’s make sure we have some data (gestures) for our machine learning algorithm to learn from. To generate a realistic data set, I wrote an iOS app called GestureInput to enter the gestures on-device. If your use case isn’t much different from mine, you may be able to use GestureInput as-is. It allows you to enter a number of strokes, preview the resulting image and add it to the data set. You can also modify the associated classes (called labels) and delete examples. GestureInput randomly chooses gesture classes for you to draw examples of so that you get roughly equal numbers of each. When I want to change the frequencies with which they show up (e.g. when adding a new class to an existing data set), I change the and recompile. Not pretty, but it works. hard-coded values Generating data for the machine learning algorithm to learn from The readme for this project explains how to , which include check marks, x marks, ascending diagonals, “scribbles” (rapid side-to-side motion while moving either up or down), circles, U shapes, hearts, plus signs, question marks, capital A, capital B, happy faces and sad faces. A is also included which you can use by . modify the set of gesture classes sample data set transferring it to your device How many gestures should you draw? As I mentioned in part 1, I was able to get 99.4% accuracy with 60 examples of each gesture, but I would actually recommend making about 100. Try to draw your gestures in a variety of ways so that the algorithm can learn them all. Exporting For Training A “Rasterize” button in GestureInput converts the user’s strokes into images and saves them into a file called data.trainingset. These images are what we’ll input to the algorithm. As covered in part 1, I scale and translate the user’s gesture (“drawing”) to fit in a fixed-size box before converting it into a grayscale image. This helps make our gesture recognition independent of where and how big the user makes their gesture. It also minimizes the number of pixels in the image that represent empty space. Converting the user’s strokes into a grayscale image for input into our machine learning algorithm Note that I still store the raw time sequence of touch positions for each stroke in another file. That way I can change the way gestures are converted into images in the future, or even use a non-image-based approach to recognition, without having to draw all the gestures again. GestureInput saves the data set in the documents folder of its container. The easiest way to get the data set off your device is by . downloading the container through Xcode 2. Training a Neural Network In step 1 we converted our data set into a set of images (with class labels). This converts our gesture classification problem into an image classification problem — just one (simple) approach to recognizing the gestures. A different approach might use velocity or acceleration data. I mentioned that we’d be using a machine learning algorithm. It turns out the state-of-the-art class of machine learning algorithms for image classification right now is convolutional neural networks (CNNs). See this . We’ll train one with TensorFlow and use it in our app. excellent beginner-friendly introduction to CNNs If you’re not familiar with TensorFlow, you can , but this article has all the instructions you’ll need to train a model. My neural network is based off the one used in the TensorFlow tutorial. learn about it here Deep MNIST for Experts The set of scripts I used to train and export a model are in a folder called . I’ll be going over the typical use case, but they have some extra command-line options that might be useful. Start by setting up with : gesturelearner virtualenv cd /path/to/gesturelearner# Until coremltools supports Python 3, use Python 2.7.virtualenv -p $(which python2.7) venvpip install -r requirements.txt Preparing the Data Set First, I use to split the data set into a 15% “test set” and an 85% “training set”. filter.py # Activate the virtualenv.source /path/to/gesturelearner/venv/bin/activate

# Split the data set.python /path/to/gesturelearner/filter.py --test-fraction=0.15 data.trainingset The training set is of course used to train the neural network. The purpose of the test set is to show how well the neural network’s learnings generalize to new data (i.e. is the network just memorizing the labels of the gestures in the training set, or is it discovering an underlying pattern)? I chose to set aside 15% of the data for the test set. If you only have a few hundred gesture examples in total then 15% will be a rather small number. That means the accuracy on the test set will only give you a rough idea of how well the algorithm is doing. This part is optional. Ultimately the best way to find out how well the network performs is probably to just put it in your app and try it out. Training After converting my custom .trainingset format into the TFRecords format that TensorFlow likes, I use to train a model. This is where the magic happens. Where our neural network learns from the examples we gave it to robustly classify new gestures it encounters in the future. train.py train.py prints its progress, periodically saving a TensorFlow checkpoint file and testing its accuracy on the test set (if specified). # Convert the generated files to the TensorFlow TFRecords format.python /path/to/gesturelearner/convert_to_tfrecords.py data_filtered.trainingsetpython /path/to/gesturelearner/convert_to_tfrecords.py data_filtered_test.trainingset

# Train the neural network.python /path/to/gesturelearner/train.py --test-file=data_filtered_test.tfrecords data_filtered.tfrecords Training should be quick, reaching about 98% accuracy in a minute and settling after about 10 minutes. Training the neural network If you quit train.py during training, you can start again later and it will load the checkpoint file to pick up where it left off. It for where to load the model from and where to save it. has options Training With Lopsided Data If you have significantly more examples of some gestures than other gestures, the network will tend to learn to recognize the better-represented gestures at the expense of the others. There are a few different ways to cope with this: The neural network is trained by minimizing a cost function associated with making errors. To avoid neglecting certain classes, you can increase the cost of misclassifying them. Include duplicates of the less-represented gestures so that you have equal numbers of all gestures. Remove some examples of the more-represented gestures. My code doesn’t do these things out-of-the-box, but they should be relatively easy to implement. Exporting to Core ML As I alluded to earlier, Core ML does not have a “converter” for converting TensorFlow models into Core ML MLModels the way it does with Caffe and scikit-learn, for example. This leaves us with two options to convert our neural network into an MLModel: Use the .models package, which has an . coremltools API for building neural networks Since the MLModel is based on , you can skip coremltools and use protobuf directly in virtually any programming language. specification Google’s protocol buffers So far there don’t seem to be any examples of either method on the web, other than in the internal code of the existing converters. Here’s a condensed version of my example using coremltools: To use it: # Save a Core ML .mlmodel file from the TensorFlow checkpoint model.ckpt.python /path/to/gesturelearner/save_mlmodel.py model.ckpt The full code can be . If for some reason you prefer to skip coremltools and work directly with the MLModel protobuf specification, you can also see how to do that there. found here One ugly side effect of having to write this conversion code ourselves is that we describe our entire network in two places (the TensorFlow code, and the conversion code). Any time we change the TensorFlow graph, we have to synchronize the conversion code to make sure our models export properly. Hopefully in the future Apple will develop a better method for exporting TensorFlow models. On Android you can use the official . Google will also be releasing a mobile-optimized version of TensorFlow called . Tensorflow API TensorFlow Lite 3. Recognizing Gestures In-App Finally, let’s put our model to work in a user-facing app. This part of the project is , the app you saw in action at the beginning of the article. GestureRecognizer Once you have an mlmodel file, you can add it to a target in Xcode. You’ll need to be running Xcode 9. At the moment it’s in public beta, but its release will likely coincide with that of the new iPhone and iOS 11 next week. Xcode 9 will compile any mlmodel files that you add to your target and generate Swift classes for them. I named my model GestureModel so Xcode generated , and classes. GestureModel GestureModelInput GestureModelOutput We’ll need to convert the user’s gesture ( ) into the format that accepts. That means converting the gesture into a grayscale image exactly the same way we did in step 1. Core ML then requires us to convert the array of grayscale values to its multidimensional array type, . Drawing GestureModel [MLMultiArray](https://developer.apple.com/documentation/coreml/mlmultiarray) is like a wrapper around a raw array that tells Core ML what type it contains and what its shape (i.e. dimensions) is. With an in hand, we can evaluate our neural network. MLMultiArray MLMultiArray I use a shared instance of since each instance seems to take a noticeable length of time to allocate. In fact, even after the instance is created, the model is slow to evaluate for the first time. I evaluate the network once with an empty image when the application starts so that the user doesn’t see a delay when they start gesturing. GestureModel Interpreting the Network’s Output The function above outputs an array of “probabilities” for each possible gesture class (label). Higher values generally represent higher confidence, but a lot of gestures that do not belong to any of the classes will counterintuitively receive high scores. In part 1 I talked about how to reliably distinguish invalid gestures from valid ones. One solution being to create an “invalid gesture” category with a variety of different gestures that don’t belong to any of the other categories. For this project I just consider a gesture valid if the network classifies it with a “probability” above a certain threshold (0.8). Avoiding Conflicts Between Gestures Since some of the gesture classes I used contain each other (happy faces contain U shape mouths, x marks contain ascending diagonals), it’s possible to prematurely recognize the simpler gesture when the user actually intends to draw the more complex one. To reduce conflicts, I used two simple rules: If a gesture could make up part of a more complex gesture, delay its recognition briefly to see if the user draws that larger gesture. Given the number of strokes the user makes, don’t recognize a gesture that can’t sensibly have been drawn yet (e.g. a happy face requires at least 3 strokes for the mouth and two eyes). In general though, for high robustness and responsiveness you should probably choose gestures that don’t contain each other. And that’s it! With this setup, you can add a completely new gesture to your iOS app in about 20 minutes (input 100 images, train to 99.5+% accuracy, and export model). To see how the pieces fit together or use them in your own project, see the . full source code

Apple

Google

Target

Smart Gesture Recognition in iOS 11 with Core ML and TensorFlow

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A new approach to touch-based mobile interaction

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

A new approach to touch-based mobile interaction

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps