A decade since touch screens became ubiquitous in phones, we still generally interact with mobile apps using only minor variations of a few gestures: tapping, panning, zooming and rotating.
Recently, I explored a highly underutilized class of gestures. The approach I took to detecting them in-app has several important advantages over the state-of-the-art techniques.
Heart gesture concept by Virgil Pana
The idea originally came to me while imagining a puzzle game in which the user solves physical challenges by designing machines out of simple components. Sort of like Besiege but 2D, mobile and focused on solving physical puzzles rather than medieval warfare.
Instead of cluttering the small mobile screen with buttons for switching tools, I thought the game could have different gestures for different tools or components. For instance:
On top of cleaning up the UI, this could save the user time that they would have spent looking for buttons, navigating submenus and moving their finger back and forth to work on something.
While pan and rotation gestures can be recognized using basic geometry, more complex gestures like these are tricky.
It’s tempting to try to handcraft algorithms to identify each gesture you intend to use in your app. For something like a check mark, one might use the following rules:
Indeed, the official iOS tutorial on custom gesture recognizers suggests these rules. While they may seem to describe a check mark, consider these issues:
And these are only for simple check marks. Imagine what can go wrong when dealing with more complex gestures, with perhaps multiple strokes, that all have to be distinguished from each other.
The state-of-the-art for complex gesture recognition on mobile devices seems to be an algorithm called $P. $P is the newest in the “dollar family” of gesture recognizers developed by researchers at the University of Washington.
$P’s main advantage is that all the code needed to make it work is short and simple. That makes it easy to write new implementations, and easy to allow the user to make new gestures at runtime. $P can also recognize gestures regardless of their orientation.
While $P performs decently, I don’t think it’s good enough to rely on in a real mobile app for a few reasons. First, it’s not very accurate.
Gestures misclassified by $P. Left: A T is misclassified as an exclamation mark because the top bar is shorter than expected. Center: N is misclassified as H because it’s more narrow than expected. Right: An x mark is misclassified as an asterisk (perhaps because the lines aren’t straight?)
The examples above make $P look worse than it actually is, but I still don’t think an app can afford errors like those.
It’s possible to improve $P’s accuracy by giving it more gesture templates but it won’t generalize as well from those templates as the solution you’re about to see. And the algorithm becomes slower to evaluate as you add more templates.
Another limitation of $P is its inability to extract high-level features, which are sometimes the only way of detecting a gesture.
One type of gesture that $P cannot detect. Because $P checks the distances between points rather than extracting high-level features, it’s not able to recognize this pattern. Source: University of Washington
A robust and flexible approach to detecting complex gestures is by posing the detection problem as a machine learning problem. There are a number of ways to do this. I tried out the simplest reasonable one I could think of:
This converts the problem into an image classification problem, which CNNs solve extremely well.
Like any machine learning algorithm, my network needed examples (gestures) to learn from. To make the data as realistic as possible, I wrote an iOS app for inputting and maintaining a data set of gestures on the same touch screen where the network would eventually be used.
Generating data for the neural network to learn from. No recognition happening yet.
While I’ll cover the technical details of the implementation in my next article, here’s a summary:
With this approach it doesn’t matter how many strokes you use in your gesture, where you start or finish or how the speed of your touch varies throughout your strokes. Only the image matters.
This throws away the timing information associated with the user’s gesture. But it sort of makes sense to use an image-based approach when the gesturing user has an image in mind, as when they’re drawing hearts or check marks.
One weakness is that it would be difficult to allow a user to define their own gesture at runtime since most mobile machine learning frameworks can only evaluate, not train neural networks. I think this is a rare use case.
Gesture recognition by my convolutional neural network, at the end of each stroke
I was dumbfounded at how well the neural network performed. Training it on 85% of the 5233 drawings in my data set results in 99.87% accuracy on the remaining 15% (the test set). That means it makes 1 error out of 785 test set drawings. A lot of those drawings are very ugly so this seems miraculous to me.
Note: You don’t need nearly 5233 drawings to get similar accuracy. When I first created a data set, I significantly overestimated how many drawings I’d need, and spent all day drawing 5011 instances of 11 gestures (about 455 each).
Taking 60 images for each gesture, I found that the algorithm would still reach about 99.4% accuracy on the remaining unused images. I think about 100 drawings per gesture may be a good number, depending on the complexity of the gestures.
The network is robust to length changes, proportion changes and small rotations. While it’s not invariant to all rotations like $P, there are several ways to imbue a CNN with that property. The simplest method is to randomly rotate the gestures during training, but more effective techniques exist.
Speed-wise, the app takes about 4 milliseconds to generate an image for the gesture and 7 ms to evaluate the neural network on the (legacy) Apple A8 chip. That’s without trying much to optimize it. Because of the nature of neural networks, a large number of new gestures can be added with little increase required in the size of the network.
Clearly adopting these types of gestures is more of a UI/UX problem than a technical hurdle; the technology is there. I’m excited to see if I can find ways to use them in my clients’ apps.
That’s not to say that the UI/UX problem is trivial. Though I present some ideas here on when it might make sense to use gestures like these, much additional thought is needed. If you have ideas of your own, do share!
As a guideline, it may be a good idea to use gestures like these to perform actions in your app if one or more of the following is true:
You’ll also need to make sure you have a good way for users to learn and review what gestures they can make.
It’s difficult to convey to blind users what motion to make. To help with accessibility, you can have submenus that enable the same actions as your gestures.
It can be tricky to use complex gestures inside scroll views since they interpret any movement of a user’s touch as scrolling.
One idea is to have a temporary “drawing mode” that activates when the user puts the large flat part of their thumb anywhere on the screen. They would briefly have a chance then to gesture some action. The location of the flat tap could also be used in association with the action (e.g. scribble to delete the item that was tapped).
The current implementation is very good at distinguishing between the 13 symbols I gave it — exactly what it was trained to do. But it’s not so easy with this setup to decide whether a gesture that the user drew is any of those symbols at all.
Suppose a user draws randomly and the network decides that the closest thing to their gesture is a symbol that represents deletion. I’d rather not have the app actually delete something.
We can solve that problem by adding a class that represents invalid gestures. For that class we make a variety of symbols or drawings that are not check marks, hearts or any of our other symbols. Done correctly, a class should only receive a high score from the neural network when a gesture closely resembles it.
A more sophisticated neural network might also use velocity or acceleration data to detect motion-based gestures that produce messy images, like continual circular motions. That network could even be combined with the image-based one by concatenating their layers toward the end of the network.
Some apps like games may need to determine not just that the user made a gesture, or the positions where they started or finished (which are easy to get), but additional information as well. For example, for a V shape gesture we might want to know the location of the vertex and which direction the V points. I have some ideas for solving this problem that might be fun to explore.
If these articles are popular I’ll probably expand on them in the future. By refining the tools I made here, I think the barrier to adopting these ideas in a new app could be made very small. Once set up it only takes about 20 minutes to add a new gesture (input 100 images, train to 99.5+% accuracy, and export model).
Stayed tuned for part 2 where I go into more technical detail on this implementation. In the meantime, here’s the source code.
I’m seeking new client(s) for my freelance work. If you know anyone who’s looking for an iOS or Android developer, please consider sending them my way. It’s much appreciated! :)