Have you ever found yourself in a situation where you need to interact with a smart home device, Smart TV or Smart Display but you can't touch it? Imagine, you are in the kitchen with wet or dirty hands, and you need to switch to the next episode on your Smart TV. It's not practical to touch the remote or device in this situation. Similarly, if your phone starts ringing and you need to quickly pause the music playing on it, it's much more convenient to use voice commands than to touch the device. Thanks to gesture recognition technology and voice recognition, users can now control their devices multimodel: via gestures or voice control.
My name is Daria and I am a Senior Product Manager with 5+ years of experience in the tech industry, working on products related to hardware, computer vision, and voice recognition, and in this article, we'll dive into gesture recognition systems, why and how these services are created, and what both product and technical teams need to consider in their development. I will share personal examples, my own experience, and reflections on the mistakes that I made along with my team when we were developing multimodal navigation for smart devices by SberDevices.
Enter touchscreens, the game-changer for tech interaction. Swipe, tap, and voila – complex actions made easy and user experiences amped up. But it wasn't always like this, before iPhone was released the only way to interact with a device was through a series of buttons.
The iPhone's release marked a turning point, launching us into the touchscreen era, fueled by Apple's extensive research and development efforts. With its debut, the iPhone introduced an intuitive touch interface that eliminated the need for a stylus or buttons, setting it apart from competitors. But now, we're on the brink of another transition: from touchscreens to touchless interfaces.
Touchscreens no longer suffice for users' needs, leading to the emergence of gesture control as an additional modality for interaction. Designing this new experience is an incredibly creative task. The first touchscreen prototype was featured in Star Trek, while Blade Runner and Black Mirror demonstrated gesture interfaces. Our sci-fi-inspired imaginations have envisioned touchless interfaces for years, and as technology advances, we eagerly anticipate which other futuristic concepts will become reality.
To be specific, let's examine interaction examples and use cases that arise. For instance, smart TV devices with cameras, such as Facebook Portal or SberBox Top, are designed with voice-first interfaces, thanks to their onboard virtual assistants that create an exceptional assisted experience. However, these devices still come with a traditional remote. The presence of a camera adds another dimension of interaction – gestures.
Users can make hand movements in the air without touching the screen or remote, creating a blend of touchscreen and remote control where the system responds to gestures. While I believe that we'll eventually transition to fully touchless interfaces, current computational and recognition technology limitations place us in a transitional period. We're developing multimodal interfaces, allowing users to choose the most convenient method – voice, remote, or gesture – for accomplishing their tasks.
I will share insights and recommendations derived from my team's work in developing touchless interfaces. These guidelines aim to help others avoid unnecessary mistakes when tackling similar challenges. By using these recommendations as a reference point, you can streamline your process and make more informed decisions.
When designing a gesture recognition system, select movements that are easy to execute, culturally appropriate, consistent, and unique. The gestures should be logically connected and adaptable for future additions.
Сriterias for our gesture basket:
One of the toughest challenges we faced in creating our gesture recognition system involved working with data from both aspects: how to collect it and how to annotate it.
When collecting datasets from paid respondents, they tended to perform the movements more accurately and mechanically, as they followed specific instructions. However, in real life, people behave quite differently – they may be lounging on a couch or in bed, resulting in more relaxed and imprecise movements. This created a significant gap between the dataset domains and real-world scenarios. To address this issue, we collaborated with actors who could take their time to get into character and exhibit more natural behaviour, allowing us to gather a more diverse and representative set of data.
But that wasn't the end of our problems! After collecting the data, we had to correctly label each movement, which was a daunting task in itself, as it was often difficult to determine where the movement began and ended.
We faced issues like defining consistent movements, accounting for user backgrounds and lighting conditions, and balancing movement complexity with model retraining. Iterative testing helped us refine our system, collecting data from different angles and lighting conditions.
The key aspect of our work was the iterative beta testing that our team started piloting in the early stages when the recognition network was not yet perfect. We conducted a closed beta with respondents, using a false positive detection system. When the network recognized a movement, it saved that frame on the device, and only the device owner had access to these frames. This allowed us to quickly receive feedback on unique real-life cases where we performed poorly. Immediately after receiving feedback, we collected new data on a larger scale to cover that particular case. For example, at the very beginning, the network recognized holding a cup in one's hand as a 'like' gesture, and we collected data from people holding cups to retrain the network.
Designing a gesture recognition system is no easy task, and we encountered several unexpected challenges along the way: