Voice : Anatomy of The Invisible Interface
My 7-year-old kid’s first interaction with computers was through voice. He asked Apple Siri to answer trivia like “Which is fastest car?”. When we bought Amazon Echo, my father commanded Alexa to sing old songs. Voice based Interactions are becoming ubiquitous in our daily lives. We can find them in smartphones like Apple Siri, voice assistants like Amazon Alexa, Google Home, and range of other products. It is clear voice-based interaction will soon replace graphical user interface. According to industry experts 30 % of interaction with devices will happen over voice in the next 2–3 years.
Voice is a Natural Means of Interaction
Users have long interacted with computers by typing commands over keyboards or using Graphical User Interface (GUI). GUI or typing commands over keyboards requires users to learn interface and recall during every interaction. This often results in friction between user and computers. Voice reduces friction, it is like magic. Say few words to device it grants wish. Voice is a natural means of interaction.
We’re finally ready for Voice
Voice recognition is not new, it has been around for a while. IBM shoebox was most advanced speech recognition machine in 1960s. This software required lengthy training to learn specific version, limited by computing power. We are now in the age where computing is cheap, rapid penetration of computers, phones across the world allows machine learning algorithms to be trained using millions of samples from internet. It gives systems ability to recognize almost anyone’s speech.
Voice makes experience personal
Voice assistants like Siri, Alexa saves time on routine tasks like checking weather, ordering food, playing music, replying to messages. They make experience more personal.
Designing Voice Based Interaction
Designing Voice based interaction consists of three key factors: the intent, utterance, and slot.
Let’s analyse the following request: “Play relaxing music on Alexa.”
Intent (the Objective of the Voice Interaction)
The intent represents the broader objective of a users’ voice command. In this example, the intent is evident: The user wants to hear music.
Utterance (How the User Phrases a Command)
An utterance reflects how the user phrases their request. In the given example, we know that the user wants to play music on Alexa by saying “Play me…,” but this isn’t the only way that a user could make this request. For example, the user could also say, “I want to hear music ….”
You need to consider every variation of utterance. This will help the engine to recognize the request and link it to the right action or response.
Slots (the Required or Optional Variables)
Sometimes an intent alone is not enough, and more information is required from the user in order to fulfil the request. Alexa calls this a “slot,” and slots are like traditional form fields in the sense that they can be optional or required, depending on what’s needed to complete the request. In our case, the slot is “relaxing,” but since the request can still be completed without it, this slot is optional.
Although computers can recognize speech more reliably and more natural sounding, they don’t understand context. For e.g. Apple Siri does not understand context, it simply says I can’t understand. It is good to respond to simple commands but fail miserably to understand context, have conversation. It needs to overcome if this hurdle if it needs to flourish and adeptly widely by consumers.
When Douglas C Engelbart demonstrated how to use keyboard and mouse, it changed the way we interacted with computers. Voice has similar potential to have big shift in the way we interact with computers. The need for voice is real, early experience is having a positive impact on the way user interacts with the computers. Hopefully this leads to accessible world.