434 reads

Voice Augmented Applications

by Taimur ShahFebruary 2nd, 2018

Too Long; Didn't Read

On the Watson Cognitive Environments team, we explored ways that Watson could improve human-computer interaction. We built various gesture based interactions using body tracking, spatial input tools and both passive and active voice driven applications. Ultimately, what was <em>actually useful</em> for our target environment was augmenting applications that were driven by spatial inputs (think wii-mote) with voice commands. There were a few key principles we discovered while refining this interaction pattern, both for improving the performance of the system and for delivering a better user experience.

featured image - Voice Augmented Applications

Principles for Multi-Modal Interfaces

On the Watson Cognitive Environments team, we explored ways that Watson could improve human-computer interaction. We built various gesture based interactions using body tracking, spatial input tools and both passive and active voice driven applications. Ultimately, what was actually useful for our target environment was augmenting applications that were driven by spatial inputs (think wii-mote) with voice commands. There were a few key principles we discovered while refining this interaction pattern, both for improving the performance of the system and for delivering a better user experience.

The Case for Voice

Why would you ever want voice commands on an application you can get through using regular point and click methods? One case is for cutting through deep nested menus. Another is selecting an option from a long list that you don’t want to read through. Any time it’s annoying to type, for example tiny touch keyboards, voice could be better. Maybe multiple people want to collaborate and use an application together, and voice is easier than passing around a keyboard. Our use case featured all of these scenarios, and so theoretically benefited greatly from the addition of voice. However there were several issues we ran into in implementation.

Speech to Text Performance

While everyone’s voice recognition service continues to get better over time, you can’t escape from the fact that it isn’t perfect. Your system will make mistakes. There are a few key optimizations you can make with IBM Watson’s speech-to-text service that can improve performance significantly, and will be necessary for a good user experience.

Custom Language Model

If your application has a largely static, global menu of buttons/commands that can be triggered at any time, you should add all the ways to trigger these to your STT service’s language model. By providing this language model, you are telling the service that the probability of seeing these phrases is high, and the service can bias its transcription towards this expected output. This makes a significant performance impact on detecting those commands accurately, and makes a big impact on the overall performance of your application.

Dynamic Keyword Spotting

Another argument you can provide the Watson STT service is a list of keywords you expect to hear. Similar to the language model, the service will be extra sensitive to detecting words on that list. The practical difference between the language model and this is that the model takes some time to train before it can be used. As a result it performs better, but it means you won’t be able to generate models in real-time and use them as your user goes through the application. Whenever you have voice-selectable content that changes over time in unpredictable ways, you should supply them as keywords to the service. This is especially useful for using voice to sift through a large, dynamically generated list. Here you will need to do some extra work on the engineering side, because your application needs to communicate portions of its state to the STT service, and the Watson API does not make this very easy (you will need to create a new WebSocket connection to the service for each update). However the payoff in user experience is well worth the extra effort.

Remembering the Commands

There is no use asking your user to memorize the voice commands. Not only because you shouldn’t put extra burden on the user, but because it doesn’t work. When we use applications today, there’s very little that you need to remember. Typically every action that you might want to take labels and explains itself, and is visible at all times. You don’t need to remember where the buttons are — you visually search the page, find what you’re looking for and then click it. An incredibly useful pattern we started using was to label all our buttons with the voice command that would trigger it. This also means for toggles, clicking or giving that command would then switch the label to it’s opposite. For example a toggle for “turn on lights” turns into “turn off lights” after it’s selected. Now your user doesn’t need to remember anything, and everything is available via a visual search on the application.

Wake Words

A wake word is just a phrase you have to utter which tells the system that you are about to talk to it. When building your application, you will need to decide if you want to use one or not, and there are some performance considerations to take into account. In your speech to text service, you (the client) are sending audio data over the network and expecting text to come back in the correct order. It takes some nonzero time for this transcription to take place, and if you have a microphone on and people are talking for an extended period of time, you end up with a significant gap between what is being said in real-time and what is coming back from the STT service. Basically, there will be lag. This is (partly) why your phone and home voice assistant are using hardware circuits and offline processing dedicated to detecting your wake word. Needless to say this is a significant barrier to overcome when trying to implement your own voice augmented application. Our solution to mitigate this was to create another process which was using the Microsoft offline speech-to-text service with a limited dictionary of English words dedicated to detecting our wake word (credit to Tom Wall for the implementation). When this process detected our wake word, it would send a message to our Watson Speech to Text client, which would then begin processing audio. Another, much simpler, option is to have a “push to talk” button or hardware switch to unmute/mute the microphone.

Conclusion

There are currently not too many applications and use-cases that warrant voice control, but I believe over time there will be more. Many people today focus on building applications as chatbots, and work towards extending these models to accept voice. However I believe humans and machines can communicate significantly better through a mix of conventional UI and voice. If transcription technology is not where we need it to be yet, language synthesis and the ability for machines to respond naturally in English is even further. The state of the art today in conversational interfaces involve a developer designing large conversation trees in an attempt to anticipate all user utterances and responding to them accordingly. Additionally, machines have a huge advantage in being able to communicate visually, which is a significantly higher bandwidth mode of communication. There is little reason to strip away that advantage, and we should place more emphasis on what can be gained by building multi-modal interactions, combining the strengths of machines with what is convenient for humans.