I’ve published twenty different voice activated chatbots on the Alexa platform. In the past few months, I’ve shifted my focus to building chatbots for Facebook Messenger using Amazon Lex. Same machine learning models = easy port, right? Now I know better, am humbled, and willing to share my mistakes so that others understand the differences. The hidden challenge is deciphering spelling coming in from a text. Here is what I’ve learned.
For background, underneath the covers, here is how the machine learning models get used by both platforms.
Basic Alexa Architecture w/ ML models
Basic Lex Architecture w/ ML models
The key difference is around the inbound media — text vs. a sound file. The Natural Language Understanding (NLU) models are the same. I can reuse sample utterances and custom slots. The patterns around intents triggering lambda functions are the same. What is different is Alexa’s use of Automated Speech Recognition (ASR) models to translate sound into text versus receiving what the user types into Facebook Messenger. What I have found is that this goes beyond speech to text conversion. There are additional services being provided around spell checking and language interpretation. Let me explain further with an example.
I’ve been working on a chatbot that has been out on the Facebook Messenger platform for a few months. The repo including source code and documentation can be found here.
Chuck’s purpose is to answer questions about how many calories are in a basic meal at a fast food restaurant. I’ve scraped more than a dozen popular restaurants websites, modeled the data, and then fed everything into Lex.
So you should be able to ask Chuck, how many calories in this salad.
Photo courtesy of Wikipedia Commons.
Now think for a minute how you would phrase this query in voice vs. a text. The photo is a caesar salad which is a fairly easy word to pronounce, so straight forward for a voice command to recognize and translate. Texting through a client like FB Messenger is a challenge given that caesar is not an easy word to spell. I want the bot to be user friendly, so I have been training it all the different ways someone might spell caesar. That requires extra entries in the custom slots that drive the NLU models, as well as code within the Lambda functions. None of this is required in a voice chatbot.
Let’s try another.
Photo courtesy of Wikipedia Commons
Any takers on how many different ways to text the word croissant? Same challenge — a voice driven bot will navigate this easier than a text driven one given the challenges in spelling.
Not convinced? Let’s do one more example.
My photo from flickr
Once again, much easier to say in a voice query than spell. For those of you high achievers that want the answer, it’s spelled quesadilla.
So the key takeaway here is that in a voice driven chatbot (i.e. an Alexa skill) the text coming into NLU models is from the speech recognition service — thus won’t contain spelling errors no matter the difficulty level of the terms. With a text based chatbot, that same input is coming directly from users — thus contains spelling errors, especially for difficult words.
These examples above may be extreme, but are easy to explain that there are more to speech recognition models than converting a wave file to text. It’s inherently doing spell check based on the training data that was used to build the models which were properly spelled.
Even for basic words (chicken, steak, burger, etc.), there are misspellings. Much of this is the interface being used. On a desktop, keys on the keyboard are further apart, so less likely to miss a keystroke. A mobile device, very easy to miss, and be one letter off.
Yes, the client applications can build in squiggly lines to remind the user that they are about ready to misspell a word, but usually requires some intervention before the send button is pressed.
Also when humans receive texts, we accept a lower standard around spelling as we know the medium is susceptible to these types of errors. Unfortunately writing matching and lookups in source code isn’t forgiving!
Over time I’ve evolved the dialog flow to give hints along the way. For example, when an error occurs around a food item not found, the user gets redirected to ask more generic questions.
This helps by providing proper spellings, as well as what the official terms being used are. Transposition errors will continue to occur given that users are unlikely to do a copy+paste in a response.
Let’s go with another example, what would you ask Chuck for with this?
Six Count Chicken Nuggets — just like on the menu right?
Here are the sample utterances for this intent in the model that are used for training the bot.
This may seem comprehensive, but how we text isn’t how we speak. A gap that I constantly tune is around abbreviations, because users ask questions like this.
The problem goes beyond spelling as users abbreviate constantly when texting. That’s not a spelling problem, but closely related — yet this doesn’t show up in voice driven applications. The NLU models in Lex are actually quite good at deciphering, but aren’t 100%.
It’s similar to what we learned in mobile app development over the past decade or so. When building chatbots, each one will need to be customized for the interface. NLU models that were originally developed from voice with correctly spelled terms will need to evolve, providing additional services used in text driven modes. “Forklifting” voice driven chatbots to text will be a struggle from a user experience, so be wary of the “write once, publish everywhere” claims by tools.