NLU (natural language understanding) isn’t necessarily the main challenge in moving from web to voice development — think constrained input, context & validation.
Consider this very basic web form:
Slap in some client-side validation to check for a numeric value and you’re done. This problem’s been solved for decades.
Now let’s say you’re building an Alexa Skill and want to prompt for the same information. You hope it goes a little something like this…
Alexa (blue) prompting & user (black) replying.
You create an Intent which takes a couple of entities (aka “slots”) to capture the quantity and unit (metric vs. Imperial) and provide some sample utterances:
I weigh {QUANTITY} {UNIT} I weigh {QUANTITY}{QUANTITY} {UNIT}{QUANTITY}About {QUANTITY} {UNIT}About {QUANTITY} etc. etc.
Having Alexa or any NLU system understand this is quite straightforward.
But that’s not where the work lies. Users & Alexa have other plans…
Actually, you could receive several things:
The user could say their weight without specifying the unit (“142.5”), or Alexa could fail to hear their weight correctly (“42.5”), or might correctly identify the unit but not the quantity (“42 pounds”).
Alexa might detect something which isn’t even a number and doesn’t make sense in this context (depicted by the ‘xxx?!’) although it could also be the user deliberately switching topic. Or there could be silence if the user doesn’t reply at all, or Alexa fails to detect their voice.
Very quickly you’ll learn that in voice development, you’re continually building around the unpredictability and unconstrained nature of the input. With the web, you can constrain the user’s input — in this case, to a numeric value (their weight) and one of two select options (“lbs” or “kg”). With voice, it’s unbounded.
You can mitigate some of the risk by providing more guidance to the user in the initial prompt:
At least we’re providing guidance that we’d like the unit (pounds or kilograms specified) and that decimals are also accepted. It doesn’t mean the user has to provide them.
Anything can and does happen: people will continually surprise you, and the tech’s not at 100%. Hey, even we humans mishear.
So let’s return to our conversational flow and consider the implications for us:
From left to right:
Actually, there’s another possibility too:
This might be a good time to remind ourselves that all of this is essentially one form field on the web:
Yeah.
Referring to the red, numbered arrows in turn:
But consider the other options …
I don’t want to write code based on this mess. And neither do you. While a conversational flow is useful when considering the main pathways at the design stage, a developer needs a better way of representing things ….
Let’s redraw that in terms of states:
Whilst that appears more manageable, it doesn’t negate the need to do a massive amount of validation.
Btw, this also represents how conversational engines like Dialogflow work: by mapping utterances to intents whilst considering input contexts (ie. incoming states) based on the data already known.
There’s just one problem …. they don’t scale well.
Let’s up the ante, and add another field to our web form:
Woo.
Guess what it means for your conversational flow diagram or state diagram? OK, time’s up, here’s one I baked earlier:
And this doesn’t even consider global intents like “help”, “start over”, “go back”, “cancel”… things which we don’t need on the web because, you know, users have like a back-button, can scroll and don’t have to reply within 8 seconds.
It’s at this stage, that any developer coming from mature web frameworks would be forgiven for asking themselves whether they really want to go to THIS much effort building around every eventuality.
Dialogflow may route based on incoming state but Alexa doesn’t work like that.
Alexa is essentially a flat list of intents. When it hears “yes”, it doesn’t know whether the user is confirming their age or their weight. When it hears “78”, it doesn’t know if you’ve just asked them for their weight or their age..
Bottom-line, if you’re developing for Alexa, it’s totally up to you to manage all context and the routing which comes with it.
Remember this prompt ?
Putting aside that it’s not a very natural way to speak, it’s long. It might be acceptable first time you hear it, but you don’t need to hear all of it the second time around. So if you’re building services which will be used repeatedly, it’s good to serve content — including prompts — based on the experience of the user.
You don’t need to do that with websites because advanced users are going to fast-track/shortcut themselves to where they want to go, clicking along their usual pathways while barely skimming the text. The web is non-linear, whereas an audio stream is not.
Even if you have a shorter version for advanced users, they don’t want to hear the exact same wording each day. So you’re probably going to provide some sort of variation in wording.
This is, thankfully, one area where frameworks can help. But then again, you can roll this yourself too.
Some might think this is a tedious chore, but not you, you badass, you love your craft.
First off, don’t despair. If you’re creating a simple skill, just acting on commands issued by the user, you’ll circumvent most of this. Plus, it’ll get easier with time — as the speech recognition, NLU and developer frameworks further improve.
In the meantime:
Good luck and thank you for reading this far.
p.s. If you ARE having difficulty with getting Alexa to grok users’ utterances, these tips may help.