Spoken Language Understanding SLU and Natural Language Understanding NLU aim to help machines understand human language. The main difference is the input data type. SLU deals with understanding speech, whereas NLU deals with understanding text. NLU is a part of SLU whether it’s trained independently or not.
Research on NLU started in the 1960s: Bobrow’s Ph.D. dissertation Weizenbaum’s ELIZA, a mock psychotherapist chatbot, and Winograd’s SHRDLU are the pioneer works in this space. SLU’s popularity started with the recent advances in speech recognition powered by deep learning. The query “spoken language” returns over 1000 studies on both Amazon and Microsoft research publications websites.
The conventional SLU processes utterances in two steps - Speech-to-Text (STT) first, then NLU. Once STT transcribes the speech to text, NLU extracts meaning by processing the transcribed text. The performance relies on independently trained STT and NLU modules. If STT returns erroneous output, then it leads to incorrect NLU predictions. Hence machines cannot capture what humans say. Many voice applications, including voice assistants - Alexa, Siri, and Google use this approach.
The modern SLU uses an end-to-end model instead of two distinct components. Developers train STT and NLU jointly, resulting in higher accuracy.
Picovoice calls this Speech-to-Intent as it infers users’ intents directly from speech. Amazon calls it FANS - Fusing ASR and NLU for SLU.
The answer is “it depends.” It depends on the availability of corpora and information. If available, then the answer is modern end-to-end SLU. If not, then the conventional SLU. Text-based understanding (NLU) has been around longer than speech-based understanding (SLU). Thus, it has richer datasets.
For domain-specific applications such as IVR systems, menu navigation on a website, or ordering food at a QSR, the modern end-to-end SLU is preferable. Nobody would discuss the meaning of life with a voice assistant while ordering a hamburger. For open-domain use cases such as voice assistants like Alexa, conventional -cascading SLU works better given the variety of topics they cover. One can discuss the meaning of life with Alexa - although there are better options.
Rasa: Rasa is an open-source NLU engine that processes text inputs. The core software is free, and Rasa offers paid support and consulting services. Anyone can choose a speech-to-text service, and run Rasa on transcribed text.
Snips: Snips is an open-source SLU engine that uses the conventional method. Snips no longer maintains it after being acquired by Sonos. Yet the repo is still available on GitHub and used by developers.
Wit.ai: Wit.ai is a free platform and now requires a Facebook account after being acquired by Facebook. If one doesn’t (want to) have a Facebook account or deletes it, then they cannot use Wit.
Dialogflow: Google, after the API.ai acquisition, named it Dialogflow and offers both chatbot and voicebot tools under the same name. It uses the conventional approach. Dialogflow records and sends voice data to Google’s servers for transcription and then processes transcribed text. It charges based on usage.
Lex: Amazon’s Lex is an AWS offering. Like Dialogflow, Lex offers text and voice capabilities, uses the conventional approach, and transcribes speech and understanding separately in its cloud. It charges based on usage.
Rhino: Picovoice’s Rhino is an SLU engine that uses the end-to-end approach and infers intents and intent details directly from speech. Rhino is voice-based and does not support text-based services. It charges based on the number of users and offers unlimited interactions per user.