1,560 reads

Spoken Language Understanding (SLU) vs. Natural Language Understanding (NLU)

by PicovoiceOctober 19th, 2022

Too Long; Didn't Read

Spoken Language Understanding and Natural Language Understanding aim to help machines understand human language. SLU deals with understanding speech, whereas NLU does with understanding text. NLU is a part of SLU whether it’s trained independently or not. The performance relies on independently trained STT and NLU modules. Modern SLU uses an end-to-end model instead of two distinct components. It is more accurate but works better for domain-specific use cases. For open-domain use cases such as voice assistants like Alexa, Siri and Google conventional approach is better due to availability of large NLU datasets.

Companies Mentioned

featured image - Spoken Language Understanding (SLU) vs. Natural Language Understanding (NLU)

Spoken Language Understanding SLU and Natural Language Understanding NLU aim to help machines understand human language. The main difference is the input data type. SLU deals with understanding speech, whereas NLU deals with understanding text. NLU is a part of SLU whether it’s trained independently or not.

Research on NLU started in the 1960s: Bobrow’s Ph.D. dissertation Weizenbaum’s ELIZA, a mock psychotherapist chatbot, and Winograd’s SHRDLU are the pioneer works in this space. SLU’s popularity started with the recent advances in speech recognition powered by deep learning. The query “spoken language” returns over 1000 studies on both Amazon and Microsoft research publications websites.

Conventional SLU Approach

The conventional SLU processes utterances in two steps - Speech-to-Text (STT) first, then NLU. Once STT transcribes the speech to text, NLU extracts meaning by processing the transcribed text. The performance relies on independently trained STT and NLU modules. If STT returns erroneous output, then it leads to incorrect NLU predictions. Hence machines cannot capture what humans say. Many voice applications, including voice assistants - Alexa, Siri, and Google use this approach.

End-to-End SLU Approach

The modern SLU uses an end-to-end model instead of two distinct components. Developers train STT and NLU jointly, resulting in higher accuracy.

Picovoice calls this Speech-to-Intent as it infers users’ intents directly from speech. Amazon calls it FANS - Fusing ASR and NLU for SLU.

Conventional SLU Approach vs. End-to-End SLU Approach

The answer is “it depends.” It depends on the availability of corpora and information. If available, then the answer is modern end-to-end SLU. If not, then the conventional SLU. Text-based understanding (NLU) has been around longer than speech-based understanding (SLU). Thus, it has richer datasets.

For domain-specific applications such as IVR systems, menu navigation on a website, or ordering food at a QSR, the modern end-to-end SLU is preferable. Nobody would discuss the meaning of life with a voice assistant while ordering a hamburger. For open-domain use cases such as voice assistants like Alexa, conventional -cascading SLU works better given the variety of topics they cover. One can discuss the meaning of life with Alexa - although there are better options.

Top SLU and NLU engines in the market

Free and Open-source SLU and NLU Engines:

Rasa: Rasa is an open-source NLU engine that processes text inputs. The core software is free, and Rasa offers paid support and consulting services. Anyone can choose a speech-to-text service, and run Rasa on transcribed text.

Snips: Snips is an open-source SLU engine that uses the conventional method. Snips no longer maintains it after being acquired by Sonos. Yet the repo is still available on GitHub and used by developers.

Wit.ai: Wit.ai is a free platform and now requires a Facebook account after being acquired by Facebook. If one doesn’t (want to) have a Facebook account or deletes it, then they cannot use Wit.

Top paid SLU and NLU Engines:

Dialogflow: Google, after the API.ai acquisition, named it Dialogflow and offers both chatbot and voicebot tools under the same name. It uses the conventional approach. Dialogflow records and sends voice data to Google’s servers for transcription and then processes transcribed text. It charges based on usage.

Lex: Amazon’s Lex is an AWS offering. Like Dialogflow, Lex offers text and voice capabilities, uses the conventional approach, and transcribes speech and understanding separately in its cloud. It charges based on usage.

Rhino: Picovoice’s Rhino is an SLU engine that uses the end-to-end approach and infers intents and intent details directly from speech. Rhino is voice-based and does not support text-based services. It charges based on the number of users and offers unlimited interactions per user.