And though all these systems rely on slightly different technical processes, the first step for all of them is the same: capturing speech data and transforming it into machine-readable text.
But how does an ASR system work? How does it learn to understand speech?
In this article, I’ll give a brief introduction to automatic speech recognition. We'll look at the speech-to-text transformation process, how to build an ASR system, and touch on where to expect ASR tech in the future.
So let's get started!
So we know that on a basic level, automatic speech recognition looks like this:
audio data in, text data out.
But to get from the input to the output, the audio data needs to be made machine-readable. This means sending it through an acoustic model and a language model. The two processes work like this:
An acoustic model determines the relationship between audio signals and phonetic units in a language, while a language model matches sounds to words and word sequences.
These two systems allow ASR systems to run probability checks on audio input to develop predictions of what words and sentences are in it. From these predictions, the systems then selects the prediction with the highest confidence rating.*
*Sometimes the language model can give priority to certain predictions that are deemed more likely due to other factors
So if we run the a phrase through an ASR system, it will do the following:
It's worth mentioning here that if an automatic speech recognition system is part of a voice user interface, the ASR model won’t be the only machine learning model at work. Many automatic speech recognition systems are paired with natural language processing (NLP) and Text-to-speech (TTS) systems to perform their given roles.
That said, digging into voice user interfaces is a whole topic of its own. To learn more about it, check out this article.
So now we know how ASR systems work, but what do you need to build one?
The key is data.
A good ASR system is expected to be flexible. It needs to understand a wide variety of audio input (speech samples) and create accurate text output from that data so it can react accordingly.
To achieve this, ASR systems require data in the form of labeled speech samples and transcriptions. It's a little more complicated than that (the data labeling process is hugely important and often overlooked, for example), but for the purposes of this article let's keep things simple.
ASR systems require huge amounts of audio data. Why? Because speech is complicated. There are lots of different ways to say the same thing, and the meaning of a sentence can change with word placement and emphasis. Also consider that the world is filled with different languages, and within these languages pronunciation and word choice can differ depending on factors such as location and accent.
Oh, and let's not forget that speech differs due to age and gender, too!
With this in mind, the more speech samples you feed an ASR system, the better it gets at identifying and classifying new speech input. The more samples you have from a broad range of voices and environments, the better the system gets at identifying voices within those environments. With dedicated fine-tuning and maintenance, automatic speech recognition systems will improve as they are used.
So at the most basic level, the more data, the better. It's true that there is ongoing research into optimizing smaller datasets, but at present most models still require large amounts of data to perform well.
Fortunately, audio data collection is getting simpler thanks to dataset repositories and dedicated data collection services. This in turn is increasing the rate of technological development, so to finish things off, let's take a brief look at where automatic speech recognition can play a role in the future.
ASR Technology is already embedding itself into our society. Virtual assistants, in-car systems, and home automation are all creating convenience in everyday life. It's likely that the scope of their abilities will expand too; as more people adopt these services the technology will develop further.
Outside of the above examples, automatic speech recognition is playing a role in a variety of interesting fields and industries:
These are just a few examples of how ASR can support and improve lives, and it's likely that the next decade will see even more improvements alongside novel adaptations.
In any case, I hope this article has been a good introduction to how ASR systems work, how to build them, and what to look forward to in the future. If you have any comments or thoughts, feel free to leave a comment below and I’ll get to it as soon as I can.
Thanks so much for reading. For more about AI news and machine learning developments, follow me on twitter.