Automatic speech recognition (ASR) is the transformation of spoken language into text. If you’ve ever used a virtual assistant like Siri or Alexa, you’ve experienced using an automatic speech recognition system. The technology is being implemented in messaging apps, search engines, in-car systems, and home automation.
And though all these systems rely on slightly different technical processes, the first step for all of them is the same: capturing speech data and transforming it into machine-readable text.
But how does an ASR system work? How does it learn to understand speech? 
In this article, I’ll give a brief introduction to automatic speech recognition. We'll look at the speech-to-text transformation process, how to build an ASR system, and touch on where to expect ASR tech in the future.
So let's get started!
ASR Systems: How do they work?
So we know that on a basic level, automatic speech recognition looks like this:
audio data in, text data out.
But to get from the input to the output, the audio data needs to be made machine-readable. This means sending it through an acoustic model and a language model. The two processes work like this:
An acoustic model determines the relationship between audio signals and phonetic units in a language, while a language model matches sounds to words and word sequences.
These two systems allow ASR systems to run probability checks on audio input to develop predictions of what words and sentences are in it. From these predictions, the systems then selects the prediction with the highest confidence rating.*
*Sometimes the language model can give priority to certain predictions that are deemed more likely due to other factors
So if we run the a phrase through an ASR system, it will do the following:
- Take vocal input: "Hey Siri, what time is it?"
- Run the voice data through an acoustic model, breaking it up into phonetic parts.
- Run that data through a language model.
- Output text data: "Hey Siri, what time is it?"
It's worth mentioning here that if an automatic speech recognition system is part of a voice user interface, the ASR model won’t be the only machine learning model at work. Many automatic speech recognition systems are paired with natural language processing (NLP) and Text-to-speech (TTS) systems to perform their given roles.
That said, digging into voice user interfaces is a whole topic of its own. To learn more about it, check out this article.
So now we know how ASR systems work, but what do you need to build one?
The key is data.
Building an ASR System: The Importance of Data
A good ASR system is expected to be flexible. It needs to understand a wide variety of audio input (speech samples) and create accurate text output from that data so it can react accordingly.
To achieve this, ASR systems require data in the form of labeled speech samples and transcriptions. It's a little more complicated than that (the data labeling process is hugely important and often overlooked, for example), but for the purposes of this article let's keep things simple.
ASR systems require huge amounts of audio data. Why? Because speech is complicated. There are lots of different ways to say the same thing, and the meaning of a sentence can change with word placement and emphasis. Also consider that the world is filled with different languages, and within these languages pronunciation and word choice can differ depending on factors such as location and accent.
Oh, and let's not forget that speech differs due to age and gender, too!
With this in mind, the more speech samples you feed an ASR system, the better it gets at identifying and classifying new speech input. The more samples you have from a broad range of voices and environments, the better the system gets at identifying voices within those environments. With dedicated fine-tuning and maintenance, automatic speech recognition systems will improve as they are used.
So at the most basic level, the more data, the better. It's true that there is ongoing research into optimizing smaller datasets, but at present most models still require large amounts of data to perform well. 
Fortunately, audio data collection is getting simpler thanks to dataset repositories and dedicated data collection services. This in turn is increasing the rate of technological development, so to finish things off, let's take a brief look at where automatic speech recognition can play a role in the future.
The Future of ASR Technology
ASR Technology is already embedding itself into our society. Virtual assistants, in-car systems, and home automation are all creating convenience in everyday life. It's likely that the scope of their abilities will expand too; as more people adopt these services the technology will develop further.
Outside of the above examples, automatic speech recognition is playing a role in a variety of interesting fields and industries:
- Communication: With the adoption of cell phones worldwide, ASR systems can make messaging, online searches, and text-based services available even to communities with low levels of reading and writing literacy.
- Improving Accessibility: Automatic speech recognition systems can also help people with disabilities or injuries by providing hands-free access to applications, and auto-captioning for television, movies, and business meetings.
- Military Technology: In the US, France, and the UK, military programs have been testing and evaluating ASR systems for fighter jets. This includes tasks such as setting radio frequencies, commanding autopilot systems, and controlling flight displays.
These are just a few examples of how ASR can support and improve lives, and it's likely that the next decade will see even more improvements alongside novel adaptations.
In any case, I hope this article has been a good introduction to how ASR systems work, how to build them, and what to look forward to in the future. If you have any comments or thoughts, feel free to leave a comment below and I’ll get to it as soon as I can.
Thanks so much for reading. For more about AI news and machine learning developments, follow me on twitter. 
