paint-brush
Speech-To-Text Conversion: An Introduction to Converting Spoken Language Into Textby@milanashkhanukova
580 reads
580 reads

Speech-To-Text Conversion: An Introduction to Converting Spoken Language Into Text

by Milana ShkhanukovaNovember 6th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

We will look behind the scenes of Automatic Speech Recognition models that drive the speech assistants. We will use pre-trained versions of CNN and spectrograms. To check most of the models, you can use a variety of different Python libraries. To get the audio sample, we will ask the model to transcribe the audio.
featured image - Speech-To-Text Conversion: An Introduction to Converting Spoken Language Into Text
Milana Shkhanukova HackerNoon profile picture

You may have at least once used Siri, Alexa, or Google Assistant. Have you ever wondered how these systems understand you? Today, we will look behind the scenes of Automatic Speech Recognition models that drive the speech assistants.


Let’s start with speech.

Speech

Speech can be defined as an acoustic signal which is a waveform that varies over time. However, our computers cannot process all of the points in the speech signal, so the continuous waveform is sampled at discrete time intervals. The more often we sample the better.



Example of a waveform


We can use waveform representation in our models. However, one of the important properties of the signal is the frequency. To get more information about the frequencies of the signal, you can preprocess it with Spectrograms, MelSpectrograms, MFCC, GFCC, and others. Let’s take a quick look at spectrograms.


A spectrogram is a visual representation of the frequency content of a signal over time. Nowadays, spectrograms are one of the most popular sound representations for all neural networks. When you look at a spectrogram, here's what you can interpret:


  1. Frequency vs. Time:

    The x-axis of the spectrogram represents time, typically in seconds, and the y-axis represents frequency in Hertz. The vertical lines on the spectrogram show how the frequency of the signal changes over time.

  2. Intensity and Amplitude:

    The intensity or amplitude of a particular frequency is represented by the color or brightness of that point in the spectrogram. Bright or colorful areas indicate higher amplitude or energy at specific frequencies.

Spectrogram and a waveform of the same sound


ASR Types of Models

CNN Based Model - Conformer

Okay, that’s very interesting, but how do we get information from these Spectrograms? Isn’t a spectrogram a picture? Yes, it is indeed! So we can process it as it is done in Computer Vision using Convolutional Neural Network (CNN). This network captures important features using small filters.


Let’s check one of the best models that uses CNN and Spectrograms as input. First, let’s take an audio file from the LibriSpeech dataset.


To check most of the models, you can use a variety of different Python libraries. We will first experiment with the Nemo library. Let’s install it first (it can take some time - so be patient)


!pip install nemo_toolkit['all']


Then, create the Conformer model which was trained using CNN and spectrograms. We will use its small pre-trained version.


import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained("nvidia/stt_en_conformer_ctc_small")


To get the text of our audio sample, we will ask the model to transcribe the audio.


nemo_result = asr_model.transcribe(['./sample1.wav'])


The result will be: going along slushy country roads and speaking to damp audiences in drafty school roomoms day after day for a fortnight he'll have to put in an appearance at some place of worship on sunday morning and he can come to us immediately afterwards”


As you can see, there are no punctuation or upper letters in the text. Let’s add it with an additional model.


from nemo.collections.nlp.models import PunctuationCapitalizationModel

model = PunctuationCapitalizationModel.from_pretrained("punctuation_en_bert")

model.add_punctuation_capitalization(nemo_result)


Going along slushy country roads and speaking to damp audiences in drafty school roomoms day after day for a fortnight, he'll have to put in an appearance at some place of worship on Sunday morning, and he can come to us immediately afterwards.


That looks better! To learn more about BERT and other language models, check the courses on Hyperskill. I recommend it as a Hyperskill expert if you are only starting your path or have already gained some experience in NLP.

Transformer-Based Model - Whisper

Transformers are not just robots like Megatron and Optimus Prime. There is a popular model called Whisper, which uses Transformer architecture and is developed by OpenAI. This multitask model was trained for different tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.


One of its key advantages is that it can learn the long context over the whole sample, which makes it a powerful tool for speech analysis. To check out the Whisper model on a sample, we must first install the openai-whisper library.


!pip install -U openai-whisper


Then, import the base Whisper model, and transcribe our sample.


import whisper 

model_whisper = whisper.load_model('base')
model_whisper.transcribe('./sample1.wav')['text']


The result will be the following: going along slushy country roads and speaking to damp audiences in drafty schoolrooms day after day for fortnight. He'll have to put in an appearance at some place of worship on Sunday morning, and he can come to us immediately afterwards.


The Whisper model does not need additional modules to add punctuation or capitalization.

ASR Challenges

You may ask, if these models are so cool and easy to use, why do we continue working on ASR models? There are still many challenges unsolved; let’s list them so you may choose one to make a breakthrough in the sphere:


  1. Ambient Noise and Acoustic Variability: background noise, reverberation, and other acoustic variations in the environment can significantly degrade ASR performance.


  2. Real-time Processing: ASR systems may introduce latency when processing real-time audio streams, making them unsuitable for applications that require a low-latency response.


  3. Speaker Diarization: separating different speakers in a conversation (speaker diarization) is a challenging problem for ASR systems, especially in situations with multiple speakers.


  4. Domain Adaptation: ASR models trained on one domain (e.g., news broadcasts, one language, one accent) may not perform well when applied to a different domain (e.g., medical transcriptions) without adaptation.


  5. Data Privacy: privacy concerns arise when using ASR for transcribing sensitive or private conversations, as the transcribed data may be subject to unauthorized access.


Today, we have learned a bit about ASR systems. Now, you can use an already pre-trained model to transcribe a dialogue with your friend or your favorite lecturer. Using already trained models is just the beginning. ASR is one of AI's most popular and rapidly growing spheres, so take a step forward if you’re interested!