Thanks to advances in speech recognition, companies can now build a whole range of products with accurate transcription capabilities at their heart. platforms, personal assistants and video and audio editing tools, for example, all rely on speech to text transcription. However, you often need to train these systems for every domain you want to transcribe, using supervised data. In practice, you need a large body of transcribed audio that’s similar to what you are transcribing just to get started in a new domain. Conversation intelligence Recently, Facebook released wav2vec 2.0 which goes some way towards addressing this challenge. wav2vec 2.0 allows you to pre-train transcription systems using with no corresponding transcription — and then use just a tiny transcribed dataset for training. audio only — In this , we share how we worked with wav2vec 2.0, with great results. blog What is an end-to-end automatic speech recognition system? Before we dive into wav2vec 2.0, let’s take a few steps back to cover a couple of key terms you’ll need to understand to see what makes wav2vec 2.0 so special. First, let’s look at end-to-end automatic speech recognition systems. An end-to-end automatic speech recognition (ASR) system takes speech audio waveform and outputs the corresponding text. Traditionally, these systems use Hidden Markov Models (HMMs), where the speech audio is modeled using a stochastic process. In recent years, deep learning ASR systems have become popular thanks to increased computing power and amounts of training data. You can measure an ASR system’s performance with a word error rate (WER) metric. WER reflects the number of corrections needed to convert the ASR output into the ground truth. Generally, a lower WER means a better quality ASR system. This figure is adapted from https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-custom-speech-evaluate-data The example above shows how to calculate the WER. We can see that the ASR has made a few errors. It has inserted an “a”, identified “John” as “Jones” and deleted the word “are” from the ground truth. To calculate WER, we can use this formula: (D+I+S)/N. D is the number of deletions, I is the number of insertions, S is the number of substitutions and N is the number of words in the ground truth. In this example, the ASR output made 3 mistakes in total from 5 words in the ground truth. In this case, the WER would be 3 / 5 = 0.6. The LibriSpeech dataset Next, we’ll briefly touch on the LibriSpeech dataset. The LibriSpeech dataset is the most commonly used audio processing dataset in speech research. It was created by Vassil Panayotov and Daniel Povey in 2015 [3]. LibriSpeech consists of 960 hours of labelled speech data and is the standard benchmark for training and evaluating ASR systems. The dataset from LibriSpeech contains 5.4 hours of “clean” speech data. It’s generally used as a validation dataset. In the figure below, we show the transcription for one audio sample in the dataset. dev-clean dev-clean The transcription for one audio sample in the dev-clean dataset What is wav2vec 2.0? Now that we understand what an ASR system and the LibriSpeech dataset are, we’re ready to take a closer look at wav2vec 2.0. What’s different about wav2vec 2.0? ASR systems come in two flavors: The first are hybrid systems such as Kaldi [7] that train a deep acoustic model to predict phonemes from audio processed into Mel Frequency Cepstral Coefficients (MFCCs), combine the phonemes using a pronunciation dictionary and finally pick the most likely results using a (both count based LM and RNN based LM). language model The second are end-to-end systems using a deep neural network to predict words directly from the audio or MFCCs. Such systems like RNN-T [6] or wav2vec [1, 4] require a lot more training data and GPU resources for training. Due to the massive data requirements of end-to-end systems only the biggest companies have used them to date. The data requirements also make it hard to train for new domains (even in the same language) and new languages or accents. Using a hybrid system, it is much easier to create a model for a new domain using minimal training data and a pronunciation dictionary with words added for that domain. The promise of wav2vec 2.0 is pre-training without the supervised data using a large data set of recordings in the target domain. Afterwards, the model can be tuned using the supervised approach to maximize the accuracy. Wav2vec 2.0 shows that it’s possible to achieve low WER on LibriSpeech validation datasets using only ten minutes of labelled audio data. Another option is to use the pre-trained model (such as the libri-speech model) and just fine tune it for your domain with a few hours of labelled audio. The architecture of wav2vec 2.0 The breakthrough wav2vec 2.0 achieved is in adopting the masked pre-training method of the massive language model BERT [8]. BERT masks a few words in each training sentence and the model trains by attempting to fill the gaps. Instead of masking words, wav2vec 2.0 masks a part of the audio representation and requires the transformer network to fill in the gap. The figure below shows the wav2vec 2.0 architecture with its two major components: CNN layers and transformer layers. image credit: https://arxiv.org/pdf/2006.11477.pdf Self-supervised learning So how does self-supervised learning work in wav2vec 2.0? The raw audio waveform (X in the figure above) first passes through CNN layers, and we get latent speech representations (Z in the figure above). Now, two things happen in parallel: We mask a random subset of Z, let’s call it masked_Z. We pass masked_Z into transformer layers. The output of the transformer layers is called context representations (C in the figure above). We apply product quantization [5] on Z and get quantized representations (Q in the figure above). We expect C to be close to Q over the masked parts. The “error” between C and Q over the masked parts is called the contrastive loss. Minimizing contrastive loss enables transformer layers to learn the structure inside latent speech representations ( ) Z . Where does wav2vec 2.0 fit in the big picture? In the figure above, we saw that context representations were the output of transformer layers. Wav2vec 2.0 passes these context representations into a linear layer, followed by a softmax operation. The final output contains probability distributions over 32 tokens. A token can be a character, or it can represent word and sentence boundaries, as well as unknowns. How do we convert these probability distributions into text? The answer is a decoder! The authors of wav2vec 2.0 used a beam search decoder. Below, we show you how to use a Viterbi decoder to convert the output of wav2vec 2.0 into text. Similarity with word2vec Word2vec [2] generates a feature vector for a given word, such that feature vectors of similar words have closer cosine similarity. Similar to word2vec, we can think of the wav2vec 2.0 output as a feature vector for an audio segment. Using Python and PyTorch to build an end to end speech recognition system with wav2vec 2.0 Now, let’s look at how to create a working ASR with wav2vec 2.0 that generates text given audio waveforms from the LibriSpeech dataset. We used Python and PyTorch framework in our sample code snippets. First, download the and the from LibriSpeech. The dataset contains 5.4 hours of “clean” speech data, and it’s generally used as a validation dataset. wav2vec 2.0 model dev-clean dataset dev-clean model_path = data_path = "/home/models/wav2vec_big_960h.pt" "/home/datasets/" In the code above, we declare , which is the path to the wav2vec 2.0 model that we just downloaded. is the path to the dataset. Store it under “/home/datasets/”. model_path data_path dev-clean We mentioned in section 3.5 that wav2vec 2.0 outputs a probability distribution over 32 tokens. We convert these tokens to letters with the help from . We download from , and save it at . ltr_dict.txt ltr_dict.txt here /home/ltr_dict.txt You might notice that contains only 28 letters and tokens. The remaining four tokens are <s>, <pad>, </s>, and <unk>, and they are added when we call with the path to . ltr_dict.txt fairseq_mod.data.Dictionary.load() ltr_dict.txt target_dict = fairseq_mod.data.Dictionary.load( ) 'ltr_dict.txt' Now, create the wav2vec 2.0 model. w2v = torch.load(model_path)
model = Wav2VecCtc.build_model(w2v[ ], target_dict)
model.load_state_dict(w2v[ ], strict= ) "args" "model" True In the code above, we first load from We get , which contains the argument setup and the model’s weights. Then, we build a object. is the model definition of wav2vec 2.0. Finally, we load weights into the model we just created. model_path . w2v wav2vecCTC wav2vecCTC We know that we need a decoder to convert the output of wav2vec 2.0 into text. Create a Viterbi decoder, as in code below. decoder = W2lViterbiDecoder(target_dict) Next, we need to create a data loader for our dataset. Luckily, knows how to process the LibriSpeech dataset! To use it, we just need to call torchaudio torchaudio.datasets.LIBRISPEECH . dev_clean_librispeech_data = torchaudio.datasets.LIBRISPEECH(data_path, url= , download= )
data_loader = torch.utils.data.DataLoader(dev_clean_librispeech_data, batch_size= , shuffle= ) 'dev-clean' False 1 False In the steps so far, we have created wav2vec 2.0, a Viterbi decoder, and the data loader. Now, we are ready to convert raw waveforms into text using wav2vec 2.0 and the decoder. The code below shows how we pass one data sample into wav2vec 2.0. is the data sample, a dictionary containing speech audio waveforms and other arguments that we need to pass into wav2vec 2.0. The modeloutputs , representing logits over tokens at each time step. To get , we project the output of wav2vec 2.0 into tokens through a linear layer. The dimension of is L*B*C, where L is the sequence length, B is the batch size and C is the number of tokens. encoder_input encoder_out encoder_out encoder_out As we saw in section 3.4, we know we need to pass probability distributions over tokens to the decoder to get transcribed texts. Since are logits over tokens, we take the log softmax of these logits (through ), and get , which are probability distributions over tokens. encoder_out model.get_normalized_probs emissions encoder_out = model(**encoder_input)
emissions = model.get_normalized_probs(encoder_out, log_probs= )
emissions = emissions.transpose( , ).float().cpu().contiguous() True 0 1 Next, we pass emissions into the decoder, like this: decoder_out = decoder.decode(emissions) , we describe what happens inside the method. We need to do some post processing on to finalize the output text, but we omit those details here. Check out if you are interested in knowing more. In our third post in this series decode decoder_out post process_sentence That’s it! We just finished processing one data sample. If you want to convert all data samples from the dataset into texts and get a WER score, try and you should get a WER of 2.63%. dev-clean this notebook What’s next? In this post, we introduced the ASR system, as well as wav2vec 2.0. We also showed you how to get an ASR system working with wav2vec 2.0. Note that wav2vec 2.0 is a big model and its largest version has 317 million parameters! So, read our next post next to learn . how to compress wav2vec 2.0 About Georgian R&D Georgian is a fintech that invests in high-growth software companies. At , the R&D team works on building our platform that identifies and accelerates the best growth stage software companies. As part of this work, we take the latest and use it to help solve the business challenges of the companies where we are investors. We then create reusable toolkits so that it’s easier for our other companies to adopt these techniques. Georgian AI research We wrote this series of posts after an engagement where we collaborated closely with the team at . Chorus is a conversation intelligence platform that uses AI to analyze sales calls to drive team performance. Chorus Take a look at our open opportunities if you’re interested in a career at Georgian. References [1] Baevski et al. (2020). Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. https://arxiv.org/abs/2006.11477 [2] Mikolov et al. (2013). Efficient Estimation of Word Representations in Vector Space. https://arxiv.org/abs/1301.3781 [3] Panayotov et al. (2015). Librispeech: an asr corpus based on public domain audio books. https://ieeexplore.ieee.org/document/7178964 [4] Schneider et al. (2019). wav2vec: Unsupervised Pre-training for Speech Recognition. https://arxiv.org/abs/1904.05862 [5] Jegou et al. (2011). Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):117–128 [6] Alex Graves (2012) Sequence Transduction with Recurrent Neural Networks. https://arxiv.org/pdf/1211.3711.pdf [7] Povey et al. (2011) The Kaldi Speech Recognition Toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. https://kaldi-asr.org/doc/about.html [8] Devlin et al. (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805 Also published at https://medium.com/georgian-impact-blog/how-to-make-an-end-to-end-automatic-speech-recognition-system-with-wav2vec-2-0-dca6f8759920

Facebook

How to Use ASR System for Accurate Transcription Properties of Your Digital Product

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

10 Top Open Source AI Technologies For Startups

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

10 Top Open Source AI Technologies For Startups

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps