wav2vec2 for Automatic Speech Recognition In Plain English

Introduction

If you've been dipping your toes in the domain of automatic speech recognition ("ASR"), there's a good chance you've come across wav2vec 2.0 ("wav2vec2") from Meta AI Research. There are some excellent technical resources, not the least of which is the original wav2vec2 paper itself, that describes how the machine learning ("ML") model works. Also, the Meta AI Research team has a nice overview of wav2vec2 on their website.

I would encourage you to take a look at it since it offers a nice summary of the academic paper and since the wav2vec2 model illustrations in this article are sourced from that page. With the preceding in mind, there don't seem to be many write-ups that explain wav2vec2 in "plain English." I try to do just that with this article.

This article assumes that you understand some basic ML concepts and that you're interested in understanding how wav2vec2 works at a high level, without getting too deep "into the weeds."

Accordingly, the subsequent sections try to avoid a lot of the technical details in favor of simple explanations and useful analogies when appropriate.

That being said, it is helpful to know early on that wav2vec2 is comprised of 3 major components: the Feature Encoder, the Quantization Module, and the Transformer.

Each will be discussed in the course of starting the discussion with some basic ideas while building up to more complex (but still digestible) points. Keep in mind that wav2vec2 can be used for other purposes beyond ASR.

That being said, what follows here discusses the model in an ASR-specific context.

A Gentle Overview

At the time it was introduced in 2020, wav2vec2 offered a novel framework for building ASR systems. What was so special about it? Before wav2vec2, ASR systems were generally trained using labeled data. That is, prior models were trained on many examples of speech audio where each example had an associated transcription. To explain the idea, consider this waveform:

It is not entirely clear what this waveform represents just looking at it. But, if you are told that the speaker who generated this audio said the words "hello world", you can probably make some intelligent guesses as to which parts of the waveform correspond with the text that represents it.

You might surmise - correctly - that the first segment of the waveform is associated with the word "hello". Similarly, ASR models can learn how to make associations between spoken audio waveform segments and written text.

However, as the original wav2vec2 investigators point out in their paper, "[many] speech recognition systems require thousands of hours of transcribed speech to reach acceptable performance which is not available for the vast majority of the nearly 7,000 languages spoken worldwide."

So, the wav2vec2 investigators invented a new model where it is not necessary to have "thousands of hours of transcribed speech" in order to train the system. They reference a useful human analogy: babies don’t learn to speak by hearing a word, and then immediately seeing a text representation of that word.

They learn representations of speech by listening to people in their environment (e.g., their parents, siblings, etc.). wav2vec2 learns in an analogous way: by listening first.

Of course, how this is achieved is the point of the discussion in this article. Bear in mind that wav2vec2 is broadly designed to accomplish 2 things:

Learn what the speech units should be given samples of unlabeled audio.
Predict correct speech units.

At this point, you don't need to completely understand what is meant by these points. They will be explained below. Just keep them in the back of your head for now.

Learning Speech Units

Imagine you have a huge dataset of audio samples - say for some number of English speakers. Even without a formal background in phonetics, you might understand intuitively that the English language is vocalized using a set of basic sounds that are "strung together" to form words, sentences, etc.

Of course, if you're an English speaker, you don't think about speaking in this way and your vocalizations of whatever you want to say are more or less automatic! But, the point is that the spoken English language - and really any spoken language - can be decomposed into more basic, discrete sounds.

If we could somehow coax an ASR model to "extract" these basic sounds, it would allow us to encode any audio sample of spoken language using them. This is what wav2vec2 does by pretraining on audio data.

Pretraining, in this context, means that the first part of the model's training is self-supervised insofar as it is not explicitly "told" what the basic sounds should be for a given set of audio data.

Diving down a bit more, the system is "fed" a large number of audio-only examples and, from those examples, is able to learn a set of basic speech units.

Thus, every audio example is effectively composed of some combination of those speech units; in the same way that you can break a spoken audio sample into a sequence of phonemes.

Importantly, the basic speech units that wav2vec2 learns are shorter than phonemes and are 25 milliseconds in length.

The question that arises at this point is: How does wav2vec2 learn these speech units from audio alone?

The process of learning speech units begins with the Feature Encoder. wav2vec2 "encodes speech audio via a multi-layer convolutional neural network."

Convolutional neural networks, or CNNs, are models that allow us to learn features from a given input without those features being explicitly identified beforehand.

Each layer of a CNN can be thought of as extracting features from an input, with those features becoming increasingly more complex as you move up to higher layers.

In the case of audio data, you might imagine the first layer in a CNN examining windows of audio information and extracting low-level features, such as primitive sounds.

A later layer in the same CNN, leveraging the lower-level features extracted in earlier layers, would encode higher-level features, such as sounds approximating phonemes.

Following this idea, wav2vec2 can begin to "learn what the speech units should be given samples of unlabeled audio" by passing time slices of each audio example into the Feature Encoder and generating a latent representation of each slice.

However, the collection of latent representations do not represent discrete speech units. These representations must be discretized in some way. This is accomplished by passing the output of the Feature Encoder to a Quantization Module.

Effectively, the Quantization Module takes all the different audio representations generated by the Feature Encoder and reduces them to a finite set of speech units. It's worthwhile to ask at this point if wav2vec2 should be pretrained on a single language or a variety of languages.

Logic tells us that capturing speech units that represent multiple languages versus a single language are likely to be more useful when designing ASR systems that can be used across many languages.

To that end, pretraining wav2vec2 with a selection of multilingual audio samples enables the model to produce speech units that do in fact capture multiple languages.

The wav2vec2 investigators noted the value behind this approach since "for some languages, even [audio] data is limited." Their original findings determined "that some units are used for only a particular language, whereas others are used in similar languages and sometimes even in languages that aren't very similar."

Predicting Speech Units

The inventory of speech units is a first step toward being able to encode spoken language audio samples. But, what we really want to achieve is to train wav2vec2 on how these units relate to one another.

In other words, we want to understand what speech units are likely to occur in the same context as one another. wav2vec2 tackles this task via the Transformer layer.

The Transformer essentially allows wav2vec2 to learn, in a statistical sense, how the speech units are distributed among the various audio examples. This understanding facilitates the encoding of audio samples that the model will "see" after pretraining.

Finetuning

Ultimately, an ASR system needs to be able to generate a text transcription for a given sequence of audio that it hasn't "seen" before. After pretraining via the steps described above, wav2vec2 is finetuned for this purpose. This time the model is explicitly shown examples of audio samples and their associated transcriptions.

At this point, the model is able to utilize what it learned during pretraining to encode audio samples as sequences of speech units and to map those sequences of speech units to individual letters in the vocabulary representing the transcriptions (i.e. the letters "a" to "z" in the case of English).

The learning during finetuning completes the training of the wav2vec2 model and allows it to predict the text for new audio examples that were not part of its training during finetuning.

Conclusion

Of course, the low-level mechanics of wav2vec2 are far more complex than what is presented above. However, to reiterate, the idea of this article is to provide you with a simple, conceptual understanding of how the model works and how it is trained.

wav2vec2 is a very powerful ML framework for building ASR systems and its XLS-R variation introduced in late 2021 was trained on 128 languages, thus providing an improved platform for designing ASR models across multiple languages.

As mentioned in the Introduction, there are a number of excellent technical resources available to help you learn more. In particular, you may find those provided by Hugging Face to be especially useful.