Speech-to-text (STT), also known as automated-speech-recognition (ASR), has a long history and has made amazing progress over the past decade. Currently, it is often believed that only large corporations like Google, Facebook, or Baidu (or local state-backed monopolies for the Russian language) can provide deployable “in-the-wild” solutions.
Speech-to-text (STT), also known as automated-speech-recognition (ASR), has a long history and has made amazing progress over the past decade. Currently, it is often believed that only large corporations like Google, Facebook, or Baidu (or local state-backed monopolies for the Russian language) can provide deployable “in-the-wild” solutions. This is due to several reasons:
In this piece we describe our effort to alleviate these concerns, both globally and for the Russian language, by:
Following the success and the democratization (the so-called “ImageNet moment”, i.e. the reduction of hardware requirements, time-to-market and minimal dataset sizes to produce deployable products) of computer vision, it is logical to hope that other branches of Machine Learning (ML) will follow suit. The only questions are, when will it happen and what are the necessary conditions for it to happen?
In our opinion, the ImageNet moment in a given ML sub-field arrives when:
If the above conditions are satisfied, one can develop new useful applications with reasonable costs. Also democratization occurs — one no longer has to rely on giant companies such as Google as the only source of truth in the industry.
If you would like to know more about the philosophy of our work and why we opted for online publications instead of conferences / peer reviewed papers - please follow this link.
For our experiments we have chosen the following stack of technologies:
There are many ways to approach STT. Discussing their drawbacks and advantages is out of scope here. Everything in this article is said about an end-to-end approach using mostly graphemes (i.e. alphabet letters) and neural networks.
In a nutshell — to train an end-to-end grapheme model you just need a lot of small audio files with corresponding transcriptions, i.e. file.wav and transcription.txt. You can also use CTC loss, which alleviates the requirement to have time-aligned annotation (otherwise you will need either to provide an alignment table by yourself or learn alignment within your network). A common alternative to CTC loss is the standard categorical cross-entropy loss with attention, but it trains slowly by itself and it is usually used together with CTC loss anyway.
This “stack” was chosen for a number of reasons:
All publicly available supervised English datasets that we know of are smaller than 1,000 hours and have very limited variability. DeepSpeech 2, a seminal STT paper, suggests that you need at least 10,000 hours of annotation to build a proper STT system. 1,000 hours is also a good start, but given the generalization gap (discussed below) you need around 10,000 hours of data in different domains.
Typical academic datasets have the following drawbacks:
Because of these drawbacks, about 6 months ago we decided to collect and share an unprecedented spoken corpus in Russian. We targeted 10,000 hours at first. To our knowledge this is unprecedented even for the English language. We have seen an attempt to do work similar to ours, but despite the government funding, their datasets are not publicly available.
Recently we released a 1.0-beta version of the dataset. It includes the following domains:
Our data-collection process was the following:
You can find our corpus here and you can support our dataset here.
Though this is already substantial, we are not yet done. Our short term plan is:
PS. We did all of this, our dataset was even featured on azure datasets, now we are planning in releasing pre-trained models for 3 new languages: English / German / Spanish.
To build a great STT model, it needs the following characteristics:
We take these as our goals, and describe how we fulfilled them below.
Traditionally models are selected by benchmarking them on a couple of fixed “ideal” unseen validation datasets. In the previous sections we
explained why this is sub-optimal if you have real world usage in mind
and the only datasets available are academic datasets. Given limited
resources to properly compare models you need a radically different
approach, which we present in this section. Also keep in mind that there
is no “ideal” validation dataset when you are dealing with real
in-the-wild data — you need to validate on each domain separately.
Usually when reporting some results on some public dataset (e.g.
ImageNet), researchers allegedly run full experiments with different
hyper-parameters from scratch until convergence. Also, a good practice
is to run the so-called ablation tests, i.e. experiments that test
whether or not additional features of a model were actually useful by
comparing the performance of the model with and without those features.
In real life, practitioners cannot afford themselves the luxury of
running hundreds or thousands of experiments from scratch till
convergence or building some fancy reinforcement learning code to
control experiments. Also, the dominance of over-parameterized methods
in the literature and the availability of enterprise oriented toolkits
discourages researchers from deeply optimizing their pipelines. When you
explore the hardware options, in the professional or cloud segment there
is a bias towards expensive and impractical solutions.
Read here more to learn about our model selection methodology.
Initially we started with a fork of Deep Speech 2 in PyTorch. The original Deep Speech 2 model is based on a deep LSTM or GRU recurrent network, which are slow. The above image illustrates the optimizations we were able to add to the original pipeline. More specifically, we were able to do the following without hurting model performance:
The above chart only has convolutional models, which we found to be much faster than their recurrent counterparts. We started on the process to getting these results as follows:
So, we then explored the following ideas to improve things:
Please follow this link to learn about each of these ideas in detail.
In real life it is expected that if the model is trained on one domain, there will be a significant generalization gap on another. But is there a generalization gap in the first place? If there is, then what are the main differences between domains? Can you train one model to work fine on many reasonable domains with decent signal-to-noise ratio?
There is a generalization gap, and you can even deduce which ASR systems were trained on which domains. Also, with the ideas above, you can train a model that will perform decently even on unseen domains.
According to our observations, these are the main differences that cause the generalization gap between domains:
This benchmark includes both an acoustic model and a language model. The acoustic model is run on GPU, the results are accumulated, and then language model post-processing is run on multiple CPUs
For more detailed benchmarks, some thoughts on production usage and benchmark analysis please go here. For up-to-date and updated benchmarks please go here (Russian).
Here is a list of ideas, that we tested (some of which even worked), but we decided in the end that their complexity does not justify the benefits they provide:
Author Bio
Alexander Veysov is a Data Scientist in Silero, a small company building NLP / Speech / CV enabled products, and author of Open STT. Silero has recently shipped its own Russian STT engine. Previously he worked in a then Moscow-based VC firm and Ponominalu.ru, a ticketing startup acquired by MTS (major Russian TelCo). He received his BA and MA in Economics in Moscow State University for International Relations (MGIMO). You can follow his channel in telegram (@snakers41).
Originally published at https://thegradient.pub on March 28, 2020.
Acknowledgments
Thanks to Andrey Kurenkov and Jacob Anderson from The Gradient for their contributions to this piece.
Citation
For attribution in academic contexts or books, please cite this work as
Alexander Veysov, “Toward’s an ImageNet Moment for Speech-to-Text”, The Gradient, 2020.
BibTeX citation
@article{veysov2020towardimagenetstt,
author = {Veysov, Alexander},
title = {Toward’s an ImageNet Moment for Speech-to-Text},
journal = {The Gradient},
year = {2020},
howpublished = {\url{ https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/ } },
}
Originally published at https://thegradient.pub on March 28, 2020. All citations and references preserved as they were in the original article. HN also does not have handy table-of-contents features, so I will leave the original links as well. Where appropriate, I will provide a link to the original part of the article. I will also provide links to more up-to-date benchmarks. Also soon we will be releasing pre-trained models for English / German / Spanish.