143 reads

We Released Modern Google-level Speech-to-Text Models

by Alexander VeysovOctober 7th, 2020

Too Long; Didn't Read

We are proud to announce that we have built from ground up and released our high-quality (i.e. on par with premium Google models) speech-to-text models for the following languages: English; German; Spanish; French; German. You can find all of our models in our repository together with examples, quality and performance benchmarks. Also invested some time into making our models as accessible as possible — you can try our examples as well as PyTorch, ONNX, TensorFlow checkpoints. You can also load our model via TorchHub.

featured image - We Released Modern Google-level Speech-to-Text Models

Our models are on par with premium Google models and also really simple to use.

We are proud to announce that we have built from ground up and released our high-quality (i.e. on par with premium Google models) speech-to-text Models for the following languages:

English;
German;
Spanish;

You can find all of our models in our repository together with examples, quality and performance benchmarks. Also we invested some time into making our models as accessible as possible — you can try our examples as well as PyTorch, ONNX, TensorFlow checkpoints. You can also load our model via TorchHub.

Please go here to see the original table https://github.com/snakers4/silero-models#getting-started

Why This is a Big Deal

Speech-to-text has traditionally had high barriers of entry due to a number or reasons:

Hard-to-collect data
Costly annotation and high data requirements
High compute requirements and adoption of obsolete hard to use technologies

Here are some of the typical problems that existing ASR solutions and approaches had before our release:

STT Research typically focused on huge compute budgets
Pre-trained models and recipes did not generalize well, were difficult to use even as-is, relied on obsolete tech
Until now STT community lacked easy to use high quality production grade STT models

First we tried to alleviate some of these problems for the community by publishing the largest Russian spoken corpus in the world (see our Habr post here). Now we try to solve these problems as follows:

We publish a set of pre-trained high-quality models for popular languages
Our models are designed to be as robust to different domains as you can see in our benchmarks
Our models are pre-trained on vast and diverse datasets
Our models are fast and can be run on commodity hardware
Our models are easy to use

Embarrassing Simplicity

We believe that modern technology should be embarrassingly simple to use. In our work we follow these design principles:

Models should be compact and fast
Models should generalize across domains, there should be one general solution tailored superficially to particular domains, not vice-versa
Models should be easy to use

Further plans

Now the smallest we could compress our models is around 50 Megabytes. We still have plans to compress our Enterprise Edition models up to ~20 Megabytes without loss of fidelity. We also are planning to release Community Edition model for other popular languages.

We Released Modern Google-level Speech-to-Text Models

Too Long; Didn't Read

Why This is a Big Deal

Embarrassing Simplicity

Further plans

Links

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

We Released Modern Google-level Speech-to-Text Models

Too Long; Didn't Read

Why This is a Big Deal

Embarrassing Simplicity

Further plans

Links

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES