paint-brush
We Released Modern Google-level Speech-to-Text Modelsby@snakers41
143 reads

We Released Modern Google-level Speech-to-Text Models

by Alexander VeysovOctober 7th, 2020
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

We are proud to announce that we have built from ground up and released our high-quality (i.e. on par with premium Google models) speech-to-text models for the following languages: English; German; Spanish; French; German. You can find all of our models in our repository together with examples, quality and performance benchmarks. Also invested some time into making our models as accessible as possible — you can try our examples as well as PyTorch, ONNX, TensorFlow checkpoints. You can also load our model via TorchHub.
featured image - We Released Modern Google-level Speech-to-Text Models
Alexander Veysov HackerNoon profile picture

Our models are on par with premium Google models and also really simple to use.

We are proud to announce that we have built from ground up and released our high-quality (i.e. on par with premium Google models) speech-to-text Models for the following languages:

  • English;
  • German;
  • Spanish;

You can find all of our models in our repository together with examples, quality and performance benchmarks. Also we invested some time into making our models as accessible as possible — you can try our examples as well as PyTorch, ONNX, TensorFlow checkpoints. You can also load our model via TorchHub.

Please go here to see the original table https://github.com/snakers4/silero-models#getting-started

Why This is a Big Deal

Speech-to-text has traditionally had high barriers of entry due to a number or reasons:

  • Hard-to-collect data
  • Costly annotation and high data requirements
  • High compute requirements and adoption of obsolete hard to use technologies

Here are some of the typical problems that existing ASR solutions and approaches had before our release:

  • STT Research typically focused on huge compute budgets
  • Pre-trained models and recipes did not generalize well, were difficult to use even as-is, relied on obsolete tech
  • Until now STT community lacked easy to use high quality production grade STT models

First we tried to alleviate some of these problems for the community by publishing the largest Russian spoken corpus in the world (see our Habr post here). Now we try to solve these problems as follows:

  • We publish a set of pre-trained high-quality models for popular languages
  • Our models are designed to be as robust to different domains as you can see in our benchmarks
  • Our models are pre-trained on vast and diverse datasets
  • Our models are fast and can be run on commodity hardware
  • Our models are easy to use

Embarrassing Simplicity

We believe that modern technology should be embarrassingly simple to use. In our work we follow these design principles:

  1. Models should be compact and fast
  2. Models should generalize across domains, there should be one general solution tailored superficially to particular domains, not vice-versa
  3. Models should be easy to use

Further plans

Now the smallest we could compress our models is around 50 Megabytes. We still have plans to compress our Enterprise Edition models up to ~20 Megabytes without loss of fidelity. We also are planning to release Community Edition model for other popular languages.

Links

Originally published at https://habr.com.