8,851 reads

Speech to text transcription in 40 lines of Bash

by Sara RobinsonJune 2nd, 2017

Too Long; Didn't Read

Ever wanted to build an app that takes audio input from users? There are tons of benefits for integrating audio into an app— from simply transcribing the audio file to running NLP on the text output. But speech recognition is not a trivial task. And chances are you want to focus on building your app rather than on building and training a model to transcribe streams of audio.

Companies Mentioned

featured image - Speech to text transcription in 40 lines of Bash

Enter machine learning APIs! With ML as an API, you can add ML to your applications without reinventing the wheel by taking advantage of pre-trained models that accomplish common ML tasks.

Show me the code

I wrote a simple bash script that makes an audio recording using SoX, base64 encodes the audio file, and sends it to the Cloud Speech API. The full code is here, below I’ll highlight the key steps to get it running.

Install SoX

If you’ve got Homebrew installed, you can install SoX with a simple: brew install sox --with-flac (the --with-flac flag adds support for FLAC encoding, which we’ll need for the Speech API). You can also install SoX directly. SoX gives us the rec command, which we’ll use to record our audio.

Record and base64 encode your audio

With SoX installed, you can record 5 seconds of audio with the following:

rec --channels=1 --bits=16 --rate=16000 audio.flac trim 0 5

This will record 5 seconds of audio using FLAC encoding, and save it to a file called audio.flac. We’ll then base64 encode this file to send it to the Speech API (it also supports other formats):

base64 audio.flac > audio.base64

Now it’s ready to send to the Speech API! We’ll prepare our request in a JSON file.

Create a Speech API request

In Bash, we can create a request.json file:

FILENAME=”request-”`date +”%s”.json`cat <<EOF > $FILENAME{"config": {"encoding":"FLAC","languageCode": "en-US","speechContexts": {"phrases": ['']},"maxAlternatives": 1},"audio": {"content":}}EOF

Notice we left content blank, we’ll fill that in with our base64 encoded audio content:

sed -i '' -e '/"content":/r audio.base64' $FILENAME

And we’re ready to make our request with curl:

curl -s -X POST -H "Content-Type: application/json" --data-binary @${FILENAME} https://speech.googleapis.com/v1/speech:recognize?key=YOUR_API_KEY

Here, ${FILENAME} is a Bash variable with the name of your request.json file (in the full Bash script I append a timestamp to the filenames so they don’t get overwritten). Replace YOUR_API_KEY with a Cloud API key — you can generate one in the Credentials section of your Cloud console.

That’s it! You’ve got speech to text transcription in 40 lines of Bash.

The script also supports calling the API with files in different languages. When you run the script, just add a language code as a command line parameter, i.e. bash request.sh fr. This will change the languageCode portion of the JSON request to whichever language you’ve specified.

What else can you do with the Cloud Speech API?

My Bash example above shows sending the API a complete audio file (a batch request). The API also support streaming, so you can send it a continuous stream of audio and the API will return interim results. Find the docs on streaming recognition here.

The speechContext parameter in the request is one of my favorite Speech API features. Let’s say you’ve got a proper noun or some jargon that’s very specific to your application. The pre-trained model probably won’t recognize these industry-specific terms, but you can provide hints that tell the model to look out for these terms. For example, I’ve spoken at a few DevFests — the name of a specific type of Google Developer Group event. If I pass the API an audio file with “DevFest” it’ll typically transcribe it as “death fest,” definitely not the same thing! When I provide a hint via the speechContext param, it transcribes “DevFest” perfectly.

Who is using the Speech API?

While transcribing a 5 second audio file is fun, you’re probably wondering how this is being used in production. Here are two examples:

Twilio: they recently announced an Automated Speech Recognition service for transcribing calls, powered by the Speech API.
Azar: a chat application that has connected over 15 billion matches, using the Speech API to transcribe any audio snippets sent between matches.