Ever wanted to build an app that takes audio input from users? There are tons of benefits for integrating audio into an app— from simply transcribing the audio file to running NLP on the text output. But speech recognition is not a trivial task. And chances are you want to focus on building your app rather than on building and training a model to transcribe streams of audio.
Enter machine learning APIs! With ML as an API, you can add ML to your applications without reinventing the wheel by taking advantage of pre-trained models that accomplish common ML tasks.
I wrote a simple bash script that makes an audio recording using SoX, base64 encodes the audio file, and sends it to the Cloud Speech API. The full code is here, below I’ll highlight the key steps to get it running.
If you’ve got Homebrew installed, you can install SoX with a simple: brew install sox --with-flac
(the --with-flac
flag adds support for FLAC encoding, which we’ll need for the Speech API). You can also install SoX directly. SoX gives us the rec
command, which we’ll use to record our audio.
With SoX installed, you can record 5 seconds of audio with the following:
rec --channels=1 --bits=16 --rate=16000 audio.flac trim 0 5
This will record 5 seconds of audio using FLAC encoding, and save it to a file called audio.flac
. We’ll then base64 encode this file to send it to the Speech API (it also supports other formats):
base64 audio.flac > audio.base64
Now it’s ready to send to the Speech API! We’ll prepare our request in a JSON file.
In Bash, we can create a request.json
file:
FILENAME=”request-”`date +”%s”.json`cat <<EOF > $FILENAME{"config": {"encoding":"FLAC","languageCode": "en-US","speechContexts": {"phrases": ['']},"maxAlternatives": 1},"audio": {"content":}}EOF
Notice we left content
blank, we’ll fill that in with our base64 encoded audio content:
sed -i '' -e '/"content":/r audio.base64' $FILENAME
And we’re ready to make our request with curl
:
curl -s -X POST -H "Content-Type: application/json" --data-binary @${FILENAME} https://speech.googleapis.com/v1/speech:recognize?key=YOUR_API_KEY
Here, ${FILENAME}
is a Bash variable with the name of your request.json
file (in the full Bash script I append a timestamp to the filenames so they don’t get overwritten). Replace YOUR_API_KEY
with a Cloud API key — you can generate one in the Credentials section of your Cloud console.
That’s it! You’ve got speech to text transcription in 40 lines of Bash.
The script also supports calling the API with files in different languages. When you run the script, just add a language code as a command line parameter, i.e. bash request.sh fr
. This will change the languageCode
portion of the JSON request to whichever language you’ve specified.
My Bash example above shows sending the API a complete audio file (a batch request). The API also support streaming, so you can send it a continuous stream of audio and the API will return interim results. Find the docs on streaming recognition here.
The speechContext
parameter in the request is one of my favorite Speech API features. Let’s say you’ve got a proper noun or some jargon that’s very specific to your application. The pre-trained model probably won’t recognize these industry-specific terms, but you can provide hints that tell the model to look out for these terms. For example, I’ve spoken at a few DevFests — the name of a specific type of Google Developer Group event. If I pass the API an audio file with “DevFest” it’ll typically transcribe it as “death fest,” definitely not the same thing! When I provide a hint via the speechContext
param, it transcribes “DevFest” perfectly.
While transcribing a 5 second audio file is fun, you’re probably wondering how this is being used in production. Here are two examples:
Here’s a list of all the resources mentioned in this post:
Enjoy, and let me know if you have any Speech API feedback on Twitter @SRobTweets.