Ever wanted to build an app that takes audio input from users? There are tons of benefits for integrating audio into an app— from simply transcribing the audio file to running NLP on the text output. But speech recognition is not a trivial task. And chances are you want to focus on building your app rather than on building and training a model to transcribe streams of audio. Enter machine learning APIs! With ML as an API, you can add ML to your applications without reinventing the wheel by taking advantage of pre-trained models that accomplish common ML tasks. Show me the code I wrote a simple bash script that makes an audio recording using , base64 encodes the audio file, and sends it to the . The , below I’ll highlight the key steps to get it running. SoX Cloud Speech API full code is here Install SoX If you’ve got Homebrew installed, you can install SoX with a simple: (the flag adds support for , which we’ll need for the Speech API). You can also . SoX gives us the command, which we’ll use to record our audio. brew install sox --with-flac --with-flac FLAC encoding install SoX directly rec Record and base64 encode your audio With SoX installed, you can record 5 seconds of audio with the following: rec --channels=1 --bits=16 --rate=16000 audio.flac trim 0 5 This will record 5 seconds of audio using FLAC encoding, and save it to a file called . We’ll then base64 encode this file to send it to the Speech API (it also supports other formats): audio.flac base64 audio.flac > audio.base64 Now it’s ready to send to the Speech API! We’ll prepare our request in a JSON file. Create a Speech API request In Bash, we can create a file: request.json FILENAME=”request-”`date +”%s”.json`cat <<EOF > $FILENAME{"config": {"encoding":"FLAC","languageCode": "en-US","speechContexts": {"phrases": ['']},"maxAlternatives": 1},"audio": {"content":}}EOF Notice we left blank, we’ll fill that in with our base64 encoded audio content: content sed -i '' -e '/"content":/r audio.base64' $FILENAME And we’re ready to make our request with : curl curl -s -X POST -H "Content-Type: application/json" --data-binary @${FILENAME} https://speech.googleapis.com/v1/speech:recognize?key=YOUR_API_KEY Here, is a Bash variable with the name of your file (in the full Bash script I append a timestamp to the filenames so they don’t get overwritten). Replace with a Cloud API key — you can generate one in the section of your Cloud console. ${FILENAME} request.json YOUR_API_KEY Credentials That’s it! You’ve got speech to text transcription in 40 lines of Bash. The script also supports calling the API with files in different languages. When you run the script, just add a language code as a command line parameter, i.e. . This will change the portion of the JSON request to whichever language you’ve specified. bash request.sh fr languageCode What else can you do with the Cloud Speech API? My Bash example above shows sending the API a complete audio file (a batch request). The API also support streaming, so you can send it a continuous stream of audio and the API will return interim results. Find the docs on streaming recognition . here The parameter in the request is one of my favorite Speech API features. Let’s say you’ve got a proper noun or some jargon that’s very specific to your application. The pre-trained model probably won’t recognize these industry-specific terms, but you can provide hints that tell the model to look out for these terms. For example, I’ve spoken at a few DevFests — the name of a specific type of Google Developer Group event. If I pass the API an audio file with “DevFest” it’ll typically transcribe it as “death fest,” definitely not the same thing! When I provide a hint via the param, it transcribes “DevFest” perfectly. speechContext speechContext Who is using the Speech API? While transcribing a 5 second audio file is fun, you’re probably wondering how this is being used in production. Here are two examples: : they recently announced an Automated Speech Recognition service for transcribing calls, powered by the Speech API. Twilio : a chat application that has connected over 15 billion matches, using the Speech API to transcribe any audio snippets sent between matches. Azar Start transcribing audio Here’s a list of all the resources mentioned in this post: Bash script code Speech API documentation on the Speech API Section of my Cloud Next talk Enjoy, and let me know if you have any Speech API feedback on Twitter . @SRobTweets
Share Your Thoughts