Last month Cloud Speech a new word-level timestamps feature: audio transcriptions now include the start and end timestamp for each word. This opens up tons of possibilities: developers can now skip to the exact moment in an audio file where a word was spoken, display the relevant text while audio is playing, or search a library of audio for a specific term. introduced With the ability to search an audio file I wanted to try this feature out on videos. To do this I extracted the audio track from a video file, sent it to Cloud Speech, and built a frontend for searching the audio transcription JSON. The result is the following demo (it’s best viewed with sound in ): my recent ML API presentation In addition to the Speech API, the demo also uses Cloud Functions, Cloud Storage, and App Engine for hosting. Here’s a diagram of how the backend works: Architecture of Speech Timestamps demo Step 1: Extracting audio with ffmpeg and Cloud Functions Because Cloud Speech lets you provide the Cloud Storage URL for an audio file to transcribe, I decided to store all of my video and audio content in Cloud Storage. Then in my App Engine frontend I could get the video and associated transcription JSON directly from Cloud Storage. I wanted to be able to drop a video file into Cloud Storage and automatically have the transcription show up in another storage bucket. Sounds magical, right? I implemented this with : a compute solution for writing functions that are automatically triggered by certain cloud events. Functions are written in Node.js, and you specify the type of event that will trigger each function. In this case I triggered my function every time a new file was added to my video bucket. I split the transcription process into two functions: Cloud Functions : Extract the audio from a video and transcode it into a format the Speech API accepts (I used FLAC encoding) **extractAudio** Send the FLAC file to the Speech API and upload transcriptions to Cloud Storage **transcribeAudio** : The function uses the for accessing Cloud Storage and for extracting and transcoding audio. To get working in my Cloud Functions environment I needed to upload the ffmpeg binaries when I deployed my function and tell fluent-ffmpeg the path to those binaries. extractAudio google-cloud Node module fluent-ffmpeg ffmpeg Here’s the full list of npm dependencies: We’ll also define variables for each of our Cloud Storage buckets: one for videos, one for the FLAC audio files, and one for the transcription JSON: The function will do the following: Download the video file from Cloud Storage Extract the audio and transcode it to FLAC format for Cloud Speech Upload the FLAC file to Cloud Storage The function receives an parameter, which will give us data on the file that triggered the event. Here’s an outline of our function: event Next we’ll write the function to download our video to Cloud Storage. We can save the file to a local disk in our Cloud Function environment by writing it to the directory: /tmp Once we’ve got the video file available locally in Cloud Functions, we’re ready to extract and transcode the audio with ffmpeg: The last step is to upload the flac file to a new Cloud Storage bucket. You can find the code for uploading files in the . google-cloud documentation Step 2: Transcribing audio with Cloud Speech To get our audio transcription and timestamp data we’ll write a Cloud Function called which will be triggered whenever a flac file is added to our audio bucket. For this function we’ll need to instantiate a speech client with google-cloud Node and then write our transcription function: transcribeAudio We just need to call to make a request to Cloud Speech with our client. This will kick off a long running speech operation and return the final transcription results when it finishes: longRunningRecognize() We can then write the transcriptions to a local JSON file: The last step is uploading our JSON file to Cloud Storage in the same way we did in the first function. Woohoo! Now we’ve got an entirely serverless solution that generates timestamp transcriptions from a video. Note that you’ll want to periodically delete the contents of from your Cloud Functions file system to avoid hitting a . You can do this with the npm module. tmp/ memory limit rimraf Get Started To start using the timestamps functionality in your own apps, head over to the Speech API . For details on Cloud Functions check out the or watch my teammate ’s awesome . timestamp docs docs here Bret talk on Cloud Functions I’d love to see what you build with the Speech API and Cloud Functions. Let me know what you think in the comments or find me on Twitter . @SRobTweets

Google

Twitter

Making audio searchable with Cloud Speech

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Analyzing my own talk transcript with natural language processing

13 Questions and Answers for Google Cloud Reference Architectures

25 Products Google Sent To Their Graveyard: A Brief History

2023 Will Be the Year of Kubernetes (and Other Predictions in the Cloud and Infrastructure Industry)

3 Risk-Mitigation Lessons That We Learned The Hard Way This Year

5 Best Google Cloud Professional Data Engineer Certification Exam Courses in 2022

Analyzing my own talk transcript with natural language processing

13 Questions and Answers for Google Cloud Reference Architectures

25 Products Google Sent To Their Graveyard: A Brief History

2023 Will Be the Year of Kubernetes (and Other Predictions in the Cloud and Infrastructure Industry)

3 Risk-Mitigation Lessons That We Learned The Hard Way This Year

5 Best Google Cloud Professional Data Engineer Certification Exam Courses in 2022

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps