How to Transcribe Your Video Conferences With Automated Speech Recognitionby@wingofagriffin
597 reads
597 reads

How to Transcribe Your Video Conferences With Automated Speech Recognition

by GriffinJune 15th, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In this blog, we demonstrate how to take recordings from your Communications conferences and use Deepgram to transcribe them to text. Here, we will be using the Communications REST APIs in tandem with Deepgram’s Pre-recorded Audio API in Python to take a conference recording and provide a text transcript. The benefits to this increase accessibility for those who are hearing impaired, speak English as a second language, or simply prefer to read over listen on their own time.
featured image - How to Transcribe Your Video Conferences With Automated Speech Recognition
Griffin HackerNoon profile picture

In this digital age where virtual conferences are a dime a dozen, we see a large number of them recorded for future records. There are many uses for these records, including sharing with people who were unable to attend live, distributing for use as training, and keeping backups for future reference. One aspect of these recordings that is taken for granted, however, is accessibility. In this blog, we will demonstrate how to take recordings from your Communications conferences, and use Deepgram to transcribe them to text.

Having text copies of your conference recordings is a good way to offer alternative ways to digest the information. Some people read faster than they listen to spoken words. Some people might not speak the same first language as the one in the conference, and are more comfortable reading it. Others might be hearing impaired, and prefer to read for the most amount of comfort. Whatever reason one might have, we want to make it simple to automate the transcription generation process. Here, we will be using the Communications REST APIs in tandem with Deepgram’s Pre-recorded Audio API in Python as an example of how to generate this process, where ideally this could be recreated with the tools of your choosing.

Installing Libraries

Before we begin coding, we need to ensure we have all the proper libraries for calling these APIs. We can do this with a simple pip command (use the appropriate pip command for your operating system):

pip3 install asyncio deepgram-sdk dolbyio-rest-apis

This will install both the and Deepgram SDKs, as well as Python’s native asynchronous function library to aid us in calling the async requests the two SDKs use.

It is also a good idea to sign up for a free and Deepgram account if you haven’t already, to get your API credentials.

Obtaining an API Token

In order to use the Communications REST APIs, we need to first generate a temporary access token. This is to help prevent your permanent account credentials from being accidentally leaked, as the token will expire automatically.

To learn more about this, read the documentation.

In this case, we want to fill in the consumer key and secret with our credentials from our Communications APIs (not Media). We then call the get_api_access_token endpoint within a function so we can generate a fresh token every time we make another call. This is not the most secure way to handle this, but will ensure we don’t run into any expired credentials down the road. To learn more, see the security best practices guide.

from dolbyio_rest_apis.communications import authentication
import asyncio
# Input your Communications Credentials here
# Create a function that will generate a new api access token when needed
async def gen_token():
    response = await authentication.get_api_access_token(CONSUMER_KEY, CONSUMER_SECRET)
    return response['access_token']
print(f"Access Token: {await gen_token()}")

Getting the Conference ID

Now that we can call the APIs, we first want to get the internal conference ID of the recording we want to transcribe. We can do this by simply calling the get_conferences endpoint with our token.

from dolbyio_rest_apis.communications.monitor import conferences
response = await conferences.get_conferences(await gen_token())
# Save the most recent conference. Change '-1' to whichever conference you want.
confId = response['conferences'][-1]['confId']

Note that in this code sample, we are using the parameter: ['conferences'][-1]['confId']. This will pull only the most recent conference in the list as noted by the "-1" array value. If you are automating this to work with every newly generated conference, this likely will not be an issue. However if you are looking to do this with a specific conference, we suggest using the optional parameters in the get_conferences endpoint to obtain the desired conference ID.

Obtaining the Recording

With the conference ID in hand, we can now call an endpoint to generate a URL that contains the audio file of our conference. For this code sample, we are using a Dolby Voice conference, so we will use the endpoint to Get the Dolby Voice audio recording. If you know you are not using Dolby Voice, you can use this endpoint instead. Note that we are only obtaining the audio track of the conference instead of both the audio and the video. This is for maximum file compatibility with the transcription software. Note that the URL produced is also temporary, and will expire after some time.

from dolbyio_rest_apis.communications.monitor import recordings
# Save only the mp3 file and return as a URL.
# If your conference does not use Dolby Voice, use 'download_mp3_recording' instead.
response = await recordings.get_dolby_voice_recordings(await gen_token(), confId)
recording_url = response['url']

To help illustrate, here is an example conference recording made for transcription generated from the above code: Link

Transcoding it with Deepgram

While Deepgram does work with local files, the presigned recording url saves us many steps avoiding the hassle of needing to download and upload a file to a secure server. With the URL, we can skip those steps and directly insert the URL into the code below adapted from their Python Getting Started Guide. The code provided only uses the Punctuation feature, but could easily expanded with an assortment of the many features Deepgram provides.

from deepgram import Deepgram
# Your Deepgram API Key
# Location of the file you want to transcribe. Should include filename and extension.
FILE = recording_url
async def main():
  # Initialize the Deepgram SDK
  deepgram = Deepgram(DEEPGRAM_API_KEY)
  # file is remote
  # Set the source
  source = {
    'url': FILE
  # Send the audio to Deepgram and get the response
  response = await asyncio.create_task(
        'punctuate': True
  # Write only the transcript to the console
  await main()
  # If not running in a Jupyter notebook, run main with this line instead:
except Exception as e:
  exception_type, exception_object, exception_traceback = sys.exc_info()
  line_number = exception_traceback.tb_lineno
  print(f'line {line_number}: {exception_type} - {e}')

The Deepgram response provides many datapoints related to our speech, but to pull only the transcription of the file, we are calling ['results']['channels'][0]['alternatives'][0]['transcript']. Feel free to modify the response to generate whatever is most relevant to your needs. For the above sample provided, the result of the transcription is as follows:

Following text is a transcription of the s en of the parchment declaration of independence. The document on display in the rot the national archives Museum. The spelling and punctuation reflects the originals.

Next Steps

This is a very basic foray in how to get started with transcribing your conference recordings. We heavily suggest you invest some time into expanding this to fit your specific use case to maximize the benefit you get from using these tools.

As mentioned before, we suggest taking a look at what Deepgram has to offer in terms of additional features you could add on to the transcription process. For example:

  • Diarization can help differentiate who is saying what when there are multiple people in a conference.
  • Named Entity Recognition and/or Keywords to help increase accuracy by providing prior information of things like names and proper nouns.

The transcription of the example recording was not perfect. There are many reasons for this, including imperfect recording environments, confusing speech patterns, and compression as examples. To help give the transcription algorithms a better chance, one option could be to use the Media Enhance API to attempt to clean up the audio before sending it to transcription.

If you want to automatically generate a transcription after every recording is over, we can take advantage of webhooks to remove the manual intervention for you. In fact, the Recording.Audio.Available event provides the recording URL within the event body itself, reducing the number of steps needed to obtain it.

One final idea is if you do only have the video file ready for whatever reason, you can use the Media Transcode API to convert the video file into a format accepted by the transcription service.

You can find the source code file stored in a Jupyter notebook at this GitHub repository. Good luck coding!

Also published here.