How to Build To-Do Lists With Real-Time Speech Recognition

In my last article, I started a project that used speech recognition to create a to-do list, while also storing it in a database.

It used a microphone to record but saved the audio to a file to then be read by the system. This is not exactly real-time. So, this time, I decided to make the project work in real time.

In case you didn’t read the last article, here’s a recap of the project.

Eventually, my goal is to have a smart home.

One device which I desire, in particular, is a smart mirror. However, I want to build it myself, so I’ve got projects on the go that will hopefully allow me to slowly learn how that could be done.

For the smart mirror, I wanted the ability to say my to-do list out loud, and have that stored in a database to read back later. That way, I won’t forget everything later down the road. My first attempt recorded a file, then used that, but in this attempt I want the audio to be transcribed in real-time.

I’m still using AssembyAI for the speech recognition transcription just like in my first article.

However, as this will be in real-time, I had to upgrade to the pro plan to have that feature available.

So, without any further delay, let’s talk about first what I needed to install for this project. If I already installed it in the last blog I will still include it in this one. That way if anyone missed the last article they can still follow along with what I did for my project.

Installing a Few Things

For this to be real-time, I will need to make continuous calls to the API. But that means I would need a connection that is continuous. That’s where WebSockets come in.

Websockets in the Python library can establish this type of connection so I can feed the audio and also see the transcribed output. To install, use the following command:

pip3 install websockets

In the last speech recognition article I did, I used libportaudio2, so I already have that installed. However, there were a few other packages needed for this to work, so I used the following line:

sudo apt-get install libasound-dev portaudio19-dev  libportaudio2 libportaudiocpp0

Instead of using

sounddevice

in this part, I decided to use

PyAudio

. This is needed for more input/output controls for the audio.

pip3 install PyAudio

To run this, it needed to be done asynchronously. For that I chose

asyncio.

That’s a provisional package from Python, so it may not work if you’re using a version of Python earlier than 3.4.

I can’t remember if this package was something that came automatically, but should you need to install it you can use this command:

pip3 install asyncio

For this part, I used TinyDB this time as well to store my to-do list. That was installed in my last part, but in case you missed that you use this simple command:

pip3 install tinydb

Now that everything is installed, let’s jump straight into the code.

The Real-Time Class File

I wrote this code in two files.

The first is the class with a few different methods for real-time speech, the WebSocket connection, and the sending and receiving of the audio.

The second file is more of just the driver that calls the first method to get the ball rolling, and also provides the API key.

The first thing is first though, let’s start with the file I called

 stt_real_time.py

Let’s go over all the imports I needed for that. So, we already know you need the asyncio, WebSockets, and tinydb, but for pyaudio, plus a few others.

import asyncio
import websockets
import pyaudio
import base64
import json
from tinydb import TinyDB, Query

Next, I’m ready to declare the class.

Inside the class, I’ll also include the initializing function.

There, I will need to create the endpoint, the database, set my recording rate like in the last part, and also open a stream with PyAudio using those frame rates. I also use the opening empty string at the beginning of the class.

class stt_real_time:
     “””
     “””

     def __init__(self, key):
          self.endpoint = “wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000”
          self.key = key
          self.rate = 16000
          self.frames_per_buffer = 3200
          self.db = TinyDB(real_time_db.json)

          p = pyaudio.PyAudio()
          self.stream = p.open(
               frames_per_buffer=self.frames_per_buffer,
               rate=self.rate,
               format=pyaudio.paInt16,
               channels=1,
               input=True
          )

With the class initialized, we can move on to the next method. This will be where most of the work is done.

The will be called

real_time

, and within

real_time

I created two functions. One to send and one to receive. So, the next few code segments will be inside this real_time method, but I will explain as I go.

To begin with the method, I started by opening the WebSocket connection. While this could remain open until I terminate the project, I decided to add a timeout after 20 seconds.

async def real_time(self):

     aysnc with websockets.connect(
          self.endpoint,
          ping_interval=5,
          ping_timeout=20,
          extra_headers=((“Authorization”, self.key), )
     ) as connection:
          await asyncio.sleep(0.1)

With the connection now created, the rest of the code in the real_time method will be inside of that connection, so make sure it is indented within the connection. We can now begin the session:

session_begins = await connection.recv()
print(session_begins)

Next, I created my send function, which listens to the audio from the microphone input, then sends a jsonified dump of the data through the connection.

Here is also where we need a little error handling in case something goes wrong.

async def send():
     while True:
          try:
               data = self.stream.read(self.frames_per_buffer)
               data = base64.b64encode(data).decode(“utf-8”)
               await connection.send(json.dumps({“audio_data”: str(data)}))
           except Exception as ex:
                print(f”An error occurred: {ex}”)
                break
           await asyncio.sleep(0.01)
     return True

With the send function out of the way, I moved on to the receive function.

In this function, I had to wait until the message was finished, which created the “Final Transcript”. It’s that final one that will be inserted into the database.

For example, when I begin talking, the system will start displaying the text to what I am saying, but it will try to correct itself into a sentence structure as I go. Although it won’t always look perfect, once I finish talking it creates the final version of what I said transcribed. Here’s just an example of the transcript being read in and finalized:

So, this means nothing but the very last line, the Final Transcript, will be inserted into the database. We will also need error handling on this step, as we did in the send function.

async def receive():
     while True:
          try:
               message = await connection.recv()

               if json.loads(message)[“message_type”] == “FinalTranscript”:
                    self.db.insert({“todo:” json.loads(message)[“text”]})
                    self.__list_todos()
          except Exception as ex:
               print(f”An error occurred: {ex}”)
               break

With the receive function written, we need to allow asyncio to gather both.

So this next piece of code will be outside of the receive function but still inside of our real_time method.

sent, received = await asyncio.gather(send(), receive())

That will conclude the real_time method. At this point, if you’re following my code you likely have an error on self.__list_todos() because we have not created that method yet.

But we’ll do that now and it will be the last code in the file. All we’re going to do there is print each item from the database like so:

def __list_todos(self):
     print(“Todo List:”)
     for item in self.db:
          print(item[“todo”])

That will be all for this file, so let’s get started on the file to test and call the methods.

Just for continuity, here is all the code for this file in one place, just for something easier to read:

import asyncio
import websockets
import pyaudio
import base64
import json
from tinydb import TinyDB, Query
 
class stt_real_time:
    """
    """
 
    def __init__(self, key):
        self.endpoint = "wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000"
        self.key = key
        self.rate = 16000
        self.frames_per_buffer = 3200
        self.db = TinyDB('real_time_db.json')
 
        p = pyaudio.PyAudio()
        self.stream = p.open(
            frames_per_buffer=self.frames_per_buffer,
            rate=self.rate,
            format=pyaudio.paInt16,
            channels=1,
            input=True
        )
 
    async def real_time(self):
        
        async with websockets.connect(
            self.endpoint,
            ping_interval=5,
            ping_timeout=20,
            extra_headers=(("Authorization", self.key), )
        ) as connection:
            await asyncio.sleep(0.1)
            
            session_begins = await connection.recv()
            print(session_begins)
 
            async def send():
                while True:
                    try:
                        data = self.stream.read(self.frames_per_buffer)
                        data = base64.b64encode(data).decode("utf-8")
                        await connection.send(json.dumps({"audio_data": str(data)}))
                    except Exception as ex:
                        print(f"An error occurred: {ex}")
                        break
                    await asyncio.sleep(0.01)
                return True
 
            async def receive():
                while True:
                    try:
                        message = await connection.recv()
                        # print(json.loads(message)["text"])
 
                        if json.loads(message)["message_type"] == "FinalTranscript":
                            self.db.insert({"todo": json.loads(message)["text"]})
                            self.__list_todos()
                    except Exception as ex:
                        print(f"An error occurred: {ex}")
                        break
 
            sent, received = await asyncio.gather(send(), receive())
 
    def __list_todos(self):
        print("Todo List:")
        for item in self.db:
            print(item["todo"])

Writing the Test File

I called this file

test.py

. To start with the imports, I only needed

asyncio

and the class from the other file:

import asyncio

from stt_real_time import stt_real_time

The next step was simply to initialize the class, where I needed to also send my API key.

real_time_stt = stt_real_time(YOURAPIKEY)

There’s only one last step, and that’s to create the loop that will run asynchronously. All we have to do there is call the real_time method, which will call the others once things kick-off. I finished my code with these lines:

while True:
     asyncio.run(real_time_stt.real_time())

The only thing left to do is testing. Now, again, I want to mention that the timeout was set for 20 seconds for the WebSocket connection, which you can always have longer for your project depending on what you need.

But for testing and a simple to-do list generator, I didn’t need much time. As far as printing results, I also only print the final list of all records in the database, so as I add more to the list I will see the original input first, and the most recent last.

You could always filter to view only the most recent for your project, but I wanted to see everything added to my to-do list.

The results for the project running will print the session being created, as I displayed for the SessionBegins in my first screenshot, but then the output will be the list I say into the microphone in real-time.

And with that, we’ve successfully run the code. The transcription was done quickly, and as long as I spoke clearly the speech recognition was very accurate.

Conclusion

As I work closer to building my own smart home devices, my smart mirror needed a way to handle speech recognition.

Using the AssemblyAi API, I was able to build my to-do list generator using real-time speech recognition. The real-time recognition was built using a combination of the API, asyncio, and WebSocket.

Overall, I thought following the tutorials available on AssemblyAi’s website was useful, but it was also highly adaptable for my to-do list project. Hopefully, you found my project helpful in building your own, and feel free to leave a comment on what you used to build your project. Until next time, cheers!

References

https://www.assemblyai.com/blog/real-time-speech-recognition-with-python/?_ga=2.201721049.918789168.1640546390-81735733.1640546390

https://www.assemblyai.com/blog/real-time-speech-recognition-with-assemblyai/

https://docs.assemblyai.com/walkthroughs#realtime-streaming-transcription

https://websockets.readthedocs.io/en/stable/