How to Do Speech Recognition in Python

In my free time, I am attempting to build my own smart home devices. One feature they will need is speech recognition. While I am not certain yet as to how exactly I want to implement that feature, I thought it would be interesting to dive in and explore different options. The first I wanted to try was the SpeechRecognition library.

Why the hard way?

To put a long story short, this tutorial is going to be a little bit different. There were several errors I had to deal with and even redirect my focus. That being said, the coding portion is simple. Only a few lines of code to get it working. The installation took time and effort, but with research it was manageable. Instead, the issue was in the systems I was deciding to use. For example, the first attempt was on an Ubuntu server. Nothing wrong with that, but the default device could not be changed, seeing as the code is being run through ssh. I would have to go to the server and plug everything indirectly. Again, nothing wrong with that. I was just hoping for an easier option. Feeling particularly adventurous and being a little too lazy to plug into the server directly, I tried a few different machines.

This tutorial will be a little different than my previous posts. For this, I am first going to share the working installation code on a local Ubuntu machine, which is what I ended up using. After that, I will talk about the other machines I attempted, where I found my issues, and when I decided to switch. Hopefully, that will help anyone using other machines. Or perhaps someone will know more about the issues I encountered but did not take the time to see through.

Installing SpeechRecognition

To run the SpeechRecognition library for our code, we will first need to install SpeechRecognition but then must also install PyAudio. First, we will start with the main package:

sudo pip3 install SpeechRecognition

If your try to run code now, you will get an error about the PyAudio installation not being found. Installing should have followed exactly the same format, but it seems I was missing packages to get this to work properly, and attempting to install PyAudio threw an error. These packages should remove that error. I did not have to update apt at that point, but it does not hurt to give it an update first.

sudo apt-get install libasound-dev portaudio19-dev libportaudio2
libportaudiocpp0

With that out of the way, you should be good to install PyAudio:

sudo pip3 install PyAudio

Coding the Speech Recognizer

As mentioned previously, there are very few lines of code required to get this up and running. different machines.

First, you must import the SpeechRecognition library:

import speech_recognition as speech

We added an alias to the library in order to reference it later in a simpler way. Now, we can use the Recognizer function:

sound = speech.Recognizer()

Next, we will need to allow the python file to hear what we are saying. It is the reason we needed PyAudio as well. For live speech, we will need to set up a microphone. Note, we will not set this in a loop, so we will only be able to speak to the application one time, whether that is a single word or a sentence. Nonetheless, this recognizer is only a test, so we will not need to speak multiple times. We will set up a microphone first, give that an alias, then instructs the Recognizer to set it up earlier to listen.

with speech.Microphone() as audio:
     said = sound.listen(audio)

Now, because our microphone could be unclear, or even the speech itself, we will need to set up a “try” to determine if the Recognizer was able to understand or not. We will use a recognize_google function, so an internet connection will be required. For security's sake, I would not use this function in any home applications. However, while just testing what Python can do, it will be good enough for now. The parameters will need what was said to recognize, the language, and whether all guesses should be displayed or not.

At this time, we want to see all potential guesses, and the language will be English. Either of these could be different for you, which is why they are specified. If it did recognize the phrase, then we want to print the results. However, if it could not understand, we will want to print a message. This can be done with an “except” which will track any errors encountered, and we can leave an error that states the speech was not understood.

try:
     print(sound.recognize_google(said, language = 'en-IN',
     show_all = True))
except LookupError:
     print("Could not understand. Please repeat.")

Now, all you must do is run the application.

With our code up and running, we can now talk about what gave me issues on different machines.

Working on An Ubuntu Server

As mentioned before, the issue was that from a separate machine connect to the server, I was unable to change the default input device. This would not have been an issue if I would have gone to the server and plugged in the microphone directly. Other than the issue with the input, the installation process was the same, as it was also Ubuntu 16.04.

Working over WSL

The next system I used was a Windows machine running WSL (Windows Subsystem for Linux). It too used Ubuntu 16.04, so the installation process was the same. However, when it came to using the microphone, WSL is not as easy as plugin and go. To control the microphone over the Ubuntu terminal app, PulseAudio needed to be installed. To do this, first, the repository was added:

sudo add-apt-repository ppa:therealkense/wsl-pulseaudio

From there, a regular install could be run:

sudo apt-get install pulseaudio

PulseAudio is a network-based sound server, which runs on Linux and other variations. Like other systems, you must start it and check the status. First, there is a command to restart it:

pulseaudio --k

If not already on, you can now start PulseAudio:

pulseaudio --start

Next, you will have to look at the audio devices available. These devices are known as sinks. In my case, I had only one, which was the headset with a microphone:

pacmd list-sinks

Now we also have the index, which is what we needed. We can set the default input from here:

pacmd set-default-sink 0

Please note, you may have to run the start command again on PulseAudio. I had to run it for every command. This was the final step. Now the code should run. However, when running the code there is yet another error. It is a lengthy description, but the main error is:

I dug in to find more about this error, although it did not seem to have much documentation behind it. Instead, it seemed as if StackOverflow was one of the only sites I found with usable information on it. It seems like others were having the same issue. This is where I stopped for this implementation.

Looking back now, it seems like someone had mentioned using XServer. I am wondering now if I would have run Xming, maybe that would have worked. But, oh well. Another time perhaps I will give it a go. Leaving this version, I moved to my next machine.

Working on Fedora

As usual, the very first thing to do was install SpeechRecognition via pip3. Upon trying to install PyAudio, it is important to note that still had prerequisites to install, but they are different in Fedora. Remember that Fedora syntax is different than Ubuntu:

sudo dnf install portaudio-devel redhat-rpm-config

This is not the only package required. We must also install the python portion of devel:

sudo dnf install python3-devel

Now the prerequisites were installed, go ahead and install PyAudio via pip3 just like on the previous machines. With everything installed, I ran the code. Another error:

This error seemed to be a little more complicated to get information on. Some people were thinking it just needed a restart, some people never got it working. In either case, it was difficult to find documentation.

Now, if I would have tried harder, maybe looked longer, or even just dedicated a little bit more effort to this, it is probably simple enough to solve. However, this was just an experimental project. For an experiment, I was not wanting to dedicate too much time to this.

So, this is where I stopped trying on Fedora. Perhaps for the better, as I realized I have an Ubuntu machine, could just run that code locally. And so that is what I ended up doing and had no issues with that.

Conclusion

At the end of the day, we got something up and working. I think it was rather interesting to mess around with. The current code we created would be used only for test purposes, however, as the microphone is making a call to google. We would not want to be using google calls for any applications intended to be used for privacy reasons.

As a difference, we talked about the errors I came across on different machines. Although they are likely solvable, I did not dedicate too much time to solving these, and therefore some were left unresolved. In the long run, we did get the code up and running. Either way, it was an interesting journey, and our voices were recognized by a python library!

Hopefully, you will find some use in seeing the spots where I went wrong. Yes, the mistakes are frustrating, and roadblocks are as well. However, every mistake can be insightful. We learned what to do, what not to do, where to go when stuck, and even when to just move on if able. Noting the differences in installing certain packages on Ubuntu versus others in Fedora was the most interesting portion in my opinion. It took a little research, but nothing was more than we could handle. So, I thank you for joining this voice recognition adventure with me. Until next time, cheers!

Previously published at https://python.plainenglish.io/speechrecognition-in-python-df4e56fecf51