In this blog, I'll be comparing deep learning noise removal models to older subtractive models.
Noise is everywhere. Whether you’re inside the comfort of your home or walking down the street, the sound of the garbage truck or your dog barking can quickly become a nuisance. Especially in the digital age, all these noises get picked up by microphones and interfere with our communications. So, let’s look at how we can remove it!
Background noise removal is the ability to enhance a noisy speech signal by isolating the dominant sound. Background noise removal is used everywhere — it’s found in audio/video editing software, video conferencing platforms, and noise-cancelling headphones. So, background noise removal is still a fast evolving technology, with Artificial Intelligence bringing a whole new domain of approaches to improve the task.
Today, let’s explore how background noise removal works by looking at traditional and machine learning based approaches.
Most noise removal algorithms are subtractive, identifying certain frequencies that have the higher levels of background noise and subtracting those bands from the original signal. Many of such approaches use static filters such lowpass, highpass, and bandpass filters that are designed with specific parameters to isolate what is assumed to be the dominant signal. These algorithms work best with deterministic signals, where there is little uncertainty regarding the type of noise that is being filtered and the type of noise that is being isolated. Practically, these filters are extremely ineffective in varying conditions, specifically in situations where the properties of the background noise overlap with the clean signal to be isolated. Norbert Wiener took a different approach, forgoing the assumption that a given noisy signal is deterministic.
Wiener filtering is an industry standard for dynamic signal processing, and is used widely in hearing aids and other edge devices such as phones and communication devices. The adaptive filter works best given two audio signals: one with both the speech and the background noise and another that solely measures the background noise. Modern day smartphone designers will often place two microphones distanced from each other such that one is placed near the speaker’s mouth to record the noisy speech and the other can measure the ambient noise to filter out the noise.
You may be wondering, if we have an isolated signal of the background noise and the noisy speech, why can’t we simply subtract the background noise from the noisy speech to isolate the clean speech? While this approach may seem intuitive, the result is not quite what we expect. Turns out, there are many reasons why this wouldn’t work.
The macroscopic distance between the microphones places both of them in slightly different environments. So, we can’t perform a simple subtraction of signals to remove most elements of noise because noise is caused by a number of factors including electrostatic charges within hardware components, and small vibrations in the environment, all of which vary enormously with the slightest change in environment.
The wiener filter, however, uses the properties of those two signals to produce estimates of the clean speech. An error, known as the mean squared error is then calculated and minimized, in order to produce the best estimate for the clean speech.
Wiener filtering, unfortunately, also comes with its faults:
Artificial neural networks are an old idea that have recently exploded in the form of deep learning. While there are different deep learning approaches to noise removal, they all work by learning from a training dataset.
The first step to building an accurate noise removal model is to construct a quality training dataset. Since our goal is to remove background noise, our dataset should consist of recordings of clean speech paired with its noisy variant.
Before assembling a dataset, it is important to consider the use case of the model. For example, when training a noise removal algorithm that would be applied to signals from a helicopter pilot’s microphone, it makes most sense to train the network with audio samples that are distorted by variations of helicopter sounds.
For a general use noise removal model, it makes sense to train with samples of everyday background sounds such as loud chatter, air conditioning, typing, dogs barking, traffic, music — you get the idea.
Once we’ve figured out what kind of data we want to train with, we have to actually generate the data set. The best way is to find a large amount of clean speech signals and pure noisy signals and combine them in all sorts of ways.
For example, you can combine a high quality sample of a person speaking and a sample of a dog barking to produce a new sample which would have a person speaking with barking in the background.
So, by providing both the original sample of the person speaking and sample with both the speech and the dog barking, the neural network can repeatedly compare its estimated clean speech signal to the actual clean speech signal to then adjust itself and try again.
Finally, we can now feed our dataset to a neural network, so it can learn to isolate the background noise and generate clean speech. One of the most popular and effective for audio processing is the Recurrent Neural Network.
Recurrent neural networks are models that can recognize and understand sequential data. Sequential data includes things like audio, text, or the position of an object over time.
RNNs are particularly effective for background noise removal because they can learn patterns across time which is essential for understanding audio.
So how do RNNs work? First, let’s take a look at a feed forward neural network that has 3 main layers: input layer, hidden layer, and an output layer. RNNs introduce a feedback loop known as the hidden state from the hidden layer that updates itself as the model processes every item in some sequence.
To get a sense for this, let’s observe an RNN that is trained to isolate the background noise of a noisy audio sample. We can break up the audio sample into a sequence of evenly spaced time intervals. As each individual sample of the sequence is passed into the RNN, the hidden state gets updated during every iteration, retaining memory of the previous steps each time.
At the end of the iteration, the output is sent through the feed forward neural network to generate a new audio stream with the background noise entirely removed.Sounds like magic? Yeah, sort of!
However, RNN’s come with their own fair share of pitfalls as well. The most significant issue is that they aren’t effective at retaining information for long periods of time. This is due to the vanishing gradient problem during a process known as back propagation. While I don’t want to get carried away with the specifics, here is a good resource to learn more about it.
This lack of long term memory makes RNNs less effective in processes where long term memory serves very useful. So, researchers invented variants of the traditional RNN that use gates to solve this problem. Gates are operations that can learn what information to add or remove to a hidden state.
The two main neural networks that use these gates are Long Short-Term Memory (LSTM) and the Gated Recurrent Unit (GRU). Both are far more computationally intensive than simple recurrent networks, but are much more suited to our task of noise removal.
From here, there are all sorts of directions we can go. Some models are designed to perform the end-to-end task of background noise removal, but it also means that they are much more computationally intensive and larger in terms of size. These models are incredibly powerful and are often employed in speech recognition. Others adopt a more hybrid approach, using traditional subtractive noise removal to preprocess the data, and then apply a neural network to deal with any non static background noise that still exists in the sample. While both approaches are effective, their uses are dependent on a developer’s computational resources and desired accuracy.
Background noise reduction has been a primary area of interest in audio processing since the invention of the microphone. There are hundreds of traditional methods to filter audio, but many, if not all, work poorly with non-static audio and introduce distortion when the background noise blends with the primary speaker. With the rise of computing power and our ability to build deep learning models that can remember complex patterns over long periods of time, we’ve been able to train computers to become exceptional at specific tasks. By training a deep learning model with large amounts of data, computers have become exceptionally capable of removing noise in audio.
So, which method is better? If computational resources and latency are irrelevant, the AI approach is vastly superior to traditional approaches. This is because they are generative, whereas traditional models are subtractive. AI approaches are able to generate an entirely new audio signal with the background noise removed and with minimal distortion in the clear speech.
If computational resources and latency are a concern, given present day technology, AI approaches may be impractical to implement. The processing time of the models sometimes introduces latency to the processing which can be undesired in some cases. However, this certainly won’t be a huge concern in the foreseeable future. Neural networks are getting faster, efficient, and accessible every day and eventually they will become the standard.
Author: Praneeth Guduguntla, Technical Writer @ Audo Ai.
Originally posted at Audo AI.
Thank you for taking your time to read this. As always, I appreciate any comments/feedback you may have.