How To Apply Machine Learning And Deep Learning Methods to Audio Analysis

To view the code, training visualizations, and more information about the python example at the end of this post, visit the . Comet project page Introduction While much of the writing and literature on deep learning concerns computer vision and , audio analysis — a field that includes , digital signal processing, and music classification, tagging, and generation — is a growing subdomain of deep learning applications. Some of the most popular and widespread machine learning systems, virtual assistants Alexa, Siri and Google Home, are largely products built atop models that can extract information from audio signals. natural language processing (NLP) automatic speech recognition (ASR) Many of our users at are working on audio related machine learning tasks such as audio classification, speech recognition and speech synthesis, so we built them tools to analyze, explore and understand audio data using Comet’s meta machine-learning platform. Comet Audio modeling, training and debugging using Comet This post is focused on showing how data scientists and AI practitioners can use Comet to apply machine learning and deep learning methods in the domain of audio analysis. To understand how models can extract information from digital audio signals, we’ll dive into some of the core feature engineering methods for audio analysis. We will then use , a great python library for audio analysis, to code up a short python example training a neural architecture on the dataset. Librosa UrbanSound8k Machine Learning for Audio: Digital Signal Processing, Filter Banks, Mel-Frequency Cepstral Coefficients Building machine learning models to classify, describe, or generate audio typically concerns modeling tasks where the input data are audio samples. Example waveform of an audio dataset sample from UrbanSound8k These audio samples are usually represented as time series, where the y-axis measurement is the amplitude of the waveform. The amplitude is usually measured as a function of the change in pressure around the microphone or receiver device that originally picked up the audio. Unless there is metadata associated with your audio samples, these time series signals will often be your only input data for fitting a model. Looking at the samples below, taken from each of the ten classes in the Urbansound8k dataset, it is clear from an eye test that the waveform itself may not necessarily yield clear class identifying information. Consider the waveforms for the engine_idling, siren, and jackhammer classes — they look quite similar. It turns out one of the best features to extract from audio waveforms (and digital signals in general) has been around since the 1980’s and is still state-of-the-art: Mel Frequency Cepstral Coefficients (MFCCs), introduced by . Below we will go through a technical discussion of how MFCCs are generated and why they are useful in audio analysis. This section is somewhat technical, so before we dive in, let’s define a few key terms pertaining to digital signal processing and audio analysis. We’ll link to wikipedia and additional resources if you’d like to dig even deeper. Davis and Mermelstein in 1980 Disordered Yet Useful Terminology Sampling and Sampling Frequency In signal processing, is the reduction of a continuous signal into a series of discrete values. The or is the number of samples taken over some fixed amount of time. A high sampling frequency results in less information loss but higher computational expense, and low sampling frequencies have higher information loss but are fast and cheap to compute. sampling sampling frequency rate Amplitude The of a sound wave is a measure of its change over a period (usually of time). Another common definition of amplitude is a function of the magnitude of the difference between a variable’s extreme values. amplitude Fourier Transform The decomposes a function of time (signal) into constituent frequencies. In the same way a musical chord can be expressed by the volumes and frequencies of its constituent notes, a Fourier Transform of a function displays the amplitude (amount) of each frequency present in the underlying function (signal). Fourier Transform Top: a digital signal; Bottom: the Fourier Transform of the signal There are variants of the Fourier Transform including the , which is implemented in the Librosa library and involves splitting an audio signal into frames and then taking the Fourier Transform of each frame. In audio processing generally, the Fourier is an elegant and useful way to decompose an audio signal into its constituent frequencies. Short-time fourier transform *Resources: by far the best video I’ve found on the Fourier Transform is from * 3Blue1Brown Periodogram In signal processing, a is an estimate of the spectral density of a signal. The periodogram above shows the power spectrum of two sinusoidal basis functions of ~30Hz and ~50Hz. The output of a Fourier Transform can be thought of as being (not exactly) essentially a periodogram. periodogram Spectral Density The of a time series is a way to describe the distribution of power into discrete frequency components composing that signal. The statistical average of a signal, measured by its frequency content, is called its . The of a digital signal describes the frequency content of the signal. power spectrum spectrum spectral density Mel-Scale The is a scale of pitches judged by listeners to be equal in distance from one another. The reference point between the mel-scale and normal frequency measurement is arbitrarily defined by assigning the perceptual pitch of 1000 mels to 1000 Hz. Above about 500 Hz, increasingly large intervals are judged by listeners to produce equal pitch increments. The name comes from the word melody to indicate the scale is based on pitch comparisons. mel-scale mel The formula to convert f hertz into m mels is: Cepstrum The is the result of taking the Fourier Transform of the logarithm of the estimated power spectrum of a signal. cepstrum Stectrogram Mel-frequency spectrogram of an audio sample in the Urbansound8k dataset A is a visual representation of the spectrum of frequencies of a signal as it varies with time. A nice way to think about spectrograms is as a stacked view of periodograms across some time-interval digital signal. spectrogram Cochlea The spiral cavity of the inner ear containing the organ of Corti, which produces nerve impulses in response to sound vibrations. Preprocessing Audio: Digital Signal Processing Techniques Dataset preprocessing, feature extraction and feature engineering are steps we take to extract information from the underlying data, information that in a machine learning context should be useful for predicting the class of a sample or the value of some target variable. In audio analysis this process is largely based on finding components of an audio signal that can help us distinguish it from other signals. MFCCs, as mentioned above, remain a state of the art tool for extracting information from audio samples. Despite libraries like Librosa giving us a python one-liner to compute MFCCs for an audio sample, the underlying math is a bit complicated, so we’ll go through it step by step and include some useful links for further learning. Steps for calculating MFCCs for a given audio sample: 1. Slice the signal into short frames (of time) 2. Compute the periodogram estimate of the power spectrum for each frame 3. Apply the mel filterbank to the power spectra and sum the energy in each filter 4. Take the discrete cosine transform (DCT) of the log filterbank energies Excellent additional reading on MFCC derivation and computation can be found at blog posts and . here here 1. Slice the signal into short frames Slicing the audio signal into short frames is useful in that it allows us to our audio into discrete time-steps. We assume that on short enough time scales the audio signal doesn’t change. Typical values for the duration of the short frames are between 20–40ms. It is also conventional to overlap each frame 10–15ms. sample *Note that the overlapping frames will make the features we eventually generate highly correlated. This is the basis for why we have to take the discrete cosine transform at the end of all of this.* 2. Compute the power spectrum for each frame Once we have our frames we need to calculate the power spectrum of each frame. The power spectrum of a time series describes the distribution of power into frequency components composing that signal. According to Fourier analysis, any physical signal can be decomposed into a number of discrete frequencies, or a spectrum of frequencies over a continuous range. The statistical average of a certain signal as analyzed in terms of its frequency content is called its spectrum. Source: University of Maryland, Harmonic Analysis and the Fourier Transform We apply the to each frame to obtain a power spectra for each. Short-time fourier transform 3. Apply the mel filterbank to the power spectra and sum the energy in each filter We still have some work to do once we have our power spectra. The human cochlea does not discern between nearby frequencies well, and this effect only becomes more pronounced as frequencies increase. The is a tool that allows us to approximate the human auditory system’s response more closely than linear frequency bands. mel-scale Source: Columbia As can be seen in the visualization above, the mel filters get wider as the frequency increases — we care less about variations at higher frequencies. At low frequencies, where differences are more discernible to the human ear and thus more important in our analysis, the filters are narrow. The magnitudes from our power spectra, which were found by applying the Fourier transform to our input data, are by correlating them with each triangular Mel filter. This binning is usually applied such that each coefficient is multiplied by the corresponding filter gain, so each Mel filter comes to hold a weighted sum representing the spectral magnitude in that channel. binned Once we have our filterbank energies, we take the logarithm of each. This is yet another step motivated by the constraints of human hearing: humans don’t perceive changes in volume on a linear scale. To double the perceived volume of an audio wave, the wave’s energy must increase by a factor of 8. If an audiowave is already high volume (high energy), large variations in that wave’s energy may not sound very different. 4. Take the discrete cosine transform (DCT) of the log filterbank energies Because our filterbank energies are overlapping (see step 1), there is usually a strong correlation between them. Taking the discrete cosine transform can help decorrelate the energies. ***** Thankfully for us, the creators of have abstracted out a ton of this math and made it easy to generate MFCCs for your audio data. Let’s go through a simple python example to show how this analysis looks in action. Librosa EXAMPLE PROJECT: Urbansound8k + Librosa We’re going to be fitting a simple neural network (keras + tensorflow backend) to the UrbanSound8k dataset. To begin let’s load our dependencies, including numpy, pandas, keras, scikit-learn, and librosa. #### Dependencies #### #### Import Comet experiment tracking and visual tools comet_ml Experiment #### IPython.display ipd numpy np pandas pd librosa matplotlib.pyplot plt scipy.io wavfile wav sklearn metrics sklearn.preprocessing LabelEncoder sklearn.model_selection train_test_split keras.models Sequential keras.layers Dense, Dropout, Activation keras.optimizers Adam keras.utils to_categorical for from import import as import as import as import import as from import as from import from import from import from import from import from import from import To begin, let’s create a Comet experiment as a wrapper for all of our work. We’ll be able to capture any and all artifacts (audio files, visualizations, model, dataset, system information, training metrics, etc.) automatically. experiment = Experiment(api_key= , project_name= ) "API_KEY" "urbansound8k" Let’s load in the dataset and grab a sample for each class from the dataset. We can inspect these samples visually and acoustically using Comet. # Load dataset df = pd.read_csv( ) # Create a list the = list(df[ ].unique()) # Let == labels[i]][: ].reset_index() path = .format(tmp[ ][ ], tmp[ ][ ]) files[labels[i]] = path 'UrbanSound8K/metadata/UrbanSound8K.csv' of class labels labels 'class' 's grab a single audio file from each class files = dict() for i in range(len(labels)): tmp = df[df[' '] class 1 'UrbanSound8K/audio/fold{}/{}' 'fold' 0 'slice_file_name' 0 We can look at the waveforms for each sample using librosa’s function. display.waveplot fig = plt.figure(figsize=( , )) fig.subplots_adjust(hspace= , wspace= ) i, label enumerate(labels): fn = files[label] fig.add_subplot( , , i+ ) plt.title(label) data, sample_rate = librosa.load(fn) librosa.display.waveplot(data, sr= sample_rate) plt.savefig( ) 15 15 0.4 0.4 for in 5 2 1 'class_examples.png' We’ll save this graphic to our Comet experiment. # Log graphic waveforms to Comet experiment.log_image( ) of 'class_examples.png' Next, we’ll log the audio files themselves. # Log audio files to Comet debugging label labels: fn = files[label] experiment.log_audio(fn, metadata = { : label}) for for in 'name' Once we log the samples to Comet, we can listen to samples, inspect metadata, and much more right from the UI. Preprocessing Now we can extract features from our data. We’re going to be using librosa, but we’ll also show another utility, scipy.io, for comparison and to observe some implicit preprocessing that’s happening. fn = librosa_audio, librosa_sample_rate = librosa.load(fn) scipy_sample_rate, scipy_audio = wav.read(fn) print( .format(scipy_sample_rate)) print( .format(librosa_sample_rate)) 'UrbanSound8K/audio/fold1/191431-9-0-66.wav' "Original sample rate: {}" "Librosa sample rate: {}" Original sample rate: 48000 Librosa sample rate: 22050 Librosa’s load function will convert the sampling rate to 22.05 KHz automatically. It will also normalize the bit depth between -1 and 1. print( .format(np.min(scipy_audio), np.max(scipy_audio))) print( .format(np.min(librosa_audio), np.max(librosa_audio))) 'Original audio file min~max range: {} to {}' 'Librosa audio file min~max range: {0:.2f} to {0:.2f}' > Original audio file min~max range: -1869 to 1665 > Librosa audio file min~max range: -0.05 to -0.05 Librosa also converts the audio signal to mono from stereo. plt.figure(figsize=( , )) plt.plot(scipy_audio) plt.savefig( ) experiment.log_image( ) 12 4 'original_audio.png' 'original_audio.png' Original Audio (note that it’s in stereo — two audio sources) # Librosa: mono track plt.figure(figsize=( , )) plt.plot(librosa_audio) plt.savefig( ) experiment.log_image( ) 12 4 'librosa_audio.png' 'librosa_audio.png' Librosa audio: converted to mono Extracting MFCCs from audio using Librosa Remember all the math we went through to understand mel-frequency cepstrum coefficients earlier? Using Librosa, here’s how you extract them from audio (using the librosa_audio we defined above) mfccs = librosa.feature.mfcc(y=librosa_audio, sr=librosa_sample_rate, n_mfcc = ) 40 That’s it! print (mfccs.shape) > (40, 173) Librosa calculated 40 MFCCs over a 173 frame audio sample. plt.figure(figsize=( , )) librosa.display.specshow(mfccs, sr=librosa_sample_rate, x_axis= ) plt.savefig( ) experiment.log_image( ) 8 8 'time' 'MFCCs.png' 'MFCCs.png' We’ll define a simple function to extract MFCCs for every file in our dataset. def extract_features(file_name): audio, sample_rate = librosa.load(file_name, res_type= ) mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc= ) mfccs_processed = np.mean(mfccs.T,axis= ) mfccs_processed 'kaiser_fast' 40 0 return Now let’s extract features. features = [] # Iterate through each sound file and extract the features index, row metadata.iterrows(): file_name = os.path.join(os.path.abspath(fulldatasetpath), +str(row[ ])+ ,str(row[ ])) class_label = row[ ] data = extract_features(file_name) features.append([data, class_label]) # Convert into a Panda dataframe featuresdf = pd.DataFrame(features, columns=[ , ]) for in 'fold' "fold" '/' "slice_file_name" "class" 'feature' 'class_label' We now have a dataframe where each row has a label (class) and a single feature column, comprised of 40 MFCCs. featuresdf.head() featuresdf.iloc[ ][ ] 0 'feature' array([ .1579300e , 7.1666122e , .3181377e , .2091331e , .2115969e , .1764181e , .1183747e , 1.8912683e , 6.7266388e , 1.4556893e , .1782045e , 2.3010368e , .7251305e , 1.0052421e , .0095000e , .3153191e , .7693510e , 1.1171228e , .3699470e , 7.2629538e , .1815971e , .4952612e , 5.4577131e , .9442446e , .8693886e , .8654032e , .2121708e , 4.6092505e , .8293257e , .3475075e , 1.3341187e , 7.1307826e , .9450034e , 1.7109241e , .6942000e , .9041715e , 3.0366952e , .6827590e , .8585770e , 3.5438776e ], dtype=float32) -2 +02 +01 -1 +02 -5 +01 -2 +01 -2 +01 -1 +01 +01 +00 +01 -1 +01 +00 -1 +01 +01 -6 +00 -1 +00 -1 +01 +00 -4 +00 +00 -1 +01 -7 +00 +00 -2 +00 -5 +00 -9 -02 -3 +00 +00 -5 +00 -5 +00 +00 +00 -7 -02 +00 -5 +00 -2 +00 +00 -1 +00 -8 -01 -01 Now that we have successfully extracted our features from the underlying audio data, we can build and train a model. Model building and training We’ll start by converting our MFCCs to numpy arrays, and encoding our classification labels. sklearn.preprocessing LabelEncoder keras.utils to_categorical # Convert features and corresponding classification labels into numpy arrays X = np.array(featuresdf.feature.tolist()) y = np.array(featuresdf.class_label.tolist()) # Encode the classification labels le = LabelEncoder() yy = to_categorical(le.fit_transform(y)) from import from import Our dataset will be split into training and test sets. # split the dataset sklearn.model_selection train_test_split x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size= , random_state = ) from import 0.2 127 Let’s define and compile a simple feedforward neural network architecture. num_labels = yy.shape[ ] filter_size = def build_model_graph(input_shape=( ,)): model = Sequential() model.add(Dense( )) model.add(Activation( )) model.add(Dropout( )) model.add(Dense( )) model.add(Activation( )) model.add(Dropout( )) model.add(Dense(num_labels)) model.add(Activation( )) # Compile the model model.compile(loss= , metrics=[ ], optimizer= ) model model = build_model_graph() 1 2 40 256 'relu' 0.5 256 'relu' 0.5 'softmax' 'categorical_crossentropy' 'accuracy' 'adam' return Let’s look at a model summary and compute pre-training accuracy. # Display model architecture summary model.summary() # Calculate pre-training accuracy score = model.evaluate(x_test, y_test, verbose= ) accuracy = *score[ ] 0 100 1 print( % accuracy) "Pre-training accuracy: %.4f%%" Pre-training accuracy: 12.2496% Now it’s time to train our model. keras.callbacks ModelCheckpoint datetime datetime num_epochs = num_batch_size = model.fit(x_train, y_train, batch_size=num_batch_size, epochs=num_epochs, validation_data=(x_test, y_test), verbose= ) from import from import 100 32 1 Training completed in time: Even before training completed, Comet keeps track of the key information about our experiment. We can visualize our accuracy and loss curves in real time from the Comet UI (note the orange spin wheel indicates that training is in process). Comet’s experiment visualization dashboard Once trained we can evaluate our model on the train and test data. # Evaluating the model on the training and testing set score = model.evaluate(x_train, y_train, verbose= ) print( .format(score[ ])) score = model.evaluate(x_test, y_test, verbose= ) print( .format(score[ ])) 0 "Training Accuracy: {0:.2%}" 1 0 "Testing Accuracy: {0:.2%}" 1 Training Accuracy: 93.00% Testing Accuracy: 87.35% Conclusion Our model has trained rather well, but there is likely lots of room for improvement, perhaps using Comet’s tool. In a small amount of code we’ve been able to extract mathematically complex MFCCs from audio data, build and train a neural network to classify audio based on those MFCCs, and evaluate our model on the test data. Hyperparameter Optimization To get started with Comet, click here. Comet is 100% FREE for public projects.