Classification using Neural Network with Audio Data

@ hubertaiandqc Hubert S 3rd year INIAD Computer science & engineering | ML & QC Interest

This is an example of an audio data analysis by 2D CNN

We can consider a mel spectrogram as an image, classification predictions can be performed by CNN with sound data. Instead of mixing the time and the frequency axes together, we will only convolve one axis at a time.

First and foremost, let's make sure that the libraries are all set up

import os, shutil import numpy as np import pandas as pd import librosa import librosa.display import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split import IPython.display as ipd

We set the sampling rate to 8820 Hz, and retrieve all the data and save models during learning to model_dir variable. Simply retrieve the audio data from kaggle (it can be anything from nature sound, clapping sound, etc. )

# sampling rate s_rate = 8820 n_fft = 1024 hop_length = 128 n_mels = 128 # define directories base_dir = './' esc_dir = os.path.join(base_dir, 'ESC3') meta_file = os.path.join(esc_dir, 'meta/esc3.csv') audio_dir = os.path.join(esc_dir, 'audio/') model_file = 'esc3-model-sr{}.h5'.format(s_rate) # To show more rows and columns without "..." pd.options.display.max_columns=999 pd.options.display.max_rows=999

Read the meta file

# load metadata meta_data = pd.read_csv(meta_file, delimiter= ',' , skiprows= 0 , header= 0 ) print(meta_data.shape) display(meta_data.head())

tgt_vc = meta_data[ 'target' ].value_counts() n_classes = len(tgt_vc.index) display(tgt_vc) meta_data[ 'target' ] = meta_data[ 'target' ].replace(tgt_vc.index, list(range(n_classes))) display(meta_data.head()) cat = meta_data[ 'category' ] classes = [] for i in range(n_classes): sel = cat[meta_data[ 'target' ]==i].reset_index(drop= True ) classes.append(sel[ 0 ]) print(classes)

Replace the each category ID (target column) with a serial number starting from 0. In addition, create a list named classes for replacing serial numbers with class names.

# load a wave data def load_wave_data (audio_dir, file_name) : file_path = os.path.join(audio_dir, file_name) x, fs = librosa.load(file_path, sr=s_rate) return x,fs

Function for loading wav files. x: audio data converted to ndarray. fs: Sampling frequencies.

# change wave data to mel-stft def calculate_melsp (x, sr, n_fft= 1024 , hop_length= 128 , n_mels= 128 ) : stft = np.abs(librosa.stft(x, n_fft=n_fft, hop_length=hop_length))** 2 melsp = librosa.feature.melspectrogram(S=stft, sr=sr, n_mels=n_mels) log_melsp = librosa.power_to_db(melsp) #print(log_melsp[:3]) # debug return log_melsp

This is a function for creating mel spectogram. Before calculating the melspectogram, take the absolute value of short-time fourier transform with librosa library (stft) and return complext matrix. librosa.power_to_db() is simply transform power spectrum to dB units.

To be continued in the second post...

Share this story @ hubertaiandqc Hubert S Read my stories 3rd year INIAD Computer science & engineering | ML & QC Interest

Tags