Classification using Neural Network with Audio Data

This is an example of an audio data analysis by 2D CNN

We can consider a mel spectrogram as an image, classification predictions can be performed by CNN with sound data. Instead of mixing the time and the frequency axes together, we will only convolve one axis at a time.

First and foremost, let's make sure that the libraries are all set up

import os, shutil
import numpy as np
import pandas as pd
import librosa
import librosa.display
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import IPython.display as ipd

We set the sampling rate to 8820 Hz, and retrieve all the data and save models during learning to model_dir variable. Simply retrieve the audio data from kaggle (it can be anything from nature sound, clapping sound, etc. )

# sampling rate
s_rate = 8820
n_fft = 1024
hop_length = 128
n_mels = 128

# define directories
base_dir = './'
esc_dir = os.path.join(base_dir, 'ESC3')
meta_file = os.path.join(esc_dir, 'meta/esc3.csv')
audio_dir = os.path.join(esc_dir, 'audio/')
model_file = 'esc3-model-sr{}.h5'.format(s_rate)

# To show more rows and columns without "..."
pd.options.display.max_columns=999
pd.options.display.max_rows=999

Read the meta file

# load metadata
meta_data = pd.read_csv(meta_file, delimiter=',', skiprows=0, header=0)
print(meta_data.shape)
display(meta_data.head())

tgt_vc = meta_data['target'].value_counts()
n_classes = len(tgt_vc.index)
display(tgt_vc)

meta_data['target'] = meta_data['target'].replace(tgt_vc.index, list(range(n_classes)))
display(meta_data.head())

cat = meta_data['category']
classes = []
for i in range(n_classes):
    sel = cat[meta_data['target']==i].reset_index(drop=True)
    classes.append(sel[0])
print(classes)

Replace the each category ID (target column) with a serial number starting from 0. In addition, create a list named classes for replacing serial numbers with class names.

# load a wave data
def load_wave_data(audio_dir, file_name):
    file_path = os.path.join(audio_dir, file_name)
    x, fs = librosa.load(file_path, sr=s_rate)
    return x,fs

Function for loading wav files. x: audio data converted to ndarray. fs: Sampling frequencies.

# change wave data to mel-stft
def calculate_melsp(x, sr, n_fft=1024, hop_length=128, n_mels=128):
    stft = np.abs(librosa.stft(x, n_fft=n_fft, hop_length=hop_length))**2
    melsp = librosa.feature.melspectrogram(S=stft, sr=sr, n_mels=n_mels)
    log_melsp = librosa.power_to_db(melsp)
    #print(log_melsp[:3])  # debug
    return log_melsp

This is a function for creating mel spectogram. Before calculating the melspectogram, take the absolute value of short-time fourier transform with librosa library (stft) and return complext matrix. librosa.power_to_db() is simply transform power spectrum to dB units.

To be continued in the second post...

Classification using Neural Network with Audio Data

Too Long; Didn't Read

Companies Mentioned

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps

Classification using Neural Network with Audio Data

Too Long; Didn't Read

Companies Mentioned

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps