This is an example of an audio data analysis by 2D CNN We can consider a mel spectrogram as an image, classification predictions can be performed by CNN with sound data. Instead of mixing the time and the frequency axes together, we will only convolve one axis at a time. First and foremost, let's make sure that the libraries are all set up os, shutil numpy np pandas pd librosa librosa.display matplotlib.pyplot plt sklearn.model_selection train_test_split IPython.display ipd import import as import as import import import as from import import as We set the sampling rate to 8820 Hz, and retrieve all the data and save models during learning to model_dir variable. Simply retrieve the audio data from kaggle (it can be anything from nature sound, clapping sound, etc. ) s_rate = 8820 n_fft = 1024 hop_length = 128 n_mels = 128 base_dir = './' esc_dir = os.path.join(base_dir, 'ESC3') meta_file = os.path.join(esc_dir, 'meta/esc3.csv') audio_dir = os.path.join(esc_dir, 'audio/') model_file = 'esc3-model-sr{}.h5'.format(s_rate) pd.options.display.max_columns=999 pd.options.display.max_rows=999 # sampling rate # define directories # To show more rows and columns without "..." Read the meta file meta_data = pd.read_csv(meta_file, delimiter= , skiprows= , header= ) print(meta_data.shape) display(meta_data.head()) # load metadata ',' 0 0 tgt_vc = meta_data[ ].value_counts() n_classes = len(tgt_vc.index) display(tgt_vc) meta_data[ ] = meta_data[ ].replace(tgt_vc.index, list(range(n_classes))) display(meta_data.head()) cat = meta_data[ ] classes = [] i range(n_classes): sel = cat[meta_data[ ]==i].reset_index(drop= ) classes.append(sel[ ]) print(classes) 'target' 'target' 'target' 'category' for in 'target' True 0 Replace the each category ID (target column) with a serial number starting from 0. In addition, create a list named classes for replacing serial numbers with class names. file_path = os.path.join(audio_dir, file_name) x, fs = librosa.load(file_path, sr=s_rate) x,fs # load a wave data : def load_wave_data (audio_dir, file_name) return Function for loading wav files. x: audio data converted to ndarray. fs: Sampling frequencies. stft = np.abs(librosa.stft(x, n_fft=n_fft, hop_length=hop_length))** melsp = librosa.feature.melspectrogram(S=stft, sr=sr, n_mels=n_mels) log_melsp = librosa.power_to_db(melsp) log_melsp # change wave data to mel-stft : def calculate_melsp (x, sr, n_fft= , hop_length= , n_mels= ) 1024 128 128 2 #print(log_melsp[:3]) # debug return This is a function for creating mel spectogram. Before calculating the melspectogram, take the absolute value of short-time fourier transform with librosa library (stft) and return complext matrix. librosa.power_to_db() is simply transform power spectrum to dB units. To be continued in the second post...