I’ve always wanted to learn how to separate vocals from a track programmatically and not depend on software-as-a-service to perform the task for me. This article illustrates how to separate the vocals of a song from the instruments using my new favorite library, Librosa. You can check out the . Google Colab Notebook here The idea sparked when I wanted to separate individual tracks of a song I really liked, so I went to Product Hunt and discovered . This discovery started the urge to learn ML for music, and the eventual discovery of the Python library, librosa. melody ml By the way, I ran out of , which made my notebook . RAM explode https://twitter.com/tonypoppinss/status/1620369862630739968?embedable=true Install and import dependencies pip install librosa matplotlib IPython import librosa from librosa import display import numpy as np import IPython.display as ipd import matplotlib as plt Load and display the song. I used as I wondered how the or parts of the song would come out. My Last Serenade by KSE growling shouting y, sr = librosa.load('My Last Serenade.wav') ipd.Audio(data=y[90*sr:110*sr], rate=sr) We slice a 20-second snippet in the chorus of the song. We show the audio using . The photo is shown below because I couldn't find a way to upload audio here on DEV. ipd.Audio (tbh, this is a bit exhausting) We separate a complex-valued spectrogram D into its magnitude (S) and phase (P) components, convert the time stamps into frames, plot the data, then display the full spectrogram of the data. S_full, phase = librosa.magphase(librosa.stft(y)) idx = slice(*librosa.time_to_frames([90*110], sr=sr)) fig, ax = plt.pyplot.subplots() img = display.specshow(librosa.amplitude_to_db(S_full[:, idx], ref=np.max), y_axis='log', x_axis='time', sr=sr, ax=ax) fig.colorbar(img, ax=ax) Line-by-line explanation - we separate the magnitude and phase of the track using by representing a signal in the time-frequency domain by computing discrete Fourier Transforms(DFT) S_full, phase = librosa.magphase(librosa.stft(y)) short-time Fourier transform (y) - slice the part of the song then convert it to stft frames using the function of librosa. idx = slice(*librosa.time_to_frames([90*110], sr=sr)) time_to_frames - display the spectrogram of the 20-second sliced part of the song by converting the amplitude spectrogram to a dB-scaled spectrogram of the magnitude, then compares the magnitude and phase of the track and returns a new array containing the element-wise maxima and finally it plots the and -axes. img = display.specshow(librosa.amplitude_to_db(S_full[:, idx], ref=np.max), y_axis='log', x_axis='time', sr=sr, ax=ax) y x Below is the image of the spectrum: Decomposing the spectrogram S_filter = librosa.decompose.nn_filter(S_full, aggregate=np.median, metric='cosine', width=int(librosa.time_to_frames(2, sr=sr))) S_filter = np.minimum(S_full, S_filter) Line-by-line explanation - we filter the vocals by their nearest neighbors, aggregate their median values, compare their frames using cosine similarity and contain those frames to be separated by 2 seconds and suppress other sounds from the spectrum S_filter = librosa.decompose.nn_filter(S_full, aggregate=np.median, metric='cosine', width=int(librosa.time_to_frames(2, sr=sr))) - we get the calculated data in the memory of the and variables to get the minimum value. S_filter = np.minimum(S_full, S_filter) S_full S_filter Display the background and foreground spectrum of the audio margin_i, margin_v = 3, 11 power = 3 mask_i = librosa.util.softmask(S_filter, margin_i * (S_full - S_filter), power=power) mask_v = librosa.util.softmask(S_full - S_filter, margin_v * S_filter, power=power) S_foreground = mask_v * S_full S_background = mask_i * S_full Line-by-line explanation - we use margins to reduce loss in sound in the vocals and instrumented masks margin_i, margin_v = 3, 11 - returns the soft mask computed in a numerically stable way power = 3 and - multiply the masks with the input spectrum to separate the components S_foreground = mask_v * S_full S_background = mask_i * S_full Plotting the full spectrum, background and foreground spectrum fig, ax = plt.pyplot.subplots(nrows=3, sharex=True, sharey=True) img = display.specshow(librosa.amplitude_to_db(S_full[:, idx], ref=np.max), y_axis='log', x_axis='time', sr=sr, ax=ax[0]) ax[0].set(title='Full Spectrum') ax[0].label_outer() display.specshow(librosa.amplitude_to_db(S_background[:, idx], ref=np.max), y_axis='log', x_axis='time', sr=sr, ax=ax[1]) ax[1].set(title='Background Spectrum') ax[1].label_outer() display.specshow(librosa.amplitude_to_db(S_foreground[:, idx], ref=np.max), y_axis='log', x_axis='time', sr=sr, ax=ax[2]) ax[2].set(title='Foreground Spectrum') ax[2].label_outer() fig.colorbar(img, ax=ax) Recover the foreground audio from the masked spectrogram and playback the audio y_foreground = librosa.istft(S_foreground * phase) ipd.Audio(data=y_foreground[90*sr:110*sr], rate=sr) Line-by-line explanation - inverses the short-time fourier transform - plays back the vocals from the track y_foreground = librosa.istft(S_foreground * phase) ipd.Audio(data=y_foreground[90*sr:110*sr], rate=sr) Conclusion This seemed easy at first thought and when I was reading the documentation but digging under the code made me realize that this idea was a little more complex. But, what made me continue was when I read about in one part of the documentation which made me realize that I will be getting my hands dirty with Machine Learning in the future with this library. nearest neighbors Also published . here