Like my articles? Feel free to vote for me as ML Writer of the year . Handling audio data is an essential task for machine learning engineers working in the fields of speech analytics, music information retrieval and multimodal data analysis, but also for developers that simply want to edit, record and transcode sounds. This article shows the basics of handling audio data using command-line tools, and also provides a not-so-deep dive into handling sounds in Python. So what is sound and which are its basic attributes? According to physics, sound is a travelling vibration, i.e. a wave that moves through a medium such as the air. The sound wave is transferring energy from particle to particle until it is finally “received” by our ears and perceived by our brains. The two basic attributes of sound are (what we also call loudness) and (a measure of the wave’s vibrations per time unit). amplitude frequency Photo by Kai Dahms on Unsplash Similarly to images and videos, sound is an analog signal that has to be transformed to a digital signal, in order to be stored in computers and analyzed by software. This analog to digital conversion includes two processes: and . sampling quantization is used to convert the time-varying continuous signal x(t) to a discrete sequence of real numbers x(n). The interval between two successive discrete samples is the sampling period (Ts). We use the sampling frequency (fs = 1/Ts) as the attribute that describes the sampling process. Sampling Typical sampling frequencies are 8KHz, 16KHz and 44.1KHz. 1Hz means one sample per second, so obviously higher sampling frequencies mean more samples per second and therefore better signal quality. (This actually means that the discrete signal can capture a higher range of frequencies, namely from 0 to fs/2 Hz according to the Nyquist rule) is the process of replacing each real number, x(n), of the sequence of samples with an from a finite set of discrete values. In other words, quantization is the process of reducing the infinite number precision of an audio sample to a finite precision as defined by a particular number of bits. Quantization approximation In the majority of the cases, 16 bits per sample are used to represent each quantized sample, which means that there are 2¹⁶ levels for the quantized signal. For that reason, raw audio values usually vary from -2¹⁵ to 2¹⁵(1 bit used for the sign), however, as we will see later, this is usually normalized in the (-1, 1) range for the sake of simplicity. We usually call this bit resolution property of the quantization procedure “sample resolution” and it is measured in . bits per sample Tools and libraries used in this article I’ve selected the following command-line tools, programs and libraries to use for basic handling of audio data: /libav. FFmpeg ( ) is a free, open-source project for handling multimedia files and streams. Some think that ffmpeg and libav are the same, but actually libav is a fork project from ffmpeg ffmpeg https://ffmpeg.org ( ) aka “the Swiss Army knife of sound processing programs” is a free cross-platform command line utility for basic audio processing. Despite the fact that it has not been updated since 2015, it is still a good solution. In this article we mostly demonstrate ffmpeg and a couple of examples in sox sox http://sox.sourceforge.net ) is a free, open-source and cross-platform program for editing sounds audacity ( https://www.audacityteam.org : we will use ( ) and scipy ( ) for reading audio data and ( ) . programming pydub https://github.com/jiaaro/pydub https://scipy-cookbook.readthedocs.io librosa https://librosa.github.io/librosa/ We could also use ( ) for IO or for more advanced feature extraction and signal analysis. pyAudioAnalysis https://github.com/tyiannak/pyAudioAnalysis Finally, we will also use ( ) for basic signal visualization. plotly https://plotly.com This article is divided into two parts: 1st part: how to use ffmpeg and sox to handle audio files 2nd part: how to programmatically handle audio files and perform basic processing Part I: Handling audio data — the command-line way Below are some examples for the most basic audio handling such as conversion between formats, temporal trimming, merging and segmentation, using mostly ffmpeg and sox. To convert (mkv) (mp3) video to audio ffmpeg -i video.mkv audio.mp3 For to 16KHz, converting stereo (2 channels) and converting MP3 uncompressed audio samples), one needs to use the -ar (audio rate) -ac (audio channel) properties: downsampling to mono (1 channel) to WAV ( ffmpeg -i audio.wav -ar -ac audio_16K_mono.wav 16000 1 Note that, in that case, stereo to mono conversion means that the two channels are averaged to one. Also, downsampling of an audio file and stereo to mono conversion can be achieved using in the following manner: sox <source_file_ -r <new_sampling_rate> -c 1 <output_file>) sox Now let’s see the new file’s attributes using ffmpeg: ffmpeg -i audio_16K_mono.wav will return: Input # , wav, ‘audio_16K_mono.wav’: Metadata: encoder : Lavf57 Duration: : : , : kb/s Stream # : : Audio: pcm_s16le ([ ][ ][ ][ ] / ), Hz, mono, s16, kb/s 0 from .71 .100 00 03 10.29 bitrate 256 0 0 1 0 0 0 0x0001 16000 256 To an audio file, e.g. from the 60th to the 80th second (20 seconds new duration): trim ffmpeg -i audio.wav -ss -t audio_small.wav 60 20 (This can be achieved with the -to argument, which is used to define the end of the trimmed segment, in the example above that would be 80) To two or more audio files one can use the “ffmpeg -f concat” command. Suppose you want to concatenate all files f1.wav, f2.wav and f3.wav to a large file called output.wav. What you need to do is create a text file of the following format (say named ‘list_of_files_to_concat’): concatenate file file file 'file1.wav' 'file2.wav' 'file3.wav' and then run ffmpeg -f concat -i list_of_files_to_concat -c copy output.wav On the other hand, to an audio file into successive chunks ( ) of the (same) specified duration can be done with the “ffmpeg -f segment” option. For example, the following command will break output.wav into 1-second, non-overlapping segments named out00000.wav, out00001.wav, etc.: break segments ffmpeg -i output.wav -f segment -segment_time -c copy out% d.wav 1 05 With regards to channel handling, apart from simple mono to stereo conversion (or stereo to mono) through the -ac property, one may want to (right to left). The way to achieve this is through the ffmpeg map_channel property: switch stereo channels ffmpeg -i stereo.wav -map_channel -map_channel stereo_inverted.wav 0.0 .1 0.0 .0 To create a , say left.wav and right.wav: stereo file from two mono files ffmpeg -i left.wav -i right.wav -filter_complex -map mix_channels.wav "[0:a][1:a]join=inputs=2:channel_layout=stereo[a]" "[a]" On the opposite direction, to (one for each channel): split a stereo file into two mono ffmpeg -i stereo.wav -map_channel left.wav -map_channel right.wav 0.0 .0 0.0 .1 Map_channel can also be used to a from a stereo signal, e.g. (below the left channel is muted): mute channel ffmpeg -i stereo.wav -map_channel -map_channel muted.wav -1 0.0 .1 adaptation can also be achieved through ffmpeg, e.g. Volume ffmpeg -i data/music_44100.wav -filter:a “volume= ” data/music_44100_volume_50.wav ffmpeg -i data/music_44100.wav -filter:a “volume= ” data/music_44100_volume_200.wav 0.5 2.0 The figure below presents a screen shot from viewing (with Audacity) the original, the 50% volume adaptation and the x2 (200%) volume adaptation signals. The x2 volume boosted signal is clearly (i.e. some samples cannot be represented and they are assigned the maximum allowed value — 2¹⁵ for 16-bit signals): clipped change can be achieved with as well in the following way: Volume sox sox -v data/music_44100.wav data/music_44100_volume_50_sox.wav sox -v data/music_44100.wav data/music_44100_volume_200_sox.wav 0.5 2.0 Part II: Handling audio data — the programming way Load WAV and MP3 files to array Let us first load our sampled audio data to a array (we use numpy arrays as they are considered the most widelly adopted way to process numerical sequences/vectors). The most common way to load WAV data to numpy arrays is scipy.io.wavfile, while for MP3 data one can use pydub ( ) that uses ffmpeg for encoding / decoding audio data. numpy https://github.com/jiaaro/pydub In the following example, the signal stored in WAV and MP3 files is loaded to numpy arrays. same pydub AudioSegment numpy np scipy.io wavfile plotly.offline init_notebook_mode plotly.graph_objs go plotly fs_wav, data_wav = wavfile.read( ) audiofile = AudioSegment.from_file( ) data_mp3 = np.array(audiofile.get_array_of_samples()) fs_mp3 = audiofile.frame_rate print( . format(((data_mp3 - data_wav)** ).sum())) print( . format(data_wav.shape[ ] / fs_wav)) # Read WAV and MP3 files to array from import import as from import from import import as import # read WAV file using scipy.io.wavfile "data/music_8k.wav" # read MP3 file using pudub "data/music_8k.mp3" 'Sq Error Between mp3 and wav data = {}' 2 'Signal Duration = {} seconds' 0 result: Sq Between mp3 and wav data = Signal Duration = seconds Error 0 5.256 Note: the overall duration of the loaded signal (in seconds) is computed by dividing the number of samples by the sampling frequency (Hz = samples per second). Also, in the example above we compute the sum square error to make sure that the two signals are identical despite their mp3 to wav conversion. Stereo signals Stereo signals are handled through 2D arrays. In the following example, the data_wav array has two columns, one for each channel. By convention, the left channel is always the first and the second the right channel. fs_wav, data_wav = wavfile.read( ) time_wav = np.arange( , len(data_wav)) / fs_wav plotly.offline.iplot({ : [go.Scatter(x=time_wav, y=data_wav[:, ], name= ), go.Scatter(x=time_wav, y=data_wav[:, ], name= )]}) # Handling stereo signals "data/stereo_example_small_8k.wav" 0 "data" 0 'left channel' 1 'right channel' Normalization Normalization is necessary for performing computations on the audio signal values, as it makes the signal values independent to the sample resolution (i.e. signals with 24 bits per sample have much higher range of values than signals with 16 bits per sample). The following example demonstrates how to normalize an audio signal in the (-1, 1) range, by simply dividing by 2¹⁵. This is because we know that the sample resolution is 16 bits per sample. In the rare case of 24 bits per sample this normalization should obviously change respectively. fs_wav, data_wav = wavfile.read( ) data_wav_norm = data_wav / ( ** ) time_wav = np.arange( , len(data_wav)) / fs_wav plotly.offline.iplot({ : [go.Scatter(x=time_wav, y=data_wav_norm, name= )]}) # Normalization "data/lost_highway_small.wav" 2 15 0 "data" 'normalized audio signal' Trim / Segment The following examples show how to get seconds 2 to 4 from the previously loaded and normalized signal. This is done by simply referring to the respective indices in the numpy array. Obviously the indices must be in audio samples, so seconds need to be multiplied by the sampling frequency. data_wav_norm_crop = data_wav_norm[ * fs_wav: * fs_wav] time_wav_crop = np.arange( , len(data_wav)) / fs_wav plotly.offline.iplot({ : [go.Scatter(x=time_wav_crop, y=data_wav_norm_crop, name= )]}) # Trim (segment) audio signal (2 seconds) 2 4 0 "data" 'cropped audio signal' Fix-sized segmentation In the first part we showed how we can segment a long recording to non-overlapping segments using ffmpeg. The following code sample shows how to do the same with Python. Line 8 does the actual segmentation in a single-line command. Overall, the following script loads and normalizes an audio signal, and then . it breaks it into 1-second segments and writes each one of them in a file (Pay attention to the note in the last comment: you will need to cast to 16bit before saving to file because the numpy conversion has led to higher sample resolutions). fs, signal = wavfile.read( ) signal = signal / ( ** ) signal_len = len(signal) segment_size_t = segment_size = segment_size_t * fs segments = np.array([signal[x:x + segment_size] x np.arange( , signal_len, segment_size)]) iS, s enumerate(segments): wavfile.write( .format(segment_size_t * iS, segment_size_t * (iS + )), fs, (s)) # Fix-sized segmentation (breaks a signal into non-overlapping segments) "data/obama.wav" 2 15 1 # segment size in seconds # segment size in samples # Break signal into list of segments in a single-line Python code for in 0 # Save each segment in a seperate filename for in "data/obama_segment_{0:d}_{1:d}.wav" 1 A simple algorithm to remove silent segments from a recording The previous script has broken a recording into a list of 1-second segments. The code below implements a very simple silence removal method. Towards this end, it computes the energy as the sum of squares of the samples, then it calculates a threshold as 50% of the median energy value, and finally it keeps segments whose energy are above that threshold: IPython energies = [(s** ).sum() / len(s) s segments] thres = * np.median(energies) index_of_segments_to_keep = (np.where(energies > thres)[ ]) segments2 = segments[index_of_segments_to_keep] new_signal = np.concatenate(segments2) wavfile.write( , fs, new_signal) plotly.offline.iplot({ : [go.Scatter(y=energies, name= ), go.Scatter(y=np.ones(len(energies)) * thres, name= )]}) IPython.display.display(IPython.display.Audio( )) IPython.display.display(IPython.display.Audio( )) import # Remove pauses using an energy threshold = 50% of the median energy: 2 for in # (attention: integer overflow would occure without normalization here!) 0.5 0 # get segments that have energies higher than a the threshold: # concatenate segments to signal: # and write to file: "data/obama_processed.wav" "data" "energy" "thres" # play the initial and the generated files in notebook: "data/obama.wav" "data/obama_processed.wav" The energy / threshold plot is shown in the figure below (all segments whose energies are below the red line are removed from the processed recording). Also, note the last two lines of code (using the IPython.display.display() function) that are used to add a clickable audio clip directly in the notebook for both the initial and the processed audio files, as the following screenshot shows: You can listen to the original and processed (after silence removal) recordings below: Music analysis: a toy example on bpm (beats per minute) estimation Music analysis is an application domain of signal processing and machine learning, that focuses on analyzing musical signals, mostly for content-based retrieval and recommendation. One of the major tasks in music analysis, is to extract high-level attributes that describe a song, such as its musical genre and the underlying mood. is one of the most important attributes of a song. Tempo tracking is the task of automatically estimating a songs tempo (in bpm) directly from the signal. One of the basic implementations of tempo tracking is included in the library. Tempo librosa The following toy example takes as input a mono audio file where a song is stored and produces a stereo file where on the left channel is the initial song, while on the right channel is an artificially generated periodic “beep” sound that “follows” the main tempo of the song: numpy np scipy.io.wavfile wavfile librosa IPython [Fs, s] = wavfile.read( ) tempo, beats = librosa.beat.beat_track(y=s.astype( ), sr=Fs, units= ) beats -= s = s.reshape( , ) s = np.array(np.concatenate((s, np.zeros(s.shape)), axis= )) ib, b enumerate(beats): t = np.arange( , , / Fs) amp_mod = / (np.sqrt(t)+ ) - amp_mod[amp_mod < ] = x = s.max() * np.cos( * np.pi * t * ) * amp_mod s[int(Fs * b): int(Fs * b) + int(x.shape[ ]), ] = x.astype( ) wavfile.write( , Fs, np.int16(s)) IPython.display.display(IPython.display.Audio( )) import as import as import import # load file and extract tempo and beats: 'data/music_44100.wav' 'float' "time" 0.05 # add small 220Hz sounds on the 2nd channel of the song ON EACH BEAT -1 1 1 for in 0 0.2 1.0 0.2 0.2 0.2 0 0 2 220 0 1 'int16' # write a wav file where the 2nd channel has the estimated tempo: "data/music_44100_with_tempo.wav" # play the generated file in notebook: "data/music_44100_with_tempo.wav" The result of the script above is a WAV file where the left channel is the initial song and the right channel is the sequence of beep sounds on the estimated tempo onsets. Below are two examples of generated sounds for two different initial songs: Real-time recording and frequency analysis All of the presented code samples above have mainly focused on reading audio data from files and performing some very basic processing on the audio data such as trimming or segmentation to fix-sized windows, and then either plotting or saving the processed sounds into files. The following code goes one step further in a twofold way: (a) by showing how sound can be by a in a way that allows real-time and online processing (b) by introducing the domain representation of a sound. captured microphone frequency Our goal here is to create a simple Python script that captures sound in a segment-basis, and for each segment it plots in the terminal the segment’s frequency distribution. Real-time audio capturing is achieved through the library. Audio samples are captured in small segments (say, 200 mseconds long). Then, for each segment, the code presented below performs a basic frequency representation by running the following steps: pyaudio compute the magnitude of the Fast Fourier Transform (FFT) of the recorded segment. Also, keep the frequency values (in Hz) in a separate array, say . Then, to put it simply, according to the DFT definition, X freqs X(i) is the energy of the audio signal that is concentrated in frequency freqs(i) Hz downsample X and freqs, so that we keep much fewer frequency coefficients to visualize the script also calculates the total segment’s energy (not just the energy at particular frequency bins as described in 1). This is done just to normalize against the maximum width of the frequency visualization. plot the downsampled frequency energies X for all (downsampled as well) frequencies using a simple bar plot. These four steps are implemented in the following script. The code is also available here as part of the paura library. See inline comments for more detailed explaination: numpy np pyaudio struct scipy.fftpack scp termplotlib tpl os rows, columns = os.popen( , ).read().split() buff_size = wanted_num_of_bins = fs = pa = pyaudio.PyAudio() stream = pa.open(format=pyaudio.paInt16, channels= , rate=fs, input= , frames_per_buffer=int(fs * buff_size)) : block = stream.read(int(fs * buff_size)) format = % (len(block) / ) shorts = struct.unpack(format, block) x = np.double(list(shorts)) / ( ** ) seg_len = len(x) energy = np.mean(x ** ) max_energy = max_width_from_energy = int((energy / max_energy) * int(columns)) + max_width_from_energy > int(columns) - : max_width_from_energy = int(columns) - X = np.abs(scp.fft(x))[ :int(seg_len/ )] freqs = (np.arange( , + /len(X), / len(X)) * fs / ) wanted_step = (int(freqs.shape[ ] / wanted_num_of_bins)) freqs2 = freqs[ ::wanted_step].astype( ) X2 = np.mean(X.reshape( , wanted_step), axis= ) fig = tpl.figure() fig.barh(X2, labels=[str(int(f)) + f freqs2[ : ]], show_vals= , max_width=max_width_from_energy) fig.show() print( * (int(rows) - freqs2.shape[ ] - )) # paura_lite: # An ultra-simple command-line audio recorder with real-time # spectrogram visualization import as import import import as import as import # get window's dimensions 'stty size' 'r' 0.2 # window size in seconds 40 # number of frequency bins to display # initialize soundcard for recording: 8000 1 True while 1 # for each recorded window (until ctr+c) is pressed # get current block and convert to list of short ints, "%dh" 2 # then normalize and convert to numpy array: 2 15 # get total energy of the current window and compute a normalization # factor (to be used for visualizing the maximum spectrogram value) 2 0.02 # energy for which the bars are set to max 1 if 10 10 # get the magnitude of the FFT and the corresponding frequencies 0 2 0 1 1.0 1.0 2 # ... and resample to a fix number of frequency bins (to visualize) 0 0 'int' -1 1 # plot (freqs, fft) as horizontal histogram: " Hz" for in 0 -1 False # add exactly as many new lines as they are needed to # fill clear the screen in the next iteration: "\n" 0 1 And this is an execution example of the script: All code examples presented in part B are available in this github repo: https://github.com/tyiannak/basic_audio_handling as a jupyter notebook. The last example (the real-time command-line spectrum analyzer) is available at https://github.com/tyiannak/paura/blob/master/paura_lite.py About the author ( tyiannak.github.io ) Thodoris is currently the Director of ML at behavioralsignals.com , where his work focuses on building algorithms that recognise emotions and behaviors based on audio information. He also teaches multimodal information processing in a Data Science and AI master program in Athens, Greece.