How to use machine learning to color your room lighting, based on the emotions behind the music you are listening (Python code available here)
Emotion is a very important attribute of audio signals, either if it is speech or music. In the case of speech it can be used to enhance speech analytics by emotional and behavioral content. For music, it also makes a more representative and detailed description of the musical content, in a way that song retrieval and recommendation applications can be emotion-aware. Therefore, recognizing emotions directly from the audio information has gained increased interest during the last years. In the demo described in this article I am trying to show how estimating emotions in musical signals can be performed using Python and simple ML models, and then, for each music signal, how to map the estimated emotions to colors to be used in the lighting of the room where the music is played. For changing the room's lighting a smart lighting device is obviously required: in this demo we are using the Yeelight color bulb
Let’s start with a quick description of how we solve the “music to emotions” task.
We first need to train a machine learning model that will be able to map audio features to emotional classes. For this demo, a set of 5K music samples (songs) has been compiled. Emotional ground-truth has been gathered through the Spotify API and in particular by using Spotipy (a Python library to use in the Spotify API). The Spotify API provides access to several song metadata, including the emotional valence and energy.
Valence and energy (usually the term “Arousal” is used in the bibliography especially for speech emotion recognition, but here we have selected to use the term energy to be aligned with the Spotify attributes) are dimensional representations of the emotional states, initially adopted by phycologists as an alternative to distinct emotions (sadness, happiness, etc). They have become common in emotion recognition. Valence is related to the positivity (or negativity) of the emotional state and energy (or arousal) is the strength of the respective emotion.
The Spotify valence and energy values are continuous in the (0, 1) range (valence ranging from negative to positive and energy from weak to strong). In this demo, we have selected to sample the valence and energy values to three distinct classes: valence to negative, neutral and positive and energy to weak, normal and strong. Towards this end, different thresholds have been used to maintain a relative class balance for both classification tasks.
In addition, as shown in the experiments below, initial audio signals have been downsampled to 8K, 16K and 32K. As we show later, we have selected 8K as the higher sampling frequencies do not significantly improve the performance. Also, from each song, a fix-sized segment has been used to extract features and train the respective models. The segment size has varied for the experiments presented below, but for this particular demo a default segment size of 5 seconds has been selected, because this has been considered to be a rational segment size upon which music emotion predictions are made: longer segment sizes would mean the “user” would have to wait for a longer time for emotional predictions to be returned. On the other hand, as we will see below, shorter segment sizes lead to higher emotion recognition error rates.
To sum up, our dataset consists of 5K music segments (from 5K songs), organized in 3 emotion energy classes (low, normal, high) and 3 valence classes (negative, neutral, positive) taken at various sampling frequencies and various segment sizes. The final selected parameters for this demo are 8K sampling frequency and a segment size of 5 seconds. These samples are then used to train two models (for energy and valence)
The Python library pyAudioAnalysis has been used for extracting hand-crafted audio features and for training the respective segment-level audio classifiers. A set of almost 130 statistics computed over short-term time-domain and frequency-domain audio features are extracted per segment. So each audio segment is represented by a 130-dimension feature vector. For classification we have selected SVMs with RBF kernels (spectrograms and CNNs could be used for better performance, but this would require more data for training, and more computational demands both at training and run-time of the final demo)
How accurate can the two classifiers be for the selected parameters (5 sec length, 8KHz sampling)? During training and parameter tuning a typical cross-validation procedure has been followed: data has been split into train and test subsets and performance measures have been evaluated for each random train/test split. The overall confusion matrices for both classification tasks are shown below. Rows represent ground truth classes and columns predictions. E.g. 5.69% of the data has been misclassified as "neutral", while their true energy label was "low". As expected, emotional energy is an "easier" classification task, with an overall F1 close to 70%, while F1 for valence is around 55%. In both cases, the possibility of an "extreme" error (e.g. misclassification of high energy as low) is very low.
(energy, F1 = 67%)
(valence, F1 = 55%)
Does signal quality (sampling rate) matter? The obvious answer is that higher sampling frequencies would lead to better overall performance. However, experimental results with 8K, 16K and 32K sampling frequencies have proven that the average (for many segment lengths from 5 to 60 seconds) performance does not significantly change. In particular, and compared to the 8K signal quality, overall performance is just 0.5% better for 16K and 2% for 32K. This indicates using 8K for this particular demo, as the aforementioned performance boosting, does not justify the computational cost (32K means 4 times more feature extraction computations that 8K signals). So the short answer is that the sampling frequency does not significantly affect the overall prediction accuracy.
How does segment length affect performance? It is obvious that, the longer the segments the more information the classifier would have to estimate music emotion. But how does segment length affect overall F1 measure? The results in the following figure prove that overall performance (F1) for energy can even be above 70% for 30-second segments, while valence seems to asymptotically reach 60%. Of course, the segment length cannot be as high as possible and it is limited by the application: in our case, where we want to automatically control our color bulbs based on the estimated emotion, a rational delay would be less than 10 seconds (as explained before, 5 is the selected segment length).
So we have shown above that we can build two models for predicting emotional energy and valence in 5-second music segments with an accuracy around 70% and 55% respectively. How can these two values (energy and valence) be used to change the color of the room's lighting the music is played in?
Studies have proven that there is a strong association between colors and emotions derived from musical content (e.g. this publication). In general, emotion has been associated with colors in various contexts such as human-computer-interaction, phycology and marketing. In this demo, we have just followed some common admissions such as that yellow usually represents joy, blue sadness, red anger/alert and green calmness. However, this mapping is not the result of a deep dive into the phycological factors behind emotional content and color representation. For that reason, the demo presented in this article is 100% configurable with regards to the emotions to color mappings (i.e. the user/developer can change the mappings of emotions to colors very easily).
To define a continuous 2D valence-energy plane where each point is mapped to a single color, we define a discrete mapping of points to colors based on the aforementioned assumptions. As an example, anger, i.e. point (-0.5 , 0.5) in the valence, energy plane is mapped to red and happiness (point (0.5, 0.5)) to yellow, etc. All intermediate points are assigned as a linear combination of the basic color mappings, based on their Euclidean distance from the points of reference. The result is the continuous 2D valence-energy color plane shown below. The horizontal axis varies in the negative - positive range (valence), while the vertical axis corresponds to emotional energy (weak to strong). The particular colormap is the default mapping of (valence, energy) value pairs to color, based on a set of simple configurable rules, as described above (Note: In the Python repo of this article, this can be changed by editing the "emo_map" dictionary in main.py).
The presented demo functions as follow: for each 5-sec window recorded, the (valence, energy) values are predicted using the models described above. Then, the predicted point is mapped to the respective color and visualized in a simple plot on the computer that the demo is running. If the user provides one or more IPs of Yeelight Bulbs, the Yeelight Python library (https://yeelight.readthedocs.io/en/latest/) is used to set the colors of the Bulbs based on the aforementioned predictions.
The demo described in this article is available at this github repo: https://github.com/tyiannak/color_your_music_mood. It mostly uses pyAudioAnalysis for feature extraction and classifier training/testing. It also includes pretrained energy and valence models for music signals.
I am the Director of ML at behavioralsignals.com, where my work focuses on building algorithms that recognise emotions and behaviors based on audio information. I also teach multimodal information processing in a Data Science and AI master program in Athens, Greece.