Introduction
Music surrounds us every day. It sets us up, motivates us, and even enhances cognitive performance. Fast rhythmic beats can boost our energy, while slower, more melodic tunes help us relax. Creating the perfect playlist, however, is often a personal and time-intensive task. Major streaming platforms rely on musical experts to curate playlists, but obviously, like in most areas, we need some automation and machine assistance using data.
As someone who listens to more than 2 hours of music daily across various genres, produces music, and creates playlists for personal use, this topic resonates deeply with me. This project allowed me to combine my passion for music with my professional expertise in data analytics while exploring practical applications of machine learning algorithms. Plus, I was inspired by Spotify’s engineering article — being the leader in music streaming, they set the benchmark for innovation in this space and provided us with high-quality data.
Problem Statement
The goal of this project was to explore clustering techniques to automatically group songs into meaningful playlists. Can machine learning help us create playlists that feel as curated as those by human experts? Or, at the very least, can it ensure that death metal and classical suites are not on the same list?
To tackle this, I used a dataset of ~5,000 songs with Spotify audio features, such as danceability, tempo, and energy. Additionally, I incorporated Last.fm API data to retrieve genre information, enabling external validation of clustering quality.
Our task:
- Cluster songs into cohesive playlists based on audio features.
- Validate these clusters against Last.fm genres to ensure quality.
Initial Look at the Data
Before diving into machine learning, I visualized the dataset using basic features. Below is a scatter plot of songs based on tempo and danceability:
Even without advanced techniques, we are able to identify some patterns. For instance, Brahms’ Lullaby is easily identifiable in the low-tempo, low-danceability region, while Eminem’s The Real Slim Shady appears in the mid-tempo, high-danceability area.
However, leveraging all available features will allow us to uncover richer patterns and automate playlist creation. Here’s the complete list of features:
Unsupervised Clustering
Now, our goal is to perform some initial clustering. The idea here is to apply pure unsupervised machine learning — we don’t know the true cluster labels and aim to determine the optimal number of clusters using practical methods. To achieve this, we’ll go through some basic steps. In the end, we’ll take a look at the clustering results and try to understand the emerging patterns.
Scaling
For most of the algorithms and methods we’ll apply later, scaled data is essential. These algorithms rely on calculations (such as distances or variances) that are sensitive to the scale of features. Scaling converts everything to a common unit so that all features are treated fairly. This ensures that each feature contributes equally to the clustering process. We’ll use several different scalers.
Below is an example of projecting our data onto two features: duration_ms
and tempo
. The left plot shows the original data; the right shows the result after applying two scalers — PowerTransformer and MinMaxScaler:
Dimensionality Reduction With PCA
The next step is Principal Component Analysis (PCA), which reduces the number of features while keeping the most important information. This makes visualization easier and speeds up clustering.
Figuratively speaking, PCA finds projections — like shining a flashlight on a 3D object to create a 2D shadow. The flashlight is positioned so the shadow (projection) retains as much detail as possible about the original object.
We’ll try different numbers of components and compare the results. In our case, we’ll test 10 options. Here’s an example of projecting songs onto two principal components:
K-Means Clustering
K-Means is a popular clustering algorithm that groups data into a predefined number of clusters (k). But how do we decide on the optimal k?
Sometimes, we have business constraints. Our dataset includes around 5,000 songs. Typically, playlists on streaming services contain between 20 and 250 songs. So, for my specific use case, I need more than 20 clusters.
The Silhouette Score helps evaluate the quality of clustering by measuring how well each data point fits within its assigned cluster compared to others. A higher silhouette score (closer to 1) means that clusters are well-defined and clearly separated.
Best Combination of Parameters
With so many degrees of freedom, how can we find the best combination of parameters? The answer is simulation. I can try different combinations — scalers, numbers of PCA components, and numbers of clusters — and select the one that performs best. In my case, there were 935 possible candidates. Let’s visualize the results.
This plot shows silhouette scores (x-axis) for different combinations of PCA components (represented by circle size), scalers (color), and number of clusters (y-axis). To simplify the choice, we should focus on the options in the top-right area, like this:
The first meaningful combination with the highest silhouette score uses the MinMax scaler, 2 principal components, and 6 clusters. While 6 clusters don’t meet our business requirement of having more than 20 clusters, it’s still worth examining more closely.
Initial Clustering Result
As mentioned earlier, I incorporated Last.fm data with genre information for our artists. This is the perfect moment to apply it to our dataset as a useful validation criterion. I selected the 5 most popular genres of this dataset, with the rest grouped under “other.”
Returning to our top-performing combination, we’re now ready to compare the resulting clusters using the genre labels. Below is a visualization showing the percentage share of genres by cluster, along with the number of songs in each cluster.
If you recall, I mentioned in the problem statement that one of my goals was to distinguish between death metal and classical music — honestly, that was a random example. But surprisingly, this criterion works quite well for this particular clustering setup. Take a closer look: clusters 0 and 4 are composed of about half classical music, with almost no death metal, while clusters 2 and 3 are the opposite, heavily representing death metal. A promising start!
That said, we still observe a mix of genres in many clusters, some of which aren’t musically compatible. This suggests it’s worth exploring other techniques and evaluation metrics.
Ultimately, the goal is to find a clustering solution that is both statistically robust and practically meaningful for your specific use case.
Clustering With External Validation
In this section, we’ll use genre information more extensively. This data acts as our “ideal labels” for validating clustering quality. Of course, it’s not a perfect criterion or the most sophisticated approach for playlist creation — but for me personally, as a music lover, it works well. For example, I often prefer listening to genre-specific music: hip-hop for motivation, rock at parties, or classical while working. That means I don’t want these playlists mixed together.
Let’s try to answer such a question: Can we separate music by genre using only features like danceability, energy, tempo, etc.?
Normalized Mutual Information (NMI)
Normalized Mutual Information (NMI) measures the similarity between two clusterings. It scores between 0 and 1, where:
- 0 means no overlap between cluster assignments and true labels.
- 1 means a perfect match with the ground truth.
Before jumping into finding the best clustering setup using NMI, let’s evaluate it on a familiar configuration — the one we identified earlier: MinMaxScaler, 2 PCA components, and 6 clusters. Now, we’ll fix the scaler for simplicity and focus on how the number of PCA components affects the NMI score.
We see that the NMI score for n=6 is significantly higher than for n=2 (0.29 vs. 0.18). Let’s visualize the resulting cluster distributions:
This is already a big improvement. We now see death metal nicely isolated into two clusters and all classical music grouped into one. Promising!
For reference, below is a comparison of NMI scores across various clustering configurations:
The highest NMI score indicates the best genre-focused clustering. This result is quite expected — using validation data, we can significantly improve our music clustering. But the main point here was to validate our K-Means clustering and gain a better understanding of label quality using the same parameters.
Conclusion
In this project, I explored how machine learning can help group music tracks into meaningful clusters using only audio features. Starting from basic unsupervised learning techniques like scaling, PCA, and K-Means, I validated results using real-world genre data from Last.fm.
While silhouette scores helped optimize internal consistency, external validation through genre alignment offered deeper insight into the practical quality of the clusters. Encouragingly, even simple methods could successfully separate genres like classical and death metal — highlighting the power of audio features.
This is just a starting point. Future improvements would include exploring advanced techniques like UMAP and HDBSCAN to enhance clustering and using SHAP for better interpretability. With these tools, we can move closer to building playlist systems that are both intelligent and musically meaningful.
Ultimately, machine learning can definitely complement human curation. For now, it offers a scalable way to organize and explore massive music libraries.