Using Machine Learning to Predict Song Genres from Spotify

Jason L
Jan 1, 2021
5 min read

Am I crazy or do all modern songs sound the same these days?

Recently there was an AI bot by The Pudding that went viral for roasting how basic people’s Spotify accounts are (you can try it out here). On one hand, the bot was pretty cool and showed how powerful AI can be. On the other hand, I was personally offended by it calling me old for liking alternative rock and r&b from the early 2000s.

Look, it’s not that I have anything against more modern music, but I just don’t understand what type of style they are going for most of the time. I tried putting on a more recent r&b playlist (trying to impress the AI, of course) and discovered a recent song called “Lurkin” by Chris Brown and Torey Lanez. Chris Brown is mostly a r&b singer, Lanez is a rapper, yet “Lurkin” with its catchy hooks seems clearly targeted to be a mainstream pop song. However, this only got more confusing when I looked the song up on the Spotify API and realized they classified it as “Latin hip hop”. To summarize, we have a rapper and a r&b singer collaborating on a song that could be considered to be pop or rap, yet I’m first hearing it on a r&b playlist. You got all that?

This got me thinking if we could use machine learning to train a model to classify the differences between rap, r&b, and pop using just the attributes of how the song sounds.

Background

First, I would like to note that I am going to try to keep this blog as short as possible, but if you would like to see a more technical writeup complete with code snippets, then I posted that on my RPubs page.

The dataset I used contained data from Spotify about 20,000 distinct songs across 6 different genres. While the songs dated back as far as the 1960s, 79% of the songs were from no earlier than 2010. The data contained measures of 12 different musical attributes (such as acousticness, loudness, and valence) that I used for modeling, but it also had a measure of each song's popularity. While I did not use popularity for modeling, I thought it would be neat to look at the distribution of song popularity by genre.

Nice to see that EDM music is as polarizing as I hoped it would be. Pop is the most popular genre, with rap, rock, and Latin all being fairly close behind.

For purposes of modeling, I only wanted to look at the 3 genres I felt were most similar to me: rap, pop, and r&b. Here's a look at their popularity by decade. Note how pop has had flat popularity, while r&b and rap saw huge resurgences in the 2010s.

Model Accuracy

I attempted two different types of machine learning classification models. The first is called k-nearest neighbors (KNN) and the other is a random forest (RF). KNN is like polling your neighbors and then classifying you to be most like them (if the house on the left is yellow, across the street is yellow, and the right is green, then KNN would classify your house as yellow). A random forest classifies data by following a path of more specific heuristics about the underlying data.

For a classification problem like this one that has fairly balanced classes, accuracy is the better metric to use (accuracy is a pure measure of true positives and true negatives; AUC is better for weighting false positives and negatives). The RF model performed at about 63.3% accuracy on the training data compared to about 53.3% for the KNN model. I think this gap kind of makes sense given the original premise that pop, r&b, and rap have a lot of crossover in how they sound. If KNN misclassified one song, then by definition there’s a very good chance it will also misclassify all other songs just like it (hence the name “nearest neighbors”). RF will do a better job establishing specific rules for classification. By comparison, a random guess would get it right 33.33%, so the RF model nearly doubling that is still good.

It should be worth noting that the model predicted 65% accurately on out of sample data. Typically we would expect the accuracy on the training data to be slightly higher than the testing data (the model parameters would be "biased" for the training data). This likely indicates a high amount of variance in the model. In other words, there are probably a fair amount of songs the model classified correctly on the testing data even though it wasn't particularly confident in the choice. Again, another point for the idea that some of the songs sound too much alike.

How did the model do in predicting each of the 3 genres individually? Let's check the mosaic plot.

Wow! So it turns out that the model is actually very accurate distinguishing pop (77%) and rap (74%), but it is only a little bit better than random at distinguishing r&b (43%). This means that my observation that pop and rap blending together would be wrong, but I was right about it happening to r&b. The model validates that r&b tends to sound too much like either a rap or a pop song.

Benchmarking

I wanted to make a benchmark for how evaluating my initial question of whether pop, rap, and r&b really are that similar. So I also created a separate RF model for classifying rap against two of the other genres in the original dataset: rock and edm. Those 3 genres are very distinct from each other in terms of sound. I hypothesized that the accuracy of the rock/edm model should be much higher than the 63.3% of the pop/r&b model, even when holding all of the rap songs constant across both models.

As I suspected, the computer has a much easier time identifying the difference between rap, rock, and edm as it did compared to rap, pop, and r&b. The rock/edm model was accurate on 80.6% on in-sample songs, which was over 17% better than the r&b/pop model. So it’s clearly not just my old man ears that struggle to figure out rap, pop, and r&b.

Finally, let's see a variable importance plot to see what features were most predictive.

The model found the most importance to be for speechiness and loudness (rap), as well as danceability and tempo (pop). Duration also showed some predictive power, ironically being a trait most associated with predicting the otherwise difficult r&b (r&b songs were about 10 seconds longer on average than rap and 15 more than pop in the training data).

But perhaps most importantly, now that we have a model, we can circle back to answering my original question of what the heck genre was "Lurkin" supposed to be?

Turns out no matter how much more powerful the computer is than I am, it still barely has any idea on how to classify this damn song!

Using Machine Learning to Predict Song Genres from Spotify

Am I crazy or do all modern songs sound the same these days?

Background

Model Accuracy

Benchmarking

Related Posts

コメント