Spotify Playlist Analysis with Python

Objective. Using Python, the goal of this project is to implement the k-means clustering algorithm, a technique often used in machine learning, and use it for data analysis. We write various functions using lists, sets, dictionaries, sorting, and graph data structures for computational problem-solving and analysis.

Part 1. Spotify API Data

Spotify is a popular audio streaming platform with an extensive music database. The Spotify API allows developers to access the platform’s data providing global insights into music listening habits^[1]. Using the API requires an initial setup: we must register as a Spotify developer, create an app, modify the dashboard redirect URI, and store the client ID and secret. After completing these steps, we have access to the Spotify API and all its features.

Get Playlist Data from API

First, we create a Client Credentials Flow Manager used in server-to-server authentication by passing the necessary parameters to the Spotify OAuth class^[2]. We provide a client id and client secret to the constructor of this authorization flow, which does not require user interaction.

# Set client id and client secret
client_id = 'xxx'
client_secret = 'xxx'

# Spotify authentication
cc_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager = cc_manager)

Now we can get the full details of the tracks of a playlist based on a playlist ID, URI, or URL. Choose a specific playlist to analyze by copying the URL from the Spotify Player interface. Using that link, the following code uses the playlist_tracks method to retrieve a list of IDs and corresponding artists for each track from the playlist.

playlist_URI = playlist_link.split("/")[-1].split("?")[0]
# Iterate over list of playlist tracks
for i in sp.playlist_tracks(playlist_URI)["items"]:
    track_ids.append(i["track"]["id"])  # Extract song id
    artist_ids.append(i["track"]["artists"][0]["uri"])  # Extract artist id

Then, we write a function that takes the playlist data from the API and gets the metadata and audio characteristics of each track. Specifically, the function reads the query results for a playlist and returns the track name, track ID, artist, album, duration, popularity, artist popularity, artist genre, and audio characteristics for each track.

Spotify Audio Features

Spotify’s audio features are precalculated measures of both low-level and high-level perceptual music qualities that help classify a track. A quick explanation of each feature is shown below. More information on how to interpret these audio features is located at Spotify’s API documentation.

Track Metadata

name: Title of the music track.

album: Name of the album on which the track appears.

artist: Name of the artist who performed the track.

release_date: Date the album was first released.

length: Track length in milliseconds.

popularity: Popularity of a track (based on total number of streams and frequency of most recent plays).

artist_pop: Popularity of an artist (based on artist's overall track popularity scores).

artist_genres: List of genres associated with an artist.

Audio Features

acousticness: A confidence measure of whether the track is acoustic.
danceability: Suitability for dancing based on tempo, rhythm, beat, and regularity.
energy: A perceptual measure of intensity and activity.
instrumentalness: Predicts whether a track contains no vocals.
liveness: Probability that the track was performed live.
loudness: Overall loudness of a track in decibels (dB).
speechiness: Detects the presence of spoken words in a track.
tempo: Estimated pace of a track in beats per minute (BPM).
valence: A measure describing the musical positiveness.

Playlist Data Preview

The following code loops through each track ID in the playlist and extracts the song information by calling the function we created. From there, we can create a dataframe by passing in the returned data using the pandas package.

# Loop over track ids
tracks = [playlist_features(track_ids[i], artist_ids[i], playlist_ids[i])
    for i in range(len(track_ids))]

Part 2. Similar Artists

First, we want to find the most frequently occurring artist in a given playlist. We use the value_counts function to get a sequence containing counts of unique values sorted in descending order.

# Count distinct values in column
tallyArtists = df.value_counts(["artist", "artist_id"]).reset_index(name='counts')
topArtist = tallyArtists['artist_id'][1]

I can retrieve artist and artist-related data using the following code, passing the artist ID to the artist and artist-related artist functions under the spotipy package. The returned list of similar artists is sorted by similarity score based on the listener data^[3].

a = sp.artist(topArtist)
ra = sp.artist_related_artists(topArtist)

Below is a sample of the result when we query Spotify for the most similar artists to the playlist’s top artist, creating a list that holds all of the artist source ids and target ids. We retrieve similar data for the nodes of the connection graph, creating a list that holds information for each specified artist.

Let’s see how things look when we pull in the full dataset, with each of the artist’s top most similar artists and each of their most similar artists. The following visualization is based on the Spotify Similiar Artists API article and created with flourish studio.

Part 3. Track Similarity Search

Objective. Design and implement a k-means clustering algorithm in Python.

K-means clustering is a popular machine learning and data mining algorithm that discovers possible clusters within a dataset. Finding these clusters often reveals meaningful information from the data distribution. Below, we create a query to retrieve similar elements based on the k-Nearest Neighbors (KNN) using the Euclidean distance.

Definitions

As with many machine learning techniques, this algorithm consists of a vast list of terminology which we define in a bit more detail below.

Definition 1.

Definition 1. Distance The Euclidean distance is a way to calculate how close a data point is to a centroid using the Pythagorean theorem. For two points \(a = \left[a_1, a_2, \ldots, a_n\right]\) and \(b = \left[b_1, b_2, \ldots, b_n\right]\) with dimension \(n\), we define the euclidean distance between both points as

\[ \small \begin{align} \mathbf{\color{darkmagenta} D}(a, b) &= \sqrt{\left(a_1 - b_1\right)^2 + \left(a_2 - b_2\right)^2 + \ldots + \left(a_n - b_n\right)^2} \end{align} \]

Definition 2.

Definition 2. Clusters A cluster is a collection of grouped points. For k-means, every point is part of a cluster. As the algorithm progresses and the centroids shift, points might change which cluster they’re grouped in, though the point itself does not move.

Definition 3.

Definition 3. Centroids A centroid is the center of a cluster calculated by the average location of all the cluster points in each dimension. So if we have three \(n\)-dimensional points \(a\), \(b\), and \(c\), we define the average as

\[ \small \mathrm{average} = \left[ \tfrac{a_1 + b_1 + c_1}{3}, \tfrac{a_2 + b_2 + c_2}{3}, \tfrac{a_3 + b_3 + c_3}{3} \right] \]

Definition 4.

Definition 4. Convergence An algorithm converges if the locations of all centroids do not change much between two iterations, e.g. within some threshold of \(1 \times 10^{-5}\).

KNN Algorithm

The KNN algorithm^[4] searches for \(k\) similar elements based on a query point at the center within a predefined radius. The Euclidean distance between two points is the length of the line segment between the two points. In this sense, the closer the distance is to 0, the more similar the songs are.

K-means clustering works in four steps:

Initialize some number \(k\) of cluster centers, also called centroids.
For each data point in the dataset, assign it to the closest centroid.
Update the locations of the centroids to be the average of all the points assigned to that cluster.
Repeat steps 2 and 3 until convergence.

Note that the actual data points do not change. Only the locations of the centroids change with each iteration. And as the centroids move, the set containing the data points closest to each centroid alters.

KNN Query Example

Our function allows us to create personalized query points and modify the columns to explore other options. For example, the following code selects a specific set of song attributes and then searches for the \(k\) highest values of these attributes set equal to one. Let’s search for \(k=3\) similar songs to a query point \(\textrm{songIndex} = 6\).

# Select song and column attributes
query_point = 4
columns = ['acousticness','danceability','energy','instrumentalness','liveness','speechiness','valence']
# Set parameters and run query
func, param = knnQuery, 3
response = querySimilars(df, columns, query_point, func, param)

## ---- Query Point ----

## AG Club - Memphis

## ---- k = 3 similar songs ----

## Roddy Ricch - Stop Breathing
## The Game - Eazy
## Lil Yachty - Yacht Club (feat. Juice WRLD)

## ---- k = 3 nonsimilar songs ----

## Post Malone - Internet
## ODIE - In My Head
## Frank Ocean - In My Room

KNN Query Example (Extended)

The code below implements the same idea as above, but queries each track in a given playlist instead of a single defined query point.

Code

similar_count = {} # Similar songs count
nonsimilar_count = {} # Non-similar songs count
for track_index in df.index:
    response = querySimilars(df, columns, track_index, func, param)
    for similar_index in response[0]: # Get similar songs
        track = getMusicName(df.loc[similar_index])
        if track in similar_count:
            similar_count[track] += 1
        else:
            similar_count[track] = 1
    for nonsimilar_index in response[1]: # Get non-similar songs
        track = getMusicName(df.loc[nonsimilar_index])
        if track in nonsimilar_count:
            nonsimilar_count[track] += 1
        else:
            nonsimilar_count[track] = 1

Non-Similar Songs Count

## ---- NON-SIMILAR SONGS COUNT ----

## Frank Ocean - In My Room : 83
## ODIE - In My Head : 46
## Post Malone - Internet : 42
## Blxst - Hurt : 25
## Lil Uzi Vert - The Way Life Goes (feat. Nicki Minaj & Oh Wonder) - Remix : 19
## Kanye West - Waves : 12
## Tyla Yaweh - Understand Me : 11

Similar Songs Count

## ---- SIMILAR SONGS COUNT ----

## YoungBoy Never Broke Again - Home Ain't Home (feat. Rod Wave) : 9
## Kodak Black - MoshPit (feat. Juice WRLD) : 8
## Lil Xan - Lies (feat. Lil Skies) : 8
## Tyla Yaweh - High Right Now (feat. Wiz Khalifa) - Remix : 7
## iann dior - I might : 7
## mike. - commas : 7
## Juice WRLD - Life's A Mess (feat. Halsey) : 6
## Fresco Trey - Key To My Heart : 6
## Azizi Gibson - Rain : 6
## Mac Miller - Weekend (feat. Miguel) : 5
## Polo G - RAPSTAR : 5
## Rae Sremmurd - Denial : 5
## Juice WRLD - Stay High : 5
## Lil Uzi Vert - The Way Life Goes (feat. Nicki Minaj & Oh Wonder) - Remix : 5
## Sheff G - Weight On Me : 5
## whiterosemoxie - west side boys : 5
## Post Malone - Waiting For Never : 5
## Post Malone - Big Lie : 5
## Juice WRLD - In My Head : 5
## Justin Stone - Goldmine : 5
## Juice WRLD - Rich And Blind : 5
## Baby Keem - 16 : 5
## KILJ - No Remedy : 5
## ItsWill - No Love Song : 4
## Lil Yachty - Yacht Club (feat. Juice WRLD) : 4
## Young Thug - Webbie (feat. Duke) : 4
## The Game - Eazy : 4
## Young Thug - Love You More (with Nate Ruess, Gunna & Jeff Bhasker) : 4
## Healy - Nikes On : 4
## Polo G - Distraction : 4
## 6ix9ine - GOTTI : 4
## Wiz Khalifa - The Plan (feat. Juicy J) : 4
## Rich The Kid - Ring Ring (feat. Vory) : 4
## Arizona Zervas - HOLY TRINITY (feat. Rich The Kid) : 4
## Preme - DnF (feat. Drake & Future) : 4
## Post Malone - Rich & Sad : 4
## Post Malone - Go Flex : 4
## Various Artists - Hide (feat. Seezyn) : 4
## Jason Derulo - Whatcha Say : 4
## Rod Wave - Yungen (feat. Jack Harlow) : 4
## Kevin Abstract - Empty : 4
## Lil Uzi Vert - The Way Life Goes (feat. Oh Wonder) : 3
## Juice WRLD - Doom : 3
## YG - Sober (feat. Roddy Ricch & Post Malone) : 3
## SAINt JHN - The Best Part of Life : 3
## Internet Money - Blastoff (feat. Juice Wrld & Trippie Redd) : 3
## AUGUST 08 - Cutlass (ft. ScHoolboy Q) : 3
## Kanye West - Waves : 3
## Lil Tecca - Out Of Love (feat. Internet Money) : 3
## Nelly - Just A Dream : 3
## The Kid LAROI - FEEL SOMETHING (feat. Marshmello) : 3
## RIZ LA VIE - Go Again : 3
## 6LACK - Switch : 3
## Logic - Everyday : 3
## Juice WRLD - Already Dead : 3
## SAINt JHN - Wedding Day : 3
## Lil Mosey - Noticed : 3
## Mark Battles - Lemme Talk : 3
## Juice WRLD - You Wouldn't Understand : 3
## PawPaw Rod - HIT EM WHERE IT HURTS : 3
## Juice WRLD - ON GOD (feat. Young Thug) : 3
## Lil Uzi Vert - Erase Your Social : 3
## Roddy Ricch - Stop Breathing : 2
## Dell Mac - So Sad : 2
## A$AP Rocky - Sandman : 2
## Future - WAIT FOR U (feat. Drake & Tems) : 2
## Drake - Crew Love : 2
## The Kid LAROI - Thousand Miles : 2
## Migos - Antisocial (feat. Juice WRLD) : 2
## Verzache - Think About It : 2
## Rexx Life Raj - Moonwalk : 2
## Big Sean - Wolves (feat. Post Malone) : 2
## ODIE - In My Head : 2
## mike. - life got crazy : 2
## Tom The Mail Man - Taking Over : 2
## Juice WRLD - Wishing Well : 2
## PARTYNEXTDOOR - Break from Toronto : 2
## Toosii - be cautious : 2
## Metro Boomin - Creepin' (with The Weeknd & 21 Savage) : 2
## Xuitcasecity - Crash : 2
## Post Malone - Paranoid : 2

Part 4. K Means Clustering

Next, we implement the K-Means clustering algorithm using the Scikit-Learn library to break down a playlist into several smaller playlists. The unsupervised learning algorithm divides similar data points into k groups by computing the distance to the centroid.

The first step is to define an appropriate predefined number (k) of clusters. We use the Elbow Method to determine the optimal k, as shown below^[5].

X = df[['acousticness', 'danceability', 'liveness', 'energy','valence', 'instrumentalness', 'speechiness']]
features = X.values

from sklearn.cluster import KMeans
ssd = [] # Sum of squared distances
for k in range(1,12):
    model = KMeans(n_clusters = k, init="k-means++")
    model = model.fit(features)
    ssd.append(model.inertia_)

Thus, we tune the clustering algorithm by running K-Means for a range of k values, obtaining the above figure. It looks like a value of 3 is optimal for this case. Next, we call the K-Means function and set the k value to 3 clusters.

Considering that there are seven different audio features for the clustering task, we use principal component analysis (PCA) to reduce the dimensionality of the data into a more easily visualized set of variables.

from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca_result = pca.fit_transform(features)

In the above code, we define a PCA instance to find two principal components determined from the features of the data. From there, we visualize the resulting clusters and explore the variation. The figure below shows our 3 clusters represented in 2-dimensional space.

Results

All the clusters have values below \(0.33\), indicating that the songs most likely represent music and other non-speech-like tracks.

Cluster 1 has the highest energy and valence, indicating that these tracks are faster-paced, louder, and more positive (e.g., happy, cheerful, euphoric) than the other clusters.

Cluster 2 has the highest acousticness, with a mean value of \(0.539\) over all the cluster’s songs. Cluster 2 is also higher in danceability, indicating tracks with a faster tempo and beat intensity.

Cluster 3 appears to be the lowest valence, with a mean of \(0.2303\), indicating more negative trajectories (e.g., sadness, frustration, anger).

Conclusion

Through the use of machine learning and collaborative filtering techniques, we can analyze a playlist to gain insights into listening preferences and habits, and recommend music that aligns with similar tastes. For example, if a playlist contains several songs from a particular artist, the algorithm might suggest similar artists based on the listening habits of other users with similar music preferences.

Next Steps

To move forward with this project, we plan to incorporate Natural Language Processing (NLP) to track current music trends. NLP can search for mentions of a specific artist or song, then analyze those mentions to identify trends in popular culture. Using cutting-edge technology, NLP can identify popular keywords and phrases related to a particular song and provide recommendations based on similar terms. Ultimately, NLP can organize songs into groups and assign descriptive adjectives for each track, further classifying individual songs.

References

[1]

Web API Reference | Spotify for Developers, https://developer.spotify.com/documentation/web-api/reference/.

[2]

Welcome to Spotipy!, https://spotipy.readthedocs.io/en/2.22.0/.

[3]

E. Webb, Visualizing Rap Communities with Python & Spotify’s API, https://unboxed-analytics.com/data-technology/visualizing-rap-communities-wtih-python-spotifys-api/.

[4]

Leonardo Mauro, Spotify Songs - Similarity Search, https://www.kaggle.com/code/leomauro/spotify-songs-similarity-search/notebook.

[5]

Chingis Oinar, Separate Your Saved Songs on Spotify into Playlists of Similar Songs, https://towardsdatascience.com/cluster-your-liked-songs-on-spotify-into-playlists-of-similar-songs-66a244ba297e.