Objective. Using Python, the goal of this project is to implement the k-means clustering algorithm, a technique often used in machine learning, and use it for data analysis. We write various functions using lists, sets, dictionaries, sorting, and graph data structures for computational problem-solving and analysis.


Part 1. Spotify API Data

Spotify is a popular audio streaming platform with an extensive music database. The Spotify API allows developers to access the platform’s data providing global insights into music listening habits[1]. Using the API requires an initial setup: we must register as a Spotify developer, create an app, modify the dashboard redirect URI, and store the client ID and secret. After completing these steps, we have access to the Spotify API and all its features.

Get Playlist Data from API

First, we create a Client Credentials Flow Manager used in server-to-server authentication by passing the necessary parameters to the Spotify OAuth class[2]. We provide a client id and client secret to the constructor of this authorization flow, which does not require user interaction.

# Set client id and client secret
client_id = 'xxx'
client_secret = 'xxx'

# Spotify authentication
cc_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager = cc_manager)

Now we can get the full details of the tracks of a playlist based on a playlist ID, URI, or URL. Choose a specific playlist to analyze by copying the URL from the Spotify Player interface. Using that link, the following code uses the playlist_tracks method to retrieve a list of IDs and corresponding artists for each track from the playlist.

playlist_URI = playlist_link.split("/")[-1].split("?")[0]
# Iterate over list of playlist tracks
for i in sp.playlist_tracks(playlist_URI)["items"]:
    track_ids.append(i["track"]["id"])  # Extract song id
    artist_ids.append(i["track"]["artists"][0]["uri"])  # Extract artist id

Then, we write a function that takes the playlist data from the API and gets the metadata and audio characteristics of each track. Specifically, the function reads the query results for a playlist and returns the track name, track ID, artist, album, duration, popularity, artist popularity, artist genre, and audio characteristics for each track.

Spotify Audio Features


Spotify’s audio features are precalculated measures of both low-level and high-level perceptual music qualities that help classify a track. A quick explanation of each feature is shown below. More information on how to interpret these audio features is located at Spotify’s API documentation.

Track Metadata

name: Title of the music track.

album: Name of the album on which the track appears.

artist: Name of the artist who performed the track.

release_date: Date the album was first released.

length: Track length in milliseconds.

popularity: Popularity of a track (based on total number of streams and frequency of most recent plays).

artist_pop: Popularity of an artist (based on artist's overall track popularity scores).

artist_genres: List of genres associated with an artist.


Audio Features
  • acousticness: A confidence measure of whether the track is acoustic.
  • danceability: Suitability for dancing based on tempo, rhythm, beat, and regularity.
  • energy: A perceptual measure of intensity and activity.
  • instrumentalness: Predicts whether a track contains no vocals.
  • liveness: Probability that the track was performed live.
  • loudness: Overall loudness of a track in decibels (dB).
  • speechiness: Detects the presence of spoken words in a track.
  • tempo: Estimated pace of a track in beats per minute (BPM).
  • valence: A measure describing the musical positiveness.

Playlist Data Preview

The following code loops through each track ID in the playlist and extracts the song information by calling the function we created. From there, we can create a dataframe by passing in the returned data using the pandas package.

# Loop over track ids
tracks = [playlist_features(track_ids[i], artist_ids[i], playlist_ids[i])
    for i in range(len(track_ids))]

Part 2. Similar Artists

First, we want to find the most frequently occurring artist in a given playlist. We use the value_counts function to get a sequence containing counts of unique values sorted in descending order.

# Count distinct values in column
tallyArtists = df.value_counts(["artist", "artist_id"]).reset_index(name='counts')
topArtist = tallyArtists['artist_id'][1]

I can retrieve artist and artist-related data using the following code, passing the artist ID to the artist and artist-related artist functions under the spotipy package. The returned list of similar artists is sorted by similarity score based on the listener data[3].

a = sp.artist(topArtist)
ra = sp.artist_related_artists(topArtist)

Below is a sample of the result when we query Spotify for the most similar artists to the playlist’s top artist, creating a list that holds all of the artist source ids and target ids. We retrieve similar data for the nodes of the connection graph, creating a list that holds information for each specified artist.

 

Let’s see how things look when we pull in the full dataset, with each of the artist’s top most similar artists and each of their most similar artists. The following visualization is based on the Spotify Similiar Artists API article and created with flourish studio.

Made with Flourish

Part 4. K Means Clustering

Next, we implement the K-Means clustering algorithm using the Scikit-Learn library to break down a playlist into several smaller playlists. The unsupervised learning algorithm divides similar data points into k groups by computing the distance to the centroid.

The first step is to define an appropriate predefined number (k) of clusters. We use the Elbow Method to determine the optimal k, as shown below[5].

X = df[['acousticness', 'danceability', 'liveness', 'energy','valence', 'instrumentalness', 'speechiness']]
features = X.values

from sklearn.cluster import KMeans
ssd = [] # Sum of squared distances
for k in range(1,12):
    model = KMeans(n_clusters = k, init="k-means++")
    model = model.fit(features)
    ssd.append(model.inertia_)

Thus, we tune the clustering algorithm by running K-Means for a range of k values, obtaining the above figure. It looks like a value of 3 is optimal for this case. Next, we call the K-Means function and set the k value to 3 clusters.

Considering that there are seven different audio features for the clustering task, we use principal component analysis (PCA) to reduce the dimensionality of the data into a more easily visualized set of variables.

from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca_result = pca.fit_transform(features)

In the above code, we define a PCA instance to find two principal components determined from the features of the data. From there, we visualize the resulting clusters and explore the variation. The figure below shows our 3 clusters represented in 2-dimensional space.

Results

All the clusters have values below \(0.33\), indicating that the songs most likely represent music and other non-speech-like tracks.

Cluster 1 has the highest energy and valence, indicating that these tracks are faster-paced, louder, and more positive (e.g., happy, cheerful, euphoric) than the other clusters.

Cluster 2 has the highest acousticness, with a mean value of \(0.539\) over all the cluster’s songs. Cluster 2 is also higher in danceability, indicating tracks with a faster tempo and beat intensity.

Cluster 3 appears to be the lowest valence, with a mean of \(0.2303\), indicating more negative trajectories (e.g., sadness, frustration, anger).


Conclusion

Through the use of machine learning and collaborative filtering techniques, we can analyze a playlist to gain insights into listening preferences and habits, and recommend music that aligns with similar tastes. For example, if a playlist contains several songs from a particular artist, the algorithm might suggest similar artists based on the listening habits of other users with similar music preferences.

Next Steps

To move forward with this project, we plan to incorporate Natural Language Processing (NLP) to track current music trends. NLP can search for mentions of a specific artist or song, then analyze those mentions to identify trends in popular culture. Using cutting-edge technology, NLP can identify popular keywords and phrases related to a particular song and provide recommendations based on similar terms. Ultimately, NLP can organize songs into groups and assign descriptive adjectives for each track, further classifying individual songs.


References

[1]
Web API Reference | Spotify for Developers, https://developer.spotify.com/documentation/web-api/reference/.
[2]
[3]
E. Webb, Visualizing Rap Communities with Python & Spotify’s API, https://unboxed-analytics.com/data-technology/visualizing-rap-communities-wtih-python-spotifys-api/.
[4]
Leonardo Mauro, Spotify Songs - Similarity Search, https://www.kaggle.com/code/leomauro/spotify-songs-similarity-search/notebook.
[5]
Chingis Oinar, Separate Your Saved Songs on Spotify into Playlists of Similar Songs, https://towardsdatascience.com/cluster-your-liked-songs-on-spotify-into-playlists-of-similar-songs-66a244ba297e.