NLP Song Lyrics

Author

Hannah Luebbering

Published

April 16, 2024

Overview

Objective. The following project involves extracting and analyzing Spotify data from top playlists using the Spotify and Genius Lyrics Web API. Natural Language Processing techniques are used to process lyrics and perform sentiment analysis. K-means clustering and PCA analysis are employed to categorize songs and analyze relationships between musical features.

Extracting Spotify Data

Getting started, we want to extract data for a set of tracks within one of Spotify’s top-featured playlists. Leveraging the Spotify Web API, we can seamlessly obtain detailed data for a song, such as the artist, the album it belongs to, its release date, popularity, and audio features like danceability, energy, and tempo.

Accessing the Spotify Web API

Python libraries like spotipy offer a user-friendly way to interact with the Spotify API, offering a range of functions that streamline tasks like API authentication. To authenticate access, we provide a client ID and secret. Once authenticated, we can interact with the API and retrieve data.

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

my_auth = SpotifyClientCredentials(client_id = "xxx", client_secret = "xxx")
sp = spotipy.Spotify(auth_manager=my_auth)  # Spotify authentication

Spotify’s Featured Playlists

Let’s take a look at the popular Spotify playlists. Below, the code retrieves a range of Spotify playlists and generates a dataframe containing details for each playlist.

	playlist_name	playlist_id	description	total
0	Today’s Top Hits	37i9dQZF1DXcBWIGoYBM5M	Karol G is on top of the Hottest 50!	50
1	RapCaviar	37i9dQZF1DX0XUsuxWHRQd	New music from Eminem, Ice Spice and BossMan DLow.	50
2	Hot Country	37i9dQZF1DX1lVhptIYRda	Today's top country hits. Cover: Megan Moroney	50

Extracting Track Data From Playlist

Next, we utilize Spotify’s API to extract further details about each song within the playlist. We obtain metadata such as the track name, the artist it’s sung by, the album it belongs to, the release date, and track features such as danceability, tempo, and popularity.

def get_playlist_tracks(playlist_URI):
    results = sp.playlist_tracks(playlist_URI)
    tracks = results["items"]
    while results["next"]:
        results = sp.next(results)
        tracks.extend(results["items"])
    return tracks

Choose a specific playlist to analyze by copying the URL from the Spotify Player interface. Using that link, the playlist_tracks method retrieves a list of IDs and corresponding artists for each track from the playlist. Specifically, we analyze Spotify’s Today’s Top Hits playlist.

Natural Language Processing

Using the data gathered from the Spotify API, we now want to extract and process lyrics for each song. This is accomplished through scraping textual lyrical data from the Genius Lyrics website. Following extraction, the lyrics are cleaned before undergoing sentiment analysis.

Scraping Song Lyrics

The lyricsgenius is a fundamental library allowing for web scraping of the Genius Lyrics website. Through the initialization of the genius variable, one can access the Genius API and retrieve the lyrics of any given song, such as “Too Many Nights” by Metro Boomin.

import lyricsgenius
genius = lyricsgenius.Genius(access_token) # Initialize Genius API
song = genius.search_song("Too Many Nights", "Metro Boomin")

Searching for "Too Many Nights" by Metro Boomin...
Done.

Pre-Processing Text Data

Using the genius library, we define a function to fetch the lyrics of a song given the name and artist. Once retrieved, the next step is to pre-process the lyrics. This involves a cleaning process to eliminate patterns that may hinder the overall readability. The script contains the following steps:

Fetching Track Lyrics
Expanding Contractions
Converting Text to Lowercase
Spell Checking + Censoring
Removing Punctuations
Tokenizing and encoding to ASCII

def clean_song_lyrics(song_name, artist_name):
    # Fetch song lyrics and clean
    lyrics = get_song_lyrics(song_name, artist_name) 
    lyrics = profanity.censor(contractions.fix(lyrics).lower(), censor_char="")
    lyrics = remove_punctuation(lyrics) 
    
    # Tokenizing and encoding to ASCII
    return [word.encode("ascii", "ignore").decode() for word in word_tokenize(lyrics)]

Further Text Cleaning

We employ the Natural Language Toolkit (NLTK) to filter out stopwords and perform lemmatization. Removing common words like “the” condenses the text, allowing for a more thorough analysis of the lyrics’ core message. Lemmatization helps standardize text by transforming different verb variations into their most basic form.

	name	artist	lyrics	stopwords_removed	lemmatized
0	Please Please Please	Sabrina Carpenter	['i', 'know', 'i', 'have', 'good', 'judgment', 'i',...	['know', 'good', 'judgment', 'know', 'good', 'taste...	['know', 'good', 'judgment', 'know', 'good', 'taste...
1	Si Antes Te Hubiera Conocido	KAROL G	['what', 'what', 'we', 'are', 'in', 'a', 'relay', '...	['relay', 'summer', 'started', 'fire', 'would', 'me...	['relay', 'summer', 'start', 'fire', 'would', 'meet...
2	BIRDS OF A FEATHER	Billie Eilish	['i', 'want', 'you', 'to', 'stay', 'til', 'i', 'am'...	['want', 'stay', 'til', 'grave', 'til', 'rot', 'awa...	['want', 'stay', 'til', 'grave', 'til', 'rot', 'awa...

Term Frequency Analysis

Let’s examine the most frequent words. Plotting the frequency distribution helps to determine the occurrence of the most common terms in our lyrical corpus.

Sentiment Analysis

The next process involves implementing pipelines to predict emotions and sentiment in textual content using transformer models designed for text classification and sentiment analysis. Three distinct pipelines are created, each equipped with different models.

from transformers import pipeline

classifiers = [ # Initialize sentiment classifiers
    pipeline(model='bhadresh-savani/distilbert-base-uncased-emotion'),
    pipeline(model='cardiffnlp/twitter-roberta-base-sentiment')
]

One of the classifiers is the distilbert-base-uncased-emotion model, which detects emotions in texts like sadness, joy, love, anger, fear, and surprise. Another classifier is the roBERTa-base model “trained on 58 million tweets and fine-tuned for sentiment analysis using the TweetEval benchmark” (EMNLP 2020).

We then implement the get_lyric_sentiment function, which uses three classifiers to calculate sentiment scores from pre-processed lyrics.

# Function to perform sentiment analysis
def get_lyric_sentiment(lyrics, classifiers):
    text = " ".join(lyrics)
    scores = {}
    for classifier in classifiers:
        try:
            predictions = classifier(text, truncation=True)
            for prediction in predictions[0]:
                scores[prediction["label"]] = prediction["score"]
        except Exception as e:
            print(f"Error during sentiment analysis: {e}")
    return scores

Below is a graphical representation of the results obtained from the roBERTa-base model. According to the TweetEval reference paper and official Github repository, the resulting labels 0, 1, and 2 correspond to Negative, Neutral, and Positive, respectively.

Putting it All Together

To summarize, the code efficiently collects data and performs text analysis on every song in a playlist. Specifically, it systematically processes a list of tracks and corresponding artists while simultaneously conducting a thorough cleaning procedure on the lyrics. Additionally, the program computes a sentiment score for each song based on the lyrics, indicating whether the lyrics are positive, negative, or neutral.

	name	album	artist	release_date	length	popularity	artist_pop	artist_genres	acousticness	danceability	...	joy	love	anger	fear	surprise	LABEL_0	LABEL_1	LABEL_2	NEGATIVE	POSITIVE
0	Please Please Please	Please Please Please	Sabrina Carpenter	2024-06-06	186365	98	91	['pop']	0.274	0.669	...	0.955393	0.035888	0.002778	0.000637	0.000841	0.251058	0.542962	0.205980	0.857851	0.142149
1	Si Antes Te Hubiera Conocido	Si Antes Te Hubiera Conocido	KAROL G	2024-06-21	195824	91	89	['reggaeton', 'reggaeton colombiano', 'trap latino'...	0.446	0.924	...	0.003352	0.008062	0.982214	0.002481	0.000541	0.043676	0.469266	0.487059	0.957238	0.042762
2	BIRDS OF A FEATHER	HIT ME HARD AND SOFT	Billie Eilish	2024-05-17	210373	98	94	['art pop', 'pop']	0.200	0.747	...	0.180090	0.008181	0.068175	0.434426	0.034063	0.122799	0.504202	0.372999	0.959893	0.040107

3 rows × 33 columns

In summary, the above code aims to collect and refine song lyrics by eliminating stopwords and conducting lemmatization. Subsequently, it employs pre-trained models for sentiment analysis to determine the prevailing emotion conveyed in the lyrics. Finally, the program compiles all this information into a dataframe for further analysis.

Correlations Matrix

After completing the initial data analysis, we proceed with generating the Pearson correlations matrix using the Pandas command df.corr(). Subsequently, we visualize the matrix using the seaborn heatmap, providing a detailed understanding of the relationships between the various variables in our dataset.

track_sentiment_df = df_final[['name', 'artist',
           'acousticness', 'danceability', 'energy', 'instrumentalness', 
           'loudness', 'speechiness', 'tempo', 'valence', 
           'sadness', 'joy', 'love', 'anger', 'fear', 'surprise',
           'LABEL_0', 'LABEL_1', 'LABEL_2', 'NEGATIVE', 'POSITIVE']]

# Find the pearson correlations matrix
corr = track_sentiment_df.corr(method = 'pearson')

The code below produces a scatterplot that showcases the correlation between energy and fear. The x-axis represents the energy value, while the y-axis represents the fear sentiment. The size of each data point corresponds to the label indicating the neutral sentiment level, and its color represents the valence value. Moreover, each bubble contains its energy value within, allowing for a straightforward interpretation of the data.

Similarly, the scatterplot presented above utilizes the track sentiment data, comparing the energy and fear levels of the tracks based on valence and size.

Principal Component Analysis

Principal Component Analysis (PCA) is a unsupervised dimension reduction algorithm. We implement PCA on a range of track audio features as well as emotional sentiments such as sadness, joy, love, anger, and more from our data. This lets us transform the data into fewer columns, reducing the dimensionality without losing significant information.

After performing PCA on the data, we apply it to generate a biplot depicting the relationship between the features and tracks. This biplot quickly reveals any discernible patterns and clusters within the dataset.

X_SMALL = df_final[['acousticness', 'danceability', 'energy', 'speechiness', 
                    'tempo', 'valence', 'sadness', 'joy', 'love', 'anger', 
                    'fear', 'surprise', 'name', 'LABEL_0', 'LABEL_1', 'LABEL_2']]

We use the PCA and StandardScaler modules from the sklearn library. First, we select the first 12 columns from our data subset and form a matrix, named \(X_i\). We then standardize the data. Next, we apply PCA to the standardized data, \(X_{st}\). Lastly, we save the obtained loadings and eigenvalues.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standard scaling track audio features
X_i = X_SMALL.iloc[:,0:12]
X_st =  StandardScaler().fit_transform(X_i)

# Apply PCA to scaled data
pca = PCA()
pca_out = pca.fit(X_st)

# component loadings
loadings = pca_out.components_

# get eigenvalues (variance explained by each PC)  
pca_out.explained_variance_

array([2.65011368e+00, 1.78220617e+00, 1.32078217e+00, 1.23927771e+00,
       1.13536483e+00, 1.04777115e+00, 8.75900205e-01, 7.54116010e-01,
       6.06658820e-01, 5.67131303e-01, 2.65575917e-01, 7.98632974e-15])

Next, the following code uses the PCA() function to calculate the PCA scores of the standardized data set, \(X_{st}\).

features = X_i.columns.values # Labels
components = pca.fit_transform(X_st) # PCA Score
loadings = pca_out.components_.T * np.sqrt(pca_out.explained_variance_)

A biplot is generated based on the PCA scores and loadings, and the column names of the \(X_i\) data frame are used as labels for the plot. The variance explained by the first two principal components are also displayed on the plot.

Below is a preview of all the PCA clustered groups.

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10	PC11	PC12
0	-0.652941	-0.540595	0.277601	-1.276863	0.412919	-0.207641	-0.019832	0.451893	-0.773699	-0.432625	0.148075	-1.635591e-08
1	-1.192109	3.319574	-0.687571	1.032594	0.775812	-0.537727	-1.541805	-0.796704	-0.756995	0.949406	-0.429861	8.099381e-08
2	1.080397	0.204470	1.129465	-0.305987	0.566007	0.881001	-0.641871	0.360368	0.671404	0.286174	-0.287698	-1.186518e-07
3	-1.306320	-0.990882	0.597186	-1.050503	0.141820	-0.638868	-0.483447	0.085446	0.096276	0.019933	0.731528	7.415103e-09
4	-1.335645	-1.277782	-0.085476	-1.018065	0.395467	0.719844	-0.612023	0.245817	-0.426194	0.805717	0.072715	-9.832706e-08

The variance ratios for the PCA output and the cumulative sum of the explained variance ratios are printed below. Specifically, the array displayed represents the amount of variability explained by each component.

print(pca_out.explained_variance_ratio_)
print('----')
print(pca_out.explained_variance_ratio_.cumsum())

[2.16425951e-01 1.45546837e-01 1.07863877e-01 1.01207680e-01
 9.27214608e-02 8.55679769e-02 7.15318501e-02 6.15861408e-02
 4.95438036e-02 4.63157231e-02 2.16886999e-02 6.52216928e-16]
----
[0.21642595 0.36197279 0.46983667 0.57104434 0.66376581 0.74933378
 0.82086563 0.88245177 0.93199558 0.9783113  1.         1.        ]

The loading vectors help visualize the relationship between the original variables and their respective components. These vectors represent the weights of the variables within a mathematical equation used to generate the principal components.

df_weights = pd.DataFrame(pca_out.components_.T, columns=df_pca.columns, index=X_i.columns)
df_weights

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10	PC11	PC12
acousticness	0.253094	0.425560	-0.081137	-0.051999	0.120657	-0.270423	0.038467	0.565513	-0.512979	-0.059977	-0.264270	-2.786806e-08
danceability	-0.428916	0.327036	-0.026959	-0.109895	0.101021	0.030147	-0.217469	0.138922	0.523550	-0.012503	-0.586779	4.801649e-10
energy	-0.309072	-0.169597	-0.250304	0.309696	-0.430665	0.106693	0.308849	0.025428	-0.270576	0.482465	-0.343909	-2.016976e-08
speechiness	-0.188700	0.395047	-0.062115	-0.032742	-0.118204	0.304929	0.670314	-0.052322	0.033915	-0.472329	0.141921	6.140892e-09
tempo	0.102180	0.112419	0.250471	0.222759	-0.388726	-0.698286	0.097755	-0.348045	0.104721	-0.220550	-0.188555	9.069694e-09
valence	-0.437115	0.155936	0.052694	0.248609	-0.017721	-0.338282	-0.003064	0.417016	0.182694	0.208423	0.595445	-1.736405e-08
sadness	0.396350	0.061848	-0.457667	-0.064144	-0.420210	0.034624	-0.077819	0.190498	0.359314	0.010301	0.101546	5.156194e-01
joy	-0.412556	-0.266452	0.186634	-0.496523	-0.074336	-0.157048	0.007957	0.032509	-0.274236	-0.148999	-0.020153	5.901792e-01
love	-0.030564	-0.320082	-0.212120	0.566633	0.503138	-0.097366	0.140785	0.062050	0.057297	-0.303900	-0.150700	3.548201e-01
anger	-0.128199	0.542932	-0.134174	0.206550	0.133607	0.089812	-0.313582	-0.502087	-0.289683	0.161095	0.126852	3.547164e-01
fear	0.270561	0.132370	0.620053	0.077554	0.159411	0.149327	0.334458	0.064617	0.185699	0.427809	-0.083975	3.612335e-01
surprise	-0.038259	-0.024411	0.410989	0.395483	-0.381912	0.392240	-0.401021	0.254973	-0.134342	-0.353076	-0.027541	6.012030e-02

K Means Clustering

Clustering is an unsupervised machine learning technique used to categorize data points into distinct groups based on their similarities. In Spotify data, for example, clustering can categorize songs into genres or moods by analyzing characteristics like tempo, beat, and instrumentals.

Next, we apply K-means clustering on the dimensionally reduced Spotify data to explore patterns in track audio features and sentiment. K-means is an unsupervised algorithm that optimizes the number of clusters (\(K\)). We use the Elbow method to determine the optimal \(K\).

As shown below, we use the KMeans algorithm from the sklearn.cluster library to categorize songs based on features like energy levels and sound qualities. Applying this to the “playlist_tracks” dataset, we create three clusters, excluding the “artist” and “name” columns to focus on track features.

from sklearn.cluster import KMeans
playlist_tracks = df_final[['name','acousticness','danceability','energy', 'liveness', 
                            'instrumentalness','speechiness','valence',
                            'sadness', 'joy', 'love', 'anger', 'fear', ]]#'surprise'

kmeans = KMeans(n_clusters = 3)
kmeans.fit(playlist_tracks.drop(['name'], axis = 1))

KMeans(n_clusters=3)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Visualizing the Clusters

Moving forward, let’s look at differences in the audio features of each group.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(playlist_tracks.drop(['name'], axis = 1))
scaled_data = scaler.transform(playlist_tracks.drop(['name'], axis = 1))

Code

from sklearn.decomposition import PCA
pca = PCA(n_components =2)
pca.fit(scaled_data)
data_pca = pca.transform(scaled_data)

Text(0, 0.5, 'PC2: 0.153')

playlist_tracks['group'] = list(kmeans.labels_)
playlist_tracks = playlist_tracks.astype({'group': str})

means = pd.DataFrame(index = range(0,3), 
                    columns = list(playlist_tracks[playlist_tracks['group'] == '0'].describe().loc['mean'].index))
means.iloc[0] = playlist_tracks[playlist_tracks['group'] == '0'].describe().loc['mean']
means.iloc[1] = playlist_tracks[playlist_tracks['group'] == '1'].describe().loc['mean']
means.iloc[2] = playlist_tracks[playlist_tracks['group'] == '2'].describe().loc['mean']
means

	acousticness	danceability	energy	liveness	instrumentalness	speechiness	valence	sadness	joy	love	anger	fear
0	0.20839	0.672	0.60875	0.144285	0.020388	0.059735	0.54265	0.151202	0.108936	0.206101	0.253816	0.259347
1	0.126155	0.74595	0.67765	0.110685	0.017031	0.052625	0.6812	0.028329	0.854914	0.046817	0.0533	0.013642
2	0.3229	0.615	0.6254	0.13608	0.000763	0.0478	0.4399	0.893231	0.046527	0.002716	0.032235	0.023872

Organized Songs in a Playlist

K-means is an unsupervised clustering algorithm that partitions data into \(K\) clusters, grouping similar points together. Using Spotify data, we can cluster songs based on attributes like acousticness, danceability, and energy. We import Python libraries such as pandas, matplotlib, and sklearn for data manipulation, visualization, and clustering. After obtaining song attributes, we use the describe function to gain insights and prepare the data for clustering, as demonstrated below.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn import cluster, decomposition

songs = df_final[['name','acousticness', 'danceability', 'energy', 'instrumentalness', 
            'liveness', 'speechiness', 'valence',  'loudness_scaled', 
            'anger', 'love', 'sadness']]
songs.describe()

	acousticness	danceability	energy	instrumentalness	liveness	speechiness	valence	loudness_scaled	anger	love	sadness
count	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000
mean	0.198398	0.690180	0.639640	0.015120	0.129204	0.054504	0.577520	0.551367	0.129293	0.101710	0.250459
std	0.194310	0.144655	0.130139	0.059203	0.079672	0.032362	0.234215	0.221284	0.244960	0.245032	0.356076
min	0.000938	0.264000	0.386000	0.000000	0.029300	0.026400	0.190000	0.000000	0.000206	0.000193	0.001261
25%	0.040675	0.630500	0.552000	0.000000	0.082600	0.034775	0.373000	0.376041	0.002251	0.002018	0.007991
50%	0.140000	0.700500	0.635000	0.000003	0.106000	0.046450	0.591500	0.581936	0.026904	0.006858	0.041605
75%	0.270750	0.780000	0.727250	0.000082	0.139250	0.063300	0.775500	0.699484	0.124115	0.032315	0.401728
max	0.799000	0.936000	0.946000	0.336000	0.403000	0.204000	0.957000	1.000000	0.993250	0.994453	0.998848

The first step is to extract song labels from the dataset. We then select key features to input into the Affinity Propagation clustering algorithm from the scikit-learn library. We set a preference value of -200 to optimize clustering performance. After inputting the data, we train the algorithm to cluster the Spotify songs effectively.

labels = songs.values[:,0]
X = songs.values[:,1:12]
kmeans = cluster.AffinityPropagation()
kmeans.fit(X)

AffinityPropagation()

Category 0
-----
Please Please Please
Not Like Us
Beautiful Things
LUNCH
End of Beginning
Belong Together
Slow It Down
the boy is mine
360
Rockstar
One Of The Girls (with JENNIE, Lily Rose Depp)
Parking Lot
Gata Only
Santa
Magnetic

Category 1
-----
Si Antes Te Hubiera Conocido
Nasty
greedy
BAND4BAND (feat. Lil Baby)

Category 2
-----
BIRDS OF A FEATHER
we can't be friends (wait for your love)
Tough
Fortnight (feat. Post Malone)
Close To You
Stumblin' In
Scared To Start

Category 5
-----
Good Luck, Babe!
A Bar Song (Tipsy)
MILLION DOLLAR BABY
Too Sweet
I Had Some Help (Feat. Morgan Wallen)
Espresso
i like the way you kiss me
Houdini
I Don't Wanna Wait
Smeraldo Garden Marching Band (feat. Loco)
Water
Illusion

Category 3
-----
Stargazing
Lose Control
Austin
I Can Do It With a Broken Heart
GIRLS
Saturn
Stick Season
Lies Lies Lies
feelslikeimfallinginlove

Category 4
-----
HOT TO GO!
Move
28

The script effectively categorized the playlist into six distinct groups based on shared features, resulting in a diverse selection of songs within each category.

Supervised Learning: Similarity Search

We create a query to retrieve similar songs based on Euclidean distance, where a shorter distance indicates greater similarity. Additionally, we employ the K-nearest neighbors (KNN) algorithm, a supervised learning method that classifies songs based on known data.

KNN Algorithm

K-nearest neighbors (KNN) is a supervised algorithm for classification and regression. It predicts the category of new data points by comparing them to the K nearest neighbors with known classifications. For example, KNN can classify Spotify songs by analyzing their features and comparing them to similar, previously classified tracks.

The KNN algorithm identifies \(k\) similar elements around a query point. Using the knnQuery function, we input a query point (a specific song and artist), characteristic points, and a value for \(k\). This function calculates the Euclidean distance between the query point and each data point, returning the \(k\) closest points.

The querySimilars function then applies KNN to find and return the \(k\) most similar songs based on specified features.

# Get a song string search
def getMusicName(elem):
    return f"{elem['artist']} - {elem['name']}"

def knnQuery(queryPoint, arrCharactPoints, k):
    queryVals = queryPoint.tolist()
    distVals = []
    
    # Copy of data indices and data
    tmp = arrCharactPoints.copy(deep = True)  
    for index, row in tmp.iterrows():
        feat = row.values.tolist()
        
        # Calculate sum of squared differences
        ssd = sum(abs(feat[i] - queryVals[i]) ** 2 for i in range(len(queryVals)))
        
        # Get euclidean distance
        distVals.append(ssd ** 0.5)
        
    tmp['distance'] = distVals
    tmp = tmp.sort_values('distance')
    
    # K closest and furthest points
    return tmp.head(k).index, tmp.tail(k).index


# Execute KNN removing the query point
def querySimilar(df, columns, idx, func, param):
    arr = df[columns].copy(deep = True)
    queryPoint = arr.loc[idx]
    arr = arr.drop([idx])
    return func(queryPoint, arr, param)

KNN Query Example.

Now we perform a KNN query to find the k most similar songs based on specified features. We first establish a function to generate custom query points. The code snippet below then selects specific song features and finds the top \(k\) values within that feature set.

Let’s search for \(k=3\) similar songs to a query point \(\textrm{songIndex} = 6\).

# Select song and column attributes
songIndex = 4 # query point
columns = ['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 
           'loudness_scaled', 'tempo', 
           'speechiness', 'valence']

# Set query parameters
func, param = knnQuery,3

# Implement query
response = querySimilar(df, columns, songIndex, func, param)

print("---- Query Point ----")
print(getMusicName(df.loc[songIndex]))
print('---- k = 3 similar songs ----')
for track_id in response[0]:
    track_name = getMusicName(df.loc[track_id])
    print(track_name)
print('---- k = 3 nonsimilar songs ----')
for track_id in response[1]:
    track_name = getMusicName(df.loc[track_id])
    print(track_name)

---- Query Point ----
Shaboozey - A Bar Song (Tipsy)
---- k = 3 similar songs ----
Morgan Wallen - Lies Lies Lies
Zach Bryan - 28
The Weeknd - One Of The Girls (with JENNIE, Lily Rose Depp)
---- k = 3 nonsimilar songs ----
SZA - Saturn
Benson Boone - Slow It Down
Taylor Swift - Fortnight (feat. Post Malone)

The code below extends the KNN algorithm to query each track in a playlist rather than a single defined query point.

To track the number of similar and non-similar songs, we use two dictionaries: similar_count and nonsimilar_count. A loop iterates through the playlist, running the querySimilars function on each track. The results are processed into “similar” and “non-similar” categories stored in the response variable.

similar_count = {} # Similar songs count
nonsimilar_count = {} # Non-similar songs count

for track_index in df.index:
    # Implement query
    response = querySimilar(df, columns, track_index, func, param)
    
    # Get similar songs
    for similar_index in response[0]:
        track = getMusicName(df.loc[similar_index])
        if track in similar_count:
            similar_count[track] += 1
        else:
            similar_count[track] = 1
    
    # Get non-similar songs
    for nonsimilar_index in response[1]:
        track = getMusicName(df.loc[nonsimilar_index])
        if track in nonsimilar_count:
            nonsimilar_count[track] += 1
        else:
            nonsimilar_count[track] = 1

---- NON SIMILAR SONG COUNTS ----
Benson Boone - Slow It Down : 39
Taylor Swift - Fortnight (feat. Post Malone) : 39
SZA - Saturn : 36
Zach Bryan - 28 : 14
Morgan Wallen - Lies Lies Lies : 11
Shaboozey - A Bar Song (Tipsy) : 11

---- SIMILAR SONG COUNTS ----
Rvssian - Santa : 6
Dua Lipa - Illusion : 5
Hozier - Too Sweet : 5
The Weeknd - One Of The Girls (with JENNIE, Lily Rose Depp) : 5
Mustard - Parking Lot : 5
Tinashe - Nasty : 5
Chappell Roan - Good Luck, Babe! : 5
Djo - End of Beginning : 5