Spotify is a product that has revolutionized and dominated the music listening market. With over 125 million subscribers, Spotify is the leader in the audio streaming market and thats not even considering all of the listeners who use the app for free. Listening to music is an activity that most Americans' take apart in everyday and it's important for musicians and artists to look at any trends and directions that listeners go to in order to compete in the growing competitive market. Utilizing the data that Spotify offers, we are looking to create a regression of attrributes of songs that correlate with popularity. Our data will help others to consider the different factors that may impact the amount of streams that a given artist may receive.
The Data that we collected had various attributes given to every song as referred from Spotify's developer page: https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/
duration_ms: The duration of the track in milliseconds.
key: The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
time_signature: An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
id: The Spotify ID for the track.
type: The object type: “audio_features”
popularity: The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Artist and album popularity is derived mathematically from track popularity. Note that the popularity value may lag actual popularity by a few days: the value is not updated in real time.
The first step in the data science pipeline is obtaining data and making sure that it is in a usuable form for analysis, visualization, and modeling.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('/content/data.csv')
df.head()
df.info()
df.isnull().sum()
The data is already decently clean and won't need much more tidying. However, there are some empty data entries that need to be cleaned. But when looking at how the "artists" are displayed in the dataset, they have apostrophes and brackets around them. Below we will remove the brackets and apostrophes from the artists. And we will also get rid of all the empty entries in the dataframe. This will be down using the following code:
# removing nulls
df = df.dropna()
df['artists'] = df['artists'].str.replace("'","")
df['artists'] = df['artists'].str.replace("[","")
df['artists'] = df['artists'].str.replace("]","")
df.head()
For our data analysis, we will take a deep dive into learning how to show patterns, correlations, and visualize our data to the common eye. Creating visually appealing statistics about our data is important because the raw csv file is a bit hard to understand, especially in the general scope. Through graphs and plots, we will be able to better understand our data and get a better overview of it.
plt.figure(figsize=(16, 8))
sns.set(style="whitegrid")
corr = df.corr()
sns.heatmap(corr,annot=True, cmap="YlGnBu")
Looking at the correlation table, we can note a few observations on what attributes make a song more popular.
Utilizing the correlation table, we can also note a few other observations about attributes.
From this data, we can predict that an artist with a high energy song containing electric instruments has the best chance of gaining the most popularity.
plt.figure(figsize=(20, 10))
sns.set(style="whitegrid")
# the columns that we are interseted in
target_cols = ["acousticness","danceability","energy","speechiness","liveness","valence"]
for c in target_cols:
# group the data by year, and plot the mean score of all music in that year
x = df.groupby("year")[c].mean()
ax= sns.lineplot(x=x.index,y=x,label=c)
ax.set_title('Music Attributes over the years', fontsize = 20)
ax.legend(fancybox=True, framealpha=1, shadow=True, prop={'size': 15}, loc = 'upper right')
ax.set_ylabel('Score', fontsize = 20)
ax.set_xlabel('Year', fontsize = 20)
As you can see, prior to 1960s, main stream music had a high level of acousticness. However, after 1960, music started to become more energetic, and dancability rose. Naturally, as the level of energy rose, the level of acousticness became fell. We suspect this is the case because of the rise of two specific music genres: hip-hop and edm. EDM and hip-hop have risen to fame, and since most of these songs are hyped and energetic, they also show a drop in the level of acousticness, as well as an increase in energy level. Interestingly, danceability didn't increase as energy increase. This means that people found energetic music and acoustic music equally as easy to dance to.
plt.figure(figsize=(20, 5))
# mean loudness over year
x = df.groupby('year')["loudness"].mean()
ax = sns.lineplot(x= x.index, y=x, label="loundess")
# set axis label and titles
ax.set_title('Mean of Loudness of Songs over the Year', fontsize = 20)
ax.set_ylabel('Loudness', fontsize = 20)
ax.set_xlabel('Year', fontsize = 20)
As you can see, the mean loudness of music has been rising rapidly since right after the 1950s. This can also reflect people's music taste over time. With the grow of EDM and hip-hop music, people started to enjoy music that was "loud" and had "heavy bass" more, especially in the rap industry. This could also reflect the progress in recording technology, adding more layers or more audio effects often results in the increase in loudness too.
plt.figure(figsize=(40, 15))
sns.set(style="whitegrid")
# group by the song's name and rank them base on their popularity
x = df.groupby("name")["popularity"].mean().sort_values(ascending=False).head(10)
axis = sns.barplot(x=x.index, y=x)
axis.set_ylabel('Popularity', fontsize=40)
axis.set_xlabel('song title', fontsize=40)
As shown, it appears that most popular songs are either upbeat rap music, or have some sort of hip-hop elements in them. This reflects our statement about main stream music being more energetic and upbeat. It also displays the drop in level of acoutisness.
from collections import Counter
artist_popularity_sum = Counter()
# since a song can have different artists, we add the popularity score to each
# artists
for l in df[["artists", "popularity"]].to_numpy():
artist_list = [x.strip() for x in l[0].split(',')]
for artist in artist_list:
artist_popularity_sum[artist] += float(l[1])
top_10_artist = artist_popularity_sum.most_common(10)
xs = [a[0] for a in top_10_artist]
ys = [a[1] for a in top_10_artist]
plt.figure(figsize=(20, 10))
sns.set(style="whitegrid")
axis = sns.barplot(x=xs, y=ys)
axis.set_ylabel('Popularity')
axis.set_xlabel('artist')
We can see that famous bands like The beatles and queens are rank at the top. We could also see that newer artists, such as taylor swift and enimem are also on top of the ranking. It is important to note that this dataset contains data starting from 1960s to 2020, that's why these older bands shows appearance on the chart
data_2020 = df.loc[df['year'] == 2020]
artist_popularity_sum = Counter()
# since a song can have different artists, we add the popularity score to each
# artists
for l in data_2020[["artists", "popularity"]].to_numpy():
artist_list = [x.strip() for x in l[0].split(',')]
for artist in artist_list:
artist_popularity_sum[artist] += int(l[1])
top_10_artist = artist_popularity_sum.most_common(10)
xs = [a[0] for a in top_10_artist]
ys = [a[1] for a in top_10_artist]
plt.figure(figsize=(20, 10))
sns.set(style="whitegrid")
axis = sns.barplot(x=xs, y=ys)
axis.set_ylabel('Popularity')
axis.set_xlabel('artist')
plt.figure(figsize=(30, 10))
xs = df["year"].to_numpy()
ys = df["popularity"].to_numpy()
plt.ylabel('Popularity')
plt.xlabel('year')
plt.title("popularity over year")
plt.plot(xs, ys, '.')
plt.show()
plt.figure(figsize=(20, 5))
sns.set(style="whitegrid")
# group by the song's popularity and then get the mean energy
x = df.groupby("popularity")["year"].mean()
axis = sns.lineplot(x=x.index, y=x)
axis.set_ylabel('mean year')
axis.set_xlabel('popularity')
axis.set_title("mean year for song popularity")
We can see that a newly released song has a higher likelihood of being popular; which makes total sense. Newer songs have a higher chance of being heard, and the dataset also calculates popularity base on the current time of stream. So it is reasonable that people listens to newer songs.
plt.figure(figsize=(30, 10))
xs = df["energy"].to_numpy()
ys = df["popularity"].to_numpy()
plt.ylabel('Popularity')
plt.xlabel('energy')
plt.title("energy over popularity")
plt.plot(xs, ys, '.')
plt.show()
plt.figure(figsize=(20, 5))
sns.set(style="whitegrid")
# group by the song's popularity and then get the mean energy
x = df.groupby("popularity")["energy"].mean()
axis = sns.lineplot(x=x.index, y=x)
axis.set_ylabel('mean energy level')
axis.set_xlabel('popularity')
axis.set_title("mean energy for song popularity")
When looking at the mean populatirty over energy score, it seems on average that more popular songs are typically more energetic. This is hard to see in the scatter plot, but it is pretty clear to see this trend in the plot above.
plt.figure(figsize=(30, 10))
xs = df["liveness"].to_numpy()
ys = df["popularity"].to_numpy()
plt.ylabel('Popularity')
plt.xlabel('liveness')
plt.plot(xs, ys, '.')
plt.show()
plt.figure(figsize=(20, 5))
sns.set(style="whitegrid")
# group by the song's popularity and then get the mean energy
x = df.groupby("popularity")["liveness"].mean()
axis = sns.lineplot(x=x.index, y=x)
axis.set_ylabel('mean liveness level')
axis.set_xlabel('popularity')
axis.set_title("mean livness for song popularity")
Based on these two plots, we can infer that the liveness doesn't affect popularity as much as energy. But, we did see a peak in liveness when popularity reaches about 95. This means that very popular songs tend to have higher liveness, but otherwise, it seems like it doesn't matter as much. Also, note the drop when popularity reaches 100. This is because there's only one song that has 100 popularity, wich is Dakiti, and has liveness of 0.113. Since there's only one song that has 100 popularity, it makes sense for us to ignore this row of data when analyzing this graph.
df.loc[df["popularity"] == 100][["name", "year", "popularity", "liveness"]]
plt.figure(figsize=(30, 10))
xs = df["acousticness"].to_numpy()
ys = df["popularity"].to_numpy()
plt.ylabel('Popularity')
plt.xlabel('acousticness')
plt.plot(xs, ys, '.')
plt.show()
plt.figure(figsize=(20, 5))
sns.set(style="whitegrid")
# group by the song's popularity and then get the mean energy
x = df.groupby("popularity")["acousticness"].mean()
axis = sns.lineplot(x=x.index, y=x)
axis.set_ylabel('mean acousticness level')
axis.set_xlabel('popularity')
axis.set_title("mean acousticness for song popularity")
It seems that more popular songs tend to have lower levels of acousticness. This makes sense considering main stream music is mostly EDM and hip-hop, which use mainly electrical music elements.
Based on our analysis and the heatmap, we assume that music popularity has a positive correlation with energy, liveness, and the year. To back our claim, we can use machine learning algortihms, such as linear regression to fit a mathematical formula on the data and confirm if energy and liviness contributes to music popularity
import sklearn
# preparing data
from sklearn.model_selection import train_test_split
# linear regression model
from sklearn.linear_model import LinearRegression, ElasticNet, Ridge, Lasso
# Metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, mean_squared_error
reg = LinearRegression()
df = df.dropna()
xs = df[["energy"]].to_numpy()
ys = df["popularity"].to_numpy()[:, np.newaxis]
reg.fit(xs, ys)
reg.score(xs, ys)
With a score of 0.23, which is not quite high, we can infer that although energy does have some correlation to popularity, but the effect isn't obvious
reg.coef_, reg.intercept_
Based on the coefficient, we can see that energy is indeed positively correlated to popularity, but with the low score of regression, we can't confidently say that energy directly influences popularity of a song
plt.figure(figsize=(30, 10))
sns.lineplot(x=[0, 1], y=reg.predict([[0], [1]]).squeeze(), color="red")
sns.scatterplot(x=df["energy"].to_numpy(), y=df["popularity"].to_numpy())
reg = LinearRegression()
df = df.dropna()
xs = df[["liveness"]].to_numpy()
ys = df["popularity"].to_numpy()[:, np.newaxis]
reg.fit(xs, ys)
reg.score(xs, ys)
This regression score is very low, meaning that liveness has almost no effect on popularity, the peak of livenss of very popular songs we saw earlier might just be a coincident
reg.coef_, reg.intercept_
The coefficient shows that liveness is negatively correlated to popularity if anything. However, we can ignore this since the regression score is too low.
plt.figure(figsize=(30, 10))
sns.lineplot(x=[0, 1], y=reg.predict([[0], [1]]).squeeze(), color="red")
sns.scatterplot(x=df["liveness"].to_numpy(), y=df["popularity"].to_numpy())
reg = LinearRegression()
df = df.dropna()
xs = df[["year"]].to_numpy()
ys = df["popularity"].to_numpy()[:, np.newaxis]
reg.fit(xs, ys)
reg.score(xs, ys)
Year seems to correlate to popularity the most, with a score of 0.74. This suggests that year is highly correlated to popularity, which is the same as we assumed. It seems like year has a huge effect on popularity, but the other two attributes (energy and livenss) don't matter as much
reg.coef_, reg.intercept_
We can see that year is positively correlated to popularity. Note that the small coefficient is caused by year being a large number (> 1900) comparing to popularity, which is ranged from 0 to 100.
plt.figure(figsize=(30, 10))
sns.lineplot(x=[1920, 2020], y=reg.predict([[1920], [2020]]).squeeze(), color="red")
sns.scatterplot(x=df["year"].to_numpy(), y=df["popularity"].to_numpy())
Base on the regression results, it seems like the year is still the main factor that determines a songs popularity. Although energy still has some effect, it is not quite obvious. When looking at the regression score between energy and popularity, we can definitely say that there has to be more to a song than just a high energy level for the song to be popular. Additionally, although very popular music has high liveness scores, it appears that liveness doesn't really have any effect on the popularity of a song. It just happened that most popular songs nowadays have high liveness.
Based on our results, we can tell that popular music doesn't age well. As time goes by, the popularity will most certainly decrease. We also learned that audio attributes of a song doesn't greatly affect a songs popularity. If anything, only energy of a song can make it more popular. But not by much. We can infer that maybe the emotional attachment, the lyrics and background story, or maybe even the artist is what makes the song popular, not the attributes