The idea behind this small module was to try and put into practice some of the things I have been learning for the past month since I began studying Python for Data analysis with DataQuest.
As I was looking for datasets to clean and work on from the web (Kaggle, Google, Reddit, ...), I came to the conclusion that the easiest way to begin was to work on data I personally knew: thus I had the idea to download my Spotify listening history.
The compressed file revealed a bunch of json files, with one corresponding to my listening history, and giving information on the tracks played (with corresponding artist), as well as the played time for each track and the end time (in a year-month-day hour-min-sec format) for each track played.
It seemed logical to try and complete this information with additional data:
To do so, I installed the Spotipy library, which is the tool proposed by Spotify to search data through its web API.
#classic libraries for analysis
import pandas as pd
import numpy as np
import json
import datetime as dt
#plotting with matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns
#plotting directly in Jupyter notebook
%matplotlib inline
#plotting with plotly (offline default mode)
import plotly
plotly.__version__
import plotly.express as px
#plotting directly in Jupyter notebook
import plotly.io as pio
pio.renderers.default='notebook'
#displaying all output for each cell
# from IPython.core.interactiveshell import InteractiveShell
# InteractiveShell.ast_node_interactivity = "all"
history_path = 'StreamingHistory0.json'
history = pd.read_json(history_path)
history.head(10)
history.info()
history.shape
Before anything, we want to keep only the songs that were played for more than 30 seconds.
history = history[history['msPlayed']>30000]
history.reset_index(drop=True, inplace=True)
print(history.shape)
print(history.tail(10))
And we check if there are any duplicates left on the endTime, artistName and trackName columns.
history.duplicated(['endTime','artistName', 'trackName']).sum()
No more duplicates, yeehaw!
At this point, there are no more abnormal duplicates (rows with equal values for endtime, artistName and trackName), which means we can work on our data.
The search through Spotify web API was performed in another python file named 'Spotify web API search.ipynb'
The search returned the 'tracks_df.csv' file, that we will work on in the following lines.
tracks_df = pd.read_csv('tracks_df.csv')
print(tracks_df.shape)
print(tracks_df.head(10))
Now that we have retrieved most of the information we will want to use for our analysis in tracks_df (we still need genres info as well as audio_features info), we can merge history with tracks_df.
Nonetheless, we have to take into acount that there are duplicates in the history dataframe, partly because some tracks have the same name though they are not from the same artist. So it would be better to merge on track_id rather than track_name.
To do so we have to add a track_id column in the history dataframe.
play = pd.merge(left=history, right=tracks_df, left_on=['artistName', 'trackName'], right_on=['artist_name', 'track_name'], how='left')
print(play.shape)
print(play.head(10))
print(play.duplicated(['endTime', 'artistName', 'trackName']).sum())
print(14640 - 8440)
There are obviously 8440 duplicates that were created during the merging process, though we does not know how it happened yet.
For now, let's remove them manually and we'll investigate a bit more on the merging process later.
Before removing anything, we want to check whether we have the same unique ('artistName', 'trackName') pairs in the history dataframe and the merged dataframe after dropping duplicates.
#We create a copy of the merged play dataframe in order to drop duplicates.
play_clean = play.copy()
play_clean.drop_duplicates(['endTime', 'artistName', 'trackName'], inplace=True)
play_clean.reset_index(drop=True, inplace=True)
print(play_clean.shape)
Since we have doubts on the merging process on the artist and track columns, let's determine which pair of columns we have to keep to be sure to have the same information as the history dataframe.
#We check whether the merged dataframe without duplicate values still holds
#the same unique ('artistName','trackName') pairs as ('artistName','trackName') from history.
play_clean_gb = play_clean.groupby(['artistName','trackName']).size()
history_gb = history.groupby(['artistName','trackName']).size()
print(play_clean_gb.equals(history_gb))
#We check whether the merged dataframe without duplicate values still holds
#the same unique ('artist_name','track_name') pairs as ('artistName','trackName') from history.
play_clean_gb = play_clean.groupby(['artist_name','track_name']).size()
history_gb = history.groupby(['artistName','trackName']).size()
print(play_clean_gb.equals(history_gb))
print(play_clean.groupby(['artist_name','track_name']).size().shape)
print(history.groupby(['artistName','trackName']).size().shape)
We did not loose any information by dropping duplicates based on ('artistName','trackName'), though we can see that the merging process made us loose information in the play ('artist_name','track_name') columns.
Therefore, we will keep the ('artistName','trackName') columns to keep all available information.
play.drop(columns=['artist_name', 'track_name'], axis=1, inplace=True)
play.drop_duplicates(['endTime', 'artistName', 'trackName'], inplace=True)
play.reset_index(drop=True, inplace=True)
print(play.shape)
print(play.head(10))
Let's do some cleaning before anything else:
play.info()
play.rename(columns={'endTime':'end_time',
'msPlayed':'ms_played',
'trackName':'track',
'artistName':'artist'}, inplace=True)
play.columns
Eventually, we want to convert the end_time column values to datetime objects in order to manipulate it more easily, as well as create another column containing ordinal values for the dates, in order to plot them. We also convert ms_played to floats.
play['end_time'][:5]
play['end_time'] = pd.to_datetime(play['end_time'])
play['ordinal_time'] = [x.toordinal() for x in play['end_time']]
play.loc[:, 'ms_played'] = play.loc[:, 'ms_played'].astype('float')
print(play.info())
print(play.head())
Now we can do some exploratory analysis!
First, let's see what are the most played tracks and most played artists.
To rank tracks based on listening count, we apply the .groupby() method on track and artist, since some tracks from different artists have the same name.
Let's remember for the rest of the analysis that for us, a unique 'track' is defined by the pair ['track', 'artist'].
play.groupby(['track','artist']).size().reset_index(name='count').sort_values(by='count', ascending=False)
So it seems that the track that I have been BINGING on is unknown from Spotify, which is rather strange..
artist_count = play.groupby('artist').size().reset_index(name='count').sort_values(by='count', ascending=False)
artist_count
At this point, these results are VERY astonishing to me, since for example I didn't even remember I used to listen to Mick Turner.. and some artists that I have been listening to a lot do not even appear in the top 5! Let's investigate a bit more though.
At some point we will want to study listen patterns throughout the year, thus we will need a column containing values as a float, which will enable us to plot data more easily.
fig = px.bar(artist_count[:10], x='artist', y='count', color='count')
fig.update_layout(title='Number of tracks played per artist')
fig
Let's see what is the total amount of play time per track or artist.
play_time = play.groupby(['artist','track'])['ms_played'].sum().reset_index(name='ms_played')
play_time
What are the 20 most listened tracks in terms of total play time?
top_tracks = play_time.sort_values(by='ms_played', ascending=False)[:21]
#We remove the first track since it's unknown..
top_tracks = top_tracks[1:21]
print(top_tracks[['artist', 'track']])
#display images with matplotlib
import matplotlib.image as mpimg
img = mpimg.imread('mavisstaples.jpg')
plt.imshow(img)
plt.xticks(ticks=[])
plt.yticks(ticks=[])
plt.show()
Let's check which are the top artists in terms of total play time.
top_artists = play.pivot_table(index='artist', values='ms_played', aggfunc=np.sum).sort_values(by='ms_played', ascending=False)
top_artists
# def get_hours_played(time_played):
# hours = time_played.seconds // 3600
# minutes = (time_played.seconds % 3600) // 60
# seconds = time_played.seconds - hours*3600 - minutes*60
# return [hours, minutes, seconds]
Let's analyze listening trends by weeks during the year.
play.head()
play.info()
#We first set the index to 'end_time' and resample at the month or week or any other level
play_week = play[['end_time', 'ms_played']].set_index('end_time').resample('W').sum()
play_week
fig = px.bar(play_week, x=play_week.index, y=play_week['ms_played']/3.6e6, color=play_week['ms_played']/3.6e6)
#we want the xaxis tick values to be all the months
month_serie = pd.to_datetime(play['end_time'].dt.strftime('%Y-%m'), format='%Y-%m')
#some modifications to the figure layout
fig.update_layout(xaxis_tickvals=month_serie,
xaxis_title=None,
yaxis_title='Hours played',
title='Total listening time per week',
coloraxis_colorbar=dict(title=None))
#print layout when necessary
# print(fig.layout)
fig.show()
#Selecting only 'end_time' and 'ms_played' from play dataframe and setting index to 'end_time'
play_day = play[['end_time', 'ms_played']]
#Getting the time from datetime index
play_day['end_time'] = pd.to_datetime(play_day['end_time'].dt.strftime('%H:%M'), format='%H:%M')
#Setting the index
play_day.set_index('end_time', inplace=True)
#Converting time played from milliseconds to minutes and changing column name
play_day.loc[:,'ms_played'] = play_day.loc[:,'ms_played']/60000
play_day.rename(columns={'ms_played':'minutes_played'}, inplace=True)
play_day
#Resampling by 30 minutes periods (mandatory datetime index for this operation)
play_day = play_day.resample('30min').sum()
#Selecting the time from play_day.index
play_day.index = play_day.index.time
play_day
fig = px.bar(play_day, x=play_day.index, y='minutes_played', color='minutes_played')
fig.update_layout(xaxis_title=None,
title='Listening time per day')
# print(fig.layout)
fig.show()
There clearly is a listening pattern that comes out of this graph: two listening peaks appear at 8:30 and 18.
I was employed most of the time this past year so the decline in listening time between 12:00 and 14:30 is directly related to lunch time, during which I don't listen to a lot of music.
The peak at 8:30 corresponds either to me arriving at work and starting working with music, either with non-working days get-up time and immediately playing music. Anyway, it shows that I am activating myself at this time on most days!
The peak at 6pm corresponds either to me trying to get some motivation while at work, either me not working and at home and putting some music before planning my evening.
top_10_artists = list(top_artists.index[:10])
play_artist = play[play['artist'].isin(top_10_artists)]
#We want to aggregate by month, so we need to modify the 'end_time' column
play_artist.loc[:,'end_time'] = pd.to_datetime(play['end_time'].dt.strftime('%Y-%m'), format='%Y-%m')
play_artist = pd.pivot_table(play_artist, index='end_time', values='ms_played', columns='artist', aggfunc=np.sum)
#Stacking the dataframe to get tidy data and bar with plotly express
play_artist = play_artist.stack().reset_index(level=1)
play_artist.columns = ['artist', 'ms_played']
#Change the play time to minutes
play_artist.loc[:,'ms_played'] = play_artist.loc[:,'ms_played']/60000
play_artist.rename(columns={'ms_played':'minutes_played'}, inplace=True)
play_artist.index.name = None
# print(play_artist['artist'].value_counts())
fig = px.bar(play_artist, x=play_artist.index, y="minutes_played", color='minutes_played', facet_row="artist")
fig.update_layout(
height=1200,
title='Total play time for the 10 most played artists, in minutes',
plot_bgcolor='rgba(250,250,250,250)',
xaxis_tickvals=month_serie)
#removing yaxes titles('ms_played')
for i in range(11):
fig.update_yaxes(title=None, row=i)
#removing bottom xaxes title
fig.update_xaxes(title=None)
#modifying annotations (artist name)
#get access to all layout information
# print(fig.layout)
fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1]))
fig.for_each_annotation(lambda a: a.update(font_size=15))
fig.for_each_annotation(lambda a: a.update(textangle=0))
fig.for_each_annotation(lambda a: a.update(x=0.01))
fig.for_each_annotation(lambda a: a.update(yanchor='bottom'))
fig.show()
I don't know why at this point the two axes at the bottom of the figure don't display any bar (in fact there is a very thin bar in each graph). We will have to investigate a bit further regarding this issue.
ideas:
change the axes ylimits (more than 400, like 500)
see what it displays when changing aggregation to weeks rather than months
#what if we want to keep a daily precision?
top_10_artists = list(top_artists.index[:10])
play_artist_day = play[play['artist'].isin(top_10_artists)]
#We want to aggregate by day, so we need to modify the 'end_time' column
play_artist_day['end_time'] = pd.to_datetime(play['end_time'].dt.strftime('%Y-%m-%d'), format='%Y-%m-%d')
play_artist_day = pd.pivot_table(play_artist_day, index='end_time', values='ms_played', columns='artist', aggfunc=np.sum)
play_artist_day = play_artist_day.stack().reset_index(level=1)
play_artist_day.columns = ['artist', 'ms_played']
#let's change the play time to minutes
play_artist_day.loc[:,'ms_played'] = play_artist_day.loc[:,'ms_played']/60000
play_artist_day.rename(columns={'ms_played':'minutes_played'}, inplace=True)
play_artist_day.index.name = None
#play_artist_day
#plotting with matplotlib
fig, (ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8, ax9, ax10) = plt.subplots(10,1, figsize=(12, 17))
ax_list = [ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8, ax9, ax10]
#some visual modifications of the figure
for ax in ax_list:
index = ax_list.index(ax)
ax_artist = play_artist_day[play_artist_day['artist'] == top_10_artists[index]]
ax.bar(x=ax_artist.index, height=ax_artist['minutes_played'], width=1)
ax.set_xlim([dt.datetime(2019, 4, 13), dt.datetime(2020, 4, 12)])
ax.set_ylabel(top_10_artists[index], rotation=0, labelpad=100, fontsize=20)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.show()