Spotify history analysis

The idea behind this small module was to try and put into practice some of the things I have been learning for the past month since I began studying Python for Data analysis with DataQuest.
As I was looking for datasets to clean and work on from the web (Kaggle, Google, Reddit, ...), I came to the conclusion that the easiest way to begin was to work on data I personally knew: thus I had the idea to download my Spotify listening history.
The compressed file revealed a bunch of json files, with one corresponding to my listening history, and giving information on the tracks played (with corresponding artist), as well as the played time for each track and the end time (in a year-month-day hour-min-sec format) for each track played.
It seemed logical to try and complete this information with additional data:

  • album name,
  • number of the track in the album,
  • track_id,
  • genre (still to be performed),
  • other Spotify features (still to be performed).

To do so, I installed the Spotipy library, which is the tool proposed by Spotify to search data through its web API.

In [1]:
#classic libraries for analysis
import pandas as pd
import numpy as np

import json
import datetime as dt

#plotting with matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns
#plotting directly in Jupyter notebook
%matplotlib inline

#plotting with plotly (offline default mode)
import plotly
plotly.__version__
import plotly.express as px
#plotting directly in Jupyter notebook
import plotly.io as pio
pio.renderers.default='notebook'


#displaying all output for each cell
# from IPython.core.interactiveshell import InteractiveShell
# InteractiveShell.ast_node_interactivity = "all"
In [2]:
history_path = 'StreamingHistory0.json'
history = pd.read_json(history_path)
history.head(10)
Out[2]:
endTime artistName trackName msPlayed
0 2019-04-13 17:19 Peter Sarstedt Where Do You Go To (My Lovely) 129954
1 2019-04-13 17:23 Jorge Ben Jor A Banda do Zé Pretinho 239973
2 2019-04-13 17:26 Jorge Ben Jor Taj Mahal 184973
3 2019-04-13 17:31 Jorge Ben Jor Que Maravilha 246133
4 2019-04-13 17:33 Jorge Ben Jor Filho Maravilha 130800
5 2019-04-13 17:35 Jorge Ben Jor Caramba! ... Galileu Da Galiléia 148960
6 2019-04-13 17:38 Jorge Ben Jor Take It Easy My Brother Charles 158133
7 2019-04-13 17:42 Jorge Ben Jor Menino Jesus de Praga 224706
8 2019-04-13 17:44 Jorge Ben Jor Paz E Arroz 124373
9 2019-04-13 17:49 Jorge Ben Jor Por Causa De Você, Menina / Chove Chuva - Ao Vivo 304133
In [3]:
history.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7093 entries, 0 to 7092
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   endTime     7093 non-null   object
 1   artistName  7093 non-null   object
 2   trackName   7093 non-null   object
 3   msPlayed    7093 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 221.8+ KB
In [4]:
history.shape
Out[4]:
(7093, 4)

Before anything, we want to keep only the songs that were played for more than 30 seconds.

In [5]:
history = history[history['msPlayed']>30000]
history.reset_index(drop=True, inplace=True)
print(history.shape)
print(history.tail(10))
(6200, 4)
               endTime                artistName                  trackName  \
6190  2020-04-13 08:17                     S.M.P              Clean Mallory   
6191  2020-04-13 08:23                D. Tiffany       Get Back To You Soon   
6192  2020-04-13 08:31                D. Tiffany                 Blue Dream   
6193  2020-04-13 08:37                D. Tiffany               How RU Plush   
6194  2020-04-13 08:43                D. Tiffany        Something About You   
6195  2020-04-13 08:50                   Flørist                      DVOTE   
6196  2020-04-13 08:57                   Flørist  DVOTE - Bank Holiday Edit   
6197  2020-04-13 09:06         Slim Media Player               Dappled Love   
6198  2020-04-13 09:12         Slim Media Player        Pink Hues, Bad News   
6199  2020-04-13 16:45  Electric Sound Broadcast                 Waverunner   

      msPlayed  
6190    369829  
6191    366036  
6192    432399  
6193    374577  
6194    371542  
6195    251537  
6196    420256  
6197    484006  
6198    366779  
6199     63953  

And we check if there are any duplicates left on the endTime, artistName and trackName columns.

In [6]:
history.duplicated(['endTime','artistName', 'trackName']).sum()
Out[6]:
0

No more duplicates, yeehaw!
At this point, there are no more abnormal duplicates (rows with equal values for endtime, artistName and trackName), which means we can work on our data.
The search through Spotify web API was performed in another python file named 'Spotify web API search.ipynb'

The search returned the 'tracks_df.csv' file, that we will work on in the following lines.

In [7]:
tracks_df = pd.read_csv('tracks_df.csv')
print(tracks_df.shape)
print(tracks_df.head(10))
(6200, 10)
                 track_id                                    track_name  \
0  6M665tf7ZKQzUFA5yOrUE4  Where Do You Go To (My Lovely) - Re-Recorded   
1  4jSiAZQpsA1mAS3RE0UHCo                        A Banda do Zé Pretinho   
2  7pdCDKs0i05N8ag4tAC5u5                                     Taj Mahal   
3  4gw4oUsExNbPqm67aP4yut                                 Que Maravilha   
4  0rHH9cMqaUZF8Cm2n7Dk7s                               Filho Maravilha   
5  7M2GbY9YD3n7FA10lHGdoH              Caramba! ... Galileu Da Galileia   
6  399Hm1oqfzzQaZRu3CtVMb               Take It Easy My Brother Charles   
7  0DG6f7Dmp6BdK4K32bojAp                         Menino Jesus de Praga   
8  6gnsRunBKNL6es2Ltg60WQ                                   Paz E Arroz   
9  6gnsRunBKNL6es2Ltg60WQ                                   Paz E Arroz   

   track_length_ms               artist_id     artist_name  \
0           287720  3ExGDjEKejMhyciAgxPe0B  Peter Sarstedt   
1           239973  5JYtpnUKxAzXfHEYpOeeit   Jorge Ben Jor   
2           186933  5JYtpnUKxAzXfHEYpOeeit   Jorge Ben Jor   
3           246400  5JYtpnUKxAzXfHEYpOeeit   Jorge Ben Jor   
4           133720  5JYtpnUKxAzXfHEYpOeeit   Jorge Ben Jor   
5           149360  5JYtpnUKxAzXfHEYpOeeit   Jorge Ben Jor   
6           155666  5JYtpnUKxAzXfHEYpOeeit   Jorge Ben Jor   
7           224706  5JYtpnUKxAzXfHEYpOeeit   Jorge Ben Jor   
8           123546  5JYtpnUKxAzXfHEYpOeeit   Jorge Ben Jor   
9           123546  5JYtpnUKxAzXfHEYpOeeit   Jorge Ben Jor   

                            album_name  track_popularity  track_number  \
0                       The Lost Album                59            14   
1               A Banda do Zé Pretinho                40             1   
2                        Africa Brasil                48             6   
3                       Minha História                18            10   
4  Football & Samba Groove Association                18             3   
5   Brazil Classics 1: Beleza Tropical                16            11   
6                            Jorge Ben                54             6   
7               A Banda do Zé Pretinho                26             7   
8                                  Ben                 7             3   
9                                  Ben                 7             3   

  album_release  album_total_tracks  
0    2008-02-01                  14  
1          1978                  11  
2    1976-01-01                  11  
3    1994-01-01                  14  
4    2006-01-01                  15  
5    2007-03-26                  18  
6    1969-01-20                  11  
7          1978                  11  
8    2011-01-01                  11  
9    2011-01-01                  11  

Now that we have retrieved most of the information we will want to use for our analysis in tracks_df (we still need genres info as well as audio_features info), we can merge history with tracks_df.
Nonetheless, we have to take into acount that there are duplicates in the history dataframe, partly because some tracks have the same name though they are not from the same artist. So it would be better to merge on track_id rather than track_name.
To do so we have to add a track_id column in the history dataframe.

In [8]:
play = pd.merge(left=history, right=tracks_df, left_on=['artistName', 'trackName'], right_on=['artist_name', 'track_name'], how='left')
print(play.shape)
print(play.head(10))
(14640, 14)
            endTime      artistName                         trackName  \
0  2019-04-13 17:19  Peter Sarstedt    Where Do You Go To (My Lovely)   
1  2019-04-13 17:23   Jorge Ben Jor            A Banda do Zé Pretinho   
2  2019-04-13 17:26   Jorge Ben Jor                         Taj Mahal   
3  2019-04-13 17:31   Jorge Ben Jor                     Que Maravilha   
4  2019-04-13 17:33   Jorge Ben Jor                   Filho Maravilha   
5  2019-04-13 17:35   Jorge Ben Jor  Caramba! ... Galileu Da Galiléia   
6  2019-04-13 17:38   Jorge Ben Jor   Take It Easy My Brother Charles   
7  2019-04-13 17:38   Jorge Ben Jor   Take It Easy My Brother Charles   
8  2019-04-13 17:42   Jorge Ben Jor             Menino Jesus de Praga   
9  2019-04-13 17:44   Jorge Ben Jor                       Paz E Arroz   

   msPlayed                track_id                       track_name  \
0    129954                     NaN                              NaN   
1    239973  4jSiAZQpsA1mAS3RE0UHCo           A Banda do Zé Pretinho   
2    184973  7pdCDKs0i05N8ag4tAC5u5                        Taj Mahal   
3    246133  4gw4oUsExNbPqm67aP4yut                    Que Maravilha   
4    130800  0rHH9cMqaUZF8Cm2n7Dk7s                  Filho Maravilha   
5    148960                     NaN                              NaN   
6    158133  399Hm1oqfzzQaZRu3CtVMb  Take It Easy My Brother Charles   
7    158133  399Hm1oqfzzQaZRu3CtVMb  Take It Easy My Brother Charles   
8    224706  0DG6f7Dmp6BdK4K32bojAp            Menino Jesus de Praga   
9    124373  6gnsRunBKNL6es2Ltg60WQ                      Paz E Arroz   

   track_length_ms               artist_id    artist_name  \
0              NaN                     NaN            NaN   
1         239973.0  5JYtpnUKxAzXfHEYpOeeit  Jorge Ben Jor   
2         186933.0  5JYtpnUKxAzXfHEYpOeeit  Jorge Ben Jor   
3         246400.0  5JYtpnUKxAzXfHEYpOeeit  Jorge Ben Jor   
4         133720.0  5JYtpnUKxAzXfHEYpOeeit  Jorge Ben Jor   
5              NaN                     NaN            NaN   
6         155666.0  5JYtpnUKxAzXfHEYpOeeit  Jorge Ben Jor   
7         155666.0  5JYtpnUKxAzXfHEYpOeeit  Jorge Ben Jor   
8         224706.0  5JYtpnUKxAzXfHEYpOeeit  Jorge Ben Jor   
9         123546.0  5JYtpnUKxAzXfHEYpOeeit  Jorge Ben Jor   

                            album_name  track_popularity  track_number  \
0                                  NaN               NaN           NaN   
1               A Banda do Zé Pretinho              40.0           1.0   
2                        Africa Brasil              48.0           6.0   
3                       Minha História              18.0          10.0   
4  Football & Samba Groove Association              18.0           3.0   
5                                  NaN               NaN           NaN   
6                            Jorge Ben              54.0           6.0   
7                            Jorge Ben              54.0           6.0   
8               A Banda do Zé Pretinho              26.0           7.0   
9                                  Ben               7.0           3.0   

  album_release  album_total_tracks  
0           NaN                 NaN  
1          1978                11.0  
2    1976-01-01                11.0  
3    1994-01-01                14.0  
4    2006-01-01                15.0  
5           NaN                 NaN  
6    1969-01-20                11.0  
7    1969-01-20                11.0  
8          1978                11.0  
9    2011-01-01                11.0  
In [9]:
print(play.duplicated(['endTime', 'artistName', 'trackName']).sum())
8440
In [10]:
print(14640 - 8440)
6200

There are obviously 8440 duplicates that were created during the merging process, though we does not know how it happened yet.
For now, let's remove them manually and we'll investigate a bit more on the merging process later.
Before removing anything, we want to check whether we have the same unique ('artistName', 'trackName') pairs in the history dataframe and the merged dataframe after dropping duplicates.

In [11]:
#We create a copy of the merged play dataframe in order to drop duplicates.
play_clean = play.copy()
play_clean.drop_duplicates(['endTime', 'artistName', 'trackName'], inplace=True)
play_clean.reset_index(drop=True, inplace=True)
print(play_clean.shape)
(6200, 14)

Since we have doubts on the merging process on the artist and track columns, let's determine which pair of columns we have to keep to be sure to have the same information as the history dataframe.

In [12]:
#We check whether the merged dataframe without duplicate values still holds
#the same unique ('artistName','trackName') pairs as ('artistName','trackName') from history.
play_clean_gb = play_clean.groupby(['artistName','trackName']).size()
history_gb = history.groupby(['artistName','trackName']).size()
print(play_clean_gb.equals(history_gb))
True
In [13]:
#We check whether the merged dataframe without duplicate values still holds
#the same unique ('artist_name','track_name') pairs as ('artistName','trackName') from history.
play_clean_gb = play_clean.groupby(['artist_name','track_name']).size()
history_gb = history.groupby(['artistName','trackName']).size()
print(play_clean_gb.equals(history_gb))
False
In [14]:
print(play_clean.groupby(['artist_name','track_name']).size().shape)
print(history.groupby(['artistName','trackName']).size().shape)
(3565,)
(3988,)

We did not loose any information by dropping duplicates based on ('artistName','trackName'), though we can see that the merging process made us loose information in the play ('artist_name','track_name') columns.
Therefore, we will keep the ('artistName','trackName') columns to keep all available information.

In [15]:
play.drop(columns=['artist_name', 'track_name'], axis=1, inplace=True)
In [16]:
play.drop_duplicates(['endTime', 'artistName', 'trackName'], inplace=True)
play.reset_index(drop=True, inplace=True)
print(play.shape)
print(play.head(10))
(6200, 12)
            endTime      artistName  \
0  2019-04-13 17:19  Peter Sarstedt   
1  2019-04-13 17:23   Jorge Ben Jor   
2  2019-04-13 17:26   Jorge Ben Jor   
3  2019-04-13 17:31   Jorge Ben Jor   
4  2019-04-13 17:33   Jorge Ben Jor   
5  2019-04-13 17:35   Jorge Ben Jor   
6  2019-04-13 17:38   Jorge Ben Jor   
7  2019-04-13 17:42   Jorge Ben Jor   
8  2019-04-13 17:44   Jorge Ben Jor   
9  2019-04-13 17:49   Jorge Ben Jor   

                                           trackName  msPlayed  \
0                     Where Do You Go To (My Lovely)    129954   
1                             A Banda do Zé Pretinho    239973   
2                                          Taj Mahal    184973   
3                                      Que Maravilha    246133   
4                                    Filho Maravilha    130800   
5                   Caramba! ... Galileu Da Galiléia    148960   
6                    Take It Easy My Brother Charles    158133   
7                              Menino Jesus de Praga    224706   
8                                        Paz E Arroz    124373   
9  Por Causa De Você, Menina / Chove Chuva - Ao Vivo    304133   

                 track_id  track_length_ms               artist_id  \
0                     NaN              NaN                     NaN   
1  4jSiAZQpsA1mAS3RE0UHCo         239973.0  5JYtpnUKxAzXfHEYpOeeit   
2  7pdCDKs0i05N8ag4tAC5u5         186933.0  5JYtpnUKxAzXfHEYpOeeit   
3  4gw4oUsExNbPqm67aP4yut         246400.0  5JYtpnUKxAzXfHEYpOeeit   
4  0rHH9cMqaUZF8Cm2n7Dk7s         133720.0  5JYtpnUKxAzXfHEYpOeeit   
5                     NaN              NaN                     NaN   
6  399Hm1oqfzzQaZRu3CtVMb         155666.0  5JYtpnUKxAzXfHEYpOeeit   
7  0DG6f7Dmp6BdK4K32bojAp         224706.0  5JYtpnUKxAzXfHEYpOeeit   
8  6gnsRunBKNL6es2Ltg60WQ         123546.0  5JYtpnUKxAzXfHEYpOeeit   
9                     NaN              NaN                     NaN   

                            album_name  track_popularity  track_number  \
0                                  NaN               NaN           NaN   
1               A Banda do Zé Pretinho              40.0           1.0   
2                        Africa Brasil              48.0           6.0   
3                       Minha História              18.0          10.0   
4  Football & Samba Groove Association              18.0           3.0   
5                                  NaN               NaN           NaN   
6                            Jorge Ben              54.0           6.0   
7               A Banda do Zé Pretinho              26.0           7.0   
8                                  Ben               7.0           3.0   
9                                  NaN               NaN           NaN   

  album_release  album_total_tracks  
0           NaN                 NaN  
1          1978                11.0  
2    1976-01-01                11.0  
3    1994-01-01                14.0  
4    2006-01-01                15.0  
5           NaN                 NaN  
6    1969-01-20                11.0  
7          1978                11.0  
8    2011-01-01                11.0  
9           NaN                 NaN  

Let's do some cleaning before anything else:

  • we want all column names to be in the format track_name, not trackName
In [17]:
play.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6200 entries, 0 to 6199
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   endTime             6200 non-null   object 
 1   artistName          6200 non-null   object 
 2   trackName           6200 non-null   object 
 3   msPlayed            6200 non-null   int64  
 4   track_id            5512 non-null   object 
 5   track_length_ms     5512 non-null   float64
 6   artist_id           5512 non-null   object 
 7   album_name          5512 non-null   object 
 8   track_popularity    5512 non-null   float64
 9   track_number        5512 non-null   float64
 10  album_release       5512 non-null   object 
 11  album_total_tracks  5512 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 581.4+ KB
In [18]:
play.rename(columns={'endTime':'end_time',
            'msPlayed':'ms_played',
            'trackName':'track',
            'artistName':'artist'}, inplace=True)
play.columns
Out[18]:
Index(['end_time', 'artist', 'track', 'ms_played', 'track_id',
       'track_length_ms', 'artist_id', 'album_name', 'track_popularity',
       'track_number', 'album_release', 'album_total_tracks'],
      dtype='object')

Eventually, we want to convert the end_time column values to datetime objects in order to manipulate it more easily, as well as create another column containing ordinal values for the dates, in order to plot them. We also convert ms_played to floats.

In [19]:
play['end_time'][:5]
Out[19]:
0    2019-04-13 17:19
1    2019-04-13 17:23
2    2019-04-13 17:26
3    2019-04-13 17:31
4    2019-04-13 17:33
Name: end_time, dtype: object
In [20]:
play['end_time'] = pd.to_datetime(play['end_time'])
play['ordinal_time'] = [x.toordinal() for x in play['end_time']]
play.loc[:, 'ms_played'] = play.loc[:, 'ms_played'].astype('float')
In [21]:
print(play.info())
print(play.head())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6200 entries, 0 to 6199
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   end_time            6200 non-null   datetime64[ns]
 1   artist              6200 non-null   object        
 2   track               6200 non-null   object        
 3   ms_played           6200 non-null   float64       
 4   track_id            5512 non-null   object        
 5   track_length_ms     5512 non-null   float64       
 6   artist_id           5512 non-null   object        
 7   album_name          5512 non-null   object        
 8   track_popularity    5512 non-null   float64       
 9   track_number        5512 non-null   float64       
 10  album_release       5512 non-null   object        
 11  album_total_tracks  5512 non-null   float64       
 12  ordinal_time        6200 non-null   int64         
dtypes: datetime64[ns](1), float64(5), int64(1), object(6)
memory usage: 629.8+ KB
None
             end_time          artist                           track  \
0 2019-04-13 17:19:00  Peter Sarstedt  Where Do You Go To (My Lovely)   
1 2019-04-13 17:23:00   Jorge Ben Jor          A Banda do Zé Pretinho   
2 2019-04-13 17:26:00   Jorge Ben Jor                       Taj Mahal   
3 2019-04-13 17:31:00   Jorge Ben Jor                   Que Maravilha   
4 2019-04-13 17:33:00   Jorge Ben Jor                 Filho Maravilha   

   ms_played                track_id  track_length_ms               artist_id  \
0   129954.0                     NaN              NaN                     NaN   
1   239973.0  4jSiAZQpsA1mAS3RE0UHCo         239973.0  5JYtpnUKxAzXfHEYpOeeit   
2   184973.0  7pdCDKs0i05N8ag4tAC5u5         186933.0  5JYtpnUKxAzXfHEYpOeeit   
3   246133.0  4gw4oUsExNbPqm67aP4yut         246400.0  5JYtpnUKxAzXfHEYpOeeit   
4   130800.0  0rHH9cMqaUZF8Cm2n7Dk7s         133720.0  5JYtpnUKxAzXfHEYpOeeit   

                            album_name  track_popularity  track_number  \
0                                  NaN               NaN           NaN   
1               A Banda do Zé Pretinho              40.0           1.0   
2                        Africa Brasil              48.0           6.0   
3                       Minha História              18.0          10.0   
4  Football & Samba Groove Association              18.0           3.0   

  album_release  album_total_tracks  ordinal_time  
0           NaN                 NaN        737162  
1          1978                11.0        737162  
2    1976-01-01                11.0        737162  
3    1994-01-01                14.0        737162  
4    2006-01-01                15.0        737162  

Now we can do some exploratory analysis!

Exploratory data analysis

First, let's see what are the most played tracks and most played artists.
To rank tracks based on listening count, we apply the .groupby() method on track and artist, since some tracks from different artists have the same name. Let's remember for the rest of the analysis that for us, a unique 'track' is defined by the pair ['track', 'artist'].

In [22]:
play.groupby(['track','artist']).size().reset_index(name='count').sort_values(by='count', ascending=False)
Out[22]:
track artist count
3628 Unknown Track Unknown Artist 21
769 Dead of Night Orville Peck 14
3496 Through Winter's Darkest Day Yussuf Jerusalem 13
3584 Turn to Hate Orville Peck 12
3892 X Marks the Spot Ghostpoet 12
... ... ... ...
1562 Ich wollte nie erwachsen sein (Nessajas Lied) ... Peter Maffay 1
1564 If I Ain't Got You Scary Pockets 1
1565 If I Don't Go To Hell (feat. Jus Allah) Janice & Mr. Green 1
1567 If I Never Get Over Marc Mac 1
3987 死者より Shintaro Sakamoto 1

3988 rows × 3 columns

So it seems that the track that I have been BINGING on is unknown from Spotify, which is rather strange..

In [23]:
artist_count = play.groupby('artist').size().reset_index(name='count').sort_values(by='count', ascending=False)
artist_count
Out[23]:
artist count
296 Ghostpoet 171
833 Yussuf Jerusalem 145
498 Mick Turner 139
264 Fila Brazillia 111
512 Mop Mop 108
... ... ...
366 Jay Prince 1
363 Janice & Mr. Green 1
361 Jamila Woods 1
360 Jamie xx 1
423 LION BABE 1

847 rows × 2 columns

At this point, these results are VERY astonishing to me, since for example I didn't even remember I used to listen to Mick Turner.. and some artists that I have been listening to a lot do not even appear in the top 5! Let's investigate a bit more though.

At some point we will want to study listen patterns throughout the year, thus we will need a column containing values as a float, which will enable us to plot data more easily.

In [24]:
fig = px.bar(artist_count[:10], x='artist', y='count', color='count')
fig.update_layout(title='Number of tracks played per artist')
fig

Let's see what is the total amount of play time per track or artist.

In [25]:
play_time = play.groupby(['artist','track'])['ms_played'].sum().reset_index(name='ms_played')
play_time
Out[25]:
artist track ms_played
0 16BL Deep In My Soul - Original Mix 483966.0
1 1Blip2 Alone 883636.0
2 3 Hürel Sevenler Ağlarmış 449428.0
3 4hero Another Day 297573.0
4 4hero Bed Of Roses 246080.0
... ... ... ...
3983 Özdemir Erdoğan Uzun İnce Bir Yoldayım 425680.0
3984 박혜진 Park Hye Jin ABC 242000.0
3985 박혜진 Park Hye Jin CLOSE EYES 35990.0
3986 박혜진 Park Hye Jin I DON'T CARE 340660.0
3987 박혜진 Park Hye Jin IF U WANT IT 226341.0

3988 rows × 3 columns

What are the 20 most listened tracks in terms of total play time?

In [26]:
top_tracks = play_time.sort_values(by='ms_played', ascending=False)[:21]
#We remove the first track since it's unknown..
top_tracks = top_tracks[1:21]
print(top_tracks[['artist', 'track']])

#display images with matplotlib
import matplotlib.image as mpimg
img = mpimg.imread('mavisstaples.jpg')
plt.imshow(img)
plt.xticks(ticks=[])
plt.yticks(ticks=[])
plt.show()
                  artist                                              track
2211       Mavis Staples                           Let's Do It Again - Live
2805        Orville Peck                                       Turn to Hate
2797        Orville Peck                                      Dead of Night
1881  Kuniyuki Takahashi  Flying Music Short Version - Alex from Tokyo,S...
1289           Ghostpoet                        Be Right Back, Moving House
3924    Yussuf Jerusalem                                The Path Of Paladin
3912    Yussuf Jerusalem                                    Cruel Love Song
1335           Ghostpoet                                   X Marks the Spot
2301         Mdou Moctar                                         Tarhatazed
353           Arlo Parks                                               Cola
1321           Ghostpoet                     Sorry My Love, It's You Not Me
3925    Yussuf Jerusalem                       Through Winter's Darkest Day
1905  Kuniyuki Takahashi                     The Guitar Song - Original Mix
3130       Robert Palmer                                          Fine Time
1328           Ghostpoet                                        Them Waters
3910    Yussuf Jerusalem                             A Heart Full of Sorrow
108        Ahmed Fakroun                                             Nisyan
1206      Fila Brazillia                                   The Garden Zadar
666         Channel Tres                                         Controller
2266          Mazzy Star                       Look On Down From The Bridge

Let's check which are the top artists in terms of total play time.

In [27]:
top_artists = play.pivot_table(index='artist', values='ms_played', aggfunc=np.sum).sort_values(by='ms_played', ascending=False)
top_artists
Out[27]:
ms_played
artist
Ghostpoet 43231120.0
Kuniyuki Takahashi 38059644.0
Pablo Bolivar 34838970.0
Mop Mop 29383429.0
Fila Brazillia 29247086.0
... ...
Donnie & Joe Emerson 32533.0
Groove Armada 32507.0
Karen Dalton 32044.0
Prequel 31567.0
CCCP – Fedeli Alla Linea 31107.0

847 rows × 1 columns

In [28]:
# def get_hours_played(time_played):
#     hours = time_played.seconds // 3600
#     minutes = (time_played.seconds % 3600) // 60
#     seconds = time_played.seconds - hours*3600 - minutes*60
#     return [hours, minutes, seconds]

Let's analyze listening trends by weeks during the year.

In [29]:
play.head()
play.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6200 entries, 0 to 6199
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   end_time            6200 non-null   datetime64[ns]
 1   artist              6200 non-null   object        
 2   track               6200 non-null   object        
 3   ms_played           6200 non-null   float64       
 4   track_id            5512 non-null   object        
 5   track_length_ms     5512 non-null   float64       
 6   artist_id           5512 non-null   object        
 7   album_name          5512 non-null   object        
 8   track_popularity    5512 non-null   float64       
 9   track_number        5512 non-null   float64       
 10  album_release       5512 non-null   object        
 11  album_total_tracks  5512 non-null   float64       
 12  ordinal_time        6200 non-null   int64         
dtypes: datetime64[ns](1), float64(5), int64(1), object(6)
memory usage: 629.8+ KB
In [30]:
#We first set the index to 'end_time' and resample at the month or week or any other level
play_week = play[['end_time', 'ms_played']].set_index('end_time').resample('W').sum()
play_week
Out[30]:
ms_played
end_time
2019-04-14 5754703.0
2019-04-21 28831180.0
2019-04-28 23649088.0
2019-05-05 8736929.0
2019-05-12 8829121.0
2019-05-19 47995.0
2019-05-26 3361907.0
2019-06-02 15402053.0
2019-06-09 12140287.0
2019-06-16 33675202.0
2019-06-23 17555519.0
2019-06-30 15658090.0
2019-07-07 51402322.0
2019-07-14 30691994.0
2019-07-21 29666993.0
2019-07-28 170846.0
2019-08-04 13570579.0
2019-08-11 9033280.0
2019-08-18 28245836.0
2019-08-25 33083909.0
2019-09-01 29224978.0
2019-09-08 33967783.0
2019-09-15 21461625.0
2019-09-22 18362654.0
2019-09-29 22747426.0
2019-10-06 16041517.0
2019-10-13 33919625.0
2019-10-20 41306998.0
2019-10-27 20554182.0
2019-11-03 28217084.0
2019-11-10 55187339.0
2019-11-17 27327094.0
2019-11-24 44750236.0
2019-12-01 41867408.0
2019-12-08 31049006.0
2019-12-15 55298460.0
2019-12-22 47520383.0
2019-12-29 16978441.0
2020-01-05 32192667.0
2020-01-12 54451097.0
2020-01-19 85501832.0
2020-01-26 64431490.0
2020-02-02 14448586.0
2020-02-09 8196195.0
2020-02-16 6985421.0
2020-02-23 24086795.0
2020-03-01 22895667.0
2020-03-08 42555684.0
2020-03-15 17843805.0
2020-03-22 49407959.0
2020-03-29 6624305.0
2020-04-05 4353912.0
2020-04-12 74419251.0
2020-04-19 4065999.0
In [31]:
fig = px.bar(play_week, x=play_week.index, y=play_week['ms_played']/3.6e6, color=play_week['ms_played']/3.6e6)

#we want the xaxis tick values to be all the months
month_serie = pd.to_datetime(play['end_time'].dt.strftime('%Y-%m'), format='%Y-%m')

#some modifications to the figure layout
fig.update_layout(xaxis_tickvals=month_serie,
                 xaxis_title=None,
                  yaxis_title='Hours played',
                 title='Total listening time per week',
                 coloraxis_colorbar=dict(title=None))
#print layout when necessary
# print(fig.layout)
fig.show()

Some remarks on the listening time evolution

  • let's remember that this is a weekly listening time evolution
  • we can see an increase in listening time between october 2019 and november 2019. This observation matches with me getting more time at home or in cafes looking for a job since I left my last job on october 31st
  • even if I listened to music while working on easy tasks, I can see that I spend more time listening to music while at home
  • I got a temporary full time job during the two first weeks of february 2020, which also clearly corresponds to a net decrease in the listening time during this period
  • in March and April, I used to switch media (Youtube, Soundcloud, BandCamp), which can explain the volatility in listening time per week during this period.

Let's see if we can get a clear pattern by day

In [32]:
#Selecting only 'end_time' and 'ms_played' from play dataframe and setting index to 'end_time'
play_day = play[['end_time', 'ms_played']]
#Getting the time from datetime index
play_day['end_time'] = pd.to_datetime(play_day['end_time'].dt.strftime('%H:%M'), format='%H:%M')
#Setting the index
play_day.set_index('end_time', inplace=True)
#Converting time played from milliseconds to minutes and changing column name
play_day.loc[:,'ms_played'] = play_day.loc[:,'ms_played']/60000
play_day.rename(columns={'ms_played':'minutes_played'}, inplace=True)
play_day
#Resampling by 30 minutes periods (mandatory datetime index for this operation)
play_day = play_day.resample('30min').sum()
#Selecting the time from play_day.index
play_day.index = play_day.index.time
play_day
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1047: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4133: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[32]:
minutes_played
00:00:00 99.670617
00:30:00 53.908767
01:00:00 59.511567
01:30:00 31.526800
02:00:00 27.400883
02:30:00 26.460000
03:00:00 0.000000
03:30:00 0.000000
04:00:00 0.000000
04:30:00 37.469283
05:00:00 74.771900
05:30:00 318.601817
06:00:00 430.226850
06:30:00 475.439700
07:00:00 600.992217
07:30:00 726.258267
08:00:00 831.148000
08:30:00 941.722200
09:00:00 791.219500
09:30:00 743.095700
10:00:00 762.089283
10:30:00 644.666017
11:00:00 730.335850
11:30:00 692.507700
12:00:00 546.890117
12:30:00 448.063700
13:00:00 504.027867
13:30:00 469.137317
14:00:00 354.705117
14:30:00 451.346983
15:00:00 478.065283
15:30:00 659.810117
16:00:00 893.253600
16:30:00 902.266367
17:00:00 1123.248683
17:30:00 1176.940567
18:00:00 1265.657183
18:30:00 1150.020400
19:00:00 915.185067
19:30:00 855.530183
20:00:00 715.119017
20:30:00 676.589283
21:00:00 487.834467
21:30:00 463.537033
22:00:00 354.469833
22:30:00 219.649267
23:00:00 150.753817
23:30:00 101.388100
In [33]:
fig = px.bar(play_day, x=play_day.index, y='minutes_played', color='minutes_played')
fig.update_layout(xaxis_title=None,
                 title='Listening time per day')
# print(fig.layout)
fig.show()

There clearly is a listening pattern that comes out of this graph: two listening peaks appear at 8:30 and 18.
I was employed most of the time this past year so the decline in listening time between 12:00 and 14:30 is directly related to lunch time, during which I don't listen to a lot of music. The peak at 8:30 corresponds either to me arriving at work and starting working with music, either with non-working days get-up time and immediately playing music. Anyway, it shows that I am activating myself at this time on most days!
The peak at 6pm corresponds either to me trying to get some motivation while at work, either me not working and at home and putting some music before planning my evening.

In [34]:
top_10_artists = list(top_artists.index[:10])

play_artist = play[play['artist'].isin(top_10_artists)]
#We want to aggregate by month, so we need to modify the 'end_time' column
play_artist.loc[:,'end_time'] = pd.to_datetime(play['end_time'].dt.strftime('%Y-%m'), format='%Y-%m')
play_artist = pd.pivot_table(play_artist, index='end_time', values='ms_played', columns='artist', aggfunc=np.sum)

#Stacking the dataframe to get tidy data and bar with plotly express
play_artist = play_artist.stack().reset_index(level=1)
play_artist.columns = ['artist', 'ms_played']
#Change the play time to minutes
play_artist.loc[:,'ms_played'] = play_artist.loc[:,'ms_played']/60000
play_artist.rename(columns={'ms_played':'minutes_played'}, inplace=True)
play_artist.index.name = None
# print(play_artist['artist'].value_counts())
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:965: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [35]:
fig = px.bar(play_artist, x=play_artist.index, y="minutes_played", color='minutes_played', facet_row="artist")

fig.update_layout(
    height=1200,
    title='Total play time for the 10 most played artists, in minutes',
    plot_bgcolor='rgba(250,250,250,250)',
    xaxis_tickvals=month_serie)

#removing yaxes titles('ms_played')
for i in range(11):
    fig.update_yaxes(title=None, row=i)

#removing bottom xaxes title
fig.update_xaxes(title=None)

#modifying annotations (artist name)
#get access to all layout information
# print(fig.layout)
fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1]))
fig.for_each_annotation(lambda a: a.update(font_size=15))
fig.for_each_annotation(lambda a: a.update(textangle=0))
fig.for_each_annotation(lambda a: a.update(x=0.01))
fig.for_each_annotation(lambda a: a.update(yanchor='bottom'))

fig.show()

I don't know why at this point the two axes at the bottom of the figure don't display any bar (in fact there is a very thin bar in each graph). We will have to investigate a bit further regarding this issue.

ideas:
change the axes ylimits (more than 400, like 500)
see what it displays when changing aggregation to weeks rather than months

In [36]:
#what if we want to keep a daily precision?
top_10_artists = list(top_artists.index[:10])

play_artist_day = play[play['artist'].isin(top_10_artists)]
#We want to aggregate by day, so we need to modify the 'end_time' column
play_artist_day['end_time'] = pd.to_datetime(play['end_time'].dt.strftime('%Y-%m-%d'), format='%Y-%m-%d')
play_artist_day = pd.pivot_table(play_artist_day, index='end_time', values='ms_played', columns='artist', aggfunc=np.sum)
play_artist_day = play_artist_day.stack().reset_index(level=1)
play_artist_day.columns = ['artist', 'ms_played']
#let's change the play time to minutes
play_artist_day.loc[:,'ms_played'] = play_artist_day.loc[:,'ms_played']/60000
play_artist_day.rename(columns={'ms_played':'minutes_played'}, inplace=True)
play_artist_day.index.name = None
#play_artist_day
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [37]:
#plotting with matplotlib
fig, (ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8, ax9, ax10) = plt.subplots(10,1, figsize=(12, 17))
ax_list = [ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8, ax9, ax10]

#some visual modifications of the figure
for ax in ax_list:
    index = ax_list.index(ax)
    ax_artist = play_artist_day[play_artist_day['artist'] == top_10_artists[index]]
    ax.bar(x=ax_artist.index, height=ax_artist['minutes_played'], width=1)
    ax.set_xlim([dt.datetime(2019, 4, 13), dt.datetime(2020, 4, 12)])
    ax.set_ylabel(top_10_artists[index], rotation=0, labelpad=100, fontsize=20)
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)

plt.show()

Some remarks on the artists

  • we can see that I tend to listen to some artists in a very short time period, probably corresponding to the time I discovered them (Mavis Staples in July 2019, Mop Mop in December 2019)
  • on the contrary, I have been listening to other artists more consistently over the last year (Mazzy Star, Pablo Bolivar, and Kuniyuki Takahashi - considering I discovered the last one in december 2019, or Ghospoet)

remaining things to do:

get genres + acoustic features on spotify? (check towardsdatascience on medium for analysis ideas https://towardsdatascience.com/get-your-spotify-streaming-history-with-python-d5a208bbcbd3)

add some images or logo of spotify in the graphs? (https://plotly.com/python/images/)

dealing with null values from the search (for further analysis on genre, acoustic features, ...)