Some analysis on the European Soccer Database

This database was compiled by Hugo Mathien and is available here on Kaggle.

As described by Hugo, the information gathered here is really impressive:

  • +25,000 matches
  • +10,000 players
  • 11 European Countries with their lead championship
  • Seasons 2008 to 2016
  • Players and Teams' attributes* sourced from EA Sports' FIFA video game series, including the weekly updates
  • Team line up with squad formation (X, Y coordinates)
  • Betting odds from up to 10 providers
  • Detailed match events (goal types, possession, corner, cross, fouls, cards etc…) for +10,000 matches

While this clearly represents quality material to perform in depth analysis and eventually predict matches outcome, I will focus here on performing exploratory analysis on teams, leagues or particular players and/or categories of players.

In [1]:
#importing classic analysis libraries
import pandas as pd
import numpy as np

#sqlite3
import sqlite3

#plotting with plotly (offline default mode)
import plotly
plotly.__version__
import plotly.express as px
import plotly.graph_objects as go
#plotting directly in Jupyter notebook
import plotly.io as pio
pio.renderers.default='notebook'

#plotting with matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

Let's see what is in the database file

In [2]:
conn = sqlite3.connect('database.sqlite')
tables = pd.read_sql("""SELECT * FROM sqlite_master WHERE type='table' ;""", conn)
tables
Out[2]:
type name tbl_name rootpage sql
0 table sqlite_sequence sqlite_sequence 4 CREATE TABLE sqlite_sequence(name,seq)
1 table Player_Attributes Player_Attributes 11 CREATE TABLE "Player_Attributes" (\n\t`id`\tIN...
2 table Player Player 14 CREATE TABLE `Player` (\n\t`id`\tINTEGER PRIMA...
3 table Match Match 18 CREATE TABLE `Match` (\n\t`id`\tINTEGER PRIMAR...
4 table League League 24 CREATE TABLE `League` (\n\t`id`\tINTEGER PRIMA...
5 table Country Country 26 CREATE TABLE `Country` (\n\t`id`\tINTEGER PRIM...
6 table Team Team 29 CREATE TABLE "Team" (\n\t`id`\tINTEGER PRIMARY...
7 table Team_Attributes Team_Attributes 2 CREATE TABLE `Team_Attributes` (\n\t`id`\tINTE...

As we already knew, we've got quite a lot of information here, with the following tables:

  • Match
  • Player
  • Player_Attributes
  • League
  • Country
  • Team
  • Team_attributes

We quickly draw the scheme above in order to be able to better visualize tables information and indexes/relations between tables. This will make our work much easier by being able to refer to it later on.

In [3]:
from IPython.display import Image
Image(filename='Untitled 1.png', width=800, height=800)
Out[3]:

Let's inspect country, league and team tables.

In [4]:
pd.read_sql("""SELECT * FROM Country;""", conn)
Out[4]:
id name
0 1 Belgium
1 1729 England
2 4769 France
3 7809 Germany
4 10257 Italy
5 13274 Netherlands
6 15722 Poland
7 17642 Portugal
8 19694 Scotland
9 21518 Spain
10 24558 Switzerland
In [5]:
pd.read_sql("""SELECT * FROM League;""", conn)
Out[5]:
id country_id name
0 1 1 Belgium Jupiler League
1 1729 1729 England Premier League
2 4769 4769 France Ligue 1
3 7809 7809 Germany 1. Bundesliga
4 10257 10257 Italy Serie A
5 13274 13274 Netherlands Eredivisie
6 15722 15722 Poland Ekstraklasa
7 17642 17642 Portugal Liga ZON Sagres
8 19694 19694 Scotland Premier League
9 21518 21518 Spain LIGA BBVA
10 24558 24558 Switzerland Super League

We can see that there is only one league per country, and that league index and country index are equal.

In [6]:
pd.read_sql("""SELECT * FROM Team LIMIT 20;""", conn)
Out[6]:
id team_api_id team_fifa_api_id team_long_name team_short_name
0 1 9987 673.0 KRC Genk GEN
1 2 9993 675.0 Beerschot AC BAC
2 3 10000 15005.0 SV Zulte-Waregem ZUL
3 4 9994 2007.0 Sporting Lokeren LOK
4 5 9984 1750.0 KSV Cercle Brugge CEB
5 6 8635 229.0 RSC Anderlecht AND
6 7 9991 674.0 KAA Gent GEN
7 8 9998 1747.0 RAEC Mons MON
8 9 7947 NaN FCV Dender EH DEN
9 10 9985 232.0 Standard de Liège STL
10 11 8203 110724.0 KV Mechelen MEC
11 12 8342 231.0 Club Brugge KV CLB
12 13 9999 546.0 KSV Roeselare ROS
13 14 8571 100081.0 KV Kortrijk KOR
14 15 4049 NaN Tubize TUB
15 16 9996 111560.0 Royal Excel Mouscron MOU
16 17 10001 681.0 KVC Westerlo WES
17 18 9986 670.0 Sporting Charleroi CHA
18 614 9997 680.0 Sint-Truidense VV STT
19 1034 9989 239.0 Lierse SK LIE

Let's create a table of all the matches with home and away teams as well as scores.

In [7]:
match_list = pd.read_sql("""
                            SELECT
                                m.id match_id,
                                c.name country,
                                l.name league,
                                m.season,
                                m.stage,
                                m.date,
                                ht.team_long_name home_team,
                                m.home_team_goal,
                                m.away_team_goal,
                                at.team_long_name away_team
                            FROM Match m
                            INNER JOIN Country c ON m.country_id = c.id
                            INNER JOIN League l ON m.league_id = l.id
                            INNER JOIN Team ht ON ht.team_api_id = m.home_team_api_id
                            INNER JOIN Team at ON at.team_api_id = m.away_team_api_id
                            ORDER BY 2, 4 DESC;""", conn)
match_list
Out[7]:
match_id country league season stage date home_team home_team_goal away_team_goal away_team
0 1489 Belgium Belgium Jupiler League 2015/2016 1 2015-07-24 00:00:00 Sint-Truidense VV 2 1 Club Brugge KV
1 1490 Belgium Belgium Jupiler League 2015/2016 1 2015-07-25 00:00:00 KV Kortrijk 2 1 Standard de Liège
2 1491 Belgium Belgium Jupiler League 2015/2016 1 2015-07-25 00:00:00 KRC Genk 3 1 Oud-Heverlee Leuven
3 1492 Belgium Belgium Jupiler League 2015/2016 1 2015-07-25 00:00:00 KV Oostende 3 1 KV Mechelen
4 1493 Belgium Belgium Jupiler League 2015/2016 1 2015-07-25 00:00:00 SV Zulte-Waregem 3 1 Sporting Lokeren
... ... ... ... ... ... ... ... ... ... ...
25974 24733 Switzerland Switzerland Super League 2008/2009 9 2008-09-13 00:00:00 FC Basel 2 0 FC Luzern
25975 24734 Switzerland Switzerland Super League 2008/2009 9 2008-09-13 00:00:00 Grasshopper Club Zürich 0 0 FC Aarau
25976 24735 Switzerland Switzerland Super League 2008/2009 9 2008-09-13 00:00:00 Neuchâtel Xamax 3 3 FC Sion
25977 24736 Switzerland Switzerland Super League 2008/2009 9 2008-09-14 00:00:00 AC Bellinzona 1 2 BSC Young Boys
25978 24737 Switzerland Switzerland Super League 2008/2009 9 2008-09-14 00:00:00 FC Vaduz 1 7 FC Zürich

25979 rows × 10 columns

In [8]:
#example for Ligue 1 matches from the last recorded season
ligue1_1516 = pd.read_sql("""
                            SELECT
                                m.id match_id,
                                c.name country,
                                l.name league,
                                m.season,
                                m.stage,
                                m.date,
                                ht.team_long_name home_team,
                                m.home_team_goal,
                                m.away_team_goal,
                                at.team_long_name away_team
                            FROM Match m
                            INNER JOIN Country c ON m.country_id = c.id
                            INNER JOIN League l ON m.league_id = l.id
                            INNER JOIN Team ht ON ht.team_api_id = m.home_team_api_id
                            INNER JOIN Team at ON at.team_api_id = m.away_team_api_id
                            WHERE country='France' AND season='2015/2016'
                            ORDER BY 6 ASC;""", conn)
ligue1_1516
Out[8]:
match_id country league season stage date home_team home_team_goal away_team_goal away_team
0 7431 France France Ligue 1 2015/2016 1 2015-08-07 00:00:00 LOSC Lille 0 1 Paris Saint-Germain
1 7429 France France Ligue 1 2015/2016 1 2015-08-08 00:00:00 SC Bastia 2 1 Stade Rennais FC
2 7433 France France Ligue 1 2015/2016 1 2015-08-08 00:00:00 Olympique de Marseille 0 1 SM Caen
3 7434 France France Ligue 1 2015/2016 1 2015-08-08 00:00:00 Montpellier Hérault SC 0 2 Angers SCO
4 7435 France France Ligue 1 2015/2016 1 2015-08-08 00:00:00 FC Nantes 1 0 En Avant de Guingamp
... ... ... ... ... ... ... ... ... ... ...
375 7744 France France Ligue 1 2015/2016 38 2016-05-14 00:00:00 FC Lorient 1 0 GFC Ajaccio
376 7745 France France Ligue 1 2015/2016 38 2016-05-14 00:00:00 Paris Saint-Germain 4 0 FC Nantes
377 7746 France France Ligue 1 2015/2016 38 2016-05-14 00:00:00 SM Caen 1 0 Girondins de Bordeaux
378 7747 France France Ligue 1 2015/2016 38 2016-05-14 00:00:00 Stade de Reims 4 1 Olympique Lyonnais
379 7748 France France Ligue 1 2015/2016 38 2016-05-14 00:00:00 Stade Rennais FC 1 2 SC Bastia

380 rows × 10 columns

Analysis on leagues

Say our initial aim is to focus on European national leagues, and we expect to get some statistics (on player characteristics, in terms of total goals, fouls, ...) in order to compare them.

What are the leagues where there are the most goals?

Of course this is not a clear indicator of the quality of the league, since various factors influence the total number of goals in a league (presence of top scorers, average quality of strikers, average quality of defenses, type of play, ...).
Let's try to draw a graph with all leagues total goals scored for each year.

In [9]:
total_goals = pd.read_sql("""
                            SELECT
                                m.season,
                                c.name country,
                                l.name league,
                                SUM(m.home_team_goal + m.away_team_goal) total_goals
                            FROM Match m
                            INNER JOIN Country c ON c.id = m.country_id
                            INNER JOIN League l ON l.id = m.league_id
                            GROUP BY 1, 2
                            ORDER BY 1 DESC, 4 DESC;""", conn)
total_goals
Out[9]:
season country league total_goals
0 2015/2016 Spain Spain LIGA BBVA 1043
1 2015/2016 England England Premier League 1026
2 2015/2016 Italy Italy Serie A 979
3 2015/2016 France France Ligue 1 960
4 2015/2016 Netherlands Netherlands Eredivisie 912
... ... ... ... ...
83 2008/2009 Belgium Belgium Jupiler League 855
84 2008/2009 Portugal Portugal Liga ZON Sagres 552
85 2008/2009 Scotland Scotland Premier League 548
86 2008/2009 Switzerland Switzerland Super League 540
87 2008/2009 Poland Poland Ekstraklasa 524

88 rows × 4 columns

In [10]:
#plotting total goals scored per year
fig = px.line(total_goals, x='season', y='total_goals', color='league')
fig.show()

We can see that there is a lot of disparity between leagues, and it looks like we are lacking data for the 2013/2014 Belgian championship.
Let's focus on the top 6 leagues.

In [11]:
#plotting total goals scored per year for the top 6 leagues in terms of total goal scored

#creating a list of the seasons to order the chart's x_axis
seasons_list = ['2008/2009', '2009/2010', '2010/2011', '2011/2012', '2012/2013', '2013/2014', '2014/2015', '2015/2016', '2016/2017', '2017/2018', '2018/2019']
total_goals_top = total_goals[total_goals['country'].isin(['Spain', 'England', 'Italy', 'France', 'Netherlands', 'Germany'])]

Between the 2008/2009 season and 2015/2016 season, the gap between #1 and #5 leagues in terms of total goals scored has slightly decreased.
In particular, total goals scored in the French Ligue 1 have considerably increased (+100 approx.), while it decreased for German Bundesliga, which is no longer in the top 5.
At the same time, there is also a noticeable decrease in the Spanish Liga BBVA, from 1101 to 1043.
There was also a huge increase for the Dutch championship total between the 2009/2010 season and the 2011/2012 season, which was quickly mitigated with total goals scored coming back to approx 900 for the last season.

Focus on top teams from each league

Seeing the total goals scored for each league is interesting, but we would now rather see which are the top 3 contributing teams for each league.

In [12]:
#creating a table with a single column for teams and a single column for goals scored per match: example with France
# test = pd.read_sql("""                 
#                     SELECT 
#                         m.season, 
#                         l.name league,
#                         c.name country,
#                         m.stage,
#                         t.team_api_id,
#                         t.team_long_name team,
#                         CASE
#                             WHEN t.team_api_id=m.home_team_api_id THEN m.home_team_goal
#                             WHEN t.team_api_id=m.away_team_api_id THEN m.away_team_goal
#                             END goals
#                     FROM Match m
#                     INNER JOIN League l ON m.league_id = l.id
#                     INNER JOIN Country c ON c.id = m.country_id
#                     INNER JOIN Team t ON t.team_api_id = m.home_team_api_id OR t.team_api_id = m.away_team_api_id
#                     WHERE m.season='2015/2016' AND c.name='France' AND m.stage=1;""", conn)
# test
In [13]:
#now doing the same for all countries and seasons and selecting the top 3 teams from each league
# top_teams = pd.read_sql("""                 
#                     WITH team_goals AS
#                         (SELECT 
#                             m.season, 
#                             l.name league,
#                             c.name country,
#                             m.stage,
#                             t.team_api_id,
#                             t.team_long_name team,
#                             CASE
#                                 WHEN t.team_api_id=m.home_team_api_id THEN m.home_team_goal
#                                 WHEN t.team_api_id=m.away_team_api_id THEN m.away_team_goal
#                                 END goals
#                         FROM Match m
#                         INNER JOIN League l ON m.league_id = l.id
#                         INNER JOIN Country c ON c.id = m.country_id
#                         INNER JOIN Team t ON t.team_api_id = m.home_team_api_id OR t.team_api_id = m.away_team_api_id),
                    
#                     classified_teams AS
#                         (SELECT
#                             tg.season,
#                             tg.league,
#                             tg.team,
#                             ROW_NUMBER() OVER (PARTITION BY tg.season, tg.league ORDER BY SUM(tg.goals) DESC) AS rank,
#                             SUM(tg.goals) total_team_goals,
#                             CAST(SUM(tg.goals) AS float)/CAST((SELECT SUM(tg2.goals) FROM team_goals tg2
#                                                                 WHERE tg2.season=tg.season AND tg2.league=tg.league
#                                                                 GROUP BY tg2.season, tg2.league) AS float) pct_total_league_goals,
#                             (SELECT SUM(tg2.goals) FROM team_goals tg2
#                             WHERE tg2.season=tg.season AND tg2.league=tg.league
#                             GROUP BY tg2.season, tg2.league) total_league_goals
#                         FROM team_goals tg
#                         GROUP BY 1, 3
#                         ORDER BY 1, 2, 4 ASC)
                    
#                     SELECT *
#                     FROM classified_teams
#                     WHERE rank <=3;""", conn)
# top_teams

Let's see the output for the 2015/2016 season and compare the top clubs from each league.

In [14]:
# top_teams_1516 = top_teams[top_teams['season']=='2015/2016']
# top_teams_1516
In [15]:
# #creating three subsets of the top_teams_1516 dataframe in order to plot a grouped bar chart
# rank1 = top_teams_1516[top_teams_1516['rank']==1]
# rank2 = top_teams_1516[top_teams_1516['rank']==2]
# rank3 = top_teams_1516[top_teams_1516['rank']==3]

# fig = go.Figure(data=[
#     go.Bar(x=rank1['league'], y=rank1['total_team_goals'], text=rank1['team'], textposition='auto'),
#     go.Bar(x=rank2['league'], y=rank2['total_team_goals'], text=rank2['team'], textposition='auto'),
#     go.Bar(x=rank3['league'], y=rank3['total_team_goals'], text=rank3['team'], textposition='auto')
# ])
# # Change the bar mode
# fig.update_layout(barmode='group',
#                   showlegend=False,
#                   title='Top 3 teams (by total goals scored) per league in the 2015/2016 season',
#                  yaxis_title='total goals scored')
# fig.show()

Analyzing data with pandas

We will now use Python Pandas library in order to store dataframes of particular interest and pursue further analysis.

Objectives

  • Determining hegemony patterns in particular leagues: whether one or several team remained at the top in terms of points (3 points for a win, 1 for a draw, 0 for a defeat)
  • Analyzing teams attributes, and see whether there is a correlation between team attributes and points
  • Analyzing players attributes

We only speak about correlation (and not causation) at this point, since the relation between players quality and team scores, for example, is ambiguous: better players will quite logically enhance the overall team attributes and level, leading to more points, and at the same time, famous and better ranked teams tend to attract top level players.

Compiling data in pandas dataframes

We chose to focus on 6 of the major European leagues:

  • Spain LIGA BBVA,
  • England Premier League,
  • Italy Serie A,
  • France Ligue 1,
  • Netherlands Eredivisie,
  • Germany 1.Bundesliga
In [16]:
#Creating a dataframe with match information: it will be the base for total points per team per season analysis
#The league ids we consider is the following: '1729', '4769', '7809', '10257', '13274', '21518'
df_match = pd.read_sql("""
                        SELECT
                            m.id,
                            m.country_id,
                            c.name country,
                            m.league_id,
                            l.name league,
                            m.season,
                            m.stage,
                            m.match_api_id,
                            m.home_team_api_id,
                            m.away_team_api_id,
                            m.home_team_goal,
                            m.away_team_goal
                        FROM Match m
                        INNER JOIN Country c ON c.id = m.country_id
                        INNER JOIN League l ON l.id = m.league_id
                        WHERE l.id IN ('1729', '4769', '7809', '10257', '13274', '21518');""", conn)
df_match.head()
Out[16]:
id country_id country league_id league season stage match_api_id home_team_api_id away_team_api_id home_team_goal away_team_goal
0 1729 1729 England 1729 England Premier League 2008/2009 1 489042 10260 10261 1 1
1 1730 1729 England 1729 England Premier League 2008/2009 1 489043 9825 8659 1 0
2 1731 1729 England 1729 England Premier League 2008/2009 1 489044 8472 8650 0 1
3 1732 1729 England 1729 England Premier League 2008/2009 1 489045 8654 8528 2 1
4 1733 1729 England 1729 England Premier League 2008/2009 1 489046 10252 8456 4 2
In [17]:
#Creating the team, country and league dataframes
df_team = pd.read_sql("""SELECT * FROM Team;""", conn)

df_country = pd.read_sql("""SELECT * FROM Country
                        WHERE id IN ('1729', '4769', '7809', '10257', '13274', '21518');""", conn)

df_league = pd.read_sql("""SELECT * FROM League
                        WHERE id IN ('1729', '4769', '7809', '10257', '13274', '21518');""", conn)
In [18]:
df_team
Out[18]:
id team_api_id team_fifa_api_id team_long_name team_short_name
0 1 9987 673.0 KRC Genk GEN
1 2 9993 675.0 Beerschot AC BAC
2 3 10000 15005.0 SV Zulte-Waregem ZUL
3 4 9994 2007.0 Sporting Lokeren LOK
4 5 9984 1750.0 KSV Cercle Brugge CEB
... ... ... ... ... ...
294 49479 10190 898.0 FC St. Gallen GAL
295 49837 10191 1715.0 FC Thun THU
296 50201 9777 324.0 Servette FC SER
297 50204 7730 1862.0 FC Lausanne-Sports LAU
298 51606 7896 NaN Lugano LUG

299 rows × 5 columns

In [19]:
#let's select the teams from the leagues that interest us
df_team = df_team.merge(df_match, how='left',
                        left_on='team_api_id',
                        right_on='home_team_api_id')

#Selecting only relevant columns and dropping duplicates, resetting index and renaming columns
df_team = df_team[['id_x', 'team_api_id', 'team_fifa_api_id', 'team_long_name', 'team_short_name', 'league_id']].drop_duplicates()
df_team.dropna(subset=['league_id'], inplace=True)
df_team.reset_index(drop=True)
df_team.rename(columns={'id_x':'id'}, inplace=True)

We need to build a dataframe containing the players list from each team, for each league (among the 6 major leagues we chose to study) and each season.

In [20]:
df_match_player = pd.read_sql("""
                                SELECT
                                    id, season, league_id, home_team_api_id, away_team_api_id,
                                    home_player_1, home_player_2, home_player_3, home_player_4, home_player_5,
                                    home_player_6, home_player_7, home_player_8, home_player_9, home_player_10, home_player_11,
                                    away_player_1, away_player_2, away_player_3, away_player_4, away_player_5,
                                    away_player_6, away_player_7, away_player_8, away_player_9, away_player_10, away_player_11
                                FROM Match
                                WHERE league_id IN ('1729', '4769', '7809', '10257', '13274', '21518');""", conn)
In [21]:
#Creating the team attributes dataframe
df_teamatt = pd.read_sql("""SELECT * FROM Team_Attributes;""", conn)
df_teamatt.head()
Out[21]:
id team_fifa_api_id team_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribbling buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 1 434 9930 2010-02-22 00:00:00 60 Balanced NaN Little 50 Mixed ... 55 Normal Organised 50 Medium 55 Press 45 Normal Cover
1 2 434 9930 2014-09-19 00:00:00 52 Balanced 48.0 Normal 56 Mixed ... 64 Normal Organised 47 Medium 44 Press 54 Normal Cover
2 3 434 9930 2015-09-10 00:00:00 47 Balanced 41.0 Normal 54 Mixed ... 64 Normal Organised 47 Medium 44 Press 54 Normal Cover
3 4 77 8485 2010-02-22 00:00:00 70 Fast NaN Little 70 Long ... 70 Lots Organised 60 Medium 70 Double 70 Wide Cover
4 5 77 8485 2011-02-22 00:00:00 47 Balanced NaN Little 52 Mixed ... 52 Normal Organised 47 Medium 47 Press 52 Normal Cover

5 rows × 25 columns

We then need to split between home and away teams and players (we begin with home teams and players), and remodel the dataframe in order to get only season, team and players as columns.

In [22]:
#Splitting between home and away teams and players
df_home_player = df_match_player[['season', 'home_team_api_id', 'home_player_1', 'home_player_2', 'home_player_3',
                                    'home_player_4', 'home_player_5', 'home_player_6', 'home_player_7', 'home_player_8',
                                    'home_player_9', 'home_player_10', 'home_player_11']]
df_home_player = pd.melt(df_home_player, id_vars=['season', 'home_team_api_id'],
                         value_vars=['home_player_1', 'home_player_2', 'home_player_3',
                                    'home_player_4', 'home_player_5', 'home_player_6',
                                    'home_player_7', 'home_player_8', 'home_player_9',
                                    'home_player_10', 'home_player_11'],
                         var_name='player_pos', value_name='player')
df_home_player.head()
Out[22]:
season home_team_api_id player_pos player
0 2008/2009 10260 home_player_1 30726.0
1 2008/2009 9825 home_player_1 23686.0
2 2008/2009 8472 home_player_1 32562.0
3 2008/2009 8654 home_player_1 36374.0
4 2008/2009 10252 home_player_1 30380.0
In [23]:
df_home_player.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 187363 entries, 0 to 187362
Data columns (total 4 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   season            187363 non-null  object 
 1   home_team_api_id  187363 non-null  int64  
 2   player_pos        187363 non-null  object 
 3   player            184463 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 5.7+ MB

Then, we drop null values and also drop duplicate entries - correponding to a player multiple appearances with his team throughout each year - and eventually rename columns and drop the player_pos column which won't be useful for our purpose.
We also rename the 'home_team_api_id' column as 'team_api_id', as we will perform the same manipulation to the dataframe containing away teams and players before concatenating them.

In [24]:
#Removing the 'player_pos' column
df_home_player.drop(columns=['player_pos'], inplace=True)

#Dropping duplicates and null values
df_home_player.drop_duplicates(subset=['season', 'home_team_api_id', 'player'], inplace=True)
df_home_player.dropna(inplace=True)

#Renaming columns
df_home_player.rename(columns={'home_team_api_id':'team_api_id'}, inplace=True)
df_home_player.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21320 entries, 0 to 187271
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   season       21320 non-null  object 
 1   team_api_id  21320 non-null  int64  
 2   player       21320 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 666.2+ KB

We can then do the same for away teams and players.

In [25]:
df_away_player = df_match_player[['season', 'away_team_api_id', 'away_player_1', 'away_player_2', 'away_player_3',
                                    'away_player_4', 'away_player_5', 'away_player_6',
                                    'away_player_7', 'away_player_8', 'away_player_9',
                                    'away_player_10', 'away_player_11']]
df_away_player = pd.melt(df_away_player, id_vars=['season', 'away_team_api_id'],
                         value_vars=['away_player_1', 'away_player_2', 'away_player_3',
                                    'away_player_4', 'away_player_5', 'away_player_6',
                                    'away_player_7', 'away_player_8', 'away_player_9',
                                    'away_player_10', 'away_player_11'],
                         var_name='player_pos', value_name='player')

df_away_player.drop(columns=['player_pos'], inplace=True)
df_away_player.drop_duplicates(subset=['season', 'away_team_api_id', 'player'], inplace=True)
df_away_player.dropna(inplace=True)
df_away_player.rename(columns={'away_team_api_id':'team_api_id'}, inplace=True)
df_away_player.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21615 entries, 0 to 187308
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   season       21615 non-null  object 
 1   team_api_id  21615 non-null  int64  
 2   player       21615 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 675.5+ KB

We have now compiled all home teams' players as well as away teams' players. We can concat the two dataframes and drop duplicates.

In [26]:
df_team_player = pd.concat([df_home_player, df_away_player], ignore_index=True).drop_duplicates(subset=['season', 'team_api_id', 'player'])
df_team_player.head()
Out[26]:
season team_api_id player
0 2008/2009 10260 30726.0
1 2008/2009 9825 23686.0
2 2008/2009 8472 32562.0
3 2008/2009 8654 36374.0
4 2008/2009 10252 30380.0

We just need to retrieve information on players from the database.

In [27]:
df_player = pd.read_sql("""SELECT * FROM Player;""", conn)
df_playeratt = pd.read_sql("""SELECT * FROM Player_Attributes;""", conn)

Cleaning the dataframes

In [28]:
df_match.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17033 entries, 0 to 17032
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   id                17033 non-null  int64 
 1   country_id        17033 non-null  int64 
 2   country           17033 non-null  object
 3   league_id         17033 non-null  int64 
 4   league            17033 non-null  object
 5   season            17033 non-null  object
 6   stage             17033 non-null  int64 
 7   match_api_id      17033 non-null  int64 
 8   home_team_api_id  17033 non-null  int64 
 9   away_team_api_id  17033 non-null  int64 
 10  home_team_goal    17033 non-null  int64 
 11  away_team_goal    17033 non-null  int64 
dtypes: int64(9), object(3)
memory usage: 1.6+ MB

There are no null values.

In [29]:
df_match.duplicated().value_counts()
Out[29]:
False    17033
dtype: int64

No duplicates either.

In [30]:
print(df_team.info())
print(df_team.duplicated().value_counts())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 189 entries, 25 to 17109
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                189 non-null    int64  
 1   team_api_id       189 non-null    int64  
 2   team_fifa_api_id  188 non-null    float64
 3   team_long_name    189 non-null    object 
 4   team_short_name   189 non-null    object 
 5   league_id         189 non-null    float64
dtypes: float64(2), int64(2), object(2)
memory usage: 10.3+ KB
None
False    189
dtype: int64

No null values except for one in the 'team_fifa_api_id' which we won't be using here since the team attributes dataframe has all entries for the 'team_api_id'. No duplicates either.

In [31]:
df_teamatt.head(10)
Out[31]:
id team_fifa_api_id team_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribbling buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 1 434 9930 2010-02-22 00:00:00 60 Balanced NaN Little 50 Mixed ... 55 Normal Organised 50 Medium 55 Press 45 Normal Cover
1 2 434 9930 2014-09-19 00:00:00 52 Balanced 48.0 Normal 56 Mixed ... 64 Normal Organised 47 Medium 44 Press 54 Normal Cover
2 3 434 9930 2015-09-10 00:00:00 47 Balanced 41.0 Normal 54 Mixed ... 64 Normal Organised 47 Medium 44 Press 54 Normal Cover
3 4 77 8485 2010-02-22 00:00:00 70 Fast NaN Little 70 Long ... 70 Lots Organised 60 Medium 70 Double 70 Wide Cover
4 5 77 8485 2011-02-22 00:00:00 47 Balanced NaN Little 52 Mixed ... 52 Normal Organised 47 Medium 47 Press 52 Normal Cover
5 6 77 8485 2012-02-22 00:00:00 58 Balanced NaN Little 62 Mixed ... 55 Normal Organised 40 Medium 40 Press 60 Normal Cover
6 7 77 8485 2013-09-20 00:00:00 62 Balanced NaN Little 45 Mixed ... 55 Normal Organised 42 Medium 42 Press 60 Normal Cover
7 8 77 8485 2014-09-19 00:00:00 58 Balanced 64.0 Normal 62 Mixed ... 57 Normal Organised 41 Medium 42 Press 60 Normal Cover
8 9 77 8485 2015-09-10 00:00:00 59 Balanced 64.0 Normal 53 Mixed ... 63 Normal Free Form 49 Medium 45 Press 63 Normal Cover
9 10 614 8576 2010-02-22 00:00:00 60 Balanced NaN Little 40 Mixed ... 55 Normal Organised 30 Deep 70 Double 30 Narrow Offside Trap

10 rows × 25 columns

We can see that columns display information by 'pair': one column indicates the value according to one attribute, while the other contains a categorical variable (slow, fast balanced for example).
Let's look at the 'buildUpPlaySpeed' and 'buildUpPlaySpeedClass' columns for example.

In [32]:
df_teamatt.groupby('buildUpPlaySpeedClass')['buildUpPlaySpeed'].describe()
Out[32]:
count mean std min 25% 50% 75% max
buildUpPlaySpeedClass
Balanced 1184.0 51.925676 8.567738 34.0 46.0 52.0 59.0 66.0
Fast 172.0 69.767442 2.578246 67.0 68.0 70.0 70.0 80.0
Slow 102.0 29.509804 2.570761 20.0 30.0 30.0 30.0 33.0
In [33]:
df_teamatt.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458 entries, 0 to 1457
Data columns (total 25 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              1458 non-null   int64  
 1   team_fifa_api_id                1458 non-null   int64  
 2   team_api_id                     1458 non-null   int64  
 3   date                            1458 non-null   object 
 4   buildUpPlaySpeed                1458 non-null   int64  
 5   buildUpPlaySpeedClass           1458 non-null   object 
 6   buildUpPlayDribbling            489 non-null    float64
 7   buildUpPlayDribblingClass       1458 non-null   object 
 8   buildUpPlayPassing              1458 non-null   int64  
 9   buildUpPlayPassingClass         1458 non-null   object 
 10  buildUpPlayPositioningClass     1458 non-null   object 
 11  chanceCreationPassing           1458 non-null   int64  
 12  chanceCreationPassingClass      1458 non-null   object 
 13  chanceCreationCrossing          1458 non-null   int64  
 14  chanceCreationCrossingClass     1458 non-null   object 
 15  chanceCreationShooting          1458 non-null   int64  
 16  chanceCreationShootingClass     1458 non-null   object 
 17  chanceCreationPositioningClass  1458 non-null   object 
 18  defencePressure                 1458 non-null   int64  
 19  defencePressureClass            1458 non-null   object 
 20  defenceAggression               1458 non-null   int64  
 21  defenceAggressionClass          1458 non-null   object 
 22  defenceTeamWidth                1458 non-null   int64  
 23  defenceTeamWidthClass           1458 non-null   object 
 24  defenceDefenderLineClass        1458 non-null   object 
dtypes: float64(1), int64(11), object(13)
memory usage: 284.9+ KB

It seems like the dataframe is complete except for the 'buildUpPlayDribbling' column, which lacks at least 2/3 of data. Quite srangely, there aren't any missing data in the corresponding 'buildUpPlayDribblingClass' column.
Let's investigate on where these data are lacking the most.

In [34]:
df_teamatt['date'].unique()
Out[34]:
array(['2010-02-22 00:00:00', '2014-09-19 00:00:00',
       '2015-09-10 00:00:00', '2011-02-22 00:00:00',
       '2012-02-22 00:00:00', '2013-09-20 00:00:00'], dtype=object)
In [35]:
df_teamatt.groupby('date').count()
Out[35]:
id team_fifa_api_id team_api_id buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribbling buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass buildUpPlayPositioningClass ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
date
2010-02-22 00:00:00 241 241 241 241 241 0 241 241 241 241 ... 241 241 241 241 241 241 241 241 241 241
2011-02-22 00:00:00 244 244 244 244 244 0 244 244 244 244 ... 244 244 244 244 244 244 244 244 244 244
2012-02-22 00:00:00 242 242 242 242 242 0 242 242 242 242 ... 242 242 242 242 242 242 242 242 242 242
2013-09-20 00:00:00 242 242 242 242 242 0 242 242 242 242 ... 242 242 242 242 242 242 242 242 242 242
2014-09-19 00:00:00 244 244 244 244 244 244 244 244 244 244 ... 244 244 244 244 244 244 244 244 244 244
2015-09-10 00:00:00 245 245 245 245 245 245 245 245 245 245 ... 245 245 245 245 245 245 245 245 245 245

6 rows × 24 columns

So it seems that all the data for 'buildUpPlayDribbling' is missing for 2010, 2011, 2012 and 2013 while it is complete for the last two years.

In [36]:
df_teamatt.groupby('buildUpPlayDribblingClass')['buildUpPlayDribbling'].value_counts(dropna=False)
Out[36]:
buildUpPlayDribblingClass  buildUpPlayDribbling
Little                     NaN                     969
                           32.0                     12
                           33.0                      6
                           29.0                      4
                           31.0                      4
                           28.0                      3
                           24.0                      2
                           30.0                      2
                           26.0                      1
                           27.0                      1
Lots                       69.0                      7
                           70.0                      5
                           67.0                      3
                           71.0                      3
                           68.0                      1
                           74.0                      1
                           77.0                      1
Normal                     52.0                     34
                           48.0                     32
                           55.0                     28
                           49.0                     24
                           53.0                     22
                           50.0                     20
                           51.0                     18
                           40.0                     17
                           54.0                     17
                           46.0                     16
                           47.0                     16
                           56.0                     15
                           42.0                     14
                           45.0                     14
                           36.0                     13
                           41.0                     13
                           35.0                     12
                           43.0                     10
                           44.0                     10
                           58.0                      9
                           60.0                      9
                           62.0                      9
                           34.0                      8
                           59.0                      8
                           37.0                      7
                           61.0                      7
                           38.0                      6
                           39.0                      6
                           57.0                      6
                           65.0                      5
                           66.0                      4
                           63.0                      2
                           64.0                      2
Name: buildUpPlayDribbling, dtype: int64

All missing values are in the 'Little' range from 'buildUpPlayDribblingClass'. That is making our work much easier since we can quite safely consider replacing missing values with an average value of 'buildUpPlayDribbling' in the 'Little' category.

In [37]:
little_mean_buildUpPlayDribbling = df_teamatt.groupby('buildUpPlayDribblingClass')['buildUpPlayDribbling'].mean()['Little']
df_teamatt.fillna(little_mean_buildUpPlayDribbling, inplace=True)
df_teamatt.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458 entries, 0 to 1457
Data columns (total 25 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              1458 non-null   int64  
 1   team_fifa_api_id                1458 non-null   int64  
 2   team_api_id                     1458 non-null   int64  
 3   date                            1458 non-null   object 
 4   buildUpPlaySpeed                1458 non-null   int64  
 5   buildUpPlaySpeedClass           1458 non-null   object 
 6   buildUpPlayDribbling            1458 non-null   float64
 7   buildUpPlayDribblingClass       1458 non-null   object 
 8   buildUpPlayPassing              1458 non-null   int64  
 9   buildUpPlayPassingClass         1458 non-null   object 
 10  buildUpPlayPositioningClass     1458 non-null   object 
 11  chanceCreationPassing           1458 non-null   int64  
 12  chanceCreationPassingClass      1458 non-null   object 
 13  chanceCreationCrossing          1458 non-null   int64  
 14  chanceCreationCrossingClass     1458 non-null   object 
 15  chanceCreationShooting          1458 non-null   int64  
 16  chanceCreationShootingClass     1458 non-null   object 
 17  chanceCreationPositioningClass  1458 non-null   object 
 18  defencePressure                 1458 non-null   int64  
 19  defencePressureClass            1458 non-null   object 
 20  defenceAggression               1458 non-null   int64  
 21  defenceAggressionClass          1458 non-null   object 
 22  defenceTeamWidth                1458 non-null   int64  
 23  defenceTeamWidthClass           1458 non-null   object 
 24  defenceDefenderLineClass        1458 non-null   object 
dtypes: float64(1), int64(11), object(13)
memory usage: 284.9+ KB

The remaining dataframes to look at are df_match_player, df_team_player, df_player and df_playeratt

In [38]:
df_match_player.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17033 entries, 0 to 17032
Data columns (total 27 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                17033 non-null  int64  
 1   season            17033 non-null  object 
 2   league_id         17033 non-null  int64  
 3   home_team_api_id  17033 non-null  int64  
 4   away_team_api_id  17033 non-null  int64  
 5   home_player_1     16781 non-null  float64
 6   home_player_2     16778 non-null  float64
 7   home_player_3     16771 non-null  float64
 8   home_player_4     16780 non-null  float64
 9   home_player_5     16772 non-null  float64
 10  home_player_6     16784 non-null  float64
 11  home_player_7     16757 non-null  float64
 12  home_player_8     16773 non-null  float64
 13  home_player_9     16785 non-null  float64
 14  home_player_10    16726 non-null  float64
 15  home_player_11    16756 non-null  float64
 16  away_player_1     16790 non-null  float64
 17  away_player_2     16772 non-null  float64
 18  away_player_3     16769 non-null  float64
 19  away_player_4     16777 non-null  float64
 20  away_player_5     16762 non-null  float64
 21  away_player_6     16776 non-null  float64
 22  away_player_7     16767 non-null  float64
 23  away_player_8     16763 non-null  float64
 24  away_player_9     16772 non-null  float64
 25  away_player_10    16748 non-null  float64
 26  away_player_11    16747 non-null  float64
dtypes: float64(22), int64(4), object(1)
memory usage: 3.5+ MB

There are some missing values in this dataframe.

In [39]:
df_match_player.groupby('season').count()
Out[39]:
id league_id home_team_api_id away_team_api_id home_player_1 home_player_2 home_player_3 home_player_4 home_player_5 home_player_6 ... away_player_2 away_player_3 away_player_4 away_player_5 away_player_6 away_player_7 away_player_8 away_player_9 away_player_10 away_player_11
season
2008/2009 2132 2132 2132 2132 1895 1912 1893 1908 1912 1904 ... 1908 1899 1905 1905 1903 1905 1890 1903 1900 1910
2009/2010 2132 2132 2132 2132 2124 2125 2122 2121 2116 2126 ... 2125 2116 2118 2115 2124 2110 2126 2125 2122 2121
2010/2011 2132 2132 2132 2132 2130 2124 2128 2130 2124 2129 ... 2120 2125 2130 2121 2131 2127 2127 2122 2125 2124
2011/2012 2110 2110 2110 2110 2109 2104 2108 2109 2108 2110 ... 2106 2105 2109 2108 2105 2108 2105 2108 2106 2104
2012/2013 2132 2132 2132 2132 2128 2125 2128 2126 2130 2132 ... 2125 2131 2125 2131 2132 2131 2132 2131 2131 2130
2013/2014 2132 2132 2132 2132 2132 2132 2132 2130 2130 2131 ... 2132 2132 2132 2129 2130 2130 2131 2127 2128 2125
2014/2015 2131 2131 2131 2131 2131 2128 2130 2129 2130 2127 ... 2128 2131 2129 2129 2127 2129 2127 2129 2123 2112
2015/2016 2132 2132 2132 2132 2132 2128 2130 2127 2122 2125 ... 2128 2130 2129 2124 2124 2127 2125 2127 2113 2121

8 rows × 26 columns

In [40]:
sns.heatmap(df_match_player.isnull(), cbar=False)
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1c26ee50>
In [41]:
missing_player_teams = pd.read_sql("""
                                    SELECT
                                        l.name league_name,
                                        m.season,
                                        (SELECT t.team_long_name FROM Team t
                                        WHERE t.team_api_id = m.home_team_api_id) home_team,
                                        (SELECT t.team_long_name FROM Team t
                                        WHERE t.team_api_id = m.away_team_api_id) away_team
                                    FROM Match m INNER JOIN League l
                                    ON l.id = m.league_id
                                    WHERE home_player_1 IS NULL
                                    AND home_player_2 IS NULL
                                    AND home_player_3 IS NULL
                                    AND home_player_4 IS NULL
                                    AND home_player_5 IS NULL
                                    AND home_player_6 IS NULL
                                    AND home_player_7 IS NULL
                                    AND home_player_8 IS NULL
                                    AND home_player_9 IS NULL
                                    AND home_player_10 IS NULL
                                    AND home_player_11 IS NULL
                                    AND away_player_1 IS NULL
                                    AND away_player_2 IS NULL
                                    AND away_player_3 IS NULL
                                    AND away_player_4 IS NULL
                                    AND away_player_5 IS NULL
                                    AND away_player_6 IS NULL
                                    AND away_player_7 IS NULL
                                    AND away_player_8 IS NULL
                                    AND away_player_9 IS NULL
                                    AND away_player_10 IS NULL
                                    AND away_player_11 IS NULL
                                    AND league_id in ('1729', '4769', '7809', '10257', '13274', '21518');""", conn)
missing_player_teams
Out[41]:
league_name season home_team away_team
0 Netherlands Eredivisie 2008/2009 De Graafschap NAC Breda
1 Netherlands Eredivisie 2008/2009 N.E.C. PSV
2 Netherlands Eredivisie 2008/2009 AZ FC Twente
3 Netherlands Eredivisie 2008/2009 Heracles Almelo Vitesse
4 Netherlands Eredivisie 2008/2009 Willem II Roda JC Kerkrade
... ... ... ... ...
200 Netherlands Eredivisie 2008/2009 Vitesse ADO Den Haag
201 Netherlands Eredivisie 2008/2009 Roda JC Kerkrade Feyenoord
202 Netherlands Eredivisie 2008/2009 Sparta Rotterdam N.E.C.
203 Netherlands Eredivisie 2008/2009 FC Volendam De Graafschap
204 Netherlands Eredivisie 2008/2009 FC Utrecht Heracles Almelo

205 rows × 4 columns

In [42]:
missing_player_teams.groupby('season')['league_name'].value_counts()
Out[42]:
season     league_name           
2008/2009  Netherlands Eredivisie    205
Name: league_name, dtype: int64

We observe that we lack data for all players from the Dutch league in the 2008/2009 season. Some information is also missing for other leagues. We will let it like that for the moment.

In [43]:
df_team_player.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 22545 entries, 0 to 42934
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   season       22545 non-null  object 
 1   team_api_id  22545 non-null  int64  
 2   player       22545 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 704.5+ KB
In [44]:
df_team_player.duplicated().value_counts()
Out[44]:
False    22545
dtype: int64

The team player dataframe is complete and without any duplicates.

In [45]:
df_player.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11060 entries, 0 to 11059
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  11060 non-null  int64  
 1   player_api_id       11060 non-null  int64  
 2   player_name         11060 non-null  object 
 3   player_fifa_api_id  11060 non-null  int64  
 4   birthday            11060 non-null  object 
 5   height              11060 non-null  float64
 6   weight              11060 non-null  int64  
dtypes: float64(1), int64(4), object(2)
memory usage: 605.0+ KB
In [46]:
df_playeratt.head()
Out[46]:
id player_fifa_api_id player_api_id date overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing ... vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
0 1 218353 505942 2016-02-18 00:00:00 67.0 71.0 right medium medium 49.0 ... 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
1 2 218353 505942 2015-11-19 00:00:00 67.0 71.0 right medium medium 49.0 ... 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
2 3 218353 505942 2015-09-21 00:00:00 62.0 66.0 right medium medium 49.0 ... 54.0 48.0 65.0 66.0 69.0 6.0 11.0 10.0 8.0 8.0
3 4 218353 505942 2015-03-20 00:00:00 61.0 65.0 right medium medium 48.0 ... 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0
4 5 218353 505942 2007-02-22 00:00:00 61.0 65.0 right medium medium 48.0 ... 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0

5 rows × 42 columns

In [47]:
df_playeratt.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183978 entries, 0 to 183977
Data columns (total 42 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   id                   183978 non-null  int64  
 1   player_fifa_api_id   183978 non-null  int64  
 2   player_api_id        183978 non-null  int64  
 3   date                 183978 non-null  object 
 4   overall_rating       183142 non-null  float64
 5   potential            183142 non-null  float64
 6   preferred_foot       183142 non-null  object 
 7   attacking_work_rate  180748 non-null  object 
 8   defensive_work_rate  183142 non-null  object 
 9   crossing             183142 non-null  float64
 10  finishing            183142 non-null  float64
 11  heading_accuracy     183142 non-null  float64
 12  short_passing        183142 non-null  float64
 13  volleys              181265 non-null  float64
 14  dribbling            183142 non-null  float64
 15  curve                181265 non-null  float64
 16  free_kick_accuracy   183142 non-null  float64
 17  long_passing         183142 non-null  float64
 18  ball_control         183142 non-null  float64
 19  acceleration         183142 non-null  float64
 20  sprint_speed         183142 non-null  float64
 21  agility              181265 non-null  float64
 22  reactions            183142 non-null  float64
 23  balance              181265 non-null  float64
 24  shot_power           183142 non-null  float64
 25  jumping              181265 non-null  float64
 26  stamina              183142 non-null  float64
 27  strength             183142 non-null  float64
 28  long_shots           183142 non-null  float64
 29  aggression           183142 non-null  float64
 30  interceptions        183142 non-null  float64
 31  positioning          183142 non-null  float64
 32  vision               181265 non-null  float64
 33  penalties            183142 non-null  float64
 34  marking              183142 non-null  float64
 35  standing_tackle      183142 non-null  float64
 36  sliding_tackle       181265 non-null  float64
 37  gk_diving            183142 non-null  float64
 38  gk_handling          183142 non-null  float64
 39  gk_kicking           183142 non-null  float64
 40  gk_positioning       183142 non-null  float64
 41  gk_reflexes          183142 non-null  float64
dtypes: float64(35), int64(3), object(4)
memory usage: 59.0+ MB

There are some missing values, but we will see if we really need all rows later on.

Eventually, before performing analysis on teams points in each league, we need to define two new columns in df_match that indicate how many points each team brings back. A win brings 3 points, a draw 1 point and a loss 0 point.

In [48]:
df_match.loc[df_match['home_team_goal'] < df_match['away_team_goal'], 'home_team_points'] = 0
df_match.loc[df_match['home_team_goal'] == df_match['away_team_goal'], 'home_team_points'] = 1
df_match.loc[df_match['home_team_goal'] > df_match['away_team_goal'], 'home_team_points'] = 3

df_match.loc[df_match['home_team_goal'] < df_match['away_team_goal'], 'away_team_points'] = 3
df_match.loc[df_match['home_team_goal'] == df_match['away_team_goal'], 'away_team_points'] = 1
df_match.loc[df_match['home_team_goal'] > df_match['away_team_goal'], 'away_team_points'] = 0

df_match
Out[48]:
id country_id country league_id league season stage match_api_id home_team_api_id away_team_api_id home_team_goal away_team_goal home_team_points away_team_points
0 1729 1729 England 1729 England Premier League 2008/2009 1 489042 10260 10261 1 1 1.0 1.0
1 1730 1729 England 1729 England Premier League 2008/2009 1 489043 9825 8659 1 0 3.0 0.0
2 1731 1729 England 1729 England Premier League 2008/2009 1 489044 8472 8650 0 1 0.0 3.0
3 1732 1729 England 1729 England Premier League 2008/2009 1 489045 8654 8528 2 1 3.0 0.0
4 1733 1729 England 1729 England Premier League 2008/2009 1 489046 10252 8456 4 2 3.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
17028 24553 21518 Spain 21518 Spain LIGA BBVA 2015/2016 9 2030167 9906 10267 2 1 3.0 0.0
17029 24554 21518 Spain 21518 Spain LIGA BBVA 2015/2016 9 2030168 9864 9783 2 0 3.0 0.0
17030 24555 21518 Spain 21518 Spain LIGA BBVA 2015/2016 9 2030169 8315 9869 3 0 3.0 0.0
17031 24556 21518 Spain 21518 Spain LIGA BBVA 2015/2016 9 2030170 7878 8603 1 1 1.0 1.0
17032 24557 21518 Spain 21518 Spain LIGA BBVA 2015/2016 9 2030171 8370 8558 3 0 3.0 0.0

17033 rows × 14 columns

We want to get a match dataframe with tidy data (for now we have two columns for home and away teams, and two columns for home and away teams' goals).

In [49]:
#Melting the team id columns
df_match_team = df_match.melt(id_vars=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id'],
                              value_vars=['home_team_api_id', 'away_team_api_id'],
                              var_name='home_away',
                              value_name='team_api_id')
#Creating a column only containing 'home_team' or 'away_team'
df_match_team.loc[:,'home_away'] = df_match_team.loc[:,'home_away'].str.replace('_api_id', '')
In [50]:
#Melting the team goals columns
df_match_goal = df_match.melt(id_vars=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id'],
                              value_vars=['home_team_goal', 'away_team_goal'],
                              var_name='home_away',
                              value_name='team_goal')
#Creating a column only containing 'home_team' or 'away_team', which gives us the possibility to merge dataframes
df_match_goal.loc[:, 'home_away'] = df_match_goal.loc[:, 'home_away'].str.replace('_goal', '')
In [51]:
#Melting the team points columns
df_match_points = df_match.melt(id_vars=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id'],
                              value_vars=['home_team_points', 'away_team_points'],
                              var_name='home_away',
                              value_name='team_points')
#Creating a column only containing 'home_team' or 'away_team', which gives us the possibility to merge dataframes
df_match_points.loc[:, 'home_away'] = df_match_goal.loc[:, 'home_away'].str.replace('_points', '')
In [52]:
#Creating another dataframe to store the goals conceded by a team and melting it
df_goals_conceded = df_match.copy()
df_goals_conceded['home_team_goal_conceded'] = df_match['away_team_goal']
df_goals_conceded['away_team_goal_conceded'] = df_match['home_team_goal']
df_goals_conceded = df_goals_conceded.melt(id_vars=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id'],
                              value_vars=['home_team_goal_conceded', 'away_team_goal_conceded'],
                              var_name='home_away',
                              value_name='goal_conceded')
#Creating a column only containing 'home_team' or 'away_team', which gives us the possibility to merge dataframes
df_goals_conceded.loc[:, 'home_away'] = df_goals_conceded.loc[:, 'home_away'].str.replace('_goal_conceded', '')
In [53]:
#Merging the dataframes back to get a final tidy dataframe that replaces the initial df_match dataframe
df_match = df_match_team.merge(right=df_match_goal, how='inner',
                               on=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id', 'home_away'])
df_match = df_match.merge(right=df_match_points, how='inner',
                         on=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id', 'home_away'])
df_match = df_match.merge(right=df_goals_conceded, how='inner',
                         on=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id', 'home_away'])
df_match
Out[53]:
id country_id country league_id league season stage match_api_id home_away team_api_id team_goal team_points goal_conceded
0 1729 1729 England 1729 England Premier League 2008/2009 1 489042 home_team 10260 1 1.0 1
1 1730 1729 England 1729 England Premier League 2008/2009 1 489043 home_team 9825 1 3.0 0
2 1731 1729 England 1729 England Premier League 2008/2009 1 489044 home_team 8472 0 0.0 1
3 1732 1729 England 1729 England Premier League 2008/2009 1 489045 home_team 8654 2 3.0 1
4 1733 1729 England 1729 England Premier League 2008/2009 1 489046 home_team 10252 4 3.0 2
... ... ... ... ... ... ... ... ... ... ... ... ... ...
34061 24553 21518 Spain 21518 Spain LIGA BBVA 2015/2016 9 2030167 away_team 10267 1 0.0 2
34062 24554 21518 Spain 21518 Spain LIGA BBVA 2015/2016 9 2030168 away_team 9783 0 0.0 2
34063 24555 21518 Spain 21518 Spain LIGA BBVA 2015/2016 9 2030169 away_team 9869 0 0.0 3
34064 24556 21518 Spain 21518 Spain LIGA BBVA 2015/2016 9 2030170 away_team 8603 1 1.0 1
34065 24557 21518 Spain 21518 Spain LIGA BBVA 2015/2016 9 2030171 away_team 8558 0 0.0 3

34066 rows × 13 columns

We finally have a tidy dataframe with season, league, match, team (home and away), goals scored (and conceded) and points!

Exploratory data analysis

Questions

  • Is there really an advantage in playing at home? At least, is playing at home associated with more chances of winning?
  • Which teams are the best performers over the years?
  • How are the teams attributes correlated with teams performances?

Home advantage

In [54]:
home_advantage = df_match.groupby(by=['league', 'home_away'], as_index=False)['team_goal'].mean()
home_advantage
Out[54]:
league home_away team_goal
0 England Premier League away_team 1.159539
1 England Premier League home_team 1.550987
2 France Ligue 1 away_team 1.040132
3 France Ligue 1 home_team 1.402961
4 Germany 1. Bundesliga away_team 1.274918
5 Germany 1. Bundesliga home_team 1.626634
6 Italy Serie A away_team 1.116009
7 Italy Serie A home_team 1.500829
8 Netherlands Eredivisie away_team 1.301062
9 Netherlands Eredivisie home_team 1.779820
10 Spain LIGA BBVA away_team 1.135855
11 Spain LIGA BBVA home_team 1.631250
In [55]:
fig = px.bar(home_advantage, x='league', y='team_goal',
             color='home_away', barmode='group')
fig.update_layout(title='Average goals scored per match for home and away teams',
                  yaxis_title='Average goals scored per match',
                  xaxis_title='League',
                  legend_title='')
#add annotations with the difference between average home goals and away goals
for league in df_league['name'].tolist():
    mean_goal = ((home_advantage[(home_advantage['league']==league) & (home_advantage['home_away']=='home_team')]['team_goal'].values
                 + home_advantage[(home_advantage['league']==league) & (home_advantage['home_away']=='away_team')]['team_goal'].values)/2).item()

    goal_difference = (home_advantage[(home_advantage['league']==league) & (home_advantage['home_away']=='home_team')]['team_goal'].values
                       - home_advantage[(home_advantage['league']==league) & (home_advantage['home_away']=='away_team')]['team_goal'].values).item()

    fig.add_annotation(
        x=league,
        y=round(mean_goal,2),
        text=round(goal_difference,2),
        showarrow=False,
        font=dict(color='dimgrey', size=20))
# print(fig.layout)
fig.show()

We can see that, on average, home teams score approximately 0.4 more goals (between 0.35 and 0.5) than their away counterpart. It seems that this difference is more pronounced in the Spanish and Dutch leagues.

Best performing teams

For this analysis, we will retrieve teams name from df_team. Eventually, we want to see if any correlation exists between team attributes and team performance. We can consider team performance with different proxies: total goals scored, total points per season and goals scored per match.

In [56]:
#Grouping by country, league, season and team_api_id and summing up points
performers = df_match.groupby(['country', 'league', 'season', 'team_api_id'])['team_goal', 'team_points'].agg(['sum', 'count']).reset_index()
#Dropping last column (count duplicate, corresponding to the number of matches played)
performers = performers.iloc[:, :-1]
#Renaming columns
performers.columns=['country', 'league', 'season', 'team_api_id', 'team_goal', 'matches', 'team_points']
#Creating a column with goals per match
performers.loc[:, 'goal_per_match'] = performers['team_goal']/performers['matches']

#Merging with df_team to get teams name
performers = performers.merge(right=df_team, left_on='team_api_id', right_on='team_api_id')
performers = performers[['country', 'league', 'season', 'team_api_id', 'team_long_name', 'team_goal', 'team_points', 'goal_per_match']].rename(columns={'team_long_name':'team'})

#Creating a column with team name and season
performers.loc[:, 'team_season'] = performers['team']+ ' ' + performers['season'].astype('str')
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning:

Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

Teams that scored the most goals

In [57]:
top_scoring_teams = performers.sort_values('team_goal', ascending=False)[:20]
fig = px.bar(top_scoring_teams, x='team_season', y='team_goal', color='team')
fig.update_layout(images = [
                            {"source": "images/real_madrid.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 4,
                             "y": 50,
                             "sizex": 30,
                             "sizey": 30,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                            {"source": "images/fc_barcelona.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 10.5,
                             "y": 50,
                             "sizex": 24,
                             "sizey": 24,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                            {"source": "images/ajax.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 13.5,
                             "y": 110,
                             "sizex": 15,
                             "sizey": 15,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                            {"source": "images/psv.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 14.5,
                             "y": 112,
                             "sizex": 10,
                             "sizey": 10,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                            {"source": "images/chelsea.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 15.5,
                             "y": 110,
                             "sizex": 14,
                             "sizey": 14,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                            {"source": "images/psg.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 16.5,
                             "y": 110,
                             "sizex": 14,
                             "sizey": 14,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                            {"source": "images/man_city.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 17.5,
                             "y": 110,
                             "sizex": 14,
                             "sizey": 14,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                            {"source": "images/liverpool.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 18.37,
                             "y": 110,
                             "sizex": 14,
                             "sizey": 14,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                            {"source": "images/bayern.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 19.4,
                             "y": 110,
                             "sizex": 14,
                             "sizey": 14,
                             "xanchor": "right",
                             "yanchor": "bottom"}
                            ],
                 title = 'Top scoring teams',
                 yaxis_title = 'Total goals',
                 xaxis_title = 'Team')
fig.show()

This chart illustrates the domination imposed by the two Spanish leaders in terms of goals scored per season.

Teams with the best defense

To compare teams in terms of defense, we will use the total number of goals conceded as a proxy.

In [58]:
best_defenses = df_match.groupby(['country', 'league', 'season', 'team_api_id'])['goal_conceded'].sum().reset_index()

#Merging with df_team to get teams name
best_defenses = best_defenses.merge(right=df_team, left_on='team_api_id', right_on='team_api_id')
best_defenses = best_defenses[['country', 'league', 'season', 'team_api_id', 'team_long_name', 'goal_conceded']].rename(columns={'team_long_name':'team'})

#Creating a column with team name and season
best_defenses.loc[:, 'team_season'] = best_defenses['team']+ ' ' + best_defenses['season'].astype('str')
In [59]:
top_defenses = best_defenses.sort_values('goal_conceded', ascending=True)[:10]
fig = px.bar(top_defenses, x='team_season', y='goal_conceded', color='team')
fig.update_layout(title = 'Top 10 teams - least goals conceded',
                 yaxis_title = 'Total goals conceded',
                 xaxis_title = 'Team',
                 images = [
                            {"source": "images/bayern.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 1.5,
                             "y": 8,
                             "sizex": 5,
                             "sizey": 5,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                             {"source": "images/atletico.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 3.37,
                             "y": 8,
                             "sizex": 5,
                             "sizey": 5,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                             {"source": "images/psg.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 4.5,
                             "y": 8,
                             "sizex": 5,
                             "sizey": 5,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                             {"source": "images/juventus.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 5.7,
                             "y": 8,
                             "sizex": 5,
                             "sizey": 5,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                             {"source": "images/ajax.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 7.5,
                             "y": 8,
                             "sizex": 5,
                             "sizey": 5,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                             {"source": "images/fc_barcelona.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 9,
                             "y": 8,
                             "sizex": 5,
                             "sizey": 5,
                             "xanchor": "right",
                             "yanchor": "bottom"}]
                 )
fig.show()

Looking at best performing teams by season.

In [60]:
df_teamatt.head()
Out[60]:
id team_fifa_api_id team_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribbling buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 1 434 9930 2010-02-22 00:00:00 60 Balanced 30.485714 Little 50 Mixed ... 55 Normal Organised 50 Medium 55 Press 45 Normal Cover
1 2 434 9930 2014-09-19 00:00:00 52 Balanced 48.000000 Normal 56 Mixed ... 64 Normal Organised 47 Medium 44 Press 54 Normal Cover
2 3 434 9930 2015-09-10 00:00:00 47 Balanced 41.000000 Normal 54 Mixed ... 64 Normal Organised 47 Medium 44 Press 54 Normal Cover
3 4 77 8485 2010-02-22 00:00:00 70 Fast 30.485714 Little 70 Long ... 70 Lots Organised 60 Medium 70 Double 70 Wide Cover
4 5 77 8485 2011-02-22 00:00:00 47 Balanced 30.485714 Little 52 Mixed ... 52 Normal Organised 47 Medium 47 Press 52 Normal Cover

5 rows × 25 columns

In [61]:
#We will change the dates a bit to coincide with the season
season_dict = {'2010-02-22 00:00:00' : '2010/2011',
               '2011-02-22 00:00:00' : '2011/2012',
               '2012-02-22 00:00:00' : '2012/2013',
               '2013-09-20 00:00:00' : '2013/2014',
               '2014-09-19 00:00:00' : '2014/2015',
               '2015-09-10 00:00:00' : '2015/2016'}
#Replacing df_teamatt 'date' column values with season_dict
df_teamatt.loc[:, 'date'] = df_teamatt.loc[:, 'date'].map(season_dict)
df_teamatt.rename(columns={'date':'season'}, inplace=True)

#Keeping only seasons between 2010/2011 and 2015/2016
performers = performers[performers['season'].isin(['2010/2011', '2011/2012', '2012/2013', '2013/2014', '2014/2015', '2015/2016'])]

#Merging performers with df_teamatt
performers = performers.merge(right=df_teamatt, on=['season', 'team_api_id'])

#Dropping 'id' and 'team_fifa_api_id' columns
performers.drop(columns=['id', 'team_fifa_api_id'], inplace=True)

#Dropping attributes columns with categorical variables (all columns with 'Class')
performers.drop(columns=[x for x in performers.columns if 'Class' in x], inplace=True)
In [62]:
performers.head()
Out[62]:
country league season team_api_id team team_goal team_points goal_per_match team_season buildUpPlaySpeed buildUpPlayDribbling buildUpPlayPassing chanceCreationPassing chanceCreationCrossing chanceCreationShooting defencePressure defenceAggression defenceTeamWidth
0 England England Premier League 2010/2011 8455 Chelsea 69 71.0 1.815789 Chelsea 2010/2011 70 30.485714 60 56 70 70 30 60 35
1 England England Premier League 2011/2012 8455 Chelsea 65 64.0 1.710526 Chelsea 2011/2012 60 30.485714 50 60 60 80 35 55 50
2 England England Premier League 2012/2013 8455 Chelsea 75 75.0 1.973684 Chelsea 2012/2013 60 30.485714 45 55 45 45 60 62 46
3 England England Premier League 2013/2014 8455 Chelsea 71 82.0 1.868421 Chelsea 2013/2014 46 30.485714 41 46 61 63 42 43 46
4 England England Premier League 2014/2015 8455 Chelsea 73 87.0 1.921053 Chelsea 2014/2015 67 52.000000 38 41 61 68 39 41 46

Comparing average attributes

Let's see if we can find any link between a variable containing an average of all attributes and the 3 teams performances indicators (total goals per season, team points, goals per match).

In [63]:
#Building a new variable avg_att
performers.loc[:, 'avg_att'] = performers.iloc[:, 8:].mean(axis=1)
performers.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 683 entries, 0 to 682
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   country                 683 non-null    object 
 1   league                  683 non-null    object 
 2   season                  683 non-null    object 
 3   team_api_id             683 non-null    int64  
 4   team                    683 non-null    object 
 5   team_goal               683 non-null    int64  
 6   team_points             683 non-null    float64
 7   goal_per_match          683 non-null    float64
 8   team_season             683 non-null    object 
 9   buildUpPlaySpeed        683 non-null    int64  
 10  buildUpPlayDribbling    683 non-null    float64
 11  buildUpPlayPassing      683 non-null    int64  
 12  chanceCreationPassing   683 non-null    int64  
 13  chanceCreationCrossing  683 non-null    int64  
 14  chanceCreationShooting  683 non-null    int64  
 15  defencePressure         683 non-null    int64  
 16  defenceAggression       683 non-null    int64  
 17  defenceTeamWidth        683 non-null    int64  
 18  avg_att                 683 non-null    float64
dtypes: float64(4), int64(10), object(5)
memory usage: 106.7+ KB
In [64]:
fig = px.scatter(performers, x='avg_att', y='team_goal')
fig.show()
In [65]:
fig = px.scatter(performers, x='avg_att', y='team_points')
fig.show()
In [66]:
fig = px.scatter(performers, x='avg_att', y='goal_per_match')
fig.show()
In [67]:
performers.loc[:, ['team_goal', 'team_points', 'goal_per_match', 'avg_att']].corr()
Out[67]:
team_goal team_points goal_per_match avg_att
team_goal 1.000000 0.861334 0.985253 -0.008235
team_points 0.861334 1.000000 0.820584 0.035206
goal_per_match 0.985253 0.820584 1.000000 -0.027214
avg_att -0.008235 0.035206 -0.027214 1.000000

At this point, we cannot see any clear pattern between team average attributes and their performance considering the three proxies selected. We will pursue the analysis with other techniques.