Some analysis on the European Soccer Database¶

This database was compiled by Hugo Mathien and is available here on Kaggle.

As described by Hugo, the information gathered here is really impressive:

+25,000 matches
+10,000 players
11 European Countries with their lead championship
Seasons 2008 to 2016
Players and Teams' attributes* sourced from EA Sports' FIFA video game series, including the weekly updates
Team line up with squad formation (X, Y coordinates)
Betting odds from up to 10 providers
Detailed match events (goal types, possession, corner, cross, fouls, cards etc…) for +10,000 matches

While this clearly represents quality material to perform in depth analysis and eventually predict matches outcome, I will focus here on performing exploratory analysis on teams, leagues or particular players and/or categories of players.

#importing classic analysis libraries
import pandas as pd
import numpy as np

#sqlite3
import sqlite3

#plotting with plotly (offline default mode)
import plotly
plotly.__version__
import plotly.express as px
import plotly.graph_objects as go
#plotting directly in Jupyter notebook
import plotly.io as pio
pio.renderers.default='notebook'

#plotting with matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

Let's see what is in the database file

conn = sqlite3.connect('database.sqlite')
tables = pd.read_sql("""SELECT * FROM sqlite_master WHERE type='table' ;""", conn)
tables

As we already knew, we've got quite a lot of information here, with the following tables:

Match
Player
Player_Attributes
League
Country
Team
Team_attributes

We quickly draw the scheme above in order to be able to better visualize tables information and indexes/relations between tables. This will make our work much easier by being able to refer to it later on.

from IPython.display import Image
Image(filename='Untitled 1.png', width=800, height=800)

Let's inspect country, league and team tables.

pd.read_sql("""SELECT * FROM Country;""", conn)

pd.read_sql("""SELECT * FROM League;""", conn)

We can see that there is only one league per country, and that league index and country index are equal.

pd.read_sql("""SELECT * FROM Team LIMIT 20;""", conn)

Let's create a table of all the matches with home and away teams as well as scores.

match_list = pd.read_sql("""
                            SELECT
                                m.id match_id,
                                c.name country,
                                l.name league,
                                m.season,
                                m.stage,
                                m.date,
                                ht.team_long_name home_team,
                                m.home_team_goal,
                                m.away_team_goal,
                                at.team_long_name away_team
                            FROM Match m
                            INNER JOIN Country c ON m.country_id = c.id
                            INNER JOIN League l ON m.league_id = l.id
                            INNER JOIN Team ht ON ht.team_api_id = m.home_team_api_id
                            INNER JOIN Team at ON at.team_api_id = m.away_team_api_id
                            ORDER BY 2, 4 DESC;""", conn)
match_list

#example for Ligue 1 matches from the last recorded season
ligue1_1516 = pd.read_sql("""
                            SELECT
                                m.id match_id,
                                c.name country,
                                l.name league,
                                m.season,
                                m.stage,
                                m.date,
                                ht.team_long_name home_team,
                                m.home_team_goal,
                                m.away_team_goal,
                                at.team_long_name away_team
                            FROM Match m
                            INNER JOIN Country c ON m.country_id = c.id
                            INNER JOIN League l ON m.league_id = l.id
                            INNER JOIN Team ht ON ht.team_api_id = m.home_team_api_id
                            INNER JOIN Team at ON at.team_api_id = m.away_team_api_id
                            WHERE country='France' AND season='2015/2016'
                            ORDER BY 6 ASC;""", conn)
ligue1_1516

Analysis on leagues¶

Say our initial aim is to focus on European national leagues, and we expect to get some statistics (on player characteristics, in terms of total goals, fouls, ...) in order to compare them.

What are the leagues where there are the most goals?¶

Of course this is not a clear indicator of the quality of the league, since various factors influence the total number of goals in a league (presence of top scorers, average quality of strikers, average quality of defenses, type of play, ...).
Let's try to draw a graph with all leagues total goals scored for each year.

total_goals = pd.read_sql("""
                            SELECT
                                m.season,
                                c.name country,
                                l.name league,
                                SUM(m.home_team_goal + m.away_team_goal) total_goals
                            FROM Match m
                            INNER JOIN Country c ON c.id = m.country_id
                            INNER JOIN League l ON l.id = m.league_id
                            GROUP BY 1, 2
                            ORDER BY 1 DESC, 4 DESC;""", conn)
total_goals

#plotting total goals scored per year
fig = px.line(total_goals, x='season', y='total_goals', color='league')
fig.show()

We can see that there is a lot of disparity between leagues, and it looks like we are lacking data for the 2013/2014 Belgian championship.
Let's focus on the top 6 leagues.

#plotting total goals scored per year for the top 6 leagues in terms of total goal scored

#creating a list of the seasons to order the chart's x_axis
seasons_list = ['2008/2009', '2009/2010', '2010/2011', '2011/2012', '2012/2013', '2013/2014', '2014/2015', '2015/2016', '2016/2017', '2017/2018', '2018/2019']
total_goals_top = total_goals[total_goals['country'].isin(['Spain', 'England', 'Italy', 'France', 'Netherlands', 'Germany'])]

Between the 2008/2009 season and 2015/2016 season, the gap between #1 and #5 leagues in terms of total goals scored has slightly decreased.
In particular, total goals scored in the French Ligue 1 have considerably increased (+100 approx.), while it decreased for German Bundesliga, which is no longer in the top 5.
At the same time, there is also a noticeable decrease in the Spanish Liga BBVA, from 1101 to 1043.
There was also a huge increase for the Dutch championship total between the 2009/2010 season and the 2011/2012 season, which was quickly mitigated with total goals scored coming back to approx 900 for the last season.

Focus on top teams from each league¶

Seeing the total goals scored for each league is interesting, but we would now rather see which are the top 3 contributing teams for each league.

#creating a table with a single column for teams and a single column for goals scored per match: example with France
# test = pd.read_sql("""                 
#                     SELECT 
#                         m.season, 
#                         l.name league,
#                         c.name country,
#                         m.stage,
#                         t.team_api_id,
#                         t.team_long_name team,
#                         CASE
#                             WHEN t.team_api_id=m.home_team_api_id THEN m.home_team_goal
#                             WHEN t.team_api_id=m.away_team_api_id THEN m.away_team_goal
#                             END goals
#                     FROM Match m
#                     INNER JOIN League l ON m.league_id = l.id
#                     INNER JOIN Country c ON c.id = m.country_id
#                     INNER JOIN Team t ON t.team_api_id = m.home_team_api_id OR t.team_api_id = m.away_team_api_id
#                     WHERE m.season='2015/2016' AND c.name='France' AND m.stage=1;""", conn)
# test

#now doing the same for all countries and seasons and selecting the top 3 teams from each league
# top_teams = pd.read_sql("""                 
#                     WITH team_goals AS
#                         (SELECT 
#                             m.season, 
#                             l.name league,
#                             c.name country,
#                             m.stage,
#                             t.team_api_id,
#                             t.team_long_name team,
#                             CASE
#                                 WHEN t.team_api_id=m.home_team_api_id THEN m.home_team_goal
#                                 WHEN t.team_api_id=m.away_team_api_id THEN m.away_team_goal
#                                 END goals
#                         FROM Match m
#                         INNER JOIN League l ON m.league_id = l.id
#                         INNER JOIN Country c ON c.id = m.country_id
#                         INNER JOIN Team t ON t.team_api_id = m.home_team_api_id OR t.team_api_id = m.away_team_api_id),
                    
#                     classified_teams AS
#                         (SELECT
#                             tg.season,
#                             tg.league,
#                             tg.team,
#                             ROW_NUMBER() OVER (PARTITION BY tg.season, tg.league ORDER BY SUM(tg.goals) DESC) AS rank,
#                             SUM(tg.goals) total_team_goals,
#                             CAST(SUM(tg.goals) AS float)/CAST((SELECT SUM(tg2.goals) FROM team_goals tg2
#                                                                 WHERE tg2.season=tg.season AND tg2.league=tg.league
#                                                                 GROUP BY tg2.season, tg2.league) AS float) pct_total_league_goals,
#                             (SELECT SUM(tg2.goals) FROM team_goals tg2
#                             WHERE tg2.season=tg.season AND tg2.league=tg.league
#                             GROUP BY tg2.season, tg2.league) total_league_goals
#                         FROM team_goals tg
#                         GROUP BY 1, 3
#                         ORDER BY 1, 2, 4 ASC)
                    
#                     SELECT *
#                     FROM classified_teams
#                     WHERE rank <=3;""", conn)
# top_teams

Let's see the output for the 2015/2016 season and compare the top clubs from each league.

# top_teams_1516 = top_teams[top_teams['season']=='2015/2016']
# top_teams_1516

# #creating three subsets of the top_teams_1516 dataframe in order to plot a grouped bar chart
# rank1 = top_teams_1516[top_teams_1516['rank']==1]
# rank2 = top_teams_1516[top_teams_1516['rank']==2]
# rank3 = top_teams_1516[top_teams_1516['rank']==3]

# fig = go.Figure(data=[
#     go.Bar(x=rank1['league'], y=rank1['total_team_goals'], text=rank1['team'], textposition='auto'),
#     go.Bar(x=rank2['league'], y=rank2['total_team_goals'], text=rank2['team'], textposition='auto'),
#     go.Bar(x=rank3['league'], y=rank3['total_team_goals'], text=rank3['team'], textposition='auto')
# ])
# # Change the bar mode
# fig.update_layout(barmode='group',
#                   showlegend=False,
#                   title='Top 3 teams (by total goals scored) per league in the 2015/2016 season',
#                  yaxis_title='total goals scored')
# fig.show()

Analyzing data with pandas¶

We will now use Python Pandas library in order to store dataframes of particular interest and pursue further analysis.

Objectives¶

Determining hegemony patterns in particular leagues: whether one or several team remained at the top in terms of points (3 points for a win, 1 for a draw, 0 for a defeat)
Analyzing teams attributes, and see whether there is a correlation between team attributes and points
Analyzing players attributes

We only speak about correlation (and not causation) at this point, since the relation between players quality and team scores, for example, is ambiguous: better players will quite logically enhance the overall team attributes and level, leading to more points, and at the same time, famous and better ranked teams tend to attract top level players.

Compiling data in pandas dataframes¶

We chose to focus on 6 of the major European leagues:

Spain LIGA BBVA,
England Premier League,
Italy Serie A,
France Ligue 1,
Netherlands Eredivisie,
Germany 1.Bundesliga

#Creating a dataframe with match information: it will be the base for total points per team per season analysis
#The league ids we consider is the following: '1729', '4769', '7809', '10257', '13274', '21518'
df_match = pd.read_sql("""
                        SELECT
                            m.id,
                            m.country_id,
                            c.name country,
                            m.league_id,
                            l.name league,
                            m.season,
                            m.stage,
                            m.match_api_id,
                            m.home_team_api_id,
                            m.away_team_api_id,
                            m.home_team_goal,
                            m.away_team_goal
                        FROM Match m
                        INNER JOIN Country c ON c.id = m.country_id
                        INNER JOIN League l ON l.id = m.league_id
                        WHERE l.id IN ('1729', '4769', '7809', '10257', '13274', '21518');""", conn)
df_match.head()

#Creating the team, country and league dataframes
df_team = pd.read_sql("""SELECT * FROM Team;""", conn)

df_country = pd.read_sql("""SELECT * FROM Country
                        WHERE id IN ('1729', '4769', '7809', '10257', '13274', '21518');""", conn)

df_league = pd.read_sql("""SELECT * FROM League
                        WHERE id IN ('1729', '4769', '7809', '10257', '13274', '21518');""", conn)

df_team

#let's select the teams from the leagues that interest us
df_team = df_team.merge(df_match, how='left',
                        left_on='team_api_id',
                        right_on='home_team_api_id')

#Selecting only relevant columns and dropping duplicates, resetting index and renaming columns
df_team = df_team[['id_x', 'team_api_id', 'team_fifa_api_id', 'team_long_name', 'team_short_name', 'league_id']].drop_duplicates()
df_team.dropna(subset=['league_id'], inplace=True)
df_team.reset_index(drop=True)
df_team.rename(columns={'id_x':'id'}, inplace=True)

We need to build a dataframe containing the players list from each team, for each league (among the 6 major leagues we chose to study) and each season.

df_match_player = pd.read_sql("""
                                SELECT
                                    id, season, league_id, home_team_api_id, away_team_api_id,
                                    home_player_1, home_player_2, home_player_3, home_player_4, home_player_5,
                                    home_player_6, home_player_7, home_player_8, home_player_9, home_player_10, home_player_11,
                                    away_player_1, away_player_2, away_player_3, away_player_4, away_player_5,
                                    away_player_6, away_player_7, away_player_8, away_player_9, away_player_10, away_player_11
                                FROM Match
                                WHERE league_id IN ('1729', '4769', '7809', '10257', '13274', '21518');""", conn)

#Creating the team attributes dataframe
df_teamatt = pd.read_sql("""SELECT * FROM Team_Attributes;""", conn)
df_teamatt.head()

We then need to split between home and away teams and players (we begin with home teams and players), and remodel the dataframe in order to get only season, team and players as columns.

#Splitting between home and away teams and players
df_home_player = df_match_player[['season', 'home_team_api_id', 'home_player_1', 'home_player_2', 'home_player_3',
                                    'home_player_4', 'home_player_5', 'home_player_6', 'home_player_7', 'home_player_8',
                                    'home_player_9', 'home_player_10', 'home_player_11']]
df_home_player = pd.melt(df_home_player, id_vars=['season', 'home_team_api_id'],
                         value_vars=['home_player_1', 'home_player_2', 'home_player_3',
                                    'home_player_4', 'home_player_5', 'home_player_6',
                                    'home_player_7', 'home_player_8', 'home_player_9',
                                    'home_player_10', 'home_player_11'],
                         var_name='player_pos', value_name='player')
df_home_player.head()

df_home_player.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 187363 entries, 0 to 187362
Data columns (total 4 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   season            187363 non-null  object 
 1   home_team_api_id  187363 non-null  int64  
 2   player_pos        187363 non-null  object 
 3   player            184463 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 5.7+ MB

Then, we drop null values and also drop duplicate entries - correponding to a player multiple appearances with his team throughout each year - and eventually rename columns and drop the player_pos column which won't be useful for our purpose.
We also rename the 'home_team_api_id' column as 'team_api_id', as we will perform the same manipulation to the dataframe containing away teams and players before concatenating them.

#Removing the 'player_pos' column
df_home_player.drop(columns=['player_pos'], inplace=True)

#Dropping duplicates and null values
df_home_player.drop_duplicates(subset=['season', 'home_team_api_id', 'player'], inplace=True)
df_home_player.dropna(inplace=True)

#Renaming columns
df_home_player.rename(columns={'home_team_api_id':'team_api_id'}, inplace=True)
df_home_player.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21320 entries, 0 to 187271
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   season       21320 non-null  object 
 1   team_api_id  21320 non-null  int64  
 2   player       21320 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 666.2+ KB

We can then do the same for away teams and players.

df_away_player = df_match_player[['season', 'away_team_api_id', 'away_player_1', 'away_player_2', 'away_player_3',
                                    'away_player_4', 'away_player_5', 'away_player_6',
                                    'away_player_7', 'away_player_8', 'away_player_9',
                                    'away_player_10', 'away_player_11']]
df_away_player = pd.melt(df_away_player, id_vars=['season', 'away_team_api_id'],
                         value_vars=['away_player_1', 'away_player_2', 'away_player_3',
                                    'away_player_4', 'away_player_5', 'away_player_6',
                                    'away_player_7', 'away_player_8', 'away_player_9',
                                    'away_player_10', 'away_player_11'],
                         var_name='player_pos', value_name='player')

df_away_player.drop(columns=['player_pos'], inplace=True)
df_away_player.drop_duplicates(subset=['season', 'away_team_api_id', 'player'], inplace=True)
df_away_player.dropna(inplace=True)
df_away_player.rename(columns={'away_team_api_id':'team_api_id'}, inplace=True)
df_away_player.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21615 entries, 0 to 187308
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   season       21615 non-null  object 
 1   team_api_id  21615 non-null  int64  
 2   player       21615 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 675.5+ KB

We have now compiled all home teams' players as well as away teams' players. We can concat the two dataframes and drop duplicates.

df_team_player = pd.concat([df_home_player, df_away_player], ignore_index=True).drop_duplicates(subset=['season', 'team_api_id', 'player'])
df_team_player.head()

We just need to retrieve information on players from the database.

df_player = pd.read_sql("""SELECT * FROM Player;""", conn)
df_playeratt = pd.read_sql("""SELECT * FROM Player_Attributes;""", conn)

Cleaning the dataframes¶

df_match.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17033 entries, 0 to 17032
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   id                17033 non-null  int64 
 1   country_id        17033 non-null  int64 
 2   country           17033 non-null  object
 3   league_id         17033 non-null  int64 
 4   league            17033 non-null  object
 5   season            17033 non-null  object
 6   stage             17033 non-null  int64 
 7   match_api_id      17033 non-null  int64 
 8   home_team_api_id  17033 non-null  int64 
 9   away_team_api_id  17033 non-null  int64 
 10  home_team_goal    17033 non-null  int64 
 11  away_team_goal    17033 non-null  int64 
dtypes: int64(9), object(3)
memory usage: 1.6+ MB

There are no null values.

df_match.duplicated().value_counts()

False    17033
dtype: int64

No duplicates either.

print(df_team.info())
print(df_team.duplicated().value_counts())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 189 entries, 25 to 17109
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                189 non-null    int64  
 1   team_api_id       189 non-null    int64  
 2   team_fifa_api_id  188 non-null    float64
 3   team_long_name    189 non-null    object 
 4   team_short_name   189 non-null    object 
 5   league_id         189 non-null    float64
dtypes: float64(2), int64(2), object(2)
memory usage: 10.3+ KB
None
False    189
dtype: int64

No null values except for one in the 'team_fifa_api_id' which we won't be using here since the team attributes dataframe has all entries for the 'team_api_id'. No duplicates either.

df_teamatt.head(10)

We can see that columns display information by 'pair': one column indicates the value according to one attribute, while the other contains a categorical variable (slow, fast balanced for example).
Let's look at the 'buildUpPlaySpeed' and 'buildUpPlaySpeedClass' columns for example.

df_teamatt.groupby('buildUpPlaySpeedClass')['buildUpPlaySpeed'].describe()

df_teamatt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458 entries, 0 to 1457
Data columns (total 25 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              1458 non-null   int64  
 1   team_fifa_api_id                1458 non-null   int64  
 2   team_api_id                     1458 non-null   int64  
 3   date                            1458 non-null   object 
 4   buildUpPlaySpeed                1458 non-null   int64  
 5   buildUpPlaySpeedClass           1458 non-null   object 
 6   buildUpPlayDribbling            489 non-null    float64
 7   buildUpPlayDribblingClass       1458 non-null   object 
 8   buildUpPlayPassing              1458 non-null   int64  
 9   buildUpPlayPassingClass         1458 non-null   object 
 10  buildUpPlayPositioningClass     1458 non-null   object 
 11  chanceCreationPassing           1458 non-null   int64  
 12  chanceCreationPassingClass      1458 non-null   object 
 13  chanceCreationCrossing          1458 non-null   int64  
 14  chanceCreationCrossingClass     1458 non-null   object 
 15  chanceCreationShooting          1458 non-null   int64  
 16  chanceCreationShootingClass     1458 non-null   object 
 17  chanceCreationPositioningClass  1458 non-null   object 
 18  defencePressure                 1458 non-null   int64  
 19  defencePressureClass            1458 non-null   object 
 20  defenceAggression               1458 non-null   int64  
 21  defenceAggressionClass          1458 non-null   object 
 22  defenceTeamWidth                1458 non-null   int64  
 23  defenceTeamWidthClass           1458 non-null   object 
 24  defenceDefenderLineClass        1458 non-null   object 
dtypes: float64(1), int64(11), object(13)
memory usage: 284.9+ KB

It seems like the dataframe is complete except for the 'buildUpPlayDribbling' column, which lacks at least 2/3 of data. Quite srangely, there aren't any missing data in the corresponding 'buildUpPlayDribblingClass' column.
Let's investigate on where these data are lacking the most.

df_teamatt['date'].unique()

array(['2010-02-22 00:00:00', '2014-09-19 00:00:00',
       '2015-09-10 00:00:00', '2011-02-22 00:00:00',
       '2012-02-22 00:00:00', '2013-09-20 00:00:00'], dtype=object)

df_teamatt.groupby('date').count()

So it seems that all the data for 'buildUpPlayDribbling' is missing for 2010, 2011, 2012 and 2013 while it is complete for the last two years.

df_teamatt.groupby('buildUpPlayDribblingClass')['buildUpPlayDribbling'].value_counts(dropna=False)

buildUpPlayDribblingClass  buildUpPlayDribbling
Little                     NaN                     969
                           32.0                     12
                           33.0                      6
                           29.0                      4
                           31.0                      4
                           28.0                      3
                           24.0                      2
                           30.0                      2
                           26.0                      1
                           27.0                      1
Lots                       69.0                      7
                           70.0                      5
                           67.0                      3
                           71.0                      3
                           68.0                      1
                           74.0                      1
                           77.0                      1
Normal                     52.0                     34
                           48.0                     32
                           55.0                     28
                           49.0                     24
                           53.0                     22
                           50.0                     20
                           51.0                     18
                           40.0                     17
                           54.0                     17
                           46.0                     16
                           47.0                     16
                           56.0                     15
                           42.0                     14
                           45.0                     14
                           36.0                     13
                           41.0                     13
                           35.0                     12
                           43.0                     10
                           44.0                     10
                           58.0                      9
                           60.0                      9
                           62.0                      9
                           34.0                      8
                           59.0                      8
                           37.0                      7
                           61.0                      7
                           38.0                      6
                           39.0                      6
                           57.0                      6
                           65.0                      5
                           66.0                      4
                           63.0                      2
                           64.0                      2
Name: buildUpPlayDribbling, dtype: int64

All missing values are in the 'Little' range from 'buildUpPlayDribblingClass'. That is making our work much easier since we can quite safely consider replacing missing values with an average value of 'buildUpPlayDribbling' in the 'Little' category.

little_mean_buildUpPlayDribbling = df_teamatt.groupby('buildUpPlayDribblingClass')['buildUpPlayDribbling'].mean()['Little']
df_teamatt.fillna(little_mean_buildUpPlayDribbling, inplace=True)
df_teamatt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458 entries, 0 to 1457
Data columns (total 25 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              1458 non-null   int64  
 1   team_fifa_api_id                1458 non-null   int64  
 2   team_api_id                     1458 non-null   int64  
 3   date                            1458 non-null   object 
 4   buildUpPlaySpeed                1458 non-null   int64  
 5   buildUpPlaySpeedClass           1458 non-null   object 
 6   buildUpPlayDribbling            1458 non-null   float64
 7   buildUpPlayDribblingClass       1458 non-null   object 
 8   buildUpPlayPassing              1458 non-null   int64  
 9   buildUpPlayPassingClass         1458 non-null   object 
 10  buildUpPlayPositioningClass     1458 non-null   object 
 11  chanceCreationPassing           1458 non-null   int64  
 12  chanceCreationPassingClass      1458 non-null   object 
 13  chanceCreationCrossing          1458 non-null   int64  
 14  chanceCreationCrossingClass     1458 non-null   object 
 15  chanceCreationShooting          1458 non-null   int64  
 16  chanceCreationShootingClass     1458 non-null   object 
 17  chanceCreationPositioningClass  1458 non-null   object 
 18  defencePressure                 1458 non-null   int64  
 19  defencePressureClass            1458 non-null   object 
 20  defenceAggression               1458 non-null   int64  
 21  defenceAggressionClass          1458 non-null   object 
 22  defenceTeamWidth                1458 non-null   int64  
 23  defenceTeamWidthClass           1458 non-null   object 
 24  defenceDefenderLineClass        1458 non-null   object 
dtypes: float64(1), int64(11), object(13)
memory usage: 284.9+ KB

The remaining dataframes to look at are df_match_player, df_team_player, df_player and df_playeratt

df_match_player.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17033 entries, 0 to 17032
Data columns (total 27 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                17033 non-null  int64  
 1   season            17033 non-null  object 
 2   league_id         17033 non-null  int64  
 3   home_team_api_id  17033 non-null  int64  
 4   away_team_api_id  17033 non-null  int64  
 5   home_player_1     16781 non-null  float64
 6   home_player_2     16778 non-null  float64
 7   home_player_3     16771 non-null  float64
 8   home_player_4     16780 non-null  float64
 9   home_player_5     16772 non-null  float64
 10  home_player_6     16784 non-null  float64
 11  home_player_7     16757 non-null  float64
 12  home_player_8     16773 non-null  float64
 13  home_player_9     16785 non-null  float64
 14  home_player_10    16726 non-null  float64
 15  home_player_11    16756 non-null  float64
 16  away_player_1     16790 non-null  float64
 17  away_player_2     16772 non-null  float64
 18  away_player_3     16769 non-null  float64
 19  away_player_4     16777 non-null  float64
 20  away_player_5     16762 non-null  float64
 21  away_player_6     16776 non-null  float64
 22  away_player_7     16767 non-null  float64
 23  away_player_8     16763 non-null  float64
 24  away_player_9     16772 non-null  float64
 25  away_player_10    16748 non-null  float64
 26  away_player_11    16747 non-null  float64
dtypes: float64(22), int64(4), object(1)
memory usage: 3.5+ MB

There are some missing values in this dataframe.

df_match_player.groupby('season').count()

sns.heatmap(df_match_player.isnull(), cbar=False)

<matplotlib.axes._subplots.AxesSubplot at 0x1a1c26ee50>

missing_player_teams = pd.read_sql("""
                                    SELECT
                                        l.name league_name,
                                        m.season,
                                        (SELECT t.team_long_name FROM Team t
                                        WHERE t.team_api_id = m.home_team_api_id) home_team,
                                        (SELECT t.team_long_name FROM Team t
                                        WHERE t.team_api_id = m.away_team_api_id) away_team
                                    FROM Match m INNER JOIN League l
                                    ON l.id = m.league_id
                                    WHERE home_player_1 IS NULL
                                    AND home_player_2 IS NULL
                                    AND home_player_3 IS NULL
                                    AND home_player_4 IS NULL
                                    AND home_player_5 IS NULL
                                    AND home_player_6 IS NULL
                                    AND home_player_7 IS NULL
                                    AND home_player_8 IS NULL
                                    AND home_player_9 IS NULL
                                    AND home_player_10 IS NULL
                                    AND home_player_11 IS NULL
                                    AND away_player_1 IS NULL
                                    AND away_player_2 IS NULL
                                    AND away_player_3 IS NULL
                                    AND away_player_4 IS NULL
                                    AND away_player_5 IS NULL
                                    AND away_player_6 IS NULL
                                    AND away_player_7 IS NULL
                                    AND away_player_8 IS NULL
                                    AND away_player_9 IS NULL
                                    AND away_player_10 IS NULL
                                    AND away_player_11 IS NULL
                                    AND league_id in ('1729', '4769', '7809', '10257', '13274', '21518');""", conn)
missing_player_teams

missing_player_teams.groupby('season')['league_name'].value_counts()

season     league_name           
2008/2009  Netherlands Eredivisie    205
Name: league_name, dtype: int64

We observe that we lack data for all players from the Dutch league in the 2008/2009 season. Some information is also missing for other leagues. We will let it like that for the moment.

df_team_player.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22545 entries, 0 to 42934
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   season       22545 non-null  object 
 1   team_api_id  22545 non-null  int64  
 2   player       22545 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 704.5+ KB

df_team_player.duplicated().value_counts()

False    22545
dtype: int64

The team player dataframe is complete and without any duplicates.

df_player.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11060 entries, 0 to 11059
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  11060 non-null  int64  
 1   player_api_id       11060 non-null  int64  
 2   player_name         11060 non-null  object 
 3   player_fifa_api_id  11060 non-null  int64  
 4   birthday            11060 non-null  object 
 5   height              11060 non-null  float64
 6   weight              11060 non-null  int64  
dtypes: float64(1), int64(4), object(2)
memory usage: 605.0+ KB

df_playeratt.head()

df_playeratt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183978 entries, 0 to 183977
Data columns (total 42 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   id                   183978 non-null  int64  
 1   player_fifa_api_id   183978 non-null  int64  
 2   player_api_id        183978 non-null  int64  
 3   date                 183978 non-null  object 
 4   overall_rating       183142 non-null  float64
 5   potential            183142 non-null  float64
 6   preferred_foot       183142 non-null  object 
 7   attacking_work_rate  180748 non-null  object 
 8   defensive_work_rate  183142 non-null  object 
 9   crossing             183142 non-null  float64
 10  finishing            183142 non-null  float64
 11  heading_accuracy     183142 non-null  float64
 12  short_passing        183142 non-null  float64
 13  volleys              181265 non-null  float64
 14  dribbling            183142 non-null  float64
 15  curve                181265 non-null  float64
 16  free_kick_accuracy   183142 non-null  float64
 17  long_passing         183142 non-null  float64
 18  ball_control         183142 non-null  float64
 19  acceleration         183142 non-null  float64
 20  sprint_speed         183142 non-null  float64
 21  agility              181265 non-null  float64
 22  reactions            183142 non-null  float64
 23  balance              181265 non-null  float64
 24  shot_power           183142 non-null  float64
 25  jumping              181265 non-null  float64
 26  stamina              183142 non-null  float64
 27  strength             183142 non-null  float64
 28  long_shots           183142 non-null  float64
 29  aggression           183142 non-null  float64
 30  interceptions        183142 non-null  float64
 31  positioning          183142 non-null  float64
 32  vision               181265 non-null  float64
 33  penalties            183142 non-null  float64
 34  marking              183142 non-null  float64
 35  standing_tackle      183142 non-null  float64
 36  sliding_tackle       181265 non-null  float64
 37  gk_diving            183142 non-null  float64
 38  gk_handling          183142 non-null  float64
 39  gk_kicking           183142 non-null  float64
 40  gk_positioning       183142 non-null  float64
 41  gk_reflexes          183142 non-null  float64
dtypes: float64(35), int64(3), object(4)
memory usage: 59.0+ MB

There are some missing values, but we will see if we really need all rows later on.

Eventually, before performing analysis on teams points in each league, we need to define two new columns in df_match that indicate how many points each team brings back. A win brings 3 points, a draw 1 point and a loss 0 point.

df_match.loc[df_match['home_team_goal'] < df_match['away_team_goal'], 'home_team_points'] = 0
df_match.loc[df_match['home_team_goal'] == df_match['away_team_goal'], 'home_team_points'] = 1
df_match.loc[df_match['home_team_goal'] > df_match['away_team_goal'], 'home_team_points'] = 3

df_match.loc[df_match['home_team_goal'] < df_match['away_team_goal'], 'away_team_points'] = 3
df_match.loc[df_match['home_team_goal'] == df_match['away_team_goal'], 'away_team_points'] = 1
df_match.loc[df_match['home_team_goal'] > df_match['away_team_goal'], 'away_team_points'] = 0

df_match

We want to get a match dataframe with tidy data (for now we have two columns for home and away teams, and two columns for home and away teams' goals).

#Melting the team id columns
df_match_team = df_match.melt(id_vars=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id'],
                              value_vars=['home_team_api_id', 'away_team_api_id'],
                              var_name='home_away',
                              value_name='team_api_id')
#Creating a column only containing 'home_team' or 'away_team'
df_match_team.loc[:,'home_away'] = df_match_team.loc[:,'home_away'].str.replace('_api_id', '')

#Melting the team goals columns
df_match_goal = df_match.melt(id_vars=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id'],
                              value_vars=['home_team_goal', 'away_team_goal'],
                              var_name='home_away',
                              value_name='team_goal')
#Creating a column only containing 'home_team' or 'away_team', which gives us the possibility to merge dataframes
df_match_goal.loc[:, 'home_away'] = df_match_goal.loc[:, 'home_away'].str.replace('_goal', '')

#Melting the team points columns
df_match_points = df_match.melt(id_vars=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id'],
                              value_vars=['home_team_points', 'away_team_points'],
                              var_name='home_away',
                              value_name='team_points')
#Creating a column only containing 'home_team' or 'away_team', which gives us the possibility to merge dataframes
df_match_points.loc[:, 'home_away'] = df_match_goal.loc[:, 'home_away'].str.replace('_points', '')

#Creating another dataframe to store the goals conceded by a team and melting it
df_goals_conceded = df_match.copy()
df_goals_conceded['home_team_goal_conceded'] = df_match['away_team_goal']
df_goals_conceded['away_team_goal_conceded'] = df_match['home_team_goal']
df_goals_conceded = df_goals_conceded.melt(id_vars=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id'],
                              value_vars=['home_team_goal_conceded', 'away_team_goal_conceded'],
                              var_name='home_away',
                              value_name='goal_conceded')
#Creating a column only containing 'home_team' or 'away_team', which gives us the possibility to merge dataframes
df_goals_conceded.loc[:, 'home_away'] = df_goals_conceded.loc[:, 'home_away'].str.replace('_goal_conceded', '')

#Merging the dataframes back to get a final tidy dataframe that replaces the initial df_match dataframe
df_match = df_match_team.merge(right=df_match_goal, how='inner',
                               on=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id', 'home_away'])
df_match = df_match.merge(right=df_match_points, how='inner',
                         on=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id', 'home_away'])
df_match = df_match.merge(right=df_goals_conceded, how='inner',
                         on=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id', 'home_away'])
df_match

We finally have a tidy dataframe with season, league, match, team (home and away), goals scored (and conceded) and points!

Exploratory data analysis¶

Questions

Is there really an advantage in playing at home? At least, is playing at home associated with more chances of winning?
Which teams are the best performers over the years?
How are the teams attributes correlated with teams performances?

Home advantage¶

home_advantage = df_match.groupby(by=['league', 'home_away'], as_index=False)['team_goal'].mean()
home_advantage

fig = px.bar(home_advantage, x='league', y='team_goal',
             color='home_away', barmode='group')
fig.update_layout(title='Average goals scored per match for home and away teams',
                  yaxis_title='Average goals scored per match',
                  xaxis_title='League',
                  legend_title='')
#add annotations with the difference between average home goals and away goals
for league in df_league['name'].tolist():
    mean_goal = ((home_advantage[(home_advantage['league']==league) & (home_advantage['home_away']=='home_team')]['team_goal'].values
                 + home_advantage[(home_advantage['league']==league) & (home_advantage['home_away']=='away_team')]['team_goal'].values)/2).item()

    goal_difference = (home_advantage[(home_advantage['league']==league) & (home_advantage['home_away']=='home_team')]['team_goal'].values
                       - home_advantage[(home_advantage['league']==league) & (home_advantage['home_away']=='away_team')]['team_goal'].values).item()

    fig.add_annotation(
        x=league,
        y=round(mean_goal,2),
        text=round(goal_difference,2),
        showarrow=False,
        font=dict(color='dimgrey', size=20))
# print(fig.layout)
fig.show()

We can see that, on average, home teams score approximately 0.4 more goals (between 0.35 and 0.5) than their away counterpart. It seems that this difference is more pronounced in the Spanish and Dutch leagues.

Best performing teams¶

For this analysis, we will retrieve teams name from df_team. Eventually, we want to see if any correlation exists between team attributes and team performance. We can consider team performance with different proxies: total goals scored, total points per season and goals scored per match.

#Grouping by country, league, season and team_api_id and summing up points
performers = df_match.groupby(['country', 'league', 'season', 'team_api_id'])['team_goal', 'team_points'].agg(['sum', 'count']).reset_index()
#Dropping last column (count duplicate, corresponding to the number of matches played)
performers = performers.iloc[:, :-1]
#Renaming columns
performers.columns=['country', 'league', 'season', 'team_api_id', 'team_goal', 'matches', 'team_points']
#Creating a column with goals per match
performers.loc[:, 'goal_per_match'] = performers['team_goal']/performers['matches']

#Merging with df_team to get teams name
performers = performers.merge(right=df_team, left_on='team_api_id', right_on='team_api_id')
performers = performers[['country', 'league', 'season', 'team_api_id', 'team_long_name', 'team_goal', 'team_points', 'goal_per_match']].rename(columns={'team_long_name':'team'})

#Creating a column with team name and season
performers.loc[:, 'team_season'] = performers['team']+ ' ' + performers['season'].astype('str')

/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning:

Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

Teams that scored the most goals¶

top_scoring_teams = performers.sort_values('team_goal', ascending=False)[:20]
fig = px.bar(top_scoring_teams, x='team_season', y='team_goal', color='team')
fig.update_layout(images = [
                            {"source": "images/real_madrid.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 4,
                             "y": 50,
                             "sizex": 30,
                             "sizey": 30,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                            {"source": "images/fc_barcelona.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 10.5,
                             "y": 50,
                             "sizex": 24,
                             "sizey": 24,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                            {"source": "images/ajax.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 13.5,
                             "y": 110,
                             "sizex": 15,
                             "sizey": 15,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                            {"source": "images/psv.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 14.5,
                             "y": 112,
                             "sizex": 10,
                             "sizey": 10,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                            {"source": "images/chelsea.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 15.5,
                             "y": 110,
                             "sizex": 14,
                             "sizey": 14,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                            {"source": "images/psg.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 16.5,
                             "y": 110,
                             "sizex": 14,
                             "sizey": 14,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                            {"source": "images/man_city.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 17.5,
                             "y": 110,
                             "sizex": 14,
                             "sizey": 14,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                            {"source": "images/liverpool.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 18.37,
                             "y": 110,
                             "sizex": 14,
                             "sizey": 14,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                            {"source": "images/bayern.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 19.4,
                             "y": 110,
                             "sizex": 14,
                             "sizey": 14,
                             "xanchor": "right",
                             "yanchor": "bottom"}
                            ],
                 title = 'Top scoring teams',
                 yaxis_title = 'Total goals',
                 xaxis_title = 'Team')
fig.show()

This chart illustrates the domination imposed by the two Spanish leaders in terms of goals scored per season.

Teams with the best defense¶

To compare teams in terms of defense, we will use the total number of goals conceded as a proxy.

best_defenses = df_match.groupby(['country', 'league', 'season', 'team_api_id'])['goal_conceded'].sum().reset_index()

#Merging with df_team to get teams name
best_defenses = best_defenses.merge(right=df_team, left_on='team_api_id', right_on='team_api_id')
best_defenses = best_defenses[['country', 'league', 'season', 'team_api_id', 'team_long_name', 'goal_conceded']].rename(columns={'team_long_name':'team'})

#Creating a column with team name and season
best_defenses.loc[:, 'team_season'] = best_defenses['team']+ ' ' + best_defenses['season'].astype('str')

top_defenses = best_defenses.sort_values('goal_conceded', ascending=True)[:10]
fig = px.bar(top_defenses, x='team_season', y='goal_conceded', color='team')
fig.update_layout(title = 'Top 10 teams - least goals conceded',
                 yaxis_title = 'Total goals conceded',
                 xaxis_title = 'Team',
                 images = [
                            {"source": "images/bayern.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 1.5,
                             "y": 8,
                             "sizex": 5,
                             "sizey": 5,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                             {"source": "images/atletico.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 3.37,
                             "y": 8,
                             "sizex": 5,
                             "sizey": 5,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                             {"source": "images/psg.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 4.5,
                             "y": 8,
                             "sizex": 5,
                             "sizey": 5,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                             {"source": "images/juventus.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 5.7,
                             "y": 8,
                             "sizex": 5,
                             "sizey": 5,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                             {"source": "images/ajax.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 7.5,
                             "y": 8,
                             "sizex": 5,
                             "sizey": 5,
                             "xanchor": "right",
                             "yanchor": "bottom"},
                             {"source": "images/fc_barcelona.png",
                             "xref": "x",
                             "yref": "y",
                             "x": 9,
                             "y": 8,
                             "sizex": 5,
                             "sizey": 5,
                             "xanchor": "right",
                             "yanchor": "bottom"}]
                 )
fig.show()

Looking at best performing teams by season.

df_teamatt.head()

#We will change the dates a bit to coincide with the season
season_dict = {'2010-02-22 00:00:00' : '2010/2011',
               '2011-02-22 00:00:00' : '2011/2012',
               '2012-02-22 00:00:00' : '2012/2013',
               '2013-09-20 00:00:00' : '2013/2014',
               '2014-09-19 00:00:00' : '2014/2015',
               '2015-09-10 00:00:00' : '2015/2016'}
#Replacing df_teamatt 'date' column values with season_dict
df_teamatt.loc[:, 'date'] = df_teamatt.loc[:, 'date'].map(season_dict)
df_teamatt.rename(columns={'date':'season'}, inplace=True)

#Keeping only seasons between 2010/2011 and 2015/2016
performers = performers[performers['season'].isin(['2010/2011', '2011/2012', '2012/2013', '2013/2014', '2014/2015', '2015/2016'])]

#Merging performers with df_teamatt
performers = performers.merge(right=df_teamatt, on=['season', 'team_api_id'])

#Dropping 'id' and 'team_fifa_api_id' columns
performers.drop(columns=['id', 'team_fifa_api_id'], inplace=True)

#Dropping attributes columns with categorical variables (all columns with 'Class')
performers.drop(columns=[x for x in performers.columns if 'Class' in x], inplace=True)

performers.head()

Comparing average attributes¶

Let's see if we can find any link between a variable containing an average of all attributes and the 3 teams performances indicators (total goals per season, team points, goals per match).

#Building a new variable avg_att
performers.loc[:, 'avg_att'] = performers.iloc[:, 8:].mean(axis=1)
performers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 683 entries, 0 to 682
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   country                 683 non-null    object 
 1   league                  683 non-null    object 
 2   season                  683 non-null    object 
 3   team_api_id             683 non-null    int64  
 4   team                    683 non-null    object 
 5   team_goal               683 non-null    int64  
 6   team_points             683 non-null    float64
 7   goal_per_match          683 non-null    float64
 8   team_season             683 non-null    object 
 9   buildUpPlaySpeed        683 non-null    int64  
 10  buildUpPlayDribbling    683 non-null    float64
 11  buildUpPlayPassing      683 non-null    int64  
 12  chanceCreationPassing   683 non-null    int64  
 13  chanceCreationCrossing  683 non-null    int64  
 14  chanceCreationShooting  683 non-null    int64  
 15  defencePressure         683 non-null    int64  
 16  defenceAggression       683 non-null    int64  
 17  defenceTeamWidth        683 non-null    int64  
 18  avg_att                 683 non-null    float64
dtypes: float64(4), int64(10), object(5)
memory usage: 106.7+ KB

fig = px.scatter(performers, x='avg_att', y='team_goal')
fig.show()

fig = px.scatter(performers, x='avg_att', y='team_points')
fig.show()

fig = px.scatter(performers, x='avg_att', y='goal_per_match')
fig.show()

performers.loc[:, ['team_goal', 'team_points', 'goal_per_match', 'avg_att']].corr()

At this point, we cannot see any clear pattern between team average attributes and their performance considering the three proxies selected. We will pursue the analysis with other techniques.

	id	team_api_id	team_fifa_api_id	team_long_name	team_short_name
0	1	9987	673.0	KRC Genk	GEN
1	2	9993	675.0	Beerschot AC	BAC
2	3	10000	15005.0	SV Zulte-Waregem	ZUL
3	4	9994	2007.0	Sporting Lokeren	LOK
4	5	9984	1750.0	KSV Cercle Brugge	CEB
5	6	8635	229.0	RSC Anderlecht	AND
6	7	9991	674.0	KAA Gent	GEN
7	8	9998	1747.0	RAEC Mons	MON
8	9	7947	NaN	FCV Dender EH	DEN
9	10	9985	232.0	Standard de Liège	STL
10	11	8203	110724.0	KV Mechelen	MEC
11	12	8342	231.0	Club Brugge KV	CLB
12	13	9999	546.0	KSV Roeselare	ROS
13	14	8571	100081.0	KV Kortrijk	KOR
14	15	4049	NaN	Tubize	TUB
15	16	9996	111560.0	Royal Excel Mouscron	MOU
16	17	10001	681.0	KVC Westerlo	WES
17	18	9986	670.0	Sporting Charleroi	CHA
18	614	9997	680.0	Sint-Truidense VV	STT
19	1034	9989	239.0	Lierse SK	LIE

	match_id	country	league	season	stage	date	home_team	home_team_goal	away_team_goal	away_team
0	1489	Belgium	Belgium Jupiler League	2015/2016	1	2015-07-24 00:00:00	Sint-Truidense VV	2	1	Club Brugge KV
1	1490	Belgium	Belgium Jupiler League	2015/2016	1	2015-07-25 00:00:00	KV Kortrijk	2	1	Standard de Liège
2	1491	Belgium	Belgium Jupiler League	2015/2016	1	2015-07-25 00:00:00	KRC Genk	3	1	Oud-Heverlee Leuven
3	1492	Belgium	Belgium Jupiler League	2015/2016	1	2015-07-25 00:00:00	KV Oostende	3	1	KV Mechelen
4	1493	Belgium	Belgium Jupiler League	2015/2016	1	2015-07-25 00:00:00	SV Zulte-Waregem	3	1	Sporting Lokeren
...	...	...	...	...	...	...	...	...	...	...
25974	24733	Switzerland	Switzerland Super League	2008/2009	9	2008-09-13 00:00:00	FC Basel	2	0	FC Luzern
25975	24734	Switzerland	Switzerland Super League	2008/2009	9	2008-09-13 00:00:00	Grasshopper Club Zürich	0	0	FC Aarau
25976	24735	Switzerland	Switzerland Super League	2008/2009	9	2008-09-13 00:00:00	Neuchâtel Xamax	3	3	FC Sion
25977	24736	Switzerland	Switzerland Super League	2008/2009	9	2008-09-14 00:00:00	AC Bellinzona	1	2	BSC Young Boys
25978	24737	Switzerland	Switzerland Super League	2008/2009	9	2008-09-14 00:00:00	FC Vaduz	1	7	FC Zürich

	id	team_api_id	team_fifa_api_id	team_long_name	team_short_name
0	1	9987	673.0	KRC Genk	GEN
1	2	9993	675.0	Beerschot AC	BAC
2	3	10000	15005.0	SV Zulte-Waregem	ZUL
3	4	9994	2007.0	Sporting Lokeren	LOK
4	5	9984	1750.0	KSV Cercle Brugge	CEB
...	...	...	...	...	...
294	49479	10190	898.0	FC St. Gallen	GAL
295	49837	10191	1715.0	FC Thun	THU
296	50201	9777	324.0	Servette FC	SER
297	50204	7730	1862.0	FC Lausanne-Sports	LAU
298	51606	7896	NaN	Lugano	LUG

	season	team_api_id	player
0	2008/2009	10260	30726.0
1	2008/2009	9825	23686.0
2	2008/2009	8472	32562.0
3	2008/2009	8654	36374.0
4	2008/2009	10252	30380.0

	count	mean	std	min	25%	50%	75%	max
buildUpPlaySpeedClass
Balanced	1184.0	51.925676	8.567738	34.0	46.0	52.0	59.0	66.0
Fast	172.0	69.767442	2.578246	67.0	68.0	70.0	70.0	80.0
Slow	102.0	29.509804	2.570761	20.0	30.0	30.0	30.0	33.0

	type	name	tbl_name	rootpage	sql
0	table	sqlite_sequence	sqlite_sequence	4	CREATE TABLE sqlite_sequence(name,seq)
1	table	Player_Attributes	Player_Attributes	11	CREATE TABLE "Player_Attributes" (\n\t`id`\tIN...
2	table	Player	Player	14	CREATE TABLE `Player` (\n\t`id`\tINTEGER PRIMA...
3	table	Match	Match	18	CREATE TABLE `Match` (\n\t`id`\tINTEGER PRIMAR...
4	table	League	League	24	CREATE TABLE `League` (\n\t`id`\tINTEGER PRIMA...
5	table	Country	Country	26	CREATE TABLE `Country` (\n\t`id`\tINTEGER PRIM...
6	table	Team	Team	29	CREATE TABLE "Team" (\n\t`id`\tINTEGER PRIMARY...
7	table	Team_Attributes	Team_Attributes	2	CREATE TABLE `Team_Attributes` (\n\t`id`\tINTE...

	id	name
0	1	Belgium
1	1729	England
2	4769	France
3	7809	Germany
4	10257	Italy
5	13274	Netherlands
6	15722	Poland
7	17642	Portugal
8	19694	Scotland
9	21518	Spain
10	24558	Switzerland

	match_id	country	league	season	stage	date	home_team	home_team_goal	away_team_goal	away_team
0	7431	France	France Ligue 1	2015/2016	1	2015-08-07 00:00:00	LOSC Lille	0	1	Paris Saint-Germain
1	7429	France	France Ligue 1	2015/2016	1	2015-08-08 00:00:00	SC Bastia	2	1	Stade Rennais FC
2	7433	France	France Ligue 1	2015/2016	1	2015-08-08 00:00:00	Olympique de Marseille	0	1	SM Caen
3	7434	France	France Ligue 1	2015/2016	1	2015-08-08 00:00:00	Montpellier Hérault SC	0	2	Angers SCO
4	7435	France	France Ligue 1	2015/2016	1	2015-08-08 00:00:00	FC Nantes	1	0	En Avant de Guingamp
...	...	...	...	...	...	...	...	...	...	...
375	7744	France	France Ligue 1	2015/2016	38	2016-05-14 00:00:00	FC Lorient	1	0	GFC Ajaccio
376	7745	France	France Ligue 1	2015/2016	38	2016-05-14 00:00:00	Paris Saint-Germain	4	0	FC Nantes
377	7746	France	France Ligue 1	2015/2016	38	2016-05-14 00:00:00	SM Caen	1	0	Girondins de Bordeaux
378	7747	France	France Ligue 1	2015/2016	38	2016-05-14 00:00:00	Stade de Reims	4	1	Olympique Lyonnais
379	7748	France	France Ligue 1	2015/2016	38	2016-05-14 00:00:00	Stade Rennais FC	1	2	SC Bastia

	season	country	league	total_goals
0	2015/2016	Spain	Spain LIGA BBVA	1043
1	2015/2016	England	England Premier League	1026
2	2015/2016	Italy	Italy Serie A	979
3	2015/2016	France	France Ligue 1	960
4	2015/2016	Netherlands	Netherlands Eredivisie	912
...	...	...	...	...
83	2008/2009	Belgium	Belgium Jupiler League	855
84	2008/2009	Portugal	Portugal Liga ZON Sagres	552
85	2008/2009	Scotland	Scotland Premier League	548
86	2008/2009	Switzerland	Switzerland Super League	540
87	2008/2009	Poland	Poland Ekstraklasa	524

	id	team_fifa_api_id	team_api_id	date	buildUpPlaySpeed	buildUpPlaySpeedClass	buildUpPlayDribbling	buildUpPlayDribblingClass	buildUpPlayPassing	buildUpPlayPassingClass	...	chanceCreationShooting	chanceCreationShootingClass	chanceCreationPositioningClass	defencePressure	defencePressureClass	defenceAggression	defenceAggressionClass	defenceTeamWidth	defenceTeamWidthClass	defenceDefenderLineClass
0	1	434	9930	2010-02-22 00:00:00	60	Balanced	NaN	Little	50	Mixed	...	55	Normal	Organised	50	Medium	55	Press	45	Normal	Cover
1	2	434	9930	2014-09-19 00:00:00	52	Balanced	48.0	Normal	56	Mixed	...	64	Normal	Organised	47	Medium	44	Press	54	Normal	Cover
2	3	434	9930	2015-09-10 00:00:00	47	Balanced	41.0	Normal	54	Mixed	...	64	Normal	Organised	47	Medium	44	Press	54	Normal	Cover
3	4	77	8485	2010-02-22 00:00:00	70	Fast	NaN	Little	70	Long	...	70	Lots	Organised	60	Medium	70	Double	70	Wide	Cover
4	5	77	8485	2011-02-22 00:00:00	47	Balanced	NaN	Little	52	Mixed	...	52	Normal	Organised	47	Medium	47	Press	52	Normal	Cover

	season	home_team_api_id	player_pos	player
0	2008/2009	10260	home_player_1	30726.0
1	2008/2009	9825	home_player_1	23686.0
2	2008/2009	8472	home_player_1	32562.0
3	2008/2009	8654	home_player_1	36374.0
4	2008/2009	10252	home_player_1	30380.0

	id	league_id	home_team_api_id	away_team_api_id	home_player_1	home_player_2	home_player_3	home_player_4	home_player_5	home_player_6	...	away_player_2	away_player_3	away_player_4	away_player_5	away_player_6	away_player_7	away_player_8	away_player_9	away_player_10	away_player_11
season
2008/2009	2132	2132	2132	2132	1895	1912	1893	1908	1912	1904	...	1908	1899	1905	1905	1903	1905	1890	1903	1900	1910
2009/2010	2132	2132	2132	2132	2124	2125	2122	2121	2116	2126	...	2125	2116	2118	2115	2124	2110	2126	2125	2122	2121
2010/2011	2132	2132	2132	2132	2130	2124	2128	2130	2124	2129	...	2120	2125	2130	2121	2131	2127	2127	2122	2125	2124
2011/2012	2110	2110	2110	2110	2109	2104	2108	2109	2108	2110	...	2106	2105	2109	2108	2105	2108	2105	2108	2106	2104
2012/2013	2132	2132	2132	2132	2128	2125	2128	2126	2130	2132	...	2125	2131	2125	2131	2132	2131	2132	2131	2131	2130
2013/2014	2132	2132	2132	2132	2132	2132	2132	2130	2130	2131	...	2132	2132	2132	2129	2130	2130	2131	2127	2128	2125
2014/2015	2131	2131	2131	2131	2131	2128	2130	2129	2130	2127	...	2128	2131	2129	2129	2127	2129	2127	2129	2123	2112
2015/2016	2132	2132	2132	2132	2132	2128	2130	2127	2122	2125	...	2128	2130	2129	2124	2124	2127	2125	2127	2113	2121

	league_name	season	home_team	away_team
0	Netherlands Eredivisie	2008/2009	De Graafschap	NAC Breda
1	Netherlands Eredivisie	2008/2009	N.E.C.	PSV
2	Netherlands Eredivisie	2008/2009	AZ	FC Twente
3	Netherlands Eredivisie	2008/2009	Heracles Almelo	Vitesse
4	Netherlands Eredivisie	2008/2009	Willem II	Roda JC Kerkrade
...	...	...	...	...
200	Netherlands Eredivisie	2008/2009	Vitesse	ADO Den Haag
201	Netherlands Eredivisie	2008/2009	Roda JC Kerkrade	Feyenoord
202	Netherlands Eredivisie	2008/2009	Sparta Rotterdam	N.E.C.
203	Netherlands Eredivisie	2008/2009	FC Volendam	De Graafschap
204	Netherlands Eredivisie	2008/2009	FC Utrecht	Heracles Almelo

	id	player_fifa_api_id	player_api_id	date	overall_rating	potential	preferred_foot	attacking_work_rate	defensive_work_rate	crossing	...	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes
0	1	218353	505942	2016-02-18 00:00:00	67.0	71.0	right	medium	medium	49.0	...	54.0	48.0	65.0	69.0	69.0	6.0	11.0	10.0	8.0	8.0
1	2	218353	505942	2015-11-19 00:00:00	67.0	71.0	right	medium	medium	49.0	...	54.0	48.0	65.0	69.0	69.0	6.0	11.0	10.0	8.0	8.0
2	3	218353	505942	2015-09-21 00:00:00	62.0	66.0	right	medium	medium	49.0	...	54.0	48.0	65.0	66.0	69.0	6.0	11.0	10.0	8.0	8.0
3	4	218353	505942	2015-03-20 00:00:00	61.0	65.0	right	medium	medium	48.0	...	53.0	47.0	62.0	63.0	66.0	5.0	10.0	9.0	7.0	7.0
4	5	218353	505942	2007-02-22 00:00:00	61.0	65.0	right	medium	medium	48.0	...	53.0	47.0	62.0	63.0	66.0	5.0	10.0	9.0	7.0	7.0

	league	home_away	team_goal
0	England Premier League	away_team	1.159539
1	England Premier League	home_team	1.550987
2	France Ligue 1	away_team	1.040132
3	France Ligue 1	home_team	1.402961
4	Germany 1. Bundesliga	away_team	1.274918
5	Germany 1. Bundesliga	home_team	1.626634
6	Italy Serie A	away_team	1.116009
7	Italy Serie A	home_team	1.500829
8	Netherlands Eredivisie	away_team	1.301062
9	Netherlands Eredivisie	home_team	1.779820
10	Spain LIGA BBVA	away_team	1.135855
11	Spain LIGA BBVA	home_team	1.631250

	id	team_fifa_api_id	team_api_id	buildUpPlaySpeed	buildUpPlaySpeedClass	buildUpPlayDribbling	buildUpPlayDribblingClass	buildUpPlayPassing	buildUpPlayPassingClass	buildUpPlayPositioningClass	...	chanceCreationShooting	chanceCreationShootingClass	chanceCreationPositioningClass	defencePressure	defencePressureClass	defenceAggression	defenceAggressionClass	defenceTeamWidth	defenceTeamWidthClass	defenceDefenderLineClass
date
2010-02-22 00:00:00	241	241	241	241	241	0	241	241	241	241	...	241	241	241	241	241	241	241	241	241	241
2011-02-22 00:00:00	244	244	244	244	244	0	244	244	244	244	...	244	244	244	244	244	244	244	244	244	244
2012-02-22 00:00:00	242	242	242	242	242	0	242	242	242	242	...	242	242	242	242	242	242	242	242	242	242
2013-09-20 00:00:00	242	242	242	242	242	0	242	242	242	242	...	242	242	242	242	242	242	242	242	242	242
2014-09-19 00:00:00	244	244	244	244	244	244	244	244	244	244	...	244	244	244	244	244	244	244	244	244	244
2015-09-10 00:00:00	245	245	245	245	245	245	245	245	245	245	...	245	245	245	245	245	245	245	245	245	245

	team_goal	team_points	goal_per_match	avg_att
team_goal	1.000000	0.861334	0.985253	-0.008235
team_points	0.861334	1.000000	0.820584	0.035206
goal_per_match	0.985253	0.820584	1.000000	-0.027214
avg_att	-0.008235	0.035206	-0.027214	1.000000