This database was compiled by Hugo Mathien and is available here on Kaggle.
As described by Hugo, the information gathered here is really impressive:
While this clearly represents quality material to perform in depth analysis and eventually predict matches outcome, I will focus here on performing exploratory analysis on teams, leagues or particular players and/or categories of players.
#importing classic analysis libraries
import pandas as pd
import numpy as np
#sqlite3
import sqlite3
#plotting with plotly (offline default mode)
import plotly
plotly.__version__
import plotly.express as px
import plotly.graph_objects as go
#plotting directly in Jupyter notebook
import plotly.io as pio
pio.renderers.default='notebook'
#plotting with matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns
Let's see what is in the database file
conn = sqlite3.connect('database.sqlite')
tables = pd.read_sql("""SELECT * FROM sqlite_master WHERE type='table' ;""", conn)
tables
As we already knew, we've got quite a lot of information here, with the following tables:
We quickly draw the scheme above in order to be able to better visualize tables information and indexes/relations between tables. This will make our work much easier by being able to refer to it later on.
from IPython.display import Image
Image(filename='Untitled 1.png', width=800, height=800)
Let's inspect country, league and team tables.
pd.read_sql("""SELECT * FROM Country;""", conn)
pd.read_sql("""SELECT * FROM League;""", conn)
We can see that there is only one league per country, and that league index and country index are equal.
pd.read_sql("""SELECT * FROM Team LIMIT 20;""", conn)
Let's create a table of all the matches with home and away teams as well as scores.
match_list = pd.read_sql("""
SELECT
m.id match_id,
c.name country,
l.name league,
m.season,
m.stage,
m.date,
ht.team_long_name home_team,
m.home_team_goal,
m.away_team_goal,
at.team_long_name away_team
FROM Match m
INNER JOIN Country c ON m.country_id = c.id
INNER JOIN League l ON m.league_id = l.id
INNER JOIN Team ht ON ht.team_api_id = m.home_team_api_id
INNER JOIN Team at ON at.team_api_id = m.away_team_api_id
ORDER BY 2, 4 DESC;""", conn)
match_list
#example for Ligue 1 matches from the last recorded season
ligue1_1516 = pd.read_sql("""
SELECT
m.id match_id,
c.name country,
l.name league,
m.season,
m.stage,
m.date,
ht.team_long_name home_team,
m.home_team_goal,
m.away_team_goal,
at.team_long_name away_team
FROM Match m
INNER JOIN Country c ON m.country_id = c.id
INNER JOIN League l ON m.league_id = l.id
INNER JOIN Team ht ON ht.team_api_id = m.home_team_api_id
INNER JOIN Team at ON at.team_api_id = m.away_team_api_id
WHERE country='France' AND season='2015/2016'
ORDER BY 6 ASC;""", conn)
ligue1_1516
Say our initial aim is to focus on European national leagues, and we expect to get some statistics (on player characteristics, in terms of total goals, fouls, ...) in order to compare them.
Of course this is not a clear indicator of the quality of the league, since various factors influence the total number of goals in a league (presence of top scorers, average quality of strikers, average quality of defenses, type of play, ...).
Let's try to draw a graph with all leagues total goals scored for each year.
total_goals = pd.read_sql("""
SELECT
m.season,
c.name country,
l.name league,
SUM(m.home_team_goal + m.away_team_goal) total_goals
FROM Match m
INNER JOIN Country c ON c.id = m.country_id
INNER JOIN League l ON l.id = m.league_id
GROUP BY 1, 2
ORDER BY 1 DESC, 4 DESC;""", conn)
total_goals
#plotting total goals scored per year
fig = px.line(total_goals, x='season', y='total_goals', color='league')
fig.show()
We can see that there is a lot of disparity between leagues, and it looks like we are lacking data for the 2013/2014 Belgian championship.
Let's focus on the top 6 leagues.
#plotting total goals scored per year for the top 6 leagues in terms of total goal scored
#creating a list of the seasons to order the chart's x_axis
seasons_list = ['2008/2009', '2009/2010', '2010/2011', '2011/2012', '2012/2013', '2013/2014', '2014/2015', '2015/2016', '2016/2017', '2017/2018', '2018/2019']
total_goals_top = total_goals[total_goals['country'].isin(['Spain', 'England', 'Italy', 'France', 'Netherlands', 'Germany'])]
Between the 2008/2009 season and 2015/2016 season, the gap between #1 and #5 leagues in terms of total goals scored has slightly decreased.
In particular, total goals scored in the French Ligue 1 have considerably increased (+100 approx.), while it decreased for German Bundesliga, which is no longer in the top 5.
At the same time, there is also a noticeable decrease in the Spanish Liga BBVA, from 1101 to 1043.
There was also a huge increase for the Dutch championship total between the 2009/2010 season and the 2011/2012 season, which was quickly mitigated with total goals scored coming back to approx 900 for the last season.
Seeing the total goals scored for each league is interesting, but we would now rather see which are the top 3 contributing teams for each league.
#creating a table with a single column for teams and a single column for goals scored per match: example with France
# test = pd.read_sql("""
# SELECT
# m.season,
# l.name league,
# c.name country,
# m.stage,
# t.team_api_id,
# t.team_long_name team,
# CASE
# WHEN t.team_api_id=m.home_team_api_id THEN m.home_team_goal
# WHEN t.team_api_id=m.away_team_api_id THEN m.away_team_goal
# END goals
# FROM Match m
# INNER JOIN League l ON m.league_id = l.id
# INNER JOIN Country c ON c.id = m.country_id
# INNER JOIN Team t ON t.team_api_id = m.home_team_api_id OR t.team_api_id = m.away_team_api_id
# WHERE m.season='2015/2016' AND c.name='France' AND m.stage=1;""", conn)
# test
#now doing the same for all countries and seasons and selecting the top 3 teams from each league
# top_teams = pd.read_sql("""
# WITH team_goals AS
# (SELECT
# m.season,
# l.name league,
# c.name country,
# m.stage,
# t.team_api_id,
# t.team_long_name team,
# CASE
# WHEN t.team_api_id=m.home_team_api_id THEN m.home_team_goal
# WHEN t.team_api_id=m.away_team_api_id THEN m.away_team_goal
# END goals
# FROM Match m
# INNER JOIN League l ON m.league_id = l.id
# INNER JOIN Country c ON c.id = m.country_id
# INNER JOIN Team t ON t.team_api_id = m.home_team_api_id OR t.team_api_id = m.away_team_api_id),
# classified_teams AS
# (SELECT
# tg.season,
# tg.league,
# tg.team,
# ROW_NUMBER() OVER (PARTITION BY tg.season, tg.league ORDER BY SUM(tg.goals) DESC) AS rank,
# SUM(tg.goals) total_team_goals,
# CAST(SUM(tg.goals) AS float)/CAST((SELECT SUM(tg2.goals) FROM team_goals tg2
# WHERE tg2.season=tg.season AND tg2.league=tg.league
# GROUP BY tg2.season, tg2.league) AS float) pct_total_league_goals,
# (SELECT SUM(tg2.goals) FROM team_goals tg2
# WHERE tg2.season=tg.season AND tg2.league=tg.league
# GROUP BY tg2.season, tg2.league) total_league_goals
# FROM team_goals tg
# GROUP BY 1, 3
# ORDER BY 1, 2, 4 ASC)
# SELECT *
# FROM classified_teams
# WHERE rank <=3;""", conn)
# top_teams
Let's see the output for the 2015/2016 season and compare the top clubs from each league.
# top_teams_1516 = top_teams[top_teams['season']=='2015/2016']
# top_teams_1516
# #creating three subsets of the top_teams_1516 dataframe in order to plot a grouped bar chart
# rank1 = top_teams_1516[top_teams_1516['rank']==1]
# rank2 = top_teams_1516[top_teams_1516['rank']==2]
# rank3 = top_teams_1516[top_teams_1516['rank']==3]
# fig = go.Figure(data=[
# go.Bar(x=rank1['league'], y=rank1['total_team_goals'], text=rank1['team'], textposition='auto'),
# go.Bar(x=rank2['league'], y=rank2['total_team_goals'], text=rank2['team'], textposition='auto'),
# go.Bar(x=rank3['league'], y=rank3['total_team_goals'], text=rank3['team'], textposition='auto')
# ])
# # Change the bar mode
# fig.update_layout(barmode='group',
# showlegend=False,
# title='Top 3 teams (by total goals scored) per league in the 2015/2016 season',
# yaxis_title='total goals scored')
# fig.show()
We will now use Python Pandas library in order to store dataframes of particular interest and pursue further analysis.
We only speak about correlation (and not causation) at this point, since the relation between players quality and team scores, for example, is ambiguous: better players will quite logically enhance the overall team attributes and level, leading to more points, and at the same time, famous and better ranked teams tend to attract top level players.
We chose to focus on 6 of the major European leagues:
#Creating a dataframe with match information: it will be the base for total points per team per season analysis
#The league ids we consider is the following: '1729', '4769', '7809', '10257', '13274', '21518'
df_match = pd.read_sql("""
SELECT
m.id,
m.country_id,
c.name country,
m.league_id,
l.name league,
m.season,
m.stage,
m.match_api_id,
m.home_team_api_id,
m.away_team_api_id,
m.home_team_goal,
m.away_team_goal
FROM Match m
INNER JOIN Country c ON c.id = m.country_id
INNER JOIN League l ON l.id = m.league_id
WHERE l.id IN ('1729', '4769', '7809', '10257', '13274', '21518');""", conn)
df_match.head()
#Creating the team, country and league dataframes
df_team = pd.read_sql("""SELECT * FROM Team;""", conn)
df_country = pd.read_sql("""SELECT * FROM Country
WHERE id IN ('1729', '4769', '7809', '10257', '13274', '21518');""", conn)
df_league = pd.read_sql("""SELECT * FROM League
WHERE id IN ('1729', '4769', '7809', '10257', '13274', '21518');""", conn)
df_team
#let's select the teams from the leagues that interest us
df_team = df_team.merge(df_match, how='left',
left_on='team_api_id',
right_on='home_team_api_id')
#Selecting only relevant columns and dropping duplicates, resetting index and renaming columns
df_team = df_team[['id_x', 'team_api_id', 'team_fifa_api_id', 'team_long_name', 'team_short_name', 'league_id']].drop_duplicates()
df_team.dropna(subset=['league_id'], inplace=True)
df_team.reset_index(drop=True)
df_team.rename(columns={'id_x':'id'}, inplace=True)
We need to build a dataframe containing the players list from each team, for each league (among the 6 major leagues we chose to study) and each season.
df_match_player = pd.read_sql("""
SELECT
id, season, league_id, home_team_api_id, away_team_api_id,
home_player_1, home_player_2, home_player_3, home_player_4, home_player_5,
home_player_6, home_player_7, home_player_8, home_player_9, home_player_10, home_player_11,
away_player_1, away_player_2, away_player_3, away_player_4, away_player_5,
away_player_6, away_player_7, away_player_8, away_player_9, away_player_10, away_player_11
FROM Match
WHERE league_id IN ('1729', '4769', '7809', '10257', '13274', '21518');""", conn)
#Creating the team attributes dataframe
df_teamatt = pd.read_sql("""SELECT * FROM Team_Attributes;""", conn)
df_teamatt.head()
We then need to split between home and away teams and players (we begin with home teams and players), and remodel the dataframe in order to get only season, team and players as columns.
#Splitting between home and away teams and players
df_home_player = df_match_player[['season', 'home_team_api_id', 'home_player_1', 'home_player_2', 'home_player_3',
'home_player_4', 'home_player_5', 'home_player_6', 'home_player_7', 'home_player_8',
'home_player_9', 'home_player_10', 'home_player_11']]
df_home_player = pd.melt(df_home_player, id_vars=['season', 'home_team_api_id'],
value_vars=['home_player_1', 'home_player_2', 'home_player_3',
'home_player_4', 'home_player_5', 'home_player_6',
'home_player_7', 'home_player_8', 'home_player_9',
'home_player_10', 'home_player_11'],
var_name='player_pos', value_name='player')
df_home_player.head()
df_home_player.info()
Then, we drop null values and also drop duplicate entries - correponding to a player multiple appearances with his team throughout each year - and eventually rename columns and drop the player_pos column which won't be useful for our purpose.
We also rename the 'home_team_api_id' column as 'team_api_id', as we will perform the same manipulation to the dataframe containing away teams and players before concatenating them.
#Removing the 'player_pos' column
df_home_player.drop(columns=['player_pos'], inplace=True)
#Dropping duplicates and null values
df_home_player.drop_duplicates(subset=['season', 'home_team_api_id', 'player'], inplace=True)
df_home_player.dropna(inplace=True)
#Renaming columns
df_home_player.rename(columns={'home_team_api_id':'team_api_id'}, inplace=True)
df_home_player.info()
We can then do the same for away teams and players.
df_away_player = df_match_player[['season', 'away_team_api_id', 'away_player_1', 'away_player_2', 'away_player_3',
'away_player_4', 'away_player_5', 'away_player_6',
'away_player_7', 'away_player_8', 'away_player_9',
'away_player_10', 'away_player_11']]
df_away_player = pd.melt(df_away_player, id_vars=['season', 'away_team_api_id'],
value_vars=['away_player_1', 'away_player_2', 'away_player_3',
'away_player_4', 'away_player_5', 'away_player_6',
'away_player_7', 'away_player_8', 'away_player_9',
'away_player_10', 'away_player_11'],
var_name='player_pos', value_name='player')
df_away_player.drop(columns=['player_pos'], inplace=True)
df_away_player.drop_duplicates(subset=['season', 'away_team_api_id', 'player'], inplace=True)
df_away_player.dropna(inplace=True)
df_away_player.rename(columns={'away_team_api_id':'team_api_id'}, inplace=True)
df_away_player.info()
We have now compiled all home teams' players as well as away teams' players. We can concat the two dataframes and drop duplicates.
df_team_player = pd.concat([df_home_player, df_away_player], ignore_index=True).drop_duplicates(subset=['season', 'team_api_id', 'player'])
df_team_player.head()
We just need to retrieve information on players from the database.
df_player = pd.read_sql("""SELECT * FROM Player;""", conn)
df_playeratt = pd.read_sql("""SELECT * FROM Player_Attributes;""", conn)
df_match.info()
There are no null values.
df_match.duplicated().value_counts()
No duplicates either.
print(df_team.info())
print(df_team.duplicated().value_counts())
No null values except for one in the 'team_fifa_api_id' which we won't be using here since the team attributes dataframe has all entries for the 'team_api_id'. No duplicates either.
df_teamatt.head(10)
We can see that columns display information by 'pair': one column indicates the value according to one attribute, while the other contains a categorical variable (slow, fast balanced for example).
Let's look at the 'buildUpPlaySpeed' and 'buildUpPlaySpeedClass' columns for example.
df_teamatt.groupby('buildUpPlaySpeedClass')['buildUpPlaySpeed'].describe()
df_teamatt.info()
It seems like the dataframe is complete except for the 'buildUpPlayDribbling' column, which lacks at least 2/3 of data. Quite srangely, there aren't any missing data in the corresponding 'buildUpPlayDribblingClass' column.
Let's investigate on where these data are lacking the most.
df_teamatt['date'].unique()
df_teamatt.groupby('date').count()
So it seems that all the data for 'buildUpPlayDribbling' is missing for 2010, 2011, 2012 and 2013 while it is complete for the last two years.
df_teamatt.groupby('buildUpPlayDribblingClass')['buildUpPlayDribbling'].value_counts(dropna=False)
All missing values are in the 'Little' range from 'buildUpPlayDribblingClass'. That is making our work much easier since we can quite safely consider replacing missing values with an average value of 'buildUpPlayDribbling' in the 'Little' category.
little_mean_buildUpPlayDribbling = df_teamatt.groupby('buildUpPlayDribblingClass')['buildUpPlayDribbling'].mean()['Little']
df_teamatt.fillna(little_mean_buildUpPlayDribbling, inplace=True)
df_teamatt.info()
The remaining dataframes to look at are df_match_player, df_team_player, df_player and df_playeratt
df_match_player.info()
There are some missing values in this dataframe.
df_match_player.groupby('season').count()
sns.heatmap(df_match_player.isnull(), cbar=False)
missing_player_teams = pd.read_sql("""
SELECT
l.name league_name,
m.season,
(SELECT t.team_long_name FROM Team t
WHERE t.team_api_id = m.home_team_api_id) home_team,
(SELECT t.team_long_name FROM Team t
WHERE t.team_api_id = m.away_team_api_id) away_team
FROM Match m INNER JOIN League l
ON l.id = m.league_id
WHERE home_player_1 IS NULL
AND home_player_2 IS NULL
AND home_player_3 IS NULL
AND home_player_4 IS NULL
AND home_player_5 IS NULL
AND home_player_6 IS NULL
AND home_player_7 IS NULL
AND home_player_8 IS NULL
AND home_player_9 IS NULL
AND home_player_10 IS NULL
AND home_player_11 IS NULL
AND away_player_1 IS NULL
AND away_player_2 IS NULL
AND away_player_3 IS NULL
AND away_player_4 IS NULL
AND away_player_5 IS NULL
AND away_player_6 IS NULL
AND away_player_7 IS NULL
AND away_player_8 IS NULL
AND away_player_9 IS NULL
AND away_player_10 IS NULL
AND away_player_11 IS NULL
AND league_id in ('1729', '4769', '7809', '10257', '13274', '21518');""", conn)
missing_player_teams
missing_player_teams.groupby('season')['league_name'].value_counts()
We observe that we lack data for all players from the Dutch league in the 2008/2009 season. Some information is also missing for other leagues. We will let it like that for the moment.
df_team_player.info()
df_team_player.duplicated().value_counts()
The team player dataframe is complete and without any duplicates.
df_player.info()
df_playeratt.head()
df_playeratt.info()
There are some missing values, but we will see if we really need all rows later on.
Eventually, before performing analysis on teams points in each league, we need to define two new columns in df_match that indicate how many points each team brings back. A win brings 3 points, a draw 1 point and a loss 0 point.
df_match.loc[df_match['home_team_goal'] < df_match['away_team_goal'], 'home_team_points'] = 0
df_match.loc[df_match['home_team_goal'] == df_match['away_team_goal'], 'home_team_points'] = 1
df_match.loc[df_match['home_team_goal'] > df_match['away_team_goal'], 'home_team_points'] = 3
df_match.loc[df_match['home_team_goal'] < df_match['away_team_goal'], 'away_team_points'] = 3
df_match.loc[df_match['home_team_goal'] == df_match['away_team_goal'], 'away_team_points'] = 1
df_match.loc[df_match['home_team_goal'] > df_match['away_team_goal'], 'away_team_points'] = 0
df_match
We want to get a match dataframe with tidy data (for now we have two columns for home and away teams, and two columns for home and away teams' goals).
#Melting the team id columns
df_match_team = df_match.melt(id_vars=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id'],
value_vars=['home_team_api_id', 'away_team_api_id'],
var_name='home_away',
value_name='team_api_id')
#Creating a column only containing 'home_team' or 'away_team'
df_match_team.loc[:,'home_away'] = df_match_team.loc[:,'home_away'].str.replace('_api_id', '')
#Melting the team goals columns
df_match_goal = df_match.melt(id_vars=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id'],
value_vars=['home_team_goal', 'away_team_goal'],
var_name='home_away',
value_name='team_goal')
#Creating a column only containing 'home_team' or 'away_team', which gives us the possibility to merge dataframes
df_match_goal.loc[:, 'home_away'] = df_match_goal.loc[:, 'home_away'].str.replace('_goal', '')
#Melting the team points columns
df_match_points = df_match.melt(id_vars=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id'],
value_vars=['home_team_points', 'away_team_points'],
var_name='home_away',
value_name='team_points')
#Creating a column only containing 'home_team' or 'away_team', which gives us the possibility to merge dataframes
df_match_points.loc[:, 'home_away'] = df_match_goal.loc[:, 'home_away'].str.replace('_points', '')
#Creating another dataframe to store the goals conceded by a team and melting it
df_goals_conceded = df_match.copy()
df_goals_conceded['home_team_goal_conceded'] = df_match['away_team_goal']
df_goals_conceded['away_team_goal_conceded'] = df_match['home_team_goal']
df_goals_conceded = df_goals_conceded.melt(id_vars=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id'],
value_vars=['home_team_goal_conceded', 'away_team_goal_conceded'],
var_name='home_away',
value_name='goal_conceded')
#Creating a column only containing 'home_team' or 'away_team', which gives us the possibility to merge dataframes
df_goals_conceded.loc[:, 'home_away'] = df_goals_conceded.loc[:, 'home_away'].str.replace('_goal_conceded', '')
#Merging the dataframes back to get a final tidy dataframe that replaces the initial df_match dataframe
df_match = df_match_team.merge(right=df_match_goal, how='inner',
on=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id', 'home_away'])
df_match = df_match.merge(right=df_match_points, how='inner',
on=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id', 'home_away'])
df_match = df_match.merge(right=df_goals_conceded, how='inner',
on=['id', 'country_id', 'country', 'league_id', 'league', 'season', 'stage', 'match_api_id', 'home_away'])
df_match
We finally have a tidy dataframe with season, league, match, team (home and away), goals scored (and conceded) and points!
Questions
home_advantage = df_match.groupby(by=['league', 'home_away'], as_index=False)['team_goal'].mean()
home_advantage
fig = px.bar(home_advantage, x='league', y='team_goal',
color='home_away', barmode='group')
fig.update_layout(title='Average goals scored per match for home and away teams',
yaxis_title='Average goals scored per match',
xaxis_title='League',
legend_title='')
#add annotations with the difference between average home goals and away goals
for league in df_league['name'].tolist():
mean_goal = ((home_advantage[(home_advantage['league']==league) & (home_advantage['home_away']=='home_team')]['team_goal'].values
+ home_advantage[(home_advantage['league']==league) & (home_advantage['home_away']=='away_team')]['team_goal'].values)/2).item()
goal_difference = (home_advantage[(home_advantage['league']==league) & (home_advantage['home_away']=='home_team')]['team_goal'].values
- home_advantage[(home_advantage['league']==league) & (home_advantage['home_away']=='away_team')]['team_goal'].values).item()
fig.add_annotation(
x=league,
y=round(mean_goal,2),
text=round(goal_difference,2),
showarrow=False,
font=dict(color='dimgrey', size=20))
# print(fig.layout)
fig.show()
We can see that, on average, home teams score approximately 0.4 more goals (between 0.35 and 0.5) than their away counterpart. It seems that this difference is more pronounced in the Spanish and Dutch leagues.
For this analysis, we will retrieve teams name from df_team. Eventually, we want to see if any correlation exists between team attributes and team performance. We can consider team performance with different proxies: total goals scored, total points per season and goals scored per match.
#Grouping by country, league, season and team_api_id and summing up points
performers = df_match.groupby(['country', 'league', 'season', 'team_api_id'])['team_goal', 'team_points'].agg(['sum', 'count']).reset_index()
#Dropping last column (count duplicate, corresponding to the number of matches played)
performers = performers.iloc[:, :-1]
#Renaming columns
performers.columns=['country', 'league', 'season', 'team_api_id', 'team_goal', 'matches', 'team_points']
#Creating a column with goals per match
performers.loc[:, 'goal_per_match'] = performers['team_goal']/performers['matches']
#Merging with df_team to get teams name
performers = performers.merge(right=df_team, left_on='team_api_id', right_on='team_api_id')
performers = performers[['country', 'league', 'season', 'team_api_id', 'team_long_name', 'team_goal', 'team_points', 'goal_per_match']].rename(columns={'team_long_name':'team'})
#Creating a column with team name and season
performers.loc[:, 'team_season'] = performers['team']+ ' ' + performers['season'].astype('str')
top_scoring_teams = performers.sort_values('team_goal', ascending=False)[:20]
fig = px.bar(top_scoring_teams, x='team_season', y='team_goal', color='team')
fig.update_layout(images = [
{"source": "images/real_madrid.png",
"xref": "x",
"yref": "y",
"x": 4,
"y": 50,
"sizex": 30,
"sizey": 30,
"xanchor": "right",
"yanchor": "bottom"},
{"source": "images/fc_barcelona.png",
"xref": "x",
"yref": "y",
"x": 10.5,
"y": 50,
"sizex": 24,
"sizey": 24,
"xanchor": "right",
"yanchor": "bottom"},
{"source": "images/ajax.png",
"xref": "x",
"yref": "y",
"x": 13.5,
"y": 110,
"sizex": 15,
"sizey": 15,
"xanchor": "right",
"yanchor": "bottom"},
{"source": "images/psv.png",
"xref": "x",
"yref": "y",
"x": 14.5,
"y": 112,
"sizex": 10,
"sizey": 10,
"xanchor": "right",
"yanchor": "bottom"},
{"source": "images/chelsea.png",
"xref": "x",
"yref": "y",
"x": 15.5,
"y": 110,
"sizex": 14,
"sizey": 14,
"xanchor": "right",
"yanchor": "bottom"},
{"source": "images/psg.png",
"xref": "x",
"yref": "y",
"x": 16.5,
"y": 110,
"sizex": 14,
"sizey": 14,
"xanchor": "right",
"yanchor": "bottom"},
{"source": "images/man_city.png",
"xref": "x",
"yref": "y",
"x": 17.5,
"y": 110,
"sizex": 14,
"sizey": 14,
"xanchor": "right",
"yanchor": "bottom"},
{"source": "images/liverpool.png",
"xref": "x",
"yref": "y",
"x": 18.37,
"y": 110,
"sizex": 14,
"sizey": 14,
"xanchor": "right",
"yanchor": "bottom"},
{"source": "images/bayern.png",
"xref": "x",
"yref": "y",
"x": 19.4,
"y": 110,
"sizex": 14,
"sizey": 14,
"xanchor": "right",
"yanchor": "bottom"}
],
title = 'Top scoring teams',
yaxis_title = 'Total goals',
xaxis_title = 'Team')
fig.show()
This chart illustrates the domination imposed by the two Spanish leaders in terms of goals scored per season.
To compare teams in terms of defense, we will use the total number of goals conceded as a proxy.
best_defenses = df_match.groupby(['country', 'league', 'season', 'team_api_id'])['goal_conceded'].sum().reset_index()
#Merging with df_team to get teams name
best_defenses = best_defenses.merge(right=df_team, left_on='team_api_id', right_on='team_api_id')
best_defenses = best_defenses[['country', 'league', 'season', 'team_api_id', 'team_long_name', 'goal_conceded']].rename(columns={'team_long_name':'team'})
#Creating a column with team name and season
best_defenses.loc[:, 'team_season'] = best_defenses['team']+ ' ' + best_defenses['season'].astype('str')
top_defenses = best_defenses.sort_values('goal_conceded', ascending=True)[:10]
fig = px.bar(top_defenses, x='team_season', y='goal_conceded', color='team')
fig.update_layout(title = 'Top 10 teams - least goals conceded',
yaxis_title = 'Total goals conceded',
xaxis_title = 'Team',
images = [
{"source": "images/bayern.png",
"xref": "x",
"yref": "y",
"x": 1.5,
"y": 8,
"sizex": 5,
"sizey": 5,
"xanchor": "right",
"yanchor": "bottom"},
{"source": "images/atletico.png",
"xref": "x",
"yref": "y",
"x": 3.37,
"y": 8,
"sizex": 5,
"sizey": 5,
"xanchor": "right",
"yanchor": "bottom"},
{"source": "images/psg.png",
"xref": "x",
"yref": "y",
"x": 4.5,
"y": 8,
"sizex": 5,
"sizey": 5,
"xanchor": "right",
"yanchor": "bottom"},
{"source": "images/juventus.png",
"xref": "x",
"yref": "y",
"x": 5.7,
"y": 8,
"sizex": 5,
"sizey": 5,
"xanchor": "right",
"yanchor": "bottom"},
{"source": "images/ajax.png",
"xref": "x",
"yref": "y",
"x": 7.5,
"y": 8,
"sizex": 5,
"sizey": 5,
"xanchor": "right",
"yanchor": "bottom"},
{"source": "images/fc_barcelona.png",
"xref": "x",
"yref": "y",
"x": 9,
"y": 8,
"sizex": 5,
"sizey": 5,
"xanchor": "right",
"yanchor": "bottom"}]
)
fig.show()
Looking at best performing teams by season.
df_teamatt.head()
#We will change the dates a bit to coincide with the season
season_dict = {'2010-02-22 00:00:00' : '2010/2011',
'2011-02-22 00:00:00' : '2011/2012',
'2012-02-22 00:00:00' : '2012/2013',
'2013-09-20 00:00:00' : '2013/2014',
'2014-09-19 00:00:00' : '2014/2015',
'2015-09-10 00:00:00' : '2015/2016'}
#Replacing df_teamatt 'date' column values with season_dict
df_teamatt.loc[:, 'date'] = df_teamatt.loc[:, 'date'].map(season_dict)
df_teamatt.rename(columns={'date':'season'}, inplace=True)
#Keeping only seasons between 2010/2011 and 2015/2016
performers = performers[performers['season'].isin(['2010/2011', '2011/2012', '2012/2013', '2013/2014', '2014/2015', '2015/2016'])]
#Merging performers with df_teamatt
performers = performers.merge(right=df_teamatt, on=['season', 'team_api_id'])
#Dropping 'id' and 'team_fifa_api_id' columns
performers.drop(columns=['id', 'team_fifa_api_id'], inplace=True)
#Dropping attributes columns with categorical variables (all columns with 'Class')
performers.drop(columns=[x for x in performers.columns if 'Class' in x], inplace=True)
performers.head()
Let's see if we can find any link between a variable containing an average of all attributes and the 3 teams performances indicators (total goals per season, team points, goals per match).
#Building a new variable avg_att
performers.loc[:, 'avg_att'] = performers.iloc[:, 8:].mean(axis=1)
performers.info()
fig = px.scatter(performers, x='avg_att', y='team_goal')
fig.show()
fig = px.scatter(performers, x='avg_att', y='team_points')
fig.show()
fig = px.scatter(performers, x='avg_att', y='goal_per_match')
fig.show()
performers.loc[:, ['team_goal', 'team_points', 'goal_per_match', 'avg_att']].corr()
At this point, we cannot see any clear pattern between team average attributes and their performance considering the three proxies selected. We will pursue the analysis with other techniques.