Behind this short python module lies another attempt at practising on some data manipulation skills, from playing with json files, organizing and 'cleaning' a dataframe with Pandas, but also performing some analysis and visualize it through matplotlib/seaborn or thanks to Folium. At this point, such work could be the start of a larger project aiming at analyzing Yelp database more in-depth, and we could keep on studying businesses and their ratings, as well as integrate other data from Yelp.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import folium
# Reading all the yelp dataset json files
checkin_json_path = 'yelp_academic_dataset_checkin.json'
checkin = pd.read_json(checkin_json_path, lines=True)
checkin.info()
business_json_path = 'yelp_academic_dataset_business.json'
business = pd.read_json(business_json_path, lines=True)
business.info()
For the moment, let's focus on the businesses and set the user dataset aside.
business.head(20)
business.isnull().sum()
First thoughts on this dataset:
city_total = business['city'].value_counts()
print(city_total)
business['city'].str.contains('vegas', case=False).sum()
We can see that there are 33419 rows with the mention 'vegas' (case insensitive), though there are 'only' 31631 businesses with 'Las Vegas' as city.
Let's see what are the rest of the rows that also contain 'vegas'.
vegas = business[(business['city'].str.contains('vegas', case=False)) & (business['city']!='Las Vegas')]
vegas['city'].value_counts()
We seek to replace all these entries with Las Vegas, since they all refer to the same city.
business.loc[(business['city'].str.contains('vegas', case=False)) & (business['city']!='Las Vegas'), 'city'] = 'Las Vegas'
business['city'].value_counts()
If we encountered this issue with Las Vegas, it could be the case with other cities as well. We will try to clean the 'city' column in the following code.
There are replacements to be done on all the top 10 cities (except for Las Vegas since we already dit it). Let's write some code to check for this issue for the 10 most represented cities.
city_total = business['city'].value_counts()
print(city_total[:10])
for c in list(city_total[:10].index):
print(c)
print('\n')
c_mask = (business['city'].str.contains(c, case=False)) & (business['city']!=c)
print(business[c_mask]['city'].value_counts())
print(city_total[c], ': initial value count of the city')
print(business['city'].str.contains(c, case=False).sum(), ': value count considering all mentions of the city name')
print('\n')
We will deal with this issue later before performing analysis on particular cities.
#checks that the code works well
# for c in list(city_total[:10].index):
# print(c)
# business.loc[(business['city'].str.contains(c, case=False)) & (business['city']!=c), 'city'] = c
# print(business['city'].str.contains(c, case=False).sum() - business[business['city']==c].shape[0])
Let's say we want to simplify the dataset by only keeping the cities with 500+ businesses
city_dict = dict(business['city'].value_counts()>500)
business = business[business['city'].map(city_dict)]
business['city'].value_counts()
#pivot table to get total reviews per city
cities_reviews = business.pivot_table(values='review_count', index='city', aggfunc=np.sum).sort_values('review_count', ascending=False)
cities_10 = cities_reviews[:10]
#pivot table to get mean review score for each city
mean_reviews = business.pivot_table(index='city', values='stars', aggfunc=np.mean, margins=True).sort_values('stars', ascending=False)
#subsetting for the top 10 cities in terms of total number of reviews
top_10_cities = list(cities_reviews.index[:10]).copy()
top_10_cities.append('All')
cities_10_mean = mean_reviews.loc[top_10_cities]
fig = plt.figure(figsize=(15,5))
ax1 = fig.add_subplot(1,2,1)
ax1.bar(cities_10.index, height=cities_10['review_count'])
ax1.set_xticklabels(labels= cities_10.index, rotation=90)
for key,spine in ax1.spines.items():
spine.set_visible(False)
ax1.tick_params(left=False, bottom=False)
ax1.set_title('Total number of reviews for each city')
ax2 = fig.add_subplot(1,2,2)
ax2.bar(cities_10_mean.index, cities_10_mean['stars'])
ax2.set_ylim(3,4)
ax2.set_xticklabels(labels= cities_10_mean.index, rotation=90)
for key,spine in ax2.spines.items():
spine.set_visible(False)
ax2.tick_params(left=False, bottom=False)
ax2.set_title('Mean review score for each city')
plt.show()
Another way of displaying the same information, with overlapping axes
#pivot table to get total reviews per city
cities_reviews = business.pivot_table(values='review_count', index='city', aggfunc=np.sum).sort_values('review_count', ascending=False)
cities_10 = cities_reviews[:10]
#pivot table to get mean review score for each city
mean_reviews = business.pivot_table(index='city', values='stars', aggfunc=np.mean, margins=True).sort_values('stars', ascending=False)
#subsetting for the top 10 cities in terms of total number of reviews
#we just omitted the 'All' row, contrary to the graph before, since we want to superpose both graphs
top_10_cities = list(cities_reviews.index[:10]).copy()
cities_10_mean = mean_reviews.loc[top_10_cities]
fig = plt.figure(figsize=(10,10))
ax1 = fig.add_subplot()
ax1.bar(cities_10.index, height=cities_10['review_count'], label='Total reviews per city', color='blue')
ax1.set_xticklabels(labels= cities_10.index, rotation=90)
for key,spine in ax1.spines.items():
spine.set_visible(False)
ax1.tick_params(axis='y', labelcolor='blue', left=False, bottom=False, right=False, top=False)
ax1.set_ylabel('Total number of reviews',c='blue')
ax2 = ax1.twinx()
ax2.bar(cities_10_mean.index, cities_10_mean['stars'], width=0.1, label='Mean review score for each city', color='red')
ax2.set_ylim(3,4)
ax2.set_ylabel('Mean review score',c='red')
for key,spine in ax2.spines.items():
spine.set_visible(False)
ax2.tick_params(axis='y', labelcolor='red', left=False, bottom=False, right=False, top=False)
fig.legend(loc='upper center')
plt.show()
So we can see that there is no trend between the number of reviews and the average rating of the businesses. In other words, a big concentration of businesses in a particular city is not a proxy for measuring the businesses' quality (as rated by customers).
As we saw earlier, the 'categories' column is filled with str values representing categories separated by a comma and a blank space. Let's convert these values to a list and, say, we want to have a look at restaurants in particular. To do so, we will first select the rows containing the mention 'Restaurants', and consider that it is enough to select all the retaurants. By considering NaN as False, we state that rows with NaN as value in the 'categories' column aren't described and we won't consider them here.
restaurants = business.copy()
restaurants = restaurants[restaurants['categories'].str.contains('Restaurants', na=False)]
restaurants.head(10)
Let's drop the businesses that are closed.
restaurants = restaurants[restaurants['is_open'].notnull()]
Now let's see how we can study these restaurants.
restaurants['categories'].value_counts()
There are a lot of information in the 'categories' column and we can see that some places are labeled as 'Mexican', others as 'Pizza' or 'Chinese'. There are obviously too many different categories to try and analyze them all, so we can either:
Let's imagine we are a group of friends planning a trip across North America. We describe ourselves as big food enthusiasts and we want to plan where we will be eating for the first three cities we plan on visiting which are Las Vegas, Phoenix and Toronto. We want to discover what the restaurants scene look like in these three cities.
Among our group, we agree on trying different types of food and set the following list:
#let's prepare the categories columns and work on another dataframe named trip
trip = restaurants.copy()
trip.reset_index(inplace=True)
trip.drop('index', axis=1, inplace=True)
#let's create a mapping dictionnary according to our list
categories_dict = {'Burgers':'Fast Food',
'Fast Food':'Fast Food',
'Sandwiches':'Fast Food',
'Steakhouses':'Steakhouses',
'American':'Steakhouses',
'Indian':'Indian',
'Chinese':'Chinese',
'Thai':'Thai',
'French':'French',
'Japanese':'Japanese',
'Sushi':'Japanese',
'Mexican':'Mexican',
'Italian':'Italian',
'Pizza':'Italian',
'Mediterranean':'Mediterranean',
'Lebanese':'Mediterranean',
'Middle Eastern':'Mediterranean',
'Turkish':'Mediterranean',
'Greek':'Mediterranean',
'Cafes':'Cafes',
'Bakeries':'Cafes'}
#We use the .explode() method in order to split the categories column, creating duplicate rows for each business.
trip = trip.assign(categories = trip['categories'].str.split(', ')).explode('categories')
trip.head(10)
#We categorize the categories column from trip dataframe thanks to our mapping dictionnary,
# according to our group's food list
trip['categories'] = trip['categories'].map(categories_dict)
#We drop the rows with categories that did not match our list
trip = trip[trip['categories'].notnull()]
#We eventually drop the duplicate rows created by the .explode() method. This action may be tricky to perform
# since one business could be associated with several categories in our list (if a business offers pizzas and burgers)
# we won't be too picky for this analysis and keep the first occurrence for each duplicate.
trip.drop_duplicates('business_id', inplace=True, ignore_index=True)
trip.head(20)
#Let's remove the rows corresponding to closed businesses
trip = trip[trip['is_open']==1]
trip = trip.drop('is_open', axis=1)
trip.reset_index(inplace=True)
trip.sort_values(['stars', 'review_count'], ascending=False).head()
trip['city'].value_counts()
Say we focus on the first city we'll visit: Toronto. Since we are staying three nigths, we agreed on eating Mediterranean, Chinese and French food.
toronto = trip[trip['city']=='Toronto']
toronto['categories'].value_counts()
Let's try to see where are the 10 top places for each food type in Toronto!
#we order the restaurants considering the number of reviews first in order to select already 'famous' places,
# then order by 'stars'
#mediterranean restaurants
toronto_med_10 = toronto[toronto['categories']=='Mediterranean'].sort_values(['review_count','stars'], ascending=False).head(10)
med_latlon = toronto_med_10[['latitude','longitude']].values.tolist()
# print(med_latlon)
#Chinese restaurants
toronto_chin_10 = toronto[toronto['categories']=='Chinese'].sort_values(['review_count','stars'], ascending=False).head(10)
chin_latlon = toronto_chin_10[['latitude','longitude']].values.tolist()
#French restaurants
toronto_fr_10 = toronto[toronto['categories']=='French'].sort_values(['review_count','stars'], ascending=False).head(10)
fr_latlon = toronto_fr_10[['latitude','longitude']].values.tolist()
map = folium.Map(location=[43.651070, -79.347015], tiles='Stamen Toner', zoom_start=12)
for point in range(0, len(med_latlon)):
folium.Marker(med_latlon[point], popup=toronto_med_10.iloc[point, 2], icon=folium.Icon(color='darkblue')).add_to(map)
folium.Marker(chin_latlon[point], popup=toronto_chin_10.iloc[point, 2], icon=folium.Icon(color='red')).add_to(map)
folium.Marker(fr_latlon[point], popup=toronto_fr_10.iloc[point, 2], icon=folium.Icon(color='green')).add_to(map)
map
We can see the high concentration of Chinese restaurants on Spadina Avenue in Toronto. Mediterranean and French restaurants seem to be more dispersed around the city.
Let's have a look at the review distribution for each food category
toronto_selection = toronto[(toronto['categories']=='Mediterranean')
| (toronto['categories']=='French')
| (toronto['categories']=='Chinese')]
g = sns.FacetGrid(toronto_selection,col='categories',height=5)
sns.despine(bottom=True, left=True)
g.map(sns.kdeplot, 'stars', shade=True)
fig, (ax1,ax2,ax3) = plt.subplots(1,3, figsize=(12,5), sharey=True)
ax1.boxplot(toronto[toronto['categories']=='French']['stars'])
ax1.set_xlabel('French restaurants')
ax2.boxplot(toronto[toronto['categories']=='Mediterranean']['stars'])
ax2.set_xlabel('Mediterranean restaurants')
ax3.boxplot(toronto[toronto['categories']=='Chinese']['stars'])
ax3.set_xlabel('Chinese restaurants')
plt.show()
We can see that among these three categories, it seems that French restaurants benefit from better reviews (a majority of scores between 3.5 and 4.5), whereas Mediterranean and Chinese restaurants reviews are a bit more dispersed.
To read the review json file, we first tried to do the same as checkin and business but the command took too much time since the file is 6M lines long..
We found a solution from George Hou who's writing for TowardsDataScience.
We read it by chunks of 1M lines and use the documentation provided by Yelp to set the data type for each column.
# size=1000000
# review_json_path = 'yelp_academic_dataset_review.json'
# review = pd.read_json(review_json_path, lines=True,
# dtype={'review_id':str, 'user_id':str, 'business_id':str, 'stars':int, 'date':str, 'text':str, 'useful':int, 'funny':int, 'cool':int},
# chunksize=size)
# tip_json_path = 'yelp_academic_dataset_tip.json'
# tip = pd.read_json(tip_json_path, lines=True)
# tip.info()