Yelp database exploratory analysis

Behind this short python module lies another attempt at practising on some data manipulation skills, from playing with json files, organizing and 'cleaning' a dataframe with Pandas, but also performing some analysis and visualize it through matplotlib/seaborn or thanks to Folium. At this point, such work could be the start of a larger project aiming at analyzing Yelp database more in-depth, and we could keep on studying businesses and their ratings, as well as integrate other data from Yelp.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import folium
In [2]:
# Reading all the yelp dataset json files
checkin_json_path = 'yelp_academic_dataset_checkin.json'
checkin = pd.read_json(checkin_json_path, lines=True)
In [3]:
checkin.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175187 entries, 0 to 175186
Data columns (total 2 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   business_id  175187 non-null  object
 1   date         175187 non-null  object
dtypes: object(2)
memory usage: 2.7+ MB
In [4]:
business_json_path = 'yelp_academic_dataset_business.json'
business = pd.read_json(business_json_path, lines=True)
In [5]:
business.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209393 entries, 0 to 209392
Data columns (total 14 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   business_id   209393 non-null  object 
 1   name          209393 non-null  object 
 2   address       209393 non-null  object 
 3   city          209393 non-null  object 
 4   state         209393 non-null  object 
 5   postal_code   209393 non-null  object 
 6   latitude      209393 non-null  float64
 7   longitude     209393 non-null  float64
 8   stars         209393 non-null  float64
 9   review_count  209393 non-null  int64  
 10  is_open       209393 non-null  int64  
 11  attributes    180348 non-null  object 
 12  categories    208869 non-null  object 
 13  hours         164550 non-null  object 
dtypes: float64(3), int64(2), object(9)
memory usage: 22.4+ MB

For the moment, let's focus on the businesses and set the user dataset aside.

In [6]:
business.head(20)
Out[6]:
business_id name address city state postal_code latitude longitude stars review_count is_open attributes categories hours
0 f9NumwFMBDn751xgFiRbNA The Range At Lake Norman 10913 Bailey Rd Cornelius NC 28031 35.462724 -80.852612 3.5 36 1 {'BusinessAcceptsCreditCards': 'True', 'BikePa... Active Life, Gun/Rifle Ranges, Guns & Ammo, Sh... {'Monday': '10:0-18:0', 'Tuesday': '11:0-20:0'...
1 Yzvjg0SayhoZgCljUJRF9Q Carlos Santo, NMD 8880 E Via Linda, Ste 107 Scottsdale AZ 85258 33.569404 -111.890264 5.0 4 1 {'GoodForKids': 'True', 'ByAppointmentOnly': '... Health & Medical, Fitness & Instruction, Yoga,... None
2 XNoUzKckATkOD1hP6vghZg Felinus 3554 Rue Notre-Dame O Montreal QC H4C 1P4 45.479984 -73.580070 5.0 5 1 None Pets, Pet Services, Pet Groomers None
3 6OAZjbxqM5ol29BuHsil3w Nevada House of Hose 1015 Sharp Cir North Las Vegas NV 89030 36.219728 -115.127725 2.5 3 0 {'BusinessAcceptsCreditCards': 'True', 'ByAppo... Hardware Stores, Home Services, Building Suppl... {'Monday': '7:0-16:0', 'Tuesday': '7:0-16:0', ...
4 51M2Kk903DFYI6gnB5I6SQ USE MY GUY SERVICES LLC 4827 E Downing Cir Mesa AZ 85205 33.428065 -111.726648 4.5 26 1 {'BusinessAcceptsCreditCards': 'True', 'ByAppo... Home Services, Plumbing, Electricians, Handyma... {'Monday': '0:0-0:0', 'Tuesday': '9:0-16:0', '...
5 cKyLV5oWZJ2NudWgqs8VZw Oasis Auto Center - Gilbert 1720 W Elliot Rd, Ste 105 Gilbert AZ 85233 33.350399 -111.827142 4.5 38 1 {'BusinessAcceptsCreditCards': 'True'} Auto Repair, Automotive, Oil Change Stations, ... {'Monday': '7:0-18:0', 'Tuesday': '7:0-18:0', ...
6 oiAlXZPIFm2nBCt0DHLu_Q Green World Cleaners 6870 S Rainbow Blvd, Ste 117 Las Vegas NV 89118 36.063977 -115.241463 3.5 81 1 {'BusinessParking': '{'garage': False, 'street... Dry Cleaning & Laundry, Local Services, Laundr... {'Monday': '7:0-19:0', 'Tuesday': '7:0-19:0', ...
7 ScYkbYNkDgCneBrD9vqhCQ Junction Tire & Auto Service 6910 E Southern Ave Mesa AZ 85209 33.393885 -111.682226 5.0 18 1 {'BusinessAcceptsCreditCards': 'True', 'ByAppo... Auto Repair, Oil Change Stations, Automotive, ... {'Monday': '7:30-17:0', 'Tuesday': '7:30-17:0'...
8 pQeaRpvuhoEqudo3uymHIQ The Empanadas House 404 E Green St Champaign IL 61820 40.110446 -88.233073 4.5 5 1 {'RestaurantsAttire': 'u'casual'', 'Restaurant... Ethnic Food, Food Trucks, Specialty Food, Impo... {'Monday': '11:30-14:30', 'Tuesday': '11:30-14...
9 EosRKXIGeSWFYWwpkbhNnA Xtreme Couture 700 Kipling Avenue Etobicoke Toronto ON M8Z 5G3 43.624539 -79.529108 3.0 16 1 {'GoodForKids': 'True', 'ByAppointmentOnly': '... Martial Arts, Gyms, Fitness & Instruction, Act... {'Monday': '5:30-23:0', 'Tuesday': '5:30-23:0'...
10 MbZMmwo-eL0Jnm_Yb9KJrA Chinook Landscaping and Design Calgary AB T2J 2P2 50.945960 -114.037207 5.0 3 1 None Contractors, Landscaping, Home Services {'Monday': '7:0-17:0', 'Tuesday': '7:0-17:0', ...
11 7Dv4_HAxsxvadEsT5fxQBg Dependable Brakes & Exhaust 1110 Saw Mill Run Blvd Pittsburgh PA 15226 40.406667 -80.004450 5.0 5 1 {'BusinessAcceptsCreditCards': 'True'} Automotive, Auto Repair {'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ...
12 M_guz7Dj7hX0evS672wIwA Chocolate Shoppe Ice Cream 2831 Parmenter St Middleton WI 53562 43.105310 -89.510142 3.5 6 1 {'BikeParking': 'True', 'Caters': 'True', 'Whe... Desserts, Food, Ice Cream & Frozen Yogurt {'Monday': '0:0-0:0', 'Tuesday': '11:0-21:0', ...
13 JjJs3o60uQCfctDjs45cmA Convertabath 116 N Roosevelt Ave, Bldg B, Ste 124 Chandler AZ 85226 33.303869 -111.951660 2.5 10 1 {'BusinessAcceptsCreditCards': 'True', 'ByAppo... Contractors, Home Services, Local Services {'Monday': '9:0-16:0', 'Tuesday': '9:0-16:0', ...
14 kOICO53wbOiOJcKuCgOQ3A Tan Las Vegas 5465 Simmons St North Las Vegas NV 89031 36.263362 -115.179839 3.5 5 0 {'BusinessAcceptsCreditCards': 'True', 'Restau... Beauty & Spas, Tanning {'Monday': '8:0-21:0', 'Tuesday': '8:0-21:0', ...
15 rqcOZePlVvJP9EtzldIz0w Desert Storage 2402 Atchley Dr Henderson NV 89052 35.952046 -115.093483 5.0 5 0 {'BusinessAcceptsCreditCards': 'True'} Local Services, Self Storage {'Monday': '8:0-18:0', 'Tuesday': '8:0-18:0', ...
16 uZuh51lXu7tsrC8RAwkg1A Manolo Blahnik 3131 Las Vegas Blvd Las Vegas NV 89109 36.125732 -115.167608 4.5 32 0 {'BusinessParking': '{'garage': True, 'street'... Shopping, Shoe Stores, Fashion None
17 nIEhsGbw0vJuYl05bzzj6Q Bloom & Blueprint 2115 E Cedar St, Unit 3 Tempe AZ 85281 33.411514 -111.895378 4.5 7 1 {'BusinessAcceptsCreditCards': 'True', 'Restau... Event Planning & Services, Wedding Planning, F... None
18 edQoeeBFUTYGwnUSE0tGPg Car Guyz 2915 Cherry Rd Rock Hill SC 29730 34.981121 -80.979023 4.5 10 1 {'BusinessAcceptsCreditCards': 'True'} Auto Customization, Tires, Motorcycle Repair, ... {'Monday': '8:0-18:0', 'Tuesday': '8:0-18:0', ...
19 Vwo64kNYDjKi98gUUv4-iQ Gateway Fit Body Boot Camp 5229 S Power Rd Mesa AZ 85212 33.320723 -111.685869 4.5 16 1 {'AcceptsInsurance': 'False', 'GoodForKids': '... Weight Loss Centers, Fitness & Instruction, Bo... {'Monday': '5:30-20:0', 'Tuesday': '5:30-20:0'...
In [7]:
business.isnull().sum()
Out[7]:
business_id         0
name                0
address             0
city                0
state               0
postal_code         0
latitude            0
longitude           0
stars               0
review_count        0
is_open             0
attributes      29045
categories        524
hours           44843
dtype: int64

First thoughts on this dataset:

  • 209393 entries,
  • the business_id column is the link between all Yelp datasets so we definitely want to keep it as it is
  • there are no duplicates
  • Yelp provides quite a lot of geographical information: city, state, postal code, latlon info
  • we also have the 'is_open' column, which we could first use to discard the rows where is_open equals to 0, meaning the place is no longer open
  • without the review dataset, we already have some information on the business reviews (stars and review_count columns)
  • the dataset is remarkably complete, since only attributes, categories and hours have null entries
  • the 'attributes' columns is filled with json dictionnaries, indicating some information on the business, whereas the 'categories' column provides information on the type of business, as a str of seveal words separated by commas
  • the 'hours' column is also filled with json dictionnaries, and we will want to keep this information as it will be valuable for our purpose.

First exploratory analysis

Analysis on cities

In [8]:
city_total = business['city'].value_counts()
print(city_total)
Las Vegas             31631
Toronto               20366
Phoenix               20171
Charlotte             10422
Scottsdale             9342
                      ...  
Elizabeth Township        1
Newburgh Heights          1
Oakridges                 1
Baie-d'Urfe               1
Saint Hubert              1
Name: city, Length: 1251, dtype: int64
In [9]:
business['city'].str.contains('vegas', case=False).sum()
Out[9]:
33419

We can see that there are 33419 rows with the mention 'vegas' (case insensitive), though there are 'only' 31631 businesses with 'Las Vegas' as city.
Let's see what are the rest of the rows that also contain 'vegas'.

In [10]:
vegas = business[(business['city'].str.contains('vegas', case=False)) & (business['city']!='Las Vegas')]
vegas['city'].value_counts()
Out[10]:
North Las Vegas            1677
N Las Vegas                  28
Las vegas                    17
LAS VEGAS                    13
N. Las Vegas                 10
las vegas                     8
Las  Vegas                    5
las Vegas                     4
La Vegas                      2
Las Vegas, NV                 2
Las Vegas,                    2
South Las Vegas               2
Las Vegas                     2
Las Vegas Nevada              1
Henderson and Las vegas       1
N.Las Vegas                   1
N W Las Vegas                 1
North Las Vegas               1
110 Las Vegas                 1
Vegas                         1
LAS VEGAS AP                  1
N E Las Vegas                 1
Las Vegass                    1
Las Vegas, Nevada             1
Lake Las Vegas                1
C Las Vegas                   1
West Las Vegas                1
Las Vegas Nv                  1
Las Vegas East                1
Name: city, dtype: int64

We seek to replace all these entries with Las Vegas, since they all refer to the same city.

In [11]:
business.loc[(business['city'].str.contains('vegas', case=False)) & (business['city']!='Las Vegas'), 'city'] = 'Las Vegas'
business['city'].value_counts()
Out[11]:
Las Vegas                          33419
Toronto                            20366
Phoenix                            20171
Charlotte                          10422
Scottsdale                          9342
                                   ...  
Sainte-Rose                            1
Garfield Hts.                          1
Fort Mills                             1
Sainte-Marguerite-du-lac-Masson        1
Saint Hubert                           1
Name: city, Length: 1222, dtype: int64

If we encountered this issue with Las Vegas, it could be the case with other cities as well. We will try to clean the 'city' column in the following code.

There are replacements to be done on all the top 10 cities (except for Las Vegas since we already dit it). Let's write some code to check for this issue for the 10 most represented cities.

In [12]:
city_total = business['city'].value_counts()
print(city_total[:10])
for c in list(city_total[:10].index):
    print(c)
    print('\n')
    c_mask = (business['city'].str.contains(c, case=False)) & (business['city']!=c)
    print(business[c_mask]['city'].value_counts())
    print(city_total[c], ': initial value count of the city')
    print(business['city'].str.contains(c, case=False).sum(), ': value count considering all mentions of the city name')
    print('\n')
Las Vegas     33419
Toronto       20366
Phoenix       20171
Charlotte     10422
Scottsdale     9342
Calgary        8377
Pittsburgh     7630
Montréal       6979
Mesa           6577
Henderson      5272
Name: city, dtype: int64
Las Vegas


Series([], Name: city, dtype: int64)
33419 : initial value count of the city
33419 : value count considering all mentions of the city name


Toronto


Downtown Toronto           4
toronto                    3
Toronto Division           1
Toronto-West               1
West Toronto               1
Toronto-North York         1
TORONTO                    1
North Toronto              1
TORONTO - FRONT ST (OT)    1
Toronto-Etobicoke          1
TORONTO - DANFORTH (OT)    1
Name: city, dtype: int64
20366 : initial value count of the city
20382 : value count considering all mentions of the city name


Phoenix


phoenix               14
PHOENIX                9
Phoenix                2
Phoenix, AZ            2
Phoenix Metro Area     1
PHOENIX AP             1
Metro Phoenix          1
North Phoenix          1
Phoenix Valley         1
Phoenix AZ             1
Phoenix,               1
Name: city, dtype: int64
20171 : initial value count of the city
20205 : value count considering all mentions of the city name


Charlotte


charlotte          4
CHARLOTTE          2
North Charlotte    2
Charlotte          1
South Charlotte    1
Name: city, dtype: int64
10422 : initial value count of the city
10432 : value count considering all mentions of the city name


Scottsdale


scottsdale              5
SCOTTSDALE              2
Scottsdale, AZ          1
North Scottsdale        1
Old Town Scottsdale     1
Westworld Scottsdale    1
Old Scottsdale          1
Name: city, dtype: int64
9342 : initial value count of the city
9354 : value count considering all mentions of the city name


Calgary


Northwest Calgary    3
Southeast Calgary    3
Northeast Calgary    1
calgary              1
SW Calgary           1
CALGARY              1
East Calgary         1
Name: city, dtype: int64
8377 : initial value count of the city
8388 : value count considering all mentions of the city name


Pittsburgh


East Pittsburgh    8
pittsburgh         3
Pittsburgh         2
PITTSBURGH AP      1
PITTSBURGH         1
PIttsburgh         1
St.Pittsburgh      1
Name: city, dtype: int64
7630 : initial value count of the city
7647 : value count considering all mentions of the city name


Montréal


Montréal-Ouest                    5
Montréal-Nord                     4
Communauté-Urbaine-de-Montréal    1
Montréal (Québec)                 1
Montréal-West                     1
West Montréal                     1
Name: city, dtype: int64
6979 : initial value count of the city
6992 : value count considering all mentions of the city name


Mesa


MESA            9
mesa            5
Mesa Az         1
Mesa Arizona    1
Mesa, AZ        1
East Mesa       1
Name: city, dtype: int64
6577 : initial value count of the city
6595 : value count considering all mentions of the city name


Henderson


henderson                   3
Central Henderson           1
Hendersonville              1
W Henderson                 1
Henderson Nevada            1
Henderson (Green Valley)    1
HENDERSON                   1
Name: city, dtype: int64
5272 : initial value count of the city
5281 : value count considering all mentions of the city name


We will deal with this issue later before performing analysis on particular cities.

In [13]:
#checks that the code works well
# for c in list(city_total[:10].index):
#     print(c)
#     business.loc[(business['city'].str.contains(c, case=False)) & (business['city']!=c), 'city'] = c
#     print(business['city'].str.contains(c, case=False).sum() - business[business['city']==c].shape[0])

Let's say we want to simplify the dataset by only keeping the cities with 500+ businesses

In [14]:
city_dict = dict(business['city'].value_counts()>500)
business = business[business['city'].map(city_dict)]
In [17]:
business['city'].value_counts()
Out[17]:
Las Vegas        33419
Toronto          20366
Phoenix          20171
Charlotte        10422
Scottsdale        9342
Calgary           8377
Pittsburgh        7630
Montréal          6979
Mesa              6577
Henderson         5272
Tempe             4797
Chandler          4580
Cleveland         3857
Glendale          3824
Gilbert           3745
Madison           3685
Mississauga       3518
Peoria            2080
Markham           1963
Champaign         1327
North York        1319
Scarborough       1244
Surprise          1218
Richmond Hill     1153
Brampton          1146
Concord           1125
Vaughan           1045
Goodyear           951
Etobicoke          841
Matthews           812
Avondale           745
Fort Mill          696
Huntersville       684
Gastonia           564
Lakewood           562
Mentor             559
Akron              512
Name: city, dtype: int64

Cities with the most reviews & average review scores

In [18]:
#pivot table to get total reviews per city
cities_reviews = business.pivot_table(values='review_count', index='city', aggfunc=np.sum).sort_values('review_count', ascending=False)
cities_10 = cities_reviews[:10]

#pivot table to get mean review score for each city
mean_reviews = business.pivot_table(index='city', values='stars', aggfunc=np.mean, margins=True).sort_values('stars', ascending=False)

#subsetting for the top 10 cities in terms of total number of reviews
top_10_cities = list(cities_reviews.index[:10]).copy()
top_10_cities.append('All')
cities_10_mean = mean_reviews.loc[top_10_cities]

fig = plt.figure(figsize=(15,5))
ax1 = fig.add_subplot(1,2,1)
ax1.bar(cities_10.index, height=cities_10['review_count'])
ax1.set_xticklabels(labels= cities_10.index, rotation=90)
for key,spine in ax1.spines.items():
    spine.set_visible(False)
ax1.tick_params(left=False, bottom=False)
ax1.set_title('Total number of reviews for each city')

ax2 = fig.add_subplot(1,2,2)
ax2.bar(cities_10_mean.index, cities_10_mean['stars'])
ax2.set_ylim(3,4)
ax2.set_xticklabels(labels= cities_10_mean.index, rotation=90)
for key,spine in ax2.spines.items():
    spine.set_visible(False)
ax2.tick_params(left=False, bottom=False)
ax2.set_title('Mean review score for each city')

plt.show()

Another way of displaying the same information, with overlapping axes

In [19]:
#pivot table to get total reviews per city
cities_reviews = business.pivot_table(values='review_count', index='city', aggfunc=np.sum).sort_values('review_count', ascending=False)
cities_10 = cities_reviews[:10]

#pivot table to get mean review score for each city
mean_reviews = business.pivot_table(index='city', values='stars', aggfunc=np.mean, margins=True).sort_values('stars', ascending=False)

#subsetting for the top 10 cities in terms of total number of reviews
#we just omitted the 'All' row, contrary to the graph before, since we want to superpose both graphs
top_10_cities = list(cities_reviews.index[:10]).copy()
cities_10_mean = mean_reviews.loc[top_10_cities]

fig = plt.figure(figsize=(10,10))
ax1 = fig.add_subplot()
ax1.bar(cities_10.index, height=cities_10['review_count'], label='Total reviews per city', color='blue')
ax1.set_xticklabels(labels= cities_10.index, rotation=90)
for key,spine in ax1.spines.items():
    spine.set_visible(False)
ax1.tick_params(axis='y', labelcolor='blue', left=False, bottom=False, right=False, top=False)
ax1.set_ylabel('Total number of reviews',c='blue')

ax2 = ax1.twinx()
ax2.bar(cities_10_mean.index, cities_10_mean['stars'], width=0.1, label='Mean review score for each city', color='red')
ax2.set_ylim(3,4)
ax2.set_ylabel('Mean review score',c='red')
for key,spine in ax2.spines.items():
    spine.set_visible(False)
ax2.tick_params(axis='y', labelcolor='red', left=False, bottom=False, right=False, top=False)
fig.legend(loc='upper center')
plt.show()

So we can see that there is no trend between the number of reviews and the average rating of the businesses. In other words, a big concentration of businesses in a particular city is not a proxy for measuring the businesses' quality (as rated by customers).

About businesses

Most encountered business types

As we saw earlier, the 'categories' column is filled with str values representing categories separated by a comma and a blank space. Let's convert these values to a list and, say, we want to have a look at restaurants in particular. To do so, we will first select the rows containing the mention 'Restaurants', and consider that it is enough to select all the retaurants. By considering NaN as False, we state that rows with NaN as value in the 'categories' column aren't described and we won't consider them here.

In [20]:
restaurants = business.copy()
restaurants = restaurants[restaurants['categories'].str.contains('Restaurants', na=False)]
restaurants.head(10)
Out[20]:
business_id name address city state postal_code latitude longitude stars review_count is_open attributes categories hours
8 pQeaRpvuhoEqudo3uymHIQ The Empanadas House 404 E Green St Champaign IL 61820 40.110446 -88.233073 4.5 5 1 {'RestaurantsAttire': 'u'casual'', 'Restaurant... Ethnic Food, Food Trucks, Specialty Food, Impo... {'Monday': '11:30-14:30', 'Tuesday': '11:30-14...
20 CsLQLiRoafpJPJSkNX2h5Q Middle East Deli 4508 E Independence Blvd Charlotte NC 28205 35.194894 -80.767442 3.0 5 0 {'RestaurantsGoodForGroups': 'True', 'OutdoorS... Food, Restaurants, Grocery, Middle Eastern None
30 9sRGfSVEfLhN_km60YruTA Apadana Restaurant 13071 Yonge Street Richmond Hill ON L4E 1A5 43.947011 -79.454862 3.0 3 1 {'Ambience': '{'touristy': False, 'hipster': F... Persian/Iranian, Turkish, Middle Eastern, Rest... {'Tuesday': '12:0-21:0', 'Wednesday': '12:0-21...
33 vjTVxnsQEZ34XjYNS-XUpA Wetzel's Pretzels 4550 East Cactus Rd, #KSFC-4 Phoenix AZ 85032 33.602822 -111.983533 4.0 10 1 {'GoodForKids': 'True', 'RestaurantsTakeOut': ... Food, Pretzels, Bakeries, Fast Food, Restaurants {'Monday': '10:0-21:0', 'Tuesday': '10:0-21:0'...
36 fnZrZlqW1Z8iWgTVDfv_MA Carl's Jr 9595 W Tropicana Ave Las Vegas NV 89147 36.099738 -115.301568 2.5 15 0 {'Alcohol': 'u'none'', 'WiFi': 'u'no'', 'GoodF... Mexican, Restaurants, Fast Food None
38 rVBPQdeayMYht4Uv_FOLHg Gourmet Burger Company 843 Kipling Avenue Toronto ON M8Z 5G9 43.633291 -79.531768 3.0 13 0 {'RestaurantsPriceRange2': '2', 'RestaurantsAt... Restaurants, Burgers, Food None
41 98hyK2QEUeI8v2y0AghfZA Pho Lee's Vietnamese Restaurant 1541 E 38th St, Ste 101 Cleveland OH 44114 41.512155 -81.663332 4.5 23 1 {'BusinessParking': 'None', 'Ambience': '{'tou... Restaurants, Vietnamese, Soup {'Monday': '11:0-20:0', 'Wednesday': '11:0-20:...
42 fhNf_sg-XzZ3e7HEVGuOZg Meat Chix And Wieners 6530 S Decatur Blvd Las Vegas NV 89118 36.071196 -115.207463 3.0 10 0 {'RestaurantsPriceRange2': '2', 'BusinessParki... Burgers, Restaurants, American (Traditional), ... {'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...
43 LoRef3ChgZKbxUio-sHgQg Amir 5252 Rue Jean Talon O Montréal QC H4P 2A7 45.494870 -73.651904 3.0 18 1 {'RestaurantsAttire': 'u'casual'', 'Restaurant... Restaurants, Lebanese, Middle Eastern None
44 Ga2Bt7xfqoggTypWD5VpoQ Amando's Bros 2602 W Southern Ave Tempe AZ 85282 33.393199 -111.976270 4.0 9 0 {'Caters': 'False', 'RestaurantsGoodForGroups'... Mexican, Restaurants {'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ...

Let's drop the businesses that are closed.

In [21]:
restaurants = restaurants[restaurants['is_open'].notnull()]

Now let's see how we can study these restaurants.

In [22]:
restaurants['categories'].value_counts()
Out[22]:
Mexican, Restaurants                                                                                                                                          778
Restaurants, Mexican                                                                                                                                          757
Pizza, Restaurants                                                                                                                                            712
Restaurants, Pizza                                                                                                                                            707
Restaurants, Chinese                                                                                                                                          687
                                                                                                                                                             ... 
Lounges, Food, Food Court, Restaurants, Nightlife, Middle Eastern, Bars, Hookah Bars                                                                            1
Nightlife, Lounges, Restaurants, Tapas Bars, Bars, American (New)                                                                                               1
Coffee & Tea, Taiwanese, Bubble Tea, Food, Restaurants                                                                                                          1
Beer, Wine & Spirits, American (Traditional), Food, Restaurants, Burgers                                                                                        1
Hawaiian, Restaurants, Asian Fusion, Caterers, Cajun/Creole, Comfort Food, Breakfast & Brunch, Food, Event Planning & Services, Barbeque, Sandwiches, Poke      1
Name: categories, Length: 28692, dtype: int64

There are a lot of information in the 'categories' column and we can see that some places are labeled as 'Mexican', others as 'Pizza' or 'Chinese'. There are obviously too many different categories to try and analyze them all, so we can either:

  • group them, for example between different types of food (Mexican, Pizza, Burgers, ...)
  • separate them between places that operate only as restaurants and mixed businesses that are both labeled as restaurants as well as Shop or Grocery..
  • have a look at some particular cities and play with geographic data

Let's imagine we are a group of friends planning a trip across North America. We describe ourselves as big food enthusiasts and we want to plan where we will be eating for the first three cities we plan on visiting which are Las Vegas, Phoenix and Toronto. We want to discover what the restaurants scene look like in these three cities.
Among our group, we agree on trying different types of food and set the following list:

  • burgers and/or fast-foods and/or sandwiches
  • steakhouses and/or american
  • indian
  • chinese
  • thai
  • french
  • japanese and/or sushis
  • mexican
  • italian and/or pizzas
  • mediterranean and/or middle eastern, lebanese, turkish and/or greek (anything with olive oil)
  • cafes and/or bakeries (for the moments we'll want to relax, have a non-alcoholic drink or something sugary)
In [23]:
#let's prepare the categories columns and work on another dataframe named trip
trip = restaurants.copy()
trip.reset_index(inplace=True)
trip.drop('index', axis=1, inplace=True)
#let's create a mapping dictionnary according to our list
categories_dict = {'Burgers':'Fast Food',
                  'Fast Food':'Fast Food',
                  'Sandwiches':'Fast Food',
                  'Steakhouses':'Steakhouses',
                  'American':'Steakhouses',
                  'Indian':'Indian',
                  'Chinese':'Chinese',
                  'Thai':'Thai',
                  'French':'French',
                  'Japanese':'Japanese',
                  'Sushi':'Japanese',
                  'Mexican':'Mexican',
                  'Italian':'Italian',
                  'Pizza':'Italian',
                  'Mediterranean':'Mediterranean',
                  'Lebanese':'Mediterranean',
                  'Middle Eastern':'Mediterranean',
                  'Turkish':'Mediterranean',
                  'Greek':'Mediterranean',
                  'Cafes':'Cafes',
                  'Bakeries':'Cafes'}
In [24]:
#We use the .explode() method in order to split the categories column, creating duplicate rows for each business.
trip = trip.assign(categories = trip['categories'].str.split(', ')).explode('categories')
trip.head(10)
Out[24]:
business_id name address city state postal_code latitude longitude stars review_count is_open attributes categories hours
0 pQeaRpvuhoEqudo3uymHIQ The Empanadas House 404 E Green St Champaign IL 61820 40.110446 -88.233073 4.5 5 1 {'RestaurantsAttire': 'u'casual'', 'Restaurant... Ethnic Food {'Monday': '11:30-14:30', 'Tuesday': '11:30-14...
0 pQeaRpvuhoEqudo3uymHIQ The Empanadas House 404 E Green St Champaign IL 61820 40.110446 -88.233073 4.5 5 1 {'RestaurantsAttire': 'u'casual'', 'Restaurant... Food Trucks {'Monday': '11:30-14:30', 'Tuesday': '11:30-14...
0 pQeaRpvuhoEqudo3uymHIQ The Empanadas House 404 E Green St Champaign IL 61820 40.110446 -88.233073 4.5 5 1 {'RestaurantsAttire': 'u'casual'', 'Restaurant... Specialty Food {'Monday': '11:30-14:30', 'Tuesday': '11:30-14...
0 pQeaRpvuhoEqudo3uymHIQ The Empanadas House 404 E Green St Champaign IL 61820 40.110446 -88.233073 4.5 5 1 {'RestaurantsAttire': 'u'casual'', 'Restaurant... Imported Food {'Monday': '11:30-14:30', 'Tuesday': '11:30-14...
0 pQeaRpvuhoEqudo3uymHIQ The Empanadas House 404 E Green St Champaign IL 61820 40.110446 -88.233073 4.5 5 1 {'RestaurantsAttire': 'u'casual'', 'Restaurant... Argentine {'Monday': '11:30-14:30', 'Tuesday': '11:30-14...
0 pQeaRpvuhoEqudo3uymHIQ The Empanadas House 404 E Green St Champaign IL 61820 40.110446 -88.233073 4.5 5 1 {'RestaurantsAttire': 'u'casual'', 'Restaurant... Food {'Monday': '11:30-14:30', 'Tuesday': '11:30-14...
0 pQeaRpvuhoEqudo3uymHIQ The Empanadas House 404 E Green St Champaign IL 61820 40.110446 -88.233073 4.5 5 1 {'RestaurantsAttire': 'u'casual'', 'Restaurant... Restaurants {'Monday': '11:30-14:30', 'Tuesday': '11:30-14...
0 pQeaRpvuhoEqudo3uymHIQ The Empanadas House 404 E Green St Champaign IL 61820 40.110446 -88.233073 4.5 5 1 {'RestaurantsAttire': 'u'casual'', 'Restaurant... Empanadas {'Monday': '11:30-14:30', 'Tuesday': '11:30-14...
1 CsLQLiRoafpJPJSkNX2h5Q Middle East Deli 4508 E Independence Blvd Charlotte NC 28205 35.194894 -80.767442 3.0 5 0 {'RestaurantsGoodForGroups': 'True', 'OutdoorS... Food None
1 CsLQLiRoafpJPJSkNX2h5Q Middle East Deli 4508 E Independence Blvd Charlotte NC 28205 35.194894 -80.767442 3.0 5 0 {'RestaurantsGoodForGroups': 'True', 'OutdoorS... Restaurants None
In [25]:
#We categorize the categories column from trip dataframe thanks to our mapping dictionnary,
# according to our group's food list 
trip['categories'] = trip['categories'].map(categories_dict)
#We drop the rows with categories that did not match our list
trip = trip[trip['categories'].notnull()]
#We eventually drop the duplicate rows created by the .explode() method. This action may be tricky to perform
# since one business could be associated with several categories in our list (if a business offers pizzas and burgers)
# we won't be too picky for this analysis and keep the first occurrence for each duplicate.
trip.drop_duplicates('business_id', inplace=True, ignore_index=True)
trip.head(20)
Out[25]:
business_id name address city state postal_code latitude longitude stars review_count is_open attributes categories hours
0 CsLQLiRoafpJPJSkNX2h5Q Middle East Deli 4508 E Independence Blvd Charlotte NC 28205 35.194894 -80.767442 3.0 5 0 {'RestaurantsGoodForGroups': 'True', 'OutdoorS... Mediterranean None
1 9sRGfSVEfLhN_km60YruTA Apadana Restaurant 13071 Yonge Street Richmond Hill ON L4E 1A5 43.947011 -79.454862 3.0 3 1 {'Ambience': '{'touristy': False, 'hipster': F... Mediterranean {'Tuesday': '12:0-21:0', 'Wednesday': '12:0-21...
2 vjTVxnsQEZ34XjYNS-XUpA Wetzel's Pretzels 4550 East Cactus Rd, #KSFC-4 Phoenix AZ 85032 33.602822 -111.983533 4.0 10 1 {'GoodForKids': 'True', 'RestaurantsTakeOut': ... Cafes {'Monday': '10:0-21:0', 'Tuesday': '10:0-21:0'...
3 fnZrZlqW1Z8iWgTVDfv_MA Carl's Jr 9595 W Tropicana Ave Las Vegas NV 89147 36.099738 -115.301568 2.5 15 0 {'Alcohol': 'u'none'', 'WiFi': 'u'no'', 'GoodF... Mexican None
4 rVBPQdeayMYht4Uv_FOLHg Gourmet Burger Company 843 Kipling Avenue Toronto ON M8Z 5G9 43.633291 -79.531768 3.0 13 0 {'RestaurantsPriceRange2': '2', 'RestaurantsAt... Fast Food None
5 fhNf_sg-XzZ3e7HEVGuOZg Meat Chix And Wieners 6530 S Decatur Blvd Las Vegas NV 89118 36.071196 -115.207463 3.0 10 0 {'RestaurantsPriceRange2': '2', 'BusinessParki... Fast Food {'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...
6 LoRef3ChgZKbxUio-sHgQg Amir 5252 Rue Jean Talon O Montréal QC H4P 2A7 45.494870 -73.651904 3.0 18 1 {'RestaurantsAttire': 'u'casual'', 'Restaurant... Mediterranean None
7 Ga2Bt7xfqoggTypWD5VpoQ Amando's Bros 2602 W Southern Ave Tempe AZ 85282 33.393199 -111.976270 4.0 9 0 {'Caters': 'False', 'RestaurantsGoodForGroups'... Mexican {'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ...
8 xFc50drSPxXkcLvX5ygqrg Boomer's Sweet Home Chicago 5932 W Bell Rd, Ste D-109 Glendale AZ 85308 33.640391 -112.187028 3.5 12 0 {'BusinessAcceptsCreditCards': 'True', 'Busine... Fast Food None
9 tLpkSwdtqqoXwU0JAGnApw Wendy's 4602 Northfield Road Cleveland OH 44128 41.434614 -81.527026 3.5 7 1 {'RestaurantsDelivery': 'False', 'BusinessAcce... Fast Food {'Monday': '10:0-3:0', 'Tuesday': '10:0-3:0', ...
10 Sd75ucXKoZUM2BEfBHFUOg China Gourmet 3460 E Southern Ave, Ste 109 Mesa AZ 85204 33.394863 -111.756036 3.0 13 0 {'RestaurantsPriceRange2': '1', 'RestaurantsAt... Chinese None
11 lK-wuiq8b1TuU7bfbQZgsg Hingetown Cleveland OH 44113 41.489343 -81.711029 3.0 4 1 {'Alcohol': 'u'none'', 'GoodForKids': 'True', ... Cafes {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W...
12 LAoSegVNU4wx4GTA8reB6A Tzikii Food Truck 7510 S Priest Dr Tempe AZ 85283 33.350104 -111.963660 3.0 4 1 {'RestaurantsTakeOut': 'False', 'WheelchairAcc... Mediterranean None
13 -qjn24n8HYF6It9GQrQntw Five Guys Burgers and Fries 2130 E Arbors Dr Charlotte NC 28262 35.338070 -80.757397 4.0 15 0 {'RestaurantsReservations': 'False', 'Restaura... Fast Food {'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...
14 RrapAhd8ZxCj-iue7fu9FA Ganga Restaurant 515 4th Avenue SW Calgary AB T2P 0J8 51.049407 -114.072656 1.5 3 1 {'RestaurantsReservations': 'True', 'WiFi': 'u... Indian {'Monday': '17:0-22:0', 'Tuesday': '17:0-22:0'...
15 OWkS1FXNJbozn-qPg3LWxg Mama Napoli Pizza Las Vegas NV 89109 36.128561 -115.171130 4.5 11 1 {'RestaurantsDelivery': 'False', 'BusinessAcce... Italian {'Friday': '18:0-0:0'}
16 j9bWpCRwpDVfwVT_V85qeA Papaya Thai 2706 E University Dr Mesa AZ 85213 33.423705 -111.772890 2.5 130 1 {'Caters': 'False', 'OutdoorSeating': 'False',... Thai {'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'...
17 6GHwgKNlvfIMUpFaxgBjUA L'Artisan Piadineria 3421 Rue Saint-Denis Montréal QC H2L 3L2 45.517369 -73.567940 4.0 9 0 {'RestaurantsTakeOut': 'True', 'Ambience': '{'... Fast Food {'Monday': '9:0-17:0', 'Tuesday': '9:0-17:0', ...
18 Ir_QIzs-4o9ElOtiGuxJrw Torteria Lupita 4601 Rue Notre-Dame O Montréal QC H4C 1S5 45.473632 -73.588135 4.0 19 0 {'RestaurantsPriceRange2': '2', 'BusinessParki... Mexican {'Tuesday': '12:0-20:0', 'Wednesday': '12:0-20...
19 JcsZvx-4yovFgCXOfd6KMg Sam's Deli Mart 3300 N Central Ave Lbby Phoenix AZ 85012 33.648302 -112.119379 3.0 4 0 {'RestaurantsTakeOut': 'True', 'RestaurantsGoo... Fast Food None
In [26]:
#Let's remove the rows corresponding to closed businesses
trip = trip[trip['is_open']==1]
trip = trip.drop('is_open', axis=1)
trip.reset_index(inplace=True)
In [27]:
trip.sort_values(['stars', 'review_count'], ascending=False).head()
Out[27]:
index business_id name address city state postal_code latitude longitude stars review_count attributes categories hours
1539 2232 IhNASEZ3XnBHmuuVnWdIwA Brew Tea Bar 7380 S Rainbow Blvd, Ste 101 Las Vegas NV 89139 36.054227 -115.242392 5.0 1827 {'HasTV': 'False', 'OutdoorSeating': 'True', '... Cafes {'Monday': '0:0-0:0', 'Tuesday': '11:0-22:0', ...
19421 28202 9MVKjEMN5T59uzG1xoD2BQ Cocina Madrigal 4044 S 16th St Phoenix AZ 85040 33.408767 -112.047381 5.0 1107 {'GoodForKids': 'True', 'WiFi': 'u'free'', 'Re... Mexican {'Monday': '0:0-0:0', 'Tuesday': '11:0-21:0', ...
20221 29358 16d3BlncEyCTzb0GxXrBXQ Green Corner Restaurant 1038 W Southern Ave, Ste 1 Mesa AZ 85210 33.394446 -111.854111 5.0 858 {'RestaurantsGoodForGroups': 'True', 'Restaura... Mediterranean {'Monday': '10:30-22:0', 'Tuesday': '10:30-22:...
2049 2972 SSCH4Z2gw-hh2KZy7aH4qw Worth Takeaway 218 W Main St Mesa AZ 85201 33.415461 -111.837432 5.0 842 {'DogsAllowed': 'True', 'BusinessParking': '{'... Fast Food {'Monday': '0:0-0:0', 'Tuesday': '8:0-16:0', '...
5628 8158 8fFTJBh0RB2EKG53ibiBKw Zenaida's Cafe 3430 E Tropicana Ave, Ste 32 Las Vegas NV 89121 36.101741 -115.100359 5.0 717 {'DogsAllowed': 'False', 'GoodForMeal': '{'des... Cafes {'Wednesday': '8:0-14:0', 'Thursday': '8:0-14:...
In [28]:
trip['city'].value_counts()
Out[28]:
Toronto          3853
Las Vegas        3470
Phoenix          2158
Montréal         2008
Calgary          1607
Charlotte        1430
Pittsburgh       1214
Mississauga       922
Cleveland         715
Scottsdale        684
Mesa              663
Madison           519
Tempe             511
Markham           479
Henderson         453
Chandler          448
Glendale          431
Brampton          392
Gilbert           331
Scarborough       317
Vaughan           298
Richmond Hill     277
North York        255
Concord           219
Peoria            214
Champaign         204
Etobicoke         203
Surprise          142
Gastonia          132
Avondale          128
Mentor            120
Goodyear          120
Akron             109
Matthews          107
Huntersville      100
Fort Mill          99
Lakewood           86
Name: city, dtype: int64

Say we focus on the first city we'll visit: Toronto. Since we are staying three nigths, we agreed on eating Mediterranean, Chinese and French food.

In [29]:
toronto = trip[trip['city']=='Toronto']
toronto['categories'].value_counts()
Out[29]:
Fast Food        907
Italian          580
Cafes            545
Chinese          401
Mediterranean    380
Japanese         349
Indian           223
Thai             173
Mexican          160
French            81
Steakhouses       54
Name: categories, dtype: int64

Let's try to see where are the 10 top places for each food type in Toronto!

In [30]:
#we order the restaurants considering the number of reviews first in order to select already 'famous' places,
# then order by 'stars'
#mediterranean restaurants
toronto_med_10 = toronto[toronto['categories']=='Mediterranean'].sort_values(['review_count','stars'], ascending=False).head(10)
med_latlon = toronto_med_10[['latitude','longitude']].values.tolist()
# print(med_latlon)

#Chinese restaurants
toronto_chin_10 = toronto[toronto['categories']=='Chinese'].sort_values(['review_count','stars'], ascending=False).head(10)
chin_latlon = toronto_chin_10[['latitude','longitude']].values.tolist()

#French restaurants
toronto_fr_10 = toronto[toronto['categories']=='French'].sort_values(['review_count','stars'], ascending=False).head(10)
fr_latlon = toronto_fr_10[['latitude','longitude']].values.tolist()

map = folium.Map(location=[43.651070, -79.347015], tiles='Stamen Toner', zoom_start=12)
for point in range(0, len(med_latlon)):
    folium.Marker(med_latlon[point], popup=toronto_med_10.iloc[point, 2], icon=folium.Icon(color='darkblue')).add_to(map)
    folium.Marker(chin_latlon[point], popup=toronto_chin_10.iloc[point, 2], icon=folium.Icon(color='red')).add_to(map)
    folium.Marker(fr_latlon[point], popup=toronto_fr_10.iloc[point, 2], icon=folium.Icon(color='green')).add_to(map)
map
Out[30]:

We can see the high concentration of Chinese restaurants on Spadina Avenue in Toronto. Mediterranean and French restaurants seem to be more dispersed around the city.

Let's have a look at the review distribution for each food category

In [31]:
toronto_selection = toronto[(toronto['categories']=='Mediterranean')
                            | (toronto['categories']=='French')
                            | (toronto['categories']=='Chinese')]
g = sns.FacetGrid(toronto_selection,col='categories',height=5)
sns.despine(bottom=True, left=True)
g.map(sns.kdeplot, 'stars', shade=True)
Out[31]:
<seaborn.axisgrid.FacetGrid at 0x1a2b11c210>
In [32]:
fig, (ax1,ax2,ax3) = plt.subplots(1,3, figsize=(12,5), sharey=True)
ax1.boxplot(toronto[toronto['categories']=='French']['stars'])
ax1.set_xlabel('French restaurants')
ax2.boxplot(toronto[toronto['categories']=='Mediterranean']['stars'])
ax2.set_xlabel('Mediterranean restaurants')
ax3.boxplot(toronto[toronto['categories']=='Chinese']['stars'])
ax3.set_xlabel('Chinese restaurants')
plt.show()

We can see that among these three categories, it seems that French restaurants benefit from better reviews (a majority of scores between 3.5 and 4.5), whereas Mediterranean and Chinese restaurants reviews are a bit more dispersed.

To read the review json file, we first tried to do the same as checkin and business but the command took too much time since the file is 6M lines long..
We found a solution from George Hou who's writing for TowardsDataScience.
We read it by chunks of 1M lines and use the documentation provided by Yelp to set the data type for each column.

In [33]:
# size=1000000
# review_json_path = 'yelp_academic_dataset_review.json'
# review = pd.read_json(review_json_path, lines=True,
#                       dtype={'review_id':str, 'user_id':str, 'business_id':str, 'stars':int, 'date':str, 'text':str, 'useful':int, 'funny':int, 'cool':int},
#                       chunksize=size)
In [34]:
# tip_json_path = 'yelp_academic_dataset_tip.json'
# tip = pd.read_json(tip_json_path, lines=True)
In [35]:
# tip.info()