Recruit Restaurant Visitor Forecasting

Lokesh Lokiee
14 min readJun 26, 2021

Predicting the number of visitors to a given restaurant on any given date

Business Problem:

Running a thriving local restaurant isn’t always as charming as first impressions appear. There are often all sorts of unexpected troubles popping up that could hurt business.

In this study we are going to address a major problem restaurant owners faces, which is forecasting number of visitors on any given day. Which will help Business owners to plan their groceries, staff members etc., .

This forecast isn’t easy to make because many unpredictable factors affect restaurant attendance, like weather and local competition. It’s even harder for newer restaurants with little historical data.

Fortunately, Recruit Holdings has unique access to key datasets that could make automated future customer predictions possible. Specifically, Recruit Holdings owns Hot Pepper Gourmet (a restaurant review service), AirREGI (a restaurant point of sales service), and Restaurant Board (reservation log management software).

Why Machine Learning/Deep Learning?

As we have lot of factors which could potentially impact the number of visitors to the restaurant. So, we need machine learning model to predict the visitors considering all the possible scenerios(like rain, holiday, festival etc.,) that could impact the number of visitors to the restaurant.

In this problem we have data from AIR and HPG, to predict visitors, which implies this is a regression problem.

Source of Data:

We have data from two site which are mentioned below:

  • Hot Pepper Gourmet (hpg): similar to Yelp, here users can search restaurants and also make a reservation online
  • AirREGI / Restaurant Board (air): similar to Square, a reservation control and cash register system

Please checkout below figure to understand the files from AIR and HPG. Also how they are linked.

Files from AIR and HPG which are linked through store_relation.csv

AirREGI / Restaurant Board (air):

air_store_info.csv: This file contains information about select air restaurants. Column names and contents are self-explanatory.

air_reserve.csv: This file contains reservations made in the air system.

air_visit_data.csv: This file contains historical visit data for air restaurants.

Hot Pepper Gourmet (hpg):

hpg_store_info.csv: This file contains information about select hpg restaurants. Column names and contents are self-explanatory.

hpg_reserve.csv: This file contains reservations made in the hpg system.

Along with AIR and HPG files we have,

store_id_relation.csv: This file allows you to join select restaurants that have both the air and hpg system.

date_info.csv: This file gives basic information about the calendar dates in the dataset.

Metric:

We are going to use RMSLE(Root Mean Square Logarthmic Error) as Evaluation Metric. Here goes the formula:

Now, there is a reason why we are using RMSLE instead of RMSE(Root Mean Square Error). Which is RMSLE tend to penalise more in case of underpredictions than over predicition. This helps restaurants not to run out food materials in middle of the day.

Below Graphs illustrates RMSLE vs RMSE:

Graph is plot between Error(RMSE/RMSLE) vs difference between predicted and actual value

Above graphs illustrates how RMSE and RMSLE behaves in case of overfit and underfit.

It is pretty clear that, RMSE is penalizing both overfit and underfit in same measure. While in RMSLE error is more if the prediction is underprediction and less is prediction is overprediction.

Now lets understand the data provided using some plots

Exploratory Data Analysis:

Number of Visitors:

Lets look at air_visit_data.csv, which contains date and number of visitors.

Boxplot on number of visitors
Statstics on visitors

Mean visitors is 20.

minimum number of visitors to a restaurant is 1.

Box plot clearly specifies there are some high values in visitors with maximum value of 877.

This high values should come in holiday season, which will not specify the trend for whole year. These would be considered as outliers and should be handled accordingly.

Total visitors vs day
Total number of unique stores on given day

with the help of above two plots, we can conclude, sudden spike in number of visitors after july is due to spike in number of stores from july.

Average number of visitors based on weekday:

Plot gives average visitors on given weekday

Here plot shows, Restaurants tend to have more visitors on saturday.

Friday and Sunday also have significant number of visitors.

On Monday and Tuesday, there is the least number of visitors.

Thursday and Wednesday have almost the same visitors trends.

Average number of visitors monthly:

Average visitors on each month

Clealy, in december restaurants tend to have more visitors. As December is holiday season.

After the month of December, the march is the month having the highest visitors.

Reservation Data:

In the plot we can see both AIR and HPG has spike in their reservation on December.

The number of registration in AIR is more than that of HPG.

Number of Reservations at given hour:

AIR Data
HPG Data

From above to plots we can conclude, Evening time is mote busy time for business.

Also AIR Data contains more number of reservations.

There are no (almost zero) visitors between 12:00 AM and 7:00 AM (approx), it may be because restaurants stay closed during the night.

Difference between reserve time and visit time:

x-axis: Difference between visit time and reservation time. y-axis: number of reservations

Plot shows, there are more number of instant reservations or reservations before 2 to 4 hours.

After than we can observe significant reservations with a day before and the pattern continues.

Number of Restaurants w.r.t Genre:

Top 15 Genres

Plot show Izakaya is the top genre, in the list. followed by Cafe/Sweets, Dining bar etc.,

Asian Karaoke/Party and International Cuisine has least number of reservations.

Surprisingly the Japanese Food is the 6th popular genre in Japan.

Top 15 Areas with more number of restaurants:

Fukuoka-ken is a pre-fecture in Japan, which contains more number of restaurants.

As Tokyo is big city it has more number of sub-prefectures. So, 2nd 3rd and 4th highest number of restaurants in a pre-fecture is Tokyo.

After Tokyo, Osaka and Hiroshima take the position. As they are also Big cities.

Latitude and Longitude:

As we found, Tokyo has highest number of restaurants. Followed by Osaka and Fukuka.

Zoom into Tokyo

We can see restaurants are spread across different areas.

Fukuku is the place which has more restautants in same area.

Holiday Effect on Visitors:

Average number of visitors on Holiday and Non-Holiday

It is observed from the plot and it is obvious to have more visitors on holidays than working days.

The difference is still not very high, because weekends are also considered as non-holiday in dataset.

While processing data, we must take into account the holidays that come on weekends, such holidays should only be considered as weekends not as holidays just to take the weekends effect into account.

Day After Holiday:

Average visitors After Holiday

We can clearly see, if next day is holiday. Restaurants tend to have more visitors. These might be because, people relax in the evening before the holiday.

Overall Observations:

  1. air_visit_data.csv is the main file, as it contains number of visitors in a day. Because, number of visitors in a day is what we are going to predict.
  2. This is a timeseries dataset, and train dataset date ranges from 2016–01–01 to 2017–04–22.
  3. Submission dataset date ranges from 2017–04–23 to 2017–05–31.
  4. Training Dataset overall contain 252108 data points. With total 821 unique restaurants. With Average visitors count of 20 approx.
  5. 75% of the days, number of visitors to the restaurants is around 30.
  6. Some cases restaurants have more visitors like more than 800. Which is the holiday season, which will not signify entire year data.
  7. Number of visitors increased drastically after July 2016, This is due to increase in number of restaurants after July 2016 in AIR.
  8. Saturaday and sunday, weekends tend to have more number of visitors. Friday also have good number of visitors.
  9. Clealy, in december restaurants tend to have more visitors. As December is holiday season.
  10. After December, March has more number of visitors.
  11. In given hour of the day, evening 6 PM to 8PM restaurants expected ti have more visitors.
  12. Morning 1AM to 10AM almost there are no visitors.
  13. From AIR dataset, we have more number of reservations compared to HPG.
  14. In December, restaurants expected to have more reservations.
  15. More number of visitors do spot reservations, or 2 to 5 hours before. There is also significant visitors who reserve a day or two days before.
  16. Izakaya is the most popular genre in Japan as almost 23.8% of restaurants are of the Izakaya genre.
  17. The second most popular genre in Japan is Cafe/Sweets having almost 21.8% restaurant market share.
  18. International cuisine, Asian, and Karaoke/Party are the least preferred genre having only 0.2% each market share.
  19. Fukuoka-ken is a pre-fecture in Japan, which contains more number of restaurants.
  20. Tokyo stands in 2nd 3rd and 4th place in number of restaurants, as Tokyo has more number of prefrectures.
  21. After Tokyo Osaka and Hiroshima has more number of restaurants.
  22. It is obvious to have more visitors on holidays than on working days.
  23. Along with that, day before holiday also has more visitors.

Existing Approaches:

6th Place Kaggle Solution (Team:- Yunfeng and Ankit):-

  1. Apart from the dataset given in the competition, Weather Data for Recruit Restaurant Competition is also used.
  2. From the calendar information, a feature called hour_gap is used which gives the gap between the reserving a restaurant and visiting it in hours, which is again subdivided into 5 categories based on gap length (viz. <12 hours, 12 to 37 hours, 37 to 59 hours, 59 to 86 hours and greater than 85 hours)
  3. Average, median, max, and min visitors per restaurant is taken into consideration separately for working days and non-working days.
  4. Area wise total count of restaurants is also calculated.
  5. From the weather information, here also temperature and precipitation information is used but the temperature is subdivided into low, average, and high.
  6. The week-day wise mean of visitors count for all 7 days of all restaurants is also calculated.
  7. Month wise mean of visitors count for all 12 months for each restaurant is also calculated.
  8. XGBoost (optimized implementation of GBDT) is trained as the final model with a total of 5000 epochs. Also During prediction, the log of the visitors is predicted.

8th Place Kaggle Solution (Team:- Max Halford):-

  1. Apart from the dataset given in the competition, Weather Data for Recruit Restaurant Competition is also used.
  2. Instead of taking the visiting data as is, it is resampled by day so that for days where there are no data points (no visitors) the number of visits is 0, keeping track of if the datapoint is added due to resampling.
  3. From the calendar information, apart from the ‘day of the week’ and ‘is weekend’, two additional features indicating if the previous or the next day is a holiday is also used.
  4. As for the store information, the preprocessed version from the weather data instead of the “official” Kaggle version is used. As the preprocessed version contains weather station data important for joining the weather features.
  5. From weather information, only precipitation and temperature features are used where missing values are handled by replacing them with the global daily average.
  6. By considering visits as a normal distribution, a value 2.4 is considered as a high quantile value of the normal distribution, a value having variance greater than 2.4 * std-deviation is considered an outlier. As soon as an outlier is detected, a new feature called visitors_capped is used where the outlier values are replaced with the maximum of the non-outlier values.
  7. A feature called the day of the month is also used which is quite interesting because it can be seen as a proxy of when people get paid in the month, supposing they get paid monthly.
  8. Exponentially weighted means (EWM) are a way to capture the trend of a time-series, EWM of various numeric features are used where alpha values are optimized using differential evolution.
  9. The mean, median, standard deviation, number of values, minimum, maximum are also used by grouping data using the visit date.
  10. All the categorical variables are label encoded.
  11. Few columns in the dataset have to be dropped because they are useless. During prediction, the log of the visitors is predicted and to get back to the original magnitude simply the exponential function is applied to the prediction. This works because exp(log(x)) = x. As it helps a decision tree to pack values in a leaf because the values are “closer” to each other. In total 98 final features are used to train the final model.
  12. LightGBM (Tree-based Gradient Boosting) is used as the final model where 6 models are trained on a random sample of the dataset. The sampling is done without replacement which is known as pasting. The average of all 6 models prediction is considered as the final prediction. Self hyperparameter tuning is done without using grid-search or random-search.

Improvements:

IQR(Inter Quartile Range) to Identify outliers:

As some restaurants has more number of visitors which are more than 800 on some days, which are clearly outliers.

So, we use IQR(Inter-Quartile Range) to replace extreme outliers with Max-IQR value.

Outlier Detection using IQR

Clustering based on Latitude and Longitude:

Clustering data based on latitude and longitude using MiniBatchKMeans.

First we need to find optimalk using elbow method.

Result graph is as below:

Error vs K to find Optimum K-value

With the help of elbow method we found optimum k is 30.

Now we cluster the data based on latitude and longitude with k value as 30.

Feature we are going to explore is distance from restaurant to the center of cluster to which restaurant belongs. And number of restaurants in the given cluster.

Features Based on Latitude and Longitude

Feature Selection using Recursive Feature Elimination:

Using RFECV, we will select subset of features from total features. As some unimportent features will effect Prediction in Machine Learning Models.

Finding out the optimal to be selected using Recursive Feature Elimination Cross-Validation:

Features with rank 1 ploted in below graph:

Rank vs Features

For Modeling, We selected only features with rank 1.

First Cut Approach:

  1. As our problem is based on time seris forecasting. We have to come up with time based features.
  2. First of all we need to replace outliers using IQR method. Using this method we will replace outliers with maximum IQR value.
  3. Also, we apply logarithm on visitors, this is because applying logarithm will smoothen the number of visitors. So, number of visitors a day will be closer to each other. (Later we can apply exp on prediction to get actual visitors).
  4. Using area name, we can extract number of restaurants in the given area.
  5. Using Genre, we can extarct number of similar Genre restaurants in the area.
  6. The reservation data consists of registration time and visit time, using which we calculated the hour gap between registration and visit then we subdivided the hour gap on the basis of gap duration.
  7. Holiday flag features is extarcted with condition if the day is Sunday/Saturday or it is holiday in Japan, then the value should be 1 else 0.
  8. Along with that we also extarct features representing day before holiday and day after holiday.
  9. Using weather data, we will extarct features avgerage temperature, low temperature, high temperature and precipitation on given day.
  10. Average visitors for each month is calculated for each store.
  11. we will calculate visitor’s statistics(mean/median/min/max/count) on the basis of working/non-working days, day of the week and also for each store. which will give us visitors behaviour on holidays, on any given weekday.
  12. Using Area data we can extract area prefecture and area sub prefecture.
  13. Using reservation data, we can extract number of reservations done on each day.
  14. Using latitude and longitude features, we will cluster the data and extract 2 features distance to the center of cluster and number of restaurants in the cluster.
  15. final features are selected using the Recursive Feature Elimination technique.
  16. We will apply one-hot encoding on area, genre and day of the week features.
  17. We are going to apply Linear Regression, Random Forest Regressor and XGB using xgboost in Machine Learning Models.
  18. In Deep Learning we will try LSTM.
  19. Apart from that we will try Autoencoders to reduce the dimensions and apply Machine Learning Techniques on top of that data.

Machine Learning Models:

Linear Regression:

  1. We have to standardise the data for Linear regression using StandardScaler().
  2. Using GridSearchCV we will find best alpha.

Validation Data RMSLE we got : 0.5168

Kaggle Submission:

Random Forest Regressor:

  1. Used RandomSearchCv for Parameter tuning.
  2. Best estimator is at n_estinators=100 and max_depth=100.

3. Validation Data RMSLE we got : 0.52889

4. Kaggle Submission:

XGBoost Regressor:

  1. Using RandomSearchCv we found the best params.
  2. Applied XGBoost model on the best params.

3. Validation Data RMSLE we got : 0.47820

4. Kaggle Submission:

Best Score of all models

Deep Learning Models:

Autoendcoders for Feature extraction:

  1. We extracted features using Autoencoders, And Applied Machine Learning Models.
  2. In Below code we saved model encoder.h5, using this we can reduce the dimensions.

3. we applied Linear Regression, RandomForest and XGBoost on encoded data.

Linear Regression Validation Dat RMSLE: 0.51986

Random Forest Validation Data RMSLE: 0.50909

XGBoost Regressor Validation Data RMSLE: 0.51132

4. All models did not perform as well as they did when we applied ML models on whole data.

LSTM(Long Short Term Memory):

  1. Using 2 LSTM layers, we build the Deep Learning model. Code shown below.
  2. We used relu, activation function in all the layers.
  3. Optimizer is Adam with initial Learning rate as 0.01. While will reduce over time based on validation data RMSLE.

4. Validation Data RMSLE we got : 1.07292

5. Kaggle Submission:

Comparison of Models:

Future Work:

  1. As this data is time series data, it would be better if we consider rolling statistics on visitors.
  2. We can also find Exponential Weighted Averages on visitors as features for prediciton.
  3. These 2 features are worst exploring.

Feel Free to connect with me on LinkedIn or GitHub .

--

--