The battle of neighborhoods in the city of Toronto

8 min readFeb 14, 2021

by Tomas Stankevičius, 2021

Introduction / Business Problem

The company wants to start a restaurant business and is trying to decide which area in Toronto could be a right spot to do that. An area should have the biggest potentiality in developing restaurant business. Location should help company to reach as wide customers auditory as possible to make business profitable. As company does not possess a knowledge of Toronto districts and neighborhoods, it decided to hire a business consultant company which provides various consultant services for companies which want to establish new business in Toronto area. A consultant company utilizing Data Science methods is going to cluster Toronto city area and provide recommendations where is the best place to start a restaurant business.

As there are a lot of restaurants in Toronto already, we will take into consideration every district’s population and its density too. We are very interested in districts located in Toronto Downtown as tourists can generate additional revenues too.

Explanation about Data

Consultant company is going to use Foursquare location data services to build a data base of various venues in every neighborhood of Toronto city. Then this data base will enriched with additional data provided by Toronto Open Data portal.

At first we are going to gather geospatial information from Toronto Open Data portal about every neighborhood in Toronto City. From page — https://open.toronto.ca/dataset/neighbourhoods/

Next, we will collect population and density information about every neighborhood in Toronto City from the page — https://open.toronto.ca/dataset/neighbourhood-profiles/

Lastly, a Foursquare API explore endpoint provides a list of recommended venues near the current location. We are going to use this endpoint to build our venues data base. Folium package provides a nice maps building functionality, so, we will be using it too to visualize results.

Our 5 rows of processed and combined data from mentioned Toronto Open Data portal in pandas data frame looks like this:

Using Folium package we can map those neighborhoods on Toronto map to get better understanding of their locations:

A Foursquare API explore endpoint provides a list of recommended venues near the current location. We are going to use this endpoint to build our venues data base. A fraction example of response code looks like this:

Lastly, a Foursquare API explore endpoint provides a list of recommended venues near the current location. We are going to use this endpoint to build our venues data base. Folium package provides a nice maps building functionality, so, we will be using it too to visualize results.

Methodology

As mentioned, this project goal is to find area in Toronto city with high population and enough density for a newly established restaurant to be able to attract sufficient amount of clients from the very beginning.

First, we have collected and processed required data such as Toronto neighborhoods geospatial data, as well as population and density in those neighborhoods. Using Foursquare API we were able to build a categorized database of venues located in every Toronto district.

Next step in our analysis will be to analyze gathered venue data and process and scale it, in order to build a Machine Learning model based on unsupervised K-Means algorithm. K-Means lets us to cluster districts according to their similar features. Districts inside clusters are very similar, but clusters from each other are very different.

We will build several graphs and maps for better understanding of K-Means model results as well as to help us draw a final conclusions

Let’s sort Venue Category in our data frame. As we can see top 10 most popular venue categories are: Coffee Shop, Cafe, Park, Restaurant, Pizza Place, Sandwich Place, Italian Restaurant, Bakery, Bar, Grocery Store.

So, all categories are related to food, except park. That’s quite promising result for our further analysis.

Further we perform one hot encoding on Toronto Venues data frame and calculate an average frequency of every category in every neighborhood. Please, check a data frame an example of first 5 rows bellow.

Next we create a pandas data frame of 10 most popular venues in each neighborhood. Here is example of first 5 rows.

After that we need to prepare a data for K-Mean clustering algorithm. We need to scale Population and Population density for model to be accurate. We use StandarScaler module from sklearn library.

At first we set a random number for max cluster numbers as we don’t know the best value so far. Let’s set total cluster number to 3. We fit a model with prepared data. After that let’s find a best value of k for our model.

As we see from the graph a distortion / inertia falls quite rapidly till we reach max number of clusters equal to 5. So let’s choose k value as 5 as it is the best fit for our model.

Clustering Analysis

Now let’s prepare a final data frame for visualization and further analysis. We merge tow pandas data frames — Toronto population data with Toronto venues data, which includes clustering labels.

Let’s visualize results using Folium package. Clusters are marked on map. Different color represents a neighborhoods dependency to different clusters.

We check how many districts there are in every clusters. Results are bellow.

As we can see the most popular cluster is number 0, with almost 80 neighborhoods in that cluster.

Let’s check average population and density in every cluster.

We see interesting population and density distribution. Although Cluster 3 has highest population, its density is one of the lowest. Cluster 4 has highest population density, but inhabitants number is only at the 4th place. It is because cluster 4 has only one district.

Let’s examine every cluster separately.

Cluster 0 is most popular cluster. It contains districts with medium size and low density populations. Because of it and proximity to Toronto Downtown, this cluster is not the best candidate. Also if we look at the top venues, restaurant or similar food service places are not amongst the most popular venues.

Neighborhoods in cluster 1 has high population numbers but low density. From the locations on the Toronto map we see it mainly a residential areas with houses and cottages. Although venues related to restaurants are still popular, but because of density and districts distance to the Toronto Downtown, this cluster is not so promising.

Cluster 2 contains 9 districts. As we see from the table every neighborhood has high population and high enough density. Top 3 most common venues are consist mainly of restaurants or entertainment places such as parks and beaches. Cluster 2 looks very promising as we have a nice balance of population, density and various venues situated in clustered districts.

Cluster 3 has high enough population, but low density. From the cluster map we see, that majority of cluster 4 districts are located quite far away from Toronto Downtown.

We can call Cluster 4 an outlier cluster as it contains only one neighborhood in it. This cluster has high enough population and high density and restaurant venues by popularity are among most popular types of venues. It’s quite promising candidate for our analysis.

Now let’s see how every cluster looks on a population map. Folium package provides a nice functionality to build these kind of maps for us to better understand our problem and make correct conclusions.

Also let’s see how every cluster looks on a Toronto population density map.

Results and Discussion

Our clustering analysis show that taking into consideration such factors as population density, total neighborhood population number, a distance to Toronto Downtown as well as top10 venues category popularity distribution in every district, majority of Cluster 2 districts falls under those criterions. As Cluster 4 contains only one district it can also be included into potential areas to establish a new restaurant.

We see that some districts from Cluster 0 also falls under our criterions. A further more narrow and concentrated analysis is required which is out of scope of this project.

Conclusion

The goal of this project was to identify and locate Toronto districts which are close to a Downtown and has high population numbers as well as high density. Using Foursquare API we took into consideration a most popular venues types in Toronto districts to narrow our results.

Our Machine Learning clustering model build on combined data provided 5 different clusters of which Cluster 2 and 4 looks most promising.

For final quality decision a further additional analysis of those clusters including more variables are required. A type and number of tourists attractions and their annual visiting number in each neighborhood, attractiveness of each neighborhood based on e.g. residents income level, noise levels, social and economic factors, crime situation and etc. are among those additional factors a company should take into consideration.

Full report and Python code in Jupyter Notebook you can find in my GitHub repo here https://github.com/tomwrx/Coursera_Capstone/