In this notebook, we'll further explore data on daily ridership for train stations in the Tokyo metro region in 2008 to answer our challenge question-- how can this data (hypothetically) be used to help the Zunda Saryo business in opening a new location in Tokyo?

Note: The data contains information about the mode of transport that riders used to get to the station and when leaving the station. The dataset was downloaded here, and station geometry GIS lines have been kindly converted to a single latitude/longitude point.

Let's first load any useful packages and our main dataset (cleaned-up during our initial data exploration):

library(tidyverse)
library(readxl)
riders_ldf <- read.csv('riders_major_long_data_frame.csv')
riders_ldf %>% head()

More data exploration

First things first. We might be able to focus on walkers or taxi-goers as our target market (e.g. perhaps they would be the ones most likely frequenting Zunda Saryo since they'd have free hands to carry a drink or merchandise and would physically pass by the storefront).

Is this a reasonable strategy? How many riders do these make up?

walk_taxi <- riders_ldf %>% filter(transport %in% c("walk","taxi") & direction=="disembarking" & riders>0) %>% group_by(station.code,station.name,longitude,latitude) %>% summarise(total_riders=sum(riders)) %>% arrange(total_riders)
## `summarise()` regrouping output by 'station.code', 'station.name', 'longitude' (override with `.groups` argument)
walk_taxi

Trying to map the data

Since there is longitude and latitude information, it'd be a shame not to try and see what this looks like on a map. This might help with making an informed decision on the new location. I've never done this before in R, but let's see if we can do that.

Trying ggmap

(The logistics are a little more involved but also simpler than I thought-- we need to sign up for a Google cloud platform trial to get the Google API key to overlay our map on top of Google Maps... I followed the instructions here.)

library(ggmap) # registered the API with command ggmap::register_google()
map <- ggmap(get_googlemap(center = c(lon = -122.335167, lat = 47.608013),
                    zoom = 11, scale = 2,
                    maptype ='terrain',
                    color = 'color'))

Oops! False start-- Google's API doesn't work, probably because my free trial expired and it wants me to enter billing information. Let's try openstreetmaps instead.

map <- get_map(get_openstreetmap(bbox=c(35.7979,139.6065,35.5321,139.9050))) 

Nope, it's just not our day. One more time.

map <- get_stamenmap(bbox=c(35.53,139.60,35.79,139.90),zoom=10,maptype="toner-lite")

Apparently it's not just me with these problems. The CRAN version of ggmap is outdated, so we'll install again via github: if(!requireNamespace("devtools")) install.packages("devtools") devtools::install_github("dkahle/ggmap").

OOPS EVERYTHING CRASHED!

...Hm. I guess this is what I get for having a version of R on my laptop that is not compatible with my operating system. (Note to future self: This is the point at which I left this alone for a few weeks, while packing, moving, and took care of study/work/life obligations.)

Trying to download from OSM

Picking this back up, how about trying with a GIS software, instead of trying to wrangle with R software incompatibilities on my laptop? This is more of an old-school academic method, but maybe it will work.

We can download the data from Open Street Map manually. To do this, we first check lat/lon of our data, so that we can figure out how much of the open street map to download.

summary(riders_ldf$longitude)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   139.1   139.6   139.7   139.7   139.8   140.4
summary(riders_ldf$latitude)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.21   35.61   35.68   35.68   35.75   36.24

Now try downloading the files from the internet.

...Oops, again! The area from Open Street Maps containing all our data is too big to download!

Success! Using Leaflet

Pivoting our strategy ONE MORE TIME. Let's try leaflet, an interactive map Javascript package that just so happens to have a library in R.

library(leaflet)

With Leaflet, you can view a map and set it to centre at a specific location. Leaflet has map tiles that default to the Open Street Map layers.

m <- leaflet() %>% setView(lng = 139.7, lat = 35.68, zoom = 12)
m %>% addTiles()

We can also customize the map tiles, like so. This greyscale, simple version will be easier on the eyes when we add in our data:

m <- leaflet(options = leafletOptions(minZoom = 9)) %>% setView(lng = 139.7, lat = 35.68, zoom = 10) %>% addProviderTiles(providers$CartoDB.Positron) 
m

Then, we add our ridership data! Here, the radius of the circles corresponds to the number of total riders exiting the train station by foot or by taxi, and overlay it with existing locations:

m %>% addCircles(data=walk_taxi, lng = ~longitude, lat = ~latitude, weight = 1,
    radius = ~total_riders/100, popup = ~station.name
  ) %>% 
  addMarkers(
    lng=139.768800, lat=35.685928, # from a google map search
    label='Tokyo location',
    labelOptions = labelOptions(noHide = F)) %>%
  addMarkers(
    lng=139.700468, lat=35.659226, 
    label='Shibuya location',
    labelOptions = labelOptions(noHide = F)) %>%
  addMarkers(
    lng=139.769484, lat=35.552608, 
    label='Haneda airport location',
    labelOptions = labelOptions(noHide = F)) 

Now this is getting somewhere!

Testing calculations

Calculating distance between points

Now, we need to see if we can, given a lat/lon of a point, calculate how close it is to the closest station, and how many stations (and later, number of people exiting/entering those stations) fall within a certain radius of the point.

I've tried and failed to install sf, geosphere, and all other R packages that people tend to use to deal with spatial data, so let's just figure out a different solution.

I borrowed the following function from Stack Overflow. This finds the distance between two points, given a lat/lon for each, in kilometres.

earth.dist<-function(lat1,long1,lat2,long2){
           rad <- pi/180
           a1 <- lat1 * rad
           a2 <- long1 * rad
           b1 <- lat2 * rad
           b2 <- long2 * rad
           dlat <- b1-a1
           dlon<- b2-a2
           a <- (sin(dlat/2))^2 +cos(a1)*cos(b1)*(sin(dlon/2))^2
           c <- 2*atan2(sqrt(a),sqrt(1-a))
           R <- 6378.145
           dist <- R *c
           return(dist)
           }

Now, we see if we can find the closest station to our point. First, for each station, calculate the distance from our point. (We generate this point arbitrarily by just using the mean of the lat's and long's.)

point_lat <- mean(riders_ldf$latitude) # arbitrary pick-- just because we know this point is within our data points
point_lon <- mean(riders_ldf$longitude) # arbitrary pick of longitude

# Use the two columns in riders_ldf to create a new column with the distance from that station to our point above
riders_ldf <- riders_ldf %>% mutate(dist_to_point = unlist(map2(latitude,longitude,earth.dist,point_lat,point_lon)))

summary(riders_ldf$dist_to_point)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4779  8.6435 16.9290 20.1211 29.0396 76.6015

Now we can filter by distance. Here we take the station closest to the point.

riders_ldf %>% filter(dist_to_point==min(dist_to_point))

Finding the closest stations and number of people

We can also filter by a certain radius. Let's try finding all the stations within 1 kilometre of the point we chose.

riders_ldf %>% filter(dist_to_point<1) %>% select(station.code,station.name,longitude,latitude,dist_to_point) %>% unique()

Perfect. Now how about number of pedestrians that exit within 1 kilometre of the point?

(walk_within_1km <- riders_ldf %>% filter(dist_to_point<1,transport=="walk",direction=="disembarking"))
sum(walk_within_1km$riders)
## [1] 72414

We can use these calculations to expand to any point on the map, to find the nearest stations and the potential number of people travelling through the area.

Actually, why don't we try that? Generate a bunch of points within the vicinity and maybe generate a heat map with the result?

Mapping number of people disembarking: "walk metric"

# Generate a range of latitudes, longitudes, then all the combinations of the two
lat_range <- seq(from=min(riders_ldf$latitude),to=max(riders_ldf$latitude),length.out =30)
lon_range <- seq(from=min(riders_ldf$longitude),to=max(riders_ldf$longitude),length.out = 30)

points <- as.matrix(expand.grid(lat_range,lon_range))

Now repeat what we did above with all of these points. Let's filter our dataset first to reduce the number of data points we're iterating over, then iterate over our points.

disembark_walk_riders <- riders_ldf %>% filter(transport=="walk",direction=="disembarking")
walk_metric <- rep(NA, nrow(points))
for (i in 1:nrow(points)){
  point_lat <- points[i,1]
  point_lon <- points[i,2]
  pedestrians <- disembark_walk_riders %>% mutate(dist_to_point = unlist(map2(latitude,longitude,earth.dist,point_lat,point_lon))) %>% filter(dist_to_point<1)
  n <- sum(pedestrians$riders)
  walk_metric[i] <- n
}

This was probably not the most efficient way to do this, but it does the job to calculate the number of pedestrians disembarking from a station 1km from the centre of each point.

walk_metric_plot <- data.frame(cbind(points,walk_metric))
names(walk_metric_plot) <- c("lat","lon","walk_metric")
walk_metric_plot <- walk_metric_plot %>% mutate(opacity = walk_metric/max(walk_metric))

sp_lon <- (lon_range[2] - lon_range[1])/2 # find interval for map
sp_lat <- (lat_range[2] - lat_range[1])/2
m %>% addRectangles(data = walk_metric_plot,
  ~lon-sp_lon,~lat-sp_lat, ~lon+sp_lon,~lat+sp_lat, 
  stroke=FALSE, color = "red",  fillOpacity = ~opacity
)

And voilà. Here's a heat map of what could be named a "walk metric" -- the number of passengers disembarking and walking from train stations within a 1km radius.

There a few problems with the current setup, though. Things to fix with this: we need to make the grid cells smaller -- aka calculate for more points. OR at least, make the distances between points/ squares correspond to the distances for which the walk_metric is calculated. (e.g. for this example of staions within 1km, we would calculate a walk_metric for each kilometre squared)