In this notebook, we'll be exploring data on daily ridership for train stations in the Tokyo metro region in 2008. The data contains information about the mode of transport that riders used to get to the station and when leaving the station.

The dataset was downloaded here, and station geometry GIS lines have been kindly converted to a single latitude/longitude point.

Let's first load any useful packages and our main dataset (csv):

library(tidyverse)
library(readxl)
riders_df <- read.csv('S05-c-10_SYUTO-g_CommutersPerStation_TokyoUrbanArea.csv')
riders_df

We'll also load an excel file with the data codes, corresponding descriptions in Japanese then English. Since we'll be using English here (and am not sure about how R will handle non-alphanumeric chacters), I've gone ahead and separated the the English description into two columns for ease of use: type (e.g. boarding or disembarking) and details (e.g. mode of transport).

legend <- readxl::read_excel('data_reference.xls', na="NA")
legend

We use those descriptions to make the columns of the original dataset interpretable by a human reading it (without having to refer to separate reference list).

legend <- legend %>% mutate_all(~ gsub(" ", "_",.)) %>% mutate(colname=paste(type,details,sep="."))
names(riders_df) <- names(riders_df) %>% map(function(x) ifelse(x %in% legend$code, legend$colname[which(legend$code==x)],x))
riders_df

Before we begin, let's quickly see what data we have here.

## [1] "Years surveyed: 2008"
## [1] "Number of stations: 1557"
## [1] "Max number of passengers boarding train: 473011 at station 新宿"
## [1] "Min number of passengers boarding train: 0 at station 東総元"
## [1] "Max number of passengers disembarking train: 467165 at station 新宿"
## [1] "Min number of passengers disembarking train: 0 at station 東総元"

Sidenote: ...How many train stations on this list didn't actually have any passengers recorded on that survey?

riders_df %>% filter(boarding.total==0 & disembarking.total==0) %>% select(survey.year,station.code,station.name,boarding.total,disembarking.total)

There appears to be 30 stations out of the 1557 stations listed in Tokyo that don't seem to have any ridership information recorded in the 2008 survey.

Now that we have a basic idea of how our data set looks like, let's answer some questions about the data.


Question #1

Which station has the highest percentage of riders arriving by aircraft, and what is that percentage?

(To answer this, we calculate the percentage of passengers boarding a train who arrived by aircraft, and sort the dataframe by that percentage in descending order:)

riders_df %>% mutate(percentage_aircraft=boarding.aircraft/boarding.total*100) %>% arrange(desc(percentage_aircraft)) %>% select(survey.year,station.name,percentage_aircraft,boarding.aircraft,boarding.total)

Our answer, then, is at the top of that table: 羽田空港第, or Haneda airport stations, top the list. (It is interesting to note that only 10 stations have any passengers arriving by aircraft.)

Answer: 羽田空港第1ビル i.e. Haneda airport Terminal 1 station has the highest percentage of riders arriving by aircraft, with 44.38% of total arrivals to the station having travelled by aircraft.


Question #2

Of stations with more than 10,000 riders leaving the station, what is the most popular mode of transport, and how many total riders used it?

To answer this question, it might be nice to transform the dataframe we have into a "long format" so that it's easier to get the statistics on the types of transportation. In a long format, each row will correspond to a station, direction of travel, and mode of transportation. We'll do this for the stations in question (>10,000 total riders leaving the station):

# Store column names in "var", temporarily
(riders_ldf <- riders_df %>% filter(disembarking.total>10000) %>% gather(var,riders,-c(survey.year,station.code,station.name,longitude,latitude)))

That previous step almost gets us there, but it'd be better if the direction of travel and mode of transportation were separated. (We've previously separated them by periods, which translates to "\." in regular expression.) Now instead of "var", we have "direction" and "transport".

(riders_ldf <- riders_ldf %>% separate(var, c("direction","transport"),sep="\\."))

With this reformatted data, it is easy to get a total of the number of passengers by mode of transportation:

riders_ldf %>% group_by(transport) %>% summarise(total_riders=sum(riders)) %>% arrange(desc(total_riders))
## `summarise()` ungrouping output (override with `.groups` argument)

Now, we can just read our answer from the table we generated.

Answer: The most popular mode of transportation, by far, is walking, with a whopping 33,912,753 passengers using that mode of transporation to get to or from large train stations (with >10,000 riders leaving the station). That's nearly 34 million passengers!


Saving our work for further exploration

So that we can use these ridership dataframes we created in another notebook, let's save it in a file that we can quickly load up later. From this point on, we'll just look at stations that have over 10,000 riders leaving the station daily.

write.csv(riders_ldf,"riders_major_long_data_frame.csv",row.names=FALSE) #Note, this data was already filtered and includes only major stations (with over 10,000 riders leaving the station)

Sidenote.. As a sanity check, let's calculate how many stations this includes (of the 1557 in the original data):

length(unique(riders_ldf$station.code))
## [1] 671

This is still a reasonable number of stations to work with! Large stations are what matter in selecting a new location for our business partner, so this should work.