Here is this homework’s
R
markdown
(rmd
) file.
In our previous class demonstrations and homeworks, we practiced exploring Citi Bike ride data to gain insights into the bike share’s rebalancing efforts. In the process, we gained experience transforming data and mapping data to visual encodings. First, as a class we practiced using a workflow with Citi Bike data to create a new variable, an indicator whether bikes may have been rebalanced. Next, in homework two, we practiced mapping Citi Bike ride data onto the three attributes of color: hue, saturation, and luminance. In the process we were able to explore how useage, rebalancing efforts, or both may have changed between 2013 and 2019, and again before and after the pandemic began. This exploration also helped us consider some of the limitations of the particular visualization: it did not consider the effects of rebalancing or bike and docking station availability. In this assignment, we will try to account for those and other limitations in the visualizations, and in the process gain practice with new data graphics and explaining our insights to others.
Load libraries to access functions we’ll use in this analysis. Of
note, if you have not installed these packages, do so outside of this
rmd
file.
library(tidyverse) # the usual
library(geojsonio) # for map data
library(broom) # for map data
library(patchwork) # for organizing multiple graphs
library(ggthemes) # collection of graph themes
theme_set(theme_tufte(base_family = 'sans'))
We’ll use the same dataset as in our previous homework. Let’s load our data and rename variables (as before),
rider_trips <- read_csv('data/201909-Citibike-tripdata.csv')
rider_trips <-
rider_trips %>%
rename_all(function(x) gsub(' ', '_', x)) %>%
rename(start_time = starttime,
end_time = stoptime) %>%
mutate(tripduration = as.difftime(tripduration, units = 'hours') )
Previously, we considered that, in general, Citi Bike’s available data include measures of several attributes for each bike ride. When a bikeshare customer begins their ride, Citi Bike measures these attributes,
bikeid
start_station_id
start_station_name
start_station_longitude
start_station_latitude
start_time
For the same record (row in the data), when a bikeshare customer ends their ride, Citi Bike measures additional attributes:
end_station_id
end_station_name
end_station_longitude
end_station_latitude
end_time
We’ll also use the variable usertype
, and the calculated
variable tripduration
. Of note, while Citi Bike also
records other attributes about the ride (e.g.,
birthyear
, gender
), we’ll ignore these
here.
Thus, for customer rides, any given ride begins at the same station that the previous ride ended. Described with math, for rides \(n \in 1, 2, ... N\) of each bike \(b \in 1, 2, ... B\), we can express bike location between rides as
\[ \textrm{end_station_name}_{b, n} = \textrm{start_station_name}_{b, n+1} \mid \textrm{normal usage} \]
This does not always hold, however, when Citi Bike intervenes between rides by removing a bike from a docking station for whatever reason (e.g., rebalancing or repair); Citi Bike may redock the bike anywhere or not at all. By combining information for ride \(n\) and \(n+1\), we can create intervention observations and by filtering to only keep transitions where
\[ \textrm{end_station_name}_{b, n} \ne \textrm{start_station_name}_{b, n+1} \mid \textrm{intervention} \]
Create observations for Citi Bike’s interventions. To create these, you’ll need to perform several data transformations.
Here’s my suggestion. First, arrange the data by bikeid
and start_time
, so that each bike’s rides will be ordered
in time. Then, group all observations for each bikeid
together. Within these groupings (by bikeid
), you’ll create
a new observation describing the time between rides. Thus, for each of
the original variables with names that begin with start_
or
end_
, the start_
variables in your new
intervention observations should be equal to the original
end_
variables of the previous observed ride, and 2) the
new end_
variables are equal to original
start_
variables of the current observed ride.
Now for most of those new intervention observations, the
start_
and end_
variables will be the same
because the bike just stays docked until the next ride. Filter those
out, and you’ll be left with rides where Citi Bike, for whatever reason,
moved the bike between rides.
Of note, you won’t know where the previous ride was from for the
first ride (using only this data), so that’s missing data. For this
exercise, assume no rebalancing occured before the first ride. Filter
those out. Include the variable usertype
and set its
measurement for all these intervention observations to “Citi Bike”.
Finally, calculate the time difference between start and end of the
transition in units of hours, and save as
tripduration
. Hint: you might try coding something like
difftime(end_time, start_time, units = 'hours')
.
Your new data frame should include these variables:
bikeid
start_station_id
start_station_name
start_station_longitude
start_station_latitude
start_time
end_station_id
end_station_name
end_station_longitude
end_station_latitude
end_time
usertype
tripduration
Name your new dataframe as the object interventions
.
# ENTER CODE TO TRANSFORM DATA INTO interventions
How many observations are in your new data frame
interventions
?
WRITE YOUR ANSWER HERE.
Applying the grammar of graphics with ggplot2
, 1) create
a histogram of your calculated tripduration
in your new
data frame interventions
, 2) layer a red, vertical line
onto the histogram that marks 24
hours, and 3) add
explanatory information including x
and
y
labels, your main takeaway as a title
, and a
caption
describing the source of your data.
# ENTER CODE TO GRAPH YOUR tripduration INTERVENTION DATA
Does our above method (creating observations when the
end_station_id
of a ride does not match the
start_station_id
of the consecutive ride) tend to
accurately measure how often Citi Bike has intervened, or might our
method tend to overcount or undercount? Explain to a Citi Bike analytics
executive.
EXPLAIN YOUR ANSWER HERE.
Apple Maps estimates that on average a bike ride from the top of Manhattan (Inwood Hill Park) to the bottom of Manhattan (Battery Park) would take about 1.5 hours. And some bike angels can ride pretty fast! Obviously Citi Bike may intervene to rebalance docking stations, and in our earlier discussions we discussed four ways they try to rebalance. Consider reasons other than rebalancing where Citi Bike may intervene. Does your histogram suggest anything about Citi Bike’s methods and purposes for their interventions? Explain to a Citi Bike analytics executive.
EXPLAIN YOUR ANSWER HERE.
To practice layering encodings onto maps, let’s try to uncover high-level patterns in the location of Citi Bike interventions.
We might think of these interventions geographically (that is, locations in space). First, to visualize these interventions as locations in space, we’ll overlay visual encodings onto a map of Manhattan. We can create the base map from geographic data available at Beta NYC, which we convert from the available data structure called a spatial polygon data frame, into a regular data frame of which we are familiar. Here’s the code:
# identify the filename and location relative to the project directory
map_file <- 'data/betanyc_hoods.geojson'
# save and load the geojson as a spatial polygon data frame
if( !file.exists(map_file) ) {
url <- str_c(
'https://ssp3nc3r.github.io/',
'20213APAN5800K007/data/betanyc_hoods.geojson'
)
# below functions in geojsonio package
spdf <- geojson_read(url, what = 'sp')
geojson_write(spdf, file = map_file)
} else {
spdf <- geojson_read(map_file, what = 'sp')
}
# convert the spatial polygon data frame to tibble (data.frame) for
# boroughs and neighborhoods using the tidy function (broom package)
nyc_neighborhoods <- tidy(spdf, region = 'neighborhood')
nyc_boroughs <- tidy(spdf, region = 'borough')
Inspect both the spatial polygon data frame, spdf
, and
the two new regular data frames, nyc_neighborhoods
and
nyc_boroughs
to get a sense of how they are structured.
From these data frames, we draw a base map of Manhattan that also
shows its neighborhood boundaries. Review the help file for
geom_polygon
, the function we’ll use to map this spatial
data onto visual encodings. Again, here’s some code to create our base
map:
p_hoods <-
# initialize graph
ggplot() +
# remove most non-data ink
theme_void() +
# add color for water (behind land polygons)
theme(
panel.background = element_rect(fill = 'lightblue')
) +
# define coordinate system and zoom in on Manhattan
coord_map(
projection = 'mercator',
xlim = c(-74.03, -73.91),
ylim = c( 40.695, 40.85)
) +
# map boundary data to visual elements (polygons)
geom_polygon(
data = nyc_neighborhoods,
mapping = aes(
x = long,
y = lat,
group = group
),
fill = 'white',
color = 'gray',
lwd = 0.1
)
# display the graph
p_hoods
There are many approaches to encode intervention data onto visual
variables layered onto the map. Choose one or more visual encodings to
layer intervention data onto the map. These may be visually encoded from
direct observations in interventions
, from
transformations or summaries of those observations, or from
both.
# ENTER CODE TO LAYER YOUR INTERVENTION DATA ONTO MAP
Explain to a Citi Bike analytics executive, what were your choices of visual encodings and how they help the executive explore patterns in Citi Bike interventions?
EXPLAIN WHAT VISUAL ENCODINGS YOU CHOSE AND WHY.
Combine your new observations from interventions
with
the original observed rides in rider_trips
into a new data
frame called allmoves
.
# ENTER CODE TO COMBINE OR BIND ROWS FROM rider_trips AND interventions
Next, let’s look more closely at the patterns of bikes available at a
station across time. Again, we don’t directly have the number of bikes
or number of empty parking spots available at each station at any given
time, but we can estimate that information from the above data. With
your data frames rider_trips
and interventions
(or collectively, allmoves
), within each
station_id
you can count observed rides (and
interventions): each end_station_id
counts as
+1
, and each start_station_id
counts as
-1
.
Then, you can order them in time and use a cumulative sum
function like cumsum()
. Because our data arbitrarily begins
at the beginning of a month, however, we should not be starting our
cumulative counts at 0
(because there were already bikes at
the stations). We can account for this by subtracting from the
cumulative bikes the minimum at each station over the month:
e.g., \(\sum b_i - \textrm{min}(\sum
b_i)\), where \(b_i \in [-1,
+1]\).
In the step of transforming data, calculate this across time per station for 1) your combined trips and interventions and 2) separately for just interventions.
# ENTER CODE TO TRANSFORM DATA (ACCUMULATED SUMS OF BIKES ENTERING & LEAVING)
In the step of visually encoding the transformed data, graph the two cumulative sums of all across time at one particular station: “W 31 St & 7 Ave”, which is near Penn Station. Categorically encode the cumulative sum of combined trips and interventions in black, and encode the cumulative sum of just interventions in red.
# ENTER CODE TO GRAPH BOTH ACCUMULATED SUMS FOR THE SINGLE STATION
Did your graph reveal patterns in bike and docking availability, Citi Bike interventions, or relationships between them at the station “W 31 St & 7 Ave”, which is located near Penn Station? Explain to a Citi Bike analytics executive.
WRITE YOUR EXPLANATION HERE.
Annotate your above graph with a title
,
subtitle
, and other markings to explain your interpretation
and insights for a mixed-audience of Citi Bike’s executives.
In submitting this individual assignment, you are representing your answers are your own. Properly Cite all resources used in completing this assignment.
Knit your answers in this r markdown file into an
html
file. Submit into courseworks both files (the
lastname-firstname-hw4.rmd
and the knitted
lastname-firstname-hw4.html
). We should be able to
reproduce your html
file just by opening your
rmd
file and knitting.
If you see mistakes or want to suggest changes, please create an issue on the source repository.