Homework 4: graphics practice

Scott Spencer https://ssp3nc3r.github.io (Columbia University)https://sps.columbia.edu/faculty/scott-spencer
2021, November 30

In our previous class demonstrations and homeworks, we practiced exploring CitiBike ride data to gain insights into the bike share’s rebalancing efforts. In the process, we gained experience transforming data and mapping data to visual encodings.

First, as a class we practiced using a workflow with CitiBike data to create a new variable, an indicator whether bikes may have been rebalanced. Next, in homework two, we practiced mapping CitiBike ride data onto the three attributes of color: hue, saturation, and luminance. In the process we were able to explore how useage, rebalancing efforts, or both may have changed between 2013 and 2019, and again before and after the pandemic began. This exploration also helped us consider some of the limitations of the particular visualization: it did not consider the effects of rebalancing or bike and docking station availability.

In this assignment, we will try to account for those and other limitations in the visualizations, and in the process gain practice with new data graphics and explaining our insights to others.

Preliminary setup

Load libraries to access functions we’ll use in this analysis. Of note, if you have not installed these packages, do so outside of this rmd file.

library(tidyverse) # the usual
library(geojsonio) # for map data
library(broom)     # for map data
library(patchwork) # for organizing multiple graphs
library(ggthemes)  # collection of graph themes
theme_set(theme_tufte(base_family = 'sans'))

We’ll use the same dataset as in our previous homework. Let’s load our data and rename variables (as before),

rider_trips <- read_csv('data/201909-citibike-tripdata.csv')
rider_trips <- 
  rider_trips %>% 
  rename_all(function(x) gsub(' ', '_', x)) %>%
  rename(start_time = starttime,
         end_time = stoptime) %>%
  mutate(tripduration = as.difftime(tripduration, units = 'hours') )

Previously, we considered that, in general, CitiBike’s available data include measures of several attributes for each bike ride. When a bikeshare customer begins their ride, CitiBike measures these attributes,

bikeid
start_station_id
start_station_name
start_station_longitude
start_station_latitude
start_time

For the same record (row in the data), when a bikeshare customer ends their ride, CitiBike measures additional attributes:

end_station_id
end_station_name
end_station_longitude
end_station_latitude
end_time

We’ll also use the variable usertype, and the calculated variable tripduration. Of note, while CitiBike also records other attributes about the ride (e.g., birthyear, gender), we’ll ignore these here.

Thus, for customer rides, any given ride begins at the same station that the previous ride ended. Described with math, for rides \(n \in 1, 2, ... N\) of each bike \(b \in 1, 2, ... B\), we can express bike location between rides as

\[ \textrm{end_station_name}_{b, n} = \textrm{start_station_name}_{b, n+1} \mid \textrm{normal usage} \]

This does not always hold, however, when CitiBike intervenes between rides by removing a bike from a docking station for whatever reason (e.g., rebalancing or repair); CitiBike may redock the bike anywhere or not at all. By combining information for ride \(n\) and \(n+1\), we can create intervention observations and by filtering to only keep transitions where

\[ \textrm{end_station_name}_{b, n} \stackrel{?}{\ne} \textrm{start_station_name}_{b, n+1} \mid \textrm{intervention} \]

Question 1 — measuring CitiBike interventions (data transformations)

Create observations for CitiBike’s interventions. To create these, you’ll need to perform several data transformations.

Here’s my suggestion. First, arrange the data by bikeid and start_time, so that each bike’s rides will be ordered in time. Then, group all observations for each bikeid together. Within these groupings (by bikeid), you’ll create a new observation describing the time between rides. Thus, for each of the original variables with names that begin with start_ or end_, the start_ variables in your new intervention observations should be equal to the original end_ variables, and 2) the new end_ variables are equal to original start_ variables of the next observed ride.

Now for most of those new intervention observations, the start_ and end_ variables will be the same because the bike just stays docked until the next ride. Filter those out, and you’ll be left with rides where CitiBike, for whatever reason, moved the bike between rides.

Of note, you won’t know where the previous ride was from for the first ride (using only this data), so that’s missing data. For this exercise, assume no rebalancing occured before the first ride. Filter those out. Include the variable usertype and set its measurement for all these intervention observations to “Citibike”. Finally, calculate the time difference between start and end of the transition in units of hours, and save as tripduration. Hint: you might try coding something like difftime(end_time, start_time, units = 'hours').

Your new data frame should include these variables:

bikeid
start_station_id
start_station_name
start_station_longitude
start_station_latitude
start_time
end_station_id
end_station_name
end_station_longitude
end_station_latitude
end_time
usertype
tripduration

Name your new dataframe as the object interventions.

# ENTER CODE TO TRANSFORM DATA INTO interventions

How many observations are in your new data frame interventions?

WRITE YOUR ANSWER HERE.

Question 2 — visualizing time between rides (visually encoding data)

Applying the grammar of graphics with ggplot2, 1) create a histogram of your calculated tripduration in your new data frame interventions, 2) layer a red, vertical line onto the histogram that marks 24 hours, and 3) add explanatory information including x and y labels, your main takeaway as a title, and a caption describing the source of your data.

# ENTER CODE TO GRAPH YOUR tripduration INTERVENTION DATA

Question 3 — critical thinking

Does our above method (creating observations when the end_station_id of a ride does not match the start_station_id of the consecutive ride) tend to accurately measure how often CitiBike has intervened, or might our method tend to overcount or undercount? Explain.

EXPLAIN YOUR ANSWER HERE.

Question 4 — critical thinking

Apple Maps estimates that on average a bike ride from the top of Manhattan (Inwood Hill Park) to the bottom of Manhattan (Battery Park) would take about 1.5 hours. And some bike angels can ride pretty fast! Obviously CitiBike may intervene to rebalance docking stations, and in our earlier discussions we discussed four ways they try to rebalance. Consider other reasons why CitiBike may intervene. Does your histogram suggest anything about CitiBike’s methods and purposes for their interventions? Explain.

EXPLAIN YOUR ANSWER HERE.

Question 5 — visualize location of interventions (visually encoding data)

To practice layering encodings onto maps, let’s try to uncover high-level patterns in the location of CitiBike interventions.

We might think of these interventions geographically (that is, locations in space). First, to visualize these interventions as locations in space, we’ll overlay visual encodings onto a map of Manhattan. We can create the base map from geographic data available at Beta NYC, which we convert from the available data structure called a spatial polygon data frame, into a regular data frame of which we are familiar. Here’s the code:

# identify the filename and location relative to the project directory
map_file <- 'data/betanyc_hoods.geojson'

# save and load the geojson as a spatial polygon data frame
if( !file.exists(map_file) ) {
  url <- str_c(
    'https://ssp3nc3r.github.io/',
    '20213APAN5800K007/data/betanyc_hoods.geojson'
    )
  
  # below functions in geojsonio package
  spdf <- geojson_read(url, what = 'sp') 
  geojson_write(spdf, file = map_file)
  
} else {
  spdf <- geojson_read(map_file, what = 'sp')
}

# convert the spatial polygon data frame to tibble (data.frame) for
# boroughs and neighborhoods using the tidy function (broom package)
nyc_neighborhoods <- tidy(spdf, region = 'neighborhood')
nyc_boroughs <- tidy(spdf, region = 'borough')

Inspect both the spatial polygon data frame, spdf, and the two new regular data frames, nyc_neighborhoods and nyc_boroughs to get a sense of how they are structured.

From these data frames, we draw a base map of Manhattan that also shows its neighborhood boundaries. Review the help file for geom_polygon, the function we’ll use to map this spatial data onto visual encodings. Again, here’s some code to create our base map:

p_hoods <- 
  
  # initialize graph
  ggplot() + 
  
  # remove most non-data ink
  theme_void() +
  
  # add color for water (behind land polygons)
  theme(
    panel.background = element_rect(fill = 'lightblue')
  ) +
  
  # define coordinate system and zoom in on Manhattan
  coord_map(
    projection = 'mercator',
    xlim = c(-74.03, -73.91),
    ylim = c(40.695, 40.85)
  ) +
  
  # map boundary data to visual elements (polygons)
  geom_polygon(
    data = nyc_neighborhoods,
    mapping = aes(
      x = long,
      y = lat,
      group = group
    ),
    fill = 'white',
    color = 'gray',
    lwd = 0.1
  ) 

# display the graph
p_hoods

There are many approaches to encode intervention data onto visual variables layered onto the map. Choose one or more visual encodings to layer intervention data onto the map. These may be visually encoded from direct observations in interventions, or from transformations or summaries of those observations, or from both.

# ENTER CODE TO LAYER YOUR INTERVENTION DATA ONTO MAP

Explain your choice of visual encodings and how they help you explore patterns in CitiBike interventions.

EXPLAIN YOUR CHOICES OF VISUAL ENCODINGS HERE.

Question 6 — combine ride data with CitiBike interventions (data transformation)

Combine your new observations from interventions with the original observed rides in rider_trips into a new data frame called allmoves.

# ENTER CODE TO COMBINE OR BIND ROWS FROM rider_trips AND interventions

Question 7 — estimating number of bikes at stations (data transformation)

Next, let’s look more closely at the patterns of bikes available at a station across time. Again, we don’t directly have the number of bikes or number of empty parking spots available at each station at any given time, but we can estimate that information from the above data. With your data frames rider_trips and interventions (or collectively, allmoves), within each station_id you can count observed rides (and interventions): each end_station_id counts as +1, and each start_station_id counts as -1.

Then, you can order them in time and use a cumulative sum function like cumsum(). Because our data arbitrarily begins at the beginning of a month, however, we should not be starting our cumulative counts at 0 (because there were already bikes at the stations). We can account for this by subtracting from the cumulative bikes the minimum at each station over the month: e.g., \(\sum b_i - \textrm{min}(\sum b_i)\), where \(b_i \in [-1, +1]\).

In the step of transforming data, calculate this across time per station for 1) your combined trips and interventions and 2) separately for just interventions.

# ENTER CODE TO TRANSFORM DATA (ACCUMULATED SUMS OF BIKES ENTERING & LEAVING)

In the step of visually encoding the transformed data, using a single graph encode the two cumulative sums across time at one particular station: “W 31 St & 7 Ave”, which is near Penn Station. Categorically encode the cumulative sum of combined trips and interventions as the hue black, and encode the cumulative sum of just interventions as the hue red.

# ENTER CODE TO GRAPH BOTH ACCUMULATED SUMS FOR THE SINGLE STATION

Question 8 — critical thinking

Did your graph reveal patterns in bike and docking availability, CitiBike interventions, or relationships between them at the station “W 31 St & 7 Ave”, which is located near Penn Station? Explain.

WRITE YOUR EXPLANATION HERE.

Annotate your above graph with a title, subtitle, and other markings to explain your interpretation and insights for CitiBike’s executives.

Submission — reproducibility

Knit your rmd file into an html file, name the files lastname-firstname-hw4.rmd and lastname-firstname-hw4.html, and submit both on courseworks.