Homework 4: graphics practice

Here is this homework’s R markdown (rmd) file.

In our previous class demonstrations and homeworks, we practiced exploring Citi Bike ride data to gain insights into the bike share’s rebalancing efforts. In the process, we gained experience transforming data and mapping data to visual encodings. First, as a class we practiced using a workflow with Citi Bike data to create a new variable, an indicator whether bikes may have been rebalanced. Next, in homework two, we practiced mapping Citi Bike ride data onto the three attributes of color: hue, saturation, and luminance. In the process we were able to explore how useage, rebalancing efforts, or both may have changed between 2013 and 2019, and again before and after the pandemic began. This exploration also helped us consider some of the limitations of the particular visualization: it did not consider the effects of rebalancing or bike and docking station availability. In this assignment, we will try to account for those and other limitations in the visualizations, and in the process gain practice with new data graphics and explaining our insights to others.

Preliminary setup

Load libraries to access functions we’ll use in this analysis. Of note, if you have not installed these packages, do so outside of this rmd file.

library(tidyverse) # the usual
library(sf)        # for map data
library(patchwork) # for organizing multiple graphs
library(ggthemes)  # collection of graph themes
theme_set(theme_tufte(base_family = 'sans'))

We’ll use the same dataset as in our previous homework. Let’s load our data and rename variables (as before),

rider_trips <- read_csv('data/201909-citibike-tripdata.csv')
rider_trips <- 
  rider_trips %>% 
  rename_all(function(x) gsub(' ', '_', x)) %>%
  rename(start_time = starttime,
         end_time = stoptime) %>%
  mutate(tripduration = as.difftime(tripduration / 3600, units = 'hours') )

Previously, we considered that, in general, Citi Bike’s available data include measures of several attributes for each bike ride. When a bikeshare customer begins their ride, Citi Bike measures these attributes,

bikeid
start_station_id
start_station_name
start_station_longitude
start_station_latitude
start_time

For the same record (row in the data), when a bikeshare customer ends their ride, Citi Bike measures additional attributes:

end_station_id
end_station_name
end_station_longitude
end_station_latitude
end_time

We’ll also use the variable usertype, and the calculated variable tripduration. Of note, while Citi Bike also records other attributes about the ride (e.g., birthyear, gender), we’ll ignore these here.

Thus, for customer rides, any given ride begins at the same station that the previous ride ended. Described with math, for rides \(n \in 1, 2, ... N\) of each bike \(b \in 1, 2, ... B\), we can express bike location between rides as

\[ \textrm{end_station_name}_{b, n} = \textrm{start_station_name}_{b, n+1} \mid \textrm{normal usage} \]

This does not always hold, however, when Citi Bike intervenes between rides by removing a bike from a docking station for whatever reason (e.g., rebalancing or repair); Citi Bike may redock the bike anywhere or not at all. By combining information for ride \(n\) and \(n+1\), we can create intervention observations and by filtering to only keep transitions where

\[ \textrm{end_station_name}_{b, n} \ne \textrm{start_station_name}_{b, n+1} \mid \textrm{intervention} \]

Question 1 — measuring CitiBike interventions (data transformations)

Question 1(a)

Create observations for Citi Bike’s “interventions”. To create these, you’ll need to perform several data transformations.

Here’s my suggestion for one approach. First, arrange the data by bikeid and start_time, so that each bike’s rides will be ordered in time. Then, group all observations for each bikeid together. Within these groupings (by bikeid), you’ll create a new observation describing the time between rides. Thus, for each of the original variables with names that begin with start_ or end_, the start_ variables in your new intervention observations should be equal to the original end_ variables of the previous observed ride, and 2) the new end_ variables are equal to original start_ variables of the current observed ride.

Now for most of those new intervention observations, the start_ and end_ variables will be the same because the bike just stays docked until the next ride. Filter those out, and you’ll be left with rides where Citi Bike, for whatever reason, moved the bike between rides.

Of note, you won’t know where the previous ride was from for the first ride (using only this data), so that’s missing data. For this exercise, assume no rebalancing occured before the first ride. Filter those out. Include the variable usertype and set its measurement for all these intervention observations to “Citi Bike”. Finally, calculate the time difference between start and end of the transition in units of hours, and save as tripduration. Hint: you might try coding something like difftime(end_time, start_time, units = 'hours').

Your new data frame should include these variables:

bikeid
start_station_id
start_station_name
start_station_longitude
start_station_latitude
start_time
end_station_id
end_station_name
end_station_longitude
end_station_latitude
end_time
usertype
tripduration

Name your new dataframe as the object interventions.

# ENTER CODE TO TRANSFORM DATA INTO interventions
interventions <- rider_trips %>%
  select( -birth_year, -gender ) %>%
  arrange(
    # ENTER CODE
  ) %>$
  group_by(
    # ENTER CODE
  ) %>%
   mutate(
    across(
      .cols = matches('end_'),
      .fns = lag
    )
  ) %>%
  rename_with(
    .cols = contains('time') | contains('_station_'),
    ~ if_else(
        str_detect(., 'start'),
        str_replace(., 'start', 'end'),
        str_replace(., 'end', 'start')
      )
  ) %>%
  filter(
    # ENTER CODE
  ) %>%
  ungroup() %>%
  mutate(
    usertype = 'Citi Bike',
    tripduration = # ENTER CODE
  )

Question 1(b)

We’re curious about a docking station near Madison Square Garden: the station name is ‘W 31 St & 7 Ave’ and its station id is 379. How many trips originated from this station, and what percent of stations had more rides leaving?

# ENTER CODE FOR CALCULATIONS

EXPLAIN ANSWER HERE.

Question 1(c)

For the same station, how many bikes did Citi Bike remove due to interventions, and what percent of stations did Citi Bike remove more bikes?

# ENTER CODE FOR CALCULATIONS

EXPLAIN ANSWER HERE.

Question 1(d)

Applying the grammar of graphics with ggplot2, create a histogram showing the distribution of the number of interventions removing bikes across stations. Use a bin width of 100. Appropriately label the x-axis and y-axis, and include a title with your interpretation (a message, not just information). Apply best practices for communicating the visual information.

# ENTER CODE FOR VISUALIZATION

Question 2 — visualizing time between rides (visually encoding data)

Applying the grammar of graphics with ggplot2, 1) create a histogram of your calculated tripduration in your new data frame interventions, 2) layer a red, vertical line onto the histogram that marks 24 hours, and 3) add explanatory information including x and y labels, your main takeaway or interpretation (message, not just information) as a title, and a caption describing the source of your data.

# ENTER CODE TO GRAPH YOUR tripduration INTERVENTION DATA

Question 3 — communication, critical thinking

Does our above method (creating observations when the end_station_id of a ride does not match the start_station_id of the consecutive ride) tend to accurately measure how often Citi Bike has intervened, or might our method tend to overcount, undercount, or both? Support your argument in an explanation to a Citi Bike analytics executive.

EXPLAIN YOUR ANSWER HERE.

Question 4 — communication, critical thinking

Apple Maps estimates that on average a bike ride from the top of Manhattan (Inwood Hill Park) to the bottom of Manhattan (Battery Park) would take about 1.5 hours. And some bike angels can ride pretty fast! Obviously Citi Bike may intervene to rebalance docking stations, and in our earlier discussions we discussed four ways they try to rebalance. Consider reasons other than rebalancing where Citi Bike may intervene. What, if anything, does your histogram suggest about Citi Bike’s methods and purposes for the observations you’ve calculated as interventions? Explain to a Citi Bike analytics executive.

EXPLAIN YOUR ANSWER HERE.

Question 5 — visualize location of interventions (visually encoding data)

To practice layering encodings onto maps, let’s try to uncover high-level patterns in the location of Citi Bike interventions.

We might think of these interventions geographically (that is, locations in space). First, to visualize these interventions as locations in space, we’ll overlay visual encodings onto a map of Manhattan. We can create the base map from geographic data available at Beta NYC, which we convert from the available data structure called a spatial polygon data frame, into a regular data frame of which we are familiar. Here’s the code:

# location of spatial polygon data frame to tibble (data.frame) for
# boroughs and neighborhoods, convert to sf data frame

url <- str_c(
    'https://ssp3nc3r.github.io/',
    '20213APAN5800K007/data/betanyc_hoods.geojson'
    )

nyc_neighborhoods <- read_sf(url)

Inspect the simple features data frame, nyc_neighborhoods, notice that the variable geometry contains a storage type called POLYGON, which contains information that describes the geographic locations we’re interested in.

From these data frames, we draw a base map of Manhattan that also shows its neighborhood boundaries. Review the help file for geom_sf, the function we’ll use to map this spatial data onto visual encodings. Again, here’s some code to create our base map:

p_hoods <- 
  
  # initialize graph
  ggplot() + 
  
  # remove most non-data ink
  theme_void() +
  
  # add color for water (behind land polygons)
  theme(
    panel.background = element_rect(fill = 'lightblue')
  ) +
  
  # map boundary data to visual elements (polygons)
  geom_sf(
    data = nyc_neighborhoods,
    mapping = aes(geometry = geometry),
    fill = 'white',
    color = 'gray',
    lwd = 0.1
  ) +
  
  # define coordinate system and zoom in on Manhattan
  coord_sf(
    crs = sf::st_crs(4326), # World Geodetic System 1984 (WGS84)
    xlim = c(-74.03, -73.91),
    ylim = c( 40.695, 40.85)
  )


# display the graph
p_hoods

To this map, you can overlay other geometric mappings like geom_point or geom_segment like with other grammar of graphics plots.

There are many approaches to encode intervention data onto visual variables layered onto the map. Choose one or more visual encodings to layer intervention data onto the map. These may be visually encoded from direct observations in interventions, from transformations or summaries of those observations, or from both.

# ENTER CODE TO LAYER YOUR INTERVENTION DATA ONTO MAP

Explain to a Citi Bike analytics executive, what were your choices of visual encodings and how they help the executive explore patterns in Citi Bike interventions?

EXPLAIN WHAT VISUAL ENCODINGS YOU CHOSE AND WHY.

Question 6 — combine ride data with CitiBike interventions (data transformation)

Combine your new observations from interventions with the original observed rides in rider_trips into a new data frame called allmoves.

# ENTER CODE TO COMBINE OR BIND ROWS FROM rider_trips AND interventions

Question 7 — estimating number of bikes at stations (data transformation and visual encodings)

Next, let’s look more closely at the patterns of bikes available at a station across time. Again, we don’t directly have the number of bikes or number of empty parking spots available at each station at any given time, but we can estimate that information from the above data. With your data frames rider_trips and interventions (or collectively, allmoves), within each station_id you can count observed rides (and interventions): each end_station_id counts as +1, and each start_station_id counts as -1.

Then, you can order them in time and use a cumulative sum function like cumsum(). Because our data arbitrarily begins at the beginning of a month, however, we should not be starting our cumulative counts at 0 (because there were already bikes at the stations). We can account for this by subtracting from the cumulative bikes the minimum at each station over the month: e.g., \(\sum b_i - \textrm{min}(\sum b_i)\), where \(b_i \in [-1, +1]\).

In the step of transforming data, calculate this across time per station for 1) your combined trips and interventions and 2) separately for just interventions.

# ENTER CODE TO TRANSFORM DATA (ACCUMULATED SUMS OF BIKES ENTERING & LEAVING)

In the step of visually encoding the transformed data, graph the two cumulative sums of all across time at one particular station: “W 31 St & 7 Ave”, which is near Penn Station. Categorically encode the cumulative sum of combined trips and interventions in black, and encode the cumulative sum of just interventions in red.

# ENTER CODE TO GRAPH BOTH ACCUMULATED SUMS FOR THE SINGLE STATION

Question 8 — communication, critical thinking

For the questions below, explain your answers to a Citi Bike analytics executive:

Did your graph reveal any patterns in bike and docking availability, Citi Bike interventions, or relationships between them at the station “W 31 St & 7 Ave”, which is located near Penn Station?

WRITE YOUR EXPLANATION HERE.

How, if at all, does your graph above address any limitations of the graph titled, “CITI BIKE HOURLY ACTIVITY AND BALANCE” by Columbia University’s Center for Spatial Research (from homework two) in the context of exploring rebalancing?

WRITE YOUR EXPLANATION HERE.

What type of data transformations or visual encodings would you recommend that Citi Bike explore next to continue learning how to improve their rebalancing efforts and why?

WRITE YOUR EXPLANATION HERE.

Annotate your above graph with a title, subtitle, and other markings to explain your interpretation and insights for a mixed-audience of Citi Bike’s executives.

# ENTER CODE TO INCLUDE ANNOTATIONS AND EXPLANATIONS ON YOUR GRAPH

References cited

Properly Cite all resources used in completing this assignment:

Submission — reproducibility

In submitting this individual assignment, you are representing your answers are your own.

Knit your answers in this r markdown file into an html file. Submit into courseworks both files (the lastname-firstname-hw4.rmd and the knitted lastname-firstname-hw4.html). We should be able to reproduce your html file just by opening your rmd file and knitting.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.