Homework 1: workflow and graphics practice

Here is this homework’s R markdown (rmd) file.

In our discussion of the Citi Bike case study, we started considering the effect of the pandemic on ridership and rebalancing, and how we might find some insight by looking at data related to other transportation systems in the city. In this homework, we will continue exploratory data analysis for this case study as prerequisites for communicating with particular audiences for particular purposes.

Preliminary setup

If you have not already, install the tidyverse and distill R packages.

Create a directory on your computer for your homework. Place this file in that directory. In RStudio, create a project in that same directory (in the RStudio menu: file, new project…). Now, when you import the data you will only need to specify the subdirectory as part of your name. This preparatory step helps your work be reproducible.

For this assignment, import data on New York City ridership that I previously saved on our course website. Create a subdirectory called data in your project directory and place the csv file you downloaded into it. Name the file MTA_recent_ridership_data.csv.

Inside the code chunk below, load the tidyverse library package (which includes dplyr and ggplot2 functions):

# enter code to load the tidyverse libraries

Question 1: importing and summarising

Import the data into a data frame named d and show a summary (hint, in your console, after you load the tidyverse library, you can type ? before read_csv or glimpse to learn more about functions for this purpose):

Use the two functions below to import and summarize your data variables:

# enter code to import and summarize your data frame variables here.

Question 2: tidying

The column or variable names will be difficult to work with as they are currently written. First, we will rename variables so the data frame will be easier to work with in code:

new_names <- 
  str_c(
    rep(c('subway', 'bus', 'lirr', 
          'mta', 'access_ride', 'bridge_tunnel'), 
        each = 2), 
    rep(c("total", "change"), 
        times = 6), 
    sep = '_'
    )

colnames(d) <- c('date', new_names)

Also, notice some of the variables are of the wrong type. The variable date, for example, is an array of type char. Let’s change this to a proper date type. And all the variables with a percentage are also of a type char. Finally, the now renamed variable mta_total is of type char.

Below, explain why variable mta_total is of type char:

Write your answer here.

Question 3: more tidying

Next, we’ll clean the variables holding percentages as a type char. We’ll do this by removing the % and recasting the variables, all in one set of piping functions:

d <- d %>% 
  mutate( date = as_date(date, format = '%m/%d/%Y') ) %>%
  mutate( mta_total = as.numeric(mta_total) ) %>%
  mutate( across( where(is.character), ~str_replace_all(.x, pattern = '%', replacement = '')) ) %>%
  mutate( across( where(is.character), ~as.numeric(.x)) )

In R, missing data is represented as NA. Let’s try to visualize whether we have missing data, say, as a so-called heatmap of the data frame. Finish the code I provided below, and answer the question that follows:

d %>%
mutate( observation = row_number() ) %>%
pivot_longer(
  cols = -c(date, observation),
  names_to = 'variable', 
  values_to = 'value') %>%
mutate(
  is_missing = # Finish this code to check for missing values
)
ggplot() +
geom_raster(
  mapping = aes(
    x = # Finish this code
    y = # Finish this code
    fill = # Finish this code
  )
) +
scale_fill_manual(
  values = c('black', 'darkorange'), 
  breaks = c(FALSE, TRUE)
)

Does your data frame d have any missing data? If so, where?

Write your answer here.

Question 4: transforming

This dataset was used to visualize several graphics in the New York Times1, in the article we reviewed in class: Penney, Veronica. How Coronavirus Has Changed New York City Transit, in One Chart. New York Times, March 8, 2021. https://www.nytimes.com/interactive/2021/03/08/climate/nyc-transit-covid.html.

The first graphic maps a three-day rolling average of the change in ridership since the lockdown in New York on March 22 for several of the transportation types {bridge and tunnel traffic, Buses, Subways, LIRR, Metro-North}. Let’s see how much the three day rolling average affects the decoding of this graphic compared with our non-averaged values.

The best way to encode the raw change for each transportation type requires we transform our data frame from wide to long format.

More specifically, the data frame currently includes each transportation type as a different variable. Instead, we want to have one variable we will call transportation_type and each observation will include the type and the remaining information.

Thus, our goal is to make our data frame look something like this:

date transportation_type change
2021-09-16 subway -57.6
2021-09-16 bus -57.1
2021-09-16 lirr -56

To do that, we will use the function pivot_longer and then subtract 100 from your new variable called change. Review the help file for this function. Now, we need to specify which columns to pivot, and what names to give them. Complete the code I’ve provided below:

d <- d %>%
  select( contains(c('date', 'change')) ) %>%
  rename_with(~ str_remove_all(.x, '_change')) %>%
  # enter the remaining needed code here

Question 5: visualizing

Now that we have our data frame d in long format, we can create our visual. For this visual, we want to only graph the transportation types shown in the NYT article: bridge_tunnel, bus, lirr, mta, and subway. The easiest way to create the graphic will be to filter the other transportation types from the data frame, and graph with the ggplot function and the geom_line. I’ve written some code to get you started that you’ll need to complete:

d %>%
  filter(
    
    # enter the remaining code here
    
  ) %>%

  ggplot() +
  
  scale_color_manual(
    breaks = c('bridge_tunnel', 'bus', 'subway', 'lirr', 'mta'),
    values = c('#367C9D', '#61A0CA', '#91BBF9', '#993865', '#773452')
  ) +
  
  labs(
    x = 'Date',
    y = 'Percent decline from 2019 ridership'
  ) +
  
  # enter the remaining code here

Question 6: communication — basic insights

Consider a mixed audience of Citi Bike executives working on rebalancing. When might they prefer to review our version of the graphic that encodes the actual daily changes for each transportation type and when might they prefer to review the NYT version that uses a three day rolling average? Explain your reasoning.

Write your answer here.

Question 7: communication — questions needing more exploration

In the NYT graphic, the publisher did not include changes in ridership for the Citi Bike bike share. If we added the changes in Citi Bike bike rides alongside the transportation types we just graphed, how do you think the changes Citi Bike rides would compare with the other transportation types and why? Explain your answer to this hypothetical to a your audience, a Citi Bike analytics executive.

Write your answer here.

Question 8: Preparing a reproducible communication

In submitting this individual assignment, you are representing your answers are your own. Properly Cite all resources used in completing this assignment.

Knit your answers in this r markdown file into an html file. Submit into courseworks both files (the lastname-firstname-hw1.rmd and the knitted lastname-firstname-hw1.html). We should be able to reproduce your html file just by opening your rmd file and knitting. That only works if you properly setup your R Studio project.


  1. Get a free subscription from Columbia’s library: https://clio.columbia.edu/catalog/15561089↩︎

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.