Here is this homework’s R
markdown
(rmd) file.
In our discussion of the Citi Bike case study, we started considering the effect of the pandemic on ridership and rebalancing, and how we might find some insight by looking at data related to other transportation systems in the city. In this homework, we will continue exploratory data analysis for this case study as prerequisites for communicating with particular audiences for particular purposes.
If you have not already, install the tidyverse
and distill
R packages.
Create a directory on your computer for your homework. Place this file in that directory. In RStudio, create a project in that same directory. Now, when you import the data you will only need to specify the subdirectory as part of your name. This preparatory step helps your work be reproducible.
For this assignment, import data on New York City ridership from https://new.mta.info/coronavirus/ridership. You’ll need to open the website, then click “Download all the data”, which will be the csv
file you’ll use for this homework. Create a subdirectory called data
and place the csv
file you downloaded into it. Name the file MTA_recent_ridership_data.csv
.
Load the tidyverse
library package (which includes dplyr
and ggplot2
functions) inside the code chunk below:
# enter code to load the libraries here
Import the data into a data frame named d
and show a summary (hint, in your console, after you load the tidyverse library, you can type ? before read_csv
or glimpse
to learn more about functions for this purpose):
Use the two functions below to import and summarise your data variables:
# enter code to import and summarise your data frame variables here.
The column or variable names will be difficult to work with as they are currently written. First, we will rename variables so the data frame will be easier to work with in code:
new_names <-
paste(
rep(c('subway', 'bus', 'lirr',
'mta', 'access_ride', 'bridge_tunnel'),
each = 2),
rep(c("total", "change"),
times = 6),
sep = '_'
)
colnames(d) <- c('date', new_names)
Also, notice some of the variables are of the wrong type. The variable Date
, for example, is an array of type char
. Let’s change this to a proper date
type. And all the variables with a percentage are also of a type char
. Finally, the now renamed variable mta_total
is of type char.
Below, explain why variable mta_total
is of type char
:
Write your answer here.
Next, we’ll clean the variables holding percentages as a type char
. We’ll do this by removing the %
and recasting the variables, all in one set of piping functions:
d <- d %>%
mutate( date = as.Date(date, format = '%m/%d/%Y') ) %>%
mutate( mta_total = as.numeric(mta_total) ) %>%
mutate_if( is.character, str_replace_all, pattern = '%', replacement = '' ) %>%
mutate_if( is.character, as.numeric )
In R, missing data is represented as NA
. Does your data frame d
have any missing data? If so, where?
Write your answer here.
This dataset was used to visualize several graphics in the New York Times, in the article we reviewed in class: Penney, Veronica. How Coronavirus Has Changed New York City Transit, in One Chart. New York Times, March 8, 2021. https://www.nytimes.com/interactive/2021/03/08/climate/nyc-transit-covid.html.
The first graphic maps a three-day rolling average of the change in ridership since the lockdown in New York on March 22 for several of the transportation types {bridge and tunnel traffic
, Buses
, Subways
, LIRR
, Metro-North
}. Let’s see how much the three day rolling average affects the decoding of this graphic compared with our non-averaged values.
The best way to encode the raw change for each transportation type requires we transform our data frame from wide to long format.
More specifically, the data frame currently includes each transportation type as a different variable. Instead, we want to have one variable we will call transportation_type
and each observation will include the type and the remaining information.
Thus, our goal is to make our data frame look something like this:
date | transportation_type | change |
---|---|---|
2021-09-16 | subway | -57.6 |
2021-09-16 | bus | -57.1 |
2021-09-16 | lirr | -56 |
… | … | … |
To do that, we will use the function pivot_longer
and then subtract 100 from your new variable called change
. Review the help file for this function. Now, we need to specify which columns to pivot, and what names to give them. Complete the code below:
d <- d %>%
select( contains(c('date', 'change')) ) %>%
rename_with(~ gsub('_change', '', .x) ) %>%
# enter the remaining needed code here
Now that we have our data frame d
in long format, we can create our visual. For this visual, we want to only graph the transportation types shown in the NYT article: bridge_tunnel
, bus
, lirr
, mta
, and subway
. The easiest way to create the graphic will be to filter the other transportation types from the data frame, and graph with the ggplot
function and the geom_line
. I’ve written some code to get you started that you’ll need to complete:
d %>%
filter(
# enter the remaining code here
) %>%
ggplot() +
scale_color_manual(
breaks = c('bridge_tunnel', 'bus', 'subway', 'lirr', 'mta'),
values = c('#367C9D', '#61A0CA', '#91BBF9', '#993865', '#773452')
) +
labs(
x = 'Date',
y = 'Percent decline from 2019 ridership'
) +
# enter the remaining code here
Which version of the data encodings would a mixed audience of Citi Bike executives find easier to read (decode) — our version that encodes the actual daily changes for each transportation type or the NYT version that uses a three day rolling average — and why?
Write your answer here.
The NYT article did not include changes in ridership for the Citi Bike bike share. If we graphed the changes in Citi Bike bike rides alongside the transportation types we just graphed, how do you think the changes Citi Bike rides would compare with these other transportation types and why? Explain to a Citi Bike analytics executive.
Write your answer here.
In submitting this individual assignment, you are representing your answers are your own. Properly Cite all resources used in completing this assignment.
Knit your answers in this r markdown file into an html
file. Submit into courseworks both files (the lastname-firstname-hw1.rmd
and the knitted lastname-firstname-hw1.html
). We should be able to reproduce your html
file just by opening your rmd
file and knitting.
If you see mistakes or want to suggest changes, please create an issue on the source repository.