Homework 2: graphics practice

Here is this homework’s R markdown (rmd) file.

For this homework assignment, we’ll continue exploring data related to our Citi Bike case study as a way to practice the concepts we’ve been discussing in class. In our third class discussion, we briefly considered an exploratory visualization of activity and docking station (im)balance, conducted in 2013 by Columbia University’s Center for Spatial Research. As practice in understanding encodings, let’s review and reconstruct one of the Center’s graphics, titled: “CITI BIKE HOURLY ACTIVITY AND BALANCE”.

Preliminary

Place the homework two R markdown file into your RStudio project directory for this course, just as for your first homework.

You can download and zoom in on a high resolution pdf of the graphic here: https://c4sr.columbia.edu/sites/default/files/Activity_Matrix_Composite.pdf.

Question 1(a) and 1(b) — communication, data types and visual encodings

In the Center’s graphic we’re reviewing, what variables and data types have been encoded? Explain to your answer to a Citi Bike analytics executive.

Write your answer here.

Explain to the same audience, to what visual channels were those variables mapped?

Write your answer here.

Question 2 — communication, coordinate systems

Explain to the same Citi Bike analytics executive, what type of coordinate system was used for this Activity and Balance graphic?

Write your answer here.

Question 3 — communication, comparing encoded data

From our discussions, we listed several ways we can compare visually-encoded data, from more effective to less effective.

From the Center’s Activity and Balance graphic, what type(s) of visual comparisons do the encodings enable? Explain to a mixed audience of Citi Bike analytics executives.

Write your answer here.

Question 4 — workflow, tidying and transforming data

Next, we will re-construct the main components of this graphic together. I’ll setup most of the code, and you will fill in the needed gaps (I prompt you with a code comment) as your answers.

To get started, we will first load our main library,

library(tidyverse)

and gather data from the New York City Bike Share data repository: https://ride.citibikenyc.com/system-data. The first time the code chunk below is run, it will download and save the zip file into your subdirectory you previously created in homework 1 called data, if the file hasn’t already been saved. Then, we read in the csv file into an R data frame object we call df:

savefile <- "data/201909-citibike-tripdata.csv"

if (!file.exists(savefile)) {
  url <- "https://s3.amazonaws.com/tripdata/201909-citibike-tripdata.csv.zip"
  download.file(url = url, destfile = savefile )
  }

df <- read_csv(savefile)

Next, we will tidy our data frame by renaming variables.

df <- df %>% rename_with(~ gsub(' ', '_', .) )

Explore the data frame for missing data. You’ll notice that some start and end station names are missing. We cannot reconstruct Columbia University Center for Spatial Research’s graphic without these values, so we will filter those NA values out of our data frame, keeping in mind that our result is now conditional on the data we still have. We also want to just consider observations with an end_station_name that is also used as a start_station_name.

df <- 
  df %>% 
  filter(
    if_any(contains('station_name'), ~ !is.na(.)),
    end_station_name %in% start_station_name
  )

We need to change the structure of our data so that we can map data values onto the visual encodings used in the Center’s graphic.

More specifically, we need to know the number of rides both starting and ending at each station name at each hour of the day, averaged over the number of days in our data set. We’ll need to pivot some of the data and create new variables. Specifically, we will pivot two variables — start_station_name and end_station_name into long format, and create variables for day of month (day) and hour of day (hour), like so:

df <- 
  df %>%
  pivot_longer(
    cols = c(start_station_name, end_station_name), 
    names_to = "start_end",
    values_to = "station_name"
  ) %>%
  mutate(
    day  = format( if_else(start_end == "start_station_name", starttime, stoptime), "%d" ),
    hour = format( if_else(start_end == "start_station_name", starttime, stoptime), "%H" )
  ) %>%
  mutate(
    station_name = fct_reorder(station_name, desc(station_name))
  )

The pivot results in creating separate observations, from the perspective of a docking station (instead of the perspective of a ride), for both types of events: a bike parking and a bike leaving.

Are you starting to see that tidying and transforming data are frequently useful prerequisites to making graphics that provide real insight? Hint, the correct answer is “Yes, and this is awesome!”

Write your answer here.

Question 5 — workflow, transforming data

With the pivoted data frame, we can now group our data by station name and hour, and calculate the averages we’ll need to map onto visual variables.

Create new variables activity and balance, where activity holds the average number of rides or observations at each station name each hour and where balance hold the average difference between rides beginning at the station and rides ending at the station.

df <- 
  df %>%
  group_by(station_name, hour, .drop = FALSE) %>%
  
  summarise(
    activity = # complete this code
    balance  = # complete this code
  ) %>%
  
  ungroup()

Inspect this data frame, and compare with the original imported data frame to understand how each step of the above code changed its structure. Start to consider how we will map these data variables onto the visual variables used in the Center’s Activity and Balance graphic.

In our third discussion, we considered how to scale data values to map their ranges to the appropriate ranges for each channel of color: hue, chroma (saturation), and luminance. We’ll do that next.

Question 6 — workflow, scaling data

Complete the code below to properly scale your data variables to the ranges of your visual variables. To get you started, I’ve written the following code:

library(scales)

df <-
  df %>%
  mutate(
    hue = if_else(balance < 0, 50, 200),
    saturation =
      rescale(
        abs(balance),
        from = # complete this code
        to   = # complete this code
      ),
    luminance =
      rescale(
        activity,
        from = # complete this code
        to   = # complete this code
      )
  )

Question 7 — workflow, mapping data to visual channels

Finally, we are ready to map our data onto the visual variables. The Center’s Activity and Balance graphic resembles a so-called heatmap.

Use the grammar of graphics to create tiles of information, using the function geom_tile. To do that, first review the help file for that function, paying particular attention to the aesthetics you’ll need to specify.

Further, to map the individual channels of color, you can use the function hcl that’s already loaded from tidyverse, which works very similarly to (a bit less optimal than) the example I showed you from my R package, hsluv_hex. You may also use mine, but that will require you to install it.

I’ve started the code for you below. Add code where prompted.

p <- 
  df %>%
  ggplot() +
  scale_fill_identity() +
  geom_tile(
    mapping = aes(
      # complete this code
    ),
    width = 0.95,
    height = 0.95
  ) +
  theme_dark() +
  theme(
    panel.background = element_blank(),
    panel.grid = element_blank(),
    plot.background = element_rect(fill = "#333333"),
    axis.text.x = element_text(color = "#888888", size = 16 / .pt),
    axis.text.y = element_text(color = "#888888", size =  7 / .pt)
  ) +
  labs(x = "", y = "")

# The next line of code will save the graphic as a pdf onto your working
# directory so that you can separately open and zoom in while reviewing it.
ggsave("activity_balance2019.pdf", plot = p, width = 8, height = 40)

p

Question 8 — communication, decoding and interpretation: critical thinking

We’ve finished roughly reconstructing the Center’s Activity and Balance graphic, updated with later data from September 2019, six years after the original graphic but still before the pandemic. We find that the patterns originally described by the Center still show up. Review their description of the Activity and Balance graphic.

Notice that the Center’s description of its graphic and data do not, however, discuss whether empty and full docking stations, and rebalancing efforts by Citi Bike, have any effect on the patterns they describe.

How might 1) empty and full docking stations and 2) CitiBike rebalancing bikes affect the visual patterns in our graphic? Explain to a Citi Bike analytics executive.

Write your answer here.

Knit and submit

In submitting this individual assignment, you are representing your answers are your own. Properly Cite all resources used in completing this assignment.

Knit your answers in this r markdown file into an html file. Submit into courseworks both files (the lastname-firstname-hw2.rmd and the knitted lastname-firstname-hw2.html). We should be able to reproduce your html file just by opening your rmd file and knitting.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.