Homework 2: graphics practice

Here is this homework’s R markdown (rmd) file.

For this homework assignment, we’ll continue exploring data related to our Citi Bike case study as a way to practice the concepts we’ve been discussing in class. In our third class discussion, we briefly considered an exploratory visualization of activity and docking station (im)balance, conducted in 2013 by Columbia University’s Center for Spatial Research. As practice in understanding encodings, let’s review and reconstruct one of the Center’s graphics, titled: “CITI BIKE HOURLY ACTIVITY AND BALANCE”.

Preliminary

Place this homework two R markdown file into your RStudio project directory for this course, just as you should have for your first homework.

You can download and zoom in on a high resolution pdf of the Spatial Information Design Lab’s graphic here: http://c4sr.spatialinformationdesignlab.org/sites/default/files/Activity_Matrix_Composite.pdf.

Question 1 — communication, identifying data types and visual encodings

In the Center for Spatial Research’s graphic we’re reviewing, what data variables and their types have been encoded? For each variable, to what visual channel(s) and attribute(s) has it been mapped?

Write your answer here.

Question 2 — communication, assessing effectiveness of visual encodings

From our discussions, we listed several ways we can compare visually-encoded data, from more effective to less effective. See e.g., lecture 3, slide 7.

From the Center for Spatial Research’s Activity and Balance graphic, what type(s) of comparison(s) do the visual encodings enable its audience to make? Explain your answer to a mixed audience of Citi Bike analytics executives.

Write your answer here.

Question 3 — communication, interpreting visual encodings

Consider the uses of this graphic. What might this graphic help Citibike executives understand in terms of decision making for their rebalancing efforts? Explain your answer to a mixed audience of Citi Bike analytics executives.

Write your answer here.

Question 4 — workflow, tidying and transforming data

Next, we will re-construct the main components of this graphic together. I’ll setup most of the code, and you will fill in the needed gaps (as before, I prompt you with a code comment) as your answers.

To get started, we will first load our main library,

library(tidyverse)

and gather data from the New York City Bike Share data repository: https://ride.citibikenyc.com/system-data. The first time the code chunk below is run, it will download and save the zip file into your subdirectory you previously created in homework 1 called data, if the file hasn’t already been saved. Then, we read in the csv file into an R data frame object we call df:

savefile <- "data/201909-citibike-tripdata.csv"

if (!file.exists(savefile)) {
  url <- "https://s3.amazonaws.com/tripdata/201909-citibike-tripdata.csv.zip"
  download.file(url = url, destfile = savefile )
  }

df <- read_csv(savefile)

Next, we will tidy our data frame by renaming variables.

df <- df %>% rename_with(~ gsub(' ', '_', .) )

Explore the data frame for missing data. You’ll notice that some start and end station names are missing. We cannot reconstruct Columbia University Center for Spatial Research’s graphic without these values, so we will filter those NA values out of our data frame, keeping in mind that our result is now conditional on the data we still have. We also want to just consider observations with an end_station_name that is also used as a start_station_name.

df <- 
  df %>% 
  filter(
    if_any(contains('station_name'), ~ !is.na(.)),
    end_station_name %in% start_station_name
  )

We need to change the structure of our data so that we can map data values onto the visual encodings used in the Center for Spatial Research’s graphic.

More specifically, we need to count the number of rides both starting and ending at each station name at each hour of the day, averaged over the number of days in our data set. We’ll need to pivot some of the data and create new variables. Specifically, we will pivot two variables — start_station_name and end_station_name into long format, and create variables for day of month (day) and hour of day (hour), like so:

df <- 
  df %>%
  pivot_longer(
    cols = c(start_station_name, end_station_name), 
    names_to = "start_end",
    values_to = "station_name"
  ) %>%
  mutate(
    day  = format( if_else(start_end == "start_station_name", starttime, stoptime), "%d" ),
    hour = format( if_else(start_end == "start_station_name", starttime, stoptime), "%H" )
  ) %>%
  mutate(
    station_name = fct_reorder(station_name, desc(station_name))
  )

The pivot results in creating separate observations, from the perspective of a docking station (instead of the perspective of a ride), for both types of events: a bike parking and a bike leaving.

With the pivoted data frame, we can now group our data by station name and hour, and calculate averages we’ll need to map onto visual variables.

Create new variables activity and balance, where activity holds the average number of rides or observations at each station name each hour and where balance hold the average difference between rides beginning at the station and rides ending at the station. While the Center for Spatial Research’s graphic only considered weekdays, let’s consider all days of the week.

df <- 
  df %>%
  group_by(station_name, hour, .drop = FALSE) %>%
  
  summarise(
    activity = # complete this code
    balance  = # complete this code
  ) %>%
  
  ungroup()

Inspect this data frame, and compare with the original imported data frame to understand how each step of the above code changed its structure. Start to consider how we will map these data variables onto the visual variables used in the Center for Spatial Research’s Activity and Balance graphic.

In our third discussion, we considered how to scale data values to map their data ranges to the appropriate visual ranges for each channel of color: hue, chroma (saturation), and luminance. We’ll do that next.

Question 5 — workflow, scaling data

Complete the code below to properly scale your data variables to the ranges of your visual variables to roughly reconstruct the Lab’s graphical mappings. Hint, you need to consider two ranges for each variable: the range of the data and the range of the visual channel. To get you started, I’ve written the following code:

library(scales)

df <-
  df %>%
  mutate(
    hue = if_else(balance < 0, 50, 200),
    saturation =
      rescale(
        abs(balance),
        from = # complete this code
        to   = # complete this code
      ),
    luminance =
      rescale(
        activity,
        from = # complete this code
        to   = # complete this code
      )
  )

Question 6 — workflow, mapping data to visual channels

In the final step of reconstructing the Lab’s data to visual mappings, we are ready to map our data onto the visual variables. The Center’s Activity and Balance graphic resembles a so-called heatmap.

Use the grammar of graphics to create tiles of information, using the function geom_tile. To do that, first review the help file for that function, paying particular attention to the aesthetics you’ll need to specify.

Further, to map the individual channels of color, you can use the function hcl that’s already loaded from tidyverse, which works very similarly to (a bit less optimal than) the example I showed you from my R package, hsluv_hex. You may also use mine, but that will require you to install it from github.

I’ve started the code for you below. Add code where prompted.

p <- 
  df %>%
  ggplot() +
  scale_fill_identity() +
  geom_tile(
    mapping = aes(
      # complete this code
    ),
    width = 0.95,
    height = 0.95
  ) +
  theme_dark() +
  theme(
    panel.background = element_blank(),
    panel.grid = element_blank(),
    plot.background = element_rect(fill = "#333333"),
    axis.text.x = element_text(color = "#888888", size = 16 / .pt),
    axis.text.y = element_text(color = "#888888", size =  7 / .pt)
  ) +
  labs(x = "", y = "")

# The next line of code will save the graphic as a pdf onto your working
# directory so that you can separately open and zoom in while reviewing it.
ggsave("activity_balance2019.pdf", plot = p, width = 8, height = 40)

p

Question 7 — communication and interpretation

Explain your choices of scaling for data and visual ranges in terms of how an audience should interpret the data from the visual encodings?

How would widening and narrowing the data ranges and — separately — widening and narrowing the visual ranges, have affected an audience’s interpretation?

Write your answer here.

Question 8 — communication, decoding and interpretation: critical thinking

We’ve finished roughly reconstructing the Center’s Activity and Balance graphic, updated with later data from September 2019, six years after the original graphic but still before the pandemic. You should find that some of the patterns originally described by the Center still show up. Review the Center’s description of the Activity and Balance graphic: https://c4sr.columbia.edu/projects/citibike-rebalancing-study.

Notice that the Center’s description of its graphic and data do not, however, discuss whether empty and full docking stations, and rebalancing efforts by Citi Bike, have any effect on the patterns shown in the graphic.

How would 1) empty and full docking stations and 2) CitiBike rebalancing bikes affect audiences’ interpretation of the visual patterns revealed by the Center for Spatial Research’s graphic? Explain these issues of interpretation to a Citi Bike analytics executive.

Write your answer here.

Knit and submit

In submitting this individual assignment, you are representing your answers are your own. Properly Cite all resources used in completing this assignment.

Knit your answers in this r markdown file into an html file. Submit into courseworks both files (the lastname-firstname-hw2.rmd and the knitted lastname-firstname-hw2.html). We should be able to reproduce your html file just by opening your rmd file and knitting.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.