Here is this homework’s
) file.
For this homework assignment, we’ll continue exploring data related to our Citi Bike case study as a way to practice the concepts we’ve been discussing in class. In our third class discussion, we briefly considered an exploratory visualization of activity and docking station (im)balance, conducted in 2013 by Columbia University’s Center for Spatial Research. As practice in understanding encodings, let’s review and reconstruct one of the Center’s graphics, titled: “CITI BIKE HOURLY ACTIVITY AND BALANCE”.
Place the homework two R
file into
your RStudio project directory for this course, just as for your first
You can download and zoom in on a high resolution pdf of the graphic here:
In the Center’s graphic we’re reviewing, what variables and data types have been encoded? Explain to your answer to a Citi Bike analytics executive.
Write your answer here.
Explain to the same audience, to what visual channels were those variables mapped?
Write your answer here.
Explain to the same Citi Bike analytics executive, what type of coordinate system was used for this Activity and Balance graphic?
Write your answer here.
From our discussions, we listed several ways we can compare visually-encoded data, from more effective to less effective.
From the Center’s Activity and Balance graphic, what type(s) of visual comparisons do the encodings enable? Explain to a mixed audience of Citi Bike analytics executives.
Write your answer here.
Next, we will re-construct the main components of this graphic together. I’ll setup most of the code, and you will fill in the needed gaps (I prompt you with a code comment) as your answers.
To get started, we will first load our main library,
and gather data from the New York City Bike Share data repository:
The first time the code chunk below is run, it will download and save
the zip file into your subdirectory you previously created in homework 1
called data
, if the file hasn’t already been saved. Then,
we read in the csv
file into an R data frame object we call
savefile <- "data/201909-citibike-tripdata.csv"
if (!file.exists(savefile)) {
url <- ""
download.file(url = url, destfile = savefile )
df <- read_csv(savefile)
Next, we will tidy our data frame by renaming variables.
df <- df %>% rename_with(~ gsub(' ', '_', .) )
Explore the data frame for missing data. You’ll notice that some
start and end station names are missing. We cannot reconstruct Columbia
University Center for Spatial Research’s graphic without these values,
so we will filter those NA
values out of our data frame,
keeping in mind that our result is now conditional on the data we still
have. We also want to just consider observations with an
that is also used as a
df <-
df %>%
if_any(contains('station_name'), ~ !,
end_station_name %in% start_station_name
We need to change the structure of our data so that we can map data values onto the visual encodings used in the Center’s graphic.
More specifically, we need to know the number of rides both starting
and ending at each station name at each hour of the day, averaged over
the number of days in our data set. We’ll need to pivot some of
the data and create new variables. Specifically, we will pivot
two variables — start_station_name
into long format, and create variables for
day of month (day
) and hour of day (hour
like so:
df <-
df %>%
cols = c(start_station_name, end_station_name),
names_to = "start_end",
values_to = "station_name"
) %>%
day = format( if_else(start_end == "start_station_name", starttime, stoptime), "%d" ),
hour = format( if_else(start_end == "start_station_name", starttime, stoptime), "%H" )
) %>%
station_name = fct_reorder(station_name, desc(station_name))
The pivot results in creating separate observations, from the perspective of a docking station (instead of the perspective of a ride), for both types of events: a bike parking and a bike leaving.
Are you starting to see that tidying and transforming data are frequently useful prerequisites to making graphics that provide real insight? Hint, the correct answer is “Yes, and this is awesome!”
Write your answer here.
With the pivoted data frame, we can now group our data by station name and hour, and calculate the averages we’ll need to map onto visual variables.
Create new variables activity
and balance
where activity
holds the average number of rides or
observations at each station name each hour and where
hold the average difference between rides beginning
at the station and rides ending at the station.
df <-
df %>%
group_by(station_name, hour, .drop = FALSE) %>%
activity = # complete this code
balance = # complete this code
) %>%
Inspect this data frame, and compare with the original imported data frame to understand how each step of the above code changed its structure. Start to consider how we will map these data variables onto the visual variables used in the Center’s Activity and Balance graphic.
In our third discussion, we considered how to scale data values to map their ranges to the appropriate ranges for each channel of color: hue, chroma (saturation), and luminance. We’ll do that next.
Complete the code below to properly scale your data variables to the ranges of your visual variables. To get you started, I’ve written the following code:
df <-
df %>%
hue = if_else(balance < 0, 50, 200),
saturation =
from = # complete this code
to = # complete this code
luminance =
from = # complete this code
to = # complete this code
Finally, we are ready to map our data onto the visual variables. The Center’s Activity and Balance graphic resembles a so-called heatmap.
Use the grammar of graphics to create tiles of information, using the
function geom_tile
. To do that, first review the help file
for that function, paying particular attention to the aesthetics you’ll
need to specify.
Further, to map the individual channels of color, you can use the
function hcl
that’s already loaded from
, which works very similarly to (a bit less
optimal than) the example I showed you from my R package,
. You may also use mine, but that will require you
to install it.
I’ve started the code for you below. Add code where prompted.
p <-
df %>%
ggplot() +
scale_fill_identity() +
mapping = aes(
# complete this code
width = 0.95,
height = 0.95
) +
theme_dark() +
panel.background = element_blank(),
panel.grid = element_blank(),
plot.background = element_rect(fill = "#333333"),
axis.text.x = element_text(color = "#888888", size = 16 / .pt),
axis.text.y = element_text(color = "#888888", size = 7 / .pt)
) +
labs(x = "", y = "")
# The next line of code will save the graphic as a pdf onto your working
# directory so that you can separately open and zoom in while reviewing it.
ggsave("activity_balance2019.pdf", plot = p, width = 8, height = 40)
We’ve finished roughly reconstructing the Center’s Activity and Balance graphic, updated with later data from September 2019, six years after the original graphic but still before the pandemic. We find that the patterns originally described by the Center still show up. Review their description of the Activity and Balance graphic.
Notice that the Center’s description of its graphic and data do not, however, discuss whether empty and full docking stations, and rebalancing efforts by Citi Bike, have any effect on the patterns they describe.
How might 1) empty and full docking stations and 2) CitiBike rebalancing bikes affect the visual patterns in our graphic? Explain to a Citi Bike analytics executive.
Write your answer here.
In submitting this individual assignment, you are representing your answers are your own. Properly Cite all resources used in completing this assignment.
Knit your answers in this r markdown file into an
file. Submit into courseworks both files (the
and the knitted
). We should be able to
reproduce your html
file just by opening your
file and knitting.
If you see mistakes or want to suggest changes, please create an issue on the source repository.