Welcome to Wednesday Night Live from New York!

Before we get started, are there any announcements?

Hello everyone. How’s it going? Last week we introduced the course. We discussed a general workflow from importing data into some kind of software, to transforming it and visualizing it.

This week I want us to focus our discussion on the beginning of a deep dive into visualizing structured data.

So we’ll consider why we visualize information, how to organize the visual, how to map data to visual encodings.

We’ll take about a theoretically-grounded approach to making graphics that enables you to make any graphic, not just pick from named graphics.

We’ll talk about why this is important. And we’ll end by practicing in two ways. We’ll deconstruct an information graphic. And we’ll continue exploratory work on our class example, the Citi Bike rebalancing study.

But first, let’s visit our course timeline.

First up in terms of deliverables is your homework 1. Together with last week’s discussion, this week should give you plenty of knowledge for doing well on your homework 1, due next week.

And remember, while I’ve given you most of the code in homework 1, the purpose of the homework is not just for you to fill in the blanks: it is also to show you how we can transform data and encode graphics, so spend time understanding what I’ve given you in that homework. That way you’ll understand when you are asked to do something similar on your own.

Sound good? I noticed some of you have already turned it in!

As another reminder of our course as a whole, let’s revisit the Venn diagram we discussed last week, broadly categorizing skills needed to drive change. Those involved the overlaps between Narratives, Data analyses, and visualization.

And last week we began a case study, Citi Bike, to think about these components together. We began that by ideating what the problem is, rebalancing bikes, what data may inform how Citi Bike can accomplish that. We located various data sources. Imported one of data sets, transformed it to create new variables. And we made one visual.

We coded this in both R and Python to show that we can basically use almost identical code syntax for either one. We’ll continue tonight, digging into visualization specifically.

Our aim with visuals, as John Tukey reminds us, is that the “greatest value of a picture, he means data visual here, is when it forces us to notice what we never expected to see.”

To start to understand his point, let’s first consider a couple more perspectives.

First, let’s consider some of what’s been said about the strength of visualization. I’ve pulled a quote from a textbook published recently on data visualization where the authors contrast written or textual communication with data encodings. Let’s read what they said together.

[READ]

A single bar or single any encoding does not visually convey added information. Let’s try testing this point.

On the left, I’m showing you one datum: how much U.S. consumers spend on Housing in dollars.

Does this bar length help us understand anything more than if we had just said that on average, U.S. consumers spend over $12,000 on housing?

[DISCUSS]

Now let’s compare this with the next graphic.

I’m still showing you the datum on housing we just considered, but now I’m also visually including other categories of expenditures along side housing. Do the additional data help us understand housing expenditure over what we just saw? How?

[DISCUSS]

Let’s consider another classic example.

So these example data are from a well-known paper by Anscombe. He provides us with 4 data sets, labeled 1 to 4, and each data set includes an x and y variable, and 11 observations or measures of those variables. Now if you read before class, you will already know the answers to the questions we’re about to discuss.

From this table, try to see how the four data sets compare with one another? Are the four data sets the same or different? If different, then how? Not so easy, right?

What if we calculated some some statistics of the data? Do you think that would help us decide? Show of hands? OK, let’s do that.

On the top right, I’ve included the mean and standard deviation of each x and y within each data set. What do you see so far?

[DISCUSS]

Do the statistics help us understand how these data sets compare with, and differ from, one another?

You’re learning, or will be learning, regression in your frameworks class. What if we used linear regression to compare them? Might that help?

Now on the lower right, I’ve included linear regressions on each data set, regressing y on x, and included the intercept and slope, standard errors on those, and a measure of their so-called significance. Again, what do you see? Same? Different?

Are we done? Let’s consider a different approach. Should we graph these datasets? Let’s do that.

I’ve given you 4 graphs of x and y, one for each data set. Now, what do you see? Same? Different?

Let’s return to what the mathematician John Tukey once said, “The greatest value of a picture is when it forces us to notice what we never expected to see.”

OK, so far, we’ve just been discussing data graphics generally. We briefly saw one last week, and just discussed another. But we sort of just assumed we knew how these graphs worked, maybe how they were made.

Let’s shift our discussion to how we make them.

These visuals have to be organized, first of all, in specific ways so that they can accurately represent comparisons between data. We place visual data markings onto a coordinate system using scales.

The coordinate system we’ve used in our few examples so far were two-dimensional, where the two axes intersect at a 90 degree angle and the values along each axis are equidistant. In other words, each axis uses a linear scale. This coordinate system, called Cartesian coordinates, is the most common.

Now I’ve added five reference markings onto this grid, two in blue, a point and a line. And we can note the point’s location within the coordinates by where it is on each axis, the x and y. That’s where (2, 3) comes from. Does anyone have questions so far about this type of coordinate system and scale?

Now there are many other coordinate systems. Let’s consider these same two markings on another commonly used coordinate system, the polar coordinate system.

I’ve kept our example Cartesian coordinate system on the left so we can compare it with our new polar coordinate system on the right, showing the same datum. The system on the right, again, is called a polar coordinate system.

And, again, I’ve carefully marked a few elements on both that are identical to use as reference markers. This includes, on the left graphic, a vertical blue line on the left most side of our graph.

Now — importantly — the range on the x axis is negative 4 to positive 4. Notice that the bottom, left of this graph is dead-center of the polar graph.

If we held the bottom of that blue vertical line in the same spot, and moved the top of it in a circle, we get the polar coordinate system on the right. Does that make sense?

You can see the code for these two graphs in the corresponding code I provided you. I recommend you review it too.

What are some things we frequently encounter in real life that are cyclical? Time? And we have circular clocks to represent time, right? That’s just one example where we may want to think about using something like this. Now we aren’t yet going into debating the effectiveness of these systems. First, I want to show you one more type of example, and then discuss how we can encode data.

We’ve all seen various maps of our world — I think that is earth for most of us. :)

Now some people still claim the earth is flat, but I think it’s a sphere. Now we’ve seen various types of flat maps that attempt to show our earth. Here are various coordinate systems created to address the distortion that the coordinates impart onto the sphere.

Now before we jump into data encodings, let’s go back to scales for a moment.

Again, here’s our example coordinate system on the left.

And we talked about how it uses linear scales, meaning the axis has equal numerical spacing. If we focus on the gridlines, we can see the numbers are equally spaced, right?

We can do one of two things to adjust how we visualize markings on this graph. We can either transform the data or transform the axis scale.

Sometimes data, for example, are non-linear. If we took the log or square root of that data, it may be easier for us to make visual comparisons. Or instead of transforming the data, we can transform the scale of the axis.

On the right, I’m showing you a few examples of these transformations. The top row, data is linear and the scale is linear. Notice that the grid lines are spaced equally apart.

Next, on the second row, I log-transform the data but keep the scale linear. Here, too, the gridlines have equal spacing, right? But the values have changed.

The third row, I reverse this, keeping the data linear but log-transforming the scale. Now, the grid lines are not equally spaced, right? But the values corresponds to the raw numbers.

The last two rows I repeat these transformations, but with square root instead of log. Notice, again, just like transforming either the data or the scale into its square root has a similar visual effect as before, with the log transformation. Make sense?

One note of warning. I cannot at the moment recall any occasion where you want to scale both data and axes. I’m not sure there is any transformation like this we could easily reason about.

Here’s an example, published comparison of two graphics of the same data. I think most of us will recognize as related to the beginning of the pandemic, that compares using linear scales on the left to using a log scale of the y-axis on the right.

As the New York Times explains, the log scale helps us think about the rate of change while the linear scale helps us compare the underlying values. I recommend you actually go read this article as it’s helpful in seeing how they explain the differences for a general, mixed audience.

So that’s a very basic refresher on coordinate systems, and on transforming the scales of either axes or data, but not both.

This brings us to using coordinate systems for data.

To lead into this, let’s just be more specific when thinking about a point in space, a line, a plane, and volume. What, precisely, do we mean by these concepts? Let’s review them.

What’s a point in space? You cannot see or feel a point because it is a place without area. The point has a position that can be defined by coordinates (numbers on one, two, three, or more axes).

Questions about why we cannot see it?

We can give it arbitrary size so that we can identify approximately where on our graph it is. Or, as we’ll see later, we can also use size as another visual attribute to encode other data variables.

What’s a line? A line can be understood as an infinite number of points that are adjacent to one another. A line can be infinite or have two endpoints. The shortest distance between two points is a straight line. Now, as with points, we still can’t really see a line. We need multiple dimensions before we can see something. Let’s talk about that now.

The first thing we can see is a surface or area. A surface is defined by two lines that do not coincide or by a minimum of three points that are not located on a line. If the two lines have one coinciding point, the surface will be a plane.

Now we’ve learned about these concepts — points, lines, areas — as children, but I wanted the concepts to be in our minds when discussing data graphics because they are fundamental to data graphics. We can build from a plane by extending it in yet another direction.

This brings us to volume. A volume is an empty space defined by surfaces, lines, and points. So everything we see on a data graphic is either a plane or volume of some kind, whether or not it represents data, or more precisely, variation in data. I know at this point, these concepts are somewhat abstract or, rather, not applied. Hang with me. We’ll get to their application.

Along with points, lines, surfaces, and volumes, we can use the channels in color — hue, chroma, and luminance — in various ways. Hue is what people normally associate with color. Chroma is how saturated the color is, from vibrant to none, or just gray. Finally, Luminance is how dark or light what we see is.

This brings us to data encodings.

Now the first person who systematically studied how we can use various attributes of a shape for data was a legendary map maker. His name is Jaques Bertin. He identified 7 basic attributes of points, of lines, and of areas that we can vary to show corresponding variation in data.

Let’s look at attributes now.

I’ve pulled this information from Bertin’s seminal book, Semiology of Graphics. Let’s walk through the attributes he identified together.

Columns represent points, lines, and areas. Under each, row represent the 7 basic attributes, which are basically ordered by their effectiveness, too.

Let’s review these systematically.

So under point, we have the x and y positions on the coordinate plane. Then size of the point. Then what he calls value of the point. By value, Bertin means luminance, which is one of the attributes of color. Below, value, he lists texture.

Then what he calls color, but is hue, another attribute of color. Sixth is orientation. And finally shape.

In each element of this grid, Bertin provides very abstract examples for each of these attributes for each of points, lines, and areas.

He also, on the right, provides us with a list of properties of data for which we can use these effectively.

So what I’d like to do now, as an exercise, is review a couple of published graphics, and try together to identify in both graphics all the marks (points, lines, areas) and attributes of those marks, are used in the graphics. Let’s do the first one together now.

I’m keeping Bertin’s list of marks and attributes on the left, just making it smaller so we can see the published graphics on the right. This first graphic is from a very entry-level book by Knaflic. Her book is titled Storytelling with data.

Try to identify her encodings. Take a moment to think about it, then we’ll take a poll.

OK. Let’s take the poll. [ACTIVATE! CLICK DOWN FOR THE POLL]

[DISCUSS]

Let’s do another. The second graphic was published in a newspaper, the Los Angeles Times. The graphic is to help its audience understand crime rates in California. Let’s take a look.

Excellent. Let’s see the results…

[SHOW RESULTS AND DISCUSS]

[DEACTIVATE! CLICK UP TO GO BACK TO MAIN SLIDE]

Looking closely now, let’s try to identify all the markings and encodings it uses.

OK. Let’s take the poll. [ACTIVATE! CLICK DOWN FOR THE POLL]

[DISCUSS]

Excellent, I hope this is helping us get started thinking about the mappings between data and visual channels.

In explaining and creating data graphics, we can, and should, also use an established grammar of graphics.

Excellent. Let’s see the results…

[SHOW RESULTS AND DISCUSS]

[DEACTIVATE! CLICK UP TO GO BACK TO MAIN SLIDE]

What do we mean by the grammar of graphics? What does grammar mean to you?

Let’s start with general definitions. Here is the definition of “grammar” from the Oxford English Dictionary. Let’s read together.

[READ]

If we borrowed ideas from the Oxford English Dictionary, we might guess that in the context of graphics, a grammar would describe the form of relationships between various components of the graphic.

Let’s review the grammar of graphics from the seminal reference in the field.

Let’s consider what Wilkinson, author of this seminal and influential text, says, reading together.

[READ]

So we should think about data mappings as a grammar to keep from limiting our ideas.

But that doesn’t mean that chart names as specific instances of graphics are not helpful. If we train ourselves to think about encodings, we can review examples that have names, chart names, to see starting points for our own encodings.

Questions about the difference between using a grammar and choosing from a list of charts? Questions about why only selecting from charts is too limiting?

Wilkinson, in his text, the grammar of graphics, establishes six components to graphics. Data, Transformations, Scales, Coordinates, Elements, and Guides.

For our Citi Bike case study, we’ve been gathering data sets containing observations of measured or recorded data variables, right?

And we’ve just discussed the idea of transformations.

We discussed a linear scale in Cartesian coordinates, for example. And we discussed how we can scale either the data or the coordinates by applying, say, a log function to those. Make sense?

And we also discussed elements. Bertin’s points, lines, areas, each of which has multiple attributes to which we can link or map data.

All of these components work together and when discussed as a grammar, allow us to create a virtually unlimited number of views of our data.

Now this discussion of a grammar has been without an implementation. Let’s see an example of how it’s implemented in a tool specifically designed as a grammar of graphics: R’s ggplot2, which as I’ve explained is available in Python too in the package plotnine, right? Can you guess what the gg in ggplot2 stands for?

OK, I’m going to lead you through this ggplot pseudo-code. I’ve written it with the components of Wilkinson’s grammar of graphics to the left, in orange, of the pseudo-code.

So I want to focus on the elements of the grammar here, as implemented in software.

In this pseudo-code on the right, I’ve sketched out generic versions of the various ggplot functions as they relate to that grammar of graphics.

Let’s walk through this together.

[DISCUSS EACH LINE]

Here’s I’m adding some annotation, to explain where Bertin’s visual variables fit inside this ggplot code.

We map or encode our data variables to visual variables and elements by specifying the particular element as a geom[etry] and visual variables for that geometry by mapping aesthetics, that’s what a e s stands for.

Is this helping to compare Wilkinson’s labels for components with this pseudo-code? Questions?

Now, while I’ve only listed one line of geom_ code here, we layered multiple geometries on top of one another in our examples last week, right?

So we can layer as many things as we need to create an optimal communication. Let’s take a look, a simple look at how this works with a toy example.

Here are two versions of the toy example. They are almost identical. Both include two layers of drawing a single point. One point is blue, the second point is orange. I’m both coloring the syntax to match what it does on the graphic. The only difference between the two graphics is I switch the layers so that I code the orange layer, then the blue layer in the first graphic. And in the second graphic, I code the blue layer first. I’ve also drawn both points close enough that they partly overlap.

And the result?

The layer drawn first appears on bottom, right?

So the ordering, near to far, of shapes follows the order in our code.

We layer graphical elements on one another from far, towards our eye. Does this make sense? Questions so far?

Now in these examples, we cannot see through the circles, right?

But we have another attribute of graphics available. We can make shapes completely opaque, completely transparent, or anywhere in between. Let’s look at an example.

Again, I’ve also given you all this code in the code demonstration files so you can run them yourself and play with it to learn.

To really drive home the point of these layers, next, let’s review a graphic I’ve re-created an information graphic that won an award, related to music. Let’s take a look at the original now.

Here, I’m showing an interactive data graphic created by Nadieh Bremer, a well-known information designer who has won many awards. Her graphic, shown here in the left, explored the top 2000 songs in the Netherlands, and those trends over time. You can click on her graphic to link to the original on the website. She made this graphic in the software tool d3.js.

Let’s break out into groups of 5 and discuss how Nadieh Bremer may have conceptually constructed this graphic … don’t worry about the interactivity right now or what code she used, but be thinking layering and the grammar of graphics. Questions?

Also, don’t look ahead in the slides as I want you all to think together, OK?

[SCROLL DOWN FOR ONE SOLUTION]

On the left, I’m showing you a static version of Nadieh’s graphic.

For teaching purposes, I wanted to re-create this graphic using the grammar of graphics in R and ggplot using her data, to demonstrate how layers work in graphics.

So I created it layer by layer, starting with the red circle, then layering the black circle, which is the record vinyl, then layering the reflection on the vinyl, then layering the grooves in the vinyl, then adding the label. All that is really what we’ll soon learn to be “non-data ink” that helps provide a visual metaphor as context to position the data.

Then I layered the data variable — year of release — as white, transparent dots, layered each song’s position in the top 2000 as size of circles, and finally, layered annotations of the very top 10 songs.

If we reversed the layers in the code, all you would see is a black and red circle because it would hide everything else. Does that make sense?

OK, now that we have discussed and demonstrated a layered grammar of graphics, let’s first see these ideas in terms of Bertin’s visual variables and, second, use these ideas to explore our Case Study, Citi Bike.

Again, to remind us, here are Bertin’s visual channels. I’ll keep this side by side with some elements of the grammar of graphics implemented in ggplot.

Next, to the right, I’m just showing the portion of the grammar of graphics code that directly relates to Bertin’s first channel for a point. Notice these two parameters are inside the aes[thetic] function. We set each of these to a variable in the data, in other words by filling in these blanks.

Make sense?

Now all the remaining parameters are outside the aes[thetic] function. When they are outside, we are not mapping them to data. We are just specifying whatever we want them to be. But we could! Any one of these that we want to map to data, we just move that parameter inside the aes[thetic] function and set it equal to a data variable. So, in total, for a point type of mark, we can map all of these to data (and more, which I’ll eventually get to).

Here, I’ve just annotated the same thing I’ve just explained so you’ll have it as a visual reference. Let’s also review these with a toy example.

To make this visual, which is also a diagram, I’ve just created a data frame with one observation for x_ and y_, both set to zero. Then, I map it to the visual variables and set the remaining parameters.

Of note, I’ve set the alpha parameter (how easily we can see through the markings) to one half, and notice that half the width of the stroke lands inside the fill area, and half outside it. It’s an odd or unfortunate behavior in that ideally we’d want the fill color to be whatever is fully inside the stroke outline, but you see they overlap, and creates a color that I did not specify!

Oh no! We’ll come back to that issue next week when we specifically discussion color.

Questions about mapping visual variables to a point marking?

Now, just as with a point mark, let’s review visual variables for a line mark as implemented and described by the grammar of graphics.

Now, you should start to feel what we are reviewing is becoming a little repetitive. That’s a good thing! It means you are catching on to the systematic grammar of graphics that let’s us draw literally anything we want. Almost anything.

So, as with a point mark, the line marking behaves the same way. Only this time, we need at least two observations of both x and y to start drawing lines, right? The geom_line function, then, starts drawing them connected together.

If you want to draw multiple lines, move the group parameter inside the aes[thetic] function and give each line a unique id in the data.

And, again, you can also map any of the remaining parameters to data by moving it inside the aes[thetic] function.

By the way, there are also helper functions for more specific circumstances. For example, when you only want to draw single line segments between any two points in x, y space, you can use geom_segment.

As with a point, let’s see a toy example, as a diagram of a line.

Here’s our toy example of a line marking, actually two lines, for each I’ve used exactly three x, y observation pairs.

Now, I’ve used exactly three points for a reason, to show you how a line marking relates to an area marking. If we closed these lines by connecting the end point to the start point, and perhaps shade it, we’d have an area, right?

Let’s do that now.

So another word for area, here, a geometric word, would be polygon which, by definition, is a plane with at least three straight sides, and usually (many) more. The grammar of graphics function, then, is called geom_polygon. And here it is.

Can you guess how it maps data to the visual variable of an area?

I won’t talk through this one, then, but let’s see the toy example using the same data we just used for the toy line example.

Any surprises on how it works? Pretty straight forward, right?

Now, you’ll notice that I’ve included functions for manually scaling the fill color here. I didn’t have to do that and, if I did not, then ggplot would pick values for me. But I usually want to be precise and specify my own.

With just those three types of functions, one for points, one for lines, and one for areas or polygons, you could really draw almost anything. Now there are many, many more sort of helper functions that let you draw various types of shapes using different parameters and such.

For example, there is a geom_rect to draw rectangles. geom_circle to draw circles. And others that seem to match categorical names of charts, like geom_bar (for bar charts) geom_histogram (for histograms) and geom_density (for density plots). But you could also just draw these with what I’ve shown you by thinking carefully about structuring your data and making whatever variables you need.

We will look at a few of these next, by deconstructing another published, award-winning information graphic that I re-created for teaching purposes. To help us continue to become more comfortable with the idea of encodings and grammar of graphics.

The information graphic we’ll deconstruct was published in Italy’s most circulated newspaper, La Lettura. This information graphic was created by Georgia Lupi, which I’m showing you a photo of her here. She create the graphic with her team. The graphic is titled, “Nobels, no degrees”.

Let’s pull this graphic out of the paper, and rotate it on our screens to help us inspect it. Or, actually, I’ve coded a recreation of it that we can play with.

Before we look at my re-creation, I wanted to show you a summary of the data variables for your reference. The most important of these are the type of Nobel prize, which could be chemistry, economics, physics, literature, medicine, and peace, right? Those are in the data variable “Category”. The graphic also uses Year of prize awarded, sex of the recipient, Age of the recipient, their hometown and university.

Here I’ve re-created Lupi’s graphic, rotated to be horizontal to make it easier for us to review together.

Alongside her graphic, I’m keeping Bertin’s visual channels, well, visual, so that we can reference them while discussing.

I’ve also written a blog post, which I’ve cited here, that takes you through the steps to create each component.

Now this graphic may seem complex when you as first blush, but it is really just a bunch of common graphics that even businesses use. And we’ll try to name them as we work through this, taking her graphic apart, graph by graph.

So how might we start identifying encodings? Let’s simplify down what she did into various components. We’ll try to separate these components so we can focus on encodings and grammar.

Sound good?

So here, to start, I’m having us focus on just one component of Lupi’s graphic. And on the left, I’m showing you the basic grammar of graphics in r/ggplot code that we can use to create it. Try to notice that we have a separate code function layered for each geometry-attribute for each encoded data mapping to visual channel.

[EXPLORE THE LAYERS]

Now, this scatterplot line chart combo had small multiples, colored by category, right?

Let’s change our focus to the multiples. On the left, you can see that all the grammar and code is the same except we also map Category to the multiples or facets, too. Pretty cool how directly these functions relate to what we see, right?

Let’s consider the next section of Lupi’s graphic.

Here are the multiples of what is called bar charts.

As with multiples of the graphics we just reviewed, Lupi again maps the data variable “Category” of prize to multiples or facets and to the fill color.

Within each facet, Lupi has mapped the data categorical variable “Education” to the x-axis, the mapped the percentage of Nobel prizes awarded across those categories to the y-axis.

Now, we could have just used line, again, but the graphics library has geom_bar as a helper function.

What it does, again, is create a line segment where the positions all begin on a common baseline, and each ending depends on the value and statistic used. Here, we have already calculated the end values, so the statistic is just the identity of the value we provided.

Are there questions about how geom_bar is really just a specific type of the generic line or line segment, so we could have used geom_line or geom_segment, too?

And we see two main functions. We’ve mapped the bar positions using geom_bar to Education and percent, and mapped their fill color to prize category.

Then, just like we did a minute ago, we also map multiples to prize category. Questions so far? Lets do another.

You should start to see the pattern comparing the code to the visuals. These are basic stacked bar charts, with a number beside each. So we map birth city and the count of prizes to position, the x and y, we map the fill color to category of prize.

Then we layer in the labels beside each. As these are tied to data, we use geom_text, and here we see similar mapping. Again, we use the same variables to position the text. Then we map the label, what it says, to the count of prizes.

Last, we map each thirty year to the multiples of facets.

I hope this is helping you feel more comfortable in thinking about data encodings mapped to Bertin’s variables and attributes, and layering them using Wilkinson’s grammar of graphics.

And I hope this is helping you see the simplicity and repetitiveness in reading the matching code so that you can make your own.

Here’s the last component in Lupi’s graphic. Now this flow is really called a Sankey chart or diagram, but the particular geometry function I make it with does something more general, so it’ called parallel sets. And you can see how these are parallel sets. The first set are categories, and the flows start their and go to another, parallel set, categories of universities. Right? This particular encoding is slightly more complicated, so you see it uses one function for the flows and another function to draw the categorical landing points. But the idea is still the same. In the function, we literally map data variables to visual variables and attributes.

By the way, if you want a fancier version of this type of plot, you can download another R package I wrote for you to create fancier sankey diagrams. You can find it on my website.

While Wilkinson has explained that we should think about graphics as a grammar, it can still be helpful to get ideas from existing, named, charts.

Just don’t fall into the trap of limiting ourselves to picking those named charts. Instead, start with Bertin’s visual encodings, and create exactly what you need to communicate using the grammar of graphics.

Does that make sense?

Now let’s practice in another way. Let’s return to our Citi Bike case study. Let’s explore Citi Bike data while demonstrating various encodings and highlighting which of Bertin’s visual channels we are using to create it.

During this discussion, to save time, I won’t walk through the code, but I urge you to review it, which I’ve written for you in the code demonstrations along with more explanation. These should be part of your reading when I give it to you. And when you don’t understand something in it; I also urge you to post about your question on our discussion forums.

We can all learn from your questions. For those that have a similar question, we all collectively answer it. For those who already know the answer, you can also learn what confuses people and learn how to explain or communicate. So whether or not you understand the underlying thing, you can learn something either way. Make sense?

So on the right, each time I’ll show you the graph I quickly made using R, and I’ve also given you the R code (and a Tableau workbook) that has a few of these exploratory graphics with the data. So our data includes longitude and latitude for each docking station.

OK, so on the right, we have an x, y two dimensional Cartesian coordinate system, the x represents longitude in the data, and the y represents latitude in the data. What we see here are all the docking stations according to these data. That’s what Bertin means by encoding data with a point on the x and y dimensions.

Questions so far?

Let’s see another example. This time dimensions using the line.

OK, the data also includes as an observation each trip that some one takes on a bike, and records several variables. It also includes the start time. So here, I’m using the same graph, and x now represents the start hour, which I calculated in code from start time and date, and the y represents the count of observations within each hour. I connect these with a line. Using a line shows us the change in direction of rides across hours. And this is one example of what Bertin means by positioning the line. Another example.

Same type of encoding. Position of a line. But this time, I’m graphing birth year of the rider on the x axis, and, again, the count of rides on the y axis. Again, these are lines. Specifically, a bunch of vertical lines that we can compare with each other. Odd that the line at 1969 is so much more than the others. I wonder why? OK, let’s do another encoding. Area.

Same type of encoding. Position of a line. But this time, I’m graphing birth year of the rider on the x axis, and, again, the count of rides on the y axis. Again, these are lines. Specifically, a bunch of vertical lines that we can compare with each other. Odd that the line at 1969 is so much more than the others. I wonder why?

By the way, in the code demonstrations I’ve provided you that corresponds with this slide, I showed you three ways to create this same graphic using the same data, just reorganized and transformed. One using a geom_segment, one using a geom_rect, and one using a geom_histogram. I could have also used geom_line. I recommend you review the code demonstrations as I’ve also provided more explanation in that file. And think about the different ways we can express the same thing using this grammar of graphics. Sound good?

OK, let’s do another encoding. Area.

So here, we encode geographic boundaries with area position on the x and y, we encode land and water with color, and we encode, docking stations with position on the x and y. See how we’re combining all these attributes. Now the docking station points can be understood in comparison with geographic areas. Let’s get more exotic, using orientation with all this.

Here we also map various colors to distinguish each of the burrows. Now, actually, in this example, I did not actually map data to land or water. Instead, I just colored the background gray, and specified the fill color of areas white.

Let’s do another, this time actually mapping data to area fill color.

Here we also map various colors to distinguish each of the burrows. Make sense? Questions? Let’s do another, this time going back to points and adding another attribute.

Here’s the same docking stations we graphed earlier. This time we encode the number of rides as the size of each point or, really, circle.

By the way, from an exploration point of view, what might we be starting to learn about Citi Bike using this encoding while exploring the data?

How about another example? Instead of mapping number of rides onto size of the point, let’s map the same data variable onto luminance.

Here we do just that. We map number of rides onto luminance or what Bertin calls value.

I hope this is helping you feel more comfortable about thinking about Bertin’s encodings and how to use them and create them using the grammar of graphics.

Let’s keep exploring.

Here we do just that. We map number of rides onto luminance.

I hope this is helping you feel more comfortable about thinking about Bertin’s encodings.

Let’s keep exploring and practicing mapping data to visual variables.

OK, this time, I decided I want to know whether the number of bikes coming to the docks tend to exceed the number of docking spots. So I calculated this by calculating the number arriving each hour minute the number leaving each hour, and testing whether that was greater than the total docking spots.

I coded that calculation as color luminance using a single hue — red.

What are we learning about the bike share from an exploratory data analysis?

The dark red stations have larger imbalances, and seem to be concentrated in the lower east side or just above there. Also a few places in Brooklyn. So Citi Bike probably has to address rebalancing with those stations more often.

Let’s try something else mapping data to lines plus another visual channel.

Here, we code the same start hour and number of rides, but we use vertical lines this time for each hour and we encode gender, which is male, female, and not specified, by color.

The green gender uses bikes the most. The blue gender uses them less, but seems to follow a similar pattern of use through the day. Finally, the pink gender seems very different.

Interesting. What I want you to keep focusing on, though, in this exercise is ways we can map and combine the various maps of data variables to visual elements and variables.

Let’s make a similar mapping, but change the data.

To explore that odd difference in distribution, we change the color encoding to user type, which includes customer and subscriber.

Ah, now this makes sense. Subscribers are probably local residents and we see two peaks, one for going to work and one for leaving work.

These data are from 2019 by the way. I bet if we gathered this year, the distribution would look different. And the red color showing customers are probably tourists and use the bikes more in the middle of the day while residents are at work.

Let’s try another, mapping data variables to three things: area, color, and points.

OK, here I’ve calculated the direction between the start and end stations for each ride, but instead of connecting the entire distance with a line, I just make a short line starting at the starting station and oriented toward the end station. I’ve also made the markings partially transparent so we get a sense of how often bikes start somewhere and point a certain direction.

I’m not sure what we really learn here. I do notice that many bike destinations seem to be pointed along Broadway.

Next, instead of just using a line segment oriented toward the ending station, I connect the two with a line.

Again, this is a really busy graphic, and we’re just exploring the data. It’s not really effective to communicate with others. None of the exploratory graphics we’ve considered tonight for Citi Bike serve as communication tools. We’ll get to the problems for communication in future lectures.

What I want you to notice is we overlay lines and their direction over the area and color encodings for geographic area.

Let’s try something else even more complex.

Here, I’m encoding the geographic areas and water land as usual, and marking stations with points and position, but I’m using the line segments differently.

Here, I’m only showing a line segment if the docking station is either empty or full.

And I’m encoding the color of the line to tell us which is the problem. Purple for empty and orange for full.

And, here’s where we can transform data and get fancy. Here, I’m orienting the line segments we see by the time of day on a 24 hour clock. To do that, I have to use some basic trignometry, a couple of simple calculations to create the end point of each line segment.

On this slide, using the calculations I’ve called out in the upper right, the visual is very small, but if you viewed it larger, you could start exploring when stations tend to have problems and what kind of problem it has.

Make sense? OK, since, you just turned in your draft proposal and I had given you an example draft proposal with a couple of graphics, let’s see how those were encoded.

OK, that’s plenty for us to think about for tonight. I hope our discussion and code demonstrations will give you plenty for your own practice in your homework 1, due next week. Stay active on our discussions throughout the week. Ask questions. Share things you find helpful. Answer others. Learn together.

As always, I’ve hand picked references that are best suited to going further for the topics we’ve discussed today.

I’ll stay for questions. Otherwise have a great rest of your night!