index.knit

Welcome to Wednesday Night Live from New York!

Before we get started, are there any announcements?

Hello everyone. How’s it going? Last week we introduced the course. We discussed a general workflow from importing data into some kind of software, to transforming it and visualizing it.

This week I want us to focus our discussion on the beginning of a deep dive into visualizing structured data.

So we’ll consider why we visualize information, how to organize the visual, how to map data to visual encodings.

We’ll take about a theoretically-grounded approach to making graphics that enables you to make any graphic, not just pick from named graphics.

We’ll talk about why this is important. And we’ll end by practicing in two ways. We’ll deconstruct an information graphic. And we’ll continue exploratory work on our class example, the Citi Bike rebalancing study.

But first, let’s visit our course timeline.

As another reminder of our course as a whole, let’s revisit the Venn diagram we discussed last week, broadly categorizing skills needed to drive change. Those involved the overlaps between Narratives, Data analyses, and visualization.

And last week we began a case study, Citi Bike, to think about these components together. We began that by ideating what the problem is, rebalancing bikes, what data may inform how Citi Bike can accomplish that. We located various data sources. Imported one of data sets, transformed it to create new variables. And we made one visual.

We coded this in both R and Python to show that we can basically use almost identical code syntax for either one. We’ll continue tonight, digging into visualization specifically.

Our aim with visuals, as John Tukey reminds us, is that the “greatest value of a picture, he means data visual here, is when it forces us to notice what we never expected to see.”

I’ve kept our example Cartesian coordinate system on the left so we can compare it with our new polar coordinate system on the right, showing the same datum. The system on the right, again, is called a polar coordinate system.

And, again, I’ve carefully marked a few elements on both that are identical to use as reference markers. This includes, on the left graphic, a vertical blue line on the left most side of our graph.

Now — importantly — the range on the x axis is negative 4 to positive 4. Notice that the bottom, left of this graph is dead-center of the polar graph.

If we held the bottom of that blue vertical line in the same spot, and moved the top of it in a circle, we get the polar coordinate system on the right. Does that make sense?

You can see the code for these two graphs in the corresponding code I provided you. I recommend you review it too.

What are some things we frequently encounter in real life that are cyclical? Time? And we have circular clocks to represent time, right? That’s just one example where we may want to think about using something like this. Now we aren’t yet going into debating the effectiveness of these systems. First, I want to show you one more type of example, and then discuss how we can encode data.

Again, here’s our example coordinate system on the left.

And we talked about how it uses linear scales, meaning the axis has equal numerical spacing. If we focus on the gridlines, we can see the numbers are equally spaced, right?

We can do one of two things to adjust how we visualize markings on this graph. We can either transform the data or transform the axis scale.

Sometimes data, for example, are non-linear. If we took the log or square root of that data, it may be easier for us to make visual comparisons. Or instead of transforming the data, we can transform the scale of the axis.

On the right, I’m showing you a few examples of these transformations. The top row, data is linear and the scale is linear. Notice that the grid lines are spaced equally apart.

Next, on the second row, I log-transform the data but keep the scale linear. Here, too, the gridlines have equal spacing, right? But the values have changed.

The third row, I reverse this, keeping the data linear but log-transforming the scale. Now, the grid lines are not equally spaced, right? But the values corresponds to the raw numbers.

The last two rows I repeat these transformations, but with square root instead of log. Notice, again, just like transforming either the data or the scale into its square root has a similar visual effect as before, with the log transformation. Make sense?

One note of warning. I cannot at the moment recall any occasion where you want to scale both data and axes. I’m not sure there is any transformation like this we could easily reason about.

I’ve pulled this information from Bertin’s seminal book, Semiology of Graphics. Let’s walk through the attributes he identified together.

Columns represent points, lines, and areas. Under each, row represent the 7 basic attributes, which are basically ordered by their effectiveness, too.

Let’s review these systematically.

So under point, we have the x and y positions on the coordinate plane. Then size of the point. Then what he calls value of the point. By value, Bertin means luminance, which is one of the attributes of color. Below, value, he lists texture.

Then what he calls color, but is hue, another attribute of color. Sixth is orientation. And finally shape.

In each element of this grid, Bertin provides very abstract examples for each of these attributes for each of points, lines, and areas.

He also, on the right, provides us with a list of properties of data for which we can use these effectively.

Wilkinson, in his text, the grammar of graphics, establishes six components to graphics. Data, Transformations, Scales, Coordinates, Elements, and Guides.

For our Citi Bike case study, we’ve been gathering data sets containing observations of measured or recorded data variables, right?

And we’ve just discussed the idea of transformations.

We discussed a linear scale in Cartesian coordinates, for example. And we discussed how we can scale either the data or the coordinates by applying, say, a log function to those. Make sense?

And we also discussed elements. Bertin’s points, lines, areas, each of which has multiple attributes to which we can link or map data.

All of these components work together and when discussed as a grammar, allow us to create a virtually unlimited number of views of our data.

Now this discussion of a grammar has been without an implementation. Let’s see an example of how it’s implemented in a tool specifically designed as a grammar of graphics: R’s ggplot2, which as I’ve explained is available in Python too in the package plotnine, right? Can you guess what the gg in ggplot2 stands for?

Here are two versions of the toy example. They are almost identical. Both include two layers of drawing a single point. One point is blue, the second point is orange. I’m both coloring the syntax to match what it does on the graphic. The only difference between the two graphics is I switch the layers so that I code the orange layer, then the blue layer in the first graphic. And in the second graphic, I code the blue layer first. I’ve also drawn both points close enough that they partly overlap.

And the result?

The layer drawn first appears on bottom, right?

So the ordering, near to far, of shapes follows the order in our code.

We layer graphical elements on one another from far, towards our eye. Does this make sense? Questions so far?

Now in these examples, we cannot see through the circles, right?

But we have another attribute of graphics available. We can make shapes completely opaque, completely transparent, or anywhere in between. Let’s look at an example.

Again, I’ve also given you all this code in the code demonstration files so you can run them yourself and play with it to learn.

To really drive home the point of these layers, next, let’s review a graphic I’ve re-created an information graphic that won an award, related to music. Let’s take a look at the original now.

On the left, I’m showing you a static version of Nadieh’s graphic.

For teaching purposes, I wanted to re-create this graphic using the grammar of graphics in R and ggplot using her data, to demonstrate how layers work in graphics.

So I created it layer by layer, starting with the red circle, then layering the black circle, which is the record vinyl, then layering the reflection on the vinyl, then layering the grooves in the vinyl, then adding the label. All that is really what we’ll soon learn to be “non-data ink” that helps provide a visual metaphor as context to position the data.

Then I layered the data variable — year of release — as white, transparent dots, layered each song’s position in the top 2000 as size of circles, and finally, layered annotations of the very top 10 songs.

If we reversed the layers in the code, all you would see is a black and red circle because it would hide everything else. Does that make sense?

To make this visual, which is also a diagram, I’ve just created a data frame with one observation for x_ and y_, both set to zero. Then, I map it to the visual variables and set the remaining parameters.

Of note, I’ve set the alpha parameter (how easily we can see through the markings) to one half, and notice that half the width of the stroke lands inside the fill area, and half outside it. It’s an odd or unfortunate behavior in that ideally we’d want the fill color to be whatever is fully inside the stroke outline, but you see they overlap, and creates a color that I did not specify!

Oh no! We’ll come back to that issue next week when we specifically discussion color.

Questions about mapping visual variables to a point marking?

Now, just as with a point mark, let’s review visual variables for a line mark as implemented and described by the grammar of graphics.

Now, you should start to feel what we are reviewing is becoming a little repetitive. That’s a good thing! It means you are catching on to the systematic grammar of graphics that let’s us draw literally anything we want. Almost anything.

So, as with a point mark, the line marking behaves the same way. Only this time, we need at least two observations of both x and y to start drawing lines, right? The geom_line function, then, starts drawing them connected together.

If you want to draw multiple lines, move the group parameter inside the aes[thetic] function and give each line a unique id in the data.

And, again, you can also map any of the remaining parameters to data by moving it inside the aes[thetic] function.

By the way, there are also helper functions for more specific circumstances. For example, when you only want to draw single line segments between any two points in x, y space, you can use geom_segment.

As with a point, let’s see a toy example, as a diagram of a line.

Any surprises on how it works? Pretty straight forward, right?

Now, you’ll notice that I’ve included functions for manually scaling the fill color here. I didn’t have to do that and, if I did not, then ggplot would pick values for me. But I usually want to be precise and specify my own.

With just those three types of functions, one for points, one for lines, and one for areas or polygons, you could really draw almost anything. Now there are many, many more sort of helper functions that let you draw various types of shapes using different parameters and such.

For example, there is a geom_rect to draw rectangles. geom_circle to draw circles. And others that seem to match categorical names of charts, like geom_bar (for bar charts) geom_histogram (for histograms) and geom_density (for density plots). But you could also just draw these with what I’ve shown you by thinking carefully about structuring your data and making whatever variables you need.

Here are the multiples of what is called bar charts.

As with multiples of the graphics we just reviewed, Lupi again maps the data variable “Category” of prize to multiples or facets and to the fill color.

Within each facet, Lupi has mapped the data categorical variable “Education” to the x-axis, the mapped the percentage of Nobel prizes awarded across those categories to the y-axis.

Now, we could have just used line, again, but the graphics library has geom_bar as a helper function.

What it does, again, is create a line segment where the positions all begin on a common baseline, and each ending depends on the value and statistic used. Here, we have already calculated the end values, so the statistic is just the identity of the value we provided.

Are there questions about how geom_bar is really just a specific type of the generic line or line segment, so we could have used geom_line or geom_segment, too?

And we see two main functions. We’ve mapped the bar positions using geom_bar to Education and percent, and mapped their fill color to prize category.

Then, just like we did a minute ago, we also map multiples to prize category. Questions so far? Lets do another.

You should start to see the pattern comparing the code to the visuals. These are basic stacked bar charts, with a number beside each. So we map birth city and the count of prizes to position, the x and y, we map the fill color to category of prize.

Then we layer in the labels beside each. As these are tied to data, we use geom_text, and here we see similar mapping. Again, we use the same variables to position the text. Then we map the label, what it says, to the count of prizes.

Last, we map each thirty year to the multiples of facets.

I hope this is helping you feel more comfortable in thinking about data encodings mapped to Bertin’s variables and attributes, and layering them using Wilkinson’s grammar of graphics.

And I hope this is helping you see the simplicity and repetitiveness in reading the matching code so that you can make your own.

Here’s the last component in Lupi’s graphic. Now this flow is really called a Sankey chart or diagram, but the particular geometry function I make it with does something more general, so it’ called parallel sets. And you can see how these are parallel sets. The first set are categories, and the flows start their and go to another, parallel set, categories of universities. Right? This particular encoding is slightly more complicated, so you see it uses one function for the flows and another function to draw the categorical landing points. But the idea is still the same. In the function, we literally map data variables to visual variables and attributes.

By the way, if you want a fancier version of this type of plot, you can download another R package I wrote for you to create fancier sankey diagrams. You can find it on my website.

Now let’s practice in another way. Let’s return to our Citi Bike case study. Let’s explore Citi Bike data while demonstrating various encodings and highlighting which of Bertin’s visual channels we are using to create it.

During this discussion, to save time, I won’t walk through the code, but I urge you to review it, which I’ve written for you in the code demonstrations along with more explanation. These should be part of your reading when I give it to you. And when you don’t understand something in it; I also urge you to post about your question on our discussion forums.

We can all learn from your questions. For those that have a similar question, we all collectively answer it. For those who already know the answer, you can also learn what confuses people and learn how to explain or communicate. So whether or not you understand the underlying thing, you can learn something either way. Make sense?

Same type of encoding. Position of a line. But this time, I’m graphing birth year of the rider on the x axis, and, again, the count of rides on the y axis. Again, these are lines. Specifically, a bunch of vertical lines that we can compare with each other. Odd that the line at 1969 is so much more than the others. I wonder why?

By the way, in the code demonstrations I’ve provided you that corresponds with this slide, I showed you three ways to create this same graphic using the same data, just reorganized and transformed. One using a geom_segment, one using a geom_rect, and one using a geom_histogram. I could have also used geom_line. I recommend you review the code demonstrations as I’ve also provided more explanation in that file. And think about the different ways we can express the same thing using this grammar of graphics. Sound good?

OK, let’s do another encoding. Area.

Here, I’m encoding the geographic areas and water land as usual, and marking stations with points and position, but I’m using the line segments differently.

Here, I’m only showing a line segment if the docking station is either empty or full.

And I’m encoding the color of the line to tell us which is the problem. Purple for empty and orange for full.

And, here’s where we can transform data and get fancy. Here, I’m orienting the line segments we see by the time of day on a 24 hour clock. To do that, I have to use some basic trignometry, a couple of simple calculations to create the end point of each line segment.

On this slide, using the calculations I’ve called out in the upper right, the visual is very small, but if you viewed it larger, you could start exploring when stations tend to have problems and what kind of problem it has.

Make sense? OK, since, you just turned in your draft proposal and I had given you an example draft proposal with a couple of graphics, let’s see how those were encoded.