Welcome to Wednesday night live from New York!

Tonight, we will begin to shift our focus from considering how to visually encode data for exploratory purposes, to ways we can help explain the visual encodings to others. That will involve the idea of experiments with what Edward Tufte calls data-ink. That will involve annotating the graphic. That will involve the idea of a hierarchy of information.

We’ll consider the idea that we should assume all people are intelligent and just need guidance specific to our data encoding, our graphic. And, finally, we’ll consider the need to iterate our ideas and re-design for our audience.

Sound good?

But first, let’s remind ourselves how tonight’s material fits into the overall course. We’re still focusing on visual encodings of data. Next week, we will shift again, to introduce narrative. Then, later, we’ll talk about how to combine these components.

Here’s our deliverables timeline.

And last week I made individual homework two available for you to start on. That’s due next week. Between your practice for homework one, the demonstration codes I’ve been providing, and what we cover in class, I think you’ll have plenty to think through to help you with this practice assignment.

Ok, like every week, I provided references in the syllabus to pre-read so we have a bit of common vocabulary in discussing the concepts together.

To kick off our discussion on how we can apply some fundamentals of communication to data visuals, let’s start by taking a poll.

[SCROLL DOWN FOR POLL]

[BACK]

Next, let’s remind ourselves about Doumont’s three laws of communication.

[ACTIVATE POLL]

Alright, I’ve activated the poll. I’d like your thoughts on this question: “Generally speaking, what is our goal in professional communication?”

I’ll give you all a minute to choose how you would answer, and then we can see how varied your responses are!

[DISCUSS]

You’ll find me reiterating this goal, and the ideas from your reading in Doumont through most of this semester. It’s that important!

[SCROLL BACK UP]

Just to refresh our memories, Doumont’s three laws of communication are to, first, adapt to our audience, second, maximize the signal-to-noise ratio, and third, use effective redundancy. These laws apply to all modes of communication, including the use of data graphics.

We’ll let this sit a few minutes to first review what we’ve already covered.

Again, we’ve been discussing how to map data to visual variables, right? Those that Bertin originally formalized. We’ve been placing these mappings from data to visual channels — which Tufte calls the data-ink encodings — within the graphic, in the plot area. Make sense?

And you’ve seen this pseudo-code before, from the grammar of graphics library that helps us map data to those markings and visual variables.

I’m circling the portion of the code we’ve been practicing so far for exploring data, not for communicating with others, right?

But there are other aspects of graphics, not typically mapped to data, that we can use when communicating our mapped data to help others understand.

On the right, we see many components of a data graphic that may or not be visible when we look at the final, communicated graphic.

I’m not going to discuss each little component here right now, but you’ve already seen this, and I’m leaving it here so that you can review it again and reference it later in the context of our discussion tonight.

While the blue labels are specific to ggplot2 parameter names, these graphic components are similarly named in most other software implementations.

Now all these components are controlled by some of the ggplot functions.

It’s these functions we have not yet really been paying much attention to.

Instead, we’ve been focusing on those that map the data to the visual channel, right?

Here I’m showing the same pseudo-code we’ve seen before, but now I’m circling the functions where you code how this non-data ink looks. The things on the right.

So these all provide help and context for our audiences to read our data encodings.

In fact, when we use the sort of styled themes, like “theme_minimal” or “theme_void” or “theme_tufte” or “theme economist,” all these themes do is alter the default settings in the function “theme”.

So if you want to know how a named theme creates a certain look, you can look at it’s settings. Let me show you now.

[SHOW THEM]

All this leads to the idea of how much of data-ink and non-data ink improves versus inhibits explaining the graphic to others.

In your readings, Tufte describes how non-data ink can interfere with our audience’s understanding of the data encodings, and he creates the concept of maximizing data-ink as a method for experimenting with graphics to find what we think is optimal for our audience.

Notice that this concept sounds a lot like what we also considered from Doumont, maximizing the signal-to-noise ratio.

I’ve made some examples, adapted from Tufte’s text, that we can discuss.

I’m showing two graphics here, both are displaying the same data encodings. Which do you find easier to read, and why?

[DISCUSS]

Yes, in this case there are many dark gridlines that make it harder to see the data encodings. Now we can experiment with both removing non-data ink and also markings that relate to data ink. Let’s take a couple of really common graphics, a bar chart and a box plot.

Now, again, I’ve essentially re-created what you will recognize from your readings. Here, I’ve encoded some basic data, where the categories are encoded along the x axis, the values of each category positioned on the y axis, and you’ll notice a typical box and tick marks surrounding these bars. Do we need them, the tick marks and outer box? Do they help?

Let’s start experimenting.

Here, I’ve removed the top and right parts of the surrounding box. Now that we’ve removed them, did they help us understand the data? If not, consider leaving them out.

Let’s try something else.

Now, we’ve removed all of the box, but left the tick marks on the left. So did the things we removed make any difference in comparing one length of the bar to another from what is now an implicit common baseline?

Making comparisons, which is the point of graphics?

Perhaps some labeling and some form of reference grid lines will help.

I’ve added values to the tick marks to the left, but instead of adding background grid lines, I’m trying erasing a small part of the bars along the horizontal aligned with the tick marks.

As Tufte was, I’m trying to get you thinking about experimenting to see what works for you and your audiences.

Also, notice that although I do not explicitly draw an x-axis line anymore, because we align all the bars along it, our mind sort of sees that axis anyway. Remember those design principles?

Let’s try another experiment, this time with a classic looking box plot.

John Tukey, whom I’ve also briefly introduced you to, invented the box plot as a way to encode several characteristics of a distribution of observations of a variable.

If you’re unfamiliar, become familiar. But the rectangle typically shows the interquartile range, the horizontal line encodes the median, and the top and bottom vertical lines, called whiskers, help to see the range of the tails of the distribution or variation of a variable.

So this is a graphical summary that partially describes the data distribution of some variable.

If we place them side by side to compare distributions, the unfilled rectangles and space between them can be a little hard to visually separate sometimes.

Tufte showed us some experiments to show that we can use fewer markings to represent the same thing. Let’s consider these experiments now.

And if you compare his experiments to Tukey’s original, you can see each approach uses minimal markings to represent the same information. He does this in varying ways.

Now this brings me to a point that most people studying this concept of data-ink maximization always miss.

Tufte says these are experiments and we should be reasonable in how much we remove! Reasonable. What’s that mean?

Let’s see how each of these markings look with the 12 distributions we just reviewed with Tukey’s box plots.

Here’s Tufte’s first experiment. Consider whether it emphasizes different things than Tukey’s original.

Our eye is drawn to what we see, the markings. On the other hand, our eye tends to ignore the negative space. What do markings and negative space represent here?

The dot encodes the median.

The vertical lines encode the tails of the distribution,

And the negative or white space between the lines and dots encode the interquartile range. Instead, what does this view emphasize? [THE TAILS]

If the interquartile range is most important for our point to the graphic, perhaps another view works better?

Here, Tufte makes the interquartile range a little thicker, darker than the lines encoding the tails.

Still encoding the same data summaries.

Does this change the emphasis of where in the distribution we are focused?

Let’s tweak it a bit more.

Here’s I’ve slightly lightened the interquartile range, but also used a darker line for the median. If we’re interested in showing the median and interquartile ranges for comparison, does this seem to be an improvement over Tukey’s original?

So how can we know which to use?

Experience and judgment help. But importantly, so does asking other people, especially your audience, or others you think have experience similar to that of your audience.

A couple of researchers conducted an experiment, randomizing different versions of the box plot to show to participants of their study, to see which versions the participants were better able to answer questions about the distribution.

The particular study results suggested that participants had more difficulty with the version C here that Tufte has made by erasing almost everything.

Does that mean we shouldn’t use C or that we should use whatever version was reported as working best?

No. Do your own experiments for your audience to convey your particular context. The point is to experiment.

On this point, let’s together read what Tufte says:

[READ]

We find this same advice from Mike Bostock, who invented d3.js — which is ubiquitous software for creating graphics online, and especially interactive, graphics — and he was well-known for his data visuals at the New York Times.

Mike Bostock explains that data graphic design is a search problem.

To find the right combination of markings, start by prototyping or experimenting, trying many things quickly.

Invite criticism from others to help you improve.

Only after you’ve learned what works do you start refining it for communication.

I want you to see two examples of Mike actually doing this.

Do you all know what a flip book is?

The New York Times has a tool that essentially takes a snapshot in time of a visual that their editors are working on, so we can flip through them to see how a graphic has evolved.

Here’s the first finished graphic. And you can google the original on this to review it on the Times website. Now let’s see how Mike describes how his team made this graphic. How it evolved into what we see.

The video should start at minute: 18:55.

How many iterations does it seem they saved before finalizing it? Hundreds? First, you saw large changes, then you started to see more refinements near the end, yes?

Let’s look at one more from Bostock. This is another published graphic. It shows how schools changed sports conferences over time. Let’s see how this graphic evolved over time.

Video should start at 35:39.

Again, it looked like they saved a hundred iterations; first making large changes, then polishing it towards the end.

So I hope seeing the amount of actual iterations to experiment and learn what works helps you all to see the importance of experimenting, or trying many approaches to see what works better.

I also hope our discussion has helped you to see the connections between our earlier discussions from Doumont on the fundamentals of communication, how they connect or apply to data graphics too.

So let’s think about the differences between exploring and explaining.

For exploring, the audience is ourselves. For explaining, the audience is someone else, which may, by the way, even include ourselves six months or a year from now, right?

We’re all now quite familiar with what Doumont said of our goals in communicating with others. It is to

“get our audiences to pay attention to, understand, and be able to act upon a maximum of messages, given constraints.”

And we will be practicing doing that for one audience, the analytics executive, in our individual homework number three.

That audience is assumed to understand the details and history of the organization they work at, and assumed to understand data science concepts, perhaps, in some cases, even better than we do. What will be new to them is our proposed project and reason for proposing it, right?

What about an audience is external to the organization?

It’s typically broader and a more general audience. This means we assume they likely know less about our organization than the analytics executive, right?

Because they don’t have daily exposure to it. Perhaps they haven’t even yet been introduced. Or maybe they know more. That depends on your external audience and purpose of selecting them.

This external, broader or more general audience may also not, on average, have the depth of training in data science, even if some individuals do.

How can we communicate with such mixed audiences?

As Doumont suggests, we should try for every part of the communication to provide something interesting to those more knowledgable in our target group while helping those less familiar, the generalists, in your target group understand too.

Doumont, if you recall, discusses these mixed audiences, and gave us a short example on how we can approach that communication.

[POLL AVAILABLE BELOW RE “MESSAGE”]

[ACTIVATE POLL]

[SCROLL BACK UP]

In Doumont’s first version, and I’ve pulled this example from your reading earlier in the semester from his chapter Fundamentals. It states,

“We worked with IR.”

If some in the audience do not know what IR is, you confuse them and they lose interest. You alienate them in your communication.

Let’s read the second version.

“We worked with IR. IR stands for Information Resources and is a new department.”

What do you think of this? Does it help those in your mixed audience that did not know what IR stood for? How? What is that second sentence? A definition? You’re telling them a definition, right?

But what about those in the audience that already know what IR stands for?

How do they feel when they hear you telling them a definition they already know? Maybe they feel you do not understand them because if you did, you would know that they know? You’re not speaking with them from their point of view.

These first two sentences demonstrate the challenges of a mixed audience.

Doumont suggests a solution. We should weave in explainers into our sentences — and we will discuss this in the next few weeks — that help the generalists in our audience, but give the specialist something new or interesting to consider. Does this third sentence do that?

This idea is directly applicable to graphics, too, and we’ll get to that in a bit.

But first, I’m arguing we should consider the purpose of the graphics communication, what we want to show, and optimize the encodings to best show the point using the principles we have discussed the last couple of weeks.

I’m talking about using visual variables and attributes of those variables.

And we should prioritize which information gets which encodings based on the empirical studies we’ve seen of accuracy in decoding them, which to some extent we all share as humans.

And then, my argument continues, that we should generally adapt our annotations of the optimized encodings to our audience rather than use some other encoding.

Why might that be?

By the way, what do I mean by annotations? By explanations?

Let’s start with titles.

Aren’t titles an overall annotation to a graphic? While it is more common to see a title as a generic description of the data, that’s a waste of space. Instead, we should use titles to explain what we’re trying to show with the graphic. If the point is to show some pattern in the data, then use the title to explain the pattern as shown in the graphic.

I’ve contributed this graphic as part of our class example of the Dodgers, which we will begin to consider next week.

This audience is the Dodgers marketing executive, though of course, it could be used in a particular context for other audiences.

If we look only at the graphic on the left, does it explain what it is or what the marketing executive should take away from it?

What about on the right? Now you haven’t seen the additional context this came from yet. But take a moment to review it.

Does the title explain what the data shows as opposed to only what it is? Does it have a point? Explain.

It’s the same reason we will use informative titles in our future assignments, like the memo and proposal. The same reason we should use informative headers instead of generic headers in our proposal. Does the difference make sense? Questions about that?

Where else can we explain? Where else can we annotate?

How about directly on the graphic itself! Let’s see another example.

It is typical for a graphic to use some kind of legend that explains the data encodings separate from the graphic.

But we make it easier for our audience, we reduce their cognitive load, we focus their attention, by directly labeling the data whenever we can!

So compare the left version, with the right version. Which is easier for you to follow?

Along with direct labeling instead of using legends, we can annotate or explain directly in the graphic in other ways, too. Let’s take a look at another example.

I’ve pulled this example from the source’s twitter feed as part of a Storytelling with Data Challenge. On the left, is the data encoding without explanations. We have no idea what these encodings show, right?

On the right, I’ve shown the title now, and it at least does explain a little about the patterns in the data, not just what the data is: the rise and fall, right? But more than that, we see annotations directly on the graphic. What do these mini paragraphs with lines pointing to the encodings on the graphic do?

Would you agree that it places the data into context, helps show or implies cause and effect?

Now, the graphic does one more thing we’ve discussed before, right? It uses color to tie concepts together. How does this graphic use color?

Now these ideas are not just theory in communication. Let’s hear from well-known practitioners in the field.

All the practitioner experts agree. Amanda Cox is the Data Editor at the New York Times. Let’s read what she explains together.

“The annotation layer is the most important thing we do.” “Most important.”

Another author and alumnus of Columbia University recently published a book that I recommend on starting data visualization, and in it he writes,

“Although the primary focus on creating a visualization is the graphic elements—bars, points, or lines—the text we include in and around our graphs is just as important.”

Let’s hear from one more voice. Shirley Wu is a very impressive visualization designer. Her work has appeared in high profile publications. In her recent book, which is also a wonderful reference, she writes,

“Annotations are of vital importance. Often overlooked, annotations are one of the best ways to make a chart understandable to an audience. Underutilized in many data visualizations, annotations are the ideal way to highlight exactly those things that you, as the creator, want the audience to pay attention to.”

So as we build graphics to explain, I’m going to encourage you to place equal importance on your annotations as on your data encodings! All the concepts in writing we’ve discussed remain valid for these annotations!

[SCROLL DOWN TO USE A CAPTION CONTEST!]

Let’s team up and each team decides on an audience and writes a message for the title of this graphic. Then we’ll vote on it.

So, we’ll gather into 10 groups Write your audience and title, limit 20 words into a google doc Then we’ll vote! Sound good?

[SCROLL UP WHEN DONE]

Along with annotating to focus our audiences attention, we should consider removing anything that does not help your audience understand, remove anything in the graphic that take focus away from the message itself.

And even for context, we create focus on the main data and explanations by using something like gray to push complementary data into the background for context. And we’ll see examples of this.

But first, I want to get back to my argument that we should optimize encodings and adapt to our audience through explanations.

To do that, we began considering a few design concepts important to visualizing data.

To refresh your memory, here those are. Let’s consider another. Layering.

We’ve mentioned that layering graphics is part of the grammar of graphics. And layering explicitly including explanations on graphics, and using things like font sizing, boldness, and negative space to create hierarchies of information.

So in terms of layering information together as a data graphic,

    1. we consider how to scale our x and y coordinates, then
    1. we decide what attributes of the observation to place on those coordinates.
    1. Then we consider all the other visual channels, individually or in combination, we can use to encode other data attributes.
    1. We label anything important for our audience to understand, and we include mini-paragraphs — explainers — directly on the graphic.

Finally, we add a message for the title, we try graying our or lightening things we need for context but aren’t the direct point, and we use color or connection, for example, to link the graphic with text.

By the way, I’m giving you the code that I used to make each of these so that you can study them and become more familiar with how we can implement layering. More generally, you should definitely study all the code I give you as I write it very precisely so that you can learn from it.

Questions so far?

All this is to steer us away from common but terrible advise to dumb down your graphical encodings. Don’t!!

You should assume your audiences are capable of understanding optimized graphical encodings if you explain them!

If you are doubtful of mixed audiences or general audiences being able to understand more complex graphics, consider publications circulated worldwide and directed specifically to those audiences.

Columbia University is in New York, so let’s stay local and consider how New York Times explains things to its worldwide, general audiences.

Let’s start with this example. It’s from the New York Times Opinion Section, for a general audience. This is not a basic bar graph, right? Now it’s out of context, but take a moment to think about what kind of data it seems to be encoding.

One variable are categories of disease, and within the category, it uses position along the x axis to show racial differences in mortality of the disease. It uses position along the y axis to show overall rate, or how large a problem it is. It shades area as the interaction of these two variables.

Now on the graphic itself, the New York Times included a title, labels, and a few explainers, right?

Do you see any explanations on this graphic of how to read it? And that’s not all the context, either, right? This graphic is also within an article, and that article also helps with context, helps explain the content to their external, general audience. In the upcoming weeks, we will be discussing how to combine graphics and narrative.

By the way, Newspapers do separate out their papers into sections. So it may be that to some extent they have different external audiences for different sections. This graphic, again, is in the opinion section.

Let’s consider, then, a graphic from another section.

This graphic is from their section in politics. Again, this is more complex than a bar chart, or even a single line chart. Let’s try to understand the data encodings here. What do you see?

Now this is, again, written for an external, general audience. And there are a few labels familiar with general audiences at least in the United States. President’s names beside the encoded year, Abbreviations of states along the colored lines.

Do the colors represent anything? Whether the state vote was higher for one political party than another? Do you see any other explainers? What about the word “Tie”? That gives their audience a reference point. And again, this graphic is part of an article that provides much more context.

Here’s another graphic. This time also in the opinion section. And this entire graphic is an explainer of how to read hurricane maps! It warranted an entire article after the US former “president” on national broadcast misled people on how to decode these representations of uncertainty in hurricane forecasts. This is indeed a complex graphic in what it encodes. And, by the way, I plan to later discuss how to encode uncertainty.

Again, this type of encoding, with explanation and context, is intended to communicate with a general, external audience.

Let’s see another.

I’ve pulled this graphic example the business section. From an article on the growth of e-commerce. What types of encodings do we see?

Not so simple as a bar, single line, or pie chart, right?

Again, this is for a general, external audience. We don’t need to dumb down our graphics or show less data; we need to explain the encodings after we optimize them for the purpose we want to show.

And here, what do the authors do to explain the graphic? Do they gray out some data encodings as background context? Do they label the data directly? Do they explain by directly annotating the graphic? How do they use color? Is it used for a particular purpose?

Again, this graphic gains even more context within the document that it lives in.

Notice, also, that it fits within an overall narrative. A narrative that gives perhaps a little surprise. Ecommerce is growing, BUT, still a small component. Why: the resolution: it’s less labor intensive. This is starting to look like a narrative arc, right?

How about this graphic? I pulled it from an article in the Economy section.

Again, it’s complex, more complex than just a bar chart, or a single line chart, or a pie chart. Right? What encodings do you see?

Again, this is explained fully within an article.

Let’s look at one more.

This graphic, I’m showing from the Science section. I’m showing it actually within the article itself, or at least part of the article.

Now this encoding may look at first a little like a line chart. But it’s different. What is it encoding, and how?

As lines go, this one is pretty unfamiliar to some general audiences. What forms of explanation and annotation do you see?

You can read the article from the link I’ve shared to get more context. I’ve also included a second link because this type of chart was used to study audience engagement with unusual graphics. Would they run away or ignore it?

The researchers learned that general audiences were intrigued by the unusualness and complexity, and would engage when the authors provided clear explanations.

I hope this provides evidence of my point tonight. Which is what? [optimize encodings, then explain for your audience].

Awesome!

Alright, let’s practice redesigning graphics for our audience!

So I’ve pulled this recent publication from the US Government’s website. The Bureau of Economic Analysis. Their primary job is to explain the economy to the public. So from the document title, we gather, this was a news release last year discussing industry-specific contributions to the US gross domestic product in the first quarter of 2020.

Now take a moment to review the document. I’m pasting a link in chat, if that helps.

https://www.bea.gov/sites/default/files/2020-07/gdpind120-fax.pdf

Now, my question is, what’s the point of this graphic in this one-page news release, this important communication on the US economy?

Do the encodings optimally explain the point? And, can we do better?

So here, I’m throwing down a challenge. I’d like to place you into groups. I’ll give you the starting data that I pulled from this release to get you going. And I’d like each of your groups to try to redesign it. Now how should you use your time in the groups? Here’s my suggestion. One of two approaches:

First, spend a few minutes just discussing what you think works well or does not work well in this graphic for whatever purpose you see it serving.

Second, and here’s where you can decide one of two ways. One way is each of you within the group individually try to encode the graphic in a different way than what you see. You can either use paper and pencil or R/GGplot or both.

Each of you spend, say, 5 minutes getting started individually. If you still don’t know either, there is no time like now to practice. Alternatively, you could have one member share the screen and you can all collectively build one together.

Third, start working together as a group, show each other where you are in your re-design, if you are stuck, whether anyone in your group can give guidance on how to solve your conceptual or tool issue. The point is to cross-share ideas and help each other. Be patient. Be helpful.

Fourth, discuss which approaches from your group you collectively want to contribute to an overall class document. And each group must at least contribute one collectively to the class. If you can’t decide on just one, you may submit two. Along with your redesign, write one or two sentences that say what you tried and why you think it did or did not help.

Does that make sense? These can also be mistakes where you tried something and think it didn’t work. Just label it as a mistake, and in class discussion, you can talk about why you tried it and why you think it didn’t work.

Once the groups have contributed to the google shared doc, we’ll come back as a class.

Remember, this is just practice, and it’s ok to feel like you messed up, or whatever. Sometimes the graphic mistakes can also be fun to look, think through, and learn from.

[GROUP WORK]

Excellent! I love your attempts and examples. And I hope this is helping you to start thinking critically about how encodings and explanations can either help or distract. How they impact our audience’s understanding.

Let me show you a few ideas I had, too. Let’s go through three of them together.

[SCROLL DOWN TO SEE 3 REDESIGNS]

Here’s my first possible redesign. What differences do you see? Do they help? Do they hurt? Are they needed? Who would like to get us started?

Let’s see a second version.

Here’s a second possible redesign. Again, how do the encodings explain the point differently? Better, worse? Why? Explain.

Ok, let’s discuss one more.

Here’s my third redesign. Again, how does it compare with the original? Does the redesign more intuitively convey a point? Explain.

I hope this exercise has been helpful in getting you to experiment together and think about how much our choices for encodings and explanations can help or make more difficult what we’re trying to say to our audience!

By the way, I’ve also given you the code to these three redesigns in the code demonstration file for you to see how we can make these.

[LESS SEE WHAT EACH OF YOU PREFERRED. POLL BELOW]

[BACK UP]

Ok, that’s plenty for us to think about for tonight. I hope our discussion and code demonstrations will give you plenty for your own practice upcoming homeworks.

As always, I’ve hand picked references that are best suited to going further for the topics we’ve discussed today.

I’ll stay for questions. Otherwise have a great rest of your night!