Welcome to Wednesday night live from New York!

Tonight, we will consider a few of the basics of one of the most challenging ideas to communicate to others: uncertainty and variation.

But first, let’s remind ourselves how tonight’s material fits into the overall course. We’ve been trying to bring these ideas together.

Here’s our deliverables timeline.

Up next are your individual homework 4, due after spring break, and your group project proposals the week after that.

Before we directly discuss uncertainty, I want to touch something we’ve discussed before: context and the data generating process.

Professional baseball has been played since the 1800s. And the game is played on a field that includes a fence surrounding it. If the batter hits a ball with a bat over the fence, he scores for his team. That’s called a home run.

It’s very common for people to compare how good players are by counting the number of things like home runs, or the rate of home runs per batting attempts.

Make sense so far?

Each team plays about half of their games on their own field, and the other half of their games on their opponents’ fields.

But each field is a different size. Let’s see this on an interactive graphic I created in your reader.

And some fields are outside and other fields have a roof and air controlled temperature. Some fields are near sea level; others at a higher altitude.

Given these differences, what does it mean to compare the number of homers by players A and B, who are on different teams?

To make accurate estimates, we need to account for all the context we can.

In this sense, to borrow a phrase from another author, Yanni, he writes that all data are local. In his text he describes several data analysis projects he has worked on, from visualizing data representing an arboretum, to those representing a museum, to news reporting, to Zillow property information, to name a few.

All his projects, in one way or another, may be considered “big data.”

From those projects Yanni synthesizes his idea, which I agree, that when people focus on the collection of so-called “big data” for an analysis,

it is common to forget, or remove, that data from it’s context and in doing so, the analyses misses what is actually represented in those data.

Whenever we consider analyzing data, whether small or large numbers of observations, we should always account for its context. Try to understand and communicate (to relevant audiences) what generated each observation. Be specific. How was it collected? By whom? Learning answers to these questions, and gaining understanding of the data’s context — how it was generated and collected — will help us perform an informed analysis.

In deed, another author, whom I’ve already introduced you to, explains that, let’s read what she says together.

Data represent real life. It is a snapshot of the world in the same way that a picture catches a small moment in time. Numbers are always placeholders for something else, a way to capture a point of view — but sometimes this can get lost.

So as you move forward in future projects, always be thinking about how the data was generated, specifically, and use that knowledge to help you with your analysis and communication.

And that brings me to another aspect of data analysis; accounting for variation and uncertainty.

I’ve simulated some example data for us to discuss and relate it to variation. Let’s look at that now.

Ok, here is a map of the United States, and within each state, I’ve drawn the boundaries of smaller political sections called counties. We can see there are thousands of counties across the United States, right?

Now, I want you to imagine you have collected data on the rates of cancer in every county you see here.

Now let’s visualize the highest county-level rates encoded as a pin, hue filling each county location.

In this map of age-adjusted cancer rates. The counties shaded pink are those counties that are in the highest decile of the cancer distribution.

Do you notice any patterns?

We note that these ailing counties tend to be very rural, midwestern, southern, and western counties.

Why might that be?

It is both easy to infer that this outcome might be directly due to the poverty of the rural lifestyle—no access to good medical care, a high-fat diet, and too much alcohol, too much tobacco.

If you worked for the US government and had to make decisions on where to allocate resources to fight cancer, would this information help?

Now let’s encoding the lowest decile of rates the same way, using a different hue, blue.

In this map of age-adjusted cancer rates, while it looks very much like the last graphic, it differs in one important detail—the counties shaded blue are those counties that are in the lowest decile of the cancer distribution.

Do you see any patterns? We note that these healthy counties tend to be very rural, midwestern, southern, and western counties.

Why might that be?

It is both easy and tempting to infer that this outcome is directly due to the clean living of the rural life-style—no air pollution, no water pollution, access to fresh food without additives, etc.

Did the pattern change between low and high rates?

Let’s bring these encodings together to make it easier to compare.

It seems that in many cases the lowest and highest rates of cancer are found in neighboring counties!

What might be going on?

Ok, so I ask you to imagine this data. I designed this case study based on a chapter in Howard Wainer’s book, titled picturing the uncertain world.

I say based on. To make these graphics I have simulated data so that we know exactly what the answer is. Let’s review that code.

As I started off by mentioning, I’ve simulated all the data. I simulated these data from a single population, having a single underlying true rate of one percent of the population.

You can see that on the left bolded code. And on the right side, I’m showing you the code I used to graph those data.

Since the data was generated from a single population, how can we explain the patterns we just discussed?

Here’s a map encoding population sizes of the counties.

How am I encoding it?

Do you see any patterns?

How do those patterns compare with those with high and low rates?

Let’s review the equation for variation in sample means.

Now, the very influential data scientist I just mentioned, Howard Wainer, claims that the equation for the variation in sample mean is the most dangerous equation.

Let’s review it on the left.

[REVIEW]

Howard explains in his book I referenced why he thinks it is so dangerous. It’s dangerous for three reasons: * First, it has caused people confusion for a very long time. * Second, that confusion has misled people in all areas of society. * Third, the consequences of making the wrong decisions due to that confusion has cost millions of lives and resources.

He gives examples of this in his book. And, again, I designed our simulated data to mimic a real example he discusses there.

Let’s see the effects of this equation play out in our graphics. So back to our example.

The apparent paradox we just saw in the high and low simulated rates of cancer is explained by variation due to sample size — Moivre’s equation in action, as applied.

Looking at the graphic on the right, The variation in the mean is inversely proportional to the square root of the sample size — here that’s county population — and so small counties have much larger variation in sample means than large counties.

Our credibility and decisions informed by communication are both improved when we accurately convey information on variation and uncertainty.

Make sense?

Now perhaps you’ve heard people discuss reasons not to communicate uncertainty. Let’s review a few concerns by an expert studying communication of uncertainty.

Baruch Fischoff, and expert in communicating uncertainty, discusses three main issues people raise as objections to communicating uncertainty.

  • First, some people are afraid that others will misinterpret quantities of uncertainty.

  • Second, some people believe others cannot use probabilities.

  • Third, some believe that communicating things like credible intervals might be used unfairly.

None of these concerns have merit. Let’s see what Baruch has to say about them.

Read together:

[READ]

So instead of hiding uncertainty, provide clear communication about that uncertainty for your audience. Doing so, you’ll have more informed audiences, and gain credibility in the process.

Speaking of clear communications, let’s discuss, for a few minutes, the impact that word choices have on communicating these concepts and more.

To be sure, our choice of words in describing quantities matter, and vary depending on our audience. Let’s take a look at the results of a couple of empirical studies. In the study, the researchers asked respondents to assign numeric values or probabilities to various phrases.

Let’s do this in a poll, now.

[ACTIVATE POLL]

First, we will assign numeric values to these phrases. The bonus is you get to have new information to take with you and consider. Now, as you click on values, take note that the x axis is on the log scale so that you have a wide number line to work with.

Any questions about the scale?

Cool, you can click or touch, depending on your device, a value along the number line for each phrase.

Take a few moments and we’ll see what you all think.

[DEACTIVATE POLL]

Let’s see how our results compare with the prior studies.

Now these data you see here are a combination of two prior empirical studies … In the first study, US CIA were asked a series of questions in the form of or in the context of reports they normally use for decision making. Those questions asked the respondents to place numerical values or probability values for words and phrases.

The second study did the same, but asked an online audience. Both results can out similar, so I combined the results into one graph for both.

So in the first study, the respondents were asked to assign numerical values to the phrases associated with some quantity, listed on the left, the y axis. Each circle, then represents how the respondent numerically interpreted that phrase.

Keep in mind that the x-axis is on the log base 10 scale so that we can represent a very wide range of perceived quantities and still get separation between them.

What types of things jump out to you?

Ok, let’s do another poll, this time thinking about probabilities.

[ACTIVATE POLL]

Like last time, click or touch a value for each phrase. Then, we’ll look at what you all think.

[DEACTIVATE POLL]

And now, let’s see how these compare with the earlier studies.

In the second earlier study, respondents were asked to assign probabilities to the phrases you see listed on the left, the y-axis. The x-axis, this time, represents 0-100%.

Again, what kinds of things do you notice?

So we learn that these phrases do not have one particular probabilistic or quantitative meaning.

Does that mean, if we intend to be precise, that we shouldn’t use such phrases?

What if we have uncertainty in what the value should be? What if we know it’s within a range, say, would it make sense to choose a phrase where general understanding varies across the range of interest?

I’d like to shift to a related topic, overstatements and avoiding them.

As with overstating information, we want to be careful about implying causation.

[DISCUSS]

Along with considering uncertainty in our words, we should consider it in our analyses and communications of the same.

We might try to categorize types of uncertainty. Let’s see how I describe different types.

So, I’ve listed a few areas where uncertainty arises. Let’s consider those now.

First, we have uncertainty in our data. In our models of that data. We should try to be specific in explaining the unknowns and limitations of both.

Second, when running a model, we may make mistakes. Or the software may make the wrong calculations because of things like overflow or underflow.

Third, the results of our model, which provides an estimate of parameters, includes uncertainty in those parameters, even though people tend to ignore it and just use point estimates. Which is what we should not do.

Finally, fourth. There is uncertainty in what decisions we should make from learning about our estimates.

Let’s consider, next, how we might visually represent these and other uncertainties.

Because I think pencil sketches can be fun, I’ve borrowed these twelve sketches of types of encodings, these twelve representations of a single set of data, from Joey’s website, that I’ve cited here. Now all these are valid ways to show actual data and their variation.

They may also be valid ways to show estimates, though I’ll make one caveat to that. Since an estimate is not usually represented as a single point, consider representing estimates that do not identify single observations.

Ok, with that in mind, if an encoding can be used either way, how can we tell in what way it was intended?

Treating one as the other can be misleading. And our audience only understands what we mean through context and explanation.

So here’s what I want to us think about. What do we mean by an estimate and how does it differ from data?

We’ve talked about data. An estimate is what we think a measurement might be, given information we know. We typically get an estimate from a model parameter. Uncertainty differs from variation in the same way. Data measurements vary. And we can have uncertainty in what those measurements are. Two different things. Make sense?

So we need to distinguish which thing we are referring to in our communications. We’ve already seen this in previous examples.

Let’s consider three examples.

For the first two examples, I’ve pulled from news organizations to see how they distinguish the two more directly, for a general audience.

The first is from a published article at the New York Times discussing the vaccine rollout. And I encourage you to go to the actual articles to review them. And this is taken out of context from the whole article, but what do you see?

Again, does the graphic refer to data? To estimates? To both? How does the article describe and distinguish these concepts?

Again, I’m just highlighting where they use language trying to make this difference clear? Pay special attention to how the graphic line changes to a dotted line, and how the point shapes at June 22, Aug 25, and Oct 29 become an open circle. These are common ways we can try to distinguish data and estimates in the visual channels themselves.

Let’s consider another one. It’s also related to the pandemic, and published on Five Thirty Eight.

Again, I encourage you to look at the original, not just this excerpt I’ve pulled. Actually, let’s do that together now because it includes animation.

So how do the authors indicate what are data, what are estimates?

Right, we see both annotations and the line itself changes from solid to dashed.

I’ve highlighted them for your reference.

Here’s a page from the example Dodger’s proposal you reviewed earlier in the semester. Do the visuals represent data or some kind of estimates? How do we know? Where in the communication do we learn this?

Here, I’ve highlighted the text that explains these are estimates. Notice the highlight includes information directly on the graphics as well as paragraphs discussing these graphics.

There are a couple more recent papers that help us think about new, additional ways of encoding uncertainty.

The first of these is from several authors who empirically studied how general audiences interpret distribution information. And what they found, and I encourage you to read their paper, is that people were able to make more accurate decisions when the distribution was discretized. By discretized, I mean it was chopped up into pieces that can be counted. The authors named this type of encoding a quantile dot plot because of the way it is made. But basically, it transforms the top distribution into a specified number of points in the shape of the distribution so that someone can essentially count their risk. In this example, on the bottom, there are 50 dots, and one can think about, say, three out of 50 times something occurs.

And earlier we’ve discussed how people tend to more easily think about things they can easily count. So this idea uses that insight.

Along with this paper, I want to introduce you to a multi-channel approach to encoding uncertainty.

Helper functions have been developed as extentions to the grammar of graphics for visualizing distributions in various ways.

So here, I’m abstractly showing you a range of hues from purple to yellow in equal steps. We discussed something like this in our previous discussion of color. And we also discussed how we can use different aspects of channel — hue, saturation, or luminance — or combine them, to encode data, right?

So the authors started experimenting with encoding hue for a main measured variable, and encoding uncertainty in that variable with saturation. The effect is that as the measure becomes more uncertain, our ability to distinguish between measures decreases — as it should because we become less certain about the difference.

And if you recall, I created an R package and gave you an example of how you can encode multiple channels at once.

Does that make sense? Questions so far?

So the authors took the idea one step further. Notice that this map legend is discrete instead of continuous. There are, in this example, 16 possible encodings. And we can still distinguish the four bottom squares, just a little less easy than the top four.

So the authors actually suppressed the options, or the different possible values as uncertainty increased. The more certain values received not just higher saturation, but more steps in the range of values, and as uncertainty decreases, they remove steps between the extremes.

Does this makes sense? Questions?

Let’s see how this plays out in an actual example.

Another author I’ve introduced you to, Claus Wilke, created another R package that does this for you in ggplot. I’ve used his package here. On the left graphic, I used a regular change of values mapped across hue from red to blue.

And on the right, I’ve used the value suppressing palette, mapping uncertainty in the data to saturation, too.

What are the differences in how you try to read this mappings?

So I wanted to introduce you to these additional, very useful approaches to showing uncertainty.

Ok, we have spring break next week. So I’d like to give you time to get into your project groups for planning and moving forward on your projects.

That wraps up tonight’s discussion on context, uncertainty, and variation. I hope our discussion will give you plenty for your own practice moving forward.

Here are the major resources I recommend for your reference related to our discussions tonight. That’s all for tonight. I’ll stay on for any questions. Otherwise enjoy the rest of your day or night!