Welcome to Tuesday night live from New York!

Tonight, we will consider a few of the basics of one of the most challenging ideas to communicate to others: uncertainty and variation.

But first, let’s remind ourselves how tonight’s material fits into the overall course. We’ve been trying to bring these ideas together.

Here’s our deliverables timeline.

Up next are your individual homework 4, due next week, and your group project proposals the week after that.

Before we directly discuss uncertainty, I want to touch something we’ve discussed before: context and the data generating process.

Professional baseball has been played since the 1800s. And the game is played on a field that includes a fence surrounding it. If the batter hits a ball with a bat over the fence, he scores for his team. That’s called a home run.

It’s very common for people to compare how good players are by counting the number of things like home runs, or the rate of home runs per batting attempts.

Make sense so far?

Each team plays about half of their games on their own field, and the other half of their games on their opponents’ fields.

But each field is a different size. Let’s see this on an interactive graphic I created in your reader.

And some fields are outside and other fields have a roof and air controlled temperature. Some fields are near sea level; others at a higher altitude.

Given these differences, what does it mean to compare the number of homers by players A and B, who are on different teams?

To make accurate estimates, we need to account for all the context we can.

In this sense, to borrow a phrase from another author, Yanni, he writes that all data are local. In his text he describes several data analysis projects he has worked on, from visualizing data representing an arboretum, to those representing a museum, to news reporting, to Zillow property information, and others.

All his projects, in one way or another, may be considered “big data.”

From those projects Yanni synthesizes his idea, which I agree, that when people focus on the collection of so-called “big data” for an analysis,

it is common to forget, or remove, that data from it’s context and in doing so, the analyses misses what is actually represented in those data.

Whenever we consider analyzing data, whether small or large numbers of observations, we should always account for its context. Try to understand and communicate (to relevant audiences) what generated each observation. Be specific. How was it collected? By whom? Learning answers to these questions, and gaining understanding of the data’s context — how it was generated and collected — will help us perform an informed analysis.

In deed, another author, whom I’ve already introduced you to, explains that, let’s read what she says together.

Data represent real life. It is a snapshot of the world in the same way that a picture catches a small moment in time. Numbers are always placeholders for something else, a way to capture a point of view — but sometimes this can get lost.

So as you move forward in future projects, always be thinking about how the data was generated, specifically, and use that knowledge to help you with your analysis and communication.

And that brings me to another aspect of data analysis; accounting for variation and uncertainty.

I’ve simulated some example data for us to discuss and relate it to variation. Let’s look at that now.

Ok, here is a map of the United States, and within each state, I’ve drawn the boundaries of smaller political sections called counties. We can see there are thousands of counties across the United States, right?

Now, I want you to imagine you have collected data on the rates of cancer in every county you see here.

Now let’s visualize the highest county-level rates encoded as a pink hue filling each county location.

In this map of age-adjusted cancer rates. The counties shaded pink are those counties that are in the highest decile of the cancer distribution.

Do you notice any patterns?

We note that these ailing counties tend to be very rural, midwestern, southern, and western counties.

Why might that be?

It is both easy to infer that this outcome might be directly due to the poverty of the rural lifestyle—no access to good medical care, a high-fat diet, and too much alcohol, too much tobacco.

If you worked for the US government and had to make decisions on where to allocate resources to fight cancer, would this information help?

Now let’s encoding the lowest decile of rates the same way, using a different hue, blue.

In this map of age-adjusted cancer rates, while it looks very much like the last graphic, it differs in one important detail—the counties shaded blue are those counties that are in the lowest decile of the cancer distribution.

Do you see any patterns? We note that these healthy counties tend to be very rural, midwestern, southern, and western counties.

Why might that be?

It is both easy and tempting to infer that this outcome is directly due to the clean living of the rural life-style—no air pollution, no water pollution, access to fresh food without additives, etc.

Did the pattern change between low and high rates?

Let’s bring these encodings together to make it easier to compare.

It seems that in many cases the lowest and highest rates of cancer are found in neighboring counties!

What might be going on?

Ok, so I ask you to imagine this data. I designed this case study based on a chapter in Howard Wainer’s book, titled picturing the uncertain world.

I say based on. To make these graphics I have simulated data so that we know exactly what the answer is. Let’s review that code.

As I started off by mentioning, I’ve simulated all the data. I simulated these data from a single population, having a single underlying true rate of one percent of the population.

You can see that on the left bolded code. And on the right side, I’m showing you the code I used to graph those data.

Since the data was generated from a single population, how can we explain the patterns we just discussed?

Here’s a map encoding population sizes of the counties.

How am I encoding it?

Do you see any patterns?

How do those patterns compare with those with high and low rates?

The apparent paradox we just saw in the high and low simulated rates of cancer is explained by variation due to sample size.

Looking at the graphic on the right, The variation in the mean is inversely proportional to the square root of the sample size — here that’s county population — and so small counties have much larger variation in sample means than large counties.

Our credibility and decisions informed by communication are both improved when we accurately convey information on variation and uncertainty.

Make sense?

But that was in an observational study. That’s why we want to use controlled experiments, right?

I’ve created another example, this time with controlled experiments. Let’s look at that now.

An A/B test is simply business jargon for a controlled experiment. By controlled experiment, I mean we take two things, measure some result, and estimate the difference in measurement due to the two things, not something else.

More concretely, let’s say we have subscribers or customers that we can email. And we have two versions of an email, and we want to know which version our customers are more likely to respond to. I’ll represent these two versions as these colored squares: blue and orange. And we’ll call the blue one the control. And the orange one, the treatment.

Next, we need to identify who we are interested in ultimately giving our email to. That’s our population. For simplicity, let’s say there are exactly 100 in our population. I’m illustrating those here as these 100 empty squares.

Now, how many Hollywood movies and shows have you seen or heard about that involved a character looping back in time to change something, hoping for a better outcome in the present? The movie “Groundhog Day” and “The butterfly Effect” come to mind.

Or, relatedly, stories about parallel universes? Shows and movies like “The man in the high castle”, “Multiverse of madness”, “Tenet”, directed by Christopher Nolan. And “Everything, Everywhere, All at once” come to mind.

If we could do that with our blue and orange versions of, say, an email, let’s see how it would play out in our population.

Ok, in our pretend population, we show what all 100 of their responses would have been if they had only been given the blue control information: 29 our of 100 responded.

And on the right, we’re showing the same 100 population and what all of their responses would have been if they had only been given the orange treatment information: 38 our of 100 responded.

Subtracting the proportions, we get a true population percentage point difference of 0.09, or 9 percentage point increase in response rate with the orange treatment.

But unlike stories in the movies, I haven’t figured out how we could reliably check the control and treatment of the same person at the same time!

In the real world, simultaneously providing and not providing a stimuli, like a drug or placebo, or even providing two versions of an email to someone, we can never know how the person would have responded by only one or only the other. There is always missing information!

The best we can do is to split a subset of our population into two groups, giving one group the blue version, the other group the orange version, and for each group, measure the average responses. Then, use that information to infer the difference in response for some mythical “average” person.

That’s what I’m illustrating here. I’ve randomly selected each of our 100 persons into one of two groups, either receiving the blue or the orange version. Our random selection included 42 in the blue, control group, and 58 in the orange, treatment group.

Next, we’ll measure how they would respond to each version.

I’ve simulated a first experiment using computer code. And here are the responses for each group. Of the 42 in the blue, control group, 9 responded, and of the 58 selected to receive the orange, treatment, 23 responded.

Our experimental difference in proportions is 0.19 or 19 percent.

Oh no! Our experiment did not match our simulated population difference: the experimental difference in proportion is more than double the population difference in proportion!

Perhaps we think we just got unlucky in our experiment. Let’s try a few more.

Here are 10 more possible experimental results. In this table, I’ve colored blue the control responses and frequencies, and colored orange those responses and frequencies. The column on the right shows the differences in proportions.

Notice that none of those experiments exactly match our population difference either!

And I should note that with even this small population of 100, if we equally split them into control and treatment, there are 1 to the power of 29 possible experiments that our randomization could create!

Indeed, let’s simulate another 1,000 experiments to get a better understanding of how much these experiments tend to differ. That tendency to differ is the sampling variation.

I’ve created an animated movie to show this to you. Let’s take a look.

On the upper left, I’m showing you the first experimental split between blue control and orange treatment.

To the right of that, for the control I’m showing you good responses in blue and non-response or bad response in gray. Those in the population not selected are lighter, faded out.

Under these, I graph each proportional result, and as we watch each experiment, you’ll see the responses pile up into a histogram.

Then, to the right of that, I’m doing the same for those selected into treatment. The good responses are filled in orange, the non or bad responses are gray, and those not selected for this experiment are faded out. As with the control group, below this treatment group, I’m graphing the proportional response below. And as you watch the experiments progress, you’ll see the responses pile up.

Last, in the lower right. I’m graphing the treatment proportion minus the control proportion for each experiment.

Make sense?

Ok, let’s watch my video animation of the 1,000 experiments.

Now let’s take a closer look at the histogram in the lower right, which represents the possible differences in proportions.

First, notice the range of results, these differences in proportions, from these 1,000 experiments. The results range from almost minus 10 percentage points different to almost positive 30.

Moreover, The average percentage points difference of these experiments is around 0.9, about the same as our real population difference, right?

As an aside, the larger our sample sizes, the closer the value will be to the true population value. And that makes sense, right? Consider an extreme case: if the sample size was the same size as the population, i.e., it was the population, the two results must be equal. That phenomena is called the law of large numbers.

It seems that variation is in everything, everywhere, all at once.

Now that we’ve seen variation in both an observational study and a controlled experiment, let’s review the equation for variation in sample means and consider why one influential data scientist has named it the most dangerous equation.

Now, the very influential data scientist I just mentioned, Howard Wainer, claims that the equation for the variation in sample mean is the most dangerous equation.

Let’s review it on the left.

[REVIEW]

Howard explains in his book I referenced why he thinks it is so dangerous. It’s dangerous for three reasons: * First, it has caused people confusion for a very long time. * Second, that confusion has misled people in all areas of society. * Third, the consequences of making the wrong decisions due to that confusion has cost millions of lives and resources.

He gives examples of this in his book. And, again, I designed our simulated data to mimic a real example he discusses there.

Now that we’ve explored variation, let’s consider it in our communications.

Now perhaps you’ve heard people discuss reasons not to communicate uncertainty. Let’s review a few concerns by an expert studying communication of uncertainty.

Baruch Fischoff, and expert in communicating uncertainty, discusses three main issues people raise as objections to communicating uncertainty.

  • First, some people are afraid that others will misinterpret quantities of uncertainty.

  • Second, some people believe others cannot use probabilities.

  • Third, some believe that communicating things like credible intervals might be used unfairly.

None of these concerns have merit. Let’s see what Baruch has to say about them.

Read together:

[READ]

So instead of hiding uncertainty, provide clear communication about that uncertainty for your audience. Doing so, you’ll have more informed audiences, and gain credibility in the process.

So let’s revisit our simulated controlled experiment, and figure out what to do about that uncertainty in our communications.

One communication approach is to communicate both the magnitude of some difference and the uncertainty around that estimate. I’m showing you one way to make this calculation on our A/B test. I took lucky experiment number 8.

Then, I directly modeled what I knew about the process.

[DISCUSS]

This summary gives us the average difference, and a range. So let’s try to use that in a communication.

Here’s an imaginary communication to a client using that information. Let’s read together.

Speaking of language, let’s explore some characteristics of the language itself we choose to communicate with.

Speaking of clear communications, let’s discuss, for a few minutes, the impact that word choices have on communicating these concepts and more.

To be sure, our choice of words in describing quantities matter, and vary depending on our audience. Let’s take a look at the results of a couple of empirical studies. In the study, the researchers asked respondents to assign numeric values or probabilities to various phrases.

Let’s do this in a poll, now.

[ACTIVATE POLL]

Now these data you see here are a combination of two prior empirical studies … In the first study, US CIA were asked a series of questions in the form of or in the context of reports they normally use for decision making. Those questions asked the respondents to place numerical values or probability values for words and phrases.

The second study did the same, but asked an online audience. Both results can out similar, so I combined the results into one graph for both.

So in the first study, the respondents were asked to assign numerical values to the phrases associated with some quantity, listed on the left, the y axis. Each circle, then represents how the respondent numerically interpreted that phrase.

Keep in mind that the x-axis is on the log base 10 scale so that we can represent a very wide range of perceived quantities and still get separation between them.

What types of things jump out to you?

Ok, let’s do another poll, this time thinking about probabilities.

In the second earlier study, respondents were asked to assign probabilities to the phrases you see listed on the left, the y-axis. The x-axis, this time, represents 0-100%.

Again, what kinds of things do you notice?

So we learn that these phrases do not have one particular probabilistic or quantitative meaning.

Does that mean, if we intend to be precise, that we shouldn’t use such phrases?

What if we have uncertainty in what the value should be? What if we know it’s within a range, say, would it make sense to choose a phrase where general understanding varies across the range of interest?

I’d like to shift to a related topic, overstatements and avoiding them.

Let’s take a look at a couple of original sentences and their revisions.

[DISCUSS]

As with overstating information, we want to be careful about implying causation.

[DISCUSS]

Along with considering uncertainty in our words, we should consider it in our analyses and communications of the same.

We might try to categorize types of uncertainty. Let’s see how I describe different types.

So, I’ve listed a few areas where uncertainty arises. Let’s consider those now.

First, we have uncertainty in our data. In our models of that data. We should try to be specific in explaining the unknowns and limitations of both.

Second, when running a model, we may make mistakes. Or the software may make the wrong calculations because of things like overflow or underflow.

Third, the results of our model, which provides an estimate of parameters, includes uncertainty in those parameters, even though people tend to ignore it and just use point estimates. Which is what we should not do.

Finally, fourth. There is uncertainty in what decisions we should make from learning about our estimates.

Let’s consider, next, how we might visually represent these and other uncertainties.

Because I think pencil sketches can be fun, I’m representing twelve sketches of types of encodings that may be used to show variation of some kind. These twelve representations of a single set of data. Now all these are valid ways to show actual data and their variation.

They may also be valid ways to show estimates, though I’ll make one caveat to that. Since an estimate is not usually represented as a single point, consider representing estimates that do not identify single observations.

Ok, with that in mind, if an encoding can be used either way, how can we tell in what way it was intended?

Treating one as the other can be misleading. And our audience only understands what we mean through context and explanation.

So here’s what I want to us think about. What do we mean by an estimate and how does it differ from data?

We’ve talked about data. An estimate is what we think a measurement might be, given information we know. We typically get an estimate from a model parameter. Uncertainty differs from variation in the same way. Data measurements vary. And we can have uncertainty in what those measurements are. Two different things. Make sense?

So we need to distinguish which thing we are referring to in our communications. We’ve already seen this in previous examples.

Let’s consider three examples.

For the first two examples, I’ve pulled from news organizations to see how they distinguish the two more directly, for a general audience.

The first is from a published article at the New York Times discussing the vaccine rollout. And I encourage you to go to the actual articles to review them. And this is taken out of context from the whole article.

But what do you see?

Again, does the graphic refer to data? To estimates? To both? How does the article describe and distinguish these concepts?

Again, I’m just highlighting where they use language trying to make this difference clear? Pay special attention to how the graphic line changes to a dotted line, and how the point shapes at June 22, Aug 25, and Oct 29 become an open circle. These are common ways we can try to distinguish data and estimates in the visual channels themselves.

Let’s consider another one. It’s also related to the pandemic, and published on Five Thirty Eight.

Again, I encourage you to look at the original, not just this excerpt I’ve pulled. Actually, let’s do that together now because it includes animation.

So how do the authors indicate what are data, what are estimates?

Right, we see both annotations and the line itself changes from solid to dashed.

I’ve highlighted them for your reference.

Here’s a page from the example Dodger’s proposal you reviewed earlier in the semester.

Do the visuals represent data or some kind of estimates? How do we know? Where in the communication do we learn this?

Here, I’ve highlighted the text that explains these are estimates. Notice the highlight includes information directly on the graphics as well as paragraphs discussing these graphics.

There are a couple more recent papers that help us think about new, additional ways of encoding uncertainty.

The first of these is from several authors who empirically studied how general audiences interpret distribution information. And what they found, and I encourage you to read their paper, is that people were able to make more accurate decisions when the distribution was discretized. By discretized, I mean it was chopped up into pieces that can be counted. The authors named this type of encoding a quantile dot plot because of the way it is made. But basically, it transforms the top distribution into a specified number of points in the shape of the distribution so that someone can essentially count their risk. In this example, on the bottom, there are 50 dots, and one can think about, say, three out of 50 times something occurs.

And earlier we’ve discussed how people tend to more easily think about things they can easily count. So this idea uses that insight.

Along with this paper, I want to introduce you to a multi-channel approach to encoding uncertainty.

Helper functions have been developed as extentions to the grammar of graphics for visualizing distributions in various ways.

So here, I’m abstractly showing you a range of hues from purple to yellow in equal steps. We discussed something like this in our previous discussion of color. And we also discussed how we can use different aspects of channel — hue, saturation, or luminance — or combine them, to encode data, right?

So the authors started experimenting with encoding hue for a main measured variable, and encoding uncertainty in that variable with saturation. The effect is that as the measure becomes more uncertain, our ability to distinguish between measures decreases — as it should because we become less certain about the difference.

And if you recall, I created an R package and gave you an example of how you can encode multiple channels at once.

Does that make sense? Questions so far?

So the authors took the idea one step further. Notice that this map legend is discrete instead of continuous. There are, in this example, 16 possible encodings. And we can still distinguish the four bottom squares, just a little less easy than the top four.

So the authors actually suppressed the options, or the different possible values as uncertainty increased. The more certain values received not just higher saturation, but more steps in the range of values, and as uncertainty decreases, they remove steps between the extremes.

Does this makes sense? Questions?

Let’s see how this plays out in an actual example.

Another author I’ve introduced you to, Claus Wilke, created another R package that does this for you in ggplot. I’ve used his package here. On the left graphic, I used a regular change of values mapped across hue from red to blue.

And on the right, I’ve used the value suppressing palette, mapping uncertainty in the data to saturation, too.

What are the differences in how you try to read this mappings?

So I wanted to introduce you to these additional, very useful approaches to showing uncertainty.

Ok, let’s switch topics for a moment. I’d like to help give you some context I hope you’ll find helpful for starting on your individual assignment, which is designed to individually give you more advanced practice using graphics to both explore and explain.

By the way, why are these individual? And what are the implications?

In your first two individual assignments, we’ve focused on our class case study with Citi Bike. We’ll Twitter has been blowing up over the years on the issue of rebalancing. Let’s take a look.

Instead of reviewing information at the overall system level, I want us to take a look at a particular station. See if we can gain insight into rebalancing, and then consider whether we can generalize what we learn in next steps.

So I hopped onto social media, searched twitter. Let’s see what I found…

Whoa! All kinds of complaints! This is overwhelming! Should we take a trip down to the station? Maybe the trip would help us gain some context into rebalancing issues and possible solutions.

Pack your bags!

Well, wait a minute. That would maybe be fun. But there probably are not enough bikes for us all to hop on to get there.

So let’s virtually go.

To do that, I’ve hired a very, very low budget production company — that’s me — to give you a not so Hollywood tour. So let’s go there, and go back in time.

Let’s roll this video short.

play the video.

I hope this video short will help you think about rebalancing as you individually work through homework four.

Ok, we have spring break next week. So I’d like to give you time to get into your project groups for planning and moving forward on your projects.

That wraps up tonight’s discussion on context, uncertainty, and variation. I hope our discussion will give you plenty for your own practice moving forward.

Here are the major resources I recommend for your reference related to our discussions tonight. That’s all for tonight. I’ll stay on for any questions. Otherwise enjoy the rest of your day or night!