index

Professional baseball has been played since the 1800s. And the game is played on a field that includes a fence surrounding it. If the batter hits a ball with a bat over the fence, he scores for his team. That’s called a home run.

It’s very common for people to compare how good players are by counting the number of things like home runs, or the rate of home runs per batting attempts.

Make sense so far?

Each team plays about half of their games on their own field, and the other half of their games on their opponents’ fields.

But each field is a different size. Let’s see this on an interactive graphic I created in your reader.

And some fields are outside and other fields have a roof and air controlled temperature. Some fields are near sea level; others at a higher altitude.

Given these differences, what does it mean to compare the number of homers by players A and B, who are on different teams?

To make accurate estimates, we need to account for all the context we can.

In this sense, to borrow a phrase from another author, Yanni, he writes that all data are local. In his text he describes several data analysis projects he has worked on, from visualizing data representing an arboretum, to those representing a museum, to news reporting, to Zillow property information, and others.

All his projects, in one way or another, may be considered “big data.”

From those projects Yanni synthesizes his idea, which I agree, that when people focus on the collection of so-called “big data” for an analysis,

it is common to forget, or remove, that data from it’s context and in doing so, the analyses misses what is actually represented in those data.

Whenever we consider analyzing data, whether small or large numbers of observations, we should always account for its context. Try to understand and communicate (to relevant audiences) what generated each observation. Be specific. How was it collected? By whom? Learning answers to these questions, and gaining understanding of the data’s context — how it was generated and collected — will help us perform an informed analysis.

On the upper left, I’m showing you the first experimental split between blue control and orange treatment.

To the right of that, for the control I’m showing you good responses in blue and non-response or bad response in gray. Those in the population not selected are lighter, faded out.

Under these, I graph each proportional result, and as we watch each experiment, you’ll see the responses pile up into a histogram.

Then, to the right of that, I’m doing the same for those selected into treatment. The good responses are filled in orange, the non or bad responses are gray, and those not selected for this experiment are faded out. As with the control group, below this treatment group, I’m graphing the proportional response below. And as you watch the experiments progress, you’ll see the responses pile up.

Last, in the lower right. I’m graphing the treatment proportion minus the control proportion for each experiment.

Make sense?

Ok, let’s watch my video animation of the 1,000 experiments.

Now let’s take a closer look at the histogram in the lower right, which represents the possible differences in proportions.

First, notice the range of results, these differences in proportions, from these 1,000 experiments. The results range from almost minus 10 percentage points different to almost positive 30.

Moreover, The average percentage points difference of these experiments is around 0.9, about the same as our real population difference, right?

As an aside, the larger our sample sizes, the closer the value will be to the true population value. And that makes sense, right? Consider an extreme case: if the sample size was the same size as the population, i.e., it was the population, the two results must be equal. That phenomena is called the law of large numbers.

It seems that variation is in everything, everywhere, all at once.

Now that we’ve seen variation in both an observational study and a controlled experiment, let’s review the equation for variation in sample means and consider why one influential data scientist has named it the most dangerous equation.

Baruch Fischoff, and expert in communicating uncertainty, discusses three main issues people raise as objections to communicating uncertainty.

First, some people are afraid that others will misinterpret quantities of uncertainty.
Second, some people believe others cannot use probabilities.
Third, some believe that communicating things like credible intervals might be used unfairly.

None of these concerns have merit. Let’s see what Baruch has to say about them.

Read together:

[READ]

So instead of hiding uncertainty, provide clear communication about that uncertainty for your audience. Doing so, you’ll have more informed audiences, and gain credibility in the process.

So let’s revisit our simulated controlled experiment, and figure out what to do about that uncertainty in our communications.

Now these data you see here are a combination of two prior empirical studies … In the first study, US CIA were asked a series of questions in the form of or in the context of reports they normally use for decision making. Those questions asked the respondents to place numerical values or probability values for words and phrases.

The second study did the same, but asked an online audience. Both results can out similar, so I combined the results into one graph for both.

So in the first study, the respondents were asked to assign numerical values to the phrases associated with some quantity, listed on the left, the y axis. Each circle, then represents how the respondent numerically interpreted that phrase.

Keep in mind that the x-axis is on the log base 10 scale so that we can represent a very wide range of perceived quantities and still get separation between them.

What types of things jump out to you?

Ok, let’s do another poll, this time thinking about probabilities.

The first of these is from several authors who empirically studied how general audiences interpret distribution information. And what they found, and I encourage you to read their paper, is that people were able to make more accurate decisions when the distribution was discretized. By discretized, I mean it was chopped up into pieces that can be counted. The authors named this type of encoding a quantile dot plot because of the way it is made. But basically, it transforms the top distribution into a specified number of points in the shape of the distribution so that someone can essentially count their risk. In this example, on the bottom, there are 50 dots, and one can think about, say, three out of 50 times something occurs.

And earlier we’ve discussed how people tend to more easily think about things they can easily count. So this idea uses that insight.

Along with this paper, I want to introduce you to a multi-channel approach to encoding uncertainty.