index.knit

Professional baseball has been played since the 1800s. And the game is played on a field that includes a fence surrounding it. If the batter hits a ball with a bat over the fence, he scores for his team. That’s called a home run.

It’s very common for people to compare how good players are by counting the number of things like home runs, or the rate of home runs per batting attempts.

Make sense so far?

Each team plays about half of their games on their own field, and the other half of their games on their opponents’ fields.

But each field is a different size. Let’s see this on an interactive graphic I created in your reader.

And some fields are outside and other fields have a roof and air controlled temperature. Some fields are near sea level; others at a higher altitude.

Given these differences, what does it mean to compare the number of homers by players A and B, who are on different teams?

To make accurate estimates, we need to account for all the context we can.

In this sense, to borrow a phrase from another author, Yanni, he writes that all data are local. In his text he describes several data analysis projects he has worked on, from visualizing data representing an arboretum, to those representing a museum, to news reporting, to Zillow property information, to name a few.

All his projects, in one way or another, may be considered “big data.”

From those projects Yanni synthesizes his idea, which I agree, that when people focus on the collection of so-called “big data” for an analysis,

it is common to forget, or remove, that data from it’s context and in doing so, the analyses misses what is actually represented in those data.

Whenever we consider analyzing data, whether small or large numbers of observations, we should always account for its context. Try to understand and communicate (to relevant audiences) what generated each observation. Be specific. How was it collected? By whom? Learning answers to these questions, and gaining understanding of the data’s context — how it was generated and collected — will help us perform an informed analysis.

Now these data you see here are a combination of two prior empirical studies … In the first study, US CIA were asked a series of questions in the form of or in the context of reports they normally use for decision making. Those questions asked the respondents to place numerical values or probability values for words and phrases.

The second study did the same, but asked an online audience. Both results can out similar, so I combined the results into one graph for both.

So in the first study, the respondents were asked to assign numerical values to the phrases associated with some quantity, listed on the left, the y axis. Each circle, then represents how the respondent numerically interpreted that phrase.

Keep in mind that the x-axis is on the log base 10 scale so that we can represent a very wide range of perceived quantities and still get separation between them.

What types of things jump out to you?

Ok, let’s do another poll, this time thinking about probabilities.

The first of these is from several authors who empirically studied how general audiences interpret distribution information. And what they found, and I encourage you to read their paper, is that people were able to make more accurate decisions when the distribution was discretized. By discretized, I mean it was chopped up into pieces that can be counted. The authors named this type of encoding a quantile dot plot because of the way it is made. But basically, it transforms the top distribution into a specified number of points in the shape of the distribution so that someone can essentially count their risk. In this example, on the bottom, there are 50 dots, and one can think about, say, three out of 50 times something occurs.

And earlier we’ve discussed how people tend to more easily think about things they can easily count. So this idea uses that insight.

Along with this paper, I want to introduce you to a multi-channel approach to encoding uncertainty.