Data in Wonderland

Explores communication with data in various forms through seminal and cutting-edge ideas in writing, data analyses, and visualzation.

Author

Dr. Scott Spencer

Published

February 22, 2024

Preface

Narrative and story can enhance others’ understanding of data, offering meaning and insights as communicated in numbers and words, or in graphical encodings. Carefully combined, narratives, data analyses, and visuals can help enable change:

Audience and assumed background

My primary audience are my students at Columbia University. The readers who will get most from this text, for whom I have in mind as my more general audience, are curious active learners:

An active learner asks questions, considers alternatives, questions assumptions, and even questions the trustworthiness of the author or speaker. An active learner tries to generalize specific examples, and devise specific examples for generalities.

An active learner doesn’t passively sponge up information — that doesn’t work! — but uses the readings and lecturer’s argument as a springboard for critical thought and deep understanding.

This text isn’t meant to be an end, but a beginning, giving you hand-selected, seminal and cutting-edge references for the concepts presented. Go down these rabbit holes, following citations and studying the cited material. Becoming an expert in storytelling with data also requires practicing. Indeed,

Learners need to practice, to imitate well, to be highly motivated, and to have the ability to see likenesses between dissimilar things in [domains ranging from creative writing to mathematics. (Gaut 2014).

You may find some concepts difficult or vague on a first read. For that, I’ll offer encouragement from Abelson (1995):

I have tried to make the presentation accessible and clear, but some readers may find a few sections cryptic …. Use your judgment on what to skim. If you don’t follow the occassional formulas, read the words. If you don’t understand the words, follow the music and come back to the words later.

Let’s hop in!

Structure and content

Empirical studies suggest the communication is generally more effective when its author controls all aspects of the communication, from content to typography and form. But in some cases, we may enhance the communication by allowing our audience to choose among potential contexts of the information. Here, we aim to explore many ideas within this framework, using as content a data analytics project.

We will start by exploring our content (a proposed and implemented data analytics project) and our intended audiences (various executives, general audiences, and mixed audiences) through narrative and considering not just our words but the typographic forms of those words in the chosen communication medium.

Then we begin to integrate other, visual, forms of data representation into the narrative. As our discussions in graphical data encodings become more complex, we give them focus in the form of information graphics, dashboards, and finally enable our audience as, to some degree, co-author with interactive design of the communication.

Software information and conventions

The primary software tools used in this reference include R, the tidyverse, ggplot2, and a few other extensions to that implementation of the grammar of graphics. Part of the later chapters in interactive communications, I introduce the internet web standards that enable such graphics: html, css, svg, canvas, and javascript.

About the author

I lecture in the applied analytics graduate program at Columbia University, my alma mater. I work with core developers of Stan, a probabilistic programming language, building Bayesian, generative models of complex processes involving both human behavior and physical phenomena.

My applied work focuses on modeling data for good and on professional sports. Recent efforts are on the spatial and temporal impact of sea-level rise on the perceived value of coastal land, and on major league baseball player and team performance. My work in modeling and communication arise from a doctor of jurisprudence with honors in research and writing, master of science in sports management focused on data science, and bachelor of science in chemical engineering focused on numerical methods and statistical process control.

My analyses have extended to other domains as well. Previously, I analyzed data, communicated statistical insights through narrative, and static and interactive visuals, persuading various stakeholders of my technology clients1 on the impact of my analyses and designs for decisions related to process control, operations, and patents.

To assist visual communication, I have coded multiple R packages, including one for mapping data values to visual encodings perceptually-uniform separately across hue, saturation, and luminance. The most persuasive communications are transparent and account for uncertainty, which are two areas of interest in my research and work in quantitative communication through visualization and storytelling. Along with research I have collaborated and published variously. These include analyses, editing, and research for The Real Madrid Way (BenBella Books 2016), of which Billy Bean (Executive Vice President of Baseball Operations at the Oakland A’s) has said “will be one of the most influential books on sports ever written.”

Along with honors recognition for research and writing, my visualizations have been showcased and longlisted in the Kantar Information is Beautiful Awards, and my analyses have won analytics competitions, including the Society for American Baseball Research’s analytics competition, graduate division for my work involving human decision-making based on perception of spatial and temporal event information. I am fluent in R; code in Stan for Bayesian modeling; ggplot2, D3.js, and Processing for visualization, interactivity, and animation; Final Cut Pro and DaVinci Resolve for editing film. Select other tools include Git, Rmd, html, css, svg, SQL, C++, macOS, Compressor, and Adobe CC.