Effective storytelling with data requires practiced skills and an understanding of theory across disciplines and domains. For a starting text of the topics in this course, consult Data in Wonderland (an evolving text). That text provides citations to numerous seminal and modern resources for the relevant sections. Below are my short-list of starting points for their respective topics.
The “fundamentals” chapter from Doumont (2009) should be a required study, as its lessons in communication apply across modes. My only complaint with this text is its lack of citation to original sources of ideas. But the ideas themselves are solid. The author’s remaining chapters place these fundamentals into the context of specific communications and are worth the price of purchase and time spent studying, even if his advice on data visualization is only a beginning.
A primary goal of data science is to analyze data to learn something new. Thus, simply reporting what’s been done before, data analyses or otherwise, does not meet that goal. While J. Harris (2017) is written for other contexts, its lessons on proper use of prior work (hint, you should not cite or describe prior work just as background) are essential in any data analysis project.
Enabling decisions and changing minds require persuasion which, in turn, requires understanding how our minds work. Sharot (2017) has helpful lessons, in that regard. Building on those, an experienced writer for the Op-Ed page of the New York Times offers her perspective on ways to persuade in Hall (2019).
Once we’ve thought through Doumont’s advice on our goals of communication — to get our audience to pay attention to, understand, and be able to act upon our messages —we can use storytelling principles for the “pay attention to” component. Storr (2020) investigates both the science and historical approaches to storytelling, and generalizes how we create effective narratives.
But for our audience to understand us, we must review the details of communication structure, from the whole, down to the paragraph, sentence, and word choices. All matter. Booth et al. (2016) offer practical advice on methods to logically lay our narratives, including composing sentences and paragraphs structured by introducing old or common material before leading the audience to new material.
More specific to communicating in the context of data science, Nolan and Stoudt (2021) proves to be a helpful guide.
Any visual communication begins by understanding how data can be visually encoded into marks or channels. Bertin (2010) formalized these possibilities, and explains these marks and channels.
Tufte (2001) teaches us how to isolate helpful marks for encoding data and information from those that detract from our messages.
Perhaps the best, most theoretically-grounded, approach to creating and communicating about data graphics has a grammar: the grammar of graphics. That grammar was developed by Wilkinson (2005), and influences today’s most flexible and capable tools for visually encoding data.
One of those tools is implemented as a package in the R language: ggplot2
. Wickham, Navarro, and Lin (2021) introduces the connection to, and implementation of, Wilkinson’s ideas, and provides capable instruction in how to use this tool, but does not explain what encodings are most effective for our communications. The effectiveness has been studied thoroughly and is ongoing. Early pioneers elevated the importance of data visualization, including Tukey (1977). Tukey’s work, in many ways, legitimized graphical data analysis. Other work, by Cleveland and McGill (1984), Cleveland and McGill (1987), Heer and Bostock (2010) empirically tested common data encodings to learn which lead to more accurate decoding and decisions. Research is ongoing.
Schwabish (2021) summarizes a few of the important theoretical ideas in visually communicating data, and offers a taxonomy of plots for various data types and intended comparisons. While we should not think of encodings as a shelf of choices, which would severely limit our communications, reviewing taxonomies can be a starting point for ideas that have already been tried. Another encyclopedic taxonomy is R. L. Harris (1999).
The above material focuses on static data visualization. In some cases, adding interactivity can improve our communications. Hohman et al. (2020) surveys today’s best practices and ideas in interactivity, and the theoretical underpinnings of those practices are taught in Tominski and Schumann (2020). Tools that work well with the ggplot2 implementation of the grammar of graphics include htmlwidgets
like ggiraph
, and plotly
. For ggiraph
, start with Gohel and Skintzos (2021); and for plotly
, you’ll do well to consult Sievert (2020) as a guide.
To help compare various software for authoring data graphics, here’s a short table:
System1 | Expressivity | Reproducible Workflow | Description | Examples |
---|---|---|---|---|
Imperative Programming | Ultimate | Yes | A coding language or library used to describe how to create a data visual. Requires the creator be comfortable describing how to create the visuals with text. | |
Declarative Programming | Very High | Yes | A coding langauge or library used to describe what the data visual should look llike. Uses a grammar of graphics that systematizes the description of visuals with text. | |
Visual Builder | High | Depends, Partly2 | A graphical user interface allowing fine control in specifying marks, glyphs, coordinate systems, and layouts. Can create compound glphs with multiple marks. May use direct manipulation of the visual objects, something like a shelf construction, or a hybrid of both. | |
Shelf Construction | Medium | No | An graphical user interface mapping data fields to encoding channels such as by dragging an icon for a variable onto a shelf containing visual marks, but do not provide control over the underlying chart layout and do not allow authors to easily produce compound glyphs comprised of multiple marks. | |
Template Selector | Very low | No | Author is limited to selecting from a list of available charts. | |
1For a comparison between these systems, see Satyanarayan et al. (2019). | ||||
2Some implementations, like Lyra, offer the ability to export the graphic as code that can be placed inside a reproducible report and hooked to new data. |
Coding, first and foremost, is the precise application of logical instructions for software tools. Best practices in coding span programming languages. Thomas and Hunt (2020) introduces important ideas regardless of your choice of language. While computers run the code, humans must understand those instructions, and are a vital part of communicating with others and interpreting what the computers give us in return. Boswell and Foucher (2011) guide how we write and organize code.
More specific to an important language in data analysis, we turn to R.
Unlike some other languages1, R was originally developed for data analysis, and you can get some understanding of the base language from Grolemund (2014). While it’s important to know how base R works, it’s a “living language” and has evolved to include numerous, powerful packages (libraries of code with functions) written for data transformation, graphics, and modelling, as well as having interfaces to access other languages like C, C++, SQL, Python, and more. A starting point within this ecosystem for data science is Wickham et al. (2019), which is taught in (wickham2023?).
The first half of Mailund (2017a) also, but from another perspective, explains the basics of how to use the R language and various packages for data science and, in the second half, begins to introduce the language from a programming perspective.2
Understanding data structures are important for advanced use of software coding tools, and Mailund (2017b) addresses and implements classic data structures within the peculiarities of the R language.
R’s both an object-oriented and functional-programming language. Mailund (2017d) and Mailund (2017c) explain a few more advanced ideas in object-oriented and functional-programming, respectively, using the R language. Finally, Mailund (2018) is an advanced text that will help you understand how languages like ggplot2
or dplyr
are built on top of — and designed to work within — the R programming language.
Wickham (2019)3 explains the inner workings of the R language, and helps us anticipate ways to efficiently code and transform data structures by knowing, for example, when data structures are referenced in memory (fast) versus when copies (slower) result from our coding decisions. He shows us how to profile our code to learn whether it’s efficient, and provides ways to improve on that efficiency.
We need data to analyze, of course. And perhaps the most used storage of organized data involves tables in databases, and accessing them through some variant of Structured Query Language (SQL). R’s ecosystem enables us to seamlessly interface with SQL databases, too, using syntax from the tidyverse
: see Wickham, Girlich, and Ruiz (2021). And for webscraping, other R packages help, like rvest
, see Wickham (2021).
While R is an interpreted language, many of its functions are compiled from languages like C++ and Fortran. When our own functions need the benefit of speed that compiled languages provide, we can interface with those languages too. Eddelbuettel and Balamuta (2018), and an older book-length treatment Eddelbuettel (2013), introduce an R interface to C++ code. To code in C++, then, we get some understanding of the language on a high-level in Stroustrup (2018), and an in-depth tutorial from Gottschling (2021).
Müller-Brockmann (1996) is a seminal reference for organizing information within a communication, and remains influential in all aspects of visual communication today.
When we design communications, typography plays an important role in helping our audiences understand. Butterick (2018) provides empirically-tested advice.
Richards (2017) and Ko (2020) offer helpful frameworks in the process of creating, criticizing, and evaluating designs.
Design guidelines are all based on human psychology, human perception is at the heart of visual communication, and Ware (2020) treats the subject scientifically, introducing many relevant empirical and theoretical studies on how humans perceive and process visual communications. Johnson (2020) also tries to bring these together in the context of interaction design and human-computer interaction.
Our communication about data analyses implies we’ve performed data analyses. Data analyses begins, fundamentally, with an understanding of probability. From introductory to advanced treatment, consult Kunin et al. (2019), Blitzstein and Hwang (2019), a classic text republished in de Finetti (2017), and Durrett (2019), respectively.
With some grounding (however deep you go), both McElreath (2020)4 and Gelman, Hill, and Ventari (2020) introduce the basics of data analysis with modelling. As we progress in our analytical abilities, Gelman et al. (2020) guide us through many best practices in a workflow for those analyses.
These references have influenced my experience and thinking when communicating data analysis for enabling change. These are not, of course, the only references I’ve found helpful (my library includes thousands of other texts) nor the only references others may recommend. My brief descriptions and citations are also far from complete and are evolving. I’ll add more when time allows. But I hope I’ve been able to highlight some of the best references available to save you significant time beginning this learning journey.
Stay curious.
Python, for example, is a popular general-purpose programming language, and has evolved to include packages that enable data science. I use it for specific purposes.↩︎
For software development in R
(beyond the scope of this course), Peng, Kross, and Brooke (2020) provides helpful guidance for beginners.↩︎
Solutions to his exercises have been published in (grosser2021?).↩︎
McElreath also has YouTube channel with corresponding lectures.↩︎
If you see mistakes or want to suggest changes, please create an issue on the source repository.