Resources

Effective storytelling with data requires practiced skills and an understanding of theory across disciplines and domains. For a starting text of the topics in this course, consult Data in Wonderland (an evolving text). That text provides citations to numerous seminal and modern resources for the relevant sections. Below are my short-list of starting points for their respective topics.

1 Communicating and Writing

The “fundamentals” chapter from Doumont (2009) should be a required study, as its lessons in communication apply across modes. My only complaint with this text is its lack of citation to original sources of ideas. But the ideas themselves are solid. The author’s remaining chapters place these fundamentals into the context of specific communications and are worth the price of purchase and time spent studying, even if his advice on data visualization is only a beginning.

A primary goal of data science is to analyze data to learn something new. Thus, simply reporting what’s been done before, data analyses or otherwise, does not meet that goal. While J. Harris (2017) is written for other contexts, its lessons on proper use of prior work (hint, you should not cite or describe prior work just as background) are essential in any data analysis project.

Enabling decisions and changing minds require persuasion which, in turn, requires understanding how our minds work. Sharot (2017) has helpful lessons, in that regard. Building on those, an experienced writer for the Op-Ed page of the New York Times offers her perspective on ways to persuade in Hall (2019).

Once we’ve thought through Doumont’s advice on our goals of communication — to get our audience to pay attention to, understand, and be able to act upon our messages —we can use storytelling principles for the “pay attention to” component. Storr (2020) investigates both the science and historical approaches to storytelling, and generalizes how we create effective narratives.

But for our audience to understand us, we must review the details of communication structure, from the whole, down to the paragraph, sentence, and word choices. All matter. Booth et al. (2016) offer practical advice on methods to logically lay our narratives, including composing sentences and paragraphs structured by introducing old or common material before leading the audience to new material.

More specific to communicating in the context of data science, Nolan and Stoudt (2021) proves to be a helpful guide.

2 Visualizing

Any visual communication begins by understanding how data can be visually encoded into marks or channels. Bertin (2010) formalized these possibilities, and explains these marks and channels.

Tufte (2001) teaches us how to isolate helpful marks for encoding data and information from those that detract from our messages.

Perhaps the best, most theoretically-grounded, approach to creating and communicating about data graphics has a grammar: the grammar of graphics. That grammar was developed by Wilkinson (2005), and influences today’s most flexible and capable tools for visually encoding data.

One of those tools is implemented as a package in the R language: ggplot2. Wickham, Navarro, and Lin (2021) introduces the connection to, and implementation of, Wilkinson’s ideas, and provides capable instruction in how to use this tool, but does not explain what encodings are most effective for our communications. The effectiveness has been studied thoroughly and is ongoing. Early pioneers elevated the importance of data visualization, including Tukey (1977). Tukey’s work, in many ways, legitimized graphical data analysis. Other work, by Cleveland and McGill (1984), Cleveland and McGill (1987), Heer and Bostock (2010) empirically tested common data encodings to learn which lead to more accurate decoding and decisions. Research is ongoing.

Schwabish (2021) summarizes a few of the important theoretical ideas in visually communicating data, and offers a taxonomy of plots for various data types and intended comparisons. While we should not think of encodings as a shelf of choices, which would severely limit our communications, reviewing taxonomies can be a starting point for ideas that have already been tried. Another encyclopedic taxonomy is R. L. Harris (1999).

As you begin working with, and saving, graphics made with ggplot2, you may find that working with fonts to be one of the more confusing aspects of the graphics. That is because some aspects of the data graphics are viewed and saved with respect to pixel density: pixels per inch. Font titles and such are sized in in pts, regardless of the pixel density. Using geom_text to map data to text, however, size is specified in mm. The relationship is 1 pt = 0.35mm. To get a handle on these, refer to Nicault (2020) and Thomas Lin Pedersen (2020).

The above material focuses on static data visualization. In some cases, adding interactivity can improve our communications. Hohman et al. (2020) surveys today’s best practices and ideas in interactivity, and the theoretical underpinnings of those practices are taught in Tominski and Schumann (2020). Tools that work well with the ggplot2 implementation of the grammar of graphics include htmlwidgets like ggiraph, and plotly. For ggiraph, start with Gohel and Skintzos (2021); and for plotly, you’ll do well to consult Sievert (2020) as a guide.

To help compare various software for authoring data graphics, here’s a short table:

Table 1: Software for creating data visualizations

System1

Expressivity

Reproducible Workflow

Description

Examples

Imperative Programming

Ultimate

Yes

A coding language or library used to describe how to create a data visual. Requires the creator be comfortable describing how to create the visuals with text.

Processing, D3.js, Vega

Declarative Programming

Very High

Yes

A coding langauge or library used to describe what the data visual should look llike. Uses a grammar of graphics that systematizes the description of visuals with text.

ggplot2, plotnine, Vega-Lite, Altair

Visual Builder

High

Depends, Partly2

A graphical user interface allowing fine control in specifying marks, glyphs, coordinate systems, and layouts. Can create compound glphs with multiple marks. May use direct manipulation of the visual objects, something like a shelf construction, or a hybrid of both.

Lyra, Data Illustrator, Charticulator

Shelf Construction

Medium

No

An graphical user interface mapping data fields to encoding channels such as by dragging an icon for a variable onto a shelf containing visual marks, but do not provide control over the underlying chart layout and do not allow authors to easily produce compound glyphs comprised of multiple marks.

Tableau

Template Selector

Very low

No

Author is limited to selecting from a list of available charts.

Microsoft Excel, Google Charts

1For a comparison between these systems, see Satyanarayan et al. (2019).

2Some implementations, like Lyra, offer the ability to export the graphic as code that can be placed inside a reproducible report and hooked to new data.

3 Coding

Coding, first and foremost, is the precise application of logical instructions for software tools. Best practices in coding span programming languages. Thomas and Hunt (2020) introduces important ideas regardless of your choice of language. While computers run the code, humans must understand those instructions, and are a vital part of communicating with others and interpreting what the computers give us in return. Boswell and Foucher (2011) guide how we write and organize code.

More specific to an important language in data analysis, we turn to R.

Unlike some other languages1, R was originally developed for data analysis, and you can get some understanding of the base language from Grolemund (2014). While it’s important to know how base R works, it’s a “living language” and has evolved to include numerous, powerful packages (libraries of code with functions) written for data transformation, graphics, and modelling, as well as having interfaces to access other languages like C, C++, SQL, Python, and more. A starting point within this ecosystem for data science is Wickham et al. (2019), which is taught in Wickham, Grolemund, and Çetinkaya-Rundel (2023).

The first half of Mailund (2017a) also, but from another perspective, explains the basics of how to use the R language and various packages for data science and, in the second half, begins to introduce the language from a programming perspective.2

Understanding data structures are important for advanced use of software coding tools, and Mailund (2017b) addresses and implements classic data structures within the peculiarities of the R language.

R’s both an object-oriented and functional-programming language. Mailund (2017d) and Mailund (2017c) explain a few more advanced ideas in object-oriented and functional-programming, respectively, using the R language. Finally, Mailund (2018) is an advanced text that will help you understand how languages like ggplot2 or dplyr are built on top of — and designed to work within — the R programming language.

Wickham (2019)3 explains the inner workings of the R language, and helps us anticipate ways to efficiently code and transform data structures by knowing, for example, when data structures are referenced in memory (fast) versus when copies (slower) result from our coding decisions. He shows us how to profile our code to learn whether it’s efficient, and provides ways to improve on that efficiency.

We need data to analyze, of course. And perhaps the most used storage of organized data involves tables in databases, and accessing them through some variant of Structured Query Language (SQL). R’s ecosystem enables us to seamlessly interface with SQL databases, too, using syntax from the tidyverse: see Wickham, Girlich, and Ruiz (2021). And for webscraping, other R packages help, like rvest, see Wickham (2021).

While R is an interpreted language, many of its functions are compiled from languages like C++ and Fortran. When our own functions need the benefit of speed that compiled languages provide, we can interface with those languages too. Eddelbuettel and Balamuta (2018), and an older book-length treatment Eddelbuettel (2013), introduce an R interface to C++ code. To code in C++, then, we get some understanding of the language on a high-level in Stroustrup (2018), and an in-depth tutorial from Gottschling (2021).

4 Designing

Müller-Brockmann (1996) is a seminal reference for organizing information within a communication, and remains influential in all aspects of visual communication today.

When we design communications, typography plays an important role in helping our audiences understand. Butterick (2018) provides empirically-tested advice.

Richards (2017) and Ko (2020) offer helpful frameworks in the process of creating, criticizing, and evaluating designs.

Design guidelines are all based on human psychology, human perception is at the heart of visual communication, and Ware (2020) treats the subject scientifically, introducing many relevant empirical and theoretical studies on how humans perceive and process visual communications. Johnson (2020) also tries to bring these together in the context of interaction design and human-computer interaction.

5 Analyzing

Our communication about data analyses implies we’ve performed data analyses. Data analyses begins, fundamentally, with an understanding of probability. From introductory to advanced treatment, consult Kunin et al. (2019), Blitzstein and Hwang (2019), a classic text republished in de Finetti (2017), and Durrett (2019), respectively.

With some grounding (however deep you go), both McElreath (2020)4 and Gelman, Hill, and Ventari (2020) introduce the basics of data analysis with modelling. As we progress in our analytical abilities, Gelman et al. (2020) guide us through many best practices in a workflow for those analyses.

6 Conclusion

These references have influenced my experience and thinking when communicating data analysis for enabling change. These are not, of course, the only references I’ve found helpful (my library includes thousands of other texts) nor the only references others may recommend. My brief descriptions and citations are also far from complete and are evolving. I’ll add more when time allows. But I hope I’ve been able to highlight some of the best references available to save you significant time beginning this learning journey.

Stay curious.

Bertin, Jacques. 2010. Semiology of Graphics: Diagrams Networks Maps. Redlands: ESRI Press. https://clio.columbia.edu/catalog/13599355.
Blitzstein, Joseph K., and Jessica Hwang. 2019. Introduction to Probability. Second edition. Boca Raton: Taylor & Francis. https://clio.columbia.edu/catalog/13062981.
Booth, Wayne C, Gregory G Columb, Joseph M Williams, Joseph Bizup, and William T Fitzgerald. 2016. “Revising Style: Telling Your Story Clearly.” In The Craft of Research, Fourth. University of Chicago Press. https://clio.columbia.edu/catalog/14295943.
Boswell, Dustin, and Trevor Foucher. 2011. The Art of Readable Code. O’Reilly. https://www.oreilly.com/library/view/the-art-of/9781449318482/.
Butterick, Matthew. 2018. “Butterick’s Practical Typography.” 2018. https://practicaltypography.com/.
Cleveland, William S, and Robert McGill. 1984. “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods.” Journal of the American Statistical Association 79 (387): 531–54. https://www-jstor-org.ezproxy.cul.columbia.edu/stable/2288400?pq-origsite=summon&seq=1#metadata_info_tab_contents.
———. 1987. “Graphical Perception: The Visual Decoding of Quantitative Information on Graphical Displays of Data.” Journal of the Royal Statistical Society. Series A 150 (3): 192–229. https://www-jstor-org.ezproxy.cul.columbia.edu/stable/2981473?pq-origsite=summon&seq=1#metadata_info_tab_contents.
Doumont, Jean-Luc. 2009. Trees, Maps, and Theorems. Effective Communication for Rational Minds. Principiæ. https://clio.columbia.edu/catalog/11663244.
Durrett, Richard. 2019. Probability: Theory and Examples. Fifth edition. Cambridge Series in Statistical and Probabilistic Mathematics 49. Cambridge ; New York, NY: Cambridge University Press. Probability: theory and examples.
Eddelbuettel, Dirk. 2013. Seamless R and C++ Integration with Rcpp. Springer Link. https://clio.columbia.edu/catalog/10497464.
Eddelbuettel, Dirk, and James Joseph Balamuta. 2018. “Extending R with C++: A Brief Introduction to Rcpp.” The American Statistician 72 (1): 28–36. https://doi.org/10.1080/00031305.2017.1375990.
Finetti, Bruno de. 2017. Theory of Probability: A Critical Introductory Treatment. Wiley. https://clio.columbia.edu/catalog/12462655.
Gelman, Andrew, Jennifer Hill, and Aki Ventari. 2020. Regression and Other Stories. S.l.: Cambridge University Press. https://avehtari.github.io/ROS-Examples/.
Gelman, Andrew, Aki Vehtari, Daniel Simpson, Charles C. Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, and Martin Modrák. 2020. “Bayesian Workflow.” http://arxiv.org/abs/2011.01808.
Gohel, David, and Panagiotis Skintzos. 2021. Ggiraph: Make ’Ggplot2’ Graphics Interactive. Manual. https://davidgohel.github.io/ggiraph.
Gottschling, Peter. 2021. Discovering Modern C++: An Intensive Course for Scientists, Engineers, and Programmers. Second edition. C++ in-Depth Series. Boston: Addison-Wesley.
Grolemund, Garrett. 2014. Hands-on Programming with R. First edition. Sebastopol, CA: O’Reilly. https://rstudio-education.github.io/hopr/.
Grosser, Malte, Henning Bumann, and Hadley Wickham. 2021. Advanced R Solutions. 1st ed. Boca Raton: Chapman; Hall/CRC. https://doi.org/10.1201/9781003175414.
Hall, Trish. 2019. Writing to Persuade: How to Bring People over to Your Side. First edition. New York: Liveright Publishing Corporation, a division of W.W. Norton & Company. https://wwnorton.com/books/9781631493058.
Harris, Joseph. 2017. Rewriting: How to Do Things with Texts. Second edition. Logan: Utah State University Press. https://clio.columbia.edu/catalog/12833786.
Harris, Robert L. 1999. Information Graphics: A Comprehensive Illustrated Reference. New York: Oxford University Press. https://clio.columbia.edu/catalog/SCSB-10446123.
Heer, Jeffrey, and Michael Bostock. 2010. “Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design.” In Proceedings of the Sigchi Conference on Human Factors in Computing Systems, 203–12. https://doi.org/10.1145/1753326.1753357.
Hohman, Fred, Matthew Conlen, Jeffrey Heer, and Duen Chau. 2020. “Communicating with Interactive Articles.” Distill 5 (9): 10.23915/distill.00028. https://doi.org/10.23915/distill.00028.
Johnson, Jeff. 2020. Designing with the Mind in Mind: Simple Guide to Understanding User Interface Design Guidelines. Third. Elsevier. https://doi.org/10.1016/B978-0-12-818202-4.01001-1.
Ko, Amy J. 2020. “Design Methods: What Design Is and How to Do It.” Book. September 2020. https://faculty.washington.edu/ajko/books/design-methods/.
Kunin, Daniel, Jingru Guo, Tyler Dae Devlin, and Daniel Xiang. 2019. Seeing Theory: A Visual Introduction to Probability and Statistics. Brown University. https://seeing-theory.brown.edu.
Mailund, Thomas. 2017a. Beginning Data Science in R. Apress. https://clio.columbia.edu/catalog/12765323.
———. 2017b. Functional Data Structures in R. Apress. https://clio.columbia.edu/catalog/14900997.
———. 2017c. Functional Programming in R. Apress. https://clio.columbia.edu/catalog/12765937.
———. 2017d. Advanced Object-Oriented Programming in R. Apress. https://clio.columbia.edu/catalog/12920129.
———. 2018. Domain-Specific Languages in R. Advanced Statistical Programming. Apress. https://clio.columbia.edu/catalog/13513771.
McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. 2nd ed. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press. https://clio.columbia.edu/catalog/15199205.
Müller-Brockmann, Josef. 1996. Grid Systems in Graphic Design. A Visual Communication Manual for Graphic Designers, Typographers, and Three Dimensional Designers. ARTHUR NIGGLI LTD. https://clio.columbia.edu/catalog/10489438.
Nicault, Christophe. 2020. “Understanding Text Size and Resolution in Ggplot2.” https://www.christophenicault.com/post/understand_size_dimension_ggplot2/.
Nolan, Deborah, and Sara Stoudt. 2021. Communicating with Data: The Art of Writing for Data Science. 1st ed. Oxford University Press. https://doi.org/10.1093/oso/9780198862741.001.0001.
Peng, Roger D, Sean Kross, and Anderson Brooke. 2020. Mastering Software Development in R. https://bookdown.org/rdpeng/RProgDA/.
Richards, Sarah. 2017. Content Design. Content Design London. https://contentdesign.london/store/the-content-design-book.
Satyanarayan, Arvind, Bongshin Lee, Donghao Ren, Jeffrey Heer, John Stasko, John Thompson, Matthew Brehmer, and Zhicheng Liu. 2019. “Critical Reflections on Visualization Authoring Systems.” IEEE Trans. Visual. Comput. Graphics 26 (1): 461–71. https://doi.org/10.1109/TVCG.2019.2934281.
Schwabish, Jonathan A. 2021. Better Data Visualizations: A Guide for Scholars, Researchers, and Wonks. New York: Columbia University Press. https://clio.columbia.edu/catalog/15473733.
Sharot, Tali. 2017. The Influential Mind. What the Brain Reveals about Our Power to Change Others. Henry Holt and Company. https://us.macmillan.com/books/9781250159618/theinfluentialmind.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with R, Plotly, and Shiny. Boca Raton, FL: CRC Press, Taylor and Francis Group. https://plotly-r.com.
Storr, Will. 2020. Science of Storytelling. New York, NY: Abrams Books. https://clio.columbia.edu/catalog/14924581.
Stroustrup, Bjarne. 2018. A Tour of c++. 2nd edition. Boston, MA: Addison-Wesley. https://www.stroustrup.com/tour2.html.
Thomas, David, and Andrew Hunt. 2020. The Pragmatic Programmer. 20th Anniversary. Your Journey to Mastery. Addison-Wesley. https://pragprog.com/titles/tpp20/the-pragmatic-programmer-20th-anniversary-edition/.
Thomas Lin Pedersen. 2020. “Taking Control of Plot Scaling.” https://www.tidyverse.org/blog/2020/08/taking-control-of-plot-scaling/.
Tominski, Christian, and Heidrun Schumann. 2020. “Chapter 4. Interacting with Visualizations.” In Interactive Visual Data Analysis, 1st ed. Boca Raton: CRC Press. https://clio.columbia.edu/catalog/14804802.
Tufte, Edward R. 2001. The Visual Display of Quantitative Information. Second. Graphics Press. https://clio.columbia.edu/catalog/195232.
Tukey, John W. 1977. Exploratory Data Analysis. Behavioral Science: Quantitative Methods. Addison-Wesley. https://clio.columbia.edu/catalog/136422.
Ware, Colin. 2020. Information Visualization: Perception for Design. Fourth. Philadelphia: Elsevier, Inc. https://clio.columbia.edu/catalog/9544096.
Wickham, Hadley. 2019. Advanced R. Second. CRC Press. https://adv-r.hadley.nz.
———. 2021. Rvest: Easily Harvest (Scrape) Web Pages. Manual. https://rvest.tidyverse.org.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the "Tidyverse".” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Maximilian Girlich, and Edgar Ruiz. 2021. Dbplyr: A ’Dplyr’ Back End for Databases. Manual. https://dbplyr.tidyverse.org.
Wickham, Hadley, Garrett Grolemund, and Mine Çetinkaya-Rundel. 2023. R for Data Science. Second. O’Reilly. https://r4ds.hadley.nz.
Wickham, Hadley, Danielle Navarro, and Thomas Lin. 2021. Ggplot2: Elegant Graphics for Data Analysis. Third. Springer. https://ggplot2-book.org/.
Wilkinson, Leland. 2005. The Grammar of Graphics. Second. Springer. https://clio.columbia.edu/catalog/7899682.

  1. Python, for example, is a popular general-purpose programming language, and has evolved to include packages that enable data science. I use it for specific purposes.↩︎

  2. For software development in R (beyond the scope of this course), Peng, Kross, and Brooke (2020) provides helpful guidance for beginners.↩︎

  3. Solutions to his exercises have been published in Grosser, Bumann, and Wickham (2021).↩︎

  4. McElreath also has YouTube channel with corresponding lectures.↩︎

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.