10 Exploratory data analysis principles
Once data are collected, the workflow needed to bridge the gap between raw data and actions typically involves an iterative process of conducting both exploratory and confirmatory analysis (Pu and Kay 2018), see Figure 1.1, which employs visualization, transformation, modeling, and testing. The techniques potentially available for each of these activities may well be infinite, and each deserves a course of study in itself. Wongsuphasawat, Liu, and Heer (2019), as their title suggests, review common goals, processes, and challenges of exploratory data analysis.
[Content to be extracted from introduction.qmd - Analyses and tools section]
10.1 Goals of exploratory analysis
10.2 Visualization for exploration
Today’s tools for exploratory data analysis frequently begin by encoding data as graphics, thanks first to John Tukey, who pioneered the field in Tukey (1977), and subsequent, and applied, work in graphically exploring statistical properties of data. Cleveland (1985), the first of Cleveland’s texts on exploratory graphics, considers basic principles of graph construction (e.g., terminology, clarifying graphs, banking, scales), various graphical methods (logarithms, residuals, distributions, dot plots, grids, loess, time series, scatterplot matrices, coplots, brushing, color, statistical variation), and perception (superposed curves, color encoding, texture, reference grids, banking to 45 degrees, correlation, graphing along a common scale). His second book, Cleveland (1993), builds upon learnings of his first, and explores univariate data (quantile plots, Q-Q plots, box plots, fits and residuals, log and power transformations, etcetera), bivariate data (smooth curves, residuals, fitting, transforming factors, slicing, bivariate distributions, time series, etcetera), trivariate data (coplots, contour plots, 3-D wireframes, etcetera), hypervariate data (using scatterplot matrices and linking and brushing). Chambers (1983) in particular explores and compares distributions, explicitly considers two-dimensional data, plots in three or more dimensions, assesses distributional assumptions, and develops and assesses regression models. These three are all helpfully abstracted from any particular programming language. Unwin (2016) applies exploratory data analysis using R, examining (univariate and multivariate) continuous variables, categorical data, relationships among variables, data quality (including missingness and outliers), making comparisons, through both single graphics and an ensemble of graphics.
10.3 Transformation and modeling
Along with visualization, we can use regression to explore data, as is well explained in the introductory textbook Gelman, Hill, and Ventari (2020), which also includes the use of graphics to explore models and estimates. These tools, and more, contribute to an overall workflow of analysis: Gelman et al. (2020) suggest best practices.