I. Title
II. Abstract
III. Project description
A. Results from prior work
B. Problem statement and significance
C. Introduction and background
1. Relevant literature review
2. Preliminary data
3. Conceptual, empirical, or theoretical model
4. Justification of approach or novel methods
D. Research plan
1. Overview of research design
2. Objectives or specific aims, hypotheses, and methods
3. Analysis and expected results
4. Timetable
E. Broader impacts
IV. References cited
V. Budget and budget justification
1 Communication scopes
1.1 Scopes
One place the general scope of analytics projects arise is within science (in which data science exists) and proposal writing. Let’s first just consider basic informational categories (in typical, but generic structure or order) that generally form a research proposal — see (Friedland, Folt, and Mercer 2018), (Oster and Cordo 2015), (Schimel 2012), (Oruc 2011), and (National Science Foundation 1998), keeping in mind that the particular information and ordering explicitly depend on our intended audience:
While these sections are generically labelled and ordered, each should be specific to an actual proposal. Let’s consider a few sections. The title, for example, should accurately reflect the content and scope of the overall proposal. The abstract frames the goals and scope of the study, briefly describes the methods, and presents the hypotheses and expected results or outputs. It also sets up proper expectations, so be careful to avoid misleading readers into thinking that the proposal addresses anything other than the actual research topic. Try for no more than two short paragraphs.
Within the project description, the problem statement and significance typically begin with the overall scope and then funnels the reader through the hypotheses to the goals or specific aims of the research.
The literature review sets the stage for the proposal by discussing the most widely accepted or influential papers on the research. The key here is to provide context and be able to show where the proposed work would extend what has been done or how it fills a gap or resolves uncertainty, etcetera1. We will discuss this in detail later.
Preliminary data can help establish credibility, likely success, or novelty of the proposal. But we should avoid overstating the implications of the data or suggesting we’ve already solved the problem.
In the research plan, the goal is to keep the audience focused on the overall significance, objectives, specific aims, and hypotheses while providing important methodological, technological, and analytical details. It contains the details of the implementation, analysis, and inferences of the study. Our job is typically to convince our audience that the project can be accomplished.
Objectives refer to broad, scientifically far-reaching aspects of a study, while hypotheses refer to a more specific set of testable conjectures. Specific aims focus on a particular question or hypothesis and the methods needed and outputs expected to fulfill the aims. Of note, these objectives will typically have already been (briefly) introduced earlier, for example, in the abstract. Later sections add relevant detail.
If early data are available, show how you will analyze them to reach your objectives or test your hypotheses, and discuss the scope of results you might eventually expect. If such data are unavailable, consider culling data from the literature to show how you expect the results to turn out and to show how you will analyze your data when they are available. Complete a table or diagram, or run statistical tests using the preliminary or “synthesized” data. This can be a good way to show how you would interpret the results of such data.
From studying these generic proposal categories, we get a rough sense of what we, and our audiences, may find helpful in understanding an analytics project, results, and implications: the content we communicate. Let’s now focus on analytics projects in more detail.
1.1.1 Data measures
Data analytics projects, of course, require data. What, then, are data? Let’s consider what Kelleher and Tierney (2018) has to say in their aptly titled chapter, What are data, and what is a data set? Consider their definitions:
datum : an abstraction of a real-world entity (person, object, or event). The terms variable, feature, and attribute are often used interchangeably to denote an individual abstraction.
Data are the plural of datum. And:
data set : consists of the data relating to a collection of entities, with each entity described in terms of a set of attributes. In its most basic form, a data set is organized in an \(n \cdot m\) data matrix called the analytics record, where \(n\) is the number of entities (rows) and \(m\) is the number of attributes (columns).
Data may be of different types, including nominal, ordinal, and numeric. These have sub-types as well. Nominal types are names for categories, classes, or states of things. Ordinal types are similar to nominal types, except that it is possible to rank or order categories of an ordinal type. Numeric types are measurable quantities we can represent using integer or real values. Numeric types can be measured on an interval scale or a ratio scale. The data attribute type is important as it affects our choice of analyses and visualizations.
Data can also be structured (like a table) or unstructured (more like the words in this document). And data may be in a raw form such as an original count or measurement, or it may be derived, such as an average of multiple measurements, or a functional transformation. Normally, the real value of a data analytics project is in using statistics or modelling “to derive one or more attributes that provide insight into a problem” (Kelleher and Tierney 2018).
Finally, existing data originally for one purpose may be used in an observational study, or we may conduct controlled experiments to generate data.
But it’s important to understand, on another level what data represents. Lupi (2016) offers an interesting and helpful take,
Data represents real life. It is a snapshot of the world in the same way that a picture catches a small moment in time. Numbers are always placeholders for something else, a way to capture a point of view — but sometimes this can get lost.
1.1.2 Data requires context
Data measurements never reveal all aspects relevant to their generation or impact upon our analysis (Loukissas 2019). Loukissas (2019) provides several interesting examples where local information that generated the data matter greatly in whether we can fully understand the recorded, or measured data. His examples include plant data in an arboretum, artifact data in a museum, collection data at a library, information in the news as data, and real estate data. Using these examples, he convincingly argues we need to shift our thinking from data sets to data settings.
Let’s consider another example, from baseball. In the game, a batter that hits the pitched ball over the outfield fence between the foul poles scores for his team — he hits a home run. But a batter’s home run count in a season does not tell us the whole story of their ability to hit home runs. Let’s consider some of the context in which a home run occurs. Batters hit a home run pitched by a specific pitcher, in a specific stadium, in specific weather conditions. All of these circumstances (and more) contribute to the existence of a home run event, but that context isn’t typically considered. Sometimes partly, rarely completely.
Perhaps obviously, all pitchers have different abilities to pitch a ball in a way that affects a batter’s ability to hit the ball. Let’s leave that aside for the moment, and consider more concrete context.
In Major League Baseball there are 30 teams, each with its own stadium. But each stadium’s playing field is differently sized than the others, each stadium’s outfield fence has uneven heights, and is different from other stadium fences! To explore this context, in Figure 1.1, hover your cursor over a particular field or fence to link them together and compare with others.
You can further explore this context in an award-winning, animated visualization (Vickars 2019, winner Kantar Information is Beautiful Awards 2019). Further, the trajectory of a hit baseball depends heavily on characteristics of the air, including density, wind speed, and direction (Adair 2017). The ball will not travel as far in cold, humid, dense air. And density depends on temperature, altitude, and humidity. Some stadiums have a roof with conditioned air protected somewhat from weather. But most are exposed. Thus, we would learn more about the qualities of a particular batter’s home run if understood in the context of these data.
Other aspects of this game are equally context-dependent. Consider each recorded ball or strike, an event made by the umpire when the batter does not swing at the ball. The umpire’s call is intended to describe location of the ball as it crosses home plate. But errors exist in that measurement. It depends on human perception, for one. We have independent measurements by a radar system (as of 2008). But that too has measurement error we can’t ignore. Firstly, there are 30 separate radar systems, one for each stadium. Secondly, those systems require periodic calibration. And calibration requires, again, human intervention. Moreover, the original radar systems installed in these stadiums in 2007 are no longer used. Different systems have been installed and used in their place. Thus, to fully understand the historical location of each pitched baseball and outcome means we must research and investigate these systems.
So when we really want to understand an event and compare among events (comparison is crucial for meaning), context matters. We’ve seen this in the baseball example, and in Loukissas’s several fascinating case study examples with many types of data. When we communicate about data, we should consider context, data settings.
1.1.3 Project scope
More on scope. On a high-level, it involves an iterative progression of the identification and understanding of decisions, goals and actions, methods of analysis, and data.
The framework of identifying goals and actions, and following with information and techniques gives us a structure not unlike having the outline of a story, beginning with why we are working on a problem and ending with how we expect to solve it. Just as stories sometimes evolve when retold, our ideas and structure of the problem may shift as we progress on the project. But like the well-posed story, once we have a well-scoped project, we should be able to discuss or write about its arc — purpose, problem, analysis and solution — in relevant detail specific to our audience.
Specificity in framing and answering basic questions is important: What problem is to be solved? Is it important? Does it have impact? Do data play a role in solving the problem? Are the right data available? Is the organization ready to tackle the problem and take actions from insights? These are the initial questions of a data analytics project. Project successes inevitably depend on our specificity of answers. Be specific.
1.1.4 Goals, actions, and problems
Identifying a specific problem is the first step in any project. And a well-defined problem illuminates its importance and impact. The problem should be solvable with identified resources. If the problem seems unsolvable, try focusing on one or more aspects of the problem. Think in terms of goals, actions, data, and analysis. Our objective is to take the outcome we want to achieve and turn it into a measurable and optimizable goal.
Consider what actions can be taken to achieve the identified goal. Such actions usually need to be specific. A well-specified project ideally has a set of actions that the organization is taking — or can take — that can now be better informed through data science. While improving on existing actions is a good general starting point in defining a project, the scope does not need to be so limited. New actions may be defined too. Conversely, if the problem stated and anticipated analyses does not inform an action, it is usually not helpful in achieving organizational goals. To optimize our goal, we need to define the expected utility of each possible action.
1.1.5 Researching the known
The general point of data analyses is to add to the conversation of what is understood. An answer, then, requires research: what’s understood? In considering how to begin, we get help from Harris (2017) in another context, making interesting use of texts in writing essays. We need to “situate our [data analyses] about … an issue in relation to what others have written about it.” That’s the real point of the above “literature review” that funding agencies expect, and it’s generally the place to start our work.
…
Searching for what is known involves both reviewing the “literature” on whatever issue we’re interested in, and any related data.
1.1.6 Identifying accessible data
Do data play a role in solving the problem? Before a project can move forward, data must be both accessible and relevant to the problem. Consider what variables each data source contributes. While some data are publicly available, other data are privately owned and permission becomes a prerequisite. And to the extent data are unavailable, we may need to setup experiments to generate it ourselves. To be sure, obtaining the right data is usually a top challenge: sometimes the variable is unmeasured or not recorded.
In cataloging the data, be specific. Identify where data are stored and in what form. Are data recorded on paper or electronically, such as in a database or on a website? Are the data structured — such as a CSV
file — or unstructured, like comments on a twitter feed? Provenance is important (Moreau et al. 2008): how were the data recorded? By a human or by an instrument?
What quality are the data (Fan 2015)? Measurement error? Are observations missing? How frequently is it collected? Is it available historically, or only in real-time? Do the data have documentation describing what it represents? These are but a few questions whose answers may impact your project or approach. By extension, it affects what and how you communicate.
1.1.7 Analyses and tools
Once data are collected, the workflow needed to bridge the gap between raw data and actions typically involves an iterative process of conducting both exploratory and confirmatory analysis (Pu and Kay 2018), see Figure 1.2, which employs visualization, transformation, modeling, and testing. The techniques potentially available for each of these activities may well be infinite, and each deserves a course of study in itself. Wongsuphasawat, Liu, and Heer (2019), as their title suggests, review common goals, processes, and challenges of exploratory data analysis.
Today’s tools for exploratory data analysis frequently begin by encoding data as graphics, thanks first to John Tukey, who pioneered the field in Tukey (1977), and subsequent, and applied, work in graphically exploring statistical properties of data. Cleveland (1985), the first of Cleveland’s texts on exploratory graphics, considers basic principles of graph construction (e.g., terminology, clarifying graphs, banking, scales), various graphical methods (logarithms, residuals, distributions, dot plots, grids, loess, time series, scatterplot matrices, coplots, brushing, color, statistical variation), and perception (superposed curves, color encoding, texture, reference grids, banking to 45 degrees, correlation, graphing along a common scale). His second book, Cleveland (1993), builds upon learnings of his first, and explores univariate data (quantile plots, Q-Q plots, box plots, fits and residuals, log and power transformations, etcetera), bivariate data (smooth curves, residuals, fitting, transforming factors, slicing, bivariate distributions, time series, etcetera), trivariate data (coplots, contour plots, 3-D wireframes, etcetera), hypervariate data (using scatterplot matrices and linking and brushing). Chambers (1983) in particular explores and compares distributions, explicitly considers two-dimensional data, plots in three or more dimensions, assesses distributional assumptions, and develops and assesses regression models. These three are all helpfully abstracted from any particular programming language2. Unwin (2016) applies exploratory data analysis using R
, examining (univariate and multivariate) continuous variables, categorical data, relationships among variables, data quality (including missingness and outliers), making comparisons, through both single graphics and an ensemble of graphics.
While these texts thoroughly discuss approaches to exploratory data analyses to help data scientists understand their own work, these do not focus on communicating analyses and results to other audiences. In this text and through references to other texts we will cover communication with other audiences.
To effectively use graphics tools for exploratory analysis requires the same understanding, if not the same approach, we need for graphically communicating with others, which we explore, beginning in ?sec-visual.
Along with visualization, we can use regression to explore data, as is well explained in the introductory textbook Gelman, Hill, and Ventari (2020), which also includes the use of graphics to explore models and estimates. These tools, and more, contribute to an overall workflow of analysis: Gelman et al. (2020) suggest best practices.
Again, the particular information and its ordering in any communication of these analyses and results depend entirely on our audience. After we begin exploring an example data analysis project, and consider workflow, we will consider audience.
1.2 CitiBike case study
Let’s develop the concept of project scope in the context of an example, one to help the bike share sponsored by Citi Bike.
You may have heard about, or even rented, a Citi Bike in New York City. Researching the history, we learn that in 2013, the New York City Department of Transportation sought to start a bike share to reduce emissions, road wear, congestion, and improve public health. After selecting an operator and sponsor, the Citi Bike bike share was established with a bike fleet distributed over a network of docking stations throughout the city. The bike share allows customers to unlock a bike at one station and return it at any other empty dock.
Might this be a problem we can find available data and conduct analyses to inform the City’s actions and further its goals?
Exercise 1.1 Explore the availability of bikes and docking spots as depending on users’ patterns and behaviors, events and locations at particular times, other forms of transportation, and on environmental context. What events may be correlated with or cause empty or full bike docking stations? What potential user behaviors or preferences may lead to these events? From what analogous things could we draw comparisons to provide context? How may these events and behaviors have been measured and recorded? What data are available? Where are it available? In what form? In what contexts are the data generated? In what ways may we find incomplete or missing data, or other errors in the stored measurements? May these data be sufficient to find insights through analysis, useful for decisions and goals?
Answers to questions as these provide necessary material for communication. Before digging into an analysis, let’s discuss two other aspects of workflow — reproducibility and code clarity.
1.3 Workflow
Truth is tough. It will not break, like a bubble, at a touch; nay, you may kick it about all day, like a football, and it will be round and full at evening (Holmes 1894).
To be most useful, reproducible work must be credibly truthful, which means that our critics can test our language, our information, our methodologies, from start to finish. That others have not done so led to the reproducibility crisis noted in (Baker 2016):
More than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments.
By reproducibility, this meta-analysis considers whether replicating a study resulted in the same statistically significant finding (some have argued that reproducibility as a measure should compare, say, a p-value across trials, not whether the p-value crossed a given threshold in each trial). Regardless, we should reproducibly build our data analyses like Holmes’s football, for our critics (later selves included) to kick it about. What does this require? Ideally, our final product should include all components of our analysis from thoughts on our goals, to identification of — and code for — collection of data, visualization, modeling, reporting and explanations of insights. In short, the critic, with the touch of her finger, should be able to reproduce our results from our work. Perhaps that sounds daunting. But with some planning and use of modern tools, reproducibility is usually practical. Guidance on assessing reproducibility and a template for reproducible workflow is described by Kitzes and co-authors (Kitzes, Turek, and Deniz 2018), along with a collection of more than 30 case studies. The authors identify three general practices that lead to reproducible work, to which I’ll add a fourth:
Clearly separate, label, and document all data, files, and operations that occur on data and files.
Document all operations fully, automating them as much as possible, and avoiding manual intervention in the workflow when feasible.
Design a workflow as a sequence of small steps that are glued together, with intermediate outputs from one step feeding into the next step as inputs.
The workflow should track your history of changes.
Several authors describe modern tools and approaches for creating a workflow that leads to reproducible research supporting credible communication, see (Gandrud 2020) and (Healy 2018).
The workflow should include the communication. And the communication includes the code. What? Writing code to clean, transform, and analyze data may not generally be thought of as communicating. But yes! Code is language. And sometimes showing code is the most efficient way to express an idea. As such, we should strive for the most readable code possible. For our future selves. And for others. For code style advice, consult Boswell and Foucher (2011) and an update to a classic, Thomas and Hunt (2020).