1  Communication scopes

Every data project is a journey from question to answer. But like Alice at the crossroads, we cannot chart a path without knowing where we want to end up. The scope of our analysis—what we include and exclude, what questions we ask and don’t ask—shapes everything that follows.

Consider the fundamental questions that frame any analytics project: What problem are we solving? Is it important? Who needs to know? What actions could they take? Do the right data exist? Can we access them? Is the organization ready to act on whatever we find? These questions seem obvious, yet projects falter when they go unasked or when answers remain vague. Specificity matters.

The framework of identifying goals and actions, then gathering information and techniques, gives us a structure not unlike a story: beginning with why we’re working on a problem and ending with how we expect to solve it. Just as stories evolve when retold, our understanding of a problem shifts as we work. But like any well-told tale, a well-scoped project has an arc we can articulate—purpose, problem, analysis, and solution—in detail appropriate to our audience.

This chapter explores how to define that scope. We begin with the components of project planning, move through an extended example, and conclude with the practical foundations of reproducible workflow. Along the way, we keep one principle central: the best analysis is useless if it doesn’t lead to action, and the clearest communication fails if we haven’t scoped the problem correctly.

But before we discuss methods or tools, we must understand our raw material. In the following chapters on Data Analysis, we explore the fundamental nature of data—what data are, how they are measured, why context matters, and how to assess data quality and prepare data for analysis. We also examine the principles and practices of exploratory data analysis. Understanding these foundations is essential because the quality and character of our data fundamentally shape what we can communicate.

Once we have established that foundation, we can consider how the components of an analytics project fit together. The workflow diagram in Figure 1.1 illustrates the iterative progression from goals and actions through data collection, analysis, and ultimately to decisions.

Figure 1.1: Analytic components of a general statistical workflow, adapted from Pu and Kay (2018).

With this framework in mind, let’s examine each component in more detail.

1.0.1 Goals, actions, and problems

Identifying a specific problem is the first step in any project. And a well-defined problem illuminates its importance and impact. The problem should be solvable with identified resources. If the problem seems unsolvable, try focusing on one or more aspects of the problem. Think in terms of goals, actions, data, and analysis. Our objective is to take the outcome we want to achieve and turn it into a measurable and optimizable goal.

Consider what actions can be taken to achieve the identified goal. Such actions usually need to be specific. A well-specified project ideally has a set of actions that the organization is taking — or can take — that can now be better informed through data science. While improving on existing actions is a good general starting point in defining a project, the scope does not need to be so limited. New actions may be defined too. Conversely, if the problem stated and anticipated analyses does not inform an action, it is usually not helpful in achieving organizational goals. To optimize our goal, we need to define the expected utility of each possible action.

1.0.2 Researching the known

The general point of data analyses is to add to the conversation of what is understood. An answer, then, requires research: what’s understood? In considering how to begin, we get help from Harris (2017) in another context, making interesting use of texts in writing essays. We need to “situate our [data analyses] about … an issue in relation to what others have written about it.” That’s the real point of the above “literature review” that funding agencies expect, and it’s generally the place to start our work.

This orientation phase matters because you cannot contribute meaningfully to a conversation you have not yet heard. Without this foundation, you risk reinventing wheels that others have already optimized—or worse, presenting as novel insights findings that were established decades ago. The goal is not simply to avoid embarrassment; it is to position your work where it can genuinely advance understanding rather than echo existing knowledge.

Start your search strategically. Academic databases like Google Scholar or JSTOR reveal what researchers have formalized, while government reports, industry white papers, and journalism capture practical wisdom that may never appear in peer-reviewed journals. Pay attention to contradictions—when studies disagree, you have found territory worth exploring. Notice what questions remain unasked: a literature full of analyses about commuting patterns may have overlooked recreational riders, or studies focused on large cities may leave gaps about how transportation systems function in smaller markets. These gaps are opportunities. Map not just what has been done, but what has been left undone.

As you review, connect existing findings directly to your data questions. Every prior finding becomes a potential lever for your analysis—a hypothesis to test, a pattern to extend, or an assumption to challenge. The literature review transforms from bureaucratic obligation into intellectual roadmap, showing you where your data can push the conversation forward.

We will see this principle in action shortly when we develop a case study analyzing bike share data—first understanding what others have discovered about urban mobility, then identifying where our analysis can add something new. Searching for what is known involves both reviewing the “literature” on whatever issue we’re interested in, and any related data.

1.0.3 Identifying accessible data

Once you understand what others have discovered, you face a practical question: can you obtain data that will let you extend or challenge that knowledge? Data availability often determines whether a project moves forward or stalls.

Begin by assessing accessibility. Some data live in open repositories—government portals, academic archives, or public APIs. Other data sit behind corporate firewalls or require institutional agreements. Permission may be as simple as clicking “I agree” or as complex as negotiating a multi-month contract. Sometimes the data you need simply does not exist in measurable form, forcing you to design experiments or surveys to generate it yourself. Be honest about these barriers early; discovering at month three that your key variable is unattainable wastes everyone’s time.

Relevance matters as much as availability. A dataset with millions of rows may be useless if it lacks the variables that address your specific question. Catalog what each potential source offers: what was measured, over what time period, at what geographic granularity? Then match these capabilities against your analytical needs. A national health survey might give you demographic patterns but miss the local environmental factors driving your specific hypothesis.

Finally, interrogate provenance (Moreau et al. 2008; Loukissas 2019). How were these data created? By automated sensors or human entry? From administrative records or research instruments? Structured in tidy tables or buried in unstructured text? Understanding data lineage helps you assess quality, anticipate biases, and interpret results correctly. As Loukissas (2019) demonstrates through cases ranging from arboretum plant records to museum artifact databases to real estate listings, the circumstances of data collection—who measured what, where, when, and why—are inseparable from the data themselves. A temperature reading means something different when recorded by an automated weather station versus a handheld thermometer by a volunteer. The best analysis cannot rescue data that were poorly collected in the first place, nor can it fully interpret data without understanding the local contexts that produced them.

1.1 CitiBike case study: From problem to project scope

Let’s apply these scoping concepts to a concrete example: Citi Bike, the bike share program in New York City. We will return to this example throughout the book, developing increasingly sophisticated analyses and communications. For now, we focus on defining the problem and identifying what we need to know.

Exercise 1.1 (Exploring the problem space) Before reading further, consider this scenario: Citi Bike struggles with “rebalancing”—bikes accumulate at popular destinations leaving origin stations empty. Riders get frustrated when they cannot find bikes or cannot park them.

What factors might explain these patterns? Consider: - Temporal patterns: Time of day, day of week, season, weather - Spatial patterns: Proximity to transit, workplaces, recreational areas, topography - Events: Street closures, holidays, construction, special occasions - Alternative systems: How do car shares, ride-hails, or transit affect bike share demand?

What data sources—public or private—might help explain these patterns? Where might those data live? What limitations might they have?

If you could access Citi Bike’s internal operational data (truck locations, real-time station levels, maintenance schedules), how would that change your analytical possibilities? What questions would remain unanswerable even with that data?

1.1.1 Understanding the problem context

In 2013, the New York City Department of Transportation sought to start a bike share to reduce emissions, road wear, congestion, and improve public health. The system they created—Citi Bike—allows customers to unlock a bike at one station and return it at any other. Simple in concept, but operationally complex: bikes accumulate at popular destinations (think: subway stations at 9am), leaving origin stations empty and destination stations full. This “rebalancing” problem affects customer satisfaction, operational costs, and ultimately whether the program achieves its public health and environmental goals.

The challenge is well-documented. Newspapers have reported frustrated riders finding empty stations when they need to commute, and Citi Bike’s own spokeswoman identified rebalancing as “one of the biggest challenges of any bike share system, especially in New York where residents don’t all work a traditional 9-5 schedule” (Friedman 2017). The problem is real, affects thousands of daily users, and matters to the organization’s mission.

1.1.2 Researching what is known

Before proposing any analysis, we must understand the existing conversation. What do we already know about bike share rebalancing?

Academic research has examined rebalancing strategies in other cities—Barcelona’s Bicing, Paris’s Vélib’, London’s Santander Cycles. Studies identify predictive factors: weather, day of week, proximity to transit, special events. Some propose optimization algorithms for redistribution truck routes. Others examine pricing incentives to shift demand.

Early NYC research is particularly relevant. In 2013, Columbia University’s Spatial Information Design Lab conducted one of the first systematic studies of Citi Bike’s rebalancing challenges (Saldarriaga 2013). Analyzing the system’s initial months of operation, they visualized station-level activity patterns, identified chronic shortage locations, and calculated the operational costs of different rebalancing strategies. Their work established important baselines—but baselines can become dated. The pandemic fundamentally altered urban mobility patterns: remote work reshaped commute flows, outdoor recreation surged, and safety concerns changed route choices. Well-established findings from pre-pandemic studies may no longer describe current behavior, reminding us that even settled analyses can be disrupted when underlying patterns shift.

Industry knowledge from bike share operators reveals practical constraints: trucks can move only so many bikes per hour, certain streets are unsuitable for large vehicles, union rules affect when staff can work. Citi Bike’s own data team at Motivate (the operator) has likely analyzed this, though their findings may not be public.

Urban planning literature discusses how bike shares integrate with broader transportation ecosystems. Subway delays cascade into bike demand spikes. Weekend recreation patterns differ fundamentally from weekday commute patterns.

The gap we might fill: most existing studies focus on a single factor (weather, or transit, or events) in isolation. Few examine how these factors interact. Even fewer have studied New York’s specific context—its 24-hour subway, its dispersed employment centers, its extreme weather variability—in the post-pandemic era. This gap suggests where our analysis might add value.

1.1.3 Identifying accessible data

What data exist that could inform this problem?

Citi Bike’s own data: The system publishes trip records—start station, end station, start time, end time, bike ID, user type (subscriber vs. casual). These are publicly available as CSV files, updated monthly. We also know station locations and capacities from the system map.

Weather data: NOAA provides historical weather for NYC—temperature, precipitation, wind. We can join this to trip data by timestamp.

Transit data: MTA publishes subway ridership counts, though with delays. We can obtain historical service alerts and planned maintenance schedules.

Event data: NYC’s open data portal lists street closures, permits for large events, school calendars.

Geographic data: USGS provides elevation data for NYC. While rarely the first source analysts consider, topography matters—bikes flow downhill more easily than up, affecting both trip patterns and rebalancing difficulty. Students who identify this source demonstrate the creative thinking required for thorough data cataloging.

Social media: Twitter and other platforms contain unstructured text where frustrated riders vent about empty stations or full docks. Scraping and analyzing this sentiment could provide early warning signals of problems and context about user experiences not captured in trip records. This requires careful attention to platform terms of service and user privacy.

Limitations: We cannot access Citi Bike’s real-time operational data—truck locations, current station fill levels, maintenance schedules. We lack individual user data (privacy protected). Trip data only record rides taken, not frustrated attempts (people who walked to an empty station and gave up).

The data are sufficient for exploratory analysis and hypothesis generation, though not for operational real-time optimization. This is acceptable for our scope: we aim to understand patterns and inform strategy, not to build a live routing system.

A broader lesson: Notice that most of our relevant data does not come from Citi Bike itself. Weather, transit, elevation, events—these are generated by entirely separate organizations. This is typical. Organizations experiencing problems rarely collect all the data needed to understand them. The skill lies in identifying what adjacent data might explain the patterns you observe. Students often get stuck when an organization declines to share data or when the data they possess seems inadequate. The solution is rarely to abandon the project; it is to think more broadly about what external data sources might illuminate the problem. The logic connecting these disparate sources to your question—weather affects ridership, transit competition shapes demand, topography influences redistribution difficulty—is what transforms scattered data into coherent insight.

1.1.4 Defining the project scope

With this foundation, we can articulate a scoped project:

Goal: Identify patterns in bike availability that Citi Bike could use to improve rebalancing operations.

Audience: Citi Bike’s head of data analytics, who understands the operational constraints and can translate findings into action.

Specific questions: - When and where do empty/full station events cluster? - How do weather, transit disruptions, and special events interact to affect demand? - Can we predict problematic patterns hours in advance, allowing proactive rather than reactive rebalancing?

Actions informed: Where to pre-position bikes, when to deploy trucks, how to adjust pricing incentives.

Data: Public trip records, weather, transit, events (2018-2019, before pandemic disruption).

Analysis approach: Exploratory visualization, time-series patterns, predictive modeling.

Deliverable: Interactive report enabling exploration by time, location, and weather condition.

This scope is achievable with available data, addresses a real problem, and produces actionable insights. It is also extensible: findings here could inform follow-up studies with operational data, or experiments with pricing incentives, or comparisons with other cities.

But before we begin collecting and analyzing data, we must establish how we will work: reproducibly, transparently, and with attention to the code that embodies our methods.

1.2 Workflow

Truth is tough. It will not break, like a bubble, at a touch; nay, you may kick it about all day, like a football, and it will be round and full at evening (Holmes 1894).

To be most useful, reproducible work must be credibly truthful, which means that our critics can test our language, our information, our methodologies, from start to finish. That others have not done so led to the reproducibility crisis noted in (Baker 2016):

More than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments.

By reproducibility, this meta-analysis considers whether replicating a study resulted in the same statistically significant finding (some have argued that reproducibility as a measure should compare, say, a p-value across trials, not whether the p-value crossed a given threshold in each trial). Regardless, we should reproducibly build our data analyses like Holmes’s football, for our critics (later selves included) to kick it about. What does this require? Ideally, our final product should include all components of our analysis from thoughts on our goals, to identification of — and code for — collection of data, visualization, modeling, reporting and explanations of insights. In short, the critic, with the touch of her finger, should be able to reproduce our results from our work. Perhaps that sounds daunting. But with some planning and use of modern tools, reproducibility is usually practical. Guidance on assessing reproducibility and a template for reproducible workflow is described by Kitzes and co-authors (Kitzes, Turek, and Deniz 2018), along with a collection of more than 30 case studies. The authors identify three general practices that lead to reproducible work, to which I’ll add a fourth:

  1. Clearly separate, label, and document all data, files, and operations that occur on data and files.

  2. Document all operations fully, automating them as much as possible, and avoiding manual intervention in the workflow when feasible.

  3. Design a workflow as a sequence of small steps that are glued together, with intermediate outputs from one step feeding into the next step as inputs.

  4. The workflow should track your history of changes.

Several authors describe modern tools and approaches for creating a workflow that leads to reproducible research supporting credible communication, see (Gandrud 2020) and (Healy 2018).

Consider how these practices apply to our Citi Bike example. If we analyze trip patterns and produce a visualization showing weather effects on ridership, we must preserve the raw trip data (practice 1), document the code that filtered and aggregated it (practice 2), structure the work so we can update it when new monthly data arrives (practice 3), and version control everything so we know which analysis produced which graphic (practice 4). Six months later, when the head of data analytics asks how we reached a particular conclusion, we can answer precisely.

The workflow should include the communication. And the communication includes the code. What? Writing code to clean, transform, and analyze data may not generally be thought of as communicating. But yes! Code is language. And sometimes showing code is the most efficient way to express an idea. As such, we should strive for the most readable code possible. For our future selves. And for others. For code style advice, consult Boswell and Foucher (2011) and an update to a classic, Thomas and Hunt (2020).

1.2.1 Looking ahead

In this chapter we have surveyed the terrain of a data project: scoping problems, researching what is known, identifying data sources, and establishing reproducible workflows. We have peeked ahead at data types, measurement, and context—concepts we will develop systematically in Part 2, “Data Analysis.”

But before we descend into those technical foundations, we must address a prior question: who are we communicating with, and why?

In this specification-first era, our first audience is often an AI system. We must learn to communicate intent precisely—clearly enough that an AI (or a future collaborator, or our forgetful future selves) can translate specifications into working code. This requires understanding how to structure prompts, provide context, and verify outputs. Code without clear specification is merely automation without direction.

Yet specifications alone do not drive decisions. The best analysis fails if it cannot reach its human audience. The clearest visualization misfires if it does not match decision-maker needs. Data without narrative is merely evidence without argument.

The next chapters therefore focus on the communicative dimensions of data work: how to specify intent to AI systems, how to analyze human audiences, how to craft persuasive narratives, and how to write memos that drive action. Only after we understand how to reach both artificial and human intelligences will we return to the technical work of building the data and visualizations that support our case.