3  Purpose, audience, and craft

As we prepare to scope, and work through, a data analytics project, we must communicate variously if it is to have value. The reproducible workflow shown in the last chapter at least provides value as a communication to its immediate audience — its authors, as a reference for what they accomplished — and to those with similar needs1 to understand and critique the logical progression and scope of the project. That form of communication, however, will be less valuable than other communication forms for different audiences and purposes. Given the information compiled from our project — the content — we now consider communicating various aspects for other purposes and audiences.

The importance of adjusting communication is not unique to data analytics. Let’s consider, say, how communication form differs when for a news story versus an op-ed. Long-time editor of the op-ed at the New York Times explains,

The approach to argument that I learned in classes at Berkeley was much more similar to an op-ed than the inverted pyramid of daily journalism or the slow, anecdotal flow of feature stories that had dominated my professional life (Hall 2019).

The qualities of an op-ed piece must be, she writes: “surprising, concrete, and persuasive.” These qualities are similar to that we need in business communication, which generally drive decisions. All business writing begins with a) identifying the purpose for communicating and b) understanding your audiences’ scopes of knowledge and responsibilities in the problem context. Neither is trivial; both require research. To motivate this discussion, let’s consider and deconstruct two example memos — one for Citi Bike, the other for the Dodgers — written for data science projects. It will be helpful for this exercise to place both memos side-by-side for comparison as we work through them below.

3.1 Communication structure

Let’s begin discussing the communication structure from several perspectives: purpose, narrative structure, sentence structure, and effective redundancy through hierarchy. Then, we consider audiences, story, and the importance of revision.

3.1.1 Purpose and audience

In the first example, we return to Citi Bike. After project ideation, and scoping, we want to ask Citi Bike’s head of data analytics to let us write a more detailed proposal to conduct the data analytics project. We accomplish this in 250 words. The the fully composed, draft memo follow,

Example 3.1 (250-word Citi Bike memo)
Michael Frumin
Director of Product and Data Science
for Transit, Bikes, and Scooters at Lyft

To inform the public on rebalancing, let’s re-explore docking
availability and bike usage, this time with subway and weather.

Let’s re-explore station and ride data in the context of subway and weather information to gain insight for “rebalancing,” what our Dani Simmons explains is ”one of the biggest challenges of any bike share system, especially in … New York where residents don’t all work a traditional 9-5 schedule, and though there is a Central Business District, it’s a huge one and people work in a variety of other neighborhoods as well” (Friedman 2017).

Recalling the previous, public study by Columbia University Center for Spatial Research (Saldarriaga 2013), it identified trends in bike usage using heatmaps. As those visualizations did not combine dimensions of space and time, which the public would find helpful to see trends in bike and station availability by neighborhood throughout a day, we can begin our analysis there.

We’ll use published data from NYC OpenData and The Open Bus Project, including date, time, station ID, and ride instances for all our docking stations and bikes since we began service. To begin, we can visually explore the intersection of trends in both time and location with this data to understand problematic neighborhoods and, even, individual stations, using current data.

Then, we will build upon the initial work, exploring causal factors such as the availability of alternative transportation (e.g., subway stations near docking stations) and weather. Both of which, we have available data that can be joined using timestamps.

The project aligns with our goals and shows the public that we are, in Simmons’s words, “innovative in how we meet this challenge.” Let’s draft a detailed proposal.

Sincerely,
Scott Spencer


Friedman, Matthew. “Citi Bike Racks Continue to Go Empty Just When Upper West Siders Need Them.” News. West Side Rag (blog), August 19, 2017. https://www.westsiderag.com/2017/08/19/citi-bike-racks-continue-to-go-empty-just-when-upper-west-siders-need-them.

Saldarriaga, Juan Francisco. “CitiBike Rebalancing Study.” Spatial Information Design Lab, Columbia University, 2013. https://c4sr.columbia.edu/projects/citibike-rebalancing-study.

It begins, in the title of this memo, with our purpose of writing, to conduct data analysis on specifically identified data to inform the issue of rebalancing, one of Citi Bike’s goals:

To inform rebalancing, let’s explore docking and bike availability in the context of subway and weather information.

This is what Doumont (2009) calls a message. We should craft communications with messages, not merely information. Doumont explains that a message differs from raw information in that it presents “intelligent added value,” that is, something to understand about the information. A message interprets the information for a specific audience and for a specific purpose. It conveys the so what, whereas information merely conveys the what. What makes our title a message? Before answering this, let’s compare one of Doumont’s examples of information to that of a message. This sentence is mere information:

A concentration of 175 \(\mu g\) per \(m^3\) has been observed in urban areas.

A message, in contrast to information, would be the so what:

The concentration in urban areas (175 \(\mu g / m^3\)) is unacceptably high.

In our title, we request an action, approval for the exploratory analysis on specified data, for a particular purpose, to inform rebalancing. This purpose also implies the so what: unavailable bikes or docking slots, unbalanced stations, are bad. We’re asking to help remedy the issue.

This beginning, if effective, is only because we wrote it for a particular audience. Our audience is head of data analytics at Citi Bike, and presumably knows the problem of rebalancing; it is well-known in, and beyond, the organization. Thus, our sentence implicitly refers back to information our audience already knows2. Relying on his or her knowledge means we do not need to first explain what rebalancing is or why it is a problem.

Let’s introduce a second example before digging further into the structure of the Citi Bike memo. Having multiple examples to analyze has the added benefit of allowing us to induce some general, but effective, writing principles.

Considering a second example, professional teams in the sport of baseball, including the Los Angeles Dodgers, make strategic decisions within the boundaries of the sport’s rules for the purpose of winning games. One of those rules involves stealing bases, as in Figure 3.1.

Figure 3.1: In a close call, the baseball umpire spread his arms, signaling that the Dodgers baserunner successfully ran to the base faster than the catcher could throw the ball to the base to get him out — the runner stole the base. Knowing when to try to steal is strategic and depends on other events that are at least partly measured as data.

This concept is part of our next example, written to the Los Angeles Dodgers’s Director of Quantitative Analytics, Scott Powers. Powers, is Director of Quantitative Analytics at the Los Angeles Dodgers. He earned his doctor of philosophy in statistics from Stanford, has authored publications in machine learning, knows R programming, and as an employee of the Dodgers, knows their history. Powers manages a team of data scientists; their responsibilities include assessing player and team performance. The draft memo follows,

Example 3.2 (250-word Dodgers memo)
Scott Powers
Director of quantitative analysis
at Los Angeles Dodgers

Our game decisions should optimize expectations.
Let’s test the concept by modeling decisions to steal.

Our Sandy Koufax pitched a perfect game, the most likely event sequence, only once: those, we do not expect or plan. Since our decisions based on other most likely events don’t align with expected outcomes, we leave wins unclaimed. To claim them, let’s base decisions on expectations flowing from decision theory and probability models. A joint model of all events works best, but we can start small with, say, decisions to steal second base.

After defining our objective (e.g. optimize expected runs) we will, from Statcast data, weight everything that could happen by its probability and accumulate these probability distributions. Joint distributions of all events, an eventual goal, will allow us to ask counterfactuals — “what if we do this” or “what if our opponent does that” — and simulate games to learn how decisions change win probability. It enables optimal strategy.

Rational and optimal, this approach is more efficient for gaining wins. For perspective, each added win from the free-agent market costs 10 million, give or take, and the league salary cap prevents unlimited spend on talent. There is no cap, however, on investing in rational decision processes.

Computational issues are being addressed in Stan, a tool that enables inferences through advanced simulations. This open-source software is free but teaching its applications will require time. To shorten our learning curve, we can start with Stan interfaces that use familiar syntax (like lme4) but return joint probability distributions: R packages rethinking, brms, or rstanarm. Perfect games aside, we can test the concept with decisions to steal.

Sincerely,
Scott Spencer

The beginning of this memo reads:

Our game decisions should optimize expectations. Let’s test the concept by modeling decisions to steal.

In the first three words, the subject of this sentence, signals to our audience something they have experience with: our game decisions. It’s familiar. Then, we provide a call to action: our game decisions should optimize expectations. Let’s test the concept by modeling decisions to steal. As with the Citi Bike memo (Example 3.1, the Dodgers memo (Example 3.2) begins with a message and stated purpose. And as with Citi Bike, the information in the title message of the Dodgers memo is shared knowledge with our audience. In fact, we rely on the educational background of our audience, who we know has earned a doctor of philosophy in statistics, when including the concept to “optimize expectations”3 without first explaining what that is because we know, or from the audience’s background, can assume the audience understands the concept.

So in both cases, we have begun with language and topics already familiar to the audience, which follows the more general writing advice from Doumont (2009) , who instructs us to

put ourselves in the shoes of the audience, anticipating their situation, their needs, their expectations. Structure the story along their line of reasoning, recognizing the constraints they might bring: their familiarity with the topic, their mastery of the language, the time they can free for us.

What else do we know about chief analytics officers in general? Their jobs require them to be efficient with their time. Thus, by starting with our purpose, letting them know what we want them to do, we are considerate of their “constraints” and “time they can free for us.”

Beginning with a purpose and call-to-action also allow the executive to understand the memo’s relevance to them, in terms of their decision-making, immediately; they have a reason to continue reading.

The persuasive power of beginning with the main message for your audience, or issue relevant to your audience, is nearly as timeless as it is true. Cicero, the Roman philosopher with a treatise on rhetoric, explained that we must not begin with details because “it forms no part of the question, and men are at first desirous to learn the very point that is to come under their judgment” (Cicero and Watson 1986).

Next, let’s review the structure of these memos to see whether we’ve “structur[ed] the story along their line of reasoning.”

3.1.2 Common ground

Let’s compare the first sentences of the body of both examples. The Citi Bike memo begins,

Let’s re-explore station and ride data in the context of subway and weather information to gain insight for “rebalancing,” broadening the factors we’ve told the public that “one of the biggest challenges of any bike share system, especially in … New York where residents don’t all work a traditional 9-5 schedule, and though there is a Central Business District, it’s a huge one and people work in a variety of other neighborhoods as well” (Friedman 2017).

This sentence starts with the title request, and then ties the purpose — rebalancing — to corporate goals. It does so by quoting the company’s spokesperson, which serves as both evidence of the so what. Offering, and accepting, Simmons’s quote serves a second purpose in writing: it helps to establish common ground with our audience.

If we want to affect the behaviors and beliefs of the person in front of us, we need to first understand what goes on inside their head and establish common ground. Why? When you provide someone with new data, they quickly accept evidence that confirms their preconceived notions (what are known as prior beliefs) and assess counter evidence with a critical eye (Sharot 2017). Four factors come into play when people form a new belief: our old belief (this is technically known as the “prior”), our confidence in that old belief, the new evidence, and our confidence in that evidence. Focusing on what you and your audience have in common, rather than what you disagree about, enables change. Let’s check for common ground in the Dodgers memo. The first sentence of the body begins,

Our Sandy Koufax pitched a perfect game, the most likely event sequence, only once: those, we do not expect or plan.

Sandy Koufax is one of the most successful Dodgers players in the history of the franchise. He is one of less than 20 pitchers in the history of baseball to pitch a perfect game, something extraordinary. Our audience, as an employee of the Dodgers, will be familiar with this history. It is also something very positive — and shared — between author and audience. It is intended to establish common ground, in two ways. Along with that positive, shared history, it sets up an example of a statistical mode, one that we know the audience would agree is unhelpful for planning game strategy because it is too rare, even if it is a statistical mode. It helps to create common ground or agreement that it may not be best to use statistical modes for making decisions.

In both memos, we are also trying to use an interesting fact that may be unexpected or surprising in this context (Sandy Koufax, Dani Simmons) to grab our audience’s attention. In journalism, this is one way to create the lead. Zinsser (2001)4 explains that the most important sentence in any communication is the first one. If it doesn’t induce the reader to proceed to the second sentence, your communication is dead. And if the second sentence doesn’t induce the audience to continue to the third sentence, it’s equally dead. Readers want to know — very soon — what’s in it for them.

Your lead must capture the audience immediately cajoling them with freshness, or novelty, or paradox, or humor, or surprise, or with an unusual idea, or an interesting fact, or a question. Next, it must provide hard details that tell the audience why the piece was written and why they ought to read it.

3.1.3 Details

At this point in both memos, we have begun our memo with information familiar to our audience, relevant to their job in decision-making, and established our purpose. We have also started with information they would agree with. We’ve created common ground. The stage is set. What’s next? Here’s the next two sentences in the body of the Citi Bike memo:

Recalling the previous, public study by Columbia University Center for Spatial Research (Saldarriaga 2013), it identified trends in bike usage using heatmaps. As those visualizations did not combine dimensions of space and time, which the public would find helpful to see trends in bike and station availability by neighborhood throughout a day, we can begin our analysis there.

The first sentence introduces previous work — background — on rebalancing studies and its limitations, and we proposed to start where the prior work stopped. This accomplishes two objectives. First, it helps our audience understand beginning details of our proposed project. Second, it helps the audience see that our proposed work is not redundant to what we already know. Thus, we began the details of our proposed solution. What is described in the Dodgers memo at a similar point? This:

To claim them, let’s base decisions on expectations flowing from decision theory and probability models. A joint model of all events works best, but we can start small with, say, decisions to steal second base.

As with Citi Bike, the next two sentences start introducing details of the proposed project.

After introducing the nature of the proposed project in both memos, we identify data that makes the proposed project feasible. In the Citi Bike memo we identify specific categories of data and the publicly available source of those data:

We’ll use published data from NYC OpenData and The Open Bus Project, including date, time, station ID, and ride instances for all our docking stations and bikes since we began service.

Similarly, in the Dodgers memo,

After defining our objective (e.g. optimize expected runs) we will, from Statcast data, weight everything that could happen by its probability and accumulate these probability distributions.

It may seem we are less descriptive of the data than in the Citi Bike memo, but the label “Statcast” signals to our particular audience a group of specific, publicly available variables collected by the Statcast system, see (Willman, n.d.) and (Willman 2020).

After identifying data, we explain how we plan its analysis.

Having identified data, both memos then describe more details of our proposed methodology. In Citi Bike, we discuss two stages. We plan to graphically explore specific variables in search of specific trends first.

To begin, we can visually explore the intersection of trends in both time and location with this data to understand problematic neighborhoods and, even, individual stations, using current data.

Then, we specifically identify additional data we plan to join and explore as causal factors for problem areas:

Then, we will build upon the initial work, exploring causal factors such as the availability of alternative transportation (e.g., subway stations near docking stations) and weather. Both of which, we have available data that can be joined using timestamps.

Similarly, in the Dodgers memo, go into the planned methodology. We plan to model expectations from the data:

…from Statcast data, weight everything that could happen by its probability and accumulate these probability distributions.

3.1.4 Benefits

Having described our data and methodology in both memos, we now describe some benefits. In the Citi Bike memo,

The project aligns with our goals and shows the public that we are, in Simmons’s words, “innovative in how we meet this challenge.”

And in the Dodgers memo, perhaps because we believe the benefits are comparatively less obvious, or less proven, we further develop them:

Joint distributions of all events, an eventual goal, will allow us to ask counterfactuals — “what if we do this” or “what if our opponent does that” — and simulate games to learn how decisions change win probability. It enables optimal strategy.

Rational and optimal, this approach is more efficient for gaining wins. For perspective, each added win from the free-agent market costs 10 million, give or take, and the league salary cap prevents unlimited spend on talent. There is no cap, however, on investing in rational decision processes.

3.1.5 Limitations

In the Citi Bike memo, we didn’t identify limitations. Should we?

In the Dodgers memo, we do, while also explaining how we plan to overcome those limitations:

Computational issues are being addressed in Stan, a tool that enables inferences through advanced simulations. This open-source software is free but teaching its applications will require time. To shorten our learning curve, we can start with Stan interfaces that use familiar syntax (like lme4) but return joint probability distributions: R packages rethinking, brms, or rstanarm.

3.1.6 Conclusion

Finally, we wrap up in both memos. in the Citi Bike memo, after echoing the quote from Simmons, we state,

Let’s draft a detailed proposal.

Again, the Dodgers memo is similar. There, we circle back to our introduction to Sandy Koufax and his perfect game, then conclude,

Perfect games aside, we can test the concept with decisions to steal.

Again, this idea of echoing something from where we began is journalism’s complement to the lead.

Zinsser explains that, ideally, the ending should encapsulate the idea of the piece and conclude with a sentence that jolts us with its fitness or unexpectedness. Consider bringing the story full circle — to strike at the end an echo of a note that was sounded at the beginning. It gratifies a sense of symmetry.

Executives’ lines of reasoning commonly, but do not always, follow the general document structure described above. If we don’t have information otherwise, this is a good start.

3.2 Narrative structure

The above ideas — tools — are helpful in structuring and writing persuasive memos, and longer communications for that matter. And as writing lengthens, the next couple of related tools can be especially helpful in refining the narrative structure in a way that holds our audience’s interest by creating tension. German dramatist Gustav Freytag in the late 19th century illustrated a narrative arc5 used in Shakespearean dramas:

Figure 3.2: The five components of narratives create tension that helps hold audiences attention.

The primary elements of an applied analytics project may be thought of as a well-articulated business problem, a data science solution, and a measurable outcome to produce value for the organization. The analytics project may thus be conceptualized as a narrative arc, with a beginning (problem), middle (analytics), and end (overcoming of the problem), along with characters (analysts, colleagues, clients) who play important roles.

Duarte (2010) used the narrative arc to conceptualize an interesting alternative way to think about structure that creates tension: alternating what is with what may be:

Figure 3.3: Describing what may be after what is creates a contrast, which we find interesting.

We can repeat this approach, see figure Figure 3.4, switching between what is and what may be to maintain a sense of tension or interest throughout a narrative arc. Once you become aware, you may be surprised how much you find writing in this form.

Figure 3.4: Duarte illustrates the repeated switching between what is and what may be, which helps to hold audience interest throughout a narrative.

This juxtaposition of two states for creating tension is another form of comparison.

Exercise 3.1 Revisit the two memos, examples Example 3.1 and Example 3.2. Identify sentences or paired sentences that shift focus from what is to what could be, creating a contrast.

Let’s look closer, now, to sentence structure.

3.3 Sentence structure

When we describe old before new, using sentence structure, it generally improves understanding. The concept has also been described as an information unit. “The information unit is what its name implies: a unit of information. Information, in this technical grammatical sense, is the tension between what is already known or predictable and what is new or unpredictable” (Halliday and Matthiessen 2004). As a general principle, “readers follow a story most easily if they can begin each sentence with a character or idea that is familiar to them, either because it was already mentioned or because it comes from the context” (J. M. Williams, Bizup, and Fitzgerald 2016).

Consider an alternative flow of information. Put new information before old information. Reversing the information flow will likely confuse your audience. This point was clearly demonstrated in a classic movie, Memento,

Figure 3.5: Poster for Memento, a movie designed to purposefully confuse the audience by narrating the story (partly) in reverse

where Director Christopher Nolan tells the story of a man with anterograde amnesia (the inability to form new memories) searching for someone who attacked him and killed his wife, using an intricate system of Polaroid photographs and tattoos to track information he cannot remember. The story is presented as two different sequences of scenes interspersed during the film: a series in black-and-white that is shown chronologically, and a series of color sequences shown in reverse order (simulating for the audience the mental state of the protagonist). The two sequences meet at the end of the film, producing one complete and cohesive narrative. Yet, the reversed order is (effectively) designed to hold the audience in confusion so that they may get a sense of the confusion experienced by someone with this illness. Indeed, that we demonstrate this with film implies that ordering in visual representation matters too, and it does. As such, we revisit this in the context of images.

For reasons similar, explain complex information last. This is particularly important in three contexts: introducing a new technical term, presenting a long or complex unit of information, introducing a concept before developing its details. And just as the old—new paradigm helps to convey messages, so too does expressing crucial actions in verbs. Make your central characters the subjects of those verbs; keep those subjects short, concrete, and specific.

Exercise 3.2 Revisit the Dodgers memo again, Example 3.2. This time, for each sentence and words within the sentence, try to identify whether the word or phrase is new or old. When determining this, consider both the words, phrases, and sentences preceding the one under analysis. Of note, you may also consider the audience’s background knowledge as a form of information.

3.4 Layering and hierarchy

Most communications benefit by providing multiple levels in which the narrative may be read. Even emails and memos — concise communications — enable two layers, the title and the main body. Thus, the title should inform the audience of the relevance of the communication: what is it the author wants them to do or know. It should also, or at least, invite the audience to learn more through the details of the main body. As the communication lengthens, more layers may be used. The title’s purpose remains the same, as does the main body. But we may add middle layers, headers and subheaders (Doumont 2009). These should not be generic. Instead, the author should be able to read just these and understand the gist of the communication. This concept is well established where we intend persuasive communication. A well-known instructor of legal writing, Guberman (2014) explains how to draft this middle layer:

Strong headings are like a good headline for a newspaper article: they give [the audience] the gist of what [they] need to know, draw [them] into text [they] might otherwise skip, and even allow … a splash of creativity and flair.

The old test is still the best. Could [the audience] skim your headings and subheadings and know why [they should act]?

A good way to provide these “signposts” is to make your headings complete thoughts that, if true, would push you toward the finish line.

In accord, Scalia and Garner (2008) writes:

Since clarity is the all-important objective, it helps to let the reader know in advance what topic you’re to discuss. Headings are most effective if they’re full sentences announcing not just the topic but your position on that topic.

In short, headings should be what Doumont (2009) calls messages. Headings provide “effective redundancy.” The redundancy gained from headers may be two fold. They, first, introduce your message before the detailed paragraphs and, second, may be collected up front, as a table of contents. Even short communication benefit from headers, and communications of at least several pages will likely benefit from such a table of contents along with headers.

3.5 Story

At this point, we’ve identified a problem or opportunity upon which our entity may decide to act. We’ve found data and considered how we might uncover insights to inform decisions. We’ve scoped an analytics project. In beginning to write, we’ve considered document structure, sentence structure, and narrative. We’ve also begun to consider our audience.

Can we — should we — use story to help us communicate? Consider its use in the context of an academic debate on the question: (Krzywinski and Cairo 2013), (Katz 2013), and (Katz 2013); and for more perspective: (Gelman and Basbøll 2014).

Exercise 3.3 Explain whether, and why or why not, you believe any of the arguments in the debate about using story to explain scientific results are correct.

Let’s now focus on how we may directly employ story. A little research on story, though, reveals differences in use of the term. The Oxford English Dictionary defines story generally as a narrative:

An oral or written narrative account of events that occurred or are believed to have occurred in the past…

Distinguished novelist E.M. Forster famously described story as “a series of events arranged in their time sequence.” Of note, he also compares and distinguishes story from plot: “plot is also a narrative of events, the emphasis falling on causality: ‘The king died and then the queen died’ is a story. But ‘the king died and then the queen died of grief’ is a plot. The time sequence is preserved, but the sense of causality overshadows it” (Forster 1927). But not just any narrative works for moving our audiences to act. Harari (2014) explains,

the difficulty lies not in telling the story, but in convincing everyone else to believe it. . .

Let’s consider other points of view.

To understand the narrative arc of successful stories, John Yorke studied numerous stories, and from those induced general principles Yorke (2015).

A journalist and author, too, has studied narrative structure but, with a different approach — he focuses on the cognitive science and psychology of how our mind works and relates those characteristics to story (Storr 2020). Story, writes Storr, typically begins with “unexpected change”, the “opening of an an information gap”, or both. Humans naturally want to understand the change, or close that gap; it becomes their goal. Language of messages and information that close the gap, then, form what we may think of as narrative’s plot.

Indeed, Storr (2020) suggests that the varying so-called designs for plot structure6 are all really different approaches to describing change:

But I suspect that none of these plot designs is actually the ‘right’ one. Beyond the basic three acts of Western storytelling, the only plot fundamental is that there must be regular change, much of which should preferably be driven by the protagonist, who changes along with it. It’s change that obsesses brains. The challenge that storytellers face is creating a plot that has enough unexpected change to hold attention over the course of an entire novel or film. This isn’t easy. For me, these different plot designs represent different methods of solving that complex problem. Each one is a unique recipe for a plot that moves relentlessly forwards, builds in intrigue and tension and never stops changing.

Other authors who consider data stories directly agree with this assessment. Matei and Hunter (2021) describes a good story in the context of cause and effect:

an excellent story connects an unexpected cause to a known effect or a known cause with an unexpected effect. This relationship is the necessary condition for the best stories.

Here’s why, they explain:

When stories violate our expectations, they intrigue. Then, unsure why the expectations have been subverted, the reader searches for an answer. Stories develop, each new question triggering our mind to find resolution. We follow this trail of questions until a new answer is revealed. To put it differently, the best stories are interesting because they keep the reader on their toes to the very end!

A quantitative analysis of over 100,000 narratives suggests these ideas, too (Robinson 2017).

We evolved for recognizing change, and for cause and effect. Thus, a narrative or story is driven forward by linking together change after change as instances of cause and effect. Indeed, to create initial interest, we only need to foreshadow change.

Let’s consider some examples that use information gaps and narratives in the context of data science. The short narratives in Wainer (2016), each about a data science concept that people frequently misunderstand, are exemplary. He begins each of these by setting up a contrast or information gap. In chapter 1, for example, the author teaches the “Rule of 72” as a heuristic to think about compounding quantities by posing a question7 to setup an information gap:

Great news! You have won a lottery and you can choose between one of two prizes. You can opt for either:

  1. $10,000 every day for a month, or

  2. One penny on the first day of the month, two on the second, four on the third, and continued doubling every day thereafter for the entire month.

Which option would you prefer?

Similarly, in chapter 2, Wainer (2016) teaches us implications of the law of large numbers by exposing an information gap. Again, he uses a question to setup an information gap:

“Virtuosos becoming a dime a dozen,” exclaimed Anthony Tommasini, chief music critic of the New York Times in his column in the arts section of that newspaper on Sunday, August 14, 2011.

But why?

Once he has setup his narratives, he bridges the gap. Let’s keep in mind that his purpose in these stories are for audience awareness. To teach. We can adapt these narrative8 concepts, though, in communications for other purposes.

Exercise 3.4 By this discussion, are the Citi Bike and Dodgers memos a story? If not, what they may lack? If so, explain what structure or form makes them a story. Do the story elements — or would they if used — add persuasive effect?

3.6 Revise for the audience

“We write a first draft for ourselves; the drafts thereafter increasingly for the reader” (J. Williams and Colomb 1990). Revision lets us switch from writing to understand to writing to explain. Switching audience is critical, and not doing so is a common mistake. Schimel (2012) explains one manifestation of the error:

Using an opening that explains a widely held schema is a flaw common with inexperienced writers. Developing scholars are still learning the material and assimilating it into their schemas. It isn’t yet ingrained knowledge, and the process of laying out the information and arguments, step by step, is part of what ingrains it to form the schema. Many developing scholars, therefore, have a hard time jumping over this material by assuming that their readers take it for granted. Rather, they are collecting their own thoughts and putting them down. There is nothing wrong with explaining things for yourself in a first draft. Many authors aren’t sure where they are going when they start, and it is not until the second or third paragraph that they get into the meat of the story. If you do this, though, when you revise, figure out where the real story starts and delete everything before that.

Revision gives us opportunity to focus on our audience once we understand what we have learned. This benefit alone is worth revision.

But it does more, especially when we allow time to pass between revisions: “If you start your project early, you’ll have time to let your revised draft cool. What seems good one day often looks different the next” (Booth et al. 2016). As you revise, read aloud. While normal conversations do not typically follow grammatically correct language, well-written communications should smoothly flow when read aloud. Try reading this sentence aloud, following the punctuation:

When we read prose, we hear it…it’s variable sound. It’s sound with — pauses. With emphasis. With, well, you know, a certain rhythm (Goodman 2008).

And when revising, consider each word and phrase, and test whether removing that word or phrase changes the context or meaning of the sentence for your audience. If not, remove it. In a similar manner, when choosing between two words with equally precise meaning, it is generally best to use the word with fewer syllables or that flows more naturally when read aloud.

Consider this 124-word blog post for what became a data-for-good project:

Improving Traffic Safety Through Video Analysis

Nearly 2,000 people die annually as a result of being involved in traffic-related accidents in Jakarta, Indonesia. The city government has invested resources in thousands of traffic cameras to help identify potential short-term (e.g. vendor carts in a hazardous location) and long-term (e.g. poorly engineered intersections) safety risks. However, manually analysing the available footage is an overwhelming task for the city’s Transportation Agency. In support of the Jakarta Smart City initiative, our team hopes to build a video-processing pipeline to extract structured information from raw traffic footage. This information can be integrated with collision, weather, and other data in order to build models which can help public officials quickly identify and assess traffic risks with the goal of reducing traffic-related fatalities and severe injuries.

The authors identified their audience explicitly in their award-winning write-up (Caldeira et al. 2018),

We want this project to provide a template for others who hope to successfully deploy machine learning and data driven systems in the developing world. . . . These lessons should be invaluable to the many researchers and data scientists who wish to partner with NGOs, governments, and other entities that are working to use machine learning in the developing world.

Exercise 3.5 Explain the similarities and differences in structure, categories of information, and level of detail between the above blog post and our two, example memos.

Exercise 3.6 Explain the similarities and differences you would expect between their audience and the background and experience you would expect from the chief analytics executive for the City of Jakarta.

Exercise 3.7 For this exercise, your relationship to the chief analytics executive at Jakarta would be that of employee or consultant (your choice). Revise the above blog post into a 250-word memo written to whom you imagine as the chief analytics executive for the City of Jakarta, with the purpose of him or her approving of you moving forward with the project, beginning with a formal proposal. You have the exclusive benefit of their post-project write-up, but don’t use any described actual results in your revisions.


  1. Examples of those with similar needs may categorically include, say, the intended audience in Caldeira et al. (2018).↩︎

  2. We will discuss this structure — old before new — in detail later.↩︎

  3. An expectation is specifically defined in probability theory. To optimize is also a specific mathematical concept.↩︎

  4. William Zinsser: A long-time teacher of writing at Columbia and Yale, the late professor and journalist is well-known for putting pen to paper, or finger to key, as the case may be.↩︎

  5. This form of narrative has dominated since Aristotle’s Poetics, but narrative is broader. See (Altman 2008).↩︎

  6. For example, (Snyder 2013) or (Booker 2004).↩︎

  7. Later, we will formalize discussion of questions, among other things, as tools of persuasion.↩︎

  8. For a detailed understanding of narrative, consult seminal and recent work, including (Altman 2008); (Bal 2017); (Ricoeur 1984); (Ricoeur 1985); and (Ricoeur 1988).↩︎