Data in Wonderland

Explores communication with data in various forms through seminal and cutting-edge ideas in writing, data analyses, and visualzation in the era of language models.

Author
Affiliations

Dr. Scott Spencer

Published

February 3, 2026

Preface

A public domain work with 100 million copies in 170 languages, Alice’s Adventures in Wonderland has seen steadily climbing search interest over the past two decades, its relevance renewed through endless remixing across media. This endurance speaks to something we need in data communication as much as in storytelling: knowing where we want to end up before we begin. When Alice asked, “Where should I go?” the Cheshire Cat replied, “That depends on where you want to end up.” Precisely. Whether wandering through Wonderland or wandering through datasets, we cannot chart a path without knowing our destination. Want to know what to communicate about data? First ask: What goals do you have? For whom? What actions should follow?

With audience and purpose defined, narrative and story can enhance others’ understanding of data, offering meaning and insights as communicated in numbers and words, or in graphical encodings. Carefully combined, narratives, data analyses, and visuals can help enable change:

Each of these voices speaks to a different dimension of the challenge. Sherlock Holmes demands the raw material of data, John Tukey insists on the revelatory power of visuals, and Daniel Kahneman reminds us that narrative is what ultimately drives decision-making. Together, they map the territory we must navigate. But who, exactly, is “we”? This book is written for a particular kind of traveler through this landscape.

Audience and assumed background

My primary travelers are my students at Columbia University. The readers who will get most from this text, for whom I have in mind as my more general audience, are curious active learners:

An active learner asks questions, considers alternatives, questions assumptions, and even questions the trustworthiness of the author or speaker. An active learner tries to generalize specific examples, and devise specific examples for generalities.

An active learner doesn’t passively sponge up information — that doesn’t work! — but uses the readings and lecturer’s argument as a springboard for critical thought and deep understanding.

This text isn’t meant to be an end, but a beginning, giving you hand-selected, seminal and cutting-edge references for the concepts presented. Go down these rabbit holes with Alice, following citations and studying the cited material. Becoming an expert in storytelling with data also requires practicing. Indeed,

Learners need to practice, to imitate well, to be highly motivated, and to have the ability to see likenesses between dissimilar things in [domains ranging from creative writing to mathematics. (Gaut 2014).

You may find some concepts difficult or vague on a first read. For that, I’ll offer encouragement from Abelson (1995):

I have tried to make the presentation accessible and clear, but some readers may find a few sections cryptic …. Use your judgment on what to skim. If you don’t follow the occassional formulas, read the words. If you don’t understand the words, follow the music and come back to the words later.

Ready?! Down the hole we go!

Structure and content

Empirical studies suggest the communication is generally more effective when its author controls all aspects of the communication, from content to typography and form. But in some cases, we may enhance the communication by allowing our audience to choose among potential contexts of the information. Here, we aim to explore many ideas within this framework, using as content a data analytics project.

We begin with an introduction to communication scopes and the data analytics workflow, establishing the foundational concepts of context, specificity, and audience consideration. Building on this foundation, we then explore how the same principles of clear communication apply when working with generative artificial intelligence. Just as we must be specific with human audiences who lack our context, we must be equally specific with AI systems. This specification thinking—providing clear context, constraints, and success criteria—becomes the through-line of our approach.

Throughout this text, you will encounter a consistent pattern: each significant visualization or analysis is introduced first as a specification (an LLM Prompt) that describes what we want to achieve, followed by an implementation (the code that realizes that specification). This prompt-to-code format reflects the evolved workflow in modern analytics: we begin by clearly articulating our communicative intent, then use AI tools to generate implementations, and finally verify and refine the results. The code examples provided are exact implementations that demonstrate how precise specifications translate into working code.

After establishing these foundations, we explore our content and audiences through narrative communication, considering not just our words but the typographic forms of those words. We then begin to integrate other, visual, forms of data representation into the narrative. As our discussions in graphical data encodings become more complex, we give them focus in the form of information graphics, dashboards, and finally enable our audience as, to some degree, co-author with interactive design of the communication.

Software information and conventions

This text embraces a specification-first approach to data analysis and visualization. Rather than beginning with code, we begin with clear, detailed descriptions of what we want to achieve—what we call “LLM Prompts” throughout the text. These specifications demonstrate how to communicate intent precisely, whether to generative AI systems or to human collaborators. The code examples that follow each prompt are exact implementations showing how precise specifications translate into working code. This workflow reflects modern analytics practice, where clear communication of intent precedes implementation, and where critical evaluation remains essential regardless of who—or what—generates the code.

Working with language models. To use generative AI effectively, we should understand its basic structure. Large language models (LLMs) are neural networks trained on vast text corpora to predict likely sequences of words. When we “prompt” a model, we provide context that guides its prediction; when we use “agents,” we chain prompts and tool calls to accomplish multi-step tasks. Models vary in their origins: some are proprietary (like GPT-4, Claude, or Gemini) while others are “open weights” (like Llama, Mistral, or Qwen), meaning their parameters are publicly available for download and local hosting. Open-weight models can be quantized—compressed to smaller numerical precision—allowing them to run on consumer hardware. Software like llama.cpp, ollama, or vllm enables local hosting. Developer frameworks like langchain or autogen facilitate building agentic applications, while user interfaces like opencode or open-webui provide accessible ways to interact with models directly. Models process input through a “context window” (their working memory) and can retrieve information via retrieval-augmented generation (RAG), accessing external documents during inference. Hardware choices matter: Apple’s unified memory architecture (M3 Ultra and beyond) combined with mlx—Apple’s machine learning framework—enables running large models locally without the memory constraints typical of discrete GPUs. Emerging technologies like RDMA over Thunderbolt 5 and Exo Labs clustering further extend these capabilities for distributed local inference. These capabilities evolve rapidly, but the fundamental pattern remains: clear specifications in, useful implementations out. Where possible, we prefer frontier open-source tools over proprietary alternatives. Open-source tools are typically more robust, freely available, and prevent the dependency and gatekeeping that comes with closed platforms. This aligns with our emphasis on transparency, reproducibility, and maintaining control over our analytical workflows.

Implementation tools are chosen based on how well they embody the underlying theory rather than any particular language. For data visualization, we employ tools grounded in the grammar of graphics—the theoretical framework that separates data transformation, aesthetic mapping, and geometric representation. This includes ggplot2 (R), plotnine (Python), Altair (Python/JavaScript), and D3.js (JavaScript). Each implements the same fundamental principles: mapping data variables to visual properties, layering geometric objects, and coordinating multiple views.

Data manipulation tools are selected for their expressive clarity—the extent to which code reads like the operations it performs. The tidyverse in R established this grammar-of-data-manipulation approach, using verbs like “filter,” “select,” “mutate,” and “summarize” that describe data transformations directly. In Python, siuba port this same grammar on top of pandas, while polars brings similar semantics with performance optimizations. This grammar offers a structured vocabulary that maps directly to natural language—the same verbs appear in our specifications and LLM prompts, creating a seamless path from intent to implementation.

For statistical analysis, we take an opinionated Bayesian-first approach. Rather than treating Bayesian methods as advanced extensions of classical statistics, we build from the foundation that probability represents degrees of belief and uncertainty quantification is essential to decision-making. We have language models use Stan as our primary modeling language, accessed through cmdstanr (R) or cmdstanpy (Python). This choice reflects the principle that computational tools should match the conceptual framework: Stan’s modeling language directly expresses the generative processes we imagine producing our data, making the connection between theory and implementation transparent.

About the author

I build bespoke probabilistic models that transform complex data into strategic insights for elite organizations—Major League Baseball and the English Premier League—then communicate those insights through visualizations and decision support tools I design. My consulting informs decisions involving multi-year contracts in the tens of millions, cross-league transfer evaluations, and roster construction strategies where improved assessment directly impacts competitive outcomes.

My approach centers on modeling actual data-generating processes—the physics, geometry, and human decisions that produce observable outcomes—rather than fitting statistical patterns to summary statistics. I jointly model related processes using custom Stan implementations: discrete Weibull AFT models for career survival, competing risks frameworks for match outcomes, Lee inverse Mills ratio corrections for selection bias, Hidden Markov Models for injury state transitions, and physics-informed models for human perception and decision-making. Recent work includes jointly modeling abilities for 100,000+ players across 85+ leagues spanning a decade, implementing hierarchical structures enabling cross-league comparison while fitting millions of observations with parallel cores.

While my portfolio centers on professional sports, the methodological challenges transfer to contexts I have supported: customer behavior analytics, health of global fisheries, and climate risk mitigation including modeling sea level rise impact on coastal property valuations.

I teach in the applied analytics graduate program at Columbia University, my alma mater, where I developed the curriculum for this book and manage numerous student teams through complex projects—experience translating directly to directing organizational analytics teams. I also designed and teach a comprehensive two-course Bayesian curriculum through Posit’s partner Athlyticz—over 200 professional instructional videos I produce, film, and edit myself. These courses (Bayesian I, Bayesian II) build foundations from probability through hierarchical models, Gaussian processes, and Hilbert-space approximations, extending to computational optimization including GPU acceleration and memory management. I co-teach multi-day Stan workshops and mentor users in Bayesian workflow through the Stan forums.

My background combines a JD with honors in research and communication, an MS in sports analytics (4.0 GPA), and a BS in chemical engineering focused on numerical methods. This dual training—rigorous Bayesian methodology with persuasive communication developed through a decade at major law firms—shapes everything in this text. Recognition includes winner of the 2017 SABR Analytics Competition (graduate division), primary analyst for The Real Madrid Way (BenBella Books 2016), and longlisting in the Kantar Information is Beautiful Awards.

I author this book and work in Stan, R (tidyverse, parallel computation), C++, Git, Quarto, SQL, D3.js, and Processing. I leverage locally-hosted LLMs and AI agentic CLI tools for privacy-preserving workflow optimization, and produce educational media with professional video production tools. Learn more about my work at ssp3nc3r.github.io/about or connect on LinkedIn.