Data and sources for estimating run expectancies

To determine run expectancy for differing “game states” in baseball, we need observations of each at bat event over a given time frame. Run expectancy is typically reported over a regular season, so we will focus here on the 2017 MLB season. Along with game states, we need to know the batting team’s score before each at bat in each half inning and the score at the half inning’s end.

At bat information is publicly available from multiple sources. The at bat data from Major League Baseball Advanced Media (MLBAM), which can be conveniently “scraped” using the R package PitchRx provides one source. Another source, a little more tedius to obtain, but easier to work with once gathered is available as event files from retrosheet.org, and the website provides instructions for gathering that data. We begin here with data in hand, loaded in the R object named rs, which we subset below to keep only needed variables.

rs <- 
  subset(rs, 
         select = c(GAME_ID, EVENT_ID, INN_CT, BAT_HOME_ID, OUTS_CT, 
                    EVENT_OUTS_CT, BASE1_RUN_ID, BASE2_RUN_ID, BASE3_RUN_ID, 
                    AWAY_SCORE_CT, HOME_SCORE_CT, EVENT_CD, BAT_EVENT_FL))

These variables are structured as follows,

# This is simply a convenience function to display the 
# structure of an R object as a data frame.
str2df <- function(x, n=5) {
  data.frame(Variable = names(x),
             Classe = sapply(x, typeof),
             Values = sapply(x, function(x) paste0(head(x, n = n),  collapse = ", ")),
             row.names = NULL)
}

# load libraries
require(dplyr)
require(knitr)
require(stargazer)

str2df(rs, 4)

Variable	Classe	Values
GAME_ID	character	ANA201704070, ANA201704070, ANA201704070, ANA201704070
EVENT_ID	integer	1, 2, 3, 4
INN_CT	integer	1, 1, 1, 1
BAT_HOME_ID	integer	0, 0, 0, 0
OUTS_CT	integer	0, 0, 1, 2
EVENT_OUTS_CT	integer	0, 1, 1, 0
BASE1_RUN_ID	character	, seguj002, seguj002, seguj002
BASE2_RUN_ID	character	, , ,
BASE3_RUN_ID	character	, , ,
AWAY_SCORE_CT	integer	0, 0, 0, 0
HOME_SCORE_CT	integer	0, 0, 0, 0
EVENT_CD	integer	14, 3, 2, 4
BAT_EVENT_FL	logical	TRUE, TRUE, TRUE, FALSE

where GAME_ID is a unique game Id, EVENT_ID is a unique observation (usually at bat) within each game, BAT_EVENT_FL indicates a batting event, INN_CT is the inning number, a BAT_HOME_ID of 0 indicates top of the inning and 1, bottom of the inning, OUTS_CT is the number of outs preceding the observed event (at bat), EVENT_OUTS_CT is the number of outs after the observed at bat, BASE{1,2,3}_RUN_ID identifies any baserunners preceeding the event, {AWAY, HOME}_SCORE_CT are the away and home team scores preceeding the event.

Estimating run expectancy

Preserving and understanding uncertainty of our estimates

Run expectancies are usually reported as “point estimates,” which have been calculated as simple averages or as a “maximum likelihood estimate” of the data. Thus, we may see reported, say, an expectancy of 0.5 runs in the beginning of each half inning with no one is one base and no outs. But each half inning is different. There is, obviously, no uncertainty in simply counting what has happened. That’s our data. But we’d like to understand how much the outcomes vary and are expected to vary. Thus, here, we depart from the typical approach to estimating run expectancies of game states and use the bayesian modeling software Stan, which models and preserves this uncertainty in our estimates and predictions. The bonus is that this approach makes understanding the uncertainty straight forward.

In half innings, more runs occur less frequently

Simply from observing baseball games, we glean that, for a given half inning, the batting team scores incrementally, and each subsequent score is generally less likely to occur. The most common half-inning is one without the batting team scoring. A single score is less common, and scoring, say, ten or more is very rare. From basic probability, we can represent counting the frequency of scoring zero or more in a half inning using the poisson distribution or, improving on this intuition, we observe that, as mentioned, many observations will have a zero score (in stats jargon, the data is overdispersed). Thus, instead of poisson, we model the data using a “negative binomial”, which is similar to poisson but considers these many zero observations, too.

Our intuition generally agrees with the data,

require(ggplot2)
require(ggthemes)
rs %>%
  group_by(GAME_ID, INN_CT, BAT_HOME_ID) %>% 
  arrange(EVENT_ID) %>% 
  mutate(RUNS = ifelse(BAT_HOME_ID==1,
                          last(HOME_SCORE_CT) - first(HOME_SCORE_CT),
                          last(AWAY_SCORE_CT) - first(AWAY_SCORE_CT))) %>%
  summarise(RUNS = max(RUNS)) %>%
  arrange(RUNS) %>%
  ggplot(mapping = aes(x=RUNS)) + stat_count() + 
  scale_x_continuous(breaks = c(0:13)) +
  labs(x = "Runs scored in half-inning", y = "Frequency") + 
  theme_tufte(base_family = "sans")

Building a bayesian run expectancy model

The model below uses the R package rstanarm’s interface to Stan’s fully-bayesian inferencing capabilities. We model RUNS_AFTER_EVENT as a function of (~) all combinations of RUNNERS and OUTS_CT, using neg_binomial_2 as we just discussed, and give our model the cleaned data rs. Absent including -1 in the formula, the model would use the first game state as a reference instead of separately reporting it with the other game states. It is not always appropriate to exclude this intercept, and we can google to learn more. The remaining options, QR = TRUE and chains = 1, iter = 500, cores = 4, seed = TRUE, have been described in depth here mc-stan.org/rstanarm/reference/stan_glm.html.

require(rstanarm)
fit <- stan_glm(RUNS_AFTER_EVENT ~ -1 + RUNNERS : OUTS_CT, 
                family = neg_binomial_2, 
                data = rs, 
                QR = TRUE,
                chains = 1, iter = 500, cores = 4, seed = TRUE)

## 
## SAMPLING FOR MODEL 'count' NOW (CHAIN 1).
## 
## Gradient evaluation took 0.073799 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 737.99 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 500 [  0%]  (Warmup)
## Iteration:  50 / 500 [ 10%]  (Warmup)
## Iteration: 100 / 500 [ 20%]  (Warmup)
## Iteration: 150 / 500 [ 30%]  (Warmup)
## Iteration: 200 / 500 [ 40%]  (Warmup)
## Iteration: 250 / 500 [ 50%]  (Warmup)
## Iteration: 251 / 500 [ 50%]  (Sampling)
## Iteration: 300 / 500 [ 60%]  (Sampling)
## Iteration: 350 / 500 [ 70%]  (Sampling)
## Iteration: 400 / 500 [ 80%]  (Sampling)
## Iteration: 450 / 500 [ 90%]  (Sampling)
## Iteration: 500 / 500 [100%]  (Sampling)
## 
##  Elapsed Time: 324.265 seconds (Warm-up)
##                290.604 seconds (Sampling)
##                614.869 seconds (Total)

Model summary

We summarise the model,

summary(fit)

## 
## Model Info:
## 
##  function:  stan_glm
##  family:    neg_binomial_2 [log]
##  formula:   RUNS_AFTER_EVENT ~ -1 + RUNNERS:OUTS_CT
##  algorithm: sampling
##  priors:    see help('prior_summary')
##  sample:    250 (posterior sample size)
##  num obs:   184463
## 
## Estimates:
##                         mean      sd        2.5%      25%       50%    
## RUNNERS0-0-0:OUTS_CT0      -0.7       0.0      -0.7      -0.7      -0.7
## RUNNERS1-0-0:OUTS_CT0      -0.1       0.0      -0.1      -0.1      -0.1
## RUNNERS0-2-0:OUTS_CT0       0.2       0.0       0.1       0.1       0.2
## RUNNERS0-0-3:OUTS_CT0       0.3       0.1       0.2       0.3       0.3
## RUNNERS1-2-0:OUTS_CT0       0.4       0.0       0.3       0.4       0.4
## RUNNERS1-0-3:OUTS_CT0       0.6       0.0       0.5       0.6       0.6
## RUNNERS0-2-3:OUTS_CT0       0.7       0.1       0.6       0.7       0.7
## RUNNERS1-2-3:OUTS_CT0       0.8       0.1       0.7       0.7       0.8
## RUNNERS0-0-0:OUTS_CT1      -1.3       0.0      -1.3      -1.3      -1.3
## RUNNERS1-0-0:OUTS_CT1      -0.7       0.0      -0.7      -0.7      -0.7
## RUNNERS0-2-0:OUTS_CT1      -0.4       0.0      -0.4      -0.4      -0.4
## RUNNERS0-0-3:OUTS_CT1       0.0       0.0      -0.1      -0.1       0.0
## RUNNERS1-2-0:OUTS_CT1      -0.1       0.0      -0.1      -0.1      -0.1
## RUNNERS1-0-3:OUTS_CT1       0.2       0.0       0.1       0.1       0.2
## RUNNERS0-2-3:OUTS_CT1       0.4       0.0       0.3       0.4       0.4
## RUNNERS1-2-3:OUTS_CT1       0.5       0.0       0.4       0.4       0.5
## RUNNERS0-0-0:OUTS_CT2      -2.2       0.0      -2.3      -2.3      -2.2
## RUNNERS1-0-0:OUTS_CT2      -1.5       0.0      -1.6      -1.5      -1.5
## RUNNERS0-2-0:OUTS_CT2      -1.1       0.0      -1.2      -1.2      -1.1
## RUNNERS0-0-3:OUTS_CT2      -1.0       0.0      -1.1      -1.1      -1.0
## RUNNERS1-2-0:OUTS_CT2      -0.8       0.0      -0.9      -0.9      -0.8
## RUNNERS1-0-3:OUTS_CT2      -0.8       0.0      -0.8      -0.8      -0.8
## RUNNERS0-2-3:OUTS_CT2      -0.6       0.1      -0.7      -0.7      -0.6
## RUNNERS1-2-3:OUTS_CT2      -0.3       0.0      -0.4      -0.3      -0.3
## reciprocal_dispersion       0.5       0.0       0.5       0.5       0.5
## mean_PPD                    0.5       0.0       0.5       0.5       0.5
## log-posterior         -158999.2       3.9 -159007.9 -159001.6 -158998.8
##                         75%       97.5%  
## RUNNERS0-0-0:OUTS_CT0      -0.7      -0.7
## RUNNERS1-0-0:OUTS_CT0      -0.1      -0.1
## RUNNERS0-2-0:OUTS_CT0       0.2       0.2
## RUNNERS0-0-3:OUTS_CT0       0.4       0.5
## RUNNERS1-2-0:OUTS_CT0       0.4       0.4
## RUNNERS1-0-3:OUTS_CT0       0.6       0.7
## RUNNERS0-2-3:OUTS_CT0       0.8       0.8
## RUNNERS1-2-3:OUTS_CT0       0.8       0.9
## RUNNERS0-0-0:OUTS_CT1      -1.3      -1.3
## RUNNERS1-0-0:OUTS_CT1      -0.7      -0.6
## RUNNERS0-2-0:OUTS_CT1      -0.3      -0.3
## RUNNERS0-0-3:OUTS_CT1       0.0       0.1
## RUNNERS1-2-0:OUTS_CT1      -0.1       0.0
## RUNNERS1-0-3:OUTS_CT1       0.2       0.3
## RUNNERS0-2-3:OUTS_CT1       0.4       0.5
## RUNNERS1-2-3:OUTS_CT1       0.5       0.5
## RUNNERS0-0-0:OUTS_CT2      -2.2      -2.2
## RUNNERS1-0-0:OUTS_CT2      -1.5      -1.5
## RUNNERS0-2-0:OUTS_CT2      -1.1      -1.1
## RUNNERS0-0-3:OUTS_CT2      -1.0      -1.0
## RUNNERS1-2-0:OUTS_CT2      -0.8      -0.8
## RUNNERS1-0-3:OUTS_CT2      -0.7      -0.7
## RUNNERS0-2-3:OUTS_CT2      -0.6      -0.5
## RUNNERS1-2-3:OUTS_CT2      -0.2      -0.2
## reciprocal_dispersion       0.5       0.5
## mean_PPD                    0.5       0.5
## log-posterior         -158996.6 -158993.0
## 
## Diagnostics:
##                       mcse Rhat n_eff
## RUNNERS0-0-0:OUTS_CT0 0.0  1.0  250  
## RUNNERS1-0-0:OUTS_CT0 0.0  1.0  250  
## RUNNERS0-2-0:OUTS_CT0 0.0  1.0  250  
## RUNNERS0-0-3:OUTS_CT0 0.0  1.0  250  
## RUNNERS1-2-0:OUTS_CT0 0.0  1.0  250  
## RUNNERS1-0-3:OUTS_CT0 0.0  1.0  250  
## RUNNERS0-2-3:OUTS_CT0 0.0  1.0  250  
## RUNNERS1-2-3:OUTS_CT0 0.0  1.0  250  
## RUNNERS0-0-0:OUTS_CT1 0.0  1.0  250  
## RUNNERS1-0-0:OUTS_CT1 0.0  1.0  250  
## RUNNERS0-2-0:OUTS_CT1 0.0  1.0  250  
## RUNNERS0-0-3:OUTS_CT1 0.0  1.0  250  
## RUNNERS1-2-0:OUTS_CT1 0.0  1.0  250  
## RUNNERS1-0-3:OUTS_CT1 0.0  1.0  250  
## RUNNERS0-2-3:OUTS_CT1 0.0  1.0  250  
## RUNNERS1-2-3:OUTS_CT1 0.0  1.0  250  
## RUNNERS0-0-0:OUTS_CT2 0.0  1.0  250  
## RUNNERS1-0-0:OUTS_CT2 0.0  1.0  250  
## RUNNERS0-2-0:OUTS_CT2 0.0  1.0  250  
## RUNNERS0-0-3:OUTS_CT2 0.0  1.0  250  
## RUNNERS1-2-0:OUTS_CT2 0.0  1.0  250  
## RUNNERS1-0-3:OUTS_CT2 0.0  1.0  250  
## RUNNERS0-2-3:OUTS_CT2 0.0  1.0  250  
## RUNNERS1-2-3:OUTS_CT2 0.0  1.0  250  
## reciprocal_dispersion 0.0  1.1   32  
## mean_PPD              0.0  1.0  250  
## log-posterior         0.4  1.0   80  
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Reviewing the model summary, n_eff and Rhats both look good. The various quantiles of coefficients are reported as the log of run expectancies because the model used neg_binomial_2 family, which defaults to a link function transforming these median estimates of our game states to log units. We can more easily interpret them by applying an exp(), as shown below along with organizing their median values into a matrix.

The median run expectancy matrix

The median run expectancy for each of the 24 game states are as follows,

# Extract and name coefficients from model
m <- exp(coef(fit))
m <- matrix(m, nrow = 8, ncol = 3, byrow = F)
rownames(m) <- levels(rs$RUNNERS)
colnames(m) <- levels(rs$OUTS_CT)

# Reorder rows according to bases
runners_on_base <- 
  c("0-0-0", "1-0-0", "0-2-0", "0-0-3", "1-2-0", "1-0-3", "0-2-3", "1-2-3")
m <- m[match(runners_on_base, rownames(m)),]

# Show Matrix as Table
m

##               0         1         2
## 0-0-0 0.5110524 0.2682680 0.1068107
## 1-0-0 0.8927785 0.5117634 0.2193748
## 0-2-0 1.1664494 0.6953203 0.3179510
## 0-0-3 1.3992135 0.9726433 0.3545467
## 1-2-0 1.4842034 0.9216748 0.4308111
## 1-0-3 1.7917773 1.1931853 0.4676016
## 0-2-3 2.0476125 1.4653425 0.5341225
## 1-2-3 2.2020040 1.5968817 0.7635635

More interestingly, we can also compare the distributions of expected runs for each game state,

# Extract posterior draws of the predictors, transform and reshape for plotting
pp <- as.data.frame(fit)
pp <- reshape2::melt(pp, 
                     variable.name = "Game.State", 
                     value.name="Expected.Runs")
pp <- transform(pp, Game.State = as.character(Game.State))
pp <- subset(pp, subset = Game.State != "reciprocal_dispersion")

# Separate Base States from Outs
pp <- transform(pp, Outs = substr(Game.State, start = 21, stop = 21))
pp <- transform(pp, Runners = substr(Game.State, start = 8, stop = 12))

# Reorganize order of Base States and Outs for a cleaner plot
pp <- transform(pp, Outs = factor(Outs, 
                                  levels=c("0", "1", "2"), 
                                  labels=c("0 Outs", "1 Out", "2 Outs")))
pp <- transform(pp, 
      Runners = factor(Runners, 
      levels=runners_on_base,
      labels=runners_on_base))

# Drop Unneeded Variables
pp$Game.State <- NULL

# Transform posterior estimates
pp <- transform(pp, Expected.Runs = exp(Expected.Runs))

# create the plot
ggplot(pp) +
  geom_density(aes(x = Expected.Runs), fill = "#C4D8E2") +
  facet_grid(Runners ~ Outs, scales = "free_y", switch = "y") +
  theme_gray() +
  theme(axis.text.x = element_text(size = 6),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        strip.text.x = element_text(size = 9, face = "bold"),
        strip.text.y = element_text(size = 9, face = "bold", angle = 180),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank(),
        panel.spacing.x = unit(1, "lines")) +
  labs(x = "Expected Runs", y = "Baserunners")

The distributions above reflect the model’s uncertainty in the predictors, but importantly, when predicting or forecasting, we need to consider both this uncertainty and the natural variation in events. The mean posterior estimates of observed at bats in the 2017 season are shown below.

# combine preditions with original data
yrep <- posterior_predict(fit)
rs <- cbind(rs, yrep = colMeans(yrep))

# reorder levels for RUNNERS and OUTS_CT
# Reorganize order of Base States and Outs for a cleaner plot
rs <- transform(rs, Outs = factor(as.character(OUTS_CT),
                                  levels=c("0", "1", "2"),
                                  labels=c("0 Outs", "1 Out", "2 Outs")))
rs <- transform(rs,
      Runners = factor(as.character(RUNNERS),
      levels=runners_on_base,
      labels=runners_on_base))

# plot posterior estimates from original data
ggplot(rs) +
  geom_density(aes(x = yrep), fill = "#C4D8E2") +
  facet_grid(Runners ~ Outs, scales = "free_y", switch = "y") +
  theme_gray() +
  theme(axis.text.x = element_text(size = 6),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        strip.text.x = element_text(size = 9, face = "bold"),
        strip.text.y = element_text(size = 9, face = "bold", angle = 180),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank(),
        panel.spacing.x = unit(1, "lines")) +
  labs(x = "Expected Runs", y = "Baserunners")

Team wins—a chain of player events