Run Expectancy Distributions of MLB Game States

Dec 3, 2017 00:00 · 2592 words · 13 minutes read MLB Baseball Run Expectancy Bayes Stan rstanarm

Team wins—a chain of player events

The performance of a baseball player includes numerous on-field (and off-field) activities, roughly divided into skills of offensive (e.g., batting, baserunning) and defensive (e.g., pitching, fielding). The value of each skill depends upon its contribution to team wins, and wins are the difference in team score (runs). Breaking this down further, runs depend on advancing each half inning of baseball from one “state” to another. By state, we mean the base locations of any runners and the number of outs.

We can determine the range of runs expected to result from each state and then link the outcomes of each player’s possible contributions to changes in run expectancy. In this article, we estimate the distribution of run expectancy for each game state.

Data and sources for estimating run expectancies

To determine run expectancy for differing “game states” in baseball, we need observations of each at bat event over a given time frame. Run expectancy is typically reported over a regular season, so we will focus here on the 2017 MLB season. Along with game states, we need to know the batting team’s score before each at bat in each half inning and the score at the half inning’s end.

At bat information is publicly available from multiple sources. The at bat data from Major League Baseball Advanced Media (MLBAM), which can be conveniently “scraped” using the R package PitchRx provides one source. Another source, a little more tedius to obtain, but easier to work with once gathered is available as event files from retrosheet.org, and the website provides instructions for gathering that data. We begin here with data in hand, loaded in the R object named rs, which we subset below to keep only needed variables.

rs <- 
  subset(rs, 
         select = c(GAME_ID, EVENT_ID, INN_CT, BAT_HOME_ID, OUTS_CT, 
                    EVENT_OUTS_CT, BASE1_RUN_ID, BASE2_RUN_ID, BASE3_RUN_ID, 
                    AWAY_SCORE_CT, HOME_SCORE_CT, EVENT_CD, BAT_EVENT_FL))

These variables are structured as follows,

# This is simply a convenience function to display the 
# structure of an R object as a data frame.
str2df <- function(x, n=5) {
  data.frame(Variable = names(x),
             Classe = sapply(x, typeof),
             Values = sapply(x, function(x) paste0(head(x, n = n),  collapse = ", ")),
             row.names = NULL)
}

# load libraries
require(dplyr)
require(knitr)
require(stargazer)

str2df(rs, 4) 
Variable Classe Values
GAME_ID character ANA201704070, ANA201704070, ANA201704070, ANA201704070
EVENT_ID integer 1, 2, 3, 4
INN_CT integer 1, 1, 1, 1
BAT_HOME_ID integer 0, 0, 0, 0
OUTS_CT integer 0, 0, 1, 2
EVENT_OUTS_CT integer 0, 1, 1, 0
BASE1_RUN_ID character , seguj002, seguj002, seguj002
BASE2_RUN_ID character , , ,
BASE3_RUN_ID character , , ,
AWAY_SCORE_CT integer 0, 0, 0, 0
HOME_SCORE_CT integer 0, 0, 0, 0
EVENT_CD integer 14, 3, 2, 4
BAT_EVENT_FL logical TRUE, TRUE, TRUE, FALSE

where GAME_ID is a unique game Id, EVENT_ID is a unique observation (usually at bat) within each game, BAT_EVENT_FL indicates a batting event, INN_CT is the inning number, a BAT_HOME_ID of 0 indicates top of the inning and 1, bottom of the inning, OUTS_CT is the number of outs preceding the observed event (at bat), EVENT_OUTS_CT is the number of outs after the observed at bat, BASE{1,2,3}_RUN_ID identifies any baserunners preceeding the event, {AWAY, HOME}_SCORE_CT are the away and home team scores preceeding the event.

Calculate runs scored after every game state of a half inning

Before modeling run expectancy with this data, we “transform” some variables and create new variables. We group these data by half inning, sort observations therein by event, remove incomplete half-innings (most typically occurring in bottom of the ninth or greater inning when the home team breaks the tied score), and calculate the difference between the batting team’s score preceeding the event and their score at half-inning’s end. After using OUTS_CT as a number to remove incomplete half innings, we transform it into a category or factor for modeling. In code,

rs <- 
  group_by(rs, GAME_ID, INN_CT, BAT_HOME_ID) %>%
  arrange(GAME_ID, INN_CT, BAT_HOME_ID, EVENT_ID) %>%
  filter( (last(OUTS_CT) + last(EVENT_OUTS_CT))==3 ) %>%
  mutate(RUNS_AFTER_EVENT = 
           ifelse(BAT_HOME_ID == 0, 
                  last(AWAY_SCORE_CT) - AWAY_SCORE_CT, 
                  last(HOME_SCORE_CT) - HOME_SCORE_CT))

Variables for the base runners (shown in the first two transform() below) are combined into a single variable with 8 “levels”: “0-0-0” represents bases empty, “1-0-0” represents a runner on first, “0-2-0” a runner on second, and so on. Observations in the data primarily relate to the batting event, and we filter out others. Then, we drop variables we no longer need (using subset()),

# create variable of base runner states
rs <- transform(rs, 
                ON_1B = as.integer(!BASE1_RUN_ID == ""),
                ON_2B = as.integer(!BASE2_RUN_ID == "")*2,
                ON_3B = as.integer(!BASE3_RUN_ID == "")*3)
rs <- transform(rs, RUNNERS = paste0(ON_1B, "-", ON_2B, "-", ON_3B))
runners_on_base <- 
  c("0-0-0", "1-0-0", "0-2-0", "0-0-3", "1-2-0", "1-0-3", "0-2-3", "1-2-3")
rs <- transform(rs, 
                 RUNNERS = factor(RUNNERS, 
                                 levels=runners_on_base,
                                 labels=runners_on_base))
# tranform outs into factor
rs <- transform(rs, OUTS_CT = factor(OUTS_CT, 
                                  levels=c("0", "1", "2")))

# drop unneeded variables
rs <- subset(rs, 
             subset = BAT_EVENT_FL == TRUE, 
             select = c(-ON_1B, -ON_2B, -ON_3B, -BAT_EVENT_FL,
                        -BASE1_RUN_ID, -BASE2_RUN_ID, -BASE3_RUN_ID))

Estimating run expectancy

Preserving and understanding uncertainty of our estimates

Run expectancies are usually reported as “point estimates,” which have been calculated as simple averages or as a “maximum likelihood estimate” of the data. Thus, we may see reported, say, an expectancy of 0.5 runs in the beginning of each half inning with no one is one base and no outs. But each half inning is different. There is, obviously, no uncertainty in simply counting what has happened. That’s our data. But we’d like to understand how much the outcomes vary and are expected to vary. Thus, here, we depart from the typical approach to estimating run expectancies of game states and use the bayesian modeling software Stan, which models and preserves this uncertainty in our estimates and predictions. The bonus is that this approach makes understanding the uncertainty straight forward.

In half innings, more runs occur less frequently

Simply from observing baseball games, we glean that, for a given half inning, the batting team scores incrementally, and each subsequent score is generally less likely to occur. The most common half-inning is one without the batting team scoring. A single score is less common, and scoring, say, ten or more is very rare. From basic probability, we can represent counting the frequency of scoring zero or more in a half inning using the poisson distribution or, improving on this intuition, we observe that, as mentioned, many observations will have a zero score (in stats jargon, the data is overdispersed). Thus, instead of poisson, we model the data using a “negative binomial”, which is similar to poisson but considers these many zero observations, too.

Our intuition generally agrees with the data,

require(ggplot2)
require(ggthemes)
rs %>%
  group_by(GAME_ID, INN_CT, BAT_HOME_ID) %>% 
  arrange(EVENT_ID) %>% 
  mutate(RUNS = ifelse(BAT_HOME_ID==1,
                          last(HOME_SCORE_CT) - first(HOME_SCORE_CT),
                          last(AWAY_SCORE_CT) - first(AWAY_SCORE_CT))) %>%
  summarise(RUNS = max(RUNS)) %>%
  arrange(RUNS) %>%
  ggplot(mapping = aes(x=RUNS)) + stat_count() + 
  scale_x_continuous(breaks = c(0:13)) +
  labs(x = "Runs scored in half-inning", y = "Frequency") + 
  theme_tufte(base_family = "sans")

Building a bayesian run expectancy model

The model below uses the R package rstanarm’s interface to Stan’s fully-bayesian inferencing capabilities. We model RUNS_AFTER_EVENT as a function of (~) all combinations of RUNNERS and OUTS_CT, using neg_binomial_2 as we just discussed, and give our model the cleaned data rs. Absent including -1 in the formula, the model would use the first game state as a reference instead of separately reporting it with the other game states. It is not always appropriate to exclude this intercept, and we can google to learn more. The remaining options, QR = TRUE and chains = 1, iter = 500, cores = 4, seed = TRUE, have been described in depth here mc-stan.org/rstanarm/reference/stan_glm.html.

require(rstanarm)
fit <- stan_glm(RUNS_AFTER_EVENT ~ -1 + RUNNERS : OUTS_CT, 
                family = neg_binomial_2, 
                data = rs, 
                QR = TRUE,
                chains = 1, iter = 500, cores = 4, seed = TRUE)
## 
## SAMPLING FOR MODEL 'count' NOW (CHAIN 1).
## 
## Gradient evaluation took 0.073799 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 737.99 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 500 [  0%]  (Warmup)
## Iteration:  50 / 500 [ 10%]  (Warmup)
## Iteration: 100 / 500 [ 20%]  (Warmup)
## Iteration: 150 / 500 [ 30%]  (Warmup)
## Iteration: 200 / 500 [ 40%]  (Warmup)
## Iteration: 250 / 500 [ 50%]  (Warmup)
## Iteration: 251 / 500 [ 50%]  (Sampling)
## Iteration: 300 / 500 [ 60%]  (Sampling)
## Iteration: 350 / 500 [ 70%]  (Sampling)
## Iteration: 400 / 500 [ 80%]  (Sampling)
## Iteration: 450 / 500 [ 90%]  (Sampling)
## Iteration: 500 / 500 [100%]  (Sampling)
## 
##  Elapsed Time: 324.265 seconds (Warm-up)
##                290.604 seconds (Sampling)
##                614.869 seconds (Total)

Model summary

We summarise the model,

summary(fit)
## 
## Model Info:
## 
##  function:  stan_glm
##  family:    neg_binomial_2 [log]
##  formula:   RUNS_AFTER_EVENT ~ -1 + RUNNERS:OUTS_CT
##  algorithm: sampling
##  priors:    see help('prior_summary')
##  sample:    250 (posterior sample size)
##  num obs:   184463
## 
## Estimates:
##                         mean      sd        2.5%      25%       50%    
## RUNNERS0-0-0:OUTS_CT0      -0.7       0.0      -0.7      -0.7      -0.7
## RUNNERS1-0-0:OUTS_CT0      -0.1       0.0      -0.1      -0.1      -0.1
## RUNNERS0-2-0:OUTS_CT0       0.2       0.0       0.1       0.1       0.2
## RUNNERS0-0-3:OUTS_CT0       0.3       0.1       0.2       0.3       0.3
## RUNNERS1-2-0:OUTS_CT0       0.4       0.0       0.3       0.4       0.4
## RUNNERS1-0-3:OUTS_CT0       0.6       0.0       0.5       0.6       0.6
## RUNNERS0-2-3:OUTS_CT0       0.7       0.1       0.6       0.7       0.7
## RUNNERS1-2-3:OUTS_CT0       0.8       0.1       0.7       0.7       0.8
## RUNNERS0-0-0:OUTS_CT1      -1.3       0.0      -1.3      -1.3      -1.3
## RUNNERS1-0-0:OUTS_CT1      -0.7       0.0      -0.7      -0.7      -0.7
## RUNNERS0-2-0:OUTS_CT1      -0.4       0.0      -0.4      -0.4      -0.4
## RUNNERS0-0-3:OUTS_CT1       0.0       0.0      -0.1      -0.1       0.0
## RUNNERS1-2-0:OUTS_CT1      -0.1       0.0      -0.1      -0.1      -0.1
## RUNNERS1-0-3:OUTS_CT1       0.2       0.0       0.1       0.1       0.2
## RUNNERS0-2-3:OUTS_CT1       0.4       0.0       0.3       0.4       0.4
## RUNNERS1-2-3:OUTS_CT1       0.5       0.0       0.4       0.4       0.5
## RUNNERS0-0-0:OUTS_CT2      -2.2       0.0      -2.3      -2.3      -2.2
## RUNNERS1-0-0:OUTS_CT2      -1.5       0.0      -1.6      -1.5      -1.5
## RUNNERS0-2-0:OUTS_CT2      -1.1       0.0      -1.2      -1.2      -1.1
## RUNNERS0-0-3:OUTS_CT2      -1.0       0.0      -1.1      -1.1      -1.0
## RUNNERS1-2-0:OUTS_CT2      -0.8       0.0      -0.9      -0.9      -0.8
## RUNNERS1-0-3:OUTS_CT2      -0.8       0.0      -0.8      -0.8      -0.8
## RUNNERS0-2-3:OUTS_CT2      -0.6       0.1      -0.7      -0.7      -0.6
## RUNNERS1-2-3:OUTS_CT2      -0.3       0.0      -0.4      -0.3      -0.3
## reciprocal_dispersion       0.5       0.0       0.5       0.5       0.5
## mean_PPD                    0.5       0.0       0.5       0.5       0.5
## log-posterior         -158999.2       3.9 -159007.9 -159001.6 -158998.8
##                         75%       97.5%  
## RUNNERS0-0-0:OUTS_CT0      -0.7      -0.7
## RUNNERS1-0-0:OUTS_CT0      -0.1      -0.1
## RUNNERS0-2-0:OUTS_CT0       0.2       0.2
## RUNNERS0-0-3:OUTS_CT0       0.4       0.5
## RUNNERS1-2-0:OUTS_CT0       0.4       0.4
## RUNNERS1-0-3:OUTS_CT0       0.6       0.7
## RUNNERS0-2-3:OUTS_CT0       0.8       0.8
## RUNNERS1-2-3:OUTS_CT0       0.8       0.9
## RUNNERS0-0-0:OUTS_CT1      -1.3      -1.3
## RUNNERS1-0-0:OUTS_CT1      -0.7      -0.6
## RUNNERS0-2-0:OUTS_CT1      -0.3      -0.3
## RUNNERS0-0-3:OUTS_CT1       0.0       0.1
## RUNNERS1-2-0:OUTS_CT1      -0.1       0.0
## RUNNERS1-0-3:OUTS_CT1       0.2       0.3
## RUNNERS0-2-3:OUTS_CT1       0.4       0.5
## RUNNERS1-2-3:OUTS_CT1       0.5       0.5
## RUNNERS0-0-0:OUTS_CT2      -2.2      -2.2
## RUNNERS1-0-0:OUTS_CT2      -1.5      -1.5
## RUNNERS0-2-0:OUTS_CT2      -1.1      -1.1
## RUNNERS0-0-3:OUTS_CT2      -1.0      -1.0
## RUNNERS1-2-0:OUTS_CT2      -0.8      -0.8
## RUNNERS1-0-3:OUTS_CT2      -0.7      -0.7
## RUNNERS0-2-3:OUTS_CT2      -0.6      -0.5
## RUNNERS1-2-3:OUTS_CT2      -0.2      -0.2
## reciprocal_dispersion       0.5       0.5
## mean_PPD                    0.5       0.5
## log-posterior         -158996.6 -158993.0
## 
## Diagnostics:
##                       mcse Rhat n_eff
## RUNNERS0-0-0:OUTS_CT0 0.0  1.0  250  
## RUNNERS1-0-0:OUTS_CT0 0.0  1.0  250  
## RUNNERS0-2-0:OUTS_CT0 0.0  1.0  250  
## RUNNERS0-0-3:OUTS_CT0 0.0  1.0  250  
## RUNNERS1-2-0:OUTS_CT0 0.0  1.0  250  
## RUNNERS1-0-3:OUTS_CT0 0.0  1.0  250  
## RUNNERS0-2-3:OUTS_CT0 0.0  1.0  250  
## RUNNERS1-2-3:OUTS_CT0 0.0  1.0  250  
## RUNNERS0-0-0:OUTS_CT1 0.0  1.0  250  
## RUNNERS1-0-0:OUTS_CT1 0.0  1.0  250  
## RUNNERS0-2-0:OUTS_CT1 0.0  1.0  250  
## RUNNERS0-0-3:OUTS_CT1 0.0  1.0  250  
## RUNNERS1-2-0:OUTS_CT1 0.0  1.0  250  
## RUNNERS1-0-3:OUTS_CT1 0.0  1.0  250  
## RUNNERS0-2-3:OUTS_CT1 0.0  1.0  250  
## RUNNERS1-2-3:OUTS_CT1 0.0  1.0  250  
## RUNNERS0-0-0:OUTS_CT2 0.0  1.0  250  
## RUNNERS1-0-0:OUTS_CT2 0.0  1.0  250  
## RUNNERS0-2-0:OUTS_CT2 0.0  1.0  250  
## RUNNERS0-0-3:OUTS_CT2 0.0  1.0  250  
## RUNNERS1-2-0:OUTS_CT2 0.0  1.0  250  
## RUNNERS1-0-3:OUTS_CT2 0.0  1.0  250  
## RUNNERS0-2-3:OUTS_CT2 0.0  1.0  250  
## RUNNERS1-2-3:OUTS_CT2 0.0  1.0  250  
## reciprocal_dispersion 0.0  1.1   32  
## mean_PPD              0.0  1.0  250  
## log-posterior         0.4  1.0   80  
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Reviewing the model summary, n_eff and Rhats both look good. The various quantiles of coefficients are reported as the log of run expectancies because the model used neg_binomial_2 family, which defaults to a link function transforming these median estimates of our game states to log units. We can more easily interpret them by applying an exp(), as shown below along with organizing their median values into a matrix.

The median run expectancy matrix

The median run expectancy for each of the 24 game states are as follows,

# Extract and name coefficients from model
m <- exp(coef(fit))
m <- matrix(m, nrow = 8, ncol = 3, byrow = F)
rownames(m) <- levels(rs$RUNNERS)
colnames(m) <- levels(rs$OUTS_CT)

# Reorder rows according to bases
runners_on_base <- 
  c("0-0-0", "1-0-0", "0-2-0", "0-0-3", "1-2-0", "1-0-3", "0-2-3", "1-2-3")
m <- m[match(runners_on_base, rownames(m)),]

# Show Matrix as Table
m
##               0         1         2
## 0-0-0 0.5110524 0.2682680 0.1068107
## 1-0-0 0.8927785 0.5117634 0.2193748
## 0-2-0 1.1664494 0.6953203 0.3179510
## 0-0-3 1.3992135 0.9726433 0.3545467
## 1-2-0 1.4842034 0.9216748 0.4308111
## 1-0-3 1.7917773 1.1931853 0.4676016
## 0-2-3 2.0476125 1.4653425 0.5341225
## 1-2-3 2.2020040 1.5968817 0.7635635

More interestingly, we can also compare the distributions of expected runs for each game state,

# Extract posterior draws of the predictors, transform and reshape for plotting
pp <- as.data.frame(fit)
pp <- reshape2::melt(pp, 
                     variable.name = "Game.State", 
                     value.name="Expected.Runs")
pp <- transform(pp, Game.State = as.character(Game.State))
pp <- subset(pp, subset = Game.State != "reciprocal_dispersion")

# Separate Base States from Outs
pp <- transform(pp, Outs = substr(Game.State, start = 21, stop = 21))
pp <- transform(pp, Runners = substr(Game.State, start = 8, stop = 12))

# Reorganize order of Base States and Outs for a cleaner plot
pp <- transform(pp, Outs = factor(Outs, 
                                  levels=c("0", "1", "2"), 
                                  labels=c("0 Outs", "1 Out", "2 Outs")))
pp <- transform(pp, 
      Runners = factor(Runners, 
      levels=runners_on_base,
      labels=runners_on_base))

# Drop Unneeded Variables
pp$Game.State <- NULL

# Transform posterior estimates
pp <- transform(pp, Expected.Runs = exp(Expected.Runs))

# create the plot
ggplot(pp) +
  geom_density(aes(x = Expected.Runs), fill = "#C4D8E2") +
  facet_grid(Runners ~ Outs, scales = "free_y", switch = "y") +
  theme_gray() +
  theme(axis.text.x = element_text(size = 6),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        strip.text.x = element_text(size = 9, face = "bold"),
        strip.text.y = element_text(size = 9, face = "bold", angle = 180),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank(),
        panel.spacing.x = unit(1, "lines")) +
  labs(x = "Expected Runs", y = "Baserunners")

The distributions above reflect the model’s uncertainty in the predictors, but importantly, when predicting or forecasting, we need to consider both this uncertainty and the natural variation in events. The mean posterior estimates of observed at bats in the 2017 season are shown below.

# combine preditions with original data
yrep <- posterior_predict(fit)
rs <- cbind(rs, yrep = colMeans(yrep))

# reorder levels for RUNNERS and OUTS_CT
# Reorganize order of Base States and Outs for a cleaner plot
rs <- transform(rs, Outs = factor(as.character(OUTS_CT),
                                  levels=c("0", "1", "2"),
                                  labels=c("0 Outs", "1 Out", "2 Outs")))
rs <- transform(rs,
      Runners = factor(as.character(RUNNERS),
      levels=runners_on_base,
      labels=runners_on_base))

# plot posterior estimates from original data
ggplot(rs) +
  geom_density(aes(x = yrep), fill = "#C4D8E2") +
  facet_grid(Runners ~ Outs, scales = "free_y", switch = "y") +
  theme_gray() +
  theme(axis.text.x = element_text(size = 6),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        strip.text.x = element_text(size = 9, face = "bold"),
        strip.text.y = element_text(size = 9, face = "bold", angle = 180),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank(),
        panel.spacing.x = unit(1, "lines")) +
  labs(x = "Expected Runs", y = "Baserunners")