“Can A.I. Be Taught to Explain Itself?” That’s the question recently posed in The New York Times Magazine. A search uncovered helpful references — An Introduction to Statistical Learning with Applications in R and The Elements of Statistical Learning, for example — but none with A.I. as author.

The available literature is better understood with some background in the languages and tools behind statistical learning.

R Markdown—a tool for reproducible reporting

More than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments.

— Baker, M. (2016). Is There a Reproducibility Crisis? Nature, 533(26), 452-454.

Learning follows reproducibility. R Markdown provides a tool to aid reproducibility of statistical analyses in R. An R Markdown file is just a plain text file containing specific syntax that a program like R Studio can use to knit code and results within our common language descriptions of the work seamlessly together into any of several formats, including pdf and html. From the R console, the command to render the markdown file is rmarkdown::render(<input>, <output_format>). Markdown syntax is easy to master. Beyond using the R Markdown file from which this paper was rendered as a template, R Studio provides ample information for the syntax: R Markdown from R Studio. For a tutorial, see also, Shalizi Using R Markdown for Class Reports (Carnegie Mellon University 22 August 2016). The basics are all most need; power-users can reference Xie’s Dynamic Documents with R and knitr (CRC Press 2015).

Mathematical notation can be included in a markdown file using latex — either within a paragraph like this fraction \(\frac{2}{3}\) or in its own space, like this representation of binomial probability,

\[f(y|N,p) = \frac{N!}{y!(N-y)!}\cdot p^y \cdot (1-p)^{N-y} = {{N}\choose{y}} \cdot p^y \cdot (1-p)^{N-y}\]

the expression to which we return when discussing probability. Typesetting latex code for math can be found online at, for example, the The LaTeX Project, but web-enabled math editors (e.g., Visual Math Editor) provide the latex code for us when we simply select the symbols, and allow us to copy the code, and paste into a markdown file between dollar signs.

From mathematical notation to R

As is evident in statistical learning literature, the subject is grounded upon and explained with the following languages,

  • Linear Algebra (e.g., vector, matrix, and vector-matrix multiplication, matrix inversion)
  • Calculus (e.g., derivatives, chain rule, integration)
  • Probability theory and Statistics (e.g., binomial and normal distributions, regression)
  • Optimization (e.g., necessary conditions for optimality, iterative methods)
  • Programming (e.g., translate algorithm descriptions into code)

Let’s review some basics, translating examples of the above into the R programming language.

Linear Algebra

We perform mathematical operations on vectors like \(\boldsymbol{a} =(v_1, v_2, v_3) \in (\mathbb{R},\mathbb{R},\mathbb{R}) \equiv \mathbb{R}^3\), and matrices like \(\boldsymbol{A} \in \mathbb{R} ^{m\times n}\), a rectangular array of real numbers with \(m\) rows and \(n\) columns, for example,

\[ \boldsymbol{A} =\left[\begin{matrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{matrix} \right] \in \left[\begin{matrix} \mathbb{R} & \mathbb{R} & \mathbb{R} \\ \mathbb{R} & \mathbb{R} & \mathbb{R} \\ \mathbb{R} & \mathbb{R} & \mathbb{R} \end{matrix} \right] \equiv \mathbb{R}^{3\times3} \]

where \(m\) and \(n\) are both 3. Let’s code vectors \(\boldsymbol{a}\) and \(\boldsymbol{b}\) in R using the standard R function c(), which means combine, and matrix \(\boldsymbol{A}\) using matrix(), assigning them some values like so,

# Vectors a and b
a <- c(2, 3, -1)
b <- c(1/2, 0, pi)

# matrix A
A <- matrix(data = c(4,4,-2,2,6,2,2,8,4),
            nrow = 3, 
            ncol = 3, 
            byrow = F)

In math notation, we assigned to \(\boldsymbol{a}\), scalars of the form \((x_1, x_2, x_3)\) such as: \(\mathbf{a}=(2, 3, -1)\) and \(\mathbf{b}=(\frac{1}{2}, 0, \pi)\). We can apply various algebraic operations to vectors including addition, subtraction, scaling, norm (length), dot product, and cross product. For each operation, where \(\boldsymbol{a} = (a_1, a_2, a_3)\) and \(\boldsymbol{b} = (b_1, b_2, b_3)\), we code mathematical operations,

\[ \begin{aligned} \boldsymbol{a} + \boldsymbol{b} &= (a_1 + b_1, a_2 + b_2, a_3 + b_3)\\ \boldsymbol{a} - \boldsymbol{b} &= (a_1 - b_1, a_2 - b_2, a_3 - b_3)\\ c\boldsymbol{a} &= (ca_1, ca_2, ca_3)\\ \left\|\boldsymbol{a}\right\| &= \sqrt{a_1^2+a_2^2+a_3^2} \\ \boldsymbol{a}\cdot \boldsymbol{b} &= a_1b_1+a_2b_2+a_3b_3=\left\|\boldsymbol{a}\right\|\left\|\boldsymbol{b}\right\| cos(\theta)\\ \boldsymbol{a}\times \boldsymbol{b} &= (a_2b_3-a_3b_2, a_3b_1-a_1b_3,a_1b_2-a_2b_1) \end{aligned} \]

in R using the syntax,

a + b             # adding vectors
a - b             # subtracting vectors
3 * a             # scaling vector by three, a constant
sqrt(sum(a ^ 2))  # norm or length of vector
a %*% b           # dot product of vectors
crossprod(a,b)    # cross product of vectors

In this example, \(\boldsymbol{a}\) and \(\boldsymbol{b}\) are not orthogonal (i.e., perpendicular to one another) as their dot product isn’t zero. We are reminded that the dot product also provides that \(\boldsymbol{a}\cdot \hat{e_x} = a_x = a \cos\alpha\), \(\boldsymbol{a}\cdot \hat{e_y} = a_y = a \cos\beta\), \(\boldsymbol{a}\cdot \hat{e_z} = a_z = a \cos\gamma\). Let’s calculate the unit (directional aspect) vector \((\hat{e_x}, \hat{e_y}, \hat{e_z})\) of \(\boldsymbol{a}\) in R by dividing it with \(\left\|\boldsymbol{a}\right\|\),

a / sqrt(sum(a ^ 2))

For matrices, we code matrix addition and subtraction,

\[ \begin{aligned} \boldsymbol{C}=\boldsymbol{A}+\boldsymbol{B}&\Leftrightarrow c_{ij}=a_{ij}+b_{ij}\\ \boldsymbol{C}=\boldsymbol{A}-\boldsymbol{B}&\Leftrightarrow c_{ij}=a_{ij}-b_{ij}\\ \end{aligned} \]

in R syntax,

# matrix B
B <- matrix(c(2,1,6,1,3,4,6,4,-2), 3, 3)

C <- A + B # matrix addition
C <- A - B # matrix subtraction

And we code the product of two matrices \(\boldsymbol{A}^{m \times n}\) and \(\boldsymbol{B}^{n \times p}\) (note, \(\boldsymbol{A}\boldsymbol{B} \neq \boldsymbol{B}\boldsymbol{A}\)),

\[ \begin{aligned} \boldsymbol{C}=\boldsymbol{A}\boldsymbol{B}&\Leftrightarrow c_{ij}=\sum\limits_{k=1}^{n}{a_{ik}b_{kj}}\\ \left[\begin{matrix} a_{11} & a_{12} \\ a_{21} & a_{22} \\ a_{31} & a_{32} \end{matrix} \right] \left[\begin{matrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{matrix} \right] &= \left[\begin{matrix} a_{11}b_{11} + a_{12}b_{21} & a_{11}b_{12}+a_{12}b_{22} \\ a_{21}b_{11} + a_{22}b_{21} & a_{21}b_{12}+a_{22}b_{22} \\ a_{31}b_{11} + a_{32}b_{21} & a_{31}b_{12}+a_{32}b_{22} \end{matrix} \right] \end{aligned} \]

in R syntax,

A %*% B

Note that column vector \(\left[\begin{matrix} b_{11} \\ b_{21} \end{matrix} \right]\) is just a \(2\times1\) matrix, so we can take the product of matrix \(\boldsymbol{A}\) and a column vector, just as above. In R,

A %*% B[,1]

Matrix inversion \(\boldsymbol{A}^{-1}\), is performed in R — saving us from manually inverting it with, say, the Gauss–Jordan elimination procedure — by simply coding,


while matrix transpose \(\boldsymbol{A}^\mathrm{T}\), which flips and rotates the matrix,

\[ \begin{aligned} \left[\begin{matrix} a_{1} & a_{2} & a_{3}\\ b_{1} & b_{2} & b_{3}\end{matrix} \right]^\mathrm{T} =\left[\begin{matrix} a_{1} & b_{1} \\ a_{2} & b_{2} \\ a_{3} & b_{3} \end{matrix} \right] \end{aligned} \]

is coded as,


By multiplying a matrix by its inverse,

\[\boldsymbol{A}^{-1}\boldsymbol{A} = \mathbb{I} \equiv \left[\begin{matrix} 1 & & 0 \\ &\ddots & \\ 0 & & 1 \end{matrix} \right]\] we obtain an identity matrix \(\mathbb{I}\) from \(\boldsymbol{A}^{-1}\boldsymbol{A}\) in R by,

solve(A) %*% A

And we can create an identity matrix \(\mathbb{I}\) directly,

diag(c(1, 1, 1))

Not all matrices can be inverted, of course, but we can check by finding the determinant of a matrix, \(\left|\boldsymbol{A}\right|\) as,


Since \(\left|\boldsymbol{A}\right| \neq 0\) in our example, it’s invertible.

The rank of a matrix can be calculated using QR decomposition in R,


Matrix operations in R provide a convenient way to solve systems of equations. Equations,

\[ \begin{aligned} a_{11} x_1 + a_{12}x_2 &= b_1\\ a_{21} x_1 + a_{22}x_2 &= b_2\\ \end{aligned} \]

when converted to an augmented matrix of the form \(\boldsymbol{A}\boldsymbol{x}=\boldsymbol{b}\),

\[ \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} b_1 \\ b_2 \end{bmatrix} \]

can be solved in R by taking the matrix inverse of both sides. For example, the matrix solution to the following system of linear equations,

\[ \begin{aligned} 1x_1 + 2x_2 &= 5\\ 3x_1 + 9x_2 &= 21 \end{aligned} \]

is \(\boldsymbol{x}=\boldsymbol{A}^{-1}\boldsymbol{b}\), or as solved in R,

A <- matrix(c(1,3,2,9), 2, 2)
b <- c(5, 21)
x <- solve(A) %*% b

A quick review of linear algebra is presented in R by Højsgaard and Carstensen, Introductory linear algebra with R, Version 5.2 (bendixcarstensen.com April 2016). Kolter’s short review of the math is excellent: the video Linear Algebra Review and notes Linear Algebra Review and Reference (Carnegie Mellon University 2008). The subject is introduced in chapter two of Goodfellow’s Deep Learning (MIT Press 2016), and with clarity in Lang’s Introduction to Linear Algebra, Second Edition (Springer 1986), Axler’s Linear Algebra Done Right, Third Edition (Springer 2015), and Treil’s Linear Algebra Done Wrong (Brown University 2017), which is free (and not done wrong), as is Pinkham’s Linear Algebra (Draft July 10, 2015).


Calculus provides a language for expressing ideas like rates of change using the expression’s derivative \(\frac{df(x)}{dx}=\lim _{\epsilon \rightarrow 0}\frac{f(x+\epsilon) -f(x)}{\epsilon }\) or, say, the probability of something as the proportion of an area under a curve, found by integration \(\int_{x=1}^{n}{x}\;dx\). The Deriv package in R provides tools to differentiate expressions analytically. Let’s take the derivative of \((x^2+3)^7\) in R,

d1 <- Deriv(expression((x ^ 2 + 3) ^ 7), 'x')
## expression(14 * (x * (3 + x^2)^6))

Notice that Deriv automatically applied the chain rule, \(F'(x)=f'(g(x))g'(x)\). We can evaluate the derivative at a specific point too,

eval(d1, list(x=1))
## [1] 57344

Symbolic integration in R requires a symbolic algebra package, like Ryacas, the R interface to Yet Another Computer Algebra System. Let’s integrate the expression \(\int{2x^2}\;dx = \frac{2}{3}x^3 + c\) in R,

yacas('Integrate(x) 2 * x ^ 2')
## expression(2 * x^3/3)

If without an analytical solution, as happens frequently, we can numerically differentiate and integrate in R, too.

The basics of Calculus have been compactly described (200 pages) in Kleppner’s Quick Calculus (Wiley 1985), which I recommend for anyone needing an, ahem, quick recall of the subject. Along with teaching the basics, it provides a very short (10 pages) summary (pages 208-220), and a short list of worked-out integrals and derivatives (pages 254-256). The basics are also taught in Thompson’s Calculus Made Easy Second Edition (Macmillan 1914). For additional insight, many have learned the workings of Calculus from the umbiquitous textbook by Stewart, Calculus, Early Transcendentals Eighth Edition (Cengage Learning 2016).

In the language of linear algebra and calculus, we can touch upon uncertainty of events — i.e., probability — and statistically describe them.


We can think of probability as a mathematical language for communicating about uncertain events to aid deduction. Our rules of probability include \(0 \leq Pr(X) \leq 1\); \(Pr(\Omega)=1\) where \(\Omega\) is the entire sample space of possible events; and the odds \(A = \frac{Pr(X)}{1-Pr(X)}\). We can work with probabilities using the sum rule, written in set notation,

\[P(A\cup B)=P(A)+P(B)-P(B \cap A)\] and the product rule,

\[P(A \cap B)=P(A)P(B|A)=P(B)P(A|B)\] where \(P(A|B)\) means the conditional probability that element \(A\) occurs given that element \(B\) occurs. We say \(A\) and \(B\) are independent if and only if \(P(A \cap B) = P(A)P(B)\), which means \(P(A|B)=P(A)\) and \(P(B|A)=P(B)\). From these rules, we get Bayes’ rule \(P(A|B) = \frac{P(A)P(B|A)}{P(B)}\). Using a tree diagram, we can show conditional probabilities on edges and total probabilities on nodes. In R using the package igraph,


g <- graph.tree(n = 2 ^ 3 - 1, children = 2)
node_labels1 <- c("", "P(A)", "P(A')", "P(AB)", "P(AB')", "P(A'B)", "P(A'B')")
edge_labels1 <- c("P(A)", "P(A')", "P(B|A)", "P(B'|A)", "P(B|A')", "P(B'|A')")

par(mar = c(0.1, 0.1, 0.1, 0.1), mfrow = c(1, 2))

plot(g,                                 # plot tree with probability notation
     layout = layout_as_tree,           # draw graph as tree
     vertex.size = 24,                  # node size
     vertex.color = '#C4D8E2',          # node color
     vertex.label = node_labels1,       # node labels
     vertex.label.cex = .6,             # node label size
     vertex.label.family = 'Helvetica', # node label family
     vertex.label.color = '#000000',    # node label size
     edge.color = '#EEEEEE',            # edge color
     edge.width = 6,                    # edge width
     edge.label = edge_labels1,         # edge labels
     edge.label.cex = .6,               # edge label size
     edge.label.family = 'Helvetica',   # edge label family
     edge.label.color = '#000000',      # edge label color
     asp = .6                           # aspect ratio of plot

node_labels2 <- c("", "0.6", "0.4", "0.36", "0.24", "0.24", "0.16")
edge_labels2 <- c("0.6", "0.4", "0.6", "0.4", "0.6", "0.4")

plot(g,                                 # plot tree with example values
     layout = layout_as_tree, 
     vertex.size = 24, 
     vertex.color = '#C4D8E2', 
     vertex.label = node_labels2,
     vertex.label.cex = .6, 
     vertex.label.family = 'Helvetica',
     vertex.label.color = '#000000', 
     edge.color = '#EEEEEE', 
     edge.width = 6,
     edge.label = edge_labels2, 
     edge.label.cex = .6, 
     edge.label.family = 'Helvetica',
     edge.label.color = '#000000', 
     asp = .6 

Other concepts include discrete and continuous random variables (e.g., observing pairs of coin tosses, we can use random variable \(\boldsymbol{X}\) to assign outcomes to numbers, thus \(X(HH) = 2,X(HT) = X(TH) = 1,X(TT) = 0\)). The language of probability also includes probability mass functions (PMF), cumulative distribution functions (CDF), and various distributions.


The bernoulli distribution represents the frequency of successes in a given number of trials. It is just a special case of a binomial distribution (the expression of which we opened this paper) where the number of observations \(N\) is \(1\) and, thus, the possible outcomes are \(y \in [0,1]\). Let’s simulate coin tosses in R by randomly sampling from possible events in the distribution space,

set.seed(TRUE)  # for reproducibility

flips <- sample(x = c("HH", "TT", "HT", "TH"),    # sample space
                size = 1000,                      # number of trials
                prob = c(0.25, 0.25, 0.25, 0.25), # probability of each event
                replace = TRUE)                   # replace the coins each time

flips <- factor(x = flips, labels = c("HH", "TT", "HT", "TH"))
##  HH  TT  HT  TH 
## 248 276 232 244

Using our presumed probability of obtaining two heads on any given flip as .25, let’s calculate the probability of two heads occuring exactly 248 times out of our 1000 draws, first expressly coding the binomial function, and then using R’s built-in function.

p1 <- choose(1000, sum(flips == 'HH')) * .25 ^ 248 * (1 - .25) ^ (1000 - sum(flips == 'HH'))
## [1] 0.02889194
p2 <- dbinom(sum(flips=='HH'), 1000, .25)
## [1] 0.02889194

We can test to see that both methods are equal:

all.equal(p1, p2)
## [1] TRUE

R offers several functions for common distributions: a probability mass function (PMF) for discrete variables or probability density function (PDF) for continuous variables, a continuous distribution function (CDF), and a function to draw random values from the distribution. R’s naming convention for these functions begin with a letter describing the function type — p for a CDF, d for a PMF/PDF, and r for a random generation function — followed by letters identifying the distribution. Thus, the above dbinom() provides the PMF for the binomial. Let’s apply these and review them in a graph. For a binomial distribution with a 50 percent probability of success in each of 40 observations of that many draws, for example, we can graph its PMF, CDF, and a histogram of these random draws,

require(ggplot2)                               # load plotting library

x <- seq(1, 40, 1)                             # create a sequence of values

df <- data.frame(value = x,                    # organize values, cdf, pdf
                 cdf = pbinom(x, length(x), .5),
                 pmf = dbinom(x, length(x), .5),
                 draws = rbinom(length(x), length(x), .5))

ggplot(data = df) +                            # create plot object

  # plot density of draws
  geom_histogram(aes(x = draws,                
                     y = ..density..), 
                 position = 'identity',          
                 binwidth = 1,
                 color = 'white',
                 fill = 'gray70') +
  # plot the CDF
  geom_line(aes(x = value,                     
                y = cdf),
            color = '#91A5AF') + 
  # plot the PMF
  geom_line(aes(x = value,                     
                y = pmf)) +
  # remove default grid lines
  theme(panel.grid.major = element_blank(),    
        panel.grid.minor = element_blank(),
        panel.background = element_blank()) +
  # add labels to plot elements
  annotate('text', x = 25, y = 0.8, label = 'CDF', color = '#91A5AF') +
  annotate('text', x = 30, y = 0.1, label = 'PMF', color = 'black') +
  annotate('text', x = 21, y = 0.25, label = 'DRAWS', color = 'gray70') +
  # add labels to axes
  labs(x = "Successes of 40\ndraws in 40 trials",                            
       y = "Probability")

(Recall that a PDF, unlike a PMF, does not represent probability; it represents the slope of the CDF.) Other common (and interrelated) distributions encountered include the multinomial dmultinom(), normal or gaussian dnorm(), cauchy dcauchy(), student-t dt(), poisson dpois(), negative binomial dnbinom(), exponential dexp(), gamma dgamma(), beta dbeta(). The LKJ correlation matrix distribution is dlkjcorr().

Some probability is reviewed in chapter three of Deep Learing. More briefly, we get a quick reference on probability from Chen, et al. Probability Cheatsheet, (2015), which closely follows Blitzstein’s Introduction to Probability (CRC Press 2014). A clear and concise statement on probability is found in, believe it or not, Feynman’s Chapter 6 Probability in Lectures on Physics, Volume I Millennium Edition (Basic Books 2010). Another helpful review is found in Murphy’s Machine Learning, A Probablistic Perspective, Chapter 2 Probability (MIT Press 2012).


Borrowing from Gelman and Hill’s Data Analysis Using Regression and Multilevel/Heirarchical Models (Cambridge University Press 2007), linear “regression can be used to represent relationships between variables,” and “is a method that summarizes how the average values of a numerical outcome variable vary over subpopulations defined by linear functions of predictors.” With mathematical notation, we can represent Gelman’s description as \(y_i = \boldsymbol{X}_i \boldsymbol{\beta}+\epsilon = \beta_1\boldsymbol{X}_{i1}+\dots+\beta_k\boldsymbol{X}_{ik}+\epsilon\), for \(i=1, \dots, n\), where errors \(\epsilon_i\) are normally distributed with mean \(0\) and standard deviation \(\sigma\).

Least squares

Many methods are available to find \(\beta\) coefficients that provide the “best fit” to the data, depending on how we define the quality of fit. Commonly, we measure how we approximate \(\boldsymbol{Y}_i\) with \(\boldsymbol{X}_i \boldsymbol{\beta}\) in terms of the squared difference \(\boldsymbol{Y}_i-(\boldsymbol{X}_i \boldsymbol{\beta})^2\). More specifically, we measure the quality of our approximation globally using the loss function,

\[ L=\sum\limits_{i=1}^{n}{(\underbrace{Y_i}_{actual} -\underbrace{(\beta_1\boldsymbol{X}_{i1}+\dots+\beta_k\boldsymbol{X}_{ik})}_{estimate} )^2} \longrightarrow \text{minimize over } \boldsymbol{\beta} \] minimizing (a type of optimizing) the measure over all values of \(\boldsymbol{\beta}\). We call the minimized, linear expression the least squares line. Note that we can think of \(\boldsymbol{X}\) as either a random variable (bayesian view) or fixed (frequentist view). First, we’ll consider \(\boldsymbol{X}\) fixed, leaving \(\epsilon\) as random. In the simplest case, \(\epsilon\) is assumed to have normal distribution \(N(0, \sigma^2)\). By minimizing our estimate, we find the maximum likelihood estimate of \(\epsilon\), thus

\[ \hat{\sigma}^2=\frac{1}{n}\sum{(\boldsymbol{Y}_i-(\boldsymbol{X}_i \boldsymbol{\beta}))^2} \] One measure of fit is \(r^2 = \frac{SSR}{SST}\) where SSR is the regression sum of squares and quantifies how far the estimated sloped regression line \(\hat{y}_i\) is from the horizontal sample mean \(\bar{y}\), and SST is the total sum of squares and quantifies how much \(y_i\) vary around their mean \(\bar{y}\). Mathematically,

\[r^2=\frac{\text{SSR}}{\text{SST}}=\frac{\sum\limits_{i=1}^{n}{(\hat{y}_i - \bar{y})^2} }{\sum\limits_{i=1}^{n}{(y_i - \bar{y})^2} }\]

With a bit more math, we get the minimizing coefficients,

\[ \boldsymbol{\beta}=\frac{\sum\limits_{i=1}^{n}{(x_i- \bar{x})(y_i- \bar{y})} }{\sum\limits_{i=1}^{n}{(x_i- \bar{x})^2} } \]

Let’s represent \(\text{SSR}\), \(\text{SST}\), and \(\sigma\) visually in R using toy data,

# prepare data and linear model
# -----------------------------------------------------------------------------
# save toy baseball data into object
df <- subset(plyr::baseball, year == 2007 & ab > 100)

# use basic linear model
mod <- lm(hr ~ ab, data = df)

# add fitted values to data
df <- transform(df, Fitted = fitted(mod))

# Create plot for SSR
# -----------------------------------------------------------------------------
SSR <- ggplot(df, aes(x = ab, y = hr)) + 

  # plot yhat - ybar
  geom_segment(aes(x = ab, 
                   y = mean(hr),
                   xend = ab, 
                   yend = Fitted),
               size = .2,
               color = '#888888') +

  # plot observations as points
  geom_point(color = 'gray85') +

  # plot linear regression line
              formula = y ~ x,
              se = FALSE,
              color = 'black',
              size = 0.5) + 
  # include origin for proper reference point in plot and to show scale
  scale_x_continuous(limits = c(80, 550)) + 
  scale_y_continuous(limits = c(0, 40)) +
  # plot mean of samples ybar
  geom_hline(aes(yintercept = mean(hr)),
             size = .5,
             color = 'black') +
  # remove all default decoration                
  theme_void() +
  # add annotations for title, yhat, ybar
  annotate("text", x = 100, y = 30,
           label = 'italic(hat(y))[i]-italic(bar(y))',
           parse = T, hjust = 0) +
  annotate("text", x = 500, y= 22, 
           label = 'bolditalic(hat(y))', 
           parse = T, hjust = 0) +
  annotate("text", x = 100, y= 14, 
           label = 'italic(bar(y))', 
           parse = T, hjust = 0)

# Create plot for SST
# -----------------------------------------------------------------------------
SST <- ggplot(df, aes(x = ab, y = hr)) + 

  # plot y - ybar
  geom_segment(aes(x = ab, 
                   y = hr,
                   xend = ab, 
                   yend = mean(hr)),
               size = .2,
               color = '#888888') +

  # plot observations as points
  geom_point(color = 'gray30') +

  # plot linear regression line
              formula = y ~ x,
              se = FALSE,
              color = 'gray85',
              size = 0.5) + 
  # include origin for proper reference point in plot and to show scale
  scale_x_continuous(limits = c(80, 550)) + 
  scale_y_continuous(limits = c(0, 40)) +
  # plot mean of samples ybar
  geom_hline(aes(yintercept = mean(hr)),
             size = .5,
             color = 'black') +
  # remove all default decoration                
  theme_void() +
  # add annotations for title, yhat, ybar
  annotate("text", x = 100, y = 30,
           label = 'italic(y)[i]-italic(bar(y))',
           parse = T, hjust = 0) +

  annotate("text", x = 500, y= 22, 
           label = 'bolditalic(hat(y))', 
           parse = T, hjust = 0) +
  annotate("text", x = 100, y= 14, 
           label = 'italic(bar(y))', 
           parse = T, hjust = 0)

# create plot for epsilon
# -----------------------------------------------------------------------------
epsilon <- ggplot(df, aes(x = ab, y = hr)) + 

  # draw y - yhat
  geom_segment(aes(x = ab, 
                   y = hr,
                   xend = ab, 
                   yend = Fitted),
               size = .2,
               color = '#888888') +

  # plot observations as points
  geom_point(color = 'gray30') +

  # plot linear regression line
              formula = y ~ x,
              se = FALSE,
              color = 'black',
              size = 0.5) + 
  # include origin for proper reference point in plot and to show scale
  scale_x_continuous(limits = c(80, 550)) + 
  scale_y_continuous(limits = c(0, 40)) +
  # plot mean of samples ybar
  geom_hline(aes(yintercept = mean(hr)),
             size = .3,
             color = 'gray85') +
  # remove all default decoration                
  theme_void() +
  # add annotations for title, yhat, ybar
  annotate("text", x = 100, y = 30,
           label = 'italic(y)[i]-italic(hat(y))[i]',
           parse = T, hjust = 0) +

  annotate("text", x = 500, y= 22, 
           label = 'bolditalic(hat(y))', 
           parse = T, hjust = 0) +
  annotate("text", x = 100, y= 14, 
           label = 'italic(bar(y))', 
           parse = T, hjust = 0)

# plot SSR, SST, and epsilon together
# -----------------------------------------------------------------------------

grid.arrange(SSR, SST, epsilon, ncol = 3)


Our regression line above, as mentioned, resulted from an optimization method, least squares. Mathematically, optimization takes the form,

\[ \begin{aligned} \text{minimize } f_0(x)& \\ \text{subject to } f_i(x)& \leq b_i \text{, where } i = 1,\dots,m \end{aligned} \]

Gradient descent

Let’s estimate our linear regression using gradient descent for the gradient \(\nabla f(\theta)\). A gradient as the analogue of the first derivative for functions of vectors. The gradient of a function \(f\) is the matrix of partial derivatives, taken generally as one or more partial derivatives,

\[ \nabla_\theta f(\theta) \in \mathbb{R}^{m \times n} = \left[\begin{matrix} \frac{\partial f(\theta)}{\partial \theta_{11}} & \dots & \frac{\partial f(\theta)}{\partial \theta_{1n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f(\theta)}{\partial \theta_{m1}} & \dots & \frac{\partial f(\theta)}{\partial \theta_{mn}} \end{matrix}\right] \] and as specific to linear regression,

\[ \nabla f(\theta) = \frac{1}{N}(\boldsymbol{y}^T - \theta \boldsymbol{X}^T)\boldsymbol{X} \] Our iterative procedure is simply,

\[ \text{Repeat until convergence:}\\ \theta := \theta - \alpha \sum{(\boldsymbol{y}^T - \theta \boldsymbol{X}^T)\boldsymbol{X}} \] Let’s code an example in R,

# generate matrix with intercept and random variable
X <- cbind(1, matrix(runif(1000,-5, 5)))
y <- X[,2] + rnorm(1000) + 3

# alpha and threshold manually chosen for toy data
gradient_descent <- function(X, y, alpha = 0.01, threshold = .0001) {
  # initialize variables
  gradient <- Inf
  i <- 1
  # initialize coefficients
  theta <- matrix(rep(0, dim(X)[2]), nrow = 2)

  # intermediate values
  cost_history <- double()
  theta_history <- list()

  # Repeat until convergence
  while(abs(max(gradient)) > threshold) {
    # update theta with gradient each iteration
    gradient <- ( t(X) %*% (X %*% theta - y) / length(y) )
    theta <- theta - alpha * gradient
    # store theta and cost (sse) each iteration
    cost_history[i] <- sum((X %*% theta - y) ^ 2) / (2 * length(y))
    theta_history[[i]] <- theta
    i <- i + 1
  cost_history <- as.array(cost_history)
  theta_history <- matrix(unlist(theta_history), ncol = 2, byrow = T)
  return(list(cost = cost_history,
              theta = theta_history))

gd <- gradient_descent(X, y)

Plotting each iteration of \(\theta\) from our algorithm, along with the data,

ggplot() + 
  # plot simulated data
  geom_point(aes(x = X[,2], y = y), color = 'gray90') +

  # plot intermediate gradients
  geom_abline(aes(slope = gd$theta[,2], 
                  intercept = gd$theta[,1]),
              color = 'gray', 
              alpha = .3) +

  # plot starting gradient
  geom_abline(aes(slope = gd$theta[1, 2], 
                  intercept = gd$theta[1, 1]),
              color = 'gray50') +

  # plot final gradient
  geom_abline(aes(slope = gd$theta[nrow(gd$theta),2], 
                  intercept = gd$theta[nrow(gd$theta),1]),
              color = '#000000') +
  # clean plot
  theme(panel.grid.major = element_blank(),    
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  # annotations
  annotate("text", x = 2.9, y = -.5,
           label = 'Initial Gradient',
           parse = F, hjust = 0,
           color = 'gray50') +

  annotate("text", x = 3, y = 7,
           label = 'Final Gradient',
           parse = F, hjust = 0,
           angle = 25,
           color = '#000000') +
  # add labels to axes
  labs(x = 'Theta',                            
       y = 'y') 

shows the algorithm converging from an initial to final gradient. This simplified version of a gradient descent is fragile and requires manually choosing an \(\alpha\) and threshold for convergence, but demonstrates the idea. The R package numDeriv provides functions to calculate gradients grad(). Of note, our example \(\theta\) was of a single dimension. When the problem includes multiple dimensions of partial second derivatives, they may be collected together in an \(n \times n\) matrix, called a Hessian — the analogue of the second derivative — in R, hessian().

Simplex method

In the above example, we optimized over a continuous variable. Let’s look at another example where the optimal answer must be constrained to integers. For these constraints, we typically use linear programming.

In linear programming, optimization follows a common solution structure. We specify what to minimize or maximize, identify the objective function and constraints (if any, note that the simple least squares problem is a special case of optimization with no constraints). In matrix form,

\[ \text{minimize x}\underbrace{\left[\begin{matrix} c_{1} \\ \vdots \\ c_{n} \end{matrix} \right] ^\mathrm{T} \left[\begin{matrix} x_{1} \\ \vdots \\ x_{n} \end{matrix} \right] }_{\text{Objective}} \text{s.t.}\underbrace{\left[\begin{matrix} a_{11} & \dots & a_{1n} \\ \vdots & \ddots & \vdots \\ a_{m1} & \dots & a_{mn} \end{matrix} \right] \left[\begin{matrix} x_{1} \\ \vdots \\ x_{n} \end{matrix} \right] \geq\left[\begin{matrix} b_{1} \\ \vdots \\ b_{n} \end{matrix} \right]}_{\text{First constraint}},\underbrace{\left[\begin{matrix} x_{1} \\ \vdots \\ x_{n} \end{matrix} \right] \geq 0}_{\text{Second constraint}} \]

Or as restated,

\[ \begin{aligned} \text{min } z &= \boldsymbol{c}\boldsymbol{x} \\ \text{subject to } \boldsymbol{A}\boldsymbol{x} &= \boldsymbol{b} \\ \boldsymbol{x} &\geq 0 \end{aligned} \] The number of functional constraints may be less than the number of variables (i.e., \(n < m\)); thus, we separate \(\boldsymbol{A}=[\boldsymbol{B}, \boldsymbol{N}]\) where \(\boldsymbol{B}\) is a feasible basis. We re-write our problem as,

\[ \begin{aligned} \text{min } \boldsymbol{z} &= \boldsymbol{c}_B \boldsymbol{x}_B + \boldsymbol{c}_N \boldsymbol{x}_N \\ \text{subject to } \boldsymbol{B} \boldsymbol{x}_B + \boldsymbol{N} \boldsymbol{x}_N &= \boldsymbol{b} \\ \boldsymbol{x}_B, \boldsymbol{x}_N &\geq 0 \end{aligned} \]

Solving for \(\boldsymbol{x}_B\), \(\mathbb{I} \boldsymbol{x}_B = \boldsymbol{B}^{-1} \boldsymbol{b} - \boldsymbol{B}^{-1} \boldsymbol{N} \boldsymbol{x}_N\). If \(\boldsymbol{x}_B \geq 0\) (i.e., a feasible solution), we set \(\boldsymbol{x}_N = 0\) (because any other number would increase \(\boldsymbol{z}\)), and substitute both \(\boldsymbol{x}_B\) and \(\boldsymbol{x}_N\) into our objective function, solving for \(\boldsymbol{z}\).

We iterate through each possible combination of bases such that the basis that minimizes \(\boldsymbol{z}\) given the constraints is our optimum. Consider the following example,

\[ \begin{aligned} \text{minimize } z &= 3x_1+x_2+9x_3+x_4 \\ \\ \text{subject to } x_1 + 2x_3 + x_4 &= 4\\ x_2 + x_3 - x_4 &= 2\\ x_i &\geq 0 \end{aligned} \]

In R,

# create a function for our "brute force" simplex method
simplex <- function(A, b, c_x) {
  # number of basis (m) and non-basis (n) variables
  m <- dim(A)[1]
  n <- dim(A)[2] - m
  # set up combinations of basis variables
  combs <- t( combn(1:(m + n), n) )
  # loop through possible combinations to find lowest cost
  z <- array()
  solution <- array()
  for( i in 1:choose(m + n, m) ) {
    # get basis variables and related objective constraints
    B <- A[ 1:m, combs[i,] ]
    # solve x_B
    x_B <- solve(B) %*% b
    # If all x_B >= 0 (feasible solution), find cost
    if( all(x_B >= 0) ) {
      c_B <- c_x[ combs[i,] ]
      # get non-basis variables and related objective constraints
      N <- A[ 1:m, setdiff(1:(m + n), combs[i,]) ]
      c_N <- c_x[ setdiff(1:(m + n), combs[i,]) ]
      # set non-basis variables to zero
      x_N <- rep(0, n)
      # solve for z in objective function
      z[i] <- c_B %*% solve(B) %*% b + 
              ( c_N - c_B %*% solve(B) %*% N ) %*% x_N
      # keep minimum cost solution
      if ( z[i] == min(z) ) solution <- x_B 
    else z[i] <- Inf
  # return NULL if no solution found
  if( min(z) == Inf ) { print("No soluion"); return(NULL) }
  # handle multiple solutions
  if( sum( z == min(z) ) > 1 ) { 
    print("Multiple solutions"); 
  # return unique solution
  x <- rep(0, m + n)
  x[ combs[which.min(z),] ] <- solution
  s <- list(x = x, 
            z = min(z))

# constraint matrix: Ax = b
A <- rbind(c(1, 0, 2, 1),
           c(0, 1, 1, -1))

# right hand side
b <- c(4, 2)

# objective function
c_x <- c(3, 1, 9, 1)

# find an optimum
simplex(A, b, c_x)
## $x
## [1] 0 6 0 4
## $z
## [1] 10

Our above version of a simplex method is brute force. We perform mutiple matrix operations for each possible combination of bases choose(m + n, m) but this gets large quick. A bases with 15 of 30 variables, for example, require the above function to iterate 155,117,520 times! Various packages are available with much more stable and efficient linear programming functions, including lp() in the package lpSolve,

x <- lpSolve::lp(direction = "min", 
                 objective.in = c_x,
                 const.mat = A,
                 const.dir = c("=", "="),
                 const.rhs = b)
## [1] 0 6 0 4

An introductory-level review of optimization, beginning with discrete variables and the simplex method is found in Pedregal, Introduction to Optimization (Springer 2004). Optimization over continuous variables is well-presented with modest mathematics in Nocedal & Wright’s Numerical Optimization (Springer 2006). Rigorous treatments are taught in Pinkham, Analysis, Convexity, and Optimization (Draft September 4, 2014); Boyd & Vandenberghe, Convex Optimization (Cambridge University Press 2014); Bertsimas & Tsitsiklis Introduction to Linear Optimization (Athena Scientific 1997); and Sra, et al. Optimization for Machine Learning (MIT Press 2014).


Machine learning algorithms usually require a high amount of numerical computation. This typically refers to algorithms that solve mathematical problems by methods that update estimates of the solution via an iterative process, rather than analytically deriving a formula providing a symbolic expression for the correct solution.

Deep Learning, Chapter 4: Numerical Computation.

Algorithms — the logical steps or procedures we implement to (re)produce a result — are, as we’ve touched upon, described by mathematics and logic. Logic such as ifthenelse …, dountil …, and …, or …, etc., are given precise syntax in R, which has a rich set of functions for specific logical procedures and can be expanded, as we’ve seen, by including libraries or packages of ever more functions. We can also code our own. We find a clear guide to R’s functional programming language in Mailund’s Functional Programming in R (Apress 2017).

General programming best practices (no less applicable for R) are timelessly described in Hunt, The Pragmatic Programmer: From Journeyman to Master (Addison-Wesley 2000). Basic implementation of algorithms in R are explained well in Braun’s A First Course in Statistical Programming with R (Cambridge University Press 2009) and with more detail in Cichosz’s Data Mining Algorithms (Wiley 2015). For algorithms generally, rigorous treatments are taught in Sedgewick’s Algorithms Fourth Edition (Addison-Wesley 2016) and Cormen’s Introduction to Algorithms, Third Edition (MIT Press 2009).

Data and model visualization

In a paper about the languages behind statistical learning, why discuss data visualization?

[T]he data display is the model, as it exposes the sources of variation and describes each source’s contribution to the total.

— Donahue, R. Fundamental Statistical Concepts in Presenting Data (July 2011).

Visual representation of data offers insight into our review of data, inspection of models, and conclusions drawn. R’s base graphics enable coding of data visualizations (R Base Graphics Cheatsheet by Joyce Robbins), which are thoroughly reviewed by Murrell in R Graphics Second Edition (CRC Press 2011). The base functions, however, lack consistency across functions. Just as mathematics provide a formal language for describing and modeling data, a grammar of graphics have been developed to formalize data visualization. See, Wilkinson’s The Grammar of Graphics Second Edition (Springer 2005). Such a grammar has been applied in the R package ggplot2 — as used above. The package’s reference manual is available at ggplot2.tidyverse.org. An example of the seven general layers of the grammar of graphics applied using ggplot2, and as used in the above plots is,

ggplot(data = <DATA>, aes(<MAPPINGS)) + # Data layer, global aesthetics mapping
  <GEOM_FUNCTION>(                      # Geometrics
     mapping = aes(<MAPPINGS>),         # Specific aesthetics mapping
     stat = <STAT>,                     # Stastistical layer
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +               # Coordinates, axes scaling
  <FACET_FUNCTION> +                    # Facetting (subplots)
  <THEME>                               # Non-data ink, annotations

From this code framework, we replace <...> in the template with the type of data or function implied. This framework is flexible, and the options vast. We are not limited, for example, to any single <GEOM_FUNCTION>, which map the data to specific visual representation; each geom adds a visual layer. Reviewing the code for our plots above reveals that we can, for example, chain consecutive plotting functions together with the + operator, each time adding a plot layer.

While the above references explain the language of coding visualizations of data, other references explain why particular visual representations may be effective for data exploration. Unwin answers the why, blending theory and application with R in Graphical Data Analysis with R (CRC Press 2015), and Murray teaches interactive graphics using d3 (which R implements as htmlwidgets) in Interactive Data Visualization for the Web, 2nd Edition (O’Reilly 2017). Canonical references include Tukey’s Exploratory Data Analysis (Addison-Wesley 1977), Cleveland’s The Elements of Graphing Data (Wadsworth 1985) and Visualizing Data (Hobart Press 1993).

Related work considers the effect of certain visualizations in aiding or obscuring the intended messages and explorations. These include Tufte’s The Visual Display of Quantitative Information (Graphics Press 2001). Knaflic provides a modern retelling of Tufte’s ideas in Storytelling with Data (Wiley 2015). Both borrow from Bertin’s Semiology of Graphics (University of Wisconsin Press 1983). Data visualization is communication, and that context is usefully set forth by Jean-luc Doumont in Trees, maps, and theorems: Effective communication for rational minds (Principiae 2009). Updated science of visual perception is in Ware’s Information Visualization: perception for design, Third Edition (Morgan Kaufmann 2013).

Organizing Data

Statistical models and algorithms require the data be particularly structured or organized. Some may require a wide format, for example, while other models require data be in long format. Or they may require the data be sorted or grouped, labeled as missing, or otherwise transformed, such as working with strings and data formats. Along with R’s data structures for vectors and matrices, it provides data frames (data.frame()) and lists (list()) (both of which we used above), which are invaluable for data organization. R’s base functions and add-on packages — notably plyr, dplyr, tidyr, reshape, stringr — have functions to organize and transform.

We can quickly reference the needed data munging functions in RStudio’s “cheat sheets” or learn more about them from Grolemund and Wickham in R for Data Science (O’Reilly 2017). The two authors also recommend a systematic workflow, to which I’ll add,

\[ \textrm{Gather} \rightarrow \textrm{Import} \rightarrow \textrm{Tidy} \rightarrow \textrm{Repeat:}\left[ \textrm{Transform} \rightleftarrows \textrm{Visualize} \rightleftarrows \textrm{Model} \right] \rightarrow \textrm{Communicate} \]


The more we understand these languages and tools, the more success we will have in studying and implementing new tools of prediction and inference, as taught in the references mentioned in opening this paper and others like MacKay’s Information Theory, Inference, and Learning Algorithms (Cambridge University Press 2003) and Kuhn’s Applied Predictive Modeling (Springer 2013). At least until A.I. explains itself.