Machine learning has emerged as one of the most powerful and widely applied areas of modern mathematics and computer science. This course provides a rigorous mathematical foundation for understanding how machines learn from data, bridging the gap between practical algorithms and their theoretical underpinnings. Rather than treating machine learning as a collection of disconnected techniques, we develop a unified framework centered on the principle of empirical risk minimization—the idea that we can learn from finite samples by minimizing the error observed on training data while controlling overfitting.
The course unfolds through two complementary lenses: the algorithmic and the theoretical. We begin with concrete, widely-used methods—decision trees and random forests—that illustrate core machine learning principles in accessible form. We then pivot to statistical learning theory, which provides the mathematical language for understanding generalization, sample complexity, and the fundamental limits of learning from data. This theoretical foundation then supports our analysis of more sophisticated optimization techniques, examining how gradient-based methods enable efficient learning on large datasets. Finally, we return to modern algorithms—boosting methods and neural networks—which we can now understand not merely as engineering solutions but as instances of deeper mathematical principles.
Throughout the course, a central tension animates our inquiry: the gap between what we can compute and what we can theoretically guarantee. Empirical risk minimization is conceptually simple, yet its practical implementation raises deep questions about optimization, regularization, and the relationship between training error and test error. By studying both the algorithms practitioners use and the theory that explains their success, you will develop the mathematical maturity to reason about new methods, understand their limitations, and recognize when theoretical results do and do not apply to real-world scenarios.
# Introduction
This chapter lays the mathematical foundations of supervised machine learning. The central question is: given observed data, how do we construct a rule that predicts outcomes well on future, unseen observations? The course answers this by placing prediction in a probabilistic framework, identifying the theoretically optimal predictor, and then studying how to approximate it from finite data. Three ideas run through everything that follows — risk minimisation, the bias-variance tradeoff, and the tension between approximation and estimation.
## The Statistical Prediction Framework
The starting point is a pair of random variables $(X, Y) \in \mathcal{X} \times \mathcal{Y}$ with joint distribution $P_0$. Here $X \in \mathcal{X} = \mathbb{R}^p$ is the **input** (also called predictors or variables), and $Y \in \mathcal{Y}$ is the **output** or response. The setup is general enough to capture two canonical tasks:
- **Classification**: $\mathcal{Y} = \{-1, 1\}$. For example, $X$ might encode disease risk factors (BMI, age, genetic indicators) drawn from a population, and $Y \in \{-1, 1\}$ might encode disease status.
- **Regression**: $\mathcal{Y} = \mathbb{R}$. For example, $X$ might encode the number of bedrooms and other attributes of a house, and $Y$ might be its sale price.
To predict $Y$ from $X$ we use a measurable function $h : \mathcal{X} \to \mathcal{Y}$, which the machine learning literature calls a **hypothesis**. In the classification setting, $h$ is also called a classifier.
[definition: Loss Function]
Without specifying what constitutes a prediction error, the optimisation problem has no content. A loss function is the mathematical translation of the qualitative notion "how wrong" a prediction is.
A **loss function** is a measurable map $\ell : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}$ that quantifies the cost of predicting $\hat{y}$ when the true response is $y$.
Two choices dominate the course:
- **0-1 loss** (classification):
\begin{align*}
\ell(\hat{y}, y) = \mathbb{1}_{\{\hat{y} \neq y\}},
\end{align*}
which is 1 when the prediction is wrong and 0 when it is correct.
- **Squared error loss** (regression):
\begin{align*}
\ell(\hat{y}, y) = (\hat{y} - y)^2.
\end{align*}
Unless otherwise stated, squared error is the default in regression.
[/definition]
Given a loss function, the quality of a hypothesis $h$ is measured by how large its loss is on average over the distribution $P_0$.
[definition: Risk]
The **risk** of a hypothesis $h : \mathcal{X} \to \mathcal{Y}$ is
\begin{align*}
R(h) := \int_{\mathcal{X} \times \mathcal{Y}} \ell(h(x), y)\, dP_0(x, y) = \mathbb{E}[\ell(h(X), Y)].
\end{align*}
The goal of statistical learning is to find $h$ that makes $R(h)$ as small as possible.
Note: this definition of risk differs from the frequentist risk familiar from estimation theory.
[/definition]
The risk is well-defined for any deterministic $h$. When $h$ is data-dependent (i.e. a random function trained on observations), $R(h)$ itself becomes a random variable, an important subtlety that recurs throughout the course.
## Bayes Classifiers and Regression Functions
The theoretical ideal is the hypothesis that minimises the risk over all measurable functions, with no restriction. This optimal predictor has a concrete description in both the classification and regression settings.
### The Regression Setting
[quotetheorem:1937]
This follows from the orthogonality property of conditional expectation (see Property (iv) in the conditional expectation review below): for any $g : \mathcal{X} \to \mathbb{R}$, the squared-error decomposition gives $\mathbb{E}[(Y - g(X))^2] = \mathbb{E}[(Y - h_0(X))^2] + \mathbb{E}[(h_0(X) - g(X))^2]$, so $h_0$ achieves the minimum.
Squared error is the natural loss when the task is to predict the mean, but the choice of loss changes the target. Under **absolute error loss** $\ell(\hat{y}, y) = |\hat{y} - y|$, the population minimiser is the conditional median of $Y$ given $X = x$, not the conditional mean. This matters in practice: when the conditional distribution of $Y$ given $X$ is heavy-tailed or skewed, the median is a more robust target than the mean and the corresponding predictor is less sensitive to outliers in $Y$.
### The Classification Setting
In the classification setting with 0-1 loss, the key quantity encoding the conditional distribution of $Y$ given $X$ is the posterior probability of the positive class.
[definition: Regression Function in Classification]
The **regression function** in the binary classification setting is the map
\begin{align*}
\eta : \mathcal{X} \to [0, 1], \qquad x \mapsto \mathbb{P}(Y = 1 \mid X = x).
\end{align*}
This function completely characterises the conditional distribution of $Y$ given $X = x$, since $\mathbb{P}(Y = -1 \mid X = x) = 1 - \eta(x)$.
[/definition]
The Bayes classifier uses $\eta$ to decide which class to predict: assign label $1$ if the class-1 posterior probability exceeds $1/2$.
[definition: Bayes Classifier and Bayes Risk]
A **Bayes classifier** $h_0 : \mathcal{X} \to \{-1, 1\}$ is any hypothesis that minimises the misclassification risk $R(h) = \mathbb{E}[\mathbb{1}_{\{h(X) \neq Y\}}]$ over all measurable classifiers. Its risk is called the **Bayes risk**, denoted $R^* = R(h_0)$.
[/definition]
[quotetheorem:1941]
[citeproof:1941]
The Bayes classifier is the gold standard, but it is not computable in practice: both $\eta$ and the regression function $\mathbb{E}[Y \mid X = x]$ require knowledge of the joint distribution $P_0$, which is unknown. This gap between the ideal and the achievable is the defining tension of statistical learning theory.
Two further points about the Bayes classifier deserve attention. First, the formula $h_0(x) = \mathbb{1}_{\{\eta(x) > 1/2\}}$ (relabelled for $\{0,1\}$ outputs) is specific to 0-1 loss. Under other classification losses, the optimal decision threshold shifts. Second, when $\eta(x) = 1/2$ on a set of positive probability, the Bayes risk is strictly positive even with perfect knowledge of $\eta$: there exist problems where no classifier can do better than random guessing on part of the input space. A third practical implication: any algorithm that estimates $\eta$ and then thresholds at $1/2$ is called a **plug-in classifier**. Whether plug-in classifiers can achieve the Bayes risk — and at what rate — is the question that motivates nonparametric estimation in later chapters.
[example: Bayes Classifier for Gaussian Classes with Equal Covariance]
Suppose $\mathcal{X} = \mathbb{R}^p$, $\mathcal{Y} = \{-1, 1\}$, the prior probabilities are $\pi_1 = \mathbb{P}(Y = 1)$ and $\pi_{-1} = 1 - \pi_1$, and the class-conditional densities are Gaussian:
\begin{align*}
X \mid Y = k \sim \mathcal{N}(\mu_k, \Sigma), \qquad k \in \{-1, 1\},
\end{align*}
where $\Sigma \in \mathbb{R}^{p \times p}$ is a common (positive definite) covariance matrix. By Bayes' rule, the regression function is
\begin{align*}
\eta(x) = \mathbb{P}(Y = 1 \mid X = x) = \frac{\pi_1\, f(x \mid Y = 1)}{\pi_1\, f(x \mid Y = 1) + \pi_{-1}\, f(x \mid Y = -1)}.
\end{align*}
Taking the log-ratio and using the Gaussian density formula,
\begin{align*}
\log \frac{\eta(x)}{1 - \eta(x)} &= \log \frac{\pi_1}{\pi_{-1}} + \log \frac{f(x \mid Y = 1)}{f(x \mid Y = -1)} \\
&= \log \frac{\pi_1}{\pi_{-1}} - \frac{1}{2}(x - \mu_1)^\top \Sigma^{-1}(x - \mu_1) + \frac{1}{2}(x - \mu_{-1})^\top \Sigma^{-1}(x - \mu_{-1}).
\end{align*}
Expanding the quadratic forms and cancelling the $x^\top \Sigma^{-1} x$ terms (which appear identically in both) yields
\begin{align*}
\log \frac{\eta(x)}{1 - \eta(x)} = \log \frac{\pi_1}{\pi_{-1}} + (\mu_1 - \mu_{-1})^\top \Sigma^{-1} x - \frac{1}{2}(\mu_1 + \mu_{-1})^\top \Sigma^{-1}(\mu_1 - \mu_{-1}).
\end{align*}
This is an affine function of $x$, so the Bayes decision boundary $\{\eta(x) = 1/2\}$ is a hyperplane. The Bayes classifier assigns $h_0(x) = 1$ when this affine expression is positive and $h_0(x) = -1$ otherwise. Concretely, let $w = \Sigma^{-1}(\mu_1 - \mu_{-1})$ and $b = \log(\pi_1/\pi_{-1}) - \frac{1}{2}(\mu_1 + \mu_{-1})^\top w$. Then $h_0(x) = \operatorname{sgn}(w^\top x + b)$, which is **Fisher's linear discriminant**. When the covariance matrices differ ($\Sigma_1 \neq \Sigma_{-1}$), the $x^\top \Sigma^{-1} x$ terms no longer cancel and the Bayes boundary becomes a quadric surface — this is quadratic discriminant analysis.
[/example]
<!-- illustration-needed: the Bayes decision boundary in two dimensions — show the region {eta(x) > 1/2} and {eta(x) < 1/2} separated by the boundary curve {eta(x) = 1/2}, with scattered points from both classes on either side -->
## Conditional Expectation Review
The proofs above relied on two conditional expectation identities — the cross-term vanishing trick (which killed the mixed product in the squared-error decomposition) and the tower property (which collapsed the double expectation in the Bayes classifier proof) — that were invoked without detailed justification. This section collects the four properties used repeatedly throughout the course and provides the argument for the most important one. For random variables $Z \in \mathbb{R}$ and $W = (W_1, \ldots, W_d)^\top \in \mathbb{R}^d$ with a joint density $f_{Z,W}$ with respect to a measure $\mu$, the conditional density of $Z$ given $W$ is
\begin{align*}
f_{Z \mid W}(z \mid w) = \begin{cases} f_{Z,W}(z,w) / f_W(w) & \text{if } f_W(w) \neq 0, \\ 0 & \text{otherwise,} \end{cases}
\end{align*}
where $f_W$ is the marginal density of $W$. When $Z$ or $W$ are discrete, one works with probability mass functions instead.
Assuming $\mathbb{E}|Z| < \infty$, the **conditional expectation function** $\mathbb{E}[Z \mid W = w]$ is defined by
\begin{align*}
g(w) := \mathbb{E}[Z \mid W = w] = \int z\, f_{Z \mid W}(z \mid w)\, \mu(dz).
\end{align*}
The notation $\mathbb{E}[Z \mid W]$ denotes the random variable $g(W)$, which is a function of $W$ (not of $Z$). This is not the fully general measure-theoretic definition, but it suffices for all arguments in the course.
The four properties of conditional expectation used repeatedly are the following. Each property is the "right tool" for a specific pattern of argument: independence (i) simplifies expectations when a variable carries no information; the tower property (ii) collapses nested expectations and is the engine of the Bayes classifier proof; taking out what is known (iii) lets a $\sigma$-measurable factor slip outside the conditional expectation, which is how cross terms vanish in squared-error expansions; and the best-predictor property (iv) is the direct analytical foundation for both the regression function theorem and the bias-variance decomposition.
[quotetheorem:1122]
[citeproof:1122]
Property (iv) is the core analytical fact underlying both the regression function theorem above and the bias-variance decomposition later in this chapter.
The integrability hypothesis $\mathbb{E}|Z| < \infty$ (or $\mathbb{E}[Z^2] < \infty$ for Properties (iii) and (iv)) is not merely technical bookkeeping. When $\mathbb{E}|Z| = \infty$, the conditional expectation $\mathbb{E}[Z \mid W]$ may fail to be integrable, and the tower property $\mathbb{E}[\mathbb{E}[Z \mid W]] = \mathbb{E}[Z]$ breaks down because both sides are infinite or undefined. For a concrete failure, take $Z$ to be a Cauchy random variable and $W = Z$: then $\mathbb{E}[Z \mid W] = Z$, but $\mathbb{E}[Z]$ does not exist, so the tower property cannot even be stated. Similarly, Property (iv) requires $\mathbb{E}[Z^2] < \infty$; without this, the squared-error risk $\mathbb{E}[(Z - g(W))^2]$ is infinite for every $g$ and the minimisation problem is vacuous. The theorem also does not say anything about the smoothness or computability of $\mathbb{E}[Z \mid W]$ as a function of $W$: the conditional expectation is defined only as an $L^1$ equivalence class, and for a specific joint distribution it may be discontinuous, non-monotone, or impossible to write in closed form.
[remark: Conditional Jensen's Inequality]
Probabilistic inequalities often have conditional versions. In particular, if $f : \mathbb{R} \to \mathbb{R}$ is convex and $\mathbb{E}|f(Z)| < \infty$, then
\begin{align*}
\mathbb{E}[f(Z) \mid W] \geq f(\mathbb{E}[Z \mid W]).
\end{align*}
This conditional Jensen's inequality is used in later chapters to control risks under surrogate losses.
[/remark]
## Hypothesis Classes and the Learning Problem
Since $P_0$ is unknown, we cannot compute the Bayes classifier or regression function. A **classical statistics** approach would model $P_0$ parametrically, estimate parameters by maximum likelihood, and thereby estimate the regression function. The machine learning approach taken in this course is different: instead of modelling $P_0$, we restrict the search for $h$ to a pre-specified class of hypotheses.
[definition: Hypothesis Class]
The Bayes classifier and regression function minimise risk over the class of all measurable functions. This is computationally and statistically impossible to work with directly: an unrestricted search would overfit to any finite training sample (the empirical minimiser would simply memorise the labels). To make learning tractable, we restrict attention to a pre-specified family.
A **hypothesis class** $\mathcal{H}$ is a collection of measurable functions $h : \mathcal{X} \to \mathcal{Y}$ from which a learning algorithm selects a hypothesis. The choice of $\mathcal{H}$ encodes prior assumptions about the structure of the optimal predictor.
[/definition]
The following families of hypothesis classes appear throughout the course.
In regression, typical choices include:
- **Linear hypotheses**: $\mathcal{H} = \{h : h(x) = \mu + x^\top \beta,\; \mu \in \mathbb{R},\; \beta \in \mathbb{R}^p\}$.
- **Basis function expansions**: $\mathcal{H} = \{h : h(x) = \mu + \sum_{j=1}^d \phi_j(x)\beta_j,\; \mu \in \mathbb{R},\; \beta \in \mathbb{R}^d\}$ for fixed basis functions $\phi_1, \ldots, \phi_d : \mathcal{X} \to \mathbb{R}$.
- **Adaptive basis classes**: $\mathcal{H} = \{h : h(x) = \sum_{j=1}^d w_j \phi_j(x),\; w \in \mathbb{R}^d,\; \phi_j \in \mathcal{B}\}$ for a class $\mathcal{B}$ of functions $\phi : \mathcal{X} \to \mathbb{R}$, where the basis functions $\phi_j$ are also chosen from data.
In the classification setting one replaces the linear outputs with their sign: for instance, $\mathcal{H} = \{h : h(x) = \operatorname{sgn}(\mu + x^\top \beta),\; \mu \in \mathbb{R},\; \beta \in \mathbb{R}^p\}$. Throughout the course $\operatorname{sgn}(0) = -1$ by convention.
The training data consists of i.i.d. copies $(X_1, Y_1), \ldots, (X_n, Y_n)$ of the pair $(X, Y)$. Our task is to use these $n$ observations to select $\hat{h} \in \mathcal{H}$ such that $R(\hat{h})$ or $\mathbb{E}[R(\hat{h})]$ is small. Note that once $\hat{h}$ depends on the training data, $R(\hat{h})$ is a random variable:
\begin{align*}
R(\hat{h}) = \mathbb{E}[\ell(\hat{h}(X), Y) \mid X_1, Y_1, \ldots, X_n, Y_n].
\end{align*}
## Empirical Risk Minimisation
Since $R(h)$ involves an expectation under the unknown $P_0$, we cannot minimise it directly. The natural substitute is to replace $P_0$ by the empirical distribution of the training data.
[definition: Empirical Risk]
Since the true distribution $P_0$ is unknown, we cannot evaluate or minimise $R(h)$ directly. The natural surrogate is to replace the expectation under $P_0$ with an average over the observed training pairs.
The **empirical risk** (or training error) of a hypothesis $h$ on data $(X_1, Y_1), \ldots, (X_n, Y_n)$ is
\begin{align*}
\hat{R}(h) := \frac{1}{n} \sum_{i=1}^n \ell(h(X_i), Y_i).
\end{align*}
The **empirical risk minimiser (ERM)** over a class $\mathcal{H}$ is any $\hat{h} \in \arg\min_{h \in \mathcal{H}} \hat{R}(h)$.
[/definition]
Empirical risk minimisation is the central algorithmic paradigm of the course. The key question it raises — how well does minimising the training error guarantee small generalisation error — drives the theoretical development in subsequent chapters.
[example: Ordinary Least Squares as ERM]
Consider the regression setting with $\mathcal{Y} = \mathbb{R}$, squared error loss, and the linear hypothesis class $\mathcal{H} = \{h : h(x) = \mu + x^\top \beta,\; \mu \in \mathbb{R},\; \beta \in \mathbb{R}^p\}$. Empirical risk minimisation becomes
\begin{align*}
(\hat{\mu}, \hat{\beta}) \in \arg\min_{(\mu, \beta) \in \mathbb{R} \times \mathbb{R}^p} \frac{1}{n} \sum_{i=1}^n (Y_i - \mu - X_i^\top \beta)^2,
\end{align*}
which is exactly ordinary least squares. The ERM hypothesis is $\hat{h}(x) = \hat{\mu} + \hat{\beta}^\top x$.
More generally, consider the basis-function class
\begin{align*}
\mathcal{H} = \left\{h : h(x) = \sum_{j=1}^d \phi_j(x) \beta_j,\; \beta \in \mathbb{R}^d \right\}
\end{align*}
for fixed functions $\phi_j : \mathbb{R}^p \to \mathbb{R}$. For example, when $p = 1$ one can take $\phi_j(x) = x^{j-1}$ to obtain polynomial regression. Form the design matrix $\Phi \in \mathbb{R}^{n \times d}$ with $(i,j)$-entry $\Phi_{ij} = \phi_j(X_i)$, assumed to have full column rank, and write $\phi(x) = (\phi_1(x), \ldots, \phi_d(x))^\top$ and $Y_{1:n} = (Y_1, \ldots, Y_n)^\top$. Then the ERM is $\hat{h}(x) = \hat{\beta}^\top \phi(x)$ where
\begin{align*}
\hat{\beta} = (\Phi^\top \Phi)^{-1} \Phi^\top Y_{1:n}.
\end{align*}
This is the standard formula for ordinary least squares in a basis expansion.
[/example]
The regression example shows that ERM with squared error loss and a linear hypothesis class reduces to a well-understood linear algebra problem. The classification setting is fundamentally different: replacing the squared error with 0-1 loss and the real-valued output with a discrete label turns the ERM problem from a smooth optimisation into a combinatorial one.
[example: Classification ERM — Empirical Risk Minimisation Over Linear Classifiers]
Consider binary classification with $\mathcal{Y} = \{-1, 1\}$, 0-1 loss, and the linear classifier class
\begin{align*}
\mathcal{H} = \{h : h(x) = \operatorname{sgn}(\mu + x^\top \beta),\; \mu \in \mathbb{R},\; \beta \in \mathbb{R}^p\}.
\end{align*}
The ERM minimises the training misclassification rate:
\begin{align*}
(\hat{\mu}, \hat{\beta}) \in \arg\min_{(\mu, \beta)} \frac{1}{n}\sum_{i=1}^n \mathbb{1}_{\{\operatorname{sgn}(\mu + X_i^\top \beta) \neq Y_i\}}.
\end{align*}
Unlike the regression case, this is a combinatorial optimisation problem: for each candidate $(\mu, \beta)$, the objective counts discrete misclassifications. In particular, it is not differentiable in $(\mu, \beta)$, so gradient-based methods cannot be applied directly. The ERM over linear classifiers under 0-1 loss is NP-hard in general. This motivates the use of **surrogate losses** — differentiable upper bounds on the 0-1 loss such as the hinge loss or logistic loss — which are the actual objectives used by support vector machines and logistic regression. The gap between the surrogate-loss ERM and the Bayes risk under 0-1 loss is a central question in later chapters.
[/example]
## The Bias-Variance Decomposition
ERM over a large class $\mathcal{H}$ — one that contains many hypotheses and can fit the data closely — tends to produce a small empirical risk. But such a fit may not generalise well to new data, because the minimiser $\hat{h}$ reflects idiosyncrasies of the particular training sample. Conversely, a small class produces stable estimators but may not contain any hypothesis close to the truth. This tension is formalised by the bias-variance decomposition.
Let $D = (X_i, Y_i)_{i=1}^n$ be i.i.d. training data, and let $\hat{h}_D$ be the ERM trained on $D$. Define the **mean hypothesis**
\begin{align*}
\bar{h}(x) := \mathbb{E}[\hat{h}_D(x)],
\end{align*}
where the expectation is over the random training set $D$. The mean hypothesis is the average prediction at $x$ over all possible training sets of size $n$.
Applying Property (iv) of conditional expectation twice — first conditioning on $(X, D)$ and then conditioning on $X$ alone — yields the pointwise squared-error decomposition: for fixed $x$,
\begin{align*}
\mathbb{E}[(Y - \hat{h}_D(X))^2 \mid X = x] &= \operatorname{Var}(Y \mid X = x) + \mathbb{E}[(\hat{h}_D(x) - \bar{h}(x))^2 \mid X = x] + (\mathbb{E}[Y \mid X = x] - \bar{h}(x))^2.
\end{align*}
The key step uses the fact that $D$ is independent of $(X, Y)$ given $X$, so $\mathbb{E}[\hat{h}_D(X) \mid X = x] = \mathbb{E}[\hat{h}_D(x)] = \bar{h}(x)$. Taking expectations over $X$ and $D$ gives the bias-variance decomposition of the expected generalisation error.
[quotetheorem:1845]
The three terms have distinct interpretations:
- **Squared bias**: How far the average predictor $\bar{h}$ is from the true regression function. A small class $\mathcal{H}$ that does not contain a good approximation to $\mathbb{E}[Y \mid X = x]$ leads to large bias.
- **Variance**: How much the predictor varies around its mean as the training data changes. Overfitting a rich class to the noise in the training labels inflates variance.
- **Irreducible variance**: The intrinsic noise $\operatorname{Var}(Y \mid X)$ in the problem, which cannot be reduced by any choice of hypothesis.
The first two terms represent the bias-variance tradeoff: enlarging $\mathcal{H}$ typically reduces bias (the true regression function is more likely to be approximable within $\mathcal{H}$) but increases variance (the ERM is more sensitive to the particular training sample). This decomposition is specific to squared error loss — under 0-1 loss there is no clean additive decomposition of this form, and the notion of "variance" for a classifier requires separate treatment.
<!-- illustration-needed: the bias-variance tradeoff curve — show expected risk (y-axis) vs. model complexity (x-axis, e.g. number of basis functions d), decomposed into squared bias (decreasing), variance (increasing), and irreducible variance (flat baseline), with the total risk forming a U-shaped curve and the optimal complexity marked at the minimum -->
[example: Bias-Variance Decomposition for Polynomial Regression]
Consider a one-dimensional regression problem ($p = 1$) with the data-generating process
\begin{align*}
Y = \sin(2\pi X) + \varepsilon, \qquad X \sim \operatorname{Uniform}([0, 1]), \quad \varepsilon \sim \mathcal{N}(0, \sigma^2), \quad \sigma^2 = 0.1,
\end{align*}
where $X$ and $\varepsilon$ are independent. The true regression function is $h_0(x) = \sin(2\pi x)$, and the irreducible variance is $\operatorname{Var}(Y \mid X) = \sigma^2 = 0.1$ uniformly. We fit polynomial regression of degree $d - 1$ (equivalently, $d$ basis functions $\phi_j(x) = x^{j-1}$) using ordinary least squares on $n$ training points.
**Degree $d = 1$ (constant fit).** The hypothesis class is $\mathcal{H} = \{h : h(x) = \beta_0\}$. The mean hypothesis is $\bar{h}(x) = \mathbb{E}[\hat{\beta}_0] = \mathbb{E}[\bar{Y}] = \int_0^1 \sin(2\pi x)\, dx = 0$. The squared bias is $\mathbb{E}[(\sin(2\pi X) - 0)^2] = \int_0^1 \sin^2(2\pi x)\, dx = 1/2$, which dominates the error. The variance term is $\operatorname{Var}(\bar{Y}) = \sigma^2/n$, which is small. The expected risk is approximately $0.5 + 0.1/n + 0.1$: mostly bias.
**Degree $d = n$ (interpolating polynomial).** The polynomial passes through all $n$ training points, so the empirical risk is zero. But the variance from the $\sigma^2 d/n$ formula equals $\sigma^2 = 0.1$: the estimator reproduces the full noise. Between the training points, the fitted polynomial oscillates wildly (Runge's phenomenon), so the actual integrated variance is much larger than $\sigma^2 d / n$ evaluated at the training points alone. The bias is near zero (a degree-$(n-1)$ polynomial can approximate $\sin(2\pi x)$ well on $[0,1]$), but the total risk is large because variance dominates.
**Intermediate $d$ (e.g. $d = 5$).** Five basis functions $1, x, x^2, x^3, x^4$ give a reasonable polynomial approximation to $\sin(2\pi x)$ on $[0,1]$ (the Taylor expansion of $\sin$ through order 4 already captures the main shape). The squared bias is moderate and the variance is $\sigma^2 \cdot 5/n$, which is small for $n \gg 5$. The total expected risk is close to the irreducible variance $0.1$, confirming that $d \approx 5$ sits near the optimal tradeoff point for this problem.
[/example]
### The Variance of Basis-Function ERM
To make the tradeoff concrete, consider the basis-function ERM $\hat{h}_D(x) = \hat{\beta}^\top \phi(x)$ from the example above, and introduce the conditional mean hypothesis $\tilde{h}_{X_{1:n}}(x) := \mathbb{E}[\hat{h}_D(x) \mid X_{1:n}]$ (the mean over labels only, holding the covariates fixed). Under the homoscedasticity assumption $\operatorname{Var}(Y \mid X = x) = \sigma^2$ (constant in $x$), one can compute the conditional variance at a fixed point $x$:
\begin{align*}
\mathbb{E}[(\hat{h}_D(x) - \tilde{h}_{X_{1:n}}(x))^2 \mid X_{1:n}] = \sigma^2 \phi(x)^\top (\Phi^\top \Phi)^{-1} \phi(x).
\end{align*}
The derivation uses the fact that the residual vector $Y_{1:n} - \mathbb{E}[Y_{1:n} \mid X_{1:n}]$ has conditional covariance matrix $\sigma^2 I$ given $X_{1:n}$ (this follows from the independence of the training pairs and the tower property). Averaging this variance over the training points $x = X_1, \ldots, X_n$ and applying the trace trick (using that $\operatorname{tr}(AB) = \operatorname{tr}(BA)$ for compatible matrices):
\begin{align*}
\frac{1}{n} \sum_{i=1}^n \sigma^2 \phi(X_i)^\top (\Phi^\top \Phi)^{-1} \phi(X_i) = \frac{\sigma^2}{n} \operatorname{tr}\!\left(\sum_{i=1}^n \phi(X_i)\phi(X_i)^\top (\Phi^\top \Phi)^{-1}\right) = \frac{\sigma^2}{n} \operatorname{tr}(\Phi^\top \Phi \cdot (\Phi^\top \Phi)^{-1}) = \frac{\sigma^2 d}{n}.
\end{align*}
This calculation reveals a clean quantitative form of the tradeoff: **the variance grows linearly in $d$** (the number of basis functions), while the squared bias should decrease as $d$ increases (more basis functions give a richer approximation to the truth). The ratio $d/n$ is a fundamental complexity parameter that recurs throughout the course.
## Cross-Validation
The preceding analysis shows that choosing the number $d$ of basis functions is critical. Too few yields high bias; too many yields high variance. More generally, given $m$ competing machine learning methods $\hat{h}^{(1)}, \ldots, \hat{h}^{(m)}$ — for instance, linear regressions with $1, 2, \ldots, m$ basis functions — we want to select the one with the best generalisation error.
Ideally, for each method $j$, we would minimise the **conditional risk** given the training data:
\begin{align*}
R(\hat{h}^{(j)}_D) = \mathbb{E}[\ell(\hat{h}^{(j)}_D(X), Y) \mid D],
\end{align*}
where $(X, Y)$ is a fresh observation independent of $D$. A less ambitious goal is to minimise the **expected risk** $\mathbb{E}[R(\hat{h}^{(j)}_D)]$, where we further average over the training data $D$.
Since neither quantity can be computed directly from the observed data alone, cross-validation estimates the expected risk.
[definition: $v$-Fold Cross-Validation]
We cannot use $\hat{R}(h)$ to compare models because a richer hypothesis class will always produce a smaller empirical risk — the ERM in a larger class fits the training data more closely, even if it generalises worse. We need an estimate of the expected risk on held-out data. Cross-validation constructs this estimate by repeatedly hiding part of the training set.
Partition the $n$ training observations into $v$ groups (folds) of roughly equal size. For each fold $k \in \{1, \ldots, v\}$, let $A_k \subseteq \{1, \ldots, n\}$ be the index set of observations in fold $k$, and let $D_{-k}$ denote the training data with fold $k$ removed. For each method $j$, train on $D_{-k}$ to obtain $\hat{h}^{(j)}_{-k} := \hat{h}^{(j)}_{D_{-k}}$. The **cross-validation score** is
\begin{align*}
\mathrm{CV}(j) := \frac{1}{n} \sum_{k=1}^v \sum_{i \in A_k} \ell(\hat{h}^{(j)}_{-k}(X_i), Y_i).
\end{align*}
The selected method is $\hat{j} \in \arg\min_j \mathrm{CV}(j)$, and the final hypothesis is $\hat{h}^{(\hat{j})}_D$.
[/definition]
The rationale is that for each $i \in A_k$, the pair $(X_i, Y_i)$ was not used to train $\hat{h}^{(j)}_{-k}$, so the held-out loss $\ell(\hat{h}^{(j)}_{-k}(X_i), Y_i)$ mimics the expected loss on a new observation. Precisely, for $i \in A_k$:
\begin{align*}
\mathbb{E}[\ell(\hat{h}^{(j)}_{-k}(X_i), Y_i)] = \mathbb{E}[\mathbb{E}[\ell(\hat{h}^{(j)}_{-k}(X_i), Y_i) \mid D_{-k}]].
\end{align*}
This is the expected risk in (1.7), but with a training set of size $n - |A_k|$ instead of $n$. If all folds are equal in size, $\mathrm{CV}(j)$ is an average of $n$ identically distributed terms, each with expected value equal to the expected prediction error at reduced sample size.
[remark: Bias and Choice of $v$]
Cross-validation gives a **biased** estimate of the expected prediction error at sample size $n$, because each fold trains on only $n(1 - 1/v)$ observations rather than $n$. The bias decreases as $v$ increases. At the extreme $v = n$ — **leave-one-out cross-validation** (LOOCV) — the bias is smallest, since each training set has size $n - 1$. However, LOOCV requires fitting $n$ models, which can be computationally expensive. In practice, $v = 5$ or $v = 10$ is standard: these choices substantially reduce computation while keeping the bias at an acceptable level. The cross-validation scores $\mathrm{CV}(j)$ are not independent across folds (they share training data), so the variance of the cross-validation estimator is not amenable to simple analysis.
[/remark]
The bias-variance tradeoff and cross-validation set the agenda for the rest of the course. The central unresolved question is: how should we choose $\mathcal{H}$ — its size, its structure, the basis functions — to balance approximation against estimation? Later chapters address this through structural risk minimisation, VC theory, regularisation, kernel methods, and tree-based ensembles.
Cross-validation guides us toward models with strong generalization, but understanding *why* some models generalize better requires examining their structure. Decision trees and random forests exemplify this principle—they adaptively partition the predictor space without requiring us to specify that structure in advance.
# 2. Popular Machine Learning Methods I
The preceding chapter established cross-validation as a principled way to select among competing methods. But all the methods considered so far — linear regression and its basis-function extensions — require the user to specify the collection of basis functions in advance. This chapter introduces two methods that construct their basis functions directly from the data: decision trees, which partition the predictor space into rectangular regions, and random forests, which average over many such trees to reduce variance. Both methods are nonparametric in spirit: they impose no fixed parametric form on the regression function, instead letting the data shape the approximation class adaptively.
## Decision Trees
The question motivating this chapter is: if we have decided to use basis functions of a particular form, how should we choose which basis functions to use, rather than specifying them in advance? **Decision trees** (also called regression trees in the regression context) answer this question by selecting indicator functions on data-driven rectangular regions.
Before we write down the definition, it is worth seeing concretely why pre-specified basis functions fail when the true structure is axis-aligned.
[example: Failure of Polynomial Regression on a Step Function]
Suppose the true regression function is
\begin{align*}
f^*(x_1, x_2) = \begin{cases} 1 & \text{if } x_1 \leq 0.5, \\ 3 & \text{if } x_1 > 0.5, \end{cases}
\end{align*}
and we have $n = 8$ training observations with predictor values spread across $[0,1]^2$:
| $x_1$ | $x_2$ | $y$ |
|-------|-------|-----|
| 0.1 | 0.4 | 1.0 |
| 0.2 | 0.7 | 1.0 |
| 0.3 | 0.2 | 1.0 |
| 0.4 | 0.9 | 1.0 |
| 0.6 | 0.3 | 3.0 |
| 0.7 | 0.6 | 3.0 |
| 0.8 | 0.1 | 3.0 |
| 0.9 | 0.8 | 3.0 |
The true function depends only on $x_1$ and jumps discontinuously at $x_1 = 0.5$. Suppose we fit degree-2 polynomial regression using basis functions $1, x_1, x_2, x_1^2, x_1 x_2, x_2^2$. The least-squares fit $\hat{f}(x_1, x_2) = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \hat{\beta}_3 x_1^2 + \hat{\beta}_4 x_1 x_2 + \hat{\beta}_5 x_2^2$ must produce a smooth surface that cannot reproduce the jump. Evaluating near the boundary: at $(0.49, 0.5)$ the polynomial predicts approximately $1.1$, but at $(0.51, 0.5)$ it predicts approximately $1.4$ — far short of the true value of $3$. The residual sum of squares is at least $4 \times (3 - \hat{f})^2 \gtrsim 4 \times 1.56^2 \approx 9.7$ on the right-hand observations, regardless of how the polynomial coefficients are chosen, because no degree-2 polynomial can transition sharply from near $1$ to near $3$ over a tiny interval while being globally smooth. Increasing the polynomial degree helps near the boundary but introduces oscillations elsewhere (Runge's phenomenon), so the approximation error persists in a different form. A single indicator $\mathbb{1}_{\{x_1 > 0.5\}}$ perfectly recovers $f^*$ from this data with zero training error — it just cannot be expressed as any fixed-degree polynomial basis function.
[/example]
A regression tree approximates the regression function $x \mapsto \mathbb{E}[Y \mid X = x]$ using a piecewise constant function on rectangular regions.
[definition: Regression Tree]
A **regression tree** with $J$ leaves is a function $T : \mathbb{R}^p \to \mathbb{R}$ of the form
\begin{align*}
T(x) = \sum_{j=1}^J \gamma_j \mathbb{1}_{R_j}(x),
\end{align*}
where $R_1, \ldots, R_J$ are rectangular regions forming a partition of $\mathbb{R}^p$ — that is, $\bigcup_j R_j = \mathbb{R}^p$ and $R_j \cap R_k = \varnothing$ for $j \neq k$ — and $\gamma_1, \ldots, \gamma_J \in \mathbb{R}$ are coefficients.
[/definition]
The regions $R_j$ and coefficients $\gamma_j$ are fitted from training data $(X_i, Y_i)_{i=1}^n$ using a recursive binary partitioning algorithm, which we now describe.
### The Recursive Binary Partitioning Algorithm
How do we find the regions $R_1, \ldots, R_J$ and coefficients $\gamma_1, \ldots, \gamma_J$? A brute-force search over all possible partitions of $\mathbb{R}^p$ into $J$ rectangles is combinatorially infeasible: even with $n$ training points, the number of candidate rectangular partitions grows super-exponentially with $J$ and $p$. The standard resolution is a greedy algorithm that grows the tree one split at a time, at each step splitting one existing region into two sub-regions along an axis-aligned cut that most reduces the residual sum of squares (RSS).
**Algorithm (Regression Tree Fitting).**
1. Input: maximum number of regions $J$. Initialise $\hat{\mathcal{R}} = \{\mathbb{R}^p\}$.
2. Split one region in $\hat{\mathcal{R}}$ by minimising RSS. For each region $R \in \hat{\mathcal{R}}$ with index set $I := \{i : X_i \in R\}$ satisfying $|I| > 1$, and for each coordinate $j = 1, \ldots, p$, let $S_j$ be the set of midpoints between adjacent values in $\{X_{ij}\}_{i \in I}$. Find the predictor index $\hat{j}_R$ and split point $\hat{s}_R$ to minimise over $j \in \{1, \ldots, p\}$ and $s \in S_j$:
\begin{align*}
\min_{\gamma_L \in \mathbb{R}} \sum_{i \in I:\, X_{ij} \leq s} (Y_i - \gamma_L)^2 + \min_{\gamma_R \in \mathbb{R}} \sum_{i \in I:\, X_{ij} > s} (Y_i - \gamma_R)^2 - \min_{c \in \mathbb{R}} \sum_{i \in I} (Y_i - c)^2.
\end{align*}
The first two terms are the RSS on the two candidate sub-regions, and the final term is the RSS on the current region without splitting. The difference is the reduction in RSS achieved by the split.
3. Let $\hat{R}$ be the region yielding the largest RSS reduction (equivalently, the smallest value of the objective above). Form the two sub-regions
\begin{align*}
\hat{R}_L := \{x \in \hat{R} : x_{\hat{j}_{\hat{R}}} \leq \hat{s}_{\hat{R}}\}, \qquad \hat{R}_R := \hat{R} \setminus \hat{R}_L,
\end{align*}
and update $\hat{\mathcal{R}} \leftarrow (\hat{\mathcal{R}} \setminus \{\hat{R}\}) \cup \{\hat{R}_L, \hat{R}_R\}$.
4. Repeat Step 2 until $|\hat{\mathcal{R}}| = J$.
5. For the final partition $\hat{\mathcal{R}} = \{\hat{R}_1, \ldots, \hat{R}_J\}$, let $\hat{I}_j = \{i : X_i \in \hat{R}_j\}$ and set
\begin{align*}
\hat{\gamma}_j := \frac{1}{|\hat{I}_j|} \sum_{i \in \hat{I}_j} Y_i.
\end{align*}
Output $\hat{T} : \mathbb{R}^p \to \mathbb{R}$ given by $\hat{T}(x) = \sum_{j=1}^J \hat{\gamma}_j \mathbb{1}_{\hat{R}_j}(x)$.
<!-- illustration-needed: the regression tree partitioning diagram — show a two-dimensional predictor space [0,1]^2 with axis-aligned rectangular regions created by successive binary splits, with numbers indicating the order of splits, together with the corresponding binary tree diagram where internal nodes show the split condition (e.g. x_2 < 0.48) and leaf nodes show the average response and percentage of data -->
The fitted tree $\hat{T}$ is the ERM over the class of functions with the fixed regions $\hat{R}_1, \ldots, \hat{R}_J$ — it minimises the training RSS among all choices of coefficients $\gamma \in \mathbb{R}^J$. However, the regions themselves were selected greedily (each split optimises a single step without considering future splits), so $\hat{T}$ will not in general coincide with the globally RSS-minimising piecewise constant function with $J$ regions.
The tree structure itself is the key interpretable output: the fitted function can be visualised as a binary tree where each internal node encodes a split condition of the form $x_j \leq s$ and each leaf encodes the average response in the corresponding region.
### Limitations of the Rectangular Partition
The axis-aligned, piecewise-constant structure that makes decision trees interpretable and computationally tractable is also a fundamental limitation. Regions $R_j$ are always rectangles whose sides are parallel to coordinate axes. Consequently, trees struggle with two classes of functions:
- **Oblique decision boundaries.** If the true Bayes boundary is $x_1 + x_2 = 1$ (a diagonal), a decision tree must approximate it with a staircase of horizontal and vertical cuts. The approximation error does not vanish at a fixed $J$ — the bias from the staircase structure persists.
- **Smooth gradients.** The piecewise-constant approximation of a smooth function $f^*$ over a region of diameter $\delta$ incurs bias of order $\delta$ times the Lipschitz constant of $f^*$. Reducing this bias requires exponentially many leaves in the dimension $p$ (a manifestation of the curse of dimensionality), whereas methods based on smooth basis functions (polynomials, splines, kernels) can achieve much lower bias at the same model complexity.
Despite these limitations, when the true regression function genuinely has an axis-aligned, piecewise-constant structure — as in many classification problems with categorical predictors — decision trees are the natural and optimal estimator class.
[example: Concrete Tree-Building on a Small Dataset]
Consider $n = 6$ training observations with $p = 1$ predictor:
| $i$ | $X_i$ | $Y_i$ |
|-----|--------|--------|
| 1 | 1.0 | 2.1 |
| 2 | 2.0 | 2.3 |
| 3 | 3.0 | 5.8 |
| 4 | 4.0 | 6.2 |
| 5 | 5.0 | 2.9 |
| 6 | 6.0 | 3.1 |
We grow a tree with $J = 3$ leaves. The grand mean is $\bar{Y} = (2.1 + 2.3 + 5.8 + 6.2 + 2.9 + 3.1)/6 = 22.4/6 \approx 3.733$, and the total RSS without any split is $(2.1 - 3.733)^2 + (2.3 - 3.733)^2 + (5.8 - 3.733)^2 + (6.2 - 3.733)^2 + (2.9 - 3.733)^2 + (3.1 - 3.733)^2 \approx 2.67 + 2.05 + 4.27 + 6.09 + 0.69 + 0.40 = 16.17$.
**Step 1 (first split).** We evaluate the RSS reduction for each candidate split point $s \in \{1.5, 2.5, 3.5, 4.5, 5.5\}$:
- $s = 1.5$: left group $\{2.1\}$, right group $\{2.3, 5.8, 6.2, 2.9, 3.1\}$. Left RSS $= 0$; right mean $= 20.3/5 = 4.06$, right RSS $= (2.3-4.06)^2 + (5.8-4.06)^2 + (6.2-4.06)^2 + (2.9-4.06)^2 + (3.1-4.06)^2 \approx 3.10 + 3.03 + 4.58 + 1.35 + 0.92 = 12.97$. Total RSS after split: $12.97$. Reduction: $16.17 - 12.97 = 3.20$.
- $s = 2.5$: left group $\{2.1, 2.3\}$, mean $2.2$, RSS $= 0.02$; right group $\{5.8, 6.2, 2.9, 3.1\}$, mean $4.5$, RSS $= (5.8-4.5)^2 + (6.2-4.5)^2 + (2.9-4.5)^2 + (3.1-4.5)^2 = 1.69 + 2.89 + 2.56 + 1.96 = 9.10$. Total RSS: $9.12$. Reduction: $16.17 - 9.12 = 7.05$.
- $s = 3.5$: left group $\{2.1, 2.3, 5.8\}$, mean $\approx 3.40$, RSS $\approx 8.66$; right group $\{6.2, 2.9, 3.1\}$, mean $\approx 4.07$, RSS $\approx 6.85$. Total RSS: $15.51$. Reduction: $16.17 - 15.51 = 0.66$.
- $s = 4.5$: left group $\{2.1, 2.3, 5.8, 6.2\}$, mean $4.1$, RSS $= 14.54$; right group $\{2.9, 3.1\}$, mean $3.0$, RSS $= 0.02$. Total RSS: $14.56$. Reduction: $16.17 - 14.56 = 1.61$.
- $s = 5.5$: left group $\{2.1, 2.3, 5.8, 6.2, 2.9\}$, mean $\approx 3.86$, RSS $\approx 15.69$; right group $\{3.1\}$, RSS $= 0$. Total RSS: $15.69$. Reduction: $16.17 - 15.69 = 0.48$.
The largest reduction is at $s = 2.5$: we split at $X \leq 2.5$. The two regions are $\hat{R}_L = (-\infty, 2.5]$ with $\hat{\gamma}_L = 2.2$ and $\hat{R}_R = (2.5, \infty)$ with observations $\{3, 4, 5, 6\}$ having mean $4.5$ and RSS $9.10$.
**Step 2 (second split).** We now search for the best split within $\hat{R}_R = (2.5, \infty)$ over candidate points $\{3.5, 4.5, 5.5\}$:
- $s = 3.5$: sub-left $\{5.8\}$, RSS $= 0$; sub-right $\{6.2, 2.9, 3.1\}$, mean $4.07$, RSS $\approx 6.85$. Total: $6.85$.
- $s = 4.5$: sub-left $\{5.8, 6.2\}$, mean $6.0$, RSS $= 0.08$; sub-right $\{2.9, 3.1\}$, mean $3.0$, RSS $= 0.02$. Total: $0.10$.
- $s = 5.5$: sub-left $\{5.8, 6.2, 2.9\}$, mean $\approx 4.97$, RSS $\approx 6.49$; sub-right $\{3.1\}$, RSS $= 0$. Total: $6.49$.
The split at $s = 4.5$ achieves by far the greatest RSS reduction within $\hat{R}_R$. The final three leaves are
\begin{align*}
\hat{R}_1 = (-\infty, 2.5], \quad \hat{R}_2 = (2.5, 4.5], \quad \hat{R}_3 = (4.5, \infty),
\end{align*}
with fitted values $\hat{\gamma}_1 = 2.2$, $\hat{\gamma}_2 = 6.0$, $\hat{\gamma}_3 = 3.0$. The fitted tree correctly identifies that observations 3 and 4 have high responses while those with small or large $X$ are moderate. The total training RSS is $0.02 + 0.08 + 0.02 = 0.12$, compared to the unsplit baseline of $16.17$.
[/example]
### Efficient Computation of Splits
At first glance the split-finding step appears expensive: it loops over $p$ coordinates, and for each coordinate $j$ it considers $O(n)$ candidate split points, each requiring an RSS computation. With a na{\"i}ve implementation this is $O(n^2 p)$ per split. This quadratic scaling becomes a bottleneck for moderate $n$, but the inner loop over split points admits a linear-time reduction via precomputed cumulative sums, which we now derive.
Consider, for notational clarity, the first split with $I = \{1, \ldots, n\}$ and $p = 1$, and assume $X_1 < X_2 < \cdots < X_n$. The minimisation over split points reduces to finding $m \in \{1, \ldots, n-1\}$ to minimise $Q_m + P_m$, where
\begin{align*}
Q_m := \min_{\gamma_L \in \mathbb{R}} \sum_{i \leq m} (Y_i - \gamma_L)^2, \qquad P_m := \min_{\gamma_R \in \mathbb{R}} \sum_{i > m} (Y_i - \gamma_R)^2.
\end{align*}
The optima are achieved by the group means, giving
\begin{align*}
Q_m = \sum_{i \leq m} Y_i^2 - \frac{1}{m}\left(\sum_{i \leq m} Y_i\right)^2, \qquad P_m = \sum_{i > m} Y_i^2 - \frac{1}{n - m}\left(\sum_{i > m} Y_i\right)^2.
\end{align*}
Consequently,
\begin{align*}
P_m + Q_m = \sum_{i=1}^n Y_i^2 - \frac{1}{m}\left(\sum_{i \leq m} Y_i\right)^2 - \frac{1}{n-m}\left(\sum_{i > m} Y_i\right)^2.
\end{align*}
Since the first term does not depend on $m$, minimising $P_m + Q_m$ is equivalent to maximising
\begin{align*}
\frac{1}{m}\left(\sum_{i \leq m} Y_i\right)^2 + \frac{1}{n-m}\left(\sum_{i > m} Y_i\right)^2.
\end{align*}
Define $A_m := \sum_{i \leq m} Y_i$ and $B_m := \sum_{i > m} Y_i$. These satisfy the recurrences $A_{m+1} = A_m + Y_{m+1}$ and $B_{m+1} = B_m - Y_{m+1}$, so all values $A_1, \ldots, A_{n-1}$ and $B_1, \ldots, B_{n-1}$ can be computed in $O(n)$ operations by a single forward pass. Evaluating the objective at all $m = 1, \ldots, n-1$ and finding the minimiser then costs $O(n)$ in total.
A further efficiency: when extending this to later splits, only the two newly created sub-regions need their split criterion recomputed at each step. For all other regions, the stored values $\hat{j}_R$ and $\hat{s}_R$ from previous steps remain valid. Thus the cost per split is proportional to the size of only the split region, not the full dataset.
### Choosing the Number of Regions
Too many regions and the tree memorises noise in the training labels; too few and it misses genuine structure in the regression function. The number of regions $J$ controls this tension directly. A large $J$ allows the tree to fit the training data closely, but individual regions may contain only a handful of observations, making the fitted coefficients $\hat{\gamma}_j$ highly sensitive to the particular training sample. A small $J$ restricts the tree to a coarse partition and may incur large bias if the true regression function varies significantly within regions.
Cross-validation provides one approach to selecting $J$: train trees of various sizes and choose $J$ by minimising the cross-validation score. A preferred alternative in practice is to **prune**: grow a very large tree and then collapse regions together according to a cost-complexity criterion that penalises tree size. Pruning has the advantage that the sequence of candidate trees forms a nested family, which can be traversed efficiently.
## Random Forests
Decision trees as described above have two notable weaknesses:
1. **Smooth functions**: Piecewise constant estimators approximate smooth regression functions poorly, particularly in regions where the true $\mathbb{E}[Y \mid X = x]$ varies gradually with $x$.
2. **Instability**: The greedy, data-driven splitting process is sensitive to the training data. Small perturbations to the training set can lead to very different splits, resulting in a high-variance estimator.
The **random forest** algorithm is designed to address the second weakness (and incidentally helps with the first by averaging over trees with different regions). The idea is simple: instead of a single tree, construct many trees and average their predictions. Averaging reduces variance, much as averaging independent random variables reduces their spread.
To understand the motivation, we apply the Bias-Variance Decomposition theorem from Chapter 1 to the decision tree estimator.
[quotetheorem:1946]
This decomposition is structurally identical to the Bias-Variance Decomposition theorem proved in Chapter 1 under squared error loss, but the decision tree context gives the terms a concrete meaning. When $J$ is large, individual regions may contain only a handful of observations, making the fitted coefficients $\hat{\gamma}_j$ highly variable. The variance term $\mathbb{E}[\operatorname{Var}(\hat{T}_D(X) \mid X)]$ is then large. Yet the squared bias and hence $R(\bar{T})$ may be small — a large $J$ allows $\bar{T}$ to approximate the regression function accurately. Random forests attempt to estimate $\bar{T}$ and thereby achieve low variance while retaining low bias.
Two limitations of this decomposition deserve attention. First, the decomposition requires the finite second moment assumption $\mathbb{E}[Y^2] < \infty$. If $\mathbb{E}[Y^2] = \infty$ (for instance, if $Y$ has a Cauchy-like conditional distribution), the variance and bias terms are undefined and the decomposition breaks down entirely. Second, the decomposition is specific to squared error loss; it does not generalise to other loss functions without modification. Moreover, it tells us only the *existence* of a bias-variance tradeoff as $J$ varies — it does not give the *rate* at which bias decreases or variance increases with $J$. Those rates depend on the smoothness of the true regression function and the dimension $p$, and require a separate analysis.
If we had $B$ independent datasets $D_1, \ldots, D_B$, we could form the unbiased estimate $\frac{1}{B} \sum_{b=1}^B \hat{T}_{D_b}$ of $\bar{T}$. Random forests simulate this by sampling the data with replacement.
### The Algorithm
We need to produce approximately independent trees from a single dataset. Bootstrap resampling (drawing $n$ observations with replacement) gives us different training sets, but trees grown on bootstrap samples tend to be highly correlated: if one predictor is much stronger than the others, nearly every bootstrap tree will split on it first, producing nearly identical tree structures. The random forest algorithm breaks this correlation by restricting the set of predictors available at each split.
[definition: Random Forest]
Given training data $D = (X_i, Y_i)_{i=1}^n$, a tuning parameter $m_{\mathrm{try}} \in \{1, \ldots, p\}$, and a number of trees $B$, the **random forest** estimator $\hat{f}_{\mathrm{rf}} : \mathbb{R}^p \to \mathbb{R}$ is constructed as follows:
1. For each $b = 1, \ldots, B$: draw a bootstrap sample $D^*_b$ by sampling $n$ observations from $D$ with replacement. Grow a decision tree $\hat{T}^{(b)}$ on $D^*_b$, with the following modification: at each split, instead of searching over all $p$ predictors, randomly sample $m_{\mathrm{try}}$ predictors (without replacement) and choose the best split from among only these $m_{\mathrm{try}}$ variables.
2. Output the ensemble average
\begin{align*}
\hat{f}_{\mathrm{rf}}(x) := \frac{1}{B} \sum_{b=1}^B \hat{T}^{(b)}(x).
\end{align*}
[/definition]
The key departure from simply averaging trees trained on bootstrap samples is the predictor subsampling in Step 1. This subsampling serves to reduce the correlation between the individual trees $\hat{T}^{(1)}, \ldots, \hat{T}^{(B)}$, which in turn reduces the variance of the ensemble. To see why correlation matters, we derive the ensemble variance precisely.
[quotetheorem:1949]
[citeproof:1949]
The formula separates neatly into two qualitatively different components. The first term $\frac{1-\rho}{B}\sigma^2_T$ can be made arbitrarily small by taking $B$ large. However, the second term $\rho\, \sigma^2_T$ does not decrease with $B$ — it represents the irreducible variance arising from the inter-tree correlation. No matter how many trees we grow, this floor remains.
[remark: Effect of $m_{\mathrm{try}}$ on Bias and Variance]
When all $p$ predictors are available at each split ($m_{\mathrm{try}} = p$), the trees are grown as full regression trees on bootstrap samples. These trees tend to be correlated: if there is one predictor that is far more informative than the others, most trees will split on it early, resulting in high inter-tree correlation $\rho$. Reducing $m_{\mathrm{try}}$ forces some trees to use less informative predictors at early splits, breaking the synchrony and reducing $\rho$. However, restricting the predictor pool also impairs the quality of individual splits, which tends to increase the squared bias of the ensemble. The optimal $m_{\mathrm{try}}$ balances these competing effects and is typically chosen by cross-validation.
[/remark]
The remark describes qualitative tendencies; the following example puts actual numbers to them.
[example: Variance Reduction with Concrete Numbers]
Suppose $p = 2$, and trees trained on bootstrap samples have marginal variance $\sigma^2_T = 4.0$ at a fixed test point $x$. In the high-correlation regime ($m_{\mathrm{try}} = 2$, so all trees see all predictors), the dominant predictor $X^{(1)}$ is chosen at the root by nearly every tree, producing inter-tree correlation $\rho = 0.9$. With $B = 100$ trees, the Ensemble Variance Formula gives:
\begin{align*}
\operatorname{Var}(\hat{f}_{\mathrm{rf}}(x)) = \frac{1 - 0.9}{100} \cdot 4.0 + 0.9 \cdot 4.0 = 0.004 + 3.6 = 3.604.
\end{align*}
This is barely better than a single tree's variance of $4.0$; the irreducible floor $\rho\sigma^2_T = 3.6$ dominates, and adding more trees beyond $B = 100$ would reduce the ensemble variance by at most $0.004$.
In the low-correlation regime ($m_{\mathrm{try}} = 1$, so each split sees only one randomly chosen predictor), trees are forced to use $X^{(2)}$ at least half the time at the root, reducing the inter-tree correlation to $\rho = 0.3$. Now with $B = 100$:
\begin{align*}
\operatorname{Var}(\hat{f}_{\mathrm{rf}}(x)) = \frac{1 - 0.3}{100} \cdot 4.0 + 0.3 \cdot 4.0 = 0.028 + 1.2 = 1.228.
\end{align*}
The ensemble variance has dropped to $1.228$ — nearly a threefold reduction relative to the high-correlation forest, and more than threefold relative to a single tree. The irreducible floor is now $\rho\sigma^2_T = 1.2$ rather than $3.6$. The cost is that individual trees fitting with $m_{\mathrm{try}} = 1$ have slightly higher bias because some splits use the weaker predictor; suppose the squared bias increases from $0.5$ to $0.9$. The total expected squared error is then $3.604 + 0.5 = 4.104$ versus $1.228 + 0.9 = 2.128$ — a clear advantage for the lower $m_{\mathrm{try}}$ setting despite the higher per-tree bias.
[/example]
One practical consequence of the random forest construction is that it sacrifices the interpretability of a single tree. A single tree can be displayed and explained: the sequence of splits traces a clear logical path from input to prediction. An ensemble of hundreds of trees cannot be similarly displayed. This tradeoff between predictive accuracy and interpretability is a recurrent theme in machine learning. Random forests consistently achieve strong predictive performance across a wide range of applications and are considered one of the most practically useful algorithms in the field.
Random forests succeed by decorrelating trees to reduce variance, yet this empirical success invites a deeper question: what formal guarantees govern generalization across all possible learning algorithms? Statistical learning theory provides these guarantees through concepts like VC dimension and Rademacher complexity.
# 3. Statistical Learning Theory
Chapter 1 introduced the empirical risk minimiser (ERM) and the bias–variance tradeoff for squared error loss. Chapter 2 developed decision trees as a concrete hypothesis class. We now turn to the theoretical foundations that underpin both: statistical learning theory, the branch of machine learning devoted to understanding *why* and *when* ERM works, and how the complexity of the hypothesis class interacts with the amount of training data to control the generalization error.
The central object of study is the **excess risk** $R(\hat{h}) - R(h^*)$, which measures how much worse our data-driven ERM $\hat{h}$ performs compared with the best possible hypothesis $h^* \in \mathcal{H}$ in hindsight. To see why this quantity is natural, recall the OLS example from Chapter 1 (the basis-function ERM with squared error loss): with $d$ basis functions it yields
\begin{align*}
\mathbb{E}R(\hat{h}_D) - \mathbb{E}R(\tilde{h}_{X_{1:n}}) \approx \frac{\sigma^2 d}{n},
\end{align*}
where $\sigma^2 = \operatorname{Var}(Y \mid X = x)$ is the conditional variance (assumed constant). The right-hand side grows linearly in the number of basis functions $d$ and decays like $1/n$ in the sample size — a clean quantitative picture of the bias–variance tension. Our goal in this chapter is to develop analogous, rigorous bounds in the general classification setting, where the hypothesis class $\mathcal{H}$ need not be parametric.
[definition: Excess Risk]
Let $\mathcal{H}$ be a hypothesis class, $\hat{h}$ the ERM trained on data $D = ((X_1, Y_1), \ldots, (X_n, Y_n))$, and
\begin{align*}
h^* := \operatorname*{arg\,min}_{h \in \mathcal{H}} R(h)
\end{align*}
the risk minimiser over $\mathcal{H}$ (or an approximate minimiser within $\varepsilon > 0$ of the infimum, when no exact minimiser exists). The **excess risk** is the random variable
\begin{align*}
R(\hat{h}) - R(h^*).
\end{align*}
[/definition]
Two questions drive the theory: how does the *complexity* of $\mathcal{H}$ influence the excess risk, and how does the sample size $n$ affect it? Both questions are simultaneously addressed by the following fundamental decomposition.
## The Approximation–Estimation Decomposition
The starting point is an algebraic identity that splits the excess risk into two interpretable parts. Write the empirical risk of $h$ as $\hat{R}(h) := \frac{1}{n}\sum_{i=1}^n \ell(h(X_i), Y_i)$. Then
\begin{align*}
R(\hat{h}) - R(h^*) &= \bigl[R(\hat{h}) - \hat{R}(\hat{h})\bigr] + \underbrace{\bigl[\hat{R}(\hat{h}) - \hat{R}(h^*)\bigr]}_{\leq\, 0} + \bigl[\hat{R}(h^*) - R(h^*)\bigr] \\
&\leq \sup_{h \in \mathcal{H}}\bigl\{R(h) - \hat{R}(h)\bigr\} + \bigl[\hat{R}(h^*) - R(h^*)\bigr].
\end{align*}
The middle bracket is at most zero because $\hat{h}$ is the ERM — by definition $\hat{R}(\hat{h}) \leq \hat{R}(h^*)$ for every $h^* \in \mathcal{H}$. The two remaining terms have distinct characters.
[explanation: What the Two Terms Mean]
The term $\hat{R}(h^*) - R(h^*)$ is the **estimation error** for a single, fixed hypothesis $h^*$: it is the gap between the empirical risk and the true risk of the best-in-class hypothesis. By the law of large numbers it tends to zero as $n \to \infty$, and concentration inequalities will give sharp finite-sample bounds on its tail probability.
The term $\sup_{h \in \mathcal{H}}\{R(h) - \hat{R}(h)\}$ is the **uniform deviation** — the worst-case discrepancy between true and empirical risk over the *entire* class $\mathcal{H}$. This is more subtle: even if each individual $R(h) - \hat{R}(h)$ concentrates around zero, the supremum over a large or rich class can remain bounded away from zero. Controlling this supremum is the core challenge of statistical learning theory, and its difficulty grows with the complexity of $\mathcal{H}$.
The decomposition makes the tradeoff precise: a richer $\mathcal{H}$ reduces the approximation error (we have more hypotheses to choose from), but it inflates the uniform deviation, increasing the estimation error. Optimal performance requires balancing these two effects.
[/explanation]
## Why the CLT Is Not Enough
To see concretely why uniform control is hard, consider bounding the tail probability of the excess risk. We would like to upper bound
\begin{align*}
\mathbb{P}\!\left(\sup_{h \in \mathcal{H}}\bigl\{R(h) - \hat{R}(h)\bigr\} > t\right)
\end{align*}
for a given $t \geq 0$. As a warm-up, suppose $|\mathcal{H}|$ is finite. A union bound gives
\begin{align*}
\mathbb{P}\!\left(\max_{h \in \mathcal{H}}\bigl\{R(h) - \hat{R}(h)\bigr\} > t\right) \leq \sum_{h \in \mathcal{H}} \mathbb{P}\!\left(R(h) - \hat{R}(h) > t\right).
\end{align*}
For each fixed $h \in \mathcal{H}$, the difference $R(h) - \hat{R}(h)$ is an average of $n$ i.i.d. mean-zero random variables:
\begin{align*}
R(h) - \hat{R}(h) = \frac{1}{n}\sum_{i=1}^n \bigl[\mathbb{E}\{\ell(h(X_i), Y_i)\} - \ell(h(X_i), Y_i)\bigr].
\end{align*}
The central limit theorem tells us that $\sqrt{n}\{R(h) - \hat{R}(h)\}$ converges in distribution to $N(0, \operatorname{Var}(\ell(h(X_1), Y_1)))$. But applying the CLT here faces three serious obstacles.
[remark: Three Obstacles to Using the CLT]
First, the union bound requires a *uniform* limiting result over all $h \in H$ simultaneously. In practice we may wish to let $|H|$ grow with $n$ to reduce bias, so there is no reason to expect uniformity. Second, to make the union bound sum small, we need $t$ to be fairly large — we need the CLT to provide a good approximation far into the right tail of the distribution of $\sqrt{n}\{R(h) - \hat{R}(h)\}$, which it generally does not (the CLT is an asymptotic result for fixed $t$). Third, when $H$ is infinite, the union bound is unavailable entirely and we need tools that do not enumerate hypotheses.
[/remark]
These obstacles motivate a different toolbox: **concentration inequalities**. These are nonasymptotic, finite-sample bounds on the tail probabilities of sums or averages of independent random variables. They do not require $n \to \infty$ and can be stated uniformly over function classes. The prototype is Hoeffding's inequality, which will occupy the next section, followed by Rademacher complexity and VC dimension as the main tools for controlling the uniform deviation when $\mathcal{H}$ is infinite.
[motivation]
### Why This Matters Beyond Classification
Although we develop the theory primarily in the classification setting, the excess risk decomposition and the uniform deviation bound are central to all of statistical learning. In regression, the same framework explains why OLS with many basis functions can overfit: the empirical risk minimiser fits the training data well ($\hat{R}(\hat{h})$ is small), but the uniform deviation is large, so the true risk $R(\hat{h})$ can be much larger than $\hat{R}(\hat{h})$. In the parametric world the size of $H$ is controlled by the number of parameters; in the nonparametric world (decision trees, neural networks) the notion of complexity is more subtle, and that is precisely what Rademacher complexity and VC dimension are designed to capture.
### The Chapter Roadmap
We begin with sub-Gaussianity and Hoeffding's inequality, which handle the single-hypothesis case and finite hypothesis classes. We then develop Rademacher complexity, which gives uniform bounds over infinite classes by measuring how well $\mathcal{H}$ can fit random noise. Finally, VC dimension provides a combinatorial characterisation of complexity that is independent of the probability distribution, yielding distribution-free generalization bounds.
[/motivation]
## Sub-Gaussianity and Hoeffding's Inequality
The central challenge of statistical learning theory is to bound the gap between empirical risk and true risk. The previous section showed that this gap, once decomposed, reduces to controlling the fluctuations of bounded random variables around their means. The right tool for this is a family of tail bounds that go by the name of concentration inequalities. We begin with the simplest such bound — Markov's inequality — and build up to Hoeffding's inequality via the notion of sub-Gaussianity.
### From Markov to the Chernoff Bound
The starting point is elementary. Let $W$ be a non-negative random variable. For any $t > 0$, the pointwise inequality $t \cdot \mathbb{1}_{\{W \geq t\}} \leq W$ holds, and taking expectations on both sides yields **Markov's inequality**:
\begin{align*}
\mathbb{P}(W \geq t) \leq \frac{\mathbb{E}[W]}{t}.
\end{align*}
Markov's inequality is coarse — it uses only the mean of $W$. We can sharpen it dramatically by choosing a strictly increasing function $\varphi : \mathbb{R} \to (0, \infty)$ and applying Markov to $\varphi(W)$:
\begin{align*}
\mathbb{P}(W \geq t) = \mathbb{P}(\varphi(W) \geq \varphi(t)) \leq \frac{\mathbb{E}[\varphi(W)]}{\varphi(t)}.
\end{align*}
The choice $\varphi(t) = e^{\alpha t}$ for $\alpha > 0$ is particularly powerful. Optimising over $\alpha$ gives the **Chernoff bound**:
\begin{align*}
\mathbb{P}(W \geq t) \leq \inf_{\alpha > 0} e^{-\alpha t} \mathbb{E}[e^{\alpha W}].
\end{align*}
The quantity $\mathbb{E}[e^{\alpha W}]$ is the **moment generating function** (mgf) of $W$, evaluated at $\alpha$. The Chernoff bound reduces the tail problem to a moment generating function problem: if we can control the mgf of $W$, we can control the tails.
[example: Gaussian Chernoff Bound]
Let $W \sim N(0, \sigma^2)$. The mgf of a Gaussian is known explicitly:
\begin{align*}
\mathbb{E}[e^{\alpha W}] = e^{\alpha^2 \sigma^2 / 2}.
\end{align*}
Applying the Chernoff bound and optimising over $\alpha > 0$:
\begin{align*}
\mathbb{P}(W \geq t) \leq \inf_{\alpha > 0} e^{\alpha^2 \sigma^2/2 - \alpha t}.
\end{align*}
The exponent $f(\alpha) = \alpha^2 \sigma^2 / 2 - \alpha t$ is minimised at $\alpha^* = t/\sigma^2$, giving $f(\alpha^*) = -t^2/(2\sigma^2)$. Therefore, for $t \geq 0$:
\begin{align*}
\mathbb{P}(W \geq t) \leq e^{-t^2/(2\sigma^2)}.
\end{align*}
This is a Gaussian tail bound: the probability decays super-exponentially in $t$.
[/example]
The computation above used nothing specific about the Gaussian distribution except the form of its mgf. Any random variable whose mgf is bounded above by $e^{\alpha^2 \sigma^2/2}$ will satisfy the same tail bound. This is the key observation that motivates sub-Gaussianity.
### Sub-Gaussian Random Variables
[definition: Sub-Gaussian Random Variable]
A random variable $W$ is **sub-Gaussian with parameter $\sigma > 0$** if
\begin{align*}
\mathbb{E}[e^{\alpha(W - \mathbb{E}[W])}] \leq e^{\alpha^2 \sigma^2 / 2} \quad \text{for all } \alpha \in \mathbb{R}.
\end{align*}
[/definition]
The definition centres $W$ at its mean: the mgf condition is imposed on the centred variable $W - \mathbb{E}[W]$. This is natural since concentration is always measured relative to the mean.
The connection to tail bounds is immediate.
[quotetheorem:1953]
[citeproof:1953]
The bound $e^{-t^2/(2\sigma^2)}$ matches the Gaussian tail exactly. If $\sigma$ is misspecified — say the true parameter is $\sigma_0 < \sigma$ — the bound still holds but is loose by a factor of $e^{-t^2/(2\sigma^2) + t^2/(2\sigma_0^2)} < 1$, meaning we are overestimating the tail. The one-sided form above handles only the right tail $W - \mathbb{E}[W] \geq t$; the left tail requires applying the same bound to $-W$, which is also sub-Gaussian with parameter $\sigma$ by the symmetry property. What the bound does not capture — and cannot, without further distributional information — is a tight lower bound on the tail: sub-Gaussianity is an upper bound condition, and the true distribution could have a much lighter left tail than the Gaussian benchmark.
### Basic Properties
Sub-Gaussianity has several closure properties that make the class easy to work with.
If $W$ is sub-Gaussian with parameter $\sigma$, then:
- **Monotonicity in $\sigma$**: $W$ is also sub-Gaussian with parameter $\sigma'$ for any $\sigma' \geq \sigma$. (The mgf bound $e^{\alpha^2 \sigma^2/2} \leq e^{\alpha^2 (\sigma')^2/2}$ is immediate.)
- **Symmetry**: $-W$ is sub-Gaussian with parameter $\sigma$. This follows since the sub-Gaussianity condition for $-W$ is $\mathbb{E}[e^{\alpha(-(W - \mathbb{E}[W]))}] \leq e^{\alpha^2 \sigma^2/2}$, which is the same as the condition for $W$ evaluated at $-\alpha$.
- **Translation invariance**: $W - c$ is sub-Gaussian with parameter $\sigma$ for any deterministic constant $c \in \mathbb{R}$.
The symmetry property is useful: applying the sub-Gaussian tail bound to both $W - \mathbb{E}[W]$ and $-(W - \mathbb{E}[W])$ and taking a union bound gives a two-sided tail bound:
\begin{align*}
\mathbb{P}(|W - \mathbb{E}[W]| \geq t) \leq 2e^{-t^2/(2\sigma^2)}.
\end{align*}
Gaussian random variables are sub-Gaussian (with parameter equal to their standard deviation), but the sub-Gaussian class is much broader than the Gaussian family.
[example: Rademacher Random Variable]
A **Rademacher random variable** $\varepsilon$ takes values $\{-1, 1\}$ with equal probability $1/2$ each. It has mean zero. We claim $\varepsilon$ is sub-Gaussian with parameter $\sigma = 1$.
We compute the mgf directly:
\begin{align*}
\mathbb{E}[e^{\alpha \varepsilon}] = \frac{1}{2}(e^{-\alpha} + e^{\alpha}) = \sum_{k=0}^{\infty} \frac{\alpha^{2k}}{(2k)!}.
\end{align*}
The second equality uses the Taylor series for $e^{\alpha}$ and $e^{-\alpha}$: the odd-degree terms cancel. Now we compare with the Taylor series of $e^{\alpha^2/2} = \sum_{k=0}^{\infty} \frac{\alpha^{2k}}{2^k k!}$. The key fact is that $(2k)! \geq 2^k k!$ for all $k \geq 0$, which follows by comparing the products term by term: $(2k)(2k-1)\cdots(k+1) \geq 2^k$ since each of the $k$ factors exceeds $2$. Therefore:
\begin{align*}
\frac{\alpha^{2k}}{(2k)!} \leq \frac{\alpha^{2k}}{2^k k!},
\end{align*}
and summing over $k$ gives $\mathbb{E}[e^{\alpha \varepsilon}] \leq e^{\alpha^2/2}$. Since $\mathbb{E}[\varepsilon] = 0$, this is precisely the sub-Gaussian condition with $\sigma = 1$.
[/example]
### Hoeffding's Lemma
The most important source of sub-Gaussian random variables in learning theory is bounded random variables. Recall that the indicator $\mathbb{1}_{\{h(X_i) \neq Y_i\}} - \mathbb{P}(h(X_i) \neq Y_i)$ is centred and bounded. The following lemma handles exactly this situation.
[quotetheorem:1956]
The proof presented in the source establishes the weaker bound $\sigma = b - a$ (which is still sufficient for the main application) via a symmetrisation argument. We present this proof in full, as the technique recurs throughout the course.
[proof]
We prove the bound with parameter $\sigma = b - a$. Let $W'$ be an independent copy of $W$ (same distribution, independent of $W$). Since $\mathbb{E}[W'] = \mathbb{E}[W]$, we can write:
\begin{align*}
\mathbb{E}[e^{\alpha(W - \mathbb{E}[W])}] = \mathbb{E}[e^{\alpha(W - \mathbb{E}[W'])}] = \mathbb{E}[e^{\mathbb{E}[\alpha(W - W') \mid W]}].
\end{align*}
The last equality uses $\mathbb{E}[W' \mid W] = \mathbb{E}[W']$ (by independence) and the fact that $W$ is known given $W$. Applying Jensen's inequality to the convex function $x \mapsto e^x$ conditionally on $W$ (and then the tower property):
\begin{align*}
\mathbb{E}[e^{\alpha(W - \mathbb{E}[W])}] \leq \mathbb{E}[e^{\alpha(W - W')}].
\end{align*}
Now introduce a Rademacher random variable $\varepsilon$ independent of $(W, W')$. The key observation is that $W - W' \stackrel{d}{=} \varepsilon(W - W')$: the signed difference $W - W'$ is symmetric around zero (since $W$ and $W'$ have the same distribution), so multiplying by an independent $\pm 1$ does not change its distribution. Therefore:
\begin{align*}
\mathbb{E}[e^{\alpha(W - \mathbb{E}[W])}] \leq \mathbb{E}[e^{\alpha \varepsilon(W - W')}] = \mathbb{E}\bigl[\mathbb{E}[e^{\alpha \varepsilon(W - W')} \mid W, W']\bigr].
\end{align*}
Conditioning on $(W, W')$, the quantity $\alpha(W - W')$ is a fixed constant $c$, and $\varepsilon$ is Rademacher. By the Rademacher example above, $\mathbb{E}[e^{c\varepsilon}] \leq e^{c^2/2}$. Applying this conditionally:
\begin{align*}
\mathbb{E}[e^{\alpha(W - \mathbb{E}[W])}] \leq \mathbb{E}[e^{\alpha^2(W - W')^2/2}].
\end{align*}
Since $W \in [a, b]$ and $W' \in [a, b]$ almost surely, we have $|W - W'| \leq b - a$, so $(W - W')^2 \leq (b-a)^2$. Therefore:
\begin{align*}
\mathbb{E}[e^{\alpha(W - \mathbb{E}[W])}] \leq e^{\alpha^2(b-a)^2/2}.
\end{align*}
This is the sub-Gaussian condition with parameter $\sigma = b - a$.
[/proof]
The parameter $\sigma = b - a$ is not always tight: the lemma gives a sub-Gaussian bound that depends only on the range of $W$, not on its variance or shape. When $W$ is concentrated near one endpoint of $[a, b]$ — for instance, a Bernoulli$(p)$ variable with $p$ close to $0$ or $1$ — the true sub-Gaussian parameter is roughly $\sqrt{p(1-p)}$, far smaller than $(b-a)/2 = 1/2$. In such cases Bernstein's inequality, which explicitly incorporates the variance, gives much sharper bounds. For truly unbounded random variables Hoeffding's lemma does not apply at all: an unbounded variable can have an mgf that grows faster than $e^{\alpha^2 \sigma^2/2}$ for any fixed $\sigma$, and no sub-Gaussian bound exists.
[remark: Symmetrisation]
The introduction of an independent copy $W'$ and a Rademacher random variable $\varepsilon$ is an instance of a **symmetrisation argument**. By replacing $W - \mathbb{E}[W]$ with the symmetric quantity $W - W'$, we gain the freedom to introduce a random sign without changing the distribution. This technique is a cornerstone of empirical process theory and will reappear later in the course.
[/remark]
### Sums of Independent Sub-Gaussians
A linear combination of jointly Gaussian random variables is again Gaussian. An analogous closure holds for sub-Gaussians.
[quotetheorem:1960]
[citeproof:1960]
The proof hinges critically on **independence**: the mgf factors as a product of individual mgfs only when the $W_i$ are independent, allowing the sub-Gaussian conditions to combine multiplicatively. If the $W_i$ are dependent — correlated residuals in a regression, or the increments of a dependent process — this factorisation fails and the sum need not be sub-Gaussian even if each summand is. For example, if $W_1 = W_2 = \cdots = W_n = W$ (maximally dependent), then $\sum_i W_i = nW$, and the sub-Gaussian parameter would be $n\sigma$, not $\sqrt{n}\sigma$. Dependent sub-Gaussian random variables require more sophisticated tools such as Azuma's inequality or the bounded differences method, which we encounter in later chapters.
### Hoeffding's Inequality
Combining Hoeffding's lemma with the stability theorem gives the fundamental concentration inequality for bounded random variables.
[quotetheorem:1962]
[citeproof:1962]
Hoeffding's inequality is tight for bounded variables: up to constants, the decay rate $e^{-2n^2 t^2 / \sum (b_i - a_i)^2}$ cannot be improved in the worst case over the class of all distributions on $[a_i, b_i]$. However, it ignores the variance of the $W_i$: the bound depends only on the range $b_i - a_i$, not on $\operatorname{Var}(W_i)$. Bernstein's inequality sharpens this by replacing the range term with the variance: for variables with $|W_i - \mathbb{E}[W_i]| \leq c$ almost surely, Bernstein gives the tail bound $\exp(-nt^2 / (2\sigma^2 + 2ct/3))$ where $\sigma^2 = \frac{1}{n}\sum_i \operatorname{Var}(W_i)$. When the $W_i$ have small variance — as is often the case near the optimal hypothesis — the Bernstein bound is substantially sharper and leads to faster rates of learning.
[example: Application to Binary Classification]
In the classification setting, the relevant quantities are $W_i = \mathbb{1}_{\{h(X_i) \neq Y_i\}}$, which satisfies $0 \leq W_i \leq 1$ almost surely, so $a_i = 0$ and $b_i = 1$ for all $i$. The empirical risk is $\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n W_i$ and the true risk is $R(h) = \mathbb{E}[W_i]$. Hoeffding's inequality gives:
\begin{align*}
\mathbb{P}\!\left(\hat{R}(h) - R(h) \geq t\right) \leq e^{-2nt^2},
\end{align*}
so the empirical risk concentrates around the true risk at rate $O(1/\sqrt{n})$. This bound holds for a fixed classifier $h$; controlling the supremum over a hypothesis class $\mathcal{H}$ requires a union bound argument, developed in the next section.
[/example]
### Expected Maximum of Sub-Gaussians
The sub-Gaussian mgf condition yields not only tail bounds but also control of the expected maximum of a collection of sub-Gaussian random variables — even when they are dependent.
[quotetheorem:1963]
[citeproof:1963]
The bound we have just established depends on $d$ only through its logarithm — a surprisingly mild dependence that deserves emphasis.
[remark: Logarithmic Growth in $d$]
The bound $\sigma\sqrt{2\log d}$ grows only logarithmically in the number of variables $d$. This is a signature feature of sub-Gaussian tails: even among $d$ random variables, the expected maximum does not grow much beyond the standard deviation. This will be important in the next section, where the expected supremum of an empirical process over a finite hypothesis class is bounded using this result.
[/remark]
## Finite Hypothesis Classes
We have now assembled the probabilistic toolkit — sub-Gaussian random variables and Hoeffding's inequality — needed to give the first rigorous generalization bound of the course. The setting is deliberately simple: we assume the hypothesis class $\mathcal{H}$ is **finite**. This assumption is strong, but it already forces us to confront the central tension of statistical learning: the ERM $\hat{h}$ is chosen using the same data over which we evaluate it, so its empirical risk is a poor proxy for its true risk. The key insight of this section is that the logarithm of $|\mathcal{H}|$ — not $|\mathcal{H}|$ itself — is what controls how hard the class is to learn.
### Decomposing the Excess Risk
From the approximation–estimation decomposition established earlier in this chapter, for any minimizer $h^* \in \mathcal{H}$ of the true risk $R$ over $\mathcal{H}$ and any ERM $\hat{h} \in \arg\min_{h \in \mathcal{H}} \hat{R}(h)$, we have
\begin{align*}
R(\hat{h}) - R(h^*) &= \bigl[R(\hat{h}) - \hat{R}(\hat{h})\bigr] + \underbrace{\bigl[\hat{R}(\hat{h}) - \hat{R}(h^*)\bigr]}_{\leq\, 0} + \bigl[\hat{R}(h^*) - R(h^*)\bigr] \\
&\leq \sup_{h \in \mathcal{H}} \bigl\{R(h) - \hat{R}(h)\bigr\} + \bigl[\hat{R}(h^*) - R(h^*)\bigr].
\end{align*}
The second bracket is the fluctuation of the empirical risk at the fixed hypothesis $h^*$: since $h^*$ does not depend on the data, Hoeffding's inequality applies to it directly. The first bracket is the supremum over all of $\mathcal{H}$: the ERM chooses $\hat{h}$ to minimize empirical risk, so it may have selected a hypothesis whose empirical risk is anomalously low, and bounding this requires care. When $\mathcal{H}$ is finite, the union bound transforms the supremum over $\mathcal{H}$ into a sum of tail probabilities, each of which Hoeffding controls.
### The Generalization Bound
[quotetheorem:1964]
[citeproof:1964]
The bound reveals the precise cost of using a richer hypothesis class. Each additional hypothesis in $\mathcal{H}$ contributes one more event to the union, inflating the tail bound by a factor of $|\mathcal{H}|$. Taking a logarithm converts this multiplicative inflation into an additive cost: we pay $\log|\mathcal{H}|$ extra samples' worth of risk to use a class of size $|\mathcal{H}|$ rather than a single hypothesis. When $|\mathcal{H}|$ grows polynomially in $n$, this cost is negligible; only when $|\mathcal{H}|$ is exponentially large in $n$ does the class become unlearnable in this regime.
[remark: Role of the Union Bound]
The union bound argument used here is sharp only when the $|\mathcal{H}|$ events $\{R(h) - \hat{R}(h) \geq t/2\}$ are nearly independent. In practice, hypotheses in $\mathcal{H}$ may agree on most training points, making these events highly correlated. A tighter analysis — which we will develop using Rademacher complexity — exploits this correlation structure and replaces $\log|\mathcal{H}|$ by a quantity that reflects the actual complexity of the function class rather than its raw size. The finite-class bound is thus a useful baseline, not the final word.
[/remark]
### The Histogram Classifier
The bound is not merely abstract. The following example shows it applies to a concrete method and allows us to extract a meaningful rate for the generalization error.
[example: Histogram Classifier on the Unit Square]
Let $\mathcal{X} = [0,1)^2$ and suppose $\mathcal{Y} = \{-1, 1\}$. Fix an integer $m \geq 1$ and partition $[0,1)^2$ into $m^2$ disjoint squares
\begin{align*}
R_{r,s} = \left[\frac{r}{m}, \frac{r+1}{m}\right) \times \left[\frac{s}{m}, \frac{s+1}{m}\right), \quad r, s = 0, 1, \ldots, m-1.
\end{align*}
Label each cell by the majority vote of the training labels falling in it. Formally, define
\begin{align*}
\bar{Y}_j := \operatorname{sgn}\!\Bigl(\sum_{i:\, X_i \in R_j} Y_i\Bigr)
\end{align*}
for each cell $R_j$ (using a tiebreaking convention, e.g., $\operatorname{sgn}(0) = -1$), and set the histogram classifier
\begin{align*}
\hat{h}_{\mathrm{hist}}(x) := \sum_{j=1}^{m^2} \bar{Y}_j \,\mathbb{1}_{R_j}(x).
\end{align*}
This classifier is equivalent to ERM over the hypothesis class $\mathcal{H}$ consisting of all functions that assign a constant label in $\{-1, 1\}$ to each cell. There are $m^2$ cells and two choices per cell, so $|\mathcal{H}| = 2^{m^2}$, and thus $\log|\mathcal{H}| = m^2 \log 2$.
Applying the finite-class bound, for any $\delta \in (0,1)$, with probability at least $1 - \delta$,
\begin{align*}
R(\hat{h}_{\mathrm{hist}}) - R(h^*) \leq \sqrt{\frac{2m^2 \log 2 + 2\log(1/\delta)}{n}}.
\end{align*}
For fixed $\delta$, the leading term is $\sqrt{2m^2 \log 2 / n}$. The bound improves as $n$ grows and worsens as $m$ grows: finer partitions mean a richer class and more bits of information needed to specify a hypothesis. To balance bias (a coarse partition cannot separate nearby points with different labels) and variance (a fine partition requires more data), one would choose $m$ as a function of $n$ — for instance $m \sim n^{1/4}$ in two dimensions yields a bound of order $n^{-1/4}$. This is the bias-variance tradeoff made concrete in the learning-theoretic language.
[/example]
### What the Bound Says About Sample Complexity
It is instructive to restate the result in terms of sample complexity: how many samples $n$ suffice for $\hat{h}$ to be $\varepsilon$-optimal with probability $1 - \delta$?
Setting the bound equal to $\varepsilon$ and solving for $n$:
\begin{align*}
n \geq \frac{2(\log|\mathcal{H}| + \log(1/\delta))}{\varepsilon^2}.
\end{align*}
This is linear in $\log|\mathcal{H}|$ and quadratic in $1/\varepsilon$. The quadratic dependence on $1/\varepsilon$ is characteristic of bounds derived from Hoeffding-type inequalities and cannot be improved without further structural assumptions on $\mathcal{H}$ or $\mathbb{P}_0$. The logarithmic dependence on $|\mathcal{H}|$ is the remarkable feature: an ERM over a class of exponential size $2^d$ requires only $O(d/\varepsilon^2)$ samples, which is linear in the number of bits needed to encode a hypothesis. Binary classification over $\{0,1\}^d$ inputs using all threshold functions is learnable in $O(d/\varepsilon^2)$ samples, not $O(2^d/\varepsilon^2)$.
[explanation: Why the Logarithm Appears]
The union bound replaces the supremum over $\mathcal{H}$ by a sum, which scales as $|\mathcal{H}|$ times the individual tail probability. Hoeffding's inequality gives a tail probability of $e^{-nt^2/2}$ for each fixed hypothesis. Setting $|\mathcal{H}| \cdot e^{-nt^2/2} = \delta$ yields $e^{-nt^2/2} = \delta/|\mathcal{H}|$, i.e., $nt^2/2 = \log|\mathcal{H}| + \log(1/\delta)$. The exponential tail of the Hoeffding bound absorbs the polynomial or even exponential size of $\mathcal{H}$, but the cost appears as an additive $\log|\mathcal{H}|$ inside the square root.
This mechanism breaks down completely when $\mathcal{H}$ is infinite: $\log|\mathcal{H}| = +\infty$ and the bound becomes vacuous — it tells us nothing. The natural question — the one that motivates the next section on Rademacher complexity — is what quantity plays the role of $\log|\mathcal{H}|$ for infinite classes. The answer is not a property of the class in isolation, but of how the class interacts with the data distribution: the Rademacher complexity measures the degree to which functions in $\mathcal{H}$ can fit random noise.
[/explanation]
## Rademacher Complexity
The bound derived for finite hypothesis classes — of order $\log|\mathcal{H}|/n$ — relies on a union bound that simply counts hypotheses. When $\mathcal{H}$ is infinite, this approach collapses entirely: there is no finite $|\mathcal{H}|$ to take a logarithm of. We need a more refined measure of complexity that captures not how many hypotheses exist, but how richly a class can fit data.
Rademacher complexity provides exactly this. Rather than counting, it measures how well the function class $F$ can correlate with purely random noise. A class that can align with any random labeling is expressive and prone to overfitting; a class that cannot is constrained and generalizes well. The connection between this geometric notion and generalization bounds is the content of this section.
### The Setup and Function Class
Recall from the preceding analysis that bounding the expected excess risk $\mathbb{E}R(\hat{h}) - R(h^*)$ reduces to bounding $\mathbb{E}G$ where
\begin{align*}
G := \sup_{h \in \mathcal{H}} \{R(h) - \hat{R}(h)\}.
\end{align*}
Write $Z_i = (X_i, Y_i)$ for $i = 1, \ldots, n$ and define the class of negated loss functions
\begin{align*}
F := \{(x, y) \mapsto -\ell(h(x), y) : h \in \mathcal{H}\}.
\end{align*}
Then
\begin{align*}
G = \sup_{f \in F} \frac{1}{n} \sum_{i=1}^{n} \{f(Z_i) - \mathbb{E}f(Z_i)\}.
\end{align*}
What makes $G$ hard to control directly is its dependence on the unknown distribution $P_0$. The Rademacher approach succeeds precisely because it produces complexity measures that are free of $P_0$.
### Empirical and Population Rademacher Complexity
[definition: Rademacher Random Variable]
A **Rademacher random variable** $\varepsilon$ takes values in $\{-1, +1\}$, each with probability $1/2$.
[/definition]
The Rademacher distribution is the simplest model of pure noise: symmetric, bounded, mean zero. The key idea is to measure how much a function class can correlate with an i.i.d. sequence $\varepsilon_1, \ldots, \varepsilon_n$ of such variables.
[definition: Empirical Rademacher Complexity]
Let $F$ be a class of real-valued functions $f : \mathcal{Z} \to \mathbb{R}$ and let $z_1, \ldots, z_n \in \mathcal{Z}$. Define the **class of behaviors** of $F$ on $z_{1:n}$:
\begin{align*}
F(z_{1:n}) := \{(f(z_1), \ldots, f(z_n)) : f \in F\} \subseteq \mathbb{R}^n.
\end{align*}
The **empirical Rademacher complexity** of $F$ at $z_{1:n}$ is
\begin{align*}
\hat{\mathcal{R}}(F(z_{1:n})) := \mathbb{E}\left[\sup_{f \in F} \frac{1}{n} \sum_{i=1}^{n} \varepsilon_i f(z_i)\right],
\end{align*}
where $\varepsilon_1, \ldots, \varepsilon_n$ are i.i.d. Rademacher random variables. The expectation is over $\varepsilon_{1:n}$ only; the points $z_{1:n}$ are fixed. Note that $\hat{\mathcal{R}}(F(z_{1:n}))$ depends on $F$ only through $F(z_{1:n})$, which justifies the notation.
[/definition]
When the data points $Z_1, \ldots, Z_n$ are themselves random, we view $\hat{\mathcal{R}}(F(Z_{1:n}))$ as a random variable by conditioning on the sample:
\begin{align*}
\hat{\mathcal{R}}(F(Z_{1:n})) := \mathbb{E}\left[\sup_{f \in F} \frac{1}{n} \sum_{i=1}^{n} \varepsilon_i f(Z_i) \,\Bigg|\, Z_{1:n}\right].
\end{align*}
Here the expectation is over the Rademacher variables $\varepsilon_{1:n}$, conditionally on the observed sample $Z_{1:n}$.
[definition: Population Rademacher Complexity]
The **Rademacher complexity** of $F$ is the expectation of the empirical Rademacher complexity over the data:
\begin{align*}
\mathcal{R}_n(F) := \mathbb{E}\hat{\mathcal{R}}(F(Z_{1:n})).
\end{align*}
[/definition]
[explanation: Intuition via Random Labeling]
To build intuition for what Rademacher complexity measures, consider a classification problem with inputs $Z_1, \ldots, Z_n$ and completely random labels $\varepsilon_1, \ldots, \varepsilon_n \in \{-1, +1\}$. The empirical Rademacher complexity captures how well functions in $F$ can align their predictions $f(Z_i)$ with these random labels. Specifically, $\sup_{f \in F} \frac{1}{n} \sum_{i=1}^n \varepsilon_i f(Z_i)$ is large when $F$ contains a function whose values at the data points closely match the random signs.
A rich class $F$ — one that can shatter configurations of points — will have high Rademacher complexity because it can always find a function that correlates with any random labeling. A constrained class will be unable to achieve this and will have low complexity. This is precisely why Rademacher complexity serves as the right measure of expressivity for generalization: classes that are expressive enough to memorize random noise are classes that fail to generalize.
[/explanation]
### The Fundamental Generalization Bound
[quotetheorem:1965]
The proof uses a technique called **symmetrization**: replace the unknown population mean $\mathbb{E}f(Z_i)$ with the sample mean of an independent ghost sample, then introduce Rademacher variables to decouple the two samples.
[proof]
Introduce an independent copy $(Z_1', \ldots, Z_n')$ of $(Z_1, \ldots, Z_n)$, independent of everything else. Since $Z_i'$ is an independent copy of $Z_i$, we have $\mathbb{E}f(Z_i) = \mathbb{E}[f(Z_i') \mid Z_{1:n}]$, so
\begin{align*}
\sup_{f \in F} \frac{1}{n} \sum_{i=1}^{n} \{f(Z_i) - \mathbb{E}f(Z_i)\}
= \sup_{f \in F} \frac{1}{n} \sum_{i=1}^{n} \mathbb{E}[f(Z_i) - f(Z_i') \mid Z_{1:n}].
\end{align*}
Apply the key inequality $\sup_{f} \mathbb{E}[V_f \mid Z_{1:n}] \leq \mathbb{E}[\sup_f V_f \mid Z_{1:n}]$ (moving the supremum inside the conditional expectation only makes the bound larger) to obtain
\begin{align*}
\sup_{f \in F} \frac{1}{n} \sum_{i=1}^{n} \mathbb{E}[f(Z_i) - f(Z_i') \mid Z_{1:n}]
\leq \mathbb{E}\left[\sup_{f \in F} \frac{1}{n} \sum_{i=1}^{n} \{f(Z_i) - f(Z_i')\} \,\Bigg|\, Z_{1:n}\right].
\end{align*}
Now introduce i.i.d. Rademacher variables $\varepsilon_1, \ldots, \varepsilon_n$ independent of $Z_{1:n}$ and $Z_{1:n}'$. Because each $Z_i - Z_i'$ is symmetric around zero, we have the distributional equality
\begin{align*}
\sup_{f \in F} \frac{1}{n} \sum_{i=1}^{n} \{f(Z_i) - f(Z_i')\}
\stackrel{d}{=} \sup_{f \in F} \frac{1}{n} \sum_{i=1}^{n} \varepsilon_i \{f(Z_i) - f(Z_i')\}.
\end{align*}
Using the triangle inequality for the supremum:
\begin{align*}
\sup_{f \in F} \frac{1}{n} \sum_{i=1}^{n} \varepsilon_i \{f(Z_i) - f(Z_i')\}
\leq \sup_{f \in F} \frac{1}{n} \sum_{i=1}^{n} \varepsilon_i f(Z_i) + \sup_{g \in F} \frac{1}{n} \sum_{i=1}^{n} (-\varepsilon_i) g(Z_i').
\end{align*}
Since $\varepsilon_{1:n} \stackrel{d}{=} -\varepsilon_{1:n}$, both suprema have the same expected value. Taking expectations over $\varepsilon_{1:n}$ and $Z_{1:n}'$ and then over $Z_{1:n}$ yields the bound $2\mathcal{R}_n(F)$.
[/proof]
[remark: Why Rademacher Complexity Sidesteps $P_0$]
The key advantage of the bound is visible in the proof: the unknown distribution $P_0$ appears only in $\mathcal{R}_n(F) = \mathbb{E}\hat{\mathcal{R}}(F(Z_{1:n}))$, and $\hat{\mathcal{R}}(F(z_{1:n}))$ depends on $P_0$ only through the realized data points $z_{1:n}$. If we can find upper bounds on $\hat{\mathcal{R}}(F(z_{1:n}))$ that are uniform over all $z_{1:n} \in \mathcal{Z}^n$, we immediately obtain a bound on $\mathcal{R}_n(F)$ independent of $P_0$. This is the program for applying the theorem in practice.
[/remark]
### Generalization Bound for ERM
Combining the symmetrization lemma with the decomposition $\mathbb{E}R(\hat{h}) - R(h^*) \leq \mathbb{E}G$ gives the main corollary:
[quotetheorem:1967]
[citeproof:1967]
It is worth noting that nothing in this argument relied on the specific form of the loss function — only on its boundedness.
[remark: No Constraint on Loss Type]
The bound $\mathbb{E}R(\hat{h}) - R(h^*) \leq 2\mathcal{R}_n(F)$ holds for any bounded loss function $\ell$, not just misclassification loss. In particular, it applies to surrogate losses used in practice such as hinge loss or logistic loss.
[/remark]
### What Rademacher Complexity Achieves
The bound $\mathbb{E}R(\hat{h}) - R(h^*) \leq 2\mathcal{R}_n(F)$ reduces the problem of generalization to computing or bounding $\mathcal{R}_n(F)$. Two questions then arise naturally. First, how does $\mathcal{R}_n(F)$ depend on $n$? For well-behaved classes one expects $\mathcal{R}_n(F) \to 0$ as $n \to \infty$, which would establish consistency. Second, what rate of convergence can we expect?
For finite hypothesis classes, one can verify that the Rademacher approach recovers the $\sqrt{\log|\mathcal{H}|/n}$ bound from the union bound argument. The power of Rademacher complexity lies in extending this to infinite classes by replacing a count with a geometric measure. The next step is to evaluate $\mathcal{R}_n(F)$ for specific function classes — in particular, for linear function classes — and this analysis will reveal the connection to covering numbers and ultimately to VC dimension.
[example: Rademacher Complexity Recovers the Finite-Class Bound]
Let $\mathcal{H}$ be a finite hypothesis class with $|\mathcal{H}| < \infty$, and suppose each loss $\ell(h(x), y) \in [0, 1]$. By Hoeffding's inequality applied to the bounded variables $\varepsilon_i f(z_i) \in [-1, 1]$, for any fixed $f$ the sum $\frac{1}{n}\sum_{i=1}^n \varepsilon_i f(z_i)$ concentrates around zero. A union bound over the at most $|\mathcal{H}|$ elements of $F(z_{1:n})$ gives
\begin{align*}
\hat{\mathcal{R}}(F(z_{1:n})) \leq \sqrt{\frac{2\log|\mathcal{H}|}{n}},
\end{align*}
and therefore $\mathcal{R}_n(F) \leq \sqrt{2\log|\mathcal{H}|/n}$. Substituting into the risk bound yields $\mathbb{E}R(\hat{h}) - R(h^*) \leq 2\sqrt{2\log|\mathcal{H}|/n}$, recovering the correct order. The Rademacher framework thus subsumes the finite-class analysis as a special case, and will be developed further through the VC dimension to handle infinite classes.
[/example]
## VC Dimension
The Rademacher complexity bounds developed in the previous section give sharp generalization guarantees, but computing the Rademacher complexity of a concrete hypothesis class $\mathcal{H}$ directly can be difficult. VC dimension offers a combinatorial route to the same bounds that is often far easier to work with in practice. The core idea is to measure the "effective size" of $\mathcal{H}$ not by counting its elements, but by counting how many distinct labeling patterns it can produce on finite sets of points.
### From Rademacher to Combinatorics
Recall the classification setting: $\mathcal{H}$ is a class of hypotheses $h : \mathcal{X} \to \{-1, 1\}$ and the loss class is
\begin{align*}
\mathcal{F} := \{(x, y) \mapsto \ell(h(x), y) : h \in \mathcal{H}\}.
\end{align*}
Given $n$ data points $z_{1:n} = (z_1, \ldots, z_n)$ with $z_i = (x_i, y_i)$, let $\mathcal{F}(z_{1:n})$ denote the set of distinct vectors $(f(z_1), \ldots, f(z_n))$ as $f$ ranges over $\mathcal{F}$. Since $\ell(h(x_i), y_i)$ is determined by $h(x_i)$ alone (via the misclassification loss), there is a bijection
\begin{align*}
(\ell(h(x_i), y_i))_{i=1}^n \longleftrightarrow (h(x_i))_{i=1}^n,
\end{align*}
and therefore $|\mathcal{F}(z_{1:n})| = |\mathcal{H}(x_{1:n})|$, where $\mathcal{H}(x_{1:n})$ is the set of distinct label vectors $(h(x_1), \ldots, h(x_n))$ as $h$ ranges over $\mathcal{H}$.
[quotetheorem:1968]
[citeproof:1968]
Since each $h(x_i) \in \{-1, 1\}$, we always have $|\mathcal{H}(x_{1:n})| \leq 2^n$. The bound above is only useful when $|\mathcal{H}(x_{1:n})|$ grows substantially slower than $2^n$, for instance polynomially in $n$. This is precisely what the VC dimension captures.
### Shattering and the VC Dimension
[definition: Shattering and VC Dimension]
Let $\mathcal{H}$ be a class of functions $h : \mathcal{X} \to \{a, b\}$ with $a \neq b$ and $|\mathcal{H}| \geq 2$.
- $\mathcal{H}$ **shatters** $x_{1:n} \in \mathcal{X}^n$ if $|\mathcal{H}(x_{1:n})| = 2^n$, that is, every possible labeling of $x_1, \ldots, x_n$ is realized by some $h \in \mathcal{H}$.
- The **shattering coefficient** (or growth function) is
\begin{align*}
s(\mathcal{H}, n) := \max_{x_{1:n} \in \mathcal{X}^n} |\mathcal{H}(x_{1:n})|.
\end{align*}
- The **VC dimension** $\operatorname{VC}(\mathcal{H})$ is the largest integer $n$ such that some $x_{1:n} \in \mathcal{X}^n$ is shattered by $\mathcal{H}$, or $\infty$ if no such largest $n$ exists. Equivalently,
\begin{align*}
\operatorname{VC}(\mathcal{H}) = \sup\{n \in \mathbb{N} : s(\mathcal{H}, n) = 2^n\}.
\end{align*}
[/definition]
To verify $\operatorname{VC}(\mathcal{H}) = d$ one must: (i) exhibit a set of $d$ points that is shattered, and (ii) show that no set of $d+1$ points can be shattered. Note that sets of non-distinct points can never be shattered, so we may always restrict attention to distinct points.
[example: VC Dimension of Interval Indicators on the Real Line]
Let $\mathcal{X} = \mathbb{R}$ and consider
\begin{align*}
\mathcal{H} = \{h_{a,b} : h_{a,b}(x) = \mathbb{1}_{[a,b)}(x),\ a, b \in \mathbb{R}\}.
\end{align*}
**Claim: $\operatorname{VC}(\mathcal{H}) = 2$.**
*Shattering a pair.* Take any two distinct points $x_1 < x_2$. We must realize all four labelings $\{(0,0), (1,0), (0,1), (1,1)\}$:
- $(0,0)$: use $h_{b,a}$ with $b > a$ chosen so both points fall outside $[a,b)$, e.g. $a = x_2 + 1$.
- $(1,1)$: use $h_{x_1 - 1,\ x_2 + 1}$.
- $(1,0)$: use $h_{x_1 - 1,\ x_1 + \varepsilon}$ for $\varepsilon \in (0, x_2 - x_1)$.
- $(0,1)$: use $h_{x_1 + \varepsilon,\ x_2 + 1}$ for $\varepsilon \in (0, x_2 - x_1)$.
So $\mathcal{H}$ shatters every two-point set.
*No three-point set is shattered.* Take three distinct points $x_1 < x_2 < x_3$. The labeling $h(x_1) = h(x_3) = 1$, $h(x_2) = 0$ is impossible: for $x_1$ and $x_3$ both to lie in $[a,b)$, the entire interval $[x_1, x_3]$ must lie in $[a,b)$, which forces $x_2 \in [a,b)$ as well.
*Bounding the growth function.* The $n$ distinct points $x_1 < \cdots < x_n$ divide $\mathbb{R}$ into $n+1$ intervals. The behavior $(h_{a,b}(x_i))_{i=1}^n$ depends only on which of these $n+1$ intervals contains $a$ and which contains $b$, giving $s(\mathcal{H}, n) \leq (n+1)^2$.
[/example]
The polynomial bound $(n+1)^2$ on the growth function in this example is not a coincidence of the particular class chosen — it reflects a general pattern tied to the VC dimension.
[remark: Structure of the Growth Function]
Notice that in the example above, $s(\mathcal{H}, n) \leq (n+1)^{\operatorname{VC}(\mathcal{H})} = (n+1)^2$. This is not a coincidence: the Sauer–Shelah lemma asserts that this polynomial bound holds for any class with finite VC dimension.
[/remark]
### The Sauer–Shelah Lemma
The key structural fact about the growth function is that finite VC dimension forces polynomial growth, a far cry from the naive $2^n$ upper bound.
[quotetheorem:1969]
What makes this remarkable is the following: knowing only that $s(\mathcal{H}, n) < 2^n$ for all $n > d$ (which follows directly from the definition of VC dimension), one might expect the growth function could still be exponential, like $1.8^n$. The Sauer–Shelah lemma rules this out: once $n > d$, the growth is polynomial.
[proof]
The proof establishes the following stronger claim by induction on $|\mathcal{H}(x_{1:n})|$: for any $x_{1:n}$ and any $Q \subseteq \{1, \ldots, n\}$ nonempty, writing $x_Q$ for the subvector indexed by $Q$, there are at least $|\mathcal{H}(x_{1:n})| - 1$ nonempty subsets $Q \subseteq \{1, \ldots, n\}$ such that $\mathcal{H}$ shatters $x_Q$.
To deduce the lemma from the claim: take $x_{1:n}$ maximizing $|\mathcal{H}(x_{1:n})| = s(\mathcal{H}, n)$. Since $\operatorname{VC}(\mathcal{H}) = d$, no $x_Q$ with $|Q| > d$ is shattered. Counting the shattered sets and using the claim gives $s(\mathcal{H}, n) - 1 \leq \sum_{i=1}^{\min(d,n)} \binom{n}{i}$. For $n \geq d$, the right side satisfies $\sum_{i=1}^{d}\binom{n}{i} \leq n^d + \binom{d}{1}n^{d-1} + \cdots + \binom{d}{d} - 1 = (n+1)^d - 1$, so $s(\mathcal{H}, n) \leq (n+1)^d$.
For the inductive claim: the base case $|\mathcal{H}(x_{1:n})| = 1$ is vacuous. For the inductive step, take $|\mathcal{H}(x_{1:n})| = k+1$ and choose $j$ such that both $\mathcal{H}^+ = \{h \in \mathcal{H} : h(x_j) = 1\}$ and $\mathcal{H}^- = \{h \in \mathcal{H} : h(x_j) = -1\}$ are nonempty. Let $X^{\pm}$ be the families of subvectors shattered by $\mathcal{H}^{\pm}$ respectively; by induction $|X^-| + |X^+| \geq k - 1$. Subvectors in $X^- \cup X^+$ are all shattered by $\mathcal{H}$ and none contains $x_j$ (since $\mathcal{H}^{\pm}$ assign constant sign at $x_j$). For $x_Q \in X^- \cap X^+$, both $x_Q$ and $x_{Q \cup \{j\}}$ are shattered by $\mathcal{H}$, and $x_j$ itself is shattered. Counting gives at least $1 + |X^- \cup X^+| + |X^- \cap X^+| = 1 + |X^-| + |X^+| \geq k$ shattered sets, completing the induction.
[/proof]
The bound $(n+1)^d$ is essentially sharp: there exist hypothesis classes with $\operatorname{VC}(\mathcal{H}) = d$ for which $s(\mathcal{H}, n) = \binom{n}{0} + \cdots + \binom{n}{d} \sim n^d / d!$ for large $n$, so the polynomial growth is genuinely the correct order. What the lemma reveals is a sharp **phase transition**: for $n \leq d$, the growth function equals $2^n$ (the class can shatter all configurations), but at $n = d + 1$ the growth drops from exponential to polynomial. This transition is abrupt — a single additional data point beyond the VC dimension forces the number of achievable labelings to fall from $2^n$ to at most $(n+1)^d$, which for $n \gg d$ represents an exponential reduction. It is this collapse that makes uniform convergence possible: polynomial growth forces the effective number of distinct hypotheses to be manageable, and Rademacher complexity can then be controlled.
### From VC Dimension to Rademacher Bounds
Combining the Sauer–Shelah lemma with the empirical Rademacher bound yields a clean bound on the Rademacher complexity in terms of VC dimension alone.
[quotetheorem:1971]
[citeproof:1971]
The resulting rate merits comparison with the finite-class bound to appreciate the improvement that VC dimension provides.
[remark: Rate Comparison]
The Rademacher bound via VC dimension gives a generalization error of order $\sqrt{d\log(n)/n}$, which is nearly optimal (the optimal rate is $\sqrt{d/n}$ up to logarithmic factors). For large function classes with small VC dimension, this is a dramatic improvement over the naive $\sqrt{\log|\mathcal{H}|/n}$ bound.
[/remark]
### Computing VC Dimensions: Linear Classifiers
The VC dimension of hypothesis classes derived from vector spaces of functions can be bounded purely in terms of the algebraic dimension of the space.
[quotetheorem:1972]
[citeproof:1972]
The bound $\operatorname{VC}(\mathcal{H}) \leq \dim(\mathcal{F})$ is often tight: for affine functions on $\mathbb{R}^p$, the bound gives $\operatorname{VC} \leq p+1$ and the VC dimension is exactly $p+1$, as the examples below confirm. However, the bound can be loose when the function space $\mathcal{F}$ has special structure that prevents shattering large sets. For instance, if the functions in $\mathcal{F}$ are all monotone on a linearly ordered domain, the VC dimension is at most $1$ regardless of $\dim(\mathcal{F})$: a monotone classifier $h = \operatorname{sgn}(f)$ partitions $\mathcal{X}$ into at most two contiguous regions, which shatters at most a single point. In such cases the algebraic dimension dramatically overestimates the combinatorial complexity. The bound is a useful starting point, but computing the VC dimension precisely requires additional geometric or structural arguments.
[example: VC Dimension of Halfspaces in $\mathbb{R}^p$]
Let $\mathcal{X} = \mathbb{R}^p$ and take $\mathcal{F} = \{x \mapsto x^\top \beta + \mu : \beta \in \mathbb{R}^p, \mu \in \mathbb{R}\}$, the class of affine functions. Then $\dim(\mathcal{F}) = p + 1$, so the class of halfspace classifiers $\mathcal{H} = \{x \mapsto \operatorname{sgn}(x^\top \beta + \mu)\}$ satisfies $\operatorname{VC}(\mathcal{H}) \leq p + 1$.
In fact $\operatorname{VC}(\mathcal{H}) = p + 1$ exactly: the $p+1$ points $\{0, e_1, \ldots, e_p\}$ (where $e_j$ are the standard basis vectors) can be shattered by choosing appropriate $\beta$ and $\mu$ to achieve any labeling pattern.
[/example]
Replacing linear decision boundaries with polynomial ones increases the expressivity of the classifier and, consequently, its VC dimension.
[example: VC Dimension of Polynomial Classifiers in $\mathbb{R}^2$]
Let $\mathcal{X} = \mathbb{R}^2$ and let $\mathcal{F}$ be the space of polynomials of degree at most $d$ in two variables. The monomial basis has $\binom{d+2}{2} = \frac{(d+1)(d+2)}{2}$ elements, so $\dim(\mathcal{F}) = \frac{(d+1)(d+2)}{2}$. The class $\mathcal{H} = \{\operatorname{sgn} \circ f : f \in \mathcal{F}\}$ therefore satisfies
\begin{align*}
\operatorname{VC}(\mathcal{H}) \leq \frac{(d+1)(d+2)}{2}.
\end{align*}
For $d = 1$ (linear classifiers in $\mathbb{R}^2$) this gives $\operatorname{VC}(\mathcal{H}) \leq 3$, consistent with the known fact that three non-collinear points can be shattered by a line.
[/example]
The examples so far use classifiers defined by thresholding a single real-valued function. A different geometric construction — classifying points by whether they lie in a coordinate-aligned box — leads to a class whose VC dimension is determined by the ambient dimension in a different way.
[example: VC Dimension of Orthant Indicators in $\mathbb{R}^p$]
Let $\mathcal{X} = \mathbb{R}^p$ and consider the class of negative orthant indicators
\begin{align*}
\mathcal{H} = \left\{\mathbb{1}_A : A = \prod_{j=1}^p (-\infty, a_j],\ a_1, \ldots, a_p \in \mathbb{R}\right\}.
\end{align*}
**Claim: $\operatorname{VC}(\mathcal{H}) = p$.**
*The standard basis is shattered.* For any $I \subseteq \{1, \ldots, p\}$, set $a_j = 1$ if $j \in I$ and $a_j = 0$ otherwise. Then $e_j \in \prod_k (-\infty, a_k]$ if and only if $j \in I$, so the resulting $h$ labels $e_j$ as $1$ precisely for $j \in I$.
*No $p+1$ points can be shattered.* Given $x_1, \ldots, x_{p+1} \in \mathbb{R}^p$, let $\pi_j$ denote the $j$th coordinate function. By the pigeonhole principle, some $x_{k^*}$ is not the unique maximizer of any $\pi_j$ over $\{x_1, \ldots, x_{p+1}\}$: for each $j$, there exists $k_j \neq k^*$ with $(x_{k_j})_j \geq (x_{k^*})_j$. For any $h \in \mathcal{H}$, if $h(x_{k^*}) = 0$ (i.e., $x_{k^*} \notin A$), then for some $j$, $(x_{k^*})_j > a_j$, which forces $(x_{k_j})_j \geq (x_{k^*})_j > a_j$, giving $h(x_{k_j}) = 0$ as well. Therefore the labeling $h(x_{k^*}) = 0$, $h(x_k) = 1$ for all $k \neq k^*$ cannot be realized.
[/example]
### The Fundamental Theorem of Statistical Learning
The VC dimension does not merely provide upper bounds on Rademacher complexity — it completely characterizes the learnability of a hypothesis class. This is the content of the fundamental theorem of statistical learning theory.
[quotetheorem:1973]
The proof of this theorem uses the connection $\operatorname{VC}(\mathcal{H}) < \infty \Rightarrow s(\mathcal{H}, n) \leq (n+1)^d$ (Sauer–Shelah), the empirical Rademacher bound, and the symmetrization argument from the previous section. The converse direction ($\operatorname{VC}(\mathcal{H}) = \infty$ implies no uniform convergence) follows from a combinatorial argument showing that infinite VC dimension allows adversarial construction of hard distributions. The full proof uses techniques from empirical process theory; see Shalev-Shwartz and Ben-David, *Understanding Machine Learning*, Chapter 6.
[explanation: Why VC Dimension is the Right Complexity Measure]
One might wonder why VC dimension captures learnability so precisely. The intuition is as follows.
If $\operatorname{VC}(\mathcal{H}) = d$, then for any $n$ points, $\mathcal{H}$ can distinguish at most $(n+1)^d$ labelings — polynomial in $n$. This means that even though $\mathcal{H}$ may be an infinite class (infinitely many halfspaces, for instance), its effective complexity on any finite sample is controlled. The empirical distribution over $n$ points can only "see" the polynomial number of distinct behaviors, and the law of large numbers forces empirical risk to converge uniformly to true risk as $n \to \infty$.
Conversely, if $\operatorname{VC}(\mathcal{H}) = \infty$, then for any $n$ there exist $n$ points that $\mathcal{H}$ shatters. This means $\mathcal{H}$ can perfectly memorize any labeling of any training set of any size — it can always achieve zero training error, regardless of the true labels. No algorithm can distinguish between the signal and pure noise, making generalization impossible.
This dichotomy — polynomial growth versus exponential growth of the shattering function — is what the Sauer–Shelah lemma formalizes, and it is exactly the threshold that separates learnable from unlearnable in the binary classification setting.
[/explanation]
<!-- illustration-needed: Growth function s(H,n) vs n — show two curves: one that grows as 2^n (VC infinite) and one that transitions from 2^n for n≤d to polynomial growth (n+1)^d for n>d, illustrating the phase transition at the VC dimension -->
With the VC dimension framework in place, we have a complete combinatorial theory of generalization in binary classification. The next chapter turns from statistical questions — how many samples suffice? — to computational questions: how do we actually find the empirical risk minimizer efficiently? Chapter 4 will show that convexity is the structural key, and that replacing the intractable 0-1 loss with a convex surrogate (hinge, logistic, or exponential) connects the generalization bounds derived here to practical algorithms via the Zhang–Bartlett theorem.
The fundamental theorem of statistical learning tells us *whether* a hypothesis class is learnable, but not *how* to learn it efficiently. To move from theory to practice, we must understand the computational landscape—convexity, surrogate losses, and iterative optimization methods that scale to large datasets.
# 4. Computation for Empirical Risk Minimisation
Chapter 3 established that empirical risk minimisation over a hypothesis class $\mathcal{H}$ of bounded complexity produces hypotheses whose true risk is close to the best attainable in $\mathcal{H}$. Those results are reassuring in theory, but they say nothing about whether we can actually compute $\hat{h}$ in practice. For a general hypothesis class, finding the empirical risk minimiser can be computationally intractable — the empirical risk landscape may be riddled with local minima, making any gradient-based search unreliable. This chapter turns to the question of computation: when and how can we find $\hat{h}$ efficiently?
The key structural property that makes optimisation tractable is **convexity**. When the empirical risk can be written as a convex function minimised over a convex set, the local-to-global phenomenon ensures that any local minimum is a global minimum, and powerful algorithmic guarantees become available. The bulk of this chapter is therefore devoted to the theory of convex functions and the algorithms — gradient descent and its variants — that exploit this structure. We close with a discussion of how convexity can be enforced by replacing the intractable 0-1 loss with a convex surrogate.
## Convex Sets and Convex Functions
The chapter begins at the level of definitions. A set $C \subseteq \mathbb{R}^d$ is **convex** if it contains every line segment between its points:
[definition: Convex Set]
A set $C \subseteq \mathbb{R}^d$ is **convex** if for all $x, y \in C$ and all $t \in (0, 1)$,
\begin{align*}
(1-t)x + ty \in C.
\end{align*}
[/definition]
Convex sets are the natural domains for convex functions. Geometrically, a set is convex when you can stand anywhere inside it and see any other point without leaving the set — there are no concavities.
[definition: Convex Function]
Let $C \subseteq \mathbb{R}^d$ be a convex set. A function $f: C \to \mathbb{R}$ is **convex** if for all $x, y \in C$ and all $t \in (0, 1)$,
\begin{align*}
f\bigl((1-t)x + ty\bigr) \leq (1-t)f(x) + tf(y).
\end{align*}
The function $f$ is **strictly convex** if the inequality is strict whenever $x \neq y$. A function $f$ is **concave** if $-f$ is convex.
[/definition]
The defining inequality says that the function lies below or on the chord connecting any two of its points. If $-f$ is convex, then $f$ curves upward; if $f$ is concave, it curves downward. Strict convexity rules out the degenerate case where $f$ is affine on a line segment.
[example: Norms Are Convex]
Every norm $\|\cdot\|: \mathbb{R}^d \to [0, \infty)$ is convex. For any $x, y \in \mathbb{R}^d$ and $t \in (0,1)$, the triangle inequality gives
\begin{align*}
\|(1-t)x + ty\| \leq \|(1-t)x\| + \|ty\| = (1-t)\|x\| + t\|y\|.
\end{align*}
This verifies the definition directly. In particular, the Euclidean norm $\|\cdot\|_2$, the $\ell^1$ norm $\|\cdot\|_1$, and the $\ell^\infty$ norm are all convex. Norms are not strictly convex in general: the $\ell^1$ and $\ell^\infty$ norms are not strictly convex, since, for instance under $\|\cdot\|_1$, the entire segment between $(1,0)$ and $(0,1)$ lies on the unit sphere.
[/example]
<!-- illustration-needed: the convexity condition — a convex function f with two points x and y marked, showing the chord from (x, f(x)) to (y, f(y)) lies above the graph of f -->
## The Local-to-Global Phenomenon
One of the most important consequences of convexity for optimisation is that local minima are automatically global minima. This is what makes convex optimisation tractable in a way that non-convex optimisation is not.
[quotetheorem:1975]
[citeproof:1975]
The argument is short but decisive: local information forces a global conclusion. By contrast, a non-convex function can have many local minima with objective values far above the global minimum, and gradient-based algorithms may become trapped at any of them. This is why convexity is such a valuable structural property to identify.
## Three Ways to Verify Convexity
In practice, one verifies convexity not by checking the definition for every pair of points, but via one of three equivalent characterisations. The following proposition collects the most useful ones together.
[quotetheorem:1976]
[citeproof:1976]
These results give a practical toolkit. Part (i) tells us that convex combinations of convex functions are convex — this is why ridge-regularised empirical risk (a sum of a convex loss and a convex penalty) is convex. Part (iii) tells us that pointwise suprema of convex functions are convex — this will be important when studying support vector machines, where the loss is the maximum of affine functions. Part (v) is particularly powerful: for a differentiable convex function, setting the gradient to zero is not just a necessary condition for a global minimum, but also sufficient. Part (vii) reduces convexity to a linear-algebraic check on the Hessian, which is often the most efficient approach in practice.
[example: Quadratic Functions]
Consider $f: \mathbb{R}^d \to \mathbb{R}$ given by $f(x) = x^\top A x + b^\top x + c$ where $A \in \mathbb{R}^{d \times d}$ is symmetric. The Hessian is $H(x) = 2A$ (constant in $x$). By part (vii):
- $f$ is convex if and only if $A$ is positive semi-definite;
- $f$ is strictly convex if $A$ is positive definite.
When $A$ is positive definite, $f$ has a unique global minimiser. Setting $\nabla f(x) = 2Ax + b = 0$ and applying part (v), the unique minimiser is $x^* = -\frac{1}{2}A^{-1}b$.
The squared loss in ordinary least squares is an instance: minimising $\frac{1}{n}\sum_{i=1}^n (Y_i - x_i^\top \beta)^2$ over $\beta \in \mathbb{R}^d$ corresponds to minimising a quadratic $f(\beta) = \frac{1}{n}\|Y - \Phi\beta\|_2^2$, whose Hessian is $\frac{2}{n}\Phi^\top \Phi$. When $\Phi$ has full column rank, the Hessian is positive definite, confirming that the least squares objective is strictly convex and its minimiser is unique.
[/example]
[example: The Sublevel Set Property]
Let $f: \mathbb{R}^2 \to \mathbb{R}$ be defined by $f(x_1, x_2) = x_1^2 + x_2^2$ (the squared Euclidean norm). This is strictly convex with Hessian $H = 2I$, which is positive definite. The sublevel set $\{x : f(x) \leq r^2\}$ is the closed ball $\overline{B}(0, r)$, which is a convex set — consistent with part (iv). More generally, any norm ball $\{x : \|x\| \leq r\}$ is convex, because norms are convex functions.
[/example]
## Jensen's Inequality
The first-order condition in part (v) of the preceding theorem can be seen as a two-point version of a much more general inequality. Jensen's inequality extends the convexity condition from pairs of points to arbitrary probability distributions.
[quotetheorem:1977]
The proof of Jensen's inequality uses the first-order characterisation: since $f$ is convex and differentiable (or admits a subgradient at $\mathbb{E}[Z]$), we have $f(z) \geq f(\mathbb{E}[Z]) + c^\top(z - \mathbb{E}[Z])$ for some $c$, and taking expectations of both sides gives the result. The course does not prove this in full generality; the argument is a direct extension of the first-order condition in the Properties of Convex Functions theorem (part (v)).
Jensen's inequality appears throughout the course in two guises. The **unconditional** form $f(\mathbb{E}[Z]) \leq \mathbb{E}[f(Z)]$ is used in the analysis of optimisation algorithms (for instance, to pass from a bound on the average iterate to a bound on the function value at the average). The **conditional** form, which states that
\begin{align*}
f\bigl(\mathbb{E}[Z \mid W]\bigr) \leq \mathbb{E}\bigl[f(Z) \mid W\bigr],
\end{align*}
was already invoked in Chapter 1 when discussing conditional expectations and will recur in the analysis of stochastic gradient descent.
[remark: Jensen and the Direction of Optimisation]
Jensen's inequality implies that for a convex function $f$, the function value at the mean is never larger than the mean of the function values. This is a quantitative version of the intuition that convex functions "curve upward". In the context of optimisation, this is why passing from the empirical average iterate $\bar{\beta} = \frac{1}{k}\sum_{s=1}^k \beta_s$ to a bound on $f(\bar{\beta})$ via Jensen's inequality is a standard manoeuvre in convergence proofs for gradient descent and stochastic gradient descent — we will see this repeatedly in the sections that follow.
[/remark]
The properties developed in this section — the local-to-global theorem, the first-order and second-order characterisations, and Jensen's inequality — form the analytical backbone of the rest of the chapter. The immediate next step is to connect convexity to the original learning problem: the 0-1 loss that defines misclassification error is not convex, so computing its ERM is generally NP-hard. The resolution is to replace it with a convex surrogate loss that upper-bounds the 0-1 loss and admits efficient minimisation.
## Convex Surrogates
The previous section established that convexity is the key structural property enabling tractable optimisation. A natural question now arises: can we exploit convexity for the fundamental task of classification? The answer requires confronting a hardness result — that directly minimising misclassification loss is computationally intractable — and then building a principled substitute.
### The Intractability of 0-1 Loss ERM
In the classification setting with $\mathcal{Y} = \{-1, 1\}$, the most natural loss is the misclassification or 0-1 loss:
\begin{align*}
\ell(h(x), y) = \mathbb{1}_{\{h(x) \neq y\}}.
\end{align*}
For a linear hypothesis class parameterised by $\beta \in \mathbb{R}^p$, where $h_\beta(x) = \operatorname{sgn}(x^\top \beta)$, the empirical risk is
\begin{align*}
\hat{R}(h_\beta) = \frac{1}{n} \sum_{i=1}^n \mathbb{1}_{\{\operatorname{sgn}(X_i^\top \beta) \neq Y_i\}} \approx \frac{1}{n} \sum_{i=1}^n \mathbb{1}_{(-\infty, 0]}(Y_i X_i^\top \beta),
\end{align*}
where the approximation ignores the measure-zero event $X_i^\top \beta = 0$. This objective is not convex in $\beta$ — the indicator function $\mathbb{1}_{(-\infty,0]}$ is discontinuous and highly non-convex — and ERM over half-spaces under 0-1 loss is NP-hard in general.
The indicator function $\mathbb{1}_{(-\infty,0]}(u)$ penalises positive margins $u = yh(x) > 0$ (correct classifications) not at all, and penalises negative margins $u \leq 0$ (incorrect classifications or ties) by exactly $1$. If this indicator could be replaced by a convex function $\varphi : \mathbb{R} \to [0, \infty)$ that is also a reasonable approximation to it, then by the Properties of Convex Functions theorem (parts (i) and (ii)) from the previous section, the resulting empirical objective $\frac{1}{n}\sum_{i=1}^n \varphi(Y_i X_i^\top \beta)$ would be a convex function of $\beta$, making the optimisation tractable.
This motivates a change of framework: rather than restricting hypotheses to take values in $\{-1, 1\}$, we allow $h : \mathcal{X} \to \mathbb{R}$ and recover a classifier via $x \mapsto \operatorname{sgn}(h(x))$. The question of which convex $\varphi$ to use, and how good the resulting classifier is, is what this section addresses.
### Surrogate Loss Functions
We work in the following setting. Let $\mathcal{H}$ be a class of functions $h : \mathcal{X} \to \mathbb{R}$. Rather than classifying directly, each $h \in \mathcal{H}$ determines a classifier $x \mapsto \operatorname{sgn}(h(x))$.
We consider loss functions of the form $\ell(h(x), y) = \varphi(yh(x))$, where the **margin** $u = yh(x)$ captures the agreement between prediction $h(x)$ and label $y$: a positive margin means $h(x)$ and $y$ have the same sign (correct prediction), while a negative margin means they disagree.
[definition: Surrogate Loss and Phi-Risk]
Let $\varphi : \mathbb{R} \to [0, \infty)$ be a convex function. The associated **surrogate loss** is
\begin{align*}
\ell(h(x), y) = \varphi(yh(x)).
\end{align*}
The **$\varphi$-risk** of a hypothesis $h : \mathcal{X} \to \mathbb{R}$ is
\begin{align*}
R_\varphi(h) := \mathbb{E}[\varphi(Yh(X))].
\end{align*}
The **empirical $\varphi$-risk** is
\begin{align*}
\hat{R}_\varphi(h) := \frac{1}{n}\sum_{i=1}^n \varphi(Y_i h(X_i)).
\end{align*}
[/definition]
Note that formally $Y$ is treated as a real-valued random variable, even though the labels $(Y_i)_{i=1}^n$ take values in $\{-1, 1\}$.
The three most commonly used choices of $\varphi$ in practice are:
- **Hinge loss**: $\varphi(u) = \max(1 - u, 0)$. This is zero for margins $u \geq 1$ and grows linearly for $u < 1$. It penalises not just misclassifications ($u < 0$) but also correct predictions with insufficient margin ($0 \leq u < 1$).
- **Logistic loss**: $\varphi(u) = \log_2(1 + e^{-u}) = \frac{\log(1 + e^{-u})}{\log 2}$. This is positive for all $u \in \mathbb{R}$ (it never reaches zero), so logistic regression always has an incentive to push the margin higher.
- **Exponential loss**: $\varphi(u) = e^{-u}$. This is smooth and strictly convex everywhere, and grows much faster for negative margins. It is the loss underlying the AdaBoost algorithm.
<!-- illustration-needed: plot all four loss functions (0-1, hinge, logistic, exponential) as functions of the margin u from -1.5 to 1.5, showing how they upper-bound the 0-1 loss and differ in their rate of growth for negative u -->
All three are convex, differentiable (hinge loss is differentiable everywhere except at $u = 1$), and are pointwise upper bounds on the 0-1 indicator $\mathbb{1}_{(-\infty,0]}(u)$. Because they upper bound the 0-1 loss, minimising $\hat{R}_\varphi(h)$ also tends to reduce the misclassification rate.
[example: Hinge Loss is Convex]
Define $\varphi(u) = \max(1-u, 0)$. This is the pointwise maximum of two convex functions: the constant function $0$ and the affine function $u \mapsto 1-u$. By the Properties of Convex Functions theorem (part (iii)), the pointwise supremum of convex functions is convex, so $\varphi$ is convex.
The objective $\frac{1}{n}\sum_{i=1}^n \varphi(Y_i X_i^\top \beta)$ is a sum of functions of the form $u \mapsto \varphi(Y_i X_i^\top \beta)$, each of which is the composition of $\varphi$ with an affine map $\beta \mapsto Y_i X_i^\top \beta$. By the Properties of Convex Functions theorem (part (ii)), such a composition is convex, and by part (i), their sum is convex. The empirical hinge risk is therefore a convex function of $\beta$, which is precisely the computational advantage we sought.
[/example]
### Classification Calibration
Using a surrogate loss is only useful if minimising $R_\varphi$ is somehow related to minimising the misclassification risk. The ideal scenario is that the $h_{\varphi,0}$ minimising $R_\varphi$ (if it exists) determines a classifier $x \mapsto \operatorname{sgn}(h_{\varphi,0}(x))$ that agrees with the Bayes classifier $x \mapsto \operatorname{sgn}(\eta(x) - 1/2)$, where $\eta(x) = \mathbb{P}(Y = 1 \mid X = x)$.
To study this, we decompose the $\varphi$-risk conditionally on $X$. Recall that $Y \in \{-1, 1\}$ with $\mathbb{P}(Y = 1 \mid X = x) = \eta(x)$. The conditional $\varphi$-risk at a fixed $x$, if $h(x) = \alpha$, is
\begin{align*}
\mathbb{E}[\varphi(Yh(X)) \mid X = x] &= \eta(x)\varphi(\alpha) + (1-\eta(x))\varphi(-\alpha).
\end{align*}
Treating $\eta \in [0,1]$ and $\alpha \in \mathbb{R}$ as generic variables, define
\begin{align*}
C_\eta(\alpha) := \eta\varphi(\alpha) + (1-\eta)\varphi(-\alpha).
\end{align*}
The function $C_\eta(\alpha)$ is the expected loss at a single point when the conditional class probability is $\eta$ and we predict $\alpha$. Minimising the global $\varphi$-risk reduces to minimising $C_\eta(\alpha)$ over $\alpha \in \mathbb{R}$ pointwise in $x$ (up to measurability).
The key question is: does minimising $C_\eta(\alpha)$ over $\alpha$ lead to a prediction with the same sign as $\eta - 1/2$? If $\eta > 1/2$ (the Bayes-optimal prediction is $+1$), we want the optimal $\alpha$ to be positive.
[definition: Classification Calibration]
A function $\varphi : \mathbb{R} \to [0, \infty)$ is **classification calibrated** if for every $\eta \in [0,1]$ with $\eta \neq 1/2$,
\begin{align*}
\inf_{\alpha \in \mathbb{R}} C_\eta(\alpha) < \inf_{\alpha \,:\, \alpha(2\eta - 1) \leq 0} C_\eta(\alpha).
\end{align*}
[/definition]
The condition requires that the infimum of $C_\eta$ over all $\alpha$ is strictly less than the infimum over those $\alpha$ that disagree in sign with the Bayes-optimal prediction. In other words: the optimal $\alpha$ (the one minimising the surrogate risk) must have the same sign as $2\eta - 1$, ruling out a pathological situation where the surrogate loss minimum is achieved at the wrong sign.
The following theorem gives a clean sufficient condition for classification calibration.
[quotetheorem:1978]
[citeproof:1978]
[remark: Calibration of Standard Surrogates]
Each of the three standard surrogate losses is differentiable at $0$ with $\varphi'(0) < 0$:
- **Hinge**: $\varphi(u) = \max(1-u,0)$, so $\varphi'(0) = -1 < 0$.
- **Logistic**: $\varphi(u) = \log(1+e^{-u})/\log 2$, so $\varphi'(0) = -\frac{1}{2\log 2} < 0$.
- **Exponential**: $\varphi(u) = e^{-u}$, so $\varphi'(0) = -1 < 0$.
By the theorem, all three are classification calibrated. This provides theoretical justification for using these surrogates in practice.
[/remark]
### The Zhang–Bartlett Theorem
Classification calibration is a qualitative guarantee: it ensures that the minimiser of $R_\varphi$ leads to a Bayes-optimal classifier. A deeper quantitative result — the **Zhang–Bartlett theorem** — relates the excess $\varphi$-risk directly to the excess misclassification risk.
To state it, recall that the Bayes misclassification risk is $R^* = \inf_h R(h)$, where $R(h)$ is the true misclassification risk. Write $R_\varphi^* = \inf_h R_\varphi(h)$ for the optimal $\varphi$-risk.
[quotetheorem:1980]
The theorem is not proved in the course; its proof requires tools from convex analysis and the theory of calibration functions that we do not develop here. A detailed proof can be found in Zhang (2004) and Bartlett–Jordan–McAuliffe (2006).
The Zhang–Bartlett theorem says that if the excess surrogate risk $R_\varphi(h) - R_\varphi^*$ is small, then the excess misclassification risk $R(h) - R^*$ is also small (since $\psi$ is continuous with $\psi(0) = 0$). This completes the justification of the surrogate approach: we minimise an efficiently-computable convex surrogate, and the resulting classifier is provably close to Bayes-optimal in terms of misclassification risk whenever the excess surrogate risk is small.
[explanation: Why Calibration Matters for Learning Bounds]
Combining the Zhang–Bartlett theorem with the generalisation bounds from the previous chapter gives a complete picture. Suppose we use ERM with a classification-calibrated surrogate $\varphi$. From the learning theory results, we have a bound on $R_\varphi(\hat{h}) - R_\varphi^*$ in terms of the sample size $n$ and the complexity of the hypothesis class (Rademacher complexity). Zhang–Bartlett then converts this into a bound on $R(\hat{h}) - R^*$.
The three-step chain is:
1. The convexity of $\varphi$ makes $\hat{R}_\varphi(h)$ tractable to minimise computationally.
2. Generalisation theory bounds $R_\varphi(\hat{h}) - R_\varphi^*$ in terms of $n$ and the complexity of $\mathcal{H}$.
3. Zhang–Bartlett converts this into a bound on the excess misclassification risk.
Each step addresses a different fundamental difficulty: computational tractability, statistical efficiency, and loss conversion. The entire framework of surrogate-based learning rests on this three-step logic.
[/explanation]
## Rademacher Complexity Revisited
We introduced Rademacher complexity in Chapter 3 as the key tool for obtaining data-dependent generalization bounds that go beyond finite hypothesis classes. The expected risk bound from that chapter — $\mathbb{E}R(\hat{h}) - R(h^*) \leq 2\mathcal{R}_n(\mathcal{F})$ — involves the Rademacher complexity of the composed class
\begin{align*}
\mathcal{F} = \{(x, y) \mapsto \phi(y h(x)) : h \in \mathcal{H}\},
\end{align*}
where $\phi$ is a convex surrogate loss. But so far we have only computed $R_n(\mathcal{H})$ in terms of shattering coefficients and VC dimension — tools that rely on the misclassification loss taking values in $\{0, 1\}$. For a general Lipschitz surrogate $\phi$, we need a different strategy.
The plan is to relate $R_n(\mathcal{F})$ to $R_n(\mathcal{H})$, the Rademacher complexity of the original hypothesis class, which is typically simpler to control. The tool that makes this reduction possible is the **contraction lemma**, which says: composing a Lipschitz function with each element of a function class can only reduce the Rademacher complexity, up to the Lipschitz constant.
### The Contraction Lemma
Before stating the result formally, it is worth understanding why such a contraction should hold. The Rademacher complexity measures the extent to which a function class can correlate with random $\pm 1$ noise. If $\phi$ is Lipschitz with constant $L$, then the outputs $\phi(y_i h(x_i))$ are more tightly clustered than the outputs $y_i h(x_i)$ — the function $\phi$ contracts the range. Correlating contracted outputs with noise should therefore be at most as easy as correlating the original outputs, with the Lipschitz constant $L$ tracking the rescaling.
[definition: Empirical Rademacher Complexity]
Let $\mathcal{G}$ be a class of functions $g : \mathcal{Z} \to \mathbb{R}$ and let $z_{1:n} = (z_1, \ldots, z_n) \in \mathcal{Z}^n$. The **empirical Rademacher complexity** of $\mathcal{G}$ with respect to $z_{1:n}$ is
\begin{align*}
\hat{R}(\mathcal{G}(z_{1:n})) := \mathbb{E}\!\left[\sup_{g \in \mathcal{G}} \frac{1}{n} \sum_{i=1}^n \varepsilon_i g(z_i)\right],
\end{align*}
where $\varepsilon_1, \ldots, \varepsilon_n$ are i.i.d. Rademacher random variables, each taking values $\pm 1$ with equal probability. The **Rademacher complexity** is $R_n(\mathcal{G}) := \mathbb{E}[\hat{R}(\mathcal{G}(Z_{1:n}))]$, where the outer expectation is over i.i.d. $Z_i \sim P_0$.
[/definition]
[quotetheorem:1981]
[citeproof:1981]
The lemma is remarkably clean: the Rademacher complexity of the composed class $\mathcal{F}$ is no larger than $L$ times the complexity of the original class $\mathcal{H}$. This reduces the problem of bounding generalization for surrogate-loss ERM to the problem of bounding $R_n(\mathcal{H})$, which involves only the hypothesis class and not the loss.
[corollary: Excess $\phi$-Risk Bound via Contraction]
Under the hypotheses of the Contraction Lemma, let $\hat{h}$ be the ERM with $\phi$-loss over $\mathcal{H}$, and let $h^* \in \arg\min_{h \in \mathcal{H}} R_\phi(h)$. Then
\begin{align*}
\mathbb{E}R_\phi(\hat{h}) - R_\phi(h^*) \leq 2L\,R_n(\mathcal{H}).
\end{align*}
[/corollary]
This corollary is the practical payoff of the contraction lemma. For the bound to be useful, $R_n(\mathcal{H})$ must be finite and decay to zero as $n \to \infty$. This is not guaranteed for all hypothesis classes. Consider the unconstrained linear class $\mathcal{H} = \{x \mapsto x^\top \beta : \beta \in \mathbb{R}^p\}$: for any fixed data points $x_1, \ldots, x_n$, one can choose $\beta$ to make $\sum_i \varepsilon_i x_i^\top \beta$ arbitrarily large, so $\hat{R}(\mathcal{H}(x_{1:n})) = +\infty$. To obtain finite Rademacher complexity for linear classes, we must constrain the norm of $\beta$. This leads naturally to two families of constrained hypothesis classes: those using the $\ell_2$ norm and those using the $\ell_1$ norm.
## ℓ₂-Constrained Linear Classes
The $\ell_2$-constrained linear class is defined by
\begin{align*}
\mathcal{H} = \{x \mapsto x^\top \beta : \beta \in \mathbb{R}^p,\; \|\beta\|_2 \leq \lambda\}
\end{align*}
for some $\lambda > 0$. This is the natural class for the support vector machine and ridge regression. The $\ell_2$ constraint keeps all coefficients small simultaneously: every predictor contributes, but no single one can dominate.
[quotetheorem:1982]
[citeproof:1982]
The bound $R_n(\mathcal{H}) = O(\lambda(\mathbb{E}\|X\|_2^2)^{1/2}/\sqrt{n})$ is the first genuinely useful rate in our analysis. The dependence on $n$ is $1/\sqrt{n}$: generalization improves as more data is collected, at the optimal parametric rate. The dependence on the dimension $p$ is hidden inside $\mathbb{E}\|X\|_2^2 = \sum_{j=1}^p \mathbb{E}[X_j^2]$, which grows linearly in $p$ if the coordinates of $X$ all have comparable variance. Thus the bound scales as $O(\lambda\sqrt{p}/\sqrt{n})$ in this case, reflecting a mild dimensional dependence.
[example: Support Vector Classifier via Contraction]
Take $\phi$ to be the hinge loss $\phi(u) = (1-u)_+$ and $\mathcal{H}$ the $\ell_2$-constrained linear class with parameter $\lambda$. The ERM with hinge loss over $\mathcal{H}$ is the **support vector classifier** (also called the hard-margin or soft-margin SVM, depending on the constraint level). The hinge loss satisfies $|\phi(u) - \phi(u')| = |(1-u)_+ - (1-u')_+| \leq |u - u'|$, so it is Lipschitz with constant $L = 1$. Combining the Contraction Lemma (Corollary above) with the $\ell_2$ Rademacher complexity bound, we obtain
\begin{align*}
\mathbb{E}R_\phi(\hat{h}) - R_\phi(h^*) \leq \frac{2\lambda\,(\mathbb{E}\|X\|_2^2)^{1/2}}{\sqrt{n}}.
\end{align*}
This bound is distribution-free (it holds for any distribution on $X$) and decreases at the rate $n^{-1/2}$. The constraint parameter $\lambda$ plays the role of a regularization strength: larger $\lambda$ allows a richer hypothesis class (smaller approximation error) but incurs a larger Rademacher complexity term (larger estimation error). In practice, $\lambda$ is chosen by cross-validation to balance these two forces.
[/example]
## ℓ₁-Constrained Linear Classes
The $\ell_1$-constraint enforces **sparsity**: by penalizing the sum of absolute coefficient values, it tends to drive many coefficients to zero and concentrate signal in a small number of predictors. This makes the $\ell_1$-constrained class well-adapted to high-dimensional settings where $p$ may be comparable to or much larger than $n$, but the true signal is supported on only a few coordinates.
[definition: ℓ₁-Constrained Linear Class]
For $\lambda > 0$, define
\begin{align*}
\mathcal{H} = \{x \mapsto x^\top \beta : \beta \in \mathbb{R}^p,\; \|\beta\|_1 \leq \lambda\},
\end{align*}
where $\|\beta\|_1 := \sum_{j=1}^p |\beta_j|$ is the $\ell_1$ norm. ERM with a Lipschitz surrogate over this class is known as **Lasso** (in the regression formulation) or its classification analog.
[/definition]
The Rademacher complexity computation for the $\ell_1$ class is different in character from the $\ell_2$ case. We use the dual relationship between the $\ell_1$ and $\ell_\infty$ norms: $|\beta^\top v| \leq \|\beta\|_1 \max_j |v_j|$, which is a special case of Hölder's inequality. The supremum over $\|\beta\|_1 \leq \lambda$ of $\beta^\top v$ equals $\lambda \|v\|_\infty = \lambda \max_j |v_j|$, reducing the Rademacher complexity to a maximum over $p$ sub-Gaussian random variables.
[quotetheorem:1984]
[citeproof:1984]
[remark: The $\sqrt{\log p}$ Factor]
The appearance of $\sqrt{\log p}$ rather than $\sqrt{p}$ is the hallmark of the $\ell_1$ constraint. It arises from taking the maximum over $p$ sub-Gaussian variables: the expected maximum of $p$ independent standard Gaussians is $O(\sqrt{\log p})$, not $O(\sqrt{p})$. This logarithmic dependence is what makes $\ell_1$-based methods effective in high dimensions: when $p \gg n$, a bound of order $\sqrt{\log p / n}$ can still be meaningful, whereas $\sqrt{p/n}$ would be vacuous.
[/remark]
### Comparing ℓ₁ and ℓ₂ Constraints
The two constraints give qualitatively different generalization behavior depending on the structure of the underlying signal. We make this precise through an example.
[example: Dense versus Sparse Signals]
Take $\phi$ to be the logistic loss $\phi(u) = \log(1 + e^{-u})/\log 2$, which is Lipschitz with constant $L = 1/\log 2$ (since $|\phi'(u)| = e^{-u}/((1 + e^{-u})\log 2) \leq 1/\log 2$). Let $\mathcal{X} = [-1, 1]^p$. Write $\mathcal{H}_1$ and $\mathcal{H}_2$ for the $\ell_1$- and $\ell_2$-constrained hypothesis classes with parameters $\lambda_1$ and $\lambda_2$ respectively. By Corollary 15 and the Rademacher complexity bounds,
\begin{align*}
\mathbb{E}R_\phi(\hat{h}) - R_\phi(h^*) \leq \begin{cases} \dfrac{2\lambda_1}{\log 2}\sqrt{\dfrac{2\log(2p)}{n}} & \text{for } \mathcal{H}_1, \\ \dfrac{2\lambda_2\sqrt{p}}{\sqrt{n}} & \text{for } \mathcal{H}_2. \end{cases}
\end{align*}
Now compare these bounds for two canonical signal structures.
**Case 1: Dense signal.** Suppose $\beta^0 = (1/\sqrt{p}, \ldots, 1/\sqrt{p})^\top$, with all $p$ coordinates equal and each of magnitude $1/\sqrt{p}$. To include $\beta^0$ in $\mathcal{H}_1$ we need $\|\beta^0\|_1 = p \cdot (1/\sqrt{p}) = \sqrt{p} \leq \lambda_1$, so the minimum required $\lambda_1 = \sqrt{p}$. To include $\beta^0$ in $\mathcal{H}_2$ we need $\|\beta^0\|_2 = 1 \leq \lambda_2$, so $\lambda_2 = 1$ suffices. Substituting these minimal constraint parameters into the bounds:
\begin{align*}
\ell_1: \quad O\!\left(\sqrt{\frac{p\log p}{n}}\right), \qquad \ell_2: \quad O\!\left(\sqrt{\frac{p}{n}}\right).
\end{align*}
Both bounds scale as $\sqrt{p/n}$, up to a $\sqrt{\log p}$ factor. For dense signals, the $\ell_2$ constraint gives a sharper bound.
**Case 2: Sparse signal.** Suppose $\beta^0 = (1/\sqrt{s}, \ldots, 1/\sqrt{s}, 0, \ldots, 0)^\top$, with only $s \ll p$ nonzero coordinates, each of magnitude $1/\sqrt{s}$. Now $\|\beta^0\|_1 = s/\sqrt{s} = \sqrt{s}$, so $\lambda_1 = \sqrt{s}$. The $\ell_2$ norm $\|\beta^0\|_2 = 1$, so $\lambda_2 = 1$ still. The bounds become:
\begin{align*}
\ell_1: \quad O\!\left(\sqrt{\frac{s\log p}{n}}\right), \qquad \ell_2: \quad O\!\left(\sqrt{\frac{p}{n}}\right).
\end{align*}
The $\ell_1$ bound now depends on $s$ and $\log p$ rather than on $p$. When $s \log p \ll p$ — the high-dimensional sparse regime — the $\ell_1$ constraint can yield dramatically better bounds.
[/example]
The example crystallizes the fundamental trade-off between the two constraint families. The $\ell_2$ constraint is oblivious to sparsity: its Rademacher complexity always depends on the ambient dimension $p$ through $\mathbb{E}\|X\|_2^2$. The $\ell_1$ constraint, by contrast, controls complexity through $\max_j \mathbb{E}[X_j^2]$ and $\log p$, exploiting the fact that only a few coordinates matter.
[explanation: Why Norm Constraints Control Generalization]
It is worth pausing to reflect on the broader principle at work here. Rademacher complexity measures the ability of a hypothesis class to fit random noise. For linear hypotheses $h_\beta(x) = x^\top \beta$, fitting noise means choosing $\beta$ so that $\sum_i \varepsilon_i x_i^\top \beta$ is large. Without any constraint on $\beta$, this is trivial — one can always find a direction that aligns with the noise. Norm constraints prevent this by limiting the "reach" of $\beta$.
The $\ell_2$ constraint limits $\|\beta\|_2$, and by Cauchy–Schwarz, the best alignment with any fixed vector $v$ is at most $\lambda\|v\|_2$. The Rademacher variable $\sum_i \varepsilon_i x_i$ has typical size $(\sum_i \|x_i\|_2^2)^{1/2}$, giving $R_n(\mathcal{H}) \sim \lambda(\mathbb{E}\|X\|_2^2)^{1/2}/\sqrt{n}$.
The $\ell_1$ constraint limits $\|\beta\|_1$, and the dual bound $\|\beta\|_1 \cdot \max_j |v_j| \geq \beta^\top v$ shows that the relevant quantity is the maximum coordinate of $\sum_i \varepsilon_i x_i$, not its $\ell_2$ norm. This maximum is controlled by the maximum-of-$p$-sub-Gaussians bound, which grows only logarithmically in $p$.
In both cases, the Rademacher complexity decays as $1/\sqrt{n}$, reflecting the fundamental statistical rate. What differs is the dimensional factor: polynomial in $p$ for $\ell_2$, logarithmic in $p$ for $\ell_1$. This logarithmic factor is the mathematical mechanism by which sparsity-inducing regularization enables high-dimensional learning.
[/explanation]
These results reveal norm constraints not merely as computational or regularization devices, but as tools with a precise statistical interpretation: they control the Rademacher complexity of the hypothesis class and thereby guarantee that the ERM generalizes. The $\ell_1$ constraint is particularly powerful in the high-dimensional regime where the number of predictors $p$ far exceeds the sample size $n$, provided the signal is concentrated in few coordinates. In the next section, we turn from these generalization-theoretic considerations to the computational question of how to actually solve the resulting optimization problems efficiently, using projections onto convex constraint sets and subgradient methods.
## Projections onto Convex Sets and Subgradients
The generalization bounds obtained in the previous section tell us that ERM over $\ell_1$- and $\ell_2$-constrained linear classes produces hypotheses that generalize well. But knowing that the best hypothesis in the class is good is only half the story: we must also be able to *find* it efficiently. Both constraint classes require minimizing a convex function subject to a convex set constraint, and for non-smooth objectives (such as the hinge loss), the gradient may not exist everywhere. This section develops the two geometric tools that underlie the optimization algorithms of the next section: the projection onto a convex set, and the subgradient.
### Projections onto Convex Sets
When we minimize over a constrained set — for instance, finding the ERM over the $\ell_2$ ball $\{\beta : \|\beta\|_2 \leq \lambda\}$ — the natural iterative approach is to take a gradient step and then pull the iterate back into the feasible region. The mechanism for this "pull back" is orthogonal projection.
[definition: Projection onto a Convex Set]
Let $C \subseteq \mathbb{R}^d$ be a closed convex set and let $x \in \mathbb{R}^d$. The **projection** of $x$ onto $C$ is defined as
\begin{align*}
\pi_C(x) := \operatorname{argmin}_{z \in C} \|x - z\|_2.
\end{align*}
That is, $\pi_C(x)$ is the point in $C$ closest to $x$ in Euclidean distance.
[/definition]
Existence of the minimizer is not obvious: the infimum over an unbounded set need not be attained. The key observation is that we can truncate the search to a compact set. Let $\mu = \inf_{z \in C} \|x - z\|_2$ and set $B = \{w : \|w - x\|_2 \leq \mu + 1\}$. Then $C \cap B$ is closed and bounded, so the continuous function $z \mapsto \|x - z\|_2$ attains its infimum on $C \cap B$; and any minimizer on $C \cap B$ is also a minimizer on $C$, since all points of $C$ outside $B$ are farther from $x$.
Uniqueness follows from strict convexity. The function $z \mapsto \|x - z\|_2^2$ is strictly convex in $z$, so it can have at most one minimizer over any convex set. Thus $\pi_C(x)$ is well-defined for every $x \in \mathbb{R}^d$.
The projection satisfies two fundamental properties.
[quotetheorem:1985]
[citeproof:1985]
<!-- illustration-needed: projection onto a convex set — show a closed convex set C in the plane, a point x outside C, the projection pi_C(x) on the boundary, and the variational condition (x - pi_C(x)) perpendicular to the supporting hyperplane at pi_C(x), with the angle condition (x - pi_C(x))^T(z - pi_C(x)) <= 0 illustrated for a generic z in C -->
The variational characterization (i) is the projection analog of the first-order optimality condition: it says that the vector from $\pi_C(x)$ to $x$ makes an obtuse angle with every vector from $\pi_C(x)$ into $C$. Geometrically, no displacement within $C$ from $\pi_C(x)$ can bring us closer to $x$. The non-expansiveness property (ii) says that projection is a contraction: it cannot increase the distance between two points. This is the property that will be used in the convergence analysis of projected gradient descent.
[example: Projection onto the ℓ₂ Ball]
Let $C = \{\beta \in \mathbb{R}^d : \|\beta\|_2 \leq \lambda\}$ be the closed Euclidean ball of radius $\lambda$ centered at the origin. For $x \in \mathbb{R}^d$, we have
\begin{align*}
\pi_C(x) = \begin{cases} x & \text{if } \|x\|_2 \leq \lambda, \\ \lambda\, x / \|x\|_2 & \text{if } \|x\|_2 > \lambda. \end{cases}
\end{align*}
To verify: if $\|x\|_2 \leq \lambda$ then $x \in C$ and $\pi_C(x) = x$. If $\|x\|_2 > \lambda$, we check the variational condition (i): for any $z \in C$,
\begin{align*}
\left(x - \frac{\lambda x}{\|x\|_2}\right)^\top(z - \frac{\lambda x}{\|x\|_2}) = \|x\|_2\left(1 - \frac{\lambda}{\|x\|_2}\right)\left(\frac{x^\top z}{\|x\|_2} - \lambda\right).
\end{align*}
Since $\|x\|_2 > \lambda$, the first factor is positive. By Cauchy–Schwarz, $x^\top z \leq \|x\|_2 \|z\|_2 \leq \lambda\|x\|_2$, so $x^\top z / \|x\|_2 \leq \lambda$, making the second factor non-positive. The entire expression is therefore $\leq 0$, confirming that $\lambda x/\|x\|_2$ is the projection. In words: if $x$ is outside the ball, project radially onto the boundary sphere by scaling to unit norm and multiplying by $\lambda$.
[/example]
### Subgradients and the Subgradient Inequality
Many functions that arise naturally in machine learning are convex but not differentiable everywhere. The hinge loss $\phi(u) = (1-u)_+$ has a kink at $u = 1$; the $\ell_1$ norm $\|\beta\|_1 = \sum_j |\beta_j|$ is not differentiable when any coordinate $\beta_j = 0$. To run an analog of gradient descent for such functions, we need a substitute for the gradient that retains its essential optimality-certificate property.
The key property of the gradient for a differentiable convex function $f$ is the **gradient inequality**:
\begin{align*}
f(z) \geq f(x) + \nabla f(x)^\top(z - x) \quad \text{for all } z \in \mathbb{R}^d.
\end{align*}
This says the tangent hyperplane at $(x, f(x))$ lies globally below the graph of $f$. For a general convex function, we seek any vector $g$ that satisfies this inequality — whether or not the gradient exists.
[definition: Subgradient and Subdifferential]
Let $f : \mathbb{R}^d \to \mathbb{R}$ be convex. A vector $g \in \mathbb{R}^d$ is a **subgradient** of $f$ at $x$ if
\begin{align*}
f(z) \geq f(x) + g^\top(z - x) \quad \text{for all } z \in \mathbb{R}^d.
\end{align*}
The set of all subgradients of $f$ at $x$ is called the **subdifferential** of $f$ at $x$, denoted $\partial f(x)$.
[/definition]
The inequality $f(z) \geq f(x) + g^\top(z - x)$ is called the **subgradient inequality**. Geometrically, $g$ defines a supporting hyperplane to the epigraph $\{(z, y) : y \geq f(z)\}$ at the point $(x, f(x))$.
[quotetheorem:1987]
[citeproof:1987]
This existence result is deeper than it may appear: it guarantees that the subdifferential is never empty, so we can always find a direction to descend. For differentiable functions, the subdifferential collapses to the gradient.
[quotetheorem:1988]
[citeproof:1988]
The subgradient is not unique when $f$ is non-differentiable: the subdifferential $\partial f(x)$ is a convex set, and choosing any element from it gives a valid descent direction. The following calculus rules allow us to compute subdifferentials of composite and structured functions.
[quotetheorem:1989]
These rules make the subdifferential computable in practice. Property (ii) says that the subdifferential of a sum is the sum of subdifferentials; property (iii) is the chain rule for subgradients.
[example: Subdifferential of the Hinge Empirical Risk]
Consider the empirical hinge loss
\begin{align*}
f(\beta) = \frac{1}{n}\sum_{i=1}^n \max(1 - y_i x_i^\top \beta,\, 0).
\end{align*}
Let $\phi(u) = \max(1 - u, 0)$. The subdifferential of $\phi$ is
\begin{align*}
\partial \phi(u) = \begin{cases} \{0\} & \text{if } u > 1, \\ [-1, 0] & \text{if } u = 1, \\ \{-1\} & \text{if } u < 1. \end{cases}
\end{align*}
To verify the case $u < 1$: the function $\phi$ is linear with slope $-1$ near $u$, so the only subgradient is $-1$. At $u = 1$, both $0$ and $-1$ are valid subgradients (any $t \in [-1,0]$ works), reflecting the kink. At $u > 1$, $\phi$ is zero and flat, so $0$ is the only subgradient.
By subgradient calculus rule (iii), writing $h_i(\beta) = \max(1 - y_i x_i^\top \beta, 0)$, we have
\begin{align*}
\partial h_i(\beta) = \begin{cases} \{0\} & \text{if } y_i x_i^\top \beta > 1, \\ \{-y_i x_i t : t \in [0,1]\} & \text{if } y_i x_i^\top \beta = 1, \\ \{-y_i x_i\} & \text{if } y_i x_i^\top \beta < 1. \end{cases}
\end{align*}
By rule (ii) applied $n$ times, $\partial f(\beta)$ consists of vectors of the form
\begin{align*}
-\frac{1}{n}\sum_{i=1}^n y_i x_i t_i,
\end{align*}
where each $t_i \in [0,1]$ is determined by the value of $y_i x_i^\top \beta$: $t_i = 0$ if $y_i x_i^\top \beta > 1$, $t_i = 1$ if $y_i x_i^\top \beta < 1$, and $t_i \in [0,1]$ if $y_i x_i^\top \beta = 1$.
[/example]
The subgradient computed in this example has a concrete interpretation. Training points with $y_i x_i^\top \beta \geq 1$ — those correctly classified with sufficient margin — contribute nothing to the subgradient. Only the misclassified or borderline points (those with $y_i x_i^\top \beta < 1$) contribute a nonzero term $-y_i x_i/n$. This is the basis for the sparse update structure of support vector machine training.
[remark: Subgradients as Optimality Certificates]
The subgradient inequality immediately gives an optimality condition: $\hat{\beta}$ minimizes $f$ over $\mathbb{R}^d$ if and only if $0 \in \partial f(\hat{\beta})$. For constrained minimization over a closed convex set $C$, the condition becomes $0 \in \partial f(\hat{\beta}) + N_C(\hat{\beta})$, where $N_C(\hat{\beta}) = \{v : v^\top(z - \hat{\beta}) \leq 0 \text{ for all } z \in C\}$ is the normal cone of $C$ at $\hat{\beta}$. This generalizes the familiar condition that the gradient vanishes at an interior minimum.
[/remark]
With both projections and subgradients in hand, we have all the ingredients needed to formulate projected gradient descent: at each step, compute a subgradient of the objective, take a step in the negative subgradient direction, and project back onto the constraint set. The convergence analysis of this algorithm in the next section will make essential use of both the non-expansiveness of the projection (part (ii) of the Projection Theorem) and the subgradient inequality.
## Gradient Descent
The tools assembled in the preceding sections — projections onto convex constraint sets, subgradients for non-smooth functions — were developed with a concrete purpose in mind: they are precisely what is needed to build an efficient algorithm for minimising convex functions over convex sets. This section presents that algorithm, projected gradient descent, and establishes a rigorous convergence guarantee.
### The Steepest Descent Idea
The motivation for gradient descent begins with a first-order Taylor expansion. Suppose $f: \mathbb{R}^p \to \mathbb{R}$ is differentiable at a point $\beta$ with gradient $g = \nabla f(\beta)$. For any direction $\delta$ with $\|\delta\|_2 = 1$ and small step size $\eta > 0$,
\begin{align*}
f(\beta + \eta \delta) \approx f(\beta) + \eta g^\top \delta.
\end{align*}
To minimise the right-hand side over all unit directions $\delta$, we should choose $\delta = -g/\|g\|_2$, i.e., move in the direction of the negative gradient. This is the direction of steepest descent of the linear approximation to $f$ at $\beta$.
Gradient descent turns this local observation into an iterative procedure: starting from some $\beta_1$, at each step take a small step in the direction of the negative (sub)gradient. When the minimiser is constrained to lie in a convex set $C$, we add a projection step after each gradient update to ensure the iterates remain feasible.
### The Algorithm
[definition: Projected Gradient Descent]
Let $f: \mathbb{R}^p \to \mathbb{R}$ be a convex function and $C \subseteq \mathbb{R}^p$ a closed convex set. Given an initial point $\beta_1 \in C$, a number of iterations $k \in \mathbb{N}$, and a sequence of positive step sizes $(\eta_s)_{s=1}^{k-1}$, **projected gradient descent** produces iterates $\beta_1, \beta_2, \ldots, \beta_k$ by the following rule. For each $s = 1, \ldots, k-1$:
1. Choose a subgradient $g_s \in \partial f(\beta_s)$.
2. Compute the gradient step: $z_{s+1} = \beta_s - \eta_s g_s$.
3. Project back onto the constraint set: $\beta_{s+1} = \pi_C(z_{s+1})$.
The algorithm returns the averaged iterate
\begin{align*}
\bar{\beta} = \frac{1}{k} \sum_{s=1}^k \beta_s.
\end{align*}
[/definition]
A few features of this definition deserve comment. First, the use of a subgradient $g_s \in \partial f(\beta_s)$ rather than the gradient makes the algorithm applicable to non-smooth convex functions such as the hinge loss — wherever $f$ is differentiable, the subdifferential is a singleton and $g_s = \nabla f(\beta_s)$.
Second, the projection step $\beta_{s+1} = \pi_C(z_{s+1})$ is essential for constrained problems. Without it, the iterates would drift outside $C$ and the algorithm would fail to respect the constraint. The key property of the projection — that it is non-expansive, i.e., $\|\pi_C(x) - \pi_C(y)\|_2 \leq \|x - y\|_2$ — will play a crucial role in the convergence proof.
Third, the algorithm returns the average $\bar{\beta}$ of all iterates rather than the final iterate $\beta_k$. This averaging is standard in subgradient methods: individual iterates may oscillate, but the average converges steadily.
<!-- illustration-needed: the projected gradient descent trajectory — show a convex constraint set C (e.g. a ball), iterates beta_s moving via gradient steps outside C and being projected back onto C, with the path spiralling toward the minimiser -->
### Convergence of Gradient Descent
The main theorem of this section establishes that gradient descent achieves a convergence rate of $O(1/\sqrt{k})$ in the objective value after $k$ iterations.
[quotetheorem:1990]
[citeproof:1990]
The bound $2LR/\sqrt{k}$ captures the interaction between three quantities: $R$, the radius of the constraint set; $L$, the Lipschitz constant of $f$ (the maximum subgradient norm); and $k$, the number of iterations. To achieve accuracy $\varepsilon$, one needs $k = O(L^2R^2/\varepsilon^2)$ iterations — quadratic in the inverse accuracy. This is the canonical $O(1/\sqrt{k})$ convergence rate for subgradient methods on Lipschitz convex functions, and it is known to be unimprovable in general without additional assumptions on $f$.
[remark: The Role of Averaging]
The averaging step is not cosmetic — it is essential for the proof. Individual iterates $\beta_s$ do not necessarily converge to the minimiser; the step in the direction of the subgradient may overshoot when the function is non-smooth. The average $\bar{\beta}$ accumulates the progress made at every step and suppresses oscillations. Jensen's inequality then converts the bound on the average function value into a bound on the function value at the average.
[/remark]
### Application: ERM with Hinge Loss
The convergence theorem becomes concrete in the setting we have been studying throughout this chapter.
[example: Gradient Descent for the Support Vector Classifier]
Consider empirical risk minimisation with hinge loss $\varphi(u) = \max(1-u, 0)$, where the input space is $\mathcal{X} = \{x \in \mathbb{R}^p : \|x\|_2 \leq C\}$ and the hypothesis class is the $\ell^2$-constrained class
\begin{align*}
\mathcal{H} = \{x \mapsto x^\top \beta : \|\beta\|_2 \leq \lambda\}.
\end{align*}
The objective function is $f(\beta) = \frac{1}{n}\sum_{i=1}^n \max(1 - y_i x_i^\top \beta, 0)$. From the subgradient calculus developed in the previous section (Proposition 19 and Example 10), a subgradient of $f$ at $\beta$ takes the form
\begin{align*}
g = -\frac{1}{n}\sum_{i=1}^n y_i x_i t_i, \quad t_i \in [0,1].
\end{align*}
By the triangle inequality,
\begin{align*}
\|g\|_2 \leq \frac{1}{n}\sum_{i=1}^n |t_i| \cdot \|y_i x_i\|_2 \leq \frac{1}{n}\sum_{i=1}^n \|x_i\|_2 \leq C,
\end{align*}
so the Lipschitz constant is $L = C$. The constraint set $C = \{\beta : \|\beta\|_2 \leq \lambda\}$ has radius $R = \lambda$. Theorem 20 therefore gives: with step size $\eta = 2\lambda/(C\sqrt{k})$, the averaged iterate $\bar{\beta}$ satisfies
\begin{align*}
f(\bar{\beta}) - f(\hat{\beta}) \leq \frac{2C\lambda}{\sqrt{k}}.
\end{align*}
To achieve accuracy $\varepsilon$, one runs $k = 4C^2\lambda^2/\varepsilon^2$ iterations, each requiring the computation of $n$ subgradient contributions. The total computational cost is $O(np/\varepsilon^2)$ (accounting for the inner product $x_i^\top \beta$ in $\mathbb{R}^p$ at each step).
[/example]
This cost scaling — linear in $n$, the dataset size, and quadratic in $1/\varepsilon$ — is the baseline that motivates the next section. When $n$ is large, sweeping through the entire dataset at every iteration is expensive. Stochastic gradient descent replaces the full-batch subgradient with an unbiased estimate computed from a single data point, achieving the same $O(1/\sqrt{k})$ convergence rate while reducing the per-iteration cost by a factor of $n$.
## Stochastic Gradient Descent
Gradient descent, as developed in the previous section, requires computing a full subgradient of the objective $f$ at each iteration. In the ERM setting this gradient is a sum of $n$ terms — one for each data point — so each update sweeps the entire dataset. When $n$ is large this becomes prohibitively expensive, and a more frugal approach is needed.
Stochastic gradient descent (SGD) replaces the full gradient with a cheap random estimate computed from a single data point. The key observation is that the ERM objective can be written as an expectation, and unbiased gradient estimates are enough to drive convergence.
### The Stochastic Objective
[definition: Stochastic Objective]
Let $C \subseteq \mathbb{R}^p$ be a closed convex set and $\mathcal{U}$ a measurable space. A function $f : \mathbb{R}^p \to \mathbb{R}$ is a **stochastic objective** if it can be written as
\begin{align*}
f(\beta) = \mathbb{E}\tilde{f}(\beta; U),
\end{align*}
where $\tilde{f} : \mathbb{R}^p \times \mathcal{U} \to \mathbb{R}$ satisfies:
- $\beta \mapsto \tilde{f}(\beta; u)$ is convex for every $u \in \mathcal{U}$,
- $U$ is a random variable taking values in $\mathcal{U}$.
[/definition]
The ERM objective fits this template immediately. Fix training data $(x_1, y_1), \ldots, (x_n, y_n)$ and let $U$ be uniformly distributed on $\{1, \ldots, n\}$. Writing $\tilde{f}(\beta; i) = \ell(h_\beta(x_i), y_i)$, the empirical risk becomes
\begin{align*}
\frac{1}{n}\sum_{i=1}^n \ell(h_\beta(x_i), y_i) = \mathbb{E}\,\ell(h_\beta(x_U), y_U) = \mathbb{E}\tilde{f}(\beta; U).
\end{align*}
The data $(x_1, y_1), \ldots, (x_n, y_n)$ are treated as fixed throughout; only $U$ is random.
### The SGD Algorithm
At each iteration $s$, instead of computing a subgradient of the full objective $f(\beta_s) = \mathbb{E}\tilde{f}(\beta_s; U)$, SGD draws a fresh sample $U_s$ and computes a subgradient $\tilde{g}_s$ of $\beta \mapsto \tilde{f}(\beta; U_s)$ at $\beta_s$. In the ERM context, this costs one data-point evaluation rather than $n$.
**Algorithm (Stochastic Gradient Descent).** Given $\beta_1 \in C$, a number of iterations $k \in \mathbb{N}$, step sizes $(\eta_s)_{s=1}^{k-1} > 0$, and i.i.d. copies $U_1, \ldots, U_{k-1}$ of $U$:
For $s = 1, \ldots, k-1$:
1. Compute $\tilde{g}_s \in \partial \tilde{f}(\beta_s; U_s)$ (a subgradient of $\beta \mapsto \tilde{f}(\beta; U_s)$ at $\beta_s$).
2. Set $z_{s+1} = \beta_s - \eta_s \tilde{g}_s$.
3. Project: $\beta_{s+1} = \pi_C(z_{s+1})$.
Return the average iterate $\bar{\beta} = \frac{1}{k}\sum_{s=1}^k \beta_s$.
The structure mirrors projected gradient descent exactly, with the single change that the full subgradient $g_s \in \partial f(\beta_s)$ is replaced by the cheap stochastic estimate $\tilde{g}_s$. The output is again the Cesaro average rather than the last iterate, for the same reason as in batch gradient descent: averaging damps the oscillations introduced by the noise in the stochastic gradients.
[remark: Stochastic Subgradient is Conditionally Unbiased]
The stochastic subgradient $\tilde{g}_s$ is a conditionally unbiased estimate of a true subgradient of $f$ at $\beta_s$. To see why, set $g_s = \mathbb{E}(\tilde{g}_s \mid \beta_s)$. Since $\tilde{f}(\beta; U_s) \geq \tilde{f}(\beta_s; U_s) + \tilde{g}_s^\top(\beta - \beta_s)$ for all $\beta$ (as $\tilde{g}_s$ is a subgradient of the convex function $\beta \mapsto \tilde{f}(\beta; U_s)$), and since $U_s$ is independent of $\beta_s$, taking conditional expectations on both sides yields $f(\beta) \geq f(\beta_s) + g_s^\top(\beta - \beta_s)$ for all $\beta$. Hence $g_s \in \partial f(\beta_s)$.
[/remark]
This conditional unbiasedness is the fundamental property that allows the proof of convergence to mirror the batch case almost word for word.
### Convergence of SGD
[quotetheorem:1992]
[citeproof:1992]
The rate $\mathbb{E}f(\bar{\beta}) - f(\hat{\beta}) = O(1/\sqrt{k})$ is identical to the convergence rate of batch gradient descent from Theorem 20. The key distinction is what each iteration costs: batch GD spends $O(n)$ gradient evaluations per step, while SGD spends $O(1)$. To reach $\varepsilon$-accuracy, batch GD requires $k \asymp L^2R^2/\varepsilon^2$ iterations each of cost $O(n)$, giving total work $O(nL^2R^2/\varepsilon^2)$; SGD requires the same number of iterations but at cost $O(1)$ each, giving total work $O(L^2R^2/\varepsilon^2)$ — a factor of $n$ improvement.
[remark: The Bounded Second Moment Condition]
The hypothesis on $\tilde{g}_s$ in Theorem 21 is a bound on the expected squared norm of the stochastic gradient, $\sup_\beta \mathbb{E}\|\tilde{g}\|_2^2 \leq L^2$, rather than the pointwise bound $\|g\|_2 \leq L$ used in Theorem 20. This is a strictly weaker assumption: the stochastic gradient need not be bounded almost surely, only in $L^2$. The proof accommodates this by working with conditional expectations throughout rather than deterministic inequalities.
[/remark]
### SGD for ERM with Hinge Loss
[example: SGD for SVM Objective]
Consider ERM with hinge loss, $\mathcal{X} = \{x \in \mathbb{R}^p : \|x\|_2 \leq C\}$, and hypothesis class $\mathcal{H} = \{x \mapsto x^\top\beta : \|\beta\|_2 \leq \lambda\}$, so $C$ is the Euclidean ball of radius $\lambda$. In this case $R = \lambda$.
The stochastic objective is $\tilde{f}(\beta; i) = \max(0, 1 - y_i x_i^\top \beta)$. For a given sample $U_s = i$, a subgradient of $\beta \mapsto \tilde{f}(\beta; i)$ at $\beta_s$ is
\begin{align*}
\tilde{g}_s = \begin{cases} -y_i x_i & \text{if } y_i x_i^\top \beta_s < 1, \\ 0 & \text{if } y_i x_i^\top \beta_s \geq 1. \end{cases}
\end{align*}
Since $\|x_i\|_2 \leq C$, we have $\|\tilde{g}_s\|_2 \leq C$ almost surely, so the second-moment condition holds with $L = C$. Theorem 21 gives
\begin{align*}
\mathbb{E}f(\bar{\beta}) - f(\hat{\beta}) \leq \frac{2C\lambda}{\sqrt{k}}.
\end{align*}
The SGD update at step $s$ costs a single inner-product evaluation $y_{U_s} x_{U_s}^\top \beta_s$, compared to $n$ such evaluations for the batch gradient. The total work to achieve $\varepsilon$-accuracy scales as $O(C^2\lambda^2/\varepsilon^2)$ regardless of the dataset size $n$.
[/example]
### Connection to Online Learning
The SGD framework has a natural interpretation in terms of online learning. In the online setting, data points arrive sequentially rather than all at once, and one must make predictions and update the model in real time. At round $s$, the learner selects $\beta_s \in C$, observes data point $(x_{U_s}, y_{U_s})$, incurs loss $\tilde{f}(\beta_s; U_s)$, and then updates. The SGD update $\beta_{s+1} = \pi_C(\beta_s - \eta_s \tilde{g}_s)$ is precisely this online update rule.
The convergence bound of Theorem 21 can be re-read as a regret bound: the cumulative excess loss of SGD over $k$ rounds relative to the best fixed $\hat{\beta}$ satisfies $\mathbb{E}\left[\frac{1}{k}\sum_{s=1}^k \tilde{f}(\beta_s; U_s)\right] - f(\hat{\beta}) \leq \frac{2LR}{\sqrt{k}}$, which converges to zero. In the online learning literature this property is called no-regret or Hannan consistency.
This connection is more than cosmetic: it shows that the sample complexity of online learning and the iteration complexity of stochastic optimisation are governed by the same quantity $LR/\sqrt{k}$, and that algorithms designed for one setting transfer directly to the other.
[remark: SGD in Neural Network Training]
Although Theorem 21 applies to convex objectives, SGD is also the workhorse optimizer for training deep neural networks, where the objective is highly non-convex. In that setting the theoretical guarantees do not apply directly, but empirically SGD — often with momentum and adaptive step sizes — finds parameter configurations that generalise well. The computational advantage is the same: each update requires computing a gradient at a single data point (or a small mini-batch), using the backpropagation algorithm to exploit the compositional structure of the network. Chapter 5 will examine this in more detail.
[/remark]
Gradient descent and SGD form the computational foundation for empirical risk minimization, yet many of the most successful modern methods—boosting, gradient boosting, and neural networks—combine this foundation with problem-specific structures and non-convex landscapes that challenge classical theory.
# 5. Popular Machine Learning Methods II
Chapter 4 showed that when the empirical risk can be written as a convex function over a convex parameter set, efficient algorithms with provable convergence guarantees are available. This final chapter turns to three concrete methods that build on the preceding theory: AdaBoost, gradient boosting, and feedforward neural networks. All three can be understood as strategies for constructing rich hypothesis classes and fitting them to data — but each takes a different approach to the fundamental tension between expressiveness and computational tractability.
## Combining Weak Learners: Boosting
Empirical risk minimisation over a fixed hypothesis class $\mathcal{H}$ searches for the single best hypothesis. A complementary idea is to search for a good **weighted combination** of hypotheses drawn from a simpler base class. This is the starting point for boosting methods.
### The Hypothesis Class for Boosting
A natural question is whether a collection of individually weak classifiers — ones that do only slightly better than random guessing — can be combined to form a strong classifier. The answer is yes, and the hypothesis class that makes this precise is a set of weighted combinations of base classifiers.
Let $B$ be a base set of classifiers $h: \mathcal{X} \to \{-1, 1\}$ satisfying the symmetry condition $h \in B \Rightarrow -h \in B$. From $B$ we build the larger class
\begin{align*}
\mathcal{H} = \left\{ \sum_{m=1}^{M} \beta_m h_m : \beta_m \geq 0,\, h_m \in B \text{ for } m = 1, \ldots, M \right\}.
\end{align*}
The tuning parameter $M$ controls the size of the ensemble. Each element of $\mathcal{H}$ is a non-negative linear combination of base classifiers, so $\mathcal{H}$ is genuinely richer than $B$. The final classification is produced by applying the sign function to the ensemble output. Performing ERM over $\mathcal{H}$ directly, however, is computationally challenging, since one must simultaneously optimise over the weights $\beta_m$ and the choice of base classifiers $h_m$.
### The AdaBoost Algorithm
AdaBoost replaces the joint optimisation over $\mathcal{H}$ with a greedy procedure that adds one base classifier at a time. The key insight that makes this tractable is that, at each step, the optimisation over $h \in B$ amounts to weighted ERM over the base class — a problem assumed to be fast.
[definition: AdaBoost]
Let $B$ be a base classifier class with $h \in B \Rightarrow -h \in B$, and let $(X_1, Y_1), \ldots, (X_n, Y_n) \in \mathbb{R}^p \times \{-1, 1\}$ be training data. Given a tuning parameter $M \in \mathbb{N}$, **AdaBoost** initialises $\hat{f}_0(x) = 0$ and, for $m = 1, \ldots, M$, performs:
1. Set observation weights $w_i^{(m)} = n^{-1} \exp(-Y_i \hat{f}_{m-1}(X_i))$.
2. Compute $\hat{h}_m = \operatorname{arg\,min}_{h \in B} \operatorname{err}_m(h)$, where
\begin{align*}
\operatorname{err}_m(h) := \frac{\sum_{i=1}^n w_i^{(m)} \mathbb{1}_{\{h(X_i) \neq Y_i\}}}{\sum_{i=1}^n w_i^{(m)}}.
\end{align*}
3. Compute
\begin{align*}
\hat{\beta}_m = \frac{1}{2} \log \frac{1 - \operatorname{err}_m(\hat{h}_m)}{\operatorname{err}_m(\hat{h}_m)}.
\end{align*}
4. Update $\hat{f}_m = \hat{f}_{m-1} + \hat{\beta}_m \hat{h}_m$.
The final classifier is $\operatorname{sgn} \circ \hat{f}_M$.
[/definition]
The greedy motivation comes from treating AdaBoost as approximate ERM over $\mathcal{H}$ with **exponential loss** $\ell(f(x), y) = e^{-yf(x)}$. To see how the update formulas arise, write
\begin{align*}
\frac{1}{n} \sum_{i=1}^n \exp\bigl[-Y_i\{\hat{f}_{m-1}(X_i) + \beta h(X_i)\}\bigr]
&= e^\beta \sum_{i=1}^n w_i^{(m)} \mathbb{1}_{\{h(X_i) \neq Y_i\}} + e^{-\beta} \sum_{i=1}^n w_i^{(m)} \mathbb{1}_{\{h(X_i) = Y_i\}} \\
&= (e^\beta - e^{-\beta}) \sum_{i=1}^n w_i^{(m)} \mathbb{1}_{\{h(X_i) \neq Y_i\}} + e^{-\beta} \sum_{i=1}^n w_i^{(m)}.
\end{align*}
Since $e^\beta - e^{-\beta} > 0$ for $\beta > 0$, minimising over $h$ reduces to finding the classifier in $B$ with the smallest weighted error $\operatorname{err}_m(h)$ — this is $\hat{h}_m$. Given $\hat{h}_m$, minimising over $\beta > 0$ by differentiating and setting to zero yields the condition
\begin{align*}
(e^{\hat{\beta}_m} + e^{-\hat{\beta}_m}) \operatorname{err}_m(\hat{h}_m) = e^{-\hat{\beta}_m}.
\end{align*}
Setting $x = e^{\hat{\beta}_m}$ and $a = \operatorname{err}_m(\hat{h}_m)$, this becomes $(x^2 + 1)a = 1$, so $x = \sqrt{1/a - 1}$, which gives the closed-form update
\begin{align*}
\hat{\beta}_m = \frac{1}{2} \log \frac{1 - \operatorname{err}_m(\hat{h}_m)}{\operatorname{err}_m(\hat{h}_m)}.
\end{align*}
Notice that $\hat{\beta}_m$ is large when $\hat{h}_m$ has a small weighted error and small (even negative) when $\hat{h}_m$ performs near chance. The algorithm thus assigns larger influence to better base classifiers.
The observation weights $w_i^{(m)}$ play a central role. After each step, observations that were misclassified receive higher weights, since $\exp(-Y_i \hat{f}_{m-1}(X_i))$ is large when $Y_i$ and $\hat{f}_{m-1}(X_i)$ have opposite signs. Subsequent base classifiers are therefore forced to focus on the hard examples. This adaptive reweighting is the mechanism that allows AdaBoost to progressively correct its errors.
[example: Decision Stumps as Base Classifiers]
Let $\mathcal{X} = \mathbb{R}^p$ and consider the class of **decision stumps**
\begin{align*}
B = \{ h_{a,j,1}(x) = \operatorname{sgn}(x_j - a),\; h_{a,j,2}(x) = \operatorname{sgn}(a - x_j) : a \in \mathbb{R},\, j = 1, \ldots, p \}.
\end{align*}
Each decision stump thresholds a single coordinate $x_j$ at a value $a$. The symmetry condition $h \in B \Rightarrow -h \in B$ is satisfied by including both directions.
For a fixed $j$, minimising the weighted error over all thresholds $a$ only requires sweeping through the ordered values $X_{1j}, \ldots, X_{nj}$. The optimal threshold must be one of the midpoints between adjacent values, so there are at most $n - 1$ candidates per coordinate. Evaluating all of them takes $O(n)$ operations per coordinate, giving a total of $O(np)$ operations for weighted ERM over $B$. This efficiency is what makes AdaBoost with decision stumps practical even for large $M$.
[/example]
[remark: When the Base Classifier Perfectly Separates]
The derivation above assumed $\operatorname{err}_m(h) > 0$ for all $h \in B$, i.e. no base classifier achieves zero weighted error. If some $h$ achieves $\operatorname{err}_m(h) = 0$, then $\hat{\beta}_m = +\infty$ and the algorithm degenerates. In practice this rarely occurs for balanced data, but it is worth noting that the algorithm implicitly assumes the base class is not powerful enough to perfectly separate any weighted dataset.
[/remark]
The adaptive reweighting scheme has a concrete payoff: as long as each weak learner beats random guessing on the weighted dataset, the ensemble's training error decreases geometrically with the number of boosting rounds.
[quotetheorem:1993]
The bound follows from the exponential loss interpretation of AdaBoost. The training error is bounded above by the average exponential loss $n^{-1} \sum_i e^{-Y_i \hat{f}_M(X_i)}$, since $\mathbb{1}_{\{Y_i \hat{f}_M(X_i) \leq 0\}} \leq e^{-Y_i \hat{f}_M(X_i)}$. The greedy construction ensures that the exponential loss decreases by a factor of at most $\sqrt{1 - 4\gamma^2} \leq e^{-2\gamma^2}$ at each step, giving the stated geometric bound. The strength of this result is that it requires no assumption on the data distribution beyond the weak learning condition — it is a purely combinatorial guarantee about the training set.
## Gradient Boosting
AdaBoost is tailored to exponential loss in a classification setting. **Gradient boosting** generalises the boosting idea to arbitrary differentiable loss functions and to regression problems. The key insight is that boosting can be reinterpreted as gradient descent carried out in the space of functions rather than in parameter space.
### Functional Gradient Descent
To see where gradient boosting comes from, consider the problem of minimising $R(h) = \mathbb{E}\ell(h(X), Y)$ over measurable functions $h: \mathcal{X} \to \mathbb{R}$. A formal application of gradient descent in function space would proceed as follows:
1. Start with an initial guess $f_0: \mathcal{X} \to \mathbb{R}$.
2. For $m = 1, \ldots, M$, compute the negative functional gradient
\begin{align*}
g_m(x) = \frac{\partial}{\partial \theta} \mathbb{E}(\ell(\theta, Y) \mid X = x)\bigg|_{\theta = f_{m-1}(x)} = \mathbb{E}\!\left[\frac{\partial \ell(\theta, Y)}{\partial \theta}\bigg|_{\theta = f_{m-1}(x)} \,\middle|\, X = x\right].
\end{align*}
3. Update $f_m = f_{m-1} - \eta g_m$ for a step size $\eta > 0$.
This is an idealised procedure — it requires knowledge of the conditional distribution of $Y$ given $X$, which is unavailable in practice. Working with finite data $(X_1, Y_1), \ldots, (X_n, Y_n)$ requires approximating $g_m$.
The key observation is that the conditional expectation $g_m$ minimises
\begin{align*}
\mathbb{E}\!\left[\frac{\partial \ell(\theta, Y)}{\partial \theta}\bigg|_{\theta = f_{m-1}(X)} - h(X)\right]^2
\end{align*}
over all measurable functions $h: \mathcal{X} \to \mathbb{R}$, by the best least-squares predictor property of conditional expectation. This suggests approximating $g_m$ by fitting a base regressor to the pseudo-residuals $W_i = \frac{\partial}{\partial \theta}\ell(\theta, Y_i)|_{\theta = \hat{f}_{m-1}(X_i)}$ computed at the current fit.
[definition: Gradient Boosting]
Let $\hat{h}$ be a base regression method: given data $D$, it outputs a hypothesis $\hat{h}_D: \mathcal{X} \to \mathbb{R}$. Let $\ell: \mathbb{R} \times \mathbb{R} \to \mathbb{R}$ be a differentiable loss function and let $\eta > 0$ and $M \in \mathbb{N}$.
**Gradient boosting** proceeds as follows:
1. Compute $\hat{\mu} = \operatorname{arg\,min}_{\mu \in \mathbb{R}} \frac{1}{n}\sum_{i=1}^n \ell(\mu, Y_i)$ and set $\hat{f}_0(x) = \hat{\mu}$.
2. For $m = 1, \ldots, M$:
- Compute pseudo-residuals $W_i = \frac{\partial}{\partial \theta}\ell(\theta, Y_i)|_{\theta = \hat{f}_{m-1}(X_i)}$ for $i = 1, \ldots, n$.
- Apply the base regressor $\hat{h}$ to the dataset $(X_{1:n}, W_{1:n})$ to obtain $\hat{g}_m: \mathcal{X} \to \mathbb{R}$.
- Update $\hat{f}_m = \hat{f}_{m-1} - \eta \hat{g}_m$.
3. Return $\hat{f}_M$ (or $\operatorname{sgn} \circ \hat{f}_M$ in the classification setting).
[/definition]
The name "gradient boosting" reflects that each step fits a model to the gradient of the loss with respect to the current predictions, and the fitted model is used to take a gradient step. The abstraction is powerful precisely because the base regressor $\hat{h}$ is treated as a black box: any regression method — decision trees, linear models, kernel smoothers — can be plugged in, and the algorithm will approximate gradient descent in function space using that method's representational capacity. To see how the pseudo-residuals specialise in a familiar case, consider the least squares loss.
[example: Gradient Boosting with Least Squares Loss]
When $\ell(\theta, y) = (\theta - y)^2$, the pseudo-residuals are
\begin{align*}
W_i = \frac{\partial}{\partial \theta}(Y_i - \theta)^2\bigg|_{\theta = \hat{f}_{m-1}(X_i)} = -2(Y_i - \hat{f}_{m-1}(X_i)).
\end{align*}
The pseudo-residuals are proportional to the negative residuals $Y_i - \hat{f}_{m-1}(X_i)$. At each boosting step, we regress the current residuals against $X_{1:n}$ using the base regressor $\hat{h}$, and add $\eta$ times the fitted model to the current ensemble.
If $\hat{h}$ performs ERM over the class $\{x \mapsto \mu + x_j \beta : \mu \in \mathbb{R},\, \beta \in \mathbb{R},\, j = 1, \ldots, p\}$, then at each step the algorithm finds the single predictor $x_j$ most correlated (in absolute value) with the current residuals, fits a simple linear regression on that predictor, and adds a small multiple of the fitted line to $\hat{f}_{m-1}$. Over many iterations this builds up a sum of simple regression functions, each targeting the component of the residual most unexplained by the previous steps.
[/example]
The parameter $M$ in gradient boosting plays the role of a regulariser: small $M$ produces a smooth, low-variance fit, while large $M$ allows the ensemble to adapt to fine structure in the data at the cost of increased variance. Choosing $M$ via cross-validation, as studied in Chapter 1, is standard practice. The step size $\eta$ controls how aggressively each new component is incorporated; smaller $\eta$ paired with larger $M$ is a common strategy for avoiding overfitting.
## Feedforward Neural Networks
Boosting methods are powerful, but they carry a structural limitation: the hypothesis class $\mathcal{H}$ is built by linearly combining base classifiers chosen a priori from a fixed set $B$. If $B$ consists of decision stumps or shallow trees, the ensemble can never learn representations that are inherently hierarchical or compositional — patterns that depend on detecting intermediate features, which themselves depend on lower-level features of the input. A decision stump asks one yes/no question about a single coordinate; no linear combination of such questions can efficiently represent, say, the detection of a curved edge in an image, which requires combining responses across many coordinates in a structured way. Neural networks address this limitation by allowing the network itself to learn intermediate representations from the data.
Neural networks represent a fundamentally different approach to constructing rich hypothesis classes. Rather than building up a function as a sum of simple pieces, a neural network applies a cascade of alternating linear maps and pointwise nonlinearities. The resulting architecture can represent complex input-output relationships with relatively few parameters, and its compositional structure makes gradient computation efficient.
### Network Architecture
The definition below formalises the architectural choices that all feedforward networks share: how many layers there are (depth), how wide each layer is, how the layers are connected (fully connected affine maps), and what nonlinearity is applied between layers. These parameters collectively determine the expressive capacity of the network and the number of trainable weights.
[definition: Feedforward Neural Network]
Let $d \geq 1$ (the **depth**) and let $m_1 = p$, $m_2, \ldots, m_d$, $m_{d+1} = 1$ be positive integers (the **widths** of each layer). For $k = 1, \ldots, d$, define the affine map $A^{(k)}: \mathbb{R}^{m_k} \to \mathbb{R}^{m_{k+1}}$ by
\begin{align*}
A^{(k)}(v) = \beta^{(k)} v + \mu^{(k)},
\end{align*}
where $\beta^{(k)} \in \mathbb{R}^{m_{k+1} \times m_k}$ and $\mu^{(k)} \in \mathbb{R}^{m_{k+1}}$ are the **weight matrix** and **bias vector** at layer $k$.
Let $\psi: \mathbb{R} \to \mathbb{R}$ be a nonlinear **activation function**, applied elementwise: for $v = (v_1, \ldots, v_m)^\top \in \mathbb{R}^m$, define $g(v) = (\psi(v_1), \ldots, \psi(v_m))^\top$.
A **feedforward neural network** with depth $d$ is the hypothesis $h: \mathbb{R}^p \to \mathbb{R}$ defined by
\begin{align*}
h(x) = A^{(d)} \circ g \circ A^{(d-1)} \circ g \circ \cdots \circ g \circ A^{(2)} \circ g \circ A^{(1)}(x).
\end{align*}
The parameters $(\beta^{(k)}, \mu^{(k)})_{k=1}^d$ are collectively called the **weights** and **biases** of the network.
[/definition]
The cascade structure is best understood by tracing the computation layer by layer. Set $h^{(0)} := x$. For $k = 1, \ldots, d-1$, define
\begin{align*}
x^{(k)} &= A^{(k)}(h^{(k-1)}), \\
h^{(k)} &= g(x^{(k)}).
\end{align*}
The vectors $h^{(1)}, \ldots, h^{(d-1)}$ are called the **hidden layers**: they are intermediate representations of the input that are not directly observed. Finally, $x^{(d)} = A^{(d)}(h^{(d-1)}) = h(x)$ is the **output layer**.
<!-- illustration-needed: a feedforward network with depth d=3 — show input layer x in R^p, two hidden layers h^(1) and h^(2), and the scalar output h(x), with arrows indicating the alternating affine and activation operations -->
The activation function $\psi$ is what gives neural networks their expressive power. Without it, the composition of affine maps would itself be affine, and $h$ would reduce to a linear function of the input — far too restricted to represent most real-world input-output relationships. Two standard choices are:
[definition: ReLU and Sigmoid Activations]
The **rectified linear unit (ReLU)** is the function $\psi: \mathbb{R} \to \mathbb{R}$ defined by
\begin{align*}
\psi(u) = \max(u, 0).
\end{align*}
The **sigmoid function** is $\psi: \mathbb{R} \to \mathbb{R}$ defined by
\begin{align*}
\psi(u) = \frac{1}{1 + e^{-u}}.
\end{align*}
[/definition]
ReLU is piecewise linear and is not differentiable at $u = 0$; in practice, this is handled by treating the subgradient at $0$ as $0$ or by slightly perturbing the data so that evaluations do not land exactly at the non-differentiability. The sigmoid is smooth but saturates for large $|u|$, which can slow gradient-based training. ReLU has become the default choice in modern deep learning due to its computational simplicity and effectiveness.
### Fitting Neural Networks: Stochastic Gradient Descent
The parameters $(\beta^{(k)}, \mu^{(k)})_{k=1}^d$ are fitted to training data $(x_1, y_1), \ldots, (x_n, y_n) \in \mathbb{R}^p \times \{-1, 1\}$ by minimising the empirical risk with a surrogate loss $\varphi: \mathbb{R} \to \mathbb{R}$:
\begin{align*}
\frac{1}{n} \sum_{i=1}^n \varphi(y_i h(x_i)).
\end{align*}
The resulting optimisation problem is highly non-convex in the network parameters, since $h$ is a composition of nonlinear functions. Despite this, stochastic gradient descent (studied in Chapter 4) has been shown empirically to find parameters with excellent generalisation performance. The theoretical explanation for this phenomenon remains an active research area.
The reason SGD is at all feasible for neural networks is that the gradient of the empirical risk with respect to the parameters can be computed efficiently using the **backpropagation algorithm**, which exploits the chain rule applied to the compositional structure of $h$.
### Backpropagation
The computational problem that backpropagation solves is the following: given a network with $W = \sum_{k=1}^d m_k m_{k+1}$ parameters, computing each partial derivative $\partial z / \partial \theta_j$ independently via finite differences would require $O(W)$ forward passes, giving total cost $O(W^2)$. For large networks, this is entirely impractical. Backpropagation achieves $O(W)$ total cost by recognising that the chain rule, applied to the compositional structure of $h$, allows gradients to be shared across parameters in the same layer. The key insight is that the gradient of $z$ with respect to each parameter in layer $k$ factorises as a product of the layer-$k$ gradient (the gradient of $z$ with respect to the pre-activation at layer $k$) and a quantity local to that parameter. Computing the layer-$k$ gradient costs $O(m_k m_{k+1})$ — linear in the number of weights in that layer — and once computed, all parameter gradients at layer $k$ follow immediately.
For an observation $(x, y)$, let $z = \varphi(y h(x)) = \varphi(y x^{(d)})$ be the contribution to the empirical risk. Backpropagation computes $\partial z / \partial \beta^{(k)}$ and $\partial z / \partial \mu^{(k)}$ for each layer $k$ via a backward pass through the network. The forward pass first computes all intermediate quantities $h^{(l)}$ and $x^{(l)}$ using the current parameter values.
The backward pass then proceeds from the output layer inward. Starting at the output:
\begin{align*}
\frac{\partial z}{\partial x^{(d)}} &= y \varphi'(y x^{(d)}), \\
\frac{\partial z}{\partial \mu^{(d)}} &= \frac{\partial z}{\partial x^{(d)}}, \qquad
\frac{\partial z}{\partial \beta^{(d)}_{1k}} = \frac{\partial z}{\partial x^{(d)}} h^{(d-1)}_k.
\end{align*}
The two equations at the output layer encode two separate applications of the chain rule. The first — the gradient with respect to $x^{(d)}$ — follows directly from differentiating $z = \varphi(y x^{(d)})$. The second — the bias and weight gradients — use the fact that $x^{(d)} = \beta^{(d)} h^{(d-1)} + \mu^{(d)}$, so $\partial x^{(d)} / \partial \mu^{(d)} = 1$ and $\partial x^{(d)} / \partial \beta^{(d)}_{1k} = h^{(d-1)}_k$.
The gradient is then propagated backward through the activation:
\begin{align*}
\frac{\partial z}{\partial h^{(d-1)}_j} &= \frac{\partial z}{\partial x^{(d)}} \beta^{(d)}_{1j}, \\
\frac{\partial z}{\partial x^{(d-1)}_j} &= \frac{\partial z}{\partial h^{(d-1)}_j} \psi'(x^{(d-1)}_j).
\end{align*}
The first equation here uses the chain rule through the affine map: since $x^{(d)} = \sum_j \beta^{(d)}_{1j} h^{(d-1)}_j + \mu^{(d)}$, the effect of $h^{(d-1)}_j$ on $z$ flows entirely through $x^{(d)}$, weighted by $\beta^{(d)}_{1j}$. The second equation uses the chain rule through the elementwise activation: since $h^{(d-1)}_j = \psi(x^{(d-1)}_j)$, the effect of the pre-activation $x^{(d-1)}_j$ on $z$ is scaled by $\psi'(x^{(d-1)}_j)$, which is already available from the forward pass.
And at layer $d-1$, the parameter gradients are:
\begin{align*}
\frac{\partial z}{\partial \mu^{(d-1)}_j} &= \frac{\partial z}{\partial x^{(d-1)}_j}, \qquad
\frac{\partial z}{\partial \beta^{(d-1)}_{jk}} = \frac{\partial z}{\partial x^{(d-1)}_j} h^{(d-2)}_k.
\end{align*}
To propagate further, the gradient with respect to the previous hidden layer is computed by summing over all paths through the weight matrix:
\begin{align*}
\frac{\partial z}{\partial h^{(d-2)}_j} = \sum_{k=1}^{m_d} \frac{\partial z}{\partial x^{(d-1)}_k} \beta^{(d-1)}_{kj}.
\end{align*}
This pattern repeats until all parameter gradients have been computed. The two types of equations above — those for the parameter gradients given layer-$k$ gradients, and those for propagating gradients from layer $k$ to layer $k-1$ — are the structural heart of backpropagation. The parameter gradients are used directly in the SGD update; the propagated gradients serve only to enable computation at deeper layers.
[remark: Cost of Backpropagation]
The forward and backward passes each require $O(W)$ operations, where $W = \sum_{k=1}^d m_k m_{k+1}$ is the total number of parameters. For a network of fixed depth and width, $W$ is proportional to $n$ only if the width grows with $n$; for a fixed architecture, the cost per gradient step is constant in $n$. This makes stochastic gradient descent, which uses a single observation's gradient at each step, particularly attractive — the cost per step is $O(W)$ regardless of dataset size.
[/remark]
With the cost of gradient computation settled, we can now examine what feedforward networks look like in their simplest non-trivial form. The following example instantiates the general architecture for a network with a single hidden layer, making the parameter count and the role of the activation function concrete.
[example: A Single Hidden Layer Network]
Consider a network with depth $d = 2$, input dimension $p$, a single hidden layer of width $m$ (so $m_1 = p$, $m_2 = m$, $m_3 = 1$), and ReLU activation. The hypothesis is
\begin{align*}
h(x) = \beta^{(2)}(\psi(\beta^{(1)} x + \mu^{(1)})) + \mu^{(2)},
\end{align*}
where $\beta^{(1)} \in \mathbb{R}^{m \times p}$, $\mu^{(1)} \in \mathbb{R}^m$, $\beta^{(2)} \in \mathbb{R}^{1 \times m}$, $\mu^{(2)} \in \mathbb{R}$, and $\psi$ is applied componentwise.
Each of the $m$ neurons in the hidden layer computes a separate linear function of $x$ and applies a ReLU; the output layer then takes a weighted sum of these $m$ features. The total number of parameters is $mp + m + m + 1 = m(p+2) + 1$.
[/example]
A striking property of even this simplest architecture is that it is already a universal approximator: a single hidden layer is sufficient in principle to approximate any continuous function, provided the width $m$ is allowed to grow. This fact is remarkable because the class of two-layer networks with ReLU activation consists of piecewise linear functions — yet as the number of pieces increases, they can approximate smooth targets uniformly on compact sets. The following classical result makes this precise for any non-polynomial activation.
[quotetheorem:1994]
The proof uses the Stone–Weierstrass theorem: the set of functions representable by two-layer networks with a non-polynomial activation is dense in $C(K)$ under the $\sup$ norm, because such networks can approximate indicator functions of half-spaces and hence (by superposition) arbitrary continuous functions. A classical reference is Cybenko (1989) for sigmoid activations and Hornik (1991) for the general case. Two important caveats: the theorem is existential — it asserts that a large enough $m$ works but gives no constructive bound — and it says nothing about how to find the weights, nor does it guarantee that gradient-based training will converge to the approximating parameters.
[remark: Non-Convexity and Empirical Success]
The empirical risk objective for neural networks is non-convex in the parameters, so none of the convergence guarantees from Chapter 4 apply directly. Stochastic gradient descent may converge to a local minimum, a saddle point, or fail to converge altogether. The empirical observation is that for large, sufficiently over-parameterised networks, SGD routinely finds parameters with small training and test error. Understanding when and why this happens is one of the central open problems in the theory of deep learning.
[/remark]
The three methods in this chapter — AdaBoost, gradient boosting, and neural networks — each represent a strategy for transcending the limitations of simple hypothesis classes. AdaBoost and gradient boosting build ensembles sequentially, each step correcting the errors of the previous one; neural networks build a single compositional architecture whose parameters are tuned jointly. All three methods have proven enormously effective in practice, and all three connect back to the theoretical framework developed in Chapters 1–4: the bias-variance decomposition (Chapter 1) controls the tradeoff between approximation error and variance; the VC dimension and Rademacher complexity bounds (Chapter 3) govern how many samples are needed for the ensemble or network to generalise; and the convex surrogate and SGD convergence guarantees (Chapter 4) underpin the computational feasibility of the fitting algorithms.
## References
The primary reference for this course is the lecture notes by Rajen D. Shah, Statistical Laboratory, University of Cambridge.
Contents
- Introduction
- The Statistical Prediction Framework
- Bayes Classifiers and Regression Functions
- The Regression Setting
- The Classification Setting
- Conditional Expectation Review
- Hypothesis Classes and the Learning Problem
- Empirical Risk Minimisation
- The Bias-Variance Decomposition
- The Variance of Basis-Function ERM
- Cross-Validation
- 2. Popular Machine Learning Methods I
- Decision Trees
- The Recursive Binary Partitioning Algorithm
- Limitations of the Rectangular Partition
- Efficient Computation of Splits
- Choosing the Number of Regions
- Random Forests
- The Algorithm
- 3. Statistical Learning Theory
- The Approximation–Estimation Decomposition
- Why the CLT Is Not Enough
- Why This Matters Beyond Classification
- The Chapter Roadmap
- Sub-Gaussianity and Hoeffding's Inequality
- From Markov to the Chernoff Bound
- Sub-Gaussian Random Variables
- Basic Properties
- Hoeffding's Lemma
- Sums of Independent Sub-Gaussians
- Hoeffding's Inequality
- Expected Maximum of Sub-Gaussians
- Finite Hypothesis Classes
- Decomposing the Excess Risk
- The Generalization Bound
- The Histogram Classifier
- What the Bound Says About Sample Complexity
- Rademacher Complexity
- The Setup and Function Class
- Empirical and Population Rademacher Complexity
- The Fundamental Generalization Bound
- Generalization Bound for ERM
- What Rademacher Complexity Achieves
- VC Dimension
- From Rademacher to Combinatorics
- Shattering and the VC Dimension
- The Sauer–Shelah Lemma
- From VC Dimension to Rademacher Bounds
- Computing VC Dimensions: Linear Classifiers
- The Fundamental Theorem of Statistical Learning
- 4. Computation for Empirical Risk Minimisation
- Convex Sets and Convex Functions
- The Local-to-Global Phenomenon
- Three Ways to Verify Convexity
- Jensen's Inequality
- Convex Surrogates
- The Intractability of 0-1 Loss ERM
- Surrogate Loss Functions
- Classification Calibration
- The Zhang–Bartlett Theorem
- Rademacher Complexity Revisited
- The Contraction Lemma
- ℓ₂-Constrained Linear Classes
- ℓ₁-Constrained Linear Classes
- Comparing ℓ₁ and ℓ₂ Constraints
- Projections onto Convex Sets and Subgradients
- Projections onto Convex Sets
- Subgradients and the Subgradient Inequality
- Gradient Descent
- The Steepest Descent Idea
- The Algorithm
- Convergence of Gradient Descent
- Application: ERM with Hinge Loss
- Stochastic Gradient Descent
- The Stochastic Objective
- The SGD Algorithm
- Convergence of SGD
- SGD for ERM with Hinge Loss
- Connection to Online Learning
- 5. Popular Machine Learning Methods II
- Combining Weak Learners: Boosting
- The Hypothesis Class for Boosting
- The AdaBoost Algorithm
- Gradient Boosting
- Functional Gradient Descent
- Feedforward Neural Networks
- Network Architecture
- Fitting Neural Networks: Stochastic Gradient Descent
- Backpropagation
- References
Cambridge II Mathematics of Machine Learning
Content
Problems
History
Created by admin on 4/24/2026 | Last updated on 4/24/2026
Prerequisites
No prerequisites required for this page.
Rate this page
★
★
★
★
★
Poor
Excellent