Empirical process theory studies how random samples behave when viewed through families of functions rather than through single statistics. It provides the language for turning laws of large numbers and central limit theorems into uniform statements over classes of sets or functions, with applications across probability, statistics, and learning theory. The course begins with empirical measures and indexed processes, then develops the basic almost-sure and weak convergence results that describe when empirical averages track their population counterparts uniformly.
The main themes are concentration and approximation, complexity and regularity, and the interplay between probabilistic limits and combinatorial structure. Chapters on Glivenko-Cantelli theory, symmetrisation, and Rademacher averages build the foundational tools for controlling deviations of empirical processes. These ideas are then sharpened through VC classes and combinatorial dimension, which quantify the size of function classes, and through Donsker theory, Brownian bridges, entropy, bracketing, maximal inequalities, and chaining, which provide increasingly refined criteria for functional central limit theorems. Later chapters show how these methods apply to statistical functionals, Z-estimation, bootstrap and multiplier processes, and finally to core problems in learning theory.
The chapters are arranged to move from intuition to technique to application. Early material establishes the objects and basic convergence notions, middle chapters develop the main probabilistic machinery for uniform control and weak convergence, and later chapters broaden the theory through permanence properties, examples, and inferential methods. By the end, the course connects abstract limit theorems to concrete statistical procedures and modern generalization questions.
# Introduction
Empirical process theory asks how random samples act on whole families of sets or functions at once. In elementary statistics, a sample average estimates a single expectation; in this course, the object is the entire random map $f \mapsto P_n f$, where $f$ ranges over a class $\mathcal F$. The main theme is that probabilistic limit theorems become uniform statements once the indexing class has controlled complexity. This opening chapter fixes the viewpoint of the course, explains the two central limit questions, and records the main examples that will reappear throughout the notes.
## Random Samples as Random Measures
What is the right object if the statistician wants to estimate many expectations from the same sample? A sample $X_1,\dots,X_n$ drawn from a probability measure $P$ does not only produce numbers such as
\begin{align*}
\frac{1}{n}\sum_{i=1}^n X_i;
\end{align*}
it produces a random probability measure assigning mass $1/n$ to each observation. This turns a class of functions into a random field of averages.
[definition: Empirical Measure]
Let $(S,\mathcal S)$ be a measurable space, let $X_1,\dots,X_n:\Omega\to S$ be $S$-valued random variables on a [probability space](/page/Probability%20Space) $(\Omega,\mathcal A,\mathbb Q)$, and let $\delta_x$ denote the Dirac probability measure at $x \in S$. The empirical measure is the map from $\Omega$ into the set of probability measures on $(S,\mathcal S)$ defined by
\begin{align*}
P_n(\omega) := \frac{1}{n}\sum_{i=1}^{n}\delta_{X_i(\omega)}.
\end{align*}
For a [measurable function](/page/Measurable%20Function) $f:S\to\mathbb R$, write
\begin{align*}
P_n f(\omega) := \int_S f\,dP_n(\omega) = \frac{1}{n}\sum_{i=1}^{n} f(X_i(\omega)).
\end{align*}
[/definition]
The notation $P_n f$ is designed to make sample averages look like integration. If $X_1,\dots,X_n$ are i.i.d. with distribution $P$, then $P f := \int_S f\,dP$ is the population counterpart of $P_n f$. The course studies how close $P_n f$ is to $P f$ uniformly over a class $\mathcal F$.
[example: One Function Gives the Sample Mean]
Let $S=\mathbb R$ and let $f:S\to\mathbb R$ be the identity function $f(x)=x$. For each outcome $\omega$, the definition of the empirical measure gives
\begin{align*}
P_n f(\omega)=\frac{1}{n}\sum_{i=1}^{n} f(X_i(\omega)).
\end{align*}
Since $f(X_i(\omega))=X_i(\omega)$ for every $i$, this becomes
\begin{align*}
P_n f(\omega)=\frac{1}{n}\sum_{i=1}^{n} X_i(\omega).
\end{align*}
Thus $P_n f$ is exactly the ordinary sample mean. If $X_1,\dots,X_n$ are i.i.d. with common law $P$ and $X_1$ is integrable, then the population counterpart is
\begin{align*}
P f=\int_{\mathbb R} f(x)\,dP(x).
\end{align*}
Because $f(x)=x$ and $P$ is the law of $X_1$, the expectation-as-integration identity gives
\begin{align*}
P f=\int_{\mathbb R} x\,dP(x)=\mathbb E[X_1].
\end{align*}
So the centred empirical error for this singleton class is
\begin{align*}
P_n f-Pf=\frac{1}{n}\sum_{i=1}^{n}X_i-\mathbb E[X_1].
\end{align*}
The ordinary sample mean is therefore the empirical measure evaluated at one [test function](/page/Test%20Function); empirical process theory starts from this one-coordinate situation and asks what changes when $\{f\}$ is replaced by a large class $\mathcal F$.
[/example]
A second basic example replaces numerical averages by probabilities of events. This is often the more geometric viewpoint: the class of test functions is a class of indicators, and the empirical measure records observed frequencies.
[example: Indicators of Sets]
Let $\mathcal C\subset\mathcal S$ be a class of measurable sets, and for each $C\in\mathcal C$ define $f_C=\mathbb{1}_C$. For a fixed outcome $\omega$ and a fixed set $C$, the definition of the empirical measure gives
\begin{align*}
P_n f_C(\omega)=\frac{1}{n}\sum_{i=1}^{n} f_C(X_i(\omega)).
\end{align*}
Since $f_C=\mathbb{1}_C$, this becomes
\begin{align*}
P_n f_C(\omega)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(X_i(\omega)).
\end{align*}
For each $i$, the term $\mathbb{1}_C(X_i(\omega))$ equals $1$ when $X_i(\omega)\in C$ and equals $0$ when $X_i(\omega)\notin C$. Hence the sum counts the observations among $X_1(\omega),\dots,X_n(\omega)$ that lie in $C$, and division by $n$ gives their observed fraction.
The same empirical measure assigns
\begin{align*}
P_n(C)(\omega)=\frac{1}{n}\sum_{i=1}^{n}\delta_{X_i(\omega)}(C).
\end{align*}
For each $i$, the Dirac mass satisfies $\delta_{X_i(\omega)}(C)=1$ if $X_i(\omega)\in C$ and $\delta_{X_i(\omega)}(C)=0$ otherwise, so
\begin{align*}
\delta_{X_i(\omega)}(C)=\mathbb{1}_C(X_i(\omega)).
\end{align*}
Therefore
\begin{align*}
P_n f_C(\omega)=P_n(C)(\omega).
\end{align*}
If the common law of the sample is $P$, then
\begin{align*}
P f_C=\int_S \mathbb{1}_C\,dP=P(C).
\end{align*}
Thus
\begin{align*}
P_n f_C-P f_C=P_n(C)-P(C).
\end{align*}
The empirical-process error for the indicator $\mathbb{1}_C$ is exactly the sampling error for the event $C$; uniform control over $C\in\mathcal C$ asks for simultaneous control of all these event-frequency errors, the form that appears in distribution-free statistics, VC theory, and learning bounds.
[/example]
These examples suggest that the size and geometry of the indexing class determine the answer. A finite class behaves like finitely many ordinary averages. Infinite classes require new tools because the supremum over the class may amplify rare fluctuations.
## Two Uniform Limit Problems
Which classical probability theorems should survive when a single function is replaced by a function class? The law of large numbers becomes a [uniform law of large numbers](/theorems/1855), and the [central limit theorem](/theorems/521) becomes weak convergence of a random element in a function space. These two upgrades are the backbone of the course.
[definition: Empirical Process]
Let $X_1,X_2,\dots$ be i.i.d. $S$-valued random variables with distribution $P$, and let $\mathcal F$ be a class of [measurable functions](/page/Measurable%20Functions) $f:S\to\mathbb R$ such that $P|f|<\infty$ for each $f\in\mathcal F$. The empirical process indexed by $\mathcal F$ is the random map
\begin{align*}
G_n : \mathcal F \to \mathbb R, \qquad G_n f := \sqrt n(P_n f-Pf).
\end{align*}
[/definition]
The scaling by $\sqrt n$ is the same scaling as in the [central limit theorem](/theorems/1848). Without it, the deviations $P_n f-Pf$ often converge to zero; with it, the process can have a non-degenerate Gaussian limit. Before looking for Gaussian limits, we first ask whether the unscaled deviations vanish uniformly.
[definition: Uniform Law of Large Numbers]
Let $\mathcal F$ be a class of measurable functions $f:S\to\mathbb R$ with $P|f|<\infty$ for each $f\in\mathcal F$. The class $\mathcal F$ satisfies the uniform law of large numbers under $P$ if
\begin{align*}
\sup_{f\in\mathcal F}|P_n f-Pf| \xrightarrow{\mathbb P} 0.
\end{align*}
[/definition]
This condition says that the empirical measure approximates $P$ uniformly on the test class. When the map $f\mapsto P_n f-Pf$ is bounded and measurable, these deviations are viewed as random elements of $\ell^\infty(\mathcal F)$ equipped with the supremum norm; otherwise the course uses outer-probability conventions introduced in the next chapter. Later chapters call such classes $P$-Glivenko-Cantelli classes and develop entropy and approximation criteria for verifying the condition. Once the first-order error is known to vanish, the next question is whether the rescaled error has a stable distributional limit as a whole indexed object.
[definition: P-Donsker Class]
Let $\mathcal F$ be a class of measurable functions $f:S\to\mathbb R$ with $P f^2<\infty$ for each $f\in\mathcal F$. Equip $\ell^\infty(\mathcal F)$ with the supremum norm
\begin{align*}
\|z\|_{\mathcal F}:=\sup_{f\in\mathcal F}|z(f)|.
\end{align*}
Under the course's standing measurability convention for empirical processes in $\ell^\infty(\mathcal F)$, the class $\mathcal F$ is $P$-Donsker if $(G_n)_{n\ge 1}$ converges in distribution as $\ell^\infty(\mathcal F)$-valued random elements to a tight centred Gaussian process $G_P$ satisfying
\begin{align*}
\operatorname{Cov}(G_P f,G_P g)=P(fg)-Pf\,Pg
\end{align*}
for all $f,g\in\mathcal F$.
[/definition]
This definition fixes the target space rather than treating the limit as a collection of unrelated coordinates. The remaining technical work lies in ensuring that the processes are legitimate $\ell^\infty(\mathcal F)$-valued random elements, or in replacing ordinary expectations and probabilities by their outer versions when separability is not yet available. Chapter 1 introduces these conventions before any major limit theorem is stated.
[quotetheorem:9816]
[citeproof:9816]
The finite uniform law shows the baseline: no complexity theory is needed when the class has finitely many elements, because finitely many almost sure limits can be combined on one probability-one event. The integrability assumption is not cosmetic; for instance, if $f(X_1)$ has a standard Cauchy distribution, then $P|f|=\infty$, the population mean is not finite, and the empirical average has the same Cauchy law at every sample size rather than settling down to a deterministic limit. The theorem also says nothing uniform over an infinite class, even a countable one, because intersecting countably many probability-one events only gives pointwise convergence on the listed functions and does not control the supremum over the whole class.
[example: Why Pointwise Laws Do Not Give Uniform Laws]
Let $S=[0,1]$, let $\mathcal L^1$ denote [Lebesgue measure](/page/Lebesgue%20Measure) on $\mathbb R$, let $P=\mathcal L^1\big|_{[0,1]}$, and let $\mathcal C$ be the class of all measurable sets $A\subset[0,1]$ with $P(A)=0$. Put $\mathcal F=\{\mathbb{1}_A:A\in\mathcal C\}$. For a fixed $A\in\mathcal C$ and a fixed index $i$, the assumption that $X_i$ has distribution $P$ gives
\begin{align*}
\mathbb Q(X_i\in A)=P(A)=0.
\end{align*}
By [countable subadditivity](/theorems/1108),
\begin{align*}
\mathbb Q(\exists i\ge 1\text{ such that }X_i\in A)\le \sum_{i=1}^{\infty}\mathbb Q(X_i\in A)=0.
\end{align*}
Thus, on a probability-one event depending on this fixed set $A$, every indicator $\mathbb{1}_A(X_i)$ is zero. Hence for every $n$ on that event,
\begin{align*}
P_n(A)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_A(X_i)=\frac{1}{n}\sum_{i=1}^{n}0=0=P(A).
\end{align*}
So each fixed null set satisfies the pointwise law.
The supremum over the whole class is different because the set may be chosen after the sample is observed. For a realised sample and a fixed $n$, define
\begin{align*}
A_n=\{X_1,\dots,X_n\}.
\end{align*}
This is a finite subset of $[0,1]$. Since every singleton has Lebesgue measure zero and a finite union of null sets is null,
\begin{align*}
P(A_n)=0,
\end{align*}
so $A_n\in\mathcal C$. Also $X_i\in A_n$ for each $1\le i\le n$, and therefore
\begin{align*}
P_n(A_n)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_{A_n}(X_i)=\frac{1}{n}\sum_{i=1}^{n}1=1.
\end{align*}
For every $A\in\mathcal C$, $P(A)=0$ and $0\le P_n(A)\le 1$, so
\begin{align*}
|P_n(A)-P(A)|=P_n(A)\le 1.
\end{align*}
The particular choice $A=A_n$ attains this upper bound, hence
\begin{align*}
\sup_{A\in\mathcal C}|P_n(A)-P(A)|=1
\end{align*}
for every $n$. Pointwise convergence controls each predetermined null set, but it does not control a class rich enough to contain the finite null set selected from the observed data.
[/example]
The preceding example separates two issues. Pointwise convergence is a coordinatewise statement, while a uniform law requires control over the largest coordinate after the data have been observed. The same distinction appears at the central-limit scale: finite-dimensional Gaussian limits are necessary, but they do not by themselves control suprema, tightness, or sample-path regularity.
The next baseline theorem records the finite-class central limit theorem in the notation of empirical processes. It identifies the covariance structure that every later Gaussian process limit must have, and it also marks the exact point where infinite classes need additional compactness or equicontinuity arguments.
[quotetheorem:6302]
[citeproof:6302]
This result is the finite-dimensional shadow of Donsker theory, and each hypothesis is carrying a different burden. The condition $P f_j^2<\infty$ makes the covariance matrix finite and places the centred sums in the ordinary Gaussian central-limit regime; if $f(X_1)$ has a Cauchy law, then $P f$ and $P f^2$ are not available in the required sense, so neither the displayed covariance nor the usual $\sqrt n$-Gaussian conclusion applies. The finite size of $\mathcal F$ supplies a separate compactness input: it lets the process be treated as a vector in $\mathbb R^m$, where tightness follows from finite-dimensional probability theory. For an infinite class, even compatible Gaussian limits for every finite subcollection leave open the additional problem of convergence in $\ell^\infty(\mathcal F)$, which is why later chapters introduce entropy and stochastic equicontinuity.
[example: Finite-Dimensional Marginals Are Not Enough]
[claim]There are bounded processes whose every fixed finite-dimensional marginal converges to a centred Gaussian limit, while the processes do not converge in $\ell^\infty(\mathcal F)$ with the supremum norm.[/claim]
[proof]Take $\mathcal F=\mathbb N$ and define $Z_n\in\ell^\infty(\mathbb N)$ by
\begin{align*}
Z_n(k)=\mathbb{1}_{\{n\}}(k)
\end{align*}
for $k\in\mathbb N$. Fix finitely many coordinates $k_1,\dots,k_m$. If
\begin{align*}
n>\max\{k_1,\dots,k_m\},
\end{align*}
then $n\ne k_j$ for every $1\le j\le m$, so
\begin{align*}
Z_n(k_j)=\mathbb{1}_{\{n\}}(k_j)=0
\end{align*}
for every $j$. Hence, for all sufficiently large $n$,
\begin{align*}
(Z_n(k_1),\dots,Z_n(k_m))=(0,\dots,0).
\end{align*}
Thus every fixed finite-dimensional marginal converges to the degenerate centred Gaussian vector at the origin.
The supremum norm sees a different feature. Since $Z_n(n)=1$ and $Z_n(k)\in\{0,1\}$ for every $k$,
\begin{align*}
\|Z_n\|_{\mathcal F}=\sup_{k\in\mathbb N}|Z_n(k)|=1.
\end{align*}
Moreover, if $n\ne m$, then at coordinate $n$ we have $Z_n(n)=1$ and $Z_m(n)=0$, while at coordinate $m$ we have $Z_n(m)=0$ and $Z_m(m)=1$. Therefore
\begin{align*}
\|Z_n-Z_m\|_{\mathcal F}=\sup_{k\in\mathbb N}|Z_n(k)-Z_m(k)|=1.
\end{align*}
So the set $\{Z_n:n\ge 1\}$ is not [totally bounded](/page/Totally%20Bounded) in $\ell^\infty(\mathbb N)$, because the open balls of radius $1/3$ around its points are pairwise disjoint. A compact subset of a [metric space](/page/Metric%20Space) is totally bounded, so no compact set can contain all the points $Z_n$. Consequently the deterministic laws $\delta_{Z_n}$ are not tight as probability measures on $\ell^\infty(\mathbb N)$.[/proof]
Finite-dimensional marginals only inspect finitely many coordinates before the moving spike reaches them; the supremum norm detects the spike wherever it moves. This is the gap later filled by stochastic equicontinuity, entropy bounds, or bracketing assumptions.
[/example]
This example is a warning about the topology in which convergence is being claimed. Finite-dimensional marginals determine the covariance structure of any possible limit, but they do not specify which functions should be close to each other or how oscillations between nearby coordinates are controlled. Later chapters therefore develop metrics and chaining methods that extract the missing tightness from the geometry of $\mathcal F$.
## Complexity of Function Classes
Why do some infinite classes behave like finite classes while others do not? The answer is not cardinality alone. A class may be uncountable but well approximated by finitely many brackets or nets, while a countable class may still contain too many separated directions for [uniform convergence](/page/Uniform%20Convergence).
[definition: Envelope Function]
Let $\mathcal F$ be a class of measurable functions $f:S\to\mathbb R$. A measurable function $F:S\to[0,\infty]$ is an envelope for $\mathcal F$ if
\begin{align*}
|f(x)|\le F(x)
\end{align*}
for every $f\in\mathcal F$ and every $x\in S$.
[/definition]
An envelope controls the size of the functions but not the number of distinguishable functions. For uniform laws and central limit theorems, the course will combine envelope assumptions with complexity measures such as covering numbers, bracketing numbers, VC dimension, and entropy integrals.
[example: Threshold Indicators on the Real Line]
Let $S=\mathbb R$ and let
\begin{align*}
\mathcal F=\{\mathbb{1}_{(-\infty,t]}:t\in\mathbb R\}.
\end{align*}
For a fixed $t\in\mathbb R$, put $f_t=\mathbb{1}_{(-\infty,t]}$. By the definition of the empirical process,
\begin{align*}
G_n f_t=\sqrt n(P_n f_t-Pf_t).
\end{align*}
The empirical term is
\begin{align*}
P_n f_t=\frac{1}{n}\sum_{i=1}^{n} f_t(X_i).
\end{align*}
Since $f_t(x)=\mathbb{1}_{(-\infty,t]}(x)$, this becomes
\begin{align*}
P_n f_t=\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_{(-\infty,t]}(X_i).
\end{align*}
For each $i$, $\mathbb{1}_{(-\infty,t]}(X_i)=1$ exactly when $X_i\le t$ and is $0$ otherwise, so
\begin{align*}
P_n f_t=P_n((-\infty,t]).
\end{align*}
Writing
\begin{align*}
F_n(t):=P_n((-\infty,t])
\end{align*}
therefore gives
\begin{align*}
P_n f_t=F_n(t).
\end{align*}
The population term is
\begin{align*}
P f_t=\int_{\mathbb R}\mathbb{1}_{(-\infty,t]}(x)\,dP(x).
\end{align*}
By the defining property of indicator functions under integration,
\begin{align*}
P f_t=P((-\infty,t]).
\end{align*}
Writing
\begin{align*}
F(t):=P((-\infty,t])
\end{align*}
therefore gives
\begin{align*}
P f_t=F(t).
\end{align*}
Substituting these two identities into the definition of $G_n f_t$ yields
\begin{align*}
G_n\mathbb{1}_{(-\infty,t]}=\sqrt n(F_n(t)-F(t)).
\end{align*}
Thus the empirical process indexed by threshold indicators is exactly the centred and scaled empirical distribution function. The class is uncountable, but the sets $(-\infty,t]$ are nested in $t$, so the sample cannot realise arbitrary membership patterns across the class; this order structure is why thresholds form the basic real-line example behind the *[Glivenko-Cantelli theorem](/theorems/2004)* and *Donsker's theorem*.
[/example]
The threshold example explains why geometry matters. The sets $(-\infty,t]$ are nested, so the sample cannot realise an arbitrary pattern of memberships across the class. To make this obstruction measurable for a general class of sets, we first describe what it means for a class to realise every possible labelling of a finite set.
[definition: Shattering]
Let $\mathcal C$ be a class of subsets of a set $S$. A finite set $A\subset S$ is shattered by $\mathcal C$ if for every subset $B\subset A$ there exists $C\in\mathcal C$ such that
\begin{align*}
C\cap A=B.
\end{align*}
[/definition]
Shattering measures how many dichotomies the class can realise on finite samples. If large finite sets can be shattered, uniform control is difficult because the class can fit many data-dependent patterns. This motivates the VC dimension, the numerical summary used to record the largest shattered finite set.
[definition: VC Dimension]
Let $\mathcal C$ be a class of subsets of a set $S$. The VC dimension of $\mathcal C$ is the element of $\mathbb N\cup\{0,\infty\}$ defined by
\begin{align*}
\operatorname{VC}(\mathcal C):=\sup\{|A|:A\subset S\text{ is finite and shattered by }\mathcal C\}.
\end{align*}
[/definition]
The course uses VC dimension as one route from combinatorics to probability. Other routes are analytic rather than combinatorial: they approximate functions in $L^p(P)$ metrics and measure how many balls or brackets are required.
[example: Intervals Have Small VC Dimension]
Let $\mathcal C=\{(-\infty,t]:t\in\mathbb R\}$. First take a one-point set $A=\{x\}\subset\mathbb R$. To show that $A$ is shattered, we must realise both subsets of $A$, namely $\varnothing$ and $\{x\}$. If $t<x$, then $x\notin(-\infty,t]$, so
\begin{align*}
(-\infty,t]\cap\{x\}=\varnothing.
\end{align*}
If $t=x$, then $x\in(-\infty,x]$, so
\begin{align*}
(-\infty,x]\cap\{x\}=\{x\}.
\end{align*}
Thus every one-point set is shattered, and therefore $\operatorname{VC}(\mathcal C)\ge 1$.
Now take any two-point set $A=\{x_1,x_2\}$ with $x_1<x_2$. If a threshold set $(-\infty,t]$ contains $x_2$, then $x_2\le t$. Since $x_1<x_2$, this implies $x_1\le t$, so $x_1\in(-\infty,t]$ as well. Hence no threshold can satisfy
\begin{align*}
(-\infty,t]\cap\{x_1,x_2\}=\{x_2\}.
\end{align*}
The subset $\{x_2\}\subset A$ cannot be realised, so $A$ is not shattered. Since every two-point subset of $\mathbb R$ has this form after relabelling, no set of size $2$ is shattered; then no larger finite set is shattered either, because shattering a larger set would imply shattering each of its two-point subsets by restriction. Therefore
\begin{align*}
\operatorname{VC}(\mathcal C)=1.
\end{align*}
Thresholds on the real line are low-complexity because their nested order prevents them from separating a larger point while excluding a smaller one.
[/example]
## Symmetrisation, Concentration, and Chaining
How are deterministic complexity measures converted into probabilistic bounds? The course repeatedly uses three mechanisms: symmetrisation replaces empirical fluctuations by Rademacher averages, concentration controls deviations around typical size, and chaining adds fluctuations across many scales instead of using a single crude union bound.
[definition: Rademacher Variables]
A sequence $(\varepsilon_i)_{i\ge 1}$ consists of Rademacher variables on a probability space $(\Omega_\varepsilon,\mathcal A_\varepsilon,\mathbb P_\varepsilon)$ if each $\varepsilon_i:\Omega_\varepsilon\to\{-1,1\}$ is measurable, the variables are independent, and
\begin{align*}
\mathbb P_\varepsilon(\varepsilon_i=1)=\mathbb P_\varepsilon(\varepsilon_i=-1)=\frac{1}{2}
\end{align*}
for every $i\ge 1$.
[/definition]
Rademacher variables create a random sign version of the empirical process. Conditional on the data, the randomness is symmetric, which is often easier to analyse than the original centred process. The following inequality is the standard reduction that turns expected empirical suprema into expected signed suprema.
[quotetheorem:9817]
[citeproof:9817]
Symmetrisation is usually the first probabilistic reduction in the course. Its hypotheses hide two important restrictions. First, the displayed expectations must be well-defined; for arbitrary uncountable classes, the supremum may fail to be measurable, and later chapters handle this by outer expectation, measurable majorants, or separability assumptions. Second, finiteness matters: if the envelope has infinite first moment, the expected supremum on either side can be infinite, so the inequality gives no useful bound even though it remains formally suggestive. The result is also only an expectation comparison. It does not by itself give high-probability bounds, almost sure convergence, or tightness in $\ell^\infty(\mathcal F)$; those require later concentration inequalities, entropy estimates, and chaining arguments.
[example: Non-Measurable Supremum]
Let $S=[0,1]$ with its Lebesgue $\sigma$-algebra, let $P$ be Lebesgue measure on $S$, and choose a set $V\subset[0,1]$ that is not Lebesgue measurable. Define
\begin{align*}
\mathcal F=\{\mathbb{1}_{\{v\}}:v\in V\}.
\end{align*}
For each fixed $v\in V$, the singleton $\{v\}$ is closed in $[0,1]$, hence Borel, hence Lebesgue measurable. Therefore $\mathbb{1}_{\{v\}}$ is a measurable function on $S$, so every member of $\mathcal F$ is measurable.
Let $X$ be the identity [random variable](/page/Random%20Variable) on $([0,1],\mathcal L,P)$, where $\mathcal L$ denotes the Lebesgue $\sigma$-algebra. Thus $X(\omega)=\omega$ for every $\omega\in[0,1]$. For a fixed $\omega$, the indexed values are
\begin{align*}
\mathbb{1}_{\{v\}}(X(\omega))=\mathbb{1}_{\{v\}}(\omega).
\end{align*}
If $\omega\in V$, then taking $v=\omega$ gives
\begin{align*}
\mathbb{1}_{\{\omega\}}(\omega)=1,
\end{align*}
and all indicator values are at most $1$, so
\begin{align*}
\sup_{f\in\mathcal F} f(X(\omega))=1.
\end{align*}
If $\omega\notin V$, then $\omega\ne v$ for every $v\in V$, so
\begin{align*}
\mathbb{1}_{\{v\}}(\omega)=0
\end{align*}
for every $v\in V$, and hence
\begin{align*}
\sup_{f\in\mathcal F} f(X(\omega))=0.
\end{align*}
Combining the two cases gives the pointwise identity
\begin{align*}
\sup_{f\in\mathcal F} f(X)=\mathbb{1}_V.
\end{align*}
If this supremum were a measurable random variable, then the set
\begin{align*}
\{\omega\in[0,1]:\mathbb{1}_V(\omega)>1/2\}
\end{align*}
would be Lebesgue measurable. But this set is exactly $V$, contradicting the choice of $V$ as non-Lebesgue-measurable. Thus an uncountable supremum of measurable coordinate functions need not be a measurable random variable.
[/example]
The measurability example isolates a logical obstruction: the supremum may fail to be a random variable at all. A different obstruction remains even for perfectly measurable classes, namely that the quantities being compared can be infinite and therefore useless for later bounds.
[example: A Finiteness Failure]
Let $\mathcal F=\{f\}$ contain a single measurable function such that $\mathbb E[|f(X_1)|]=\infty$, and take $n=1$. The Rademacher side of the symmetrisation bound is
\begin{align*}
2\mathbb E\left[\sup_{g\in\mathcal F}|\varepsilon_1 g(X_1)|\right].
\end{align*}
Since $\mathcal F$ has only the function $f$, the supremum over $\mathcal F$ is just the value at $f$:
\begin{align*}
\sup_{g\in\mathcal F}|\varepsilon_1 g(X_1)|=|\varepsilon_1 f(X_1)|.
\end{align*}
A Rademacher variable takes only the values $-1$ and $1$, so $|\varepsilon_1|=1$. Therefore
\begin{align*}
|\varepsilon_1 f(X_1)|=|\varepsilon_1|\,|f(X_1)|=|f(X_1)|.
\end{align*}
Taking expectations gives
\begin{align*}
2\mathbb E\left[\sup_{g\in\mathcal F}|\varepsilon_1 g(X_1)|\right]=2\mathbb E[|f(X_1)|]=\infty.
\end{align*}
Thus even for a singleton class, the symmetrised complexity can be infinite when the function is not integrable. The inequality then provides no finite bound on the empirical fluctuation, which is why envelope integrability assumptions are imposed before symmetrisation is used in Glivenko-Cantelli or Donsker arguments.
[/example]
After the reduction is legitimate, the problem becomes the study of random signed sums indexed by $\mathcal F$. A direct union bound can handle a finite net, but it wastes information when many functions are close in the natural $L^2(P)$ or covariance metric. The next idea is to organise the class by successive approximations, so that the bound pays separately for coarse location and fine increments.
[explanation: From Union Bounds to Chaining]
A finite class $\mathcal F$ can often be handled by bounding each $f\in\mathcal F$ and taking a union bound. For a large class, a single finite approximation may be too wasteful because it treats nearby functions as unrelated. Chaining instead chooses approximations at many resolutions and writes each function as a telescoping sum of increments between successive approximations. The number of points at each scale is measured by covering numbers, while the size of each increment is controlled by a metric such as $d_P(f,g)=(P(f-g)^2-(Pf-Pg)^2)^{1/2}$. This multiscale viewpoint is what allows entropy integrals to appear in maximal inequalities and Donsker criteria.
[/explanation]
The chaining philosophy also explains why empirical process theory is useful beyond probability for its own sake. Statistical estimators are often defined by optimising random criterion functions over large parameter spaces; learning algorithms minimise empirical risk over hypothesis classes. In both cases, uniform control of empirical processes converts pointwise convergence into statements about procedures selected from the data. The same estimates also connect to Banach-space geometry through suprema of random linear functionals, to [approximation theory](/page/Approximation%20Theory) through metric entropy, and to [convex geometry](/page/Convex%20Geometry) through Gaussian width and related complexity parameters.
## Applications and Roadmap
Where will the technical theory be used? The course is organised around a progression from empirical measures to uniform convergence, then from Gaussian limits to statistical applications. Each later topic returns to the same basic object $f\mapsto P_n f-Pf$ with a richer class $\mathcal F$ and a sharper complexity estimate.
[example: Empirical Risk Minimisation]
Let $\mathcal H$ be a class of prediction rules, and for each $h\in\mathcal H$ let $\ell_h:S\to\mathbb R$ be its loss. Write the population risk and empirical risk as
\begin{align*}
R(h):=P\ell_h
\end{align*}
and
\begin{align*}
R_n(h):=P_n\ell_h.
\end{align*}
Set
\begin{align*}
\Delta_n:=\sup_{h\in\mathcal H}|R_n(h)-R(h)|.
\end{align*}
Suppose $\hat h_n$ is an empirical minimiser up to optimisation error $\eta_n\ge 0$, meaning
\begin{align*}
R_n(\hat h_n)\le \inf_{h\in\mathcal H}R_n(h)+\eta_n.
\end{align*}
For every $h\in\mathcal H$, the definition of $\Delta_n$ gives
\begin{align*}
R(\hat h_n)\le R_n(\hat h_n)+\Delta_n.
\end{align*}
Using the optimisation inequality,
\begin{align*}
R_n(\hat h_n)+\Delta_n\le \inf_{g\in\mathcal H}R_n(g)+\eta_n+\Delta_n.
\end{align*}
Since $R_n(g)\le R(g)+\Delta_n$ for every $g\in\mathcal H$, taking infima over $g$ gives
\begin{align*}
\inf_{g\in\mathcal H}R_n(g)\le \inf_{g\in\mathcal H}R(g)+\Delta_n.
\end{align*}
Combining the last three displays yields
\begin{align*}
R(\hat h_n)\le \inf_{g\in\mathcal H}R(g)+2\Delta_n+\eta_n.
\end{align*}
Equivalently,
\begin{align*}
P\ell_{\hat h_n}-\inf_{h\in\mathcal H}P\ell_h\le 2\sup_{h\in\mathcal H}|P_n\ell_h-P\ell_h|+\eta_n.
\end{align*}
Thus if the uniform empirical error converges to zero in probability and the optimisation error also converges to zero in probability, then the excess population risk of $\hat h_n$ converges to zero in probability. This is the empirical-process form of the basic generalisation bound: uniform control of $P_n-P$ transfers empirical near-optimality into population near-optimality.
[/example]
The same template appears in asymptotic statistics, where estimators solve estimating equations or optimise likelihoods. If a criterion admits a local expansion around its population minimiser, uniform control of the empirical process justifies replacing the random criterion by its deterministic limit plus a Gaussian fluctuation. Functional central limit theorems then identify limiting distributions of plug-in estimators, quantile estimators, goodness-of-fit statistics, and bootstrap analogues.
[example: Kolmogorov-Smirnov Statistics]
Let
\begin{align*}
\mathcal F=\{\mathbb{1}_{(-\infty,t]}:t\in\mathbb R\}.
\end{align*}
For $f_t=\mathbb{1}_{(-\infty,t]}$, the definition of the empirical process gives
\begin{align*}
G_n f_t=\sqrt n(P_n f_t-Pf_t).
\end{align*}
The empirical term is
\begin{align*}
P_n f_t=\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_{(-\infty,t]}(X_i).
\end{align*}
Since $\mathbb{1}_{(-\infty,t]}(X_i)=1$ exactly when $X_i\le t$ and is $0$ otherwise, this is the empirical distribution function at $t$:
\begin{align*}
P_n f_t=F_n(t).
\end{align*}
Similarly, integration of an indicator gives the probability of its set, so
\begin{align*}
P f_t=\int_{\mathbb R}\mathbb{1}_{(-\infty,t]}(x)\,dP(x)=P((-\infty,t])=F(t).
\end{align*}
Substituting these two identities into $G_n f_t$ yields
\begin{align*}
G_n f_t=\sqrt n(F_n(t)-F(t)).
\end{align*}
The supremum norm on $\ell^\infty(\mathcal F)$ is
\begin{align*}
\|G_n\|_{\mathcal F}=\sup_{f\in\mathcal F}|G_n f|.
\end{align*}
Because every $f\in\mathcal F$ has the form $f_t$ for some $t\in\mathbb R$, this becomes
\begin{align*}
\|G_n\|_{\mathcal F}=\sup_{t\in\mathbb R}|G_n f_t|.
\end{align*}
Using the displayed formula for $G_n f_t$,
\begin{align*}
\|G_n\|_{\mathcal F}=\sup_{t\in\mathbb R}\left|\sqrt n(F_n(t)-F(t))\right|.
\end{align*}
Since $\sqrt n>0$, the positive scalar factors out of the supremum:
\begin{align*}
\|G_n\|_{\mathcal F}=\sqrt n\sup_{t\in\mathbb R}|F_n(t)-F(t)|.
\end{align*}
Thus the Kolmogorov-Smirnov statistic is exactly the supremum norm of the empirical process indexed by threshold indicators. Under *Donsker's theorem* for this class, the process converges to a Brownian bridge after the probability integral transform, so the limiting statistic is the supremum of that bridge; this is the source of the distribution-free limiting law.
[/example]
The geometric vocabulary used later is another way to measure the same difficulty. Covering and bracketing numbers quantify how well $\mathcal F$ can be approximated in $L^p(P)$ metrics, while Gaussian width and Rademacher complexity quantify the expected size of random linear functionals over the class. These quantities connect learning bounds, Banach-space geometry, approximation theory, and convex geometry because each asks how large a random fluctuation can become when tested against all admissible directions in $\mathcal F$.
[remark: What This Course Treats as Background]
The notes assume measure-theoretic probability, including expectation as integration, convergence in distribution, [conditional expectation](/page/Conditional%20Expectation) at a basic level, and standard laws of large numbers and central limit theorems. They also use metric-space weak convergence, compactness and [total boundedness](/page/Total%20Boundedness), elementary functional analysis, and standard concentration inequalities. The course recalls specialised conventions such as outer probability and measurable majorants when they first become necessary.
[/remark]
The chapter that follows begins the formal development. It introduces $P_n$, $G_n$, finite-dimensional convergence, covariance semimetrics, and the function space $\ell^\infty(\mathcal F)$. The rest of the course then builds the two main answers: Glivenko-Cantelli theory for uniform laws of large numbers, and Donsker theory for functional central limit theorems.
# 1. Empirical Measures and Indexed Processes
Empirical process theory begins with a simple statistical object: the empirical distribution of an i.i.d. sample. The point of the first chapter is to recast that object as a random linear functional acting on many test functions at once. This viewpoint turns pointwise laws such as the central limit theorem into questions about random elements of function spaces, where measurability and indexing choices become part of the mathematics.
The chapter starts from empirical averages, then enlarges the index set from one function to a class of sets or functions. It ends with the finite-dimensional theory: every finite subcollection satisfies an ordinary [multivariate central limit theorem](/theorems/1854), and its covariance structure is governed by the $L^2(P)$ variance of function differences.
## Empirical Measures as Random Functionals
The first question is how to describe the random sample without committing to a particular statistic. If we only record pointwise statistics, each new mean, quantile, or distribution-function value has to be treated as a separate object, and there is no unified way to compare many statistics simultaneously. Instead, we package all empirical averages into a single random probability measure.
[definition: Empirical Measure]
Let $(S, \mathcal A)$ be a measurable space, let $(\Omega,\mathcal G,\mathbb P)$ be the underlying probability space, let $X_1, X_2, \dots, X_n:\Omega\to S$ be $S$-valued random variables, and let $\delta_x$ denote the point mass at $x \in S$. The empirical measure based on $X_1, \dots, X_n$ is the map
\begin{align*}
P_n:\Omega\to \operatorname{Prob}(S,\mathcal A), \qquad P_n(\omega):=\frac{1}{n}\sum_{i=1}^n \delta_{X_i(\omega)}.
\end{align*}
[/definition]
For each sample outcome $\omega\in\Omega$, the associated empirical integral is the real-valued map on measurable functions finite at $X_1(\omega),\dots,X_n(\omega)$ given by
\begin{align*}
f\mapsto P_n(\omega)f:=\int f\,dP_n(\omega)=\frac{1}{n}\sum_{i=1}^n f(X_i(\omega)).
\end{align*}
This formula is pathwise: for a fixed $\omega$, $P_n(\omega)$ is an ordinary probability measure on $(S,\mathcal A)$. To treat $P_n$ as a probability-measure-valued random element, one must also choose a measurable structure on $\operatorname{Prob}(S,\mathcal A)$, such as the evaluation $\sigma$-algebra generated by maps $\mu\mapsto \mu(A)$ for $A\in\mathcal A$. In this chapter we avoid needing that structure directly by evaluating $P_n$ on sets and functions. This makes a statistic into the evaluation of $P_n$ at a test function. If the common distribution of the observations is $P$, then $P f = \mathbb E[f(X_1)]$ is the population version of the same average.
[example: Sample Mean as Empirical Integral]
Let $S=\mathbb R$, let $X_1,\dots,X_n$ be real-valued, and take the coordinate function $f(x)=x$. For a fixed sample outcome $\omega$ for which the observations are finite, the empirical integral is
\begin{align*}
P_n(\omega)f=\int x\,dP_n(\omega)(x).
\end{align*}
Using the definition $P_n(\omega)=n^{-1}\sum_{i=1}^n\delta_{X_i(\omega)}$ and the identity $\int f\,d\delta_y=f(y)$ for a point mass,
\begin{align*}
P_n(\omega)f=\int x\,d\left(\frac{1}{n}\sum_{i=1}^n\delta_{X_i(\omega)}\right)(x).
\end{align*}
By linearity of the integral with respect to finite sums of measures,
\begin{align*}
P_n(\omega)f=\frac{1}{n}\sum_{i=1}^n\int x\,d\delta_{X_i(\omega)}(x).
\end{align*}
Since $\int x\,d\delta_{X_i(\omega)}(x)=X_i(\omega)$, this becomes
\begin{align*}
P_n(\omega)f=\frac{1}{n}\sum_{i=1}^n X_i(\omega).
\end{align*}
Thus $P_n f=n^{-1}\sum_{i=1}^n X_i$ is the usual sample mean. If $X_1$ is integrable and $P$ is its distribution, then
\begin{align*}
P f=\int x\,dP(x)=\mathbb E[X_1].
\end{align*}
Therefore the classical centred sample-mean error is exactly $P_n f-Pf$, so studying the empirical measure at the single coordinate function $f(x)=x$ recovers the usual one-dimensional statistic.
[/example]
The example shows that centring $P_n f$ at $P f$ recovers the classical error of a sample average. To compare this error with a central limit theorem and to repeat the comparison over a whole class of possible statistics, we introduce a centred and scaled random map rather than a single centred number.
[definition: Empirical Process Indexed by Functions]
Let $X_1,X_2,\dots$ be i.i.d. with distribution $P$ on $(S,\mathcal A)$, and let $\mathcal F$ be a class of measurable functions $f:S\to\mathbb R$ with $P|f|<\infty$. The empirical process indexed by $\mathcal F$ is the sample-dependent map
\begin{align*}
G_n:\Omega\to\mathbb R^{\mathcal F}, \qquad G_n(\omega):\mathcal F\to\mathbb R, \qquad G_n(\omega)(f):=\sqrt n(P_n(\omega) f-Pf).
\end{align*}
[/definition]
Equivalently, $G_n$ may be regarded as the map $G_n:\Omega\to\mathbb R^{\mathcal F}$ whose $f$-coordinate is
\begin{align*}
G_n(f) := \sqrt n(P_n f-Pf), \qquad f\in\mathcal F.
\end{align*}
The class $\mathcal F$ determines which statistical questions are being asked. For one function, $G_n$ is a single real random variable; for many functions, it is a random map on $\mathcal F$. This is the first functional-analytic shift in the course: the object of study is no longer a number but a random element of a coordinate space, and later of $\ell^\infty(\mathcal F)$.
The function-indexed definition has therefore given the right scaling and centring, and it has made empirical averages look like coordinates of one random map. The next source of indices is more concrete. Distribution functions, classification rules, and event-frequency estimates are usually written in terms of sets: threshold events $\{X\le t\}$, membership in a measurable region, or a whole class of decision regions. For such questions the statistic of interest is the full collection of deviations $P_n(C)-P(C)$ as $C$ varies, so a set-indexed definition is needed to keep the notation aligned with the statistical object being estimated. Naming this version makes empirical distribution functions and empirical event probabilities part of the same process notation, while keeping the option to return to functions through the indicator embedding $C\mapsto \mathbf{1}_C$.
[definition: Empirical Process Indexed by Sets]
Let $\mathcal C\subseteq\mathcal A$ be a class of measurable sets. The empirical process indexed by $\mathcal C$ is the sample-dependent map
\begin{align*}
G_n:\Omega\to\mathbb R^{\mathcal C}, \qquad G_n(\omega):\mathcal C\to\mathbb R, \qquad G_n(\omega)(C):=\sqrt n(P_n(\omega)(C)-P(C)).
\end{align*}
[/definition]
Equivalently, $G_n$ may be regarded as the map $G_n:\Omega\to\mathbb R^{\mathcal C}$ whose $C$-coordinate is
\begin{align*}
G_n(C):=\sqrt n(P_n(C)-P(C)), \qquad C\in\mathcal C.
\end{align*}
Set-indexing is a special case of function-indexing by taking indicator functions. This translation is used constantly because it lets combinatorial properties of sets be treated as properties of a function class.
[example: Bernoulli Indicators on Intervals]
Let $S=\mathbb R$, let $X_1,\dots,X_n$ be i.i.d. with distribution $P$, and write $C_t=(-\infty,t]$ for $t\in\mathbb R$. Since $\mathbf{1}_{C_t}(X_i)$ takes only the values $0$ and $1$, and since
\begin{align*}
\mathbb P(\mathbf{1}_{C_t}(X_i)=1)=\mathbb P(X_i\in C_t)=P(C_t)=F(t),
\end{align*}
the random variable $\mathbf{1}_{C_t}(X_i)$ is Bernoulli with success probability $F(t)$.
For this half-line, the empirical measure gives
\begin{align*}
P_n(C_t)=\frac{1}{n}\sum_{i=1}^n \delta_{X_i}(C_t).
\end{align*}
Because $\delta_{X_i}(C_t)=1$ when $X_i\in C_t$ and $\delta_{X_i}(C_t)=0$ otherwise, we have
\begin{align*}
\delta_{X_i}(C_t)=\mathbf{1}_{C_t}(X_i)=\mathbf{1}_{(-\infty,t]}(X_i).
\end{align*}
Thus
\begin{align*}
P_n(C_t)=\frac{1}{n}\sum_{i=1}^n \mathbf{1}_{(-\infty,t]}(X_i).
\end{align*}
The right-hand side is exactly the empirical distribution function at $t$, so
\begin{align*}
F_n(t)=P_n((-\infty,t]).
\end{align*}
Substituting this identity and $P((-\infty,t])=F(t)$ into the set-indexed empirical process definition gives
\begin{align*}
G_n((-\infty,t])=\sqrt n(P_n((-\infty,t])-P((-\infty,t])).
\end{align*}
Therefore
\begin{align*}
G_n((-\infty,t])=\sqrt n(F_n(t)-F(t)).
\end{align*}
So the centred and scaled empirical distribution function is precisely the empirical process indexed by the class of half-lines $\{(-\infty,t]:t\in\mathbb R\}$.
[/example]
The two descriptions, by sets and by functions, lead to the same algebra but different geometry. The later theory asks whether the whole random map $f\mapsto G_n(f)$ has a limit, rather than whether each fixed coordinate has one.
## Bounded Index Spaces and Measurability
The next question is where the empirical process lives. If $G_n$ is to converge as a random object, we need a space whose points are bounded real functions on the index class, but this choice introduces measurability issues when the class is large.
[definition: Bounded Functions on an Index Class]
For a non-empty set $T$, the space $\ell^\infty(T)$ is the set of all bounded functions $z:T\to\mathbb R$, equipped with the supremum norm
\begin{align*}
\|z\|_T:=\sup_{t\in T}|z(t)|.
\end{align*}
[/definition]
When $T=\mathcal F$ is a function class, elements of $\ell^\infty(\mathcal F)$ are bounded real maps on $\mathcal F$. When $T=\mathcal C$ is a class of sets, elements of $\ell^\infty(\mathcal C)$ are bounded real maps on $\mathcal C$. Thus an empirical process indexed by $\mathcal F$ is viewed as an $\ell^\infty(\mathcal F)$-valued random element when $\sup_{f\in\mathcal F}|G_n(f)|<\infty$. This boundedness may hold by construction for finite classes and must be proved or imposed for more complicated classes.
[example: Finite Classes Give Bounded Sample Paths]
Let $\mathcal F=\{f_1,\dots,f_k\}$, and assume $P|f_j|<\infty$ for every $j$. Fix a sample outcome $\omega$ for which each empirical average $P_n(\omega)f_j$ is finite. For this outcome the empirical process assigns the real number
\begin{align*}
G_n(\omega)(f_j)=\sqrt n\bigl(P_n(\omega)f_j-Pf_j\bigr)
\end{align*}
to each $f_j\in\mathcal F$.
Since $\mathcal F$ has exactly the $k$ elements $f_1,\dots,f_k$, the supremum norm of this sample path is
\begin{align*}
\|G_n(\omega)\|_{\mathcal F}=\sup_{f\in\mathcal F}|G_n(\omega)(f)|.
\end{align*}
Replacing the supremum over $\mathcal F$ by the maximum over its listed elements gives
\begin{align*}
\|G_n(\omega)\|_{\mathcal F}=\max_{1\le j\le k}|G_n(\omega)(f_j)|.
\end{align*}
Each $G_n(\omega)(f_j)$ is a finite real number by the choice of $\omega$, and the maximum of finitely many finite [real numbers](/page/Real%20Numbers) is finite. Hence
\begin{align*}
\|G_n(\omega)\|_{\mathcal F}<\infty.
\end{align*}
Therefore the sample path $f\mapsto G_n(\omega)(f)$ belongs to $\ell^\infty(\mathcal F)$. The point is purely finite-dimensional: boundedness follows from taking a maximum over finitely many coordinates, not from any uniform control over an infinite class.
[/example]
The supremum norm is the right norm for uniform approximation, but it also exposes a technical obstruction. If $\mathcal F$ is uncountable, the map $\omega\mapsto \sup_{f\in\mathcal F}|G_n(f)(\omega)|$ need not be measurable without additional separability assumptions.
[definition: Outer Probability]
Let $(\Omega,\mathcal G,\mathbb P)$ be a probability space. The outer probability associated with $\mathbb P$ is the set function
\begin{align*}
\mathbb P^*:\mathcal P(\Omega)\to[0,1], \qquad \mathbb P^*(A):=\inf\{\mathbb P(B): A\subseteq B,\ B\in\mathcal G\}.
\end{align*}
[/definition]
Outer probability lets us state convergence bounds even when the event inside the probability has not yet been proved measurable. Supremum bounds also involve non-measurable real-valued maps, so probabilities alone are not enough; we need a measurable replacement for such a supremum when taking expectations.
[definition: Measurable Majorant]
Let $(\Omega,\mathcal G,\mathbb P)$ be a probability space, and let $Y:\Omega\to\overline{\mathbb R}$ be a map. A measurable majorant of $Y$ is a measurable random variable $Y^*:\Omega\to\overline{\mathbb R}$ such that $Y\le Y^*$ pointwise and, for every measurable random variable $Z:\Omega\to\overline{\mathbb R}$ satisfying $Y\le Z$ a.s., one has $Y^*\le Z$ a.s.
[/definition]
The measurable majorant is the smallest measurable random variable above $Y$, up to a.s. equality. The minimality comparison is made a.s. because random variables are identified modulo null sets in expectation estimates. Since empirical process estimates are often expressed as expected suprema, the next definition uses this majorant to assign an expectation to a non-measurable map.
[definition: Outer Expectation]
Let $\mathcal E^*$ be the class of maps $Y:\Omega\to\overline{\mathbb R}$ for which a measurable majorant $Y^*$ has a well-defined expectation. The outer expectation is the functional
\begin{align*}
\mathbb E^*:\mathcal E^*\to\overline{\mathbb R}, \qquad \mathbb E^*[Y]:=\mathbb E[Y^*].
\end{align*}
[/definition]
These notions are bookkeeping devices rather than new probability laws. Later chapters use them to formulate uniform laws and weak convergence before separability has been established.
[remark: Separability as a Measurability Device]
A class $\mathcal F$ is often reduced to a countable dense subclass under a semimetric such as $d_P$. If the supremum of the process over $\mathcal F$ agrees with the supremum over that countable subclass, measurability follows from countable operations. Empirical process theory therefore separates the geometric question of approximation from the measure-theoretic question of whether suprema are random variables.
[/remark]
## Finite-Dimensional Distributions
Before studying convergence in $\ell^\infty(\mathcal F)$, we ask what happens after evaluating the empirical process at finitely many indices. This is the part of the theory governed by the ordinary multivariate central limit theorem.
[quotetheorem:6302]
[citeproof:6302]
This theorem says that finite-dimensional empirical processes are Gaussian in the limit, but each hypothesis has a distinct role. The finite second-moment assumption supplies the Gaussian scaling; if one of the functions has infinite variance, for example a heavy-tailed $f(X_1)$ in the domain of attraction of a stable law rather than a normal law, the usual $\sqrt n$ scaling need not produce a Gaussian limit. Independence prevents persistent cross-sample dependence: if $X_i=X_1$ for every $i$, then $\sqrt n(P_n f-Pf)=\sqrt n(f(X_1)-P f)$ is not tight unless $f(X_1)=P f$ a.s. Identical distribution fixes a single covariance matrix $\Sigma$; for a triangular array with changing laws, the limiting covariance may depend on the row and the displayed formula need not describe the limit.
The finite-index assumption is equally structural. In a finite class, convergence takes place in $\mathbb R^k$, where tightness of the limiting Gaussian vector is automatic once the vector central limit theorem applies. For an infinite class, every finite subcollection may converge to the correct Gaussian vector while the sequence still fails to be tight in $\ell^\infty(\mathcal F)$; a sequence can have correct coordinates but increasingly rough sample paths over the index set. Thus finite-dimensional convergence identifies the candidate coordinates of a possible limit, and later chapters must add entropy, measurability, and asymptotic equicontinuity to obtain convergence of the full empirical process.
[example: Empirical Distribution Function Finite Coordinates]
Fix $t_1,\dots,t_k\in\mathbb R$ and set $C_j=(-\infty,t_j]$ and $f_j=\mathbf{1}_{C_j}$. Since $f_j^2=f_j$, we have $P f_j^2=P(C_j)=F(t_j)<\infty$, so the finite-dimensional central limit theorem for empirical processes applies to $f_1,\dots,f_k$.
For each $j$,
\begin{align*}
P_n f_j=\frac{1}{n}\sum_{m=1}^n f_j(X_m)=\frac{1}{n}\sum_{m=1}^n \mathbf{1}_{(-\infty,t_j]}(X_m)=F_n(t_j).
\end{align*}
Also,
\begin{align*}
P f_j=\mathbb E[\mathbf{1}_{(-\infty,t_j]}(X_1)]=\mathbb P(X_1\le t_j)=F(t_j).
\end{align*}
Therefore
\begin{align*}
(G_n(f_1),\dots,G_n(f_k))=\sqrt n(F_n(t_1)-F(t_1),\dots,F_n(t_k)-F(t_k)).
\end{align*}
It remains to identify the covariance entries. For $x\in\mathbb R$,
\begin{align*}
f_i(x)f_j(x)=\mathbf{1}_{(-\infty,t_i]}(x)\mathbf{1}_{(-\infty,t_j]}(x)=\mathbf{1}_{(-\infty,t_i]\cap(-\infty,t_j]}(x).
\end{align*}
Since
\begin{align*}
(-\infty,t_i]\cap(-\infty,t_j]=(-\infty,\min\{t_i,t_j\}],
\end{align*}
we get
\begin{align*}
f_i f_j=\mathbf{1}_{(-\infty,\min\{t_i,t_j\}]}.
\end{align*}
Hence
\begin{align*}
P(f_i f_j)=P((-\infty,\min\{t_i,t_j\}])=F(\min\{t_i,t_j\}).
\end{align*}
Combining this with $P f_i=F(t_i)$ and $P f_j=F(t_j)$ gives
\begin{align*}
\Sigma_{ij}=P(f_i f_j)-P f_i\,P f_j=F(\min\{t_i,t_j\})-F(t_i)F(t_j).
\end{align*}
Thus the displayed empirical distribution vector converges to a centred Gaussian vector with this covariance matrix. The term $F(\min\{t_i,t_j\})$ appears because two half-lines overlap exactly in the smaller half-line.
[/example]
The empirical distribution example makes the finite-dimensional theorem concrete, but it also hints at a verification problem: for each new finite class, proving a vector central limit theorem directly can be more work than proving scalar central limit theorems for linear combinations. Checking only the marginal coordinates is not enough, because two random vectors can have the same one-dimensional marginal limits along the coordinate axes while having different dependence structure. We therefore need a device that converts convergence of all one-dimensional linear projections into convergence of the full vector, and this is the role of Cramer-Wold.
[quotetheorem:9818]
[citeproof:9818]
For empirical processes, Cramer-Wold explains why finite-dimensional convergence is often no harder than the scalar central limit theorem. A linear combination of finitely many coordinates is again the empirical process evaluated at one linear combination of functions. The quantifier over every $a\in\mathbb R^k$ is necessary: coordinate projections test only the marginal limits and do not determine correlations or other dependence features. For a concrete counterexample, let $Z\sim\mathcal N(0,1)$ and let $Z'$ be an independent copy. The vectors $(Z,Z)$ and $(Z,Z')$ have the same two coordinate distributions, but the projection $(x_1,x_2)\mapsto x_1+x_2$ has laws $\mathcal N(0,4)$ and $\mathcal N(0,2)$ respectively. The Euclidean finite-dimensional structure is also essential, since the proof uses all linear functionals $a\cdot x$ and the multivariate characteristic function on $\mathbb R^k$; it is not a substitute for tightness or sample-path regularity over an infinite index class.
## Covariance Semimetrics
The final question in this chapter is how the covariance structure should measure closeness between two indices. Pointwise covariance values such as $\operatorname{Cov}(G_n(f),G_n(g))$ describe a single pair, but they do not by themselves give a uniform scale for a large class: two functions can each have controlled variance while their difference still produces a large increment. Uniform results need a distance-like quantity that measures increments directly. If $f$ and $g$ give nearly the same random value under $P$, then the empirical process should not distinguish them strongly.
[definition: Covariance Semimetric]
Let $\mathcal F$ be a class of measurable functions with $P f^2<\infty$ for every $f\in\mathcal F$. The covariance semimetric associated with $P$ is
\begin{align*}
d_P:\mathcal F\times\mathcal F\to[0,\infty), \qquad d_P(f,g)^2 := P(f-g)^2-(P f-P g)^2.
\end{align*}
[/definition]
The quantity $d_P(f,g)^2$ is the variance of $(f-g)(X_1)$. It can vanish for distinct functions when $f-g$ is $P$-a.s. constant, so it is a semimetric rather than a metric. The next result connects this geometric quantity directly to empirical process increments, which is why $d_P$ becomes the natural scale for coverings and continuity.
[quotetheorem:9819]
[citeproof:9819]
The semimetric therefore has a probabilistic meaning: it is the standard deviation of the empirical process increment between two functions. The second-moment hypotheses are essential for this interpretation, since without $P f^2<\infty$ and $P g^2<\infty$ the covariance or the increment variance may be infinite or undefined. For example, if $f(X_1)$ has a heavy tail with infinite second moment and $g=0$, then $d_P(f,g)$ is not a finite scale on which Gaussian increments can be measured. Even when $d_P$ is finite, covariance control alone does not prove tightness or weak convergence in $\ell^\infty(\mathcal F)$; it only gives the natural local scale, while later entropy and chaining arguments measure the size of $\mathcal F$ by counting how many $d_P$-balls are needed to cover it.
[example: Indicator Sets and Symmetric Difference]
Let $f=\mathbf{1}_A$ and $g=\mathbf{1}_B$ for measurable sets $A,B\in\mathcal A$. We compute the covariance semimetric by evaluating the two terms in its definition.
For $x\in S$, there are four cases. If $x\in A\cap B$, then
\begin{align*}
(\mathbf{1}_A(x)-\mathbf{1}_B(x))^2=(1-1)^2=0.
\end{align*}
If $x\in A\setminus B$, then
\begin{align*}
(\mathbf{1}_A(x)-\mathbf{1}_B(x))^2=(1-0)^2=1.
\end{align*}
If $x\in B\setminus A$, then
\begin{align*}
(\mathbf{1}_A(x)-\mathbf{1}_B(x))^2=(0-1)^2=1.
\end{align*}
If $x\notin A\cup B$, then
\begin{align*}
(\mathbf{1}_A(x)-\mathbf{1}_B(x))^2=(0-0)^2=0.
\end{align*}
Since $A\triangle B=(A\setminus B)\cup(B\setminus A)$, this pointwise calculation gives
\begin{align*}
(\mathbf{1}_A-\mathbf{1}_B)^2=\mathbf{1}_{A\triangle B}.
\end{align*}
Therefore
\begin{align*}
P(f-g)^2=P\mathbf{1}_{A\triangle B}=P(A\triangle B).
\end{align*}
The mean term is
\begin{align*}
P f-P g=P\mathbf{1}_A-P\mathbf{1}_B=P(A)-P(B).
\end{align*}
Substituting these two identities into
\begin{align*}
d_P(f,g)^2=P(f-g)^2-(P f-P g)^2
\end{align*}
gives
\begin{align*}
d_P(\mathbf{1}_A,\mathbf{1}_B)^2=P(A\triangle B)-(P(A)-P(B))^2.
\end{align*}
Thus indicator sets are close in the empirical-process semimetric when their symmetric difference has small $P$-mass, with the centering term subtracting the squared difference of their probabilities.
[/example]
For indicator classes, the semimetric is controlled by symmetric differences. At the opposite end, a singleton class has no geometry to control; this comparison clarifies exactly where empirical process theory goes beyond the ordinary central limit theorem.
[example: One-Function Empirical Process]
Let $\mathcal F=\{f\}$ with $P f^2<\infty$. Since $P|f|\le (P f^2)^{1/2}<\infty$, the quantities $P f$ and $P_n f$ are well-defined whenever the sample values are finite. For this singleton class, the empirical process has only one coordinate:
\begin{align*}
G_n(f)=\sqrt n(P_n f-Pf).
\end{align*}
Using $P_n f=n^{-1}\sum_{i=1}^n f(X_i)$, we get
\begin{align*}
G_n(f)=\sqrt n\left(\frac{1}{n}\sum_{i=1}^n f(X_i)-P f\right).
\end{align*}
Since $P f$ is a constant,
\begin{align*}
\frac{1}{n}\sum_{i=1}^n f(X_i)-P f=\frac{1}{n}\sum_{i=1}^n \bigl(f(X_i)-P f\bigr).
\end{align*}
Therefore
\begin{align*}
G_n(f)=\frac{1}{\sqrt n}\sum_{i=1}^n \bigl(f(X_i)-P f\bigr).
\end{align*}
The covariance semimetric has only the pair $(f,f)$ to evaluate. Its squared value is
\begin{align*}
d_P(f,f)^2=P(f-f)^2-(P f-P f)^2.
\end{align*}
Since $f-f=0$ and $P f-P f=0$,
\begin{align*}
d_P(f,f)^2=P0^2-0^2=0.
\end{align*}
Thus there is no non-zero distance inside a singleton index class.
Let $Y_i=f(X_i)-P f$. Then $Y_1,Y_2,\dots$ are i.i.d., $\mathbb E[Y_i]=0$, and
\begin{align*}
\mathbb E[Y_i^2]=P(f-P f)^2=P f^2-(P f)^2=\operatorname{Var}(f(X_1)).
\end{align*}
By the *scalar central limit theorem*,
\begin{align*}
\frac{1}{\sqrt n}\sum_{i=1}^n Y_i\xrightarrow{d}\mathcal N(0,\operatorname{Var}(f(X_1))).
\end{align*}
Since this left-hand side is exactly $G_n(f)$, we have
\begin{align*}
G_n(f)\xrightarrow{d}\mathcal N(0,\operatorname{Var}(f(X_1))).
\end{align*}
This is precisely the classical sample-mean central limit theorem applied to the single random variable $f(X_1)$; empirical process issues beyond ordinary asymptotic normality begin when the index class has many functions to compare.
[/example]
The chapter's main message is that empirical process theory studies the random map $f\mapsto \sqrt n(P_n f-Pf)$, not only its individual coordinates. Finite-dimensional distributions are Gaussian by classical central limit theorems, while the covariance semimetric $d_P$ records the geometry that later controls uniform convergence and tightness.
# 2. Glivenko-Cantelli Theory
This chapter asks when the pointwise law of large numbers can be strengthened to a uniform statement over a whole class of measurable functions or sets. The empirical measure $P_n$ approximates $P$ on each fixed integrable function, but statistics and learning procedures often optimise over many functions at once. Glivenko-Cantelli theory identifies the classes for which the random error
\begin{align*}
\sup_{f \in \mathcal F}|P_n f - P f|
\end{align*}
tends to zero. The chapter begins with finite, equicontinuous, and bracketed classes, then treats distribution functions on the real line, and ends with VC classes as the fundamental combinatorial source of uniform laws.
## Uniform Laws of Large Numbers
The first problem is to formulate a law of large numbers that is strong enough to control data-dependent choices. If an estimator selects a function after seeing the sample, pointwise convergence for each fixed function gives no direct control of the selected value. The correct object is the largest empirical error over the indexing class.
Let $X_1, X_2, \dots$ be i.i.d. random variables with law $P$ on a measurable space $(S, \mathcal A)$. For a measurable function $f:S \to \mathbb R$ with $P|f|<\infty$, write
\begin{align*}
P f = \int_S f\,dP, \qquad P_n f = \frac{1}{n}\sum_{i=1}^n f(X_i).
\end{align*}
For a class $\mathcal F$ of measurable functions, the uniform law asks whether
\begin{align*}
\sup_{f \in \mathcal F}|P_n f - P f| \to 0
\end{align*}
in a suitable probabilistic sense.
The main definition records this property at a fixed underlying law $P$. Since empirical process suprema need not be measurable for arbitrary classes, the clean formulation uses outer probability when needed.
[definition: P-Glivenko-Cantelli Class]
Let $(S,\mathcal A,P)$ be a probability space. A class $\mathcal F$ of measurable functions $f:S \to \mathbb R$ with $P|f|<\infty$ for every $f \in \mathcal F$ is called $P$-Glivenko-Cantelli if
\begin{align*}
\sup_{f \in \mathcal F}|P_n f - P f| \xrightarrow{\mathbb P} 0.
\end{align*}
[/definition]
If the displayed supremum is not known to be measurable, the convergence statement is interpreted in outer probability. This convention keeps the definition usable for arbitrary index classes without inserting measurability assumptions into every theorem.
This definition depends on both the class and the distribution. Later parts of the course develop distribution-free criteria, but at this stage the dependence on $P$ lets us isolate the core mechanism: uniform convergence follows when the class can be approximated by finitely many functions in the right metric and the empirical process respects that approximation.
[example: One Function Class]
Let $\mathcal F=\{f\}$ with $P|f|<\infty$. Since the only element of $\mathcal F$ is $f$, the set of empirical errors indexed by $\mathcal F$ is the singleton
\begin{align*}
\{|P_n g-Pg|:g\in\mathcal F\}=\{|P_n f-Pf|\}.
\end{align*}
Therefore its supremum is the unique element of that singleton:
\begin{align*}
\sup_{g\in\mathcal F}|P_n g-Pg|=|P_n f-Pf|.
\end{align*}
By the *[Weak Law of Large Numbers](/theorems/1127)* applied to the integrable random variable $f(X_1)$,
\begin{align*}
P_n f=\frac{1}{n}\sum_{i=1}^n f(X_i)\xrightarrow{\mathbb P}Pf.
\end{align*}
Equivalently, for every $\varepsilon>0$,
\begin{align*}
\mathbb P\left(\sup_{g\in\mathcal F}|P_n g-Pg|>\varepsilon\right)=\mathbb P(|P_n f-Pf|>\varepsilon)\to0.
\end{align*}
Thus $\mathcal F$ is $P$-Glivenko-Cantelli. The example shows that integrability is enough for one fixed function; the genuinely new issue is controlling many possible functions at once.
[/example]
The singleton example reduces the theory to the ordinary law of large numbers. The next natural step is to ask how much of this argument survives when there are several possible functions and no data-dependent selection can escape that fixed list. This gives a finite uniform law, which is the basic component reused in every net argument later in the chapter.
[quotetheorem:9816]
[citeproof:9816]
The finiteness hypothesis is doing real work: the union bound has a fixed number of terms, so the ordinary weak law can be applied separately and then combined. The result gives no rate unless the individual weak laws are quantified, and it says nothing about a growing or data-dependent list of functions. For instance, under a non-atomic law on $\mathbb R$, if the class is allowed to contain the realised sample set $\{X_1,\dots,X_n\}$ after the data are observed, then the empirical mass of that set is $1$ while its population mass is $0$. Its value is structural: later arguments try to replace an infinite class by a finite list whose size is fixed after an approximation radius has been chosen, then add a separate estimate for the approximation error.
## Finite Nets and Stochastic Equicontinuity
The central question is how an infinite class can behave like a finite class for empirical averages. A metric on functions must measure both population closeness and empirical closeness, because the sample introduces a random seminorm. The course first treats a bounded setting where deterministic $L^1(P)$ approximation controls the population part and a stochastic equicontinuity condition controls the empirical fluctuation inside small balls.
[definition: $L^1(P)$ Covering Number]
Let $\mathcal F$ be a class of measurable real-valued functions with $P|f|<\infty$ for every $f \in \mathcal F$. For $\varepsilon>0$, the $L^1(P)$ covering number is the map
\begin{align*}
N(\varepsilon,\mathcal F,L^1(P)) \in \mathbb N\cup\{\infty\}
\end{align*}
defined as the smallest $N \in \mathbb N$ for which there exist measurable functions $g_1,\dots,g_N$ such that for every $f \in \mathcal F$ there is $j$ with
\begin{align*}
P|f-g_j|<\varepsilon.
\end{align*}
If no such finite $N$ exists, set $N(\varepsilon,\mathcal F,L^1(P))=\infty$.
[/definition]
The functions $g_j$ need not belong to $\mathcal F$. Allowing external centres makes approximation more flexible and is harmless for the finite-class argument, because the weak law applies to each centre once it is integrable.
[definition: Total Boundedness In $L^1(P)$]
A class $\mathcal F$ is totally bounded in $L^1(P)$ if
\begin{align*}
N(\varepsilon,\mathcal F,L^1(P))<\infty
\end{align*}
for every $\varepsilon>0$.
[/definition]
Total boundedness says that at every fixed accuracy, the population geometry of the class is finite. The obstruction is that an $L^1(P)$-small difference can still have a large realised empirical fluctuation on exceptional samples. To pass from a finite net to a uniform law, one needs a condition saying that empirical errors cannot jump inside sufficiently small $L^1(P)$ neighbourhoods.
[definition: Stochastic Equicontinuity In Probability]
Let $\mathcal F$ be a class of measurable functions. The empirical process is stochastically equicontinuous in probability with respect to $L^1(P)$ on $\mathcal F$ if for every $\eta>0$,
\begin{align*}
\lim_{\delta \downarrow 0}\limsup_{n\to\infty}\mathbb P^*\left(\sup_{f,g\in\mathcal F: P|f-g|<\delta}|(P_n-P)(f-g)|>\eta\right)=0.
\end{align*}
[/definition]
This condition says that empirical fluctuations cannot separate functions that are close in population $L^1(P)$ distance. It is often proved by symmetrisation and entropy estimates, but here it functions as the bridge between finite nets and uniform convergence. With total boundedness providing the finite skeleton and stochastic equicontinuity controlling the error inside each small neighbourhood, the finite-class theorem can now be lifted to an infinite class.
[quotetheorem:9820]
[citeproof:9820]
This theorem separates two requirements that are sometimes conflated. Total boundedness is only a population statement: it says that $\mathcal F$ can be approximated in $L^1(P)$, but it does not say that the empirical measure treats nearby functions similarly. Under a non-atomic law on $\mathbb R$, the indicators of all finite subsets have $L^1(P)$ distance zero from the zero function, so the class is totally bounded; nevertheless the realised sample set has empirical mass $1$ and population mass $0$. Stochastic equicontinuity is the empirical ingredient that rules out this adaptation to the realised sample. The next criterion replaces this abstract equicontinuity condition by a stronger bracketing assumption, where each local approximation comes with an integrable envelope that controls both $P$ and $P_n$.
To make that replacement precise, approximation must be by brackets rather than ordinary balls. A bracket pins each function between two endpoints and controls the expected width of the interval between them.
[definition: $L^1(P)$ Bracketing Number]
Let $\mathcal F$ be a class of measurable real-valued functions with $P|f|<\infty$ for every $f\in\mathcal F$. For $\varepsilon>0$, the $L^1(P)$ bracketing number is the map
\begin{align*}
N_{[]}(\varepsilon,\mathcal F,L^1(P))\in \mathbb N\cup\{\infty\}
\end{align*}
defined as the smallest $N\in\mathbb N$ for which there exist integrable measurable functions $\ell_1,u_1,\dots,\ell_N,u_N$ with $P(u_j-\ell_j)<\varepsilon$ such that for every $f\in\mathcal F$ there is $j$ satisfying
\begin{align*}
\ell_j\le f\le u_j
\end{align*}
pointwise.
If no such finite $N$ exists, set $N_{[]}(\varepsilon,\mathcal F,L^1(P))=\infty$.
[/definition]
Brackets are stronger than covers because they provide a positive envelope $u_j-\ell_j$ for the approximation error. This envelope is the mechanism that turns population approximation into empirical control: if $f$ lies between $\ell_j$ and $u_j$, then the possible empirical error inside the bracket is bounded by the empirical average of $u_j-\ell_j$. The next theorem is needed because it packages this mechanism into a usable Glivenko-Cantelli criterion, replacing the abstract stochastic equicontinuity hypothesis by the checkable condition of finite bracketing numbers.
[quotetheorem:9821]
[citeproof:9821]
The bracketing hypothesis is stronger than ordinary total boundedness, and this strength is necessary for the proof. Ordinary $L^1(P)$ total boundedness alone can identify functions that differ only on $P$-null sets, while the empirical measure may put mass exactly on those null sets after the sample is observed. Under a non-atomic law, for example, indicators of finite subsets all have $L^1(P)$ distance zero from one another, but the realised sample set has empirical mass one and population mass zero. Bracketing prevents this pathology because the small-width envelope must control every function in the bracket uniformly rather than only up to $P$-a.e. equivalence.
The boundedness assumption is natural for indicator classes, and bracketing is often built from geometric order. The next example shows how one-dimensional order turns infinitely many sets into a finite quantile grid, giving the first non-finite class covered by the bracketing theorem.
[example: Half-Lines In The Real Line]
Fix $\delta>0$. Let $A_\delta=\{a\in\mathbb R:P(\{a\})>\delta/2\}$. This set is finite, because otherwise the disjoint atoms in $A_\delta$ would have total mass larger than $1$. Put every point of $A_\delta$ into the grid. On each interval component of $\mathbb R\setminus A_\delta$, choose cut points successively so that the $P$-mass accumulated since the previous cut is at most $\delta$: when the accumulated mass first reaches $\delta/2$, the overshoot is at most $\delta/2$, since there are no atoms larger than $\delta/2$ in that component. Hence every resulting grid interval $(s,u]$ satisfies
\begin{align*}
P((s,u])\le \delta.
\end{align*}
The construction stops after finitely many cuts, because every completed non-terminal interval has $P$-mass at least $\delta/2$ and the total mass is $1$.
Now let $t\in\mathbb R$. Choose neighbouring grid points $s\le t\le u$ with no grid point strictly between $s$ and $u$. Then
\begin{align*}
(-\infty,s]\subseteq(-\infty,t]\subseteq(-\infty,u].
\end{align*}
Therefore the corresponding indicators satisfy
\begin{align*}
\mathbb 1_{(-\infty,s]}\le \mathbb 1_{(-\infty,t]}\le \mathbb 1_{(-\infty,u]}.
\end{align*}
The width of this bracket is
\begin{align*}
P\left(\mathbb 1_{(-\infty,u]}-\mathbb 1_{(-\infty,s]}\right)=P((s,u])\le\delta.
\end{align*}
Thus finitely many brackets cover $\mathcal F$ at $L^1(P)$ width $\delta$. Since each indicator is bounded by $1$, the *Bounded $L^1(P)$ Bracketing Classes* theorem implies that $\mathcal F$ is $P$-Glivenko-Cantelli. This example shows how the order structure of the real line converts an infinite threshold class into finitely many probability brackets.
[/example]
Half-lines are controlled by a single threshold parameter. In several dimensions, coordinatewise thresholds give a richer class, and the question becomes whether a finite grid in each coordinate still controls the probability of the symmetric difference.
[example: Rectangles In Euclidean Space]
Fix $\eta>0$. For each coordinate projection $\pi_k(x)=x_k$, let $P_k=P\circ\pi_k^{-1}$. Choose a finite extended grid
\begin{align*}
-\infty=a_{k,0}<a_{k,1}<\cdots<a_{k,m_k}=\infty
\end{align*}
such that every adjacent coordinate interval has small marginal mass:
\begin{align*}
P_k((a_{k,r-1},a_{k,r}])\le \eta
\end{align*}
for all $k=1,\dots,d$ and all $r=1,\dots,m_k$. This is obtained by the same atom-isolation and quantile-partition construction used for half-lines, applied separately to the marginal law $P_k$.
For $t=(t_1,\dots,t_d)\in\mathbb R^d$, choose adjacent grid points $s_k\le t_k\le u_k$ in the $k$th coordinate, and set
\begin{align*}
R_t=(-\infty,t_1]\times\cdots\times(-\infty,t_d].
\end{align*}
With $R_s$ and $R_u$ defined in the same way, coordinatewise inclusion gives
\begin{align*}
R_s\subseteq R_t\subseteq R_u.
\end{align*}
Therefore
\begin{align*}
\mathbb 1_{R_s}\le \mathbb 1_{R_t}\le \mathbb 1_{R_u}.
\end{align*}
The bracket width is
\begin{align*}
P(\mathbb 1_{R_u}-\mathbb 1_{R_s})=P(R_u\setminus R_s).
\end{align*}
If $x\in R_u\setminus R_s$, then $x\in R_u$ and $x\notin R_s$, so for at least one coordinate $k$ we have $s_k<x_k\le u_k$. Hence
\begin{align*}
R_u\setminus R_s\subseteq \bigcup_{k=1}^d\{x\in\mathbb R^d:s_k<x_k\le u_k\}.
\end{align*}
By the union bound and the definition of the marginal laws,
\begin{align*}
P(R_u\setminus R_s)\le \sum_{k=1}^d P(s_k<X_k\le u_k)\le \sum_{k=1}^d \eta=d\eta.
\end{align*}
There are only finitely many pairs of grid rectangles, so these brackets form a finite $L^1(P)$ bracketing cover of the indicator class $\{\mathbb 1_R:R\in\mathcal C\}$ with width at most $d\eta$. Since the indicators are bounded by $1$, the *Bounded $L^1(P)$ Bracketing Classes* theorem implies that the lower-rectangle indicator class is $P$-Glivenko-Cantelli. The point is that a $d$-dimensional rectangle error is controlled by $d$ one-dimensional slab errors.
[/example]
These examples anticipate the multivariate distribution-function theorem. The next section treats the one-dimensional case in its classical sharp form.
## Distribution Functions and the Kolmogorov-Smirnov Statistic
The motivating problem is to compare the empirical distribution function with the true distribution function uniformly over all thresholds. For real-valued data, this is the original Glivenko-Cantelli theorem and the basis for the Kolmogorov-Smirnov statistic. It is also the clearest illustration of how a continuum of sets can be controlled by finitely many quantile points.
[definition: Empirical Distribution Function]
Let $X_1,\dots,X_n$ be real-valued random variables. The empirical distribution function is the random function $F_n:\mathbb R\to[0,1]$ defined by
\begin{align*}
F_n(t)=P_n((-\infty,t])=\frac{1}{n}\sum_{i=1}^n \mathbb{1}_{\{X_i\le t\}}.
\end{align*}
[/definition]
The definition identifies the empirical process indexed by half-lines with the familiar step function $F_n$. Having named this object, the central question is whether the entire graph of $F_n$ approaches the graph of $F$, not just whether $F_n(t)$ approaches $F(t)$ at one threshold. The classical Glivenko-Cantelli theorem answers this stronger question and gives an almost sure version of the uniform law.
[quotetheorem:2004]
[citeproof:2004]
The order structure of half-lines, rather than any special density or continuity of the law, is the feature that makes this theorem one-dimensional. The i.i.d. assumption is also essential: it supplies stable empirical counts at fixed grid points and prevents persistent dependence from keeping those counts biased along subsequences. For a concrete failure, take $X_i=Y$ for every $i$, where $Y$ has distribution function $F$; then $F_n(t)=\mathbb{1}_{\{Y\le t\}}$ for all $n$, so the empirical distribution function does not approach the non-degenerate $F$ uniformly. Monotonicity is the one-dimensional feature that lets finitely many grid errors control every threshold between them; for arbitrary set classes there may be no neighbouring grid points that trap the error. The theorem is qualitative, so it does not quantify the probability of a large deviation at a fixed sample size. Once the uniform distance is known to vanish, it becomes a natural statistic for testing whether data came from a proposed distribution, and the next definition names this statistic so that later asymptotic and concentration results can refer to it directly.
[definition: Kolmogorov-Smirnov Statistic]
For real-valued i.i.d. observations with distribution function $F$ and empirical distribution function $F_n$, the Kolmogorov-Smirnov statistic is the number $D_n\in[0,1]$ assigned to the pair $(F_n,F)$ by
\begin{align*}
D_n=\sup_{t\in\mathbb R}|F_n(t)-F(t)|.
\end{align*}
[/definition]
The Glivenko-Cantelli theorem says $D_n\to0$ almost surely. For statistical use, qualitative convergence is only the first layer; we also want a probability bound showing how quickly large deviations of $D_n$ become rare. The [Dvoretzky-Kiefer-Wolfowitz inequality](/theorems/6300) supplies the benchmark rate for this one-dimensional problem.
[quotetheorem:6300]
[citeproof:6300]
This bound is quoted here as a benchmark concentration estimate. Its role is to show the scale of the uniform error: the Kolmogorov-Smirnov distance is typically of order $n^{-1/2}$ up to constants, so the qualitative convergence from Glivenko-Cantelli has the same square-root sample-size behavior familiar from ordinary averages.
[example: Uniform Distribution And Quantile Grids]
Let $X_i\sim\operatorname{Unif}(0,1)$, so $F(t)=t$ for $0\le t\le 1$, while $F(t)=0$ for $t<0$ and $F(t)=1$ for $t\ge 1$. Fix $m\in\mathbb N$ and put grid points $r_k=k/m$ for $k=0,\dots,m$. The interval bracket around a point $t\in[r_k,r_{k+1}]$ is
\begin{align*}
(-\infty,r_k]\subseteq(-\infty,t]\subseteq(-\infty,r_{k+1}].
\end{align*}
Its $F$-mass is
\begin{align*}
P(r_k<X\le r_{k+1})=F(r_{k+1})-F(r_k)=\frac{k+1}{m}-\frac{k}{m}=\frac{1}{m}.
\end{align*}
Let
\begin{align*}
\Delta_m=\max_{0\le j\le m}|F_n(r_j)-F(r_j)|.
\end{align*}
For $t\in[r_k,r_{k+1}]$, monotonicity of $F_n$ gives $F_n(r_k)\le F_n(t)\le F_n(r_{k+1})$. Therefore
\begin{align*}
F_n(t)-F(t)\le F_n(r_{k+1})-t.
\end{align*}
Since $F(r_{k+1})=r_{k+1}$ and $r_{k+1}-t\le 1/m$,
\begin{align*}
F_n(r_{k+1})-t=\bigl(F_n(r_{k+1})-F(r_{k+1})\bigr)+(r_{k+1}-t)\le \Delta_m+\frac{1}{m}.
\end{align*}
Similarly,
\begin{align*}
F(t)-F_n(t)\le t-F_n(r_k).
\end{align*}
Since $F(r_k)=r_k$ and $t-r_k\le 1/m$,
\begin{align*}
t-F_n(r_k)=(t-r_k)+\bigl(F(r_k)-F_n(r_k)\bigr)\le \frac{1}{m}+\Delta_m.
\end{align*}
Thus every $t\in[0,1]$ satisfies
\begin{align*}
|F_n(t)-F(t)|\le \Delta_m+\frac{1}{m}.
\end{align*}
Outside $[0,1]$, both $F_n$ and $F$ are equal to $0$ on $(-\infty,0)$ and equal to $1$ on $[1,\infty)$, so the same bound holds on all of $\mathbb R$. For fixed $m$, the ordinary strong law applied at the finitely many grid points gives $\Delta_m\to0$ almost surely. Taking $m$ large makes the deterministic bracket error $1/m$ small, so the grid controls the whole empirical distribution function. This is the quantile-grid mechanism behind the Glivenko-Cantelli theorem in the uniform case.
[/example]
The distribution-function case motivates broader classes whose complexity is combinatorial rather than ordered. VC theory supplies a finite-dimensional measure of the number of different labellings a class of sets can induce on a sample.
## Vapnik-Chervonenkis Classes
The problem now is to recognise large set classes for which uniform convergence holds without building distribution-dependent grids by hand. The right invariant is how many subsets of a finite sample can be carved out by the class. If the class cannot realise all labellings of large finite sets, its effective empirical complexity is polynomial rather than exponential.
[definition: Shattering]
Let $\mathcal C$ be a class of subsets of a set $S$. The class $\mathcal C$ shatters points $x_1,\dots,x_m\in S$ if for every subset $I\subset\{1,\dots,m\}$ there exists $C\in\mathcal C$ such that
\begin{align*}
\{i:x_i\in C\}=I.
\end{align*}
[/definition]
Shattering means that the class can realise every binary labelling on the chosen points. To turn this local labelling ability into a numerical complexity parameter, we ask for the largest sample size on which complete labelling is possible.
[definition: VC Dimension]
Let $\mathcal C$ be a class of subsets of $S$. The VC dimension $V(\mathcal C)$ is the supremum of all $m\in\mathbb N$ for which some $m$ points in $S$ are shattered by $\mathcal C$. If sets of arbitrarily large finite size are shattered, write $V(\mathcal C)=\infty$.
[/definition]
Finite VC dimension is a structural restriction on the class, not on the probability law. The difficulty is that uniform convergence must hold simultaneously over all sets in the class and for every underlying distribution, so probability-dependent nets are not the right tool.
The key question is whether this combinatorial restriction is strong enough by itself to force a uniform law of large numbers. What saves the argument is that, on any realised sample, a VC class can produce only polynomially many distinct label patterns, making a finite-class bound effective uniformly in $P$.
[quotetheorem:9822]
[citeproof:9822]
The theorem explains why half-lines and rectangles obey uniform laws in every distribution. Finite VC dimension is the necessary complexity input for this route: it keeps the number of different sample labellings polynomial, which is small enough for symmetrisation and finite-class maximal inequalities to win. The result is still a law of large numbers rather than a full limit theorem; it gives convergence in probability but does not identify the limiting distribution of the empirical process or handle unbounded function classes without further assumptions. Infinite VC dimension can destroy the conclusion, as the later Borel-set example shows by choosing a set that follows the realised sample. For geometric classes such as half-lines and rectangles, finite VC dimension is often easier to check than probability-dependent total boundedness or bracketing.
[example: VC Dimension Of Half-Lines]
Let $\mathcal C=\{(-\infty,t]:t\in\mathbb R\}$. Fix one point $x\in\mathbb R$. To realize the empty labeling on $\{x\}$, choose any $t<x$; then $x\notin(-\infty,t]$. To realize the full labeling, choose $t=x$; then $x\in(-\infty,x]$. Hence every subset of the one-point set is realized, so one point is shattered.
Now take two distinct points and write them in increasing order as $x_1<x_2$. The labeling $\{2\}$ would require a threshold $t$ such that $x_2\in(-\infty,t]$ and $x_1\notin(-\infty,t]$. The first condition gives $x_2\le t$. Since $x_1<x_2$, we have $x_1<x_2\le t$, so $x_1\in(-\infty,t]$, contradicting the second condition. Thus no two-point set can be shattered.
Therefore the largest shattered size is $1$, so $V(\mathcal C)=1$. By the *[Vapnik-Chervonenkis Uniform Law of Large Numbers](/theorems/9822)*, this finite VC dimension implies
\begin{align*}
\sup_{t\in\mathbb R}|P_n((-\infty,t])-P((-\infty,t])|\xrightarrow{\mathbb P}0.
\end{align*}
For real-valued observations, this is exactly the Glivenko-Cantelli property for empirical distribution functions.
[/example]
Half-lines have only one threshold, while axis-aligned rectangles have one threshold per coordinate. The next example shows how the number of possible sample cuts grows polynomially with $n$, which is the combinatorial signature behind finite VC behaviour.
[example: Axis-Aligned Rectangles]
Let $\mathcal C=\{(-\infty,t_1]\times\cdots\times(-\infty,t_d]:t\in\mathbb R^d\}$ be the class of lower rectangles in $\mathbb R^d$. Fix sample points $x_1,\dots,x_n\in\mathbb R^d$, and write $x_i=(x_{i1},\dots,x_{id})$. For one coordinate $k$, the set of indices selected by a threshold $t_k$ is
\begin{align*}
I_k(t_k)=\{i:x_{ik}\le t_k\}.
\end{align*}
As $t_k$ moves from $-\infty$ to $\infty$, this set can change only when $t_k$ passes one of the $n$ coordinate values $x_{1k},\dots,x_{nk}$. Hence there are at most $n+1$ possible coordinate selections $I_k(t_k)$.
For a rectangle $R_t=(-\infty,t_1]\times\cdots\times(-\infty,t_d]$, membership of $x_i$ is the conjunction of the $d$ coordinate comparisons:
\begin{align*}
x_i\in R_t
\end{align*}
if and only if
\begin{align*}
x_{i1}\le t_1,\quad x_{i2}\le t_2,\quad \dots,\quad x_{id}\le t_d.
\end{align*}
Equivalently,
\begin{align*}
\{i:x_i\in R_t\}=I_1(t_1)\cap I_2(t_2)\cap\cdots\cap I_d(t_d).
\end{align*}
There are at most $n+1$ choices for each $I_k(t_k)$, so the number of possible intersections is bounded by
\begin{align*}
(n+1)(n+1)\cdots(n+1)=(n+1)^d.
\end{align*}
If $n$ points were shattered by $\mathcal C$, then every subset of $\{1,\dots,n\}$ would occur as $\{i:x_i\in R_t\}$ for some rectangle $R_t$. That would require
\begin{align*}
2^n\le (n+1)^d.
\end{align*}
But $2^n/(n+1)^d\to\infty$ as $n\to\infty$, so this inequality fails for all sufficiently large $n$. Therefore only finitely many points can be shattered, and the class of lower rectangles has finite VC dimension. By the *Vapnik-Chervonenkis Uniform Law of Large Numbers*, it follows that for every probability law $P$ on $\mathbb R^d$,
\begin{align*}
\sup_{R\in\mathcal C}|P_n(R)-P(R)|\xrightarrow{\mathbb P}0.
\end{align*}
Thus the polynomial growth bound $(n+1)^d$ is exactly the combinatorial control that makes lower rectangles a uniform law class.
[/example]
The finite VC condition is not a cosmetic assumption: without complexity control, uniform convergence can fail in the strongest possible way.
[example: Failure For All Borel Sets]
Let $P$ be a non-atomic probability law on $\mathbb R$, and let $\mathcal C=\mathcal B(\mathbb R)$. For a realised sample $X_1,\dots,X_n$, define the finite set
\begin{align*}
A_n=\{X_1,\dots,X_n\}.
\end{align*}
Each singleton $\{X_i\}$ is closed, hence Borel, so $A_n$ is a finite union of Borel sets and therefore $A_n\in\mathcal B(\mathbb R)$.
For every $i=1,\dots,n$, we have $X_i\in A_n$, so $\mathbb 1_{A_n}(X_i)=1$. Hence
\begin{align*}
P_n(A_n)=\frac{1}{n}\sum_{i=1}^n \mathbb 1_{A_n}(X_i)=\frac{1}{n}\sum_{i=1}^n 1=\frac{n}{n}=1.
\end{align*}
Since $P$ is non-atomic, $P(\{x\})=0$ for every $x\in\mathbb R$. Thus
\begin{align*}
P(A_n)=P(\{X_1,\dots,X_n\})\le \sum_{i=1}^n P(\{X_i\})=\sum_{i=1}^n 0=0.
\end{align*}
Because probabilities are nonnegative, this gives $P(A_n)=0$.
Therefore, for every realised sample,
\begin{align*}
\sup_{A\in\mathcal B(\mathbb R)}|P_n(A)-P(A)|\ge |P_n(A_n)-P(A_n)|=|1-0|=1.
\end{align*}
On the other hand, $0\le P_n(A)\le1$ and $0\le P(A)\le1$ for every Borel set $A$, so $|P_n(A)-P(A)|\le1$. Hence
\begin{align*}
\sup_{A\in\mathcal B(\mathbb R)}|P_n(A)-P(A)|=1.
\end{align*}
The uniform empirical error never tends to $0$, so the class of all Borel sets is not $P$-Glivenko-Cantelli.
[/example]
This failure illustrates the guiding principle of the chapter. Uniform laws are not consequences of the pointwise law alone; they require a finiteness principle, supplied either by metric total boundedness, ordered quantile approximations, or VC combinatorics.
# 3. Symmetrisation and Rademacher Averages
Symmetrisation is the main device that turns empirical processes into objects controlled by random signs. The previous chapter treated uniform convergence by approximation and finite reductions; this chapter gives the probabilistic inequalities that make those reductions quantitative. The guiding question is how to bound a random quantity such as $\sup_{f \in \mathcal F}|P_n f - P f|$ without knowing the unknown distribution $P$ explicitly. The prerequisites are the empirical measure notation from the previous chapter, [Jensen's inequality](/theorems/9), basic independence and conditioning, and elementary concentration inequalities for independent variables. Ghost samples, Rademacher variables, and concentration inequalities provide the standard answer.
## Ghost Samples and Symmetrisation Inequalities
The first problem is that $P f$ is deterministic but unknown, while $P_n f$ is observable but random. A ghost sample replaces $P f$ by an independent empirical average with the same expectation, allowing the difference $P_n f - P f$ to be compared with a difference of two independent empirical averages. Once two independent samples appear, random signs can exchange the two coordinates and reduce the problem to a Rademacher sum.
[definition: Ghost Sample]
Let $X_1, \dots, X_n$ be i.i.d. random variables with distribution $P$ on a measurable space $(S, \mathcal S)$. A ghost sample is an independent sequence $X_1', \dots, X_n'$ such that $X_1', \dots, X_n'$ are i.i.d. with distribution $P$ and independent of $X_1, \dots, X_n$.
[/definition]
The empirical measure associated with the ghost sample is written $P_n' f = n^{-1}\sum_{i=1}^n f(X_i')$. It has the same distribution as $P_n f$, but its independence from $P_n$ makes Jensen-type comparisons possible. To turn the ghost comparison into a usable bound, we need a mechanism that records which member of each pair $(X_i,X_i')$ is placed on which side of the difference. The next definition introduces the random signs that encode these swaps.
[definition: Rademacher Variables]
A sequence $\varepsilon_1, \dots, \varepsilon_n$ consists of Rademacher variables if the variables are independent and satisfy
\begin{align*}
\mathbb P(\varepsilon_i = 1) = \mathbb P(\varepsilon_i = -1) = \frac{1}{2}
\end{align*}
for each $i \in \{1, \dots, n\}$.
[/definition]
Rademacher variables are used independently of all data variables unless another dependence structure is explicitly declared. Their role is not to model the data; they randomise signs after the data have been fixed. We need the following theorem because it converts the original unknown-centred supremum into this sign-randomised quantity, which can then be studied conditionally on the observed sample.
[quotetheorem:9823]
[citeproof:9823]
The inequality shows why Rademacher sums are central: they are conditionally mean-zero sums with the function class visible only through the values $f(X_i)$. It does not say that $P_n f-Pf$ is itself conditionally mean-zero after the sample is observed, and it does not remove measurability requirements on the supremum. The factor $2$ is the price for comparing the unknown centring $P f$ with a second random empirical average; without the ghost sample, there is no independent object to swap against $P_n$.
[example: Indicator Classes]
Let $\mathcal C$ be a class of measurable subsets of $S$, and put $\mathcal F=\{\mathbb{1}_C:C\in\mathcal C\}$. Since $P_n\mathbb{1}_C=P_n(C)$ and $P\mathbb{1}_C=P(C)$, the *Symmetrisation Inequality* gives
\begin{align*}
\mathbb E\left[\sup_{C\in\mathcal C}|P_n(C)-P(C)|\right]\le 2\,\mathbb E\left[\sup_{C\in\mathcal C}\left|\frac{1}{n}\sum_{i=1}^n\varepsilon_i\mathbb{1}_C(X_i)\right|\right].
\end{align*}
On a fixed sample with distinct points, suppose first that $\mathcal C$ contains every subset of $\{X_1,\dots,X_n\}$. After the signs are realised, choose $C_+=\{X_i:\varepsilon_i=1\}$. Then
\begin{align*}
\frac{1}{n}\sum_{i=1}^n\varepsilon_i\mathbb{1}_{C_+}(X_i)=\frac{1}{n}\sum_{\{i:\varepsilon_i=1\}}1=\frac{N_+}{n},
\end{align*}
where $N_+=|\{i:\varepsilon_i=1\}|$. Hence
\begin{align*}
\sup_{C\in\mathcal C}\left|\frac{1}{n}\sum_{i=1}^n\varepsilon_i\mathbb{1}_C(X_i)\right|\ge \frac{N_+}{n}.
\end{align*}
Averaging only over the signs,
\begin{align*}
\mathbb E_\varepsilon\left[\frac{N_+}{n}\right]=\frac{1}{n}\sum_{i=1}^n\mathbb P(\varepsilon_i=1)=\frac{1}{n}\cdot n\cdot\frac{1}{2}=\frac{1}{2}.
\end{align*}
Thus a class that can realise arbitrary labels on the sample has a signed supremum bounded below by a constant, not by a quantity of order $n^{-1/2}$.
For intervals on $\mathbb R$, the same signed sum has a different form. If $X_{(1)}\le\cdots\le X_{(n)}$ is the ordered sample and $\varepsilon_{(j)}$ is the sign attached to $X_{(j)}$, then an interval selects a contiguous block of ordered sample points, so its contribution is of the form
\begin{align*}
\frac{1}{n}\sum_{j=k}^{\ell}\varepsilon_{(j)}
\end{align*}
for some $1\le k\le \ell\le n$, with the empty interval giving $0$. The class of intervals therefore cannot choose an arbitrary subset of positive signs while excluding all negative signs; it can only choose contiguous blocks in the sample order. Symmetrisation has converted the uniform deviation problem into the combinatorial question of which sign patterns the set class can realise on the observed sample.
[/example]
The indicator example uses absolute deviations because a uniform law of large numbers asks for two-sided control. In learning and estimation, the empirical procedure often requires only the largest optimistic gap $P f-P_n f$. The same ghost-sample method gives a one-sided form tailored to such excess-risk arguments.
[quotetheorem:9824]
[citeproof:9824]
This one-sided version is the form most often used when an estimator is chosen by minimising empirical risk. It measures the largest optimistic gap between population and empirical averages over a class, not the largest absolute error. The distinction matters: a class may have a small optimistic gap while still containing functions with large negative gaps, so this theorem alone is not a two-sided uniform law of large numbers. Its value is that empirical minimisation only needs to rule out functions whose empirical risk looks too good compared with their population risk.
## Rademacher Complexity and Contraction
After symmetrisation, the next question is how large the signed supremum can be. Conditional on the sample, this is a deterministic geometric quantity associated with the finite set of vectors $(f(X_1),\dots,f(X_n))$. Rademacher complexity packages this quantity in a way that can be compared across classes.
[definition: Empirical Rademacher Complexity]
Let $\mathcal F$ be a class of real-valued functions on $S$. The empirical Rademacher complexity at sample size $n$ is the functional
\begin{align*}
\widehat{\mathfrak R}_n(\mathcal F;\cdot):S^n\to[0,\infty]
\end{align*}
defined, for $x_1, \dots, x_n \in S$, by
\begin{align*}
\widehat{\mathfrak R}_n(\mathcal F;x_1,\dots,x_n)
= \mathbb E_\varepsilon\left[\sup_{f \in \mathcal F}\frac{1}{n}\sum_{i=1}^n \varepsilon_i f(x_i)\right].
\end{align*}
[/definition]
The notation $\mathbb E_\varepsilon$ means that the sample points are held fixed while only the signs are averaged. Absolute values are sometimes included in the definition; in these notes the displayed non-absolute version is the default, and symmetric classes make the two conventions coincide.
[definition: Rademacher Complexity]
Let $P$ be a probability measure on $(S,\mathcal S)$, and let $X_1,\dots,X_n$ be i.i.d. with distribution $P$. The Rademacher complexity at sample size $n$ is the functional
\begin{align*}
\mathfrak R_n:\{\text{classes of measurable real-valued functions on }S\}\to[0,\infty]
\end{align*}
defined by
\begin{align*}
\mathfrak R_n(\mathcal F)
= \mathbb E\left[\widehat{\mathfrak R}_n(\mathcal F;X_1,\dots,X_n)\right].
\end{align*}
[/definition]
The expected complexity averages the empirical complexity over the random design. The empirical version is data-dependent and will later lead to observable bounds.
[example: Linear Predictors in a Euclidean Ball]
Let $S=\mathbb R^d$, fix sample points $x_1,\dots,x_n$ with $|x_i|\le R$, and consider
\begin{align*}
\mathcal F=\{x\mapsto w\cdot x: |w|\le B\}.
\end{align*}
For fixed signs, set $Z=\sum_{i=1}^n\varepsilon_i x_i$. Then
\begin{align*}
\sup_{f\in\mathcal F}\frac{1}{n}\sum_{i=1}^n\varepsilon_i f(x_i)=\frac{1}{n}\sup_{|w|\le B} w\cdot Z.
\end{align*}
For every $w$ with $|w|\le B$, Cauchy-Schwarz gives $w\cdot Z\le |w||Z|\le B|Z|$. If $Z\neq 0$, the choice $w=BZ/|Z|$ has $|w|=B$ and gives $w\cdot Z=B|Z|$; if $Z=0$, both sides are $0$. Hence
\begin{align*}
\sup_{|w|\le B}w\cdot Z=B|Z|.
\end{align*}
Therefore
\begin{align*}
\widehat{\mathfrak R}_n(\mathcal F;x_1,\dots,x_n)=\frac{B}{n}\mathbb E_\varepsilon\left|\sum_{i=1}^n\varepsilon_i x_i\right|.
\end{align*}
It remains to bound the expected length of the signed vector sum. By Cauchy-Schwarz applied to the random variables $1$ and $|Z|$,
\begin{align*}
\mathbb E_\varepsilon |Z|\le \left(\mathbb E_\varepsilon |Z|^2\right)^{1/2}.
\end{align*}
Expanding the square in the Euclidean [inner product](/page/Inner%20Product) gives
\begin{align*}
|Z|^2=\sum_{i=1}^n\sum_{j=1}^n\varepsilon_i\varepsilon_j\,x_i\cdot x_j.
\end{align*}
Taking expectation over the signs,
\begin{align*}
\mathbb E_\varepsilon |Z|^2=\sum_{i=1}^n\sum_{j=1}^n\mathbb E_\varepsilon[\varepsilon_i\varepsilon_j]\,x_i\cdot x_j.
\end{align*}
If $i=j$, then $\mathbb E_\varepsilon[\varepsilon_i^2]=1$. If $i\neq j$, independence and $\mathbb E_\varepsilon[\varepsilon_i]=0$ give $\mathbb E_\varepsilon[\varepsilon_i\varepsilon_j]=\mathbb E_\varepsilon[\varepsilon_i]\mathbb E_\varepsilon[\varepsilon_j]=0$. Thus
\begin{align*}
\mathbb E_\varepsilon |Z|^2=\sum_{i=1}^n |x_i|^2.
\end{align*}
Since $|x_i|\le R$ for every $i$,
\begin{align*}
\mathbb E_\varepsilon |Z|^2\le nR^2.
\end{align*}
Combining the preceding inequalities,
\begin{align*}
\widehat{\mathfrak R}_n(\mathcal F;x_1,\dots,x_n)\le \frac{B}{n}\sqrt{nR^2}=\frac{BR}{\sqrt n}.
\end{align*}
The complexity is controlled by the radius $B$ of the weight ball and the sample radius $R$, with no explicit dependence on the ambient dimension $d$.
[/example]
This example illustrates a recurring message: dimension may enter only through the geometry of the constraint and the data, not through a raw count of parameters. The next obstruction appears when the functions used in the statistical criterion are not the predictors themselves but losses applied to those predictors. Without a contraction principle, a bound for $g$ would not automatically give a bound for $\ell(g,y)$; a nonlinear loss could amplify small signed fluctuations into much larger ones. The theorem below identifies the precise condition, Lipschitz continuity with controlled offsets, under which this amplification is limited.
[quotetheorem:9825]
[citeproof:9825]
Contraction is the bridge from prediction functions to losses, but every hypothesis has a specific role. The argument is fixed-sample and coordinatewise: the points $x_i$ are frozen, and $\phi_i$ may depend on the coordinate $i$ but not on the Rademacher signs. This matters because the proof compares two deterministic vector sets in $\mathbb R^n$; if the transformation were allowed to depend on $\varepsilon_i$, it could align with the signs and create a positive average even from the zero class.
The Lipschitz assumption rules out nonlinear amplification. For instance, take $\phi(t)=t^2$ and, on a fixed sample of size $n$, let $\mathcal F_s=\{0,f_s\}$ with $f_s(x_i)=s$ for every $i$. The original average is
\begin{align*}
\frac{s}{n}\mathbb E_\varepsilon\left[\max\left\{0,\sum_{i=1}^n\varepsilon_i\right\}\right],
\end{align*}
whereas the transformed average is the same quantity multiplied by $s$. No constant independent of $s$ can compare the two, so a non-Lipschitz map such as $t\mapsto t^2$ cannot satisfy a contraction bound on unbounded ranges. The condition $\phi_i(0)=0$ is also structural rather than cosmetic, since adding coordinate-dependent constants contributes the extra random term $n^{-1}\sum_i\varepsilon_i\phi_i(0)$, unrelated to $\mathcal F$. Thus the theorem controls oscillation of the class, not arbitrary offsets. The measurability convention is not a technical decoration either: for uncountable classes the supremum over $f$ may fail to be measurable, so finite approximation, separability, or outer expectation is needed before the displayed expectations have a literal probabilistic meaning.
[example: Lipschitz Loss Classes]
Let $\mathcal G$ be a class of prediction functions $g:S\to\mathbb R$, let $Y$ be a response space, and suppose $\ell:\mathbb R\times Y\to\mathbb R$ is $L$-Lipschitz in its first argument. For fixed observations $(x_i,y_i)$, define
\begin{align*}
\mathcal L\circ\mathcal G = \{(x,y)\mapsto \ell(g(x),y):g\in\mathcal G\}.
\end{align*}
For each coordinate, set
\begin{align*}
\phi_i(t)=\ell(t,y_i)-\ell(0,y_i).
\end{align*}
Then
\begin{align*}
\phi_i(0)=\ell(0,y_i)-\ell(0,y_i)=0.
\end{align*}
Also, for all $s,t\in\mathbb R$,
\begin{align*}
|\phi_i(s)-\phi_i(t)|=|\ell(s,y_i)-\ell(0,y_i)-\ell(t,y_i)+\ell(0,y_i)|.
\end{align*}
The two offset terms cancel, so
\begin{align*}
|\phi_i(s)-\phi_i(t)|=|\ell(s,y_i)-\ell(t,y_i)|.
\end{align*}
Since $\ell$ is $L$-Lipschitz in its first argument,
\begin{align*}
|\phi_i(s)-\phi_i(t)|\le L|s-t|.
\end{align*}
Thus each $\phi_i$ satisfies the hypotheses of the *[Rademacher Contraction Inequality](/theorems/9825)*.
For the centred loss class, the signed average is
\begin{align*}
\frac{1}{n}\sum_{i=1}^n \varepsilon_i\phi_i(g(x_i))
=
\frac{1}{n}\sum_{i=1}^n \varepsilon_i\bigl(\ell(g(x_i),y_i)-\ell(0,y_i)\bigr).
\end{align*}
Therefore the contraction inequality gives
\begin{align*}
\mathbb E_\varepsilon\left[\sup_{g\in\mathcal G}\frac{1}{n}\sum_{i=1}^n \varepsilon_i\bigl(\ell(g(x_i),y_i)-\ell(0,y_i)\bigr)\right]
\le
L\,\mathbb E_\varepsilon\left[\sup_{g\in\mathcal G}\frac{1}{n}\sum_{i=1}^n \varepsilon_i g(x_i)\right].
\end{align*}
By the definition of empirical Rademacher complexity, the right-hand side is
\begin{align*}
L\widehat{\mathfrak R}_n(\mathcal G;x_1,\dots,x_n).
\end{align*}
The centring term is not cosmetic. For the uncentred loss class,
\begin{align*}
\frac{1}{n}\sum_{i=1}^n\varepsilon_i\ell(g(x_i),y_i)
=
\frac{1}{n}\sum_{i=1}^n\varepsilon_i\bigl(\ell(g(x_i),y_i)-\ell(0,y_i)\bigr)
+
\frac{1}{n}\sum_{i=1}^n\varepsilon_i\ell(0,y_i).
\end{align*}
The second term depends only on the fixed responses $(y_i)$ and the signs, not on $g$. Hence contraction controls the oscillation coming from the prediction class only after the coordinate offsets $\ell(0,y_i)$ have been removed. Hinge, logistic, and clipped absolute losses therefore inherit the same sample-dependent complexity scale as the underlying prediction class in their centred form.
[/example]
The contraction example handles infinite classes by comparing them to their underlying prediction geometry, but another common situation is a finite menu of candidate functions. A naive bound on a maximum over $M$ functions would often be linear in $M$, which is too large for model selection. The relevant failure mode is that each fixed signed sum is sub-Gaussian, but taking a maximum over many such sums introduces a complexity cost. The following estimate shows that this cost is logarithmic in the number of candidates, provided the candidate vectors have uniformly bounded Euclidean length.
[quotetheorem:9826]
[citeproof:9826]
The logarithmic dependence on $|\mathcal F|$ explains why finite model classes can be handled even when the number of candidates is large. The lemma does not use independence among the candidate functions; all dependence is absorbed into the finite set of vectors $T$. The Euclidean radius assumption is essential in a concrete way: if $T=\{0,(R,0,\dots,0)\}$, then
\begin{align*}
\mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^n\varepsilon_i t_i\right]
= \mathbb E_\varepsilon[\max\{0,R\varepsilon_1\}]
= \frac{R}{2}.
\end{align*}
No bound depending only on $|T|=2$ can control this as $R\to\infty$. Finiteness is a separate hypothesis, not a presentational convenience. If $T$ is the Euclidean unit ball in $\mathbb R^n$, then $\sup_{t\in T}\sum_i\varepsilon_i t_i=|\varepsilon|=\sqrt n$ even though the radius is $1$; a radius bound without a finite-cardinality or entropy term cannot capture the size of an infinite class. Thus the theorem separates two quantities that must both be controlled, the size of the menu and the magnitude of each candidate vector.
## Concentration and High-Probability Uniform Bounds
Expected bounds describe the average size of the supremum, but statistical guarantees usually need probability statements. Markov's inequality would convert an expectation into a probability bound, but it would give weak tails and would not use the fact that the sample consists of independent coordinates. The remaining problem is to convert an expected empirical-process bound into a statement holding with high probability by measuring how much the supremum can change when one observation is altered. Bounded differences provide exactly this concentration step.
[quotetheorem:6074]
[citeproof:6074]
McDiarmid's inequality supplies concentration around the mean, while symmetrisation supplies a bound for that mean. The bounded range hypothesis is doing real work: without a uniform interval $[a,b]$, changing one observation can alter $P_n f$ by an uncontrolled amount for some $f$, so this bounded-difference argument breaks down. The theorem also gives an upper-tail statement around $\mathbb E[Z]$; it does not by itself estimate that expectation. Combining concentration with symmetrisation removes the remaining expected supremum from the statement and gives the standard route from a Rademacher complexity estimate to a uniform deviation guarantee at confidence level $1-\delta$.
[quotetheorem:9827]
[citeproof:9827]
The bound separates complexity from confidence. The term $2\mathfrak R_n(\mathcal F)$ measures the size of the class under the data-generating distribution, while the final term is the price for a probability level $1-\delta$. The bounded range hypothesis cannot be replaced by measurability alone. For the singleton class $\mathcal F=\{f\}$ with $f(x)=x$ and a heavy-tailed distribution with finite mean, $P f-P_n f$ is the usual sample-mean error; polynomial tails then make a sub-Gaussian term of order $\sqrt{\log(1/\delta)/n}$ false at small $\delta$. This shows that the theorem is a bounded-difference result, not a general concentration theorem for unbounded losses.
The expected complexity is also not interchangeable with the empirical complexity without paying another random-fluctuation cost. A sample can land in a region where a rich class shatters the observed points even if this happens rarely under $P$, so $\widehat{\mathfrak R}_n(\mathcal F;X_1,\dots,X_n)$ can be much larger than its expectation on that event. Conversely, an unusually simple sample can make the empirical complexity too small to certify the distributional supremum by itself. The theorem is therefore distribution-dependent through $\mathfrak R_n(\mathcal F)$, and it gives one-sided control; a two-sided absolute bound follows by applying the statement to both directions and adjusting the confidence level.
[example: Finite Model Selection]
Let $\mathcal F$ be a non-empty finite class of functions $f:S\to[0,1]$ with $|\mathcal F|=M$. For any fixed sample $x_1,\dots,x_n$, each vector $(f(x_1),\dots,f(x_n))$ has Euclidean norm bounded by $\sqrt n$, because
\begin{align*}
\sum_{i=1}^n f(x_i)^2 \le \sum_{i=1}^n 1^2 = n.
\end{align*}
Thus the function-class form of *[Massart Finite-Class Lemma](/theorems/9826)* applies with $b=1$ and gives
\begin{align*}
\widehat{\mathfrak R}_n(\mathcal F;x_1,\dots,x_n)
\le \sqrt{\frac{2\log M}{n}}.
\end{align*}
Taking expectation over $X_1,\dots,X_n$ preserves this deterministic bound, so
\begin{align*}
\mathfrak R_n(\mathcal F)
= \mathbb E\left[\widehat{\mathfrak R}_n(\mathcal F;X_1,\dots,X_n)\right]
\le \mathbb E\left[\sqrt{\frac{2\log M}{n}}\right]
= \sqrt{\frac{2\log M}{n}}.
\end{align*}
Since every $f\in\mathcal F$ takes values in $[0,1]$, the range length in the *Rademacher Uniform Deviation Bound* is $b-a=1-0=1$. Therefore, for every $\delta\in(0,1)$, with probability at least $1-\delta$,
\begin{align*}
\sup_{f\in\mathcal F}(P f-P_n f)
\le 2\sqrt{\frac{2\log M}{n}}+\sqrt{\frac{\log(1/\delta)}{2n}}.
\end{align*}
Both terms have the form of a constant times a square root divided by $\sqrt n$, so the finite-class one-sided deviation is controlled at scale
\begin{align*}
\sqrt{\frac{\log M}{n}}+\sqrt{\frac{\log(1/\delta)}{n}}.
\end{align*}
The same logarithmic dependence on $M$ also follows from a direct union bound. For a fixed $f\in\mathcal F$, [Hoeffding's inequality](/theorems/1962) for variables in $[0,1]$ gives
\begin{align*}
\mathbb P(P f-P_n f\ge t)\le \exp(-2nt^2).
\end{align*}
Taking the union over the $M$ functions,
\begin{align*}
\mathbb P\left(\sup_{f\in\mathcal F}(P f-P_n f)\ge t\right)
\le \sum_{f\in\mathcal F}\mathbb P(P f-P_n f\ge t)
\le M\exp(-2nt^2).
\end{align*}
Choose
\begin{align*}
t=\sqrt{\frac{\log(M/\delta)}{2n}}.
\end{align*}
Then
\begin{align*}
M\exp(-2nt^2)
= M\exp\left(-2n\cdot\frac{\log(M/\delta)}{2n}\right)
= M\exp(-\log(M/\delta))
= M\cdot\frac{\delta}{M}
= \delta.
\end{align*}
Hence the union-bound method gives, with probability at least $1-\delta$,
\begin{align*}
\sup_{f\in\mathcal F}(P f-P_n f)
\le \sqrt{\frac{\log(M/\delta)}{2n}}.
\end{align*}
For finite classes, both routes expose the same logarithmic cost in the number of candidates; the Rademacher route is more reusable because the same symmetrisation and complexity steps extend beyond finite menus.
[/example]
The expected complexity in the preceding theorem is not observable. A refined version replaces it by empirical Rademacher complexity, at the cost of another concentration term.
[quotetheorem:9828]
[citeproof:9828]
This result is the data-dependent form used in learning bounds: the observed sample estimates the relevant complexity of the class. The bounded-range hypothesis is again necessary for the concentration step. For a singleton class containing an unbounded heavy-tailed function, the empirical Rademacher complexity is only the signed average magnitude for that one function and has expectation tied to the sample scale, while a single extreme observation can still move $P_n f$ by an amount far larger than $(b-a)/n$. Without a uniform range, neither the empirical-process supremum nor the empirical complexity has the coordinate stability used in the proof.
The additional concentration term is the cost of replacing the distributional quantity $\mathfrak R_n(\mathcal F)$ by the random observable $\widehat{\mathfrak R}_n$. The theorem does not claim that the empirical complexity is always smaller than a worst-case bound; it can be large on samples where the class realises many sign patterns. It also does not provide a lower bound or an exact estimate of the uniform deviation: a small empirical Rademacher complexity controls the one-sided deviation only after the displayed slack term is included, and only under the stated boundedness and measurability conventions. In many applications it is nevertheless sharper than a finite-class or entropy estimate, because it uses the realised geometry of the vectors $(f(X_1),\dots,f(X_n))$.
[remark: Measurability Conventions]
For uncountable classes, suprema such as $\sup_{f\in\mathcal F}|P_nf-Pf|$ need not be measurable without additional hypotheses. These notes state the inequalities under measurability assumptions in this chapter. Later chapters handle this issue systematically using separability, outer expectation, and measurable majorants.
[/remark]
# 4. VC Classes and Combinatorial Dimension
This chapter builds on the empirical-measure notation, finite-dimensional process viewpoint, symmetrisation ideas, and VC terminology introduced earlier. It gives the combinatorial route to uniform control for classes of sets and binary classifiers; the entropy consequences of this counting theory are developed later in the course. The guiding question is: when can a class behave like all subsets on a finite sample, and when does its geometry force a much smaller number of possible labelings? VC dimension answers this question and converts finite-dimensional combinatorics into uniform probability bounds.
## Shattering and Growth Functions
Suppose a class of sets $\mathcal C$ is used to classify points of a measurable space $S$. On a finite sample $\{x_1,\dots,x_n\}$, the class does not remember the whole sets $C\in\mathcal C$; it remembers only the binary vectors $(\mathbb{1}_C(x_1),\dots,\mathbb{1}_C(x_n))$. The first problem is to count how many such binary patterns can occur.
[definition: Trace of a Set Class]
Let $\mathcal C$ be a class of subsets of a set $S$, and let $A\subset S$ be finite. The trace of $\mathcal C$ on $A$ is
\begin{align*}
\mathcal C|_A := \{C\cap A : C\in \mathcal C\}.
\end{align*}
[/definition]
The trace is the finite collection of subsets of $A$ that the class can realise. The next problem is to remove the dependence on the particular sample and measure the worst possible number of labelings among all samples of the same size.
[definition: Growth Function]
Let $\mathcal C$ be a class of subsets of $S$. The growth function of $\mathcal C$ is the map
\begin{align*}
\Pi_{\mathcal C}:\mathbb N\to \mathbb N\cup\{\infty\}
\end{align*}
defined by
\begin{align*}
\Pi_{\mathcal C}(n) := \sup\{|\mathcal C|_A| : A\subset S,\ |A|=n\}, \qquad n\in\mathbb N.
\end{align*}
[/definition]
Since $\mathcal C|_A\subset 2^A$, the bound $\Pi_{\mathcal C}(n) \le 2^n$ always holds. Equality at a particular sample means that the class can implement every binary labeling on that sample, so we need a name for samples where no labeling restriction remains.
[definition: Shattering]
Let $\mathcal C$ be a class of subsets of $S$, and let $A\subset S$ be finite. The class $\mathcal C$ shatters $A$ if
\begin{align*}
\mathcal C|_A = 2^A.
\end{align*}
[/definition]
Shattering identifies finite samples on which the class has no combinatorial restriction. The next invariant records the largest size of such a sample, with the value $\infty$ allowed when no finite obstruction exists.
[definition: VC Dimension]
Let $S$ be a set. The VC dimension functional on set classes over $S$ is the map
\begin{align*}
V:2^{2^S}\to \mathbb N\cup\{0,\infty\}.
\end{align*}
For a class $\mathcal C$ of subsets of $S$, its value is
\begin{align*}
V(\mathcal C) := \sup\{|A| : A\subset S \text{ is finite and shattered by } \mathcal C\}.
\end{align*}
[/definition]
A finite VC dimension says that some sample size is too large to be labeled arbitrarily. Before proving the general counting theorem, it is useful to see how this obstruction appears in ordered and geometric examples.
[example: Intervals on the Real Line]
Let $\mathcal C=\{(a,b):a<b\}$ be the class of open intervals in $\mathbb R$. We first show that every two-point set is shattered. Fix $x_1<x_2$ and put $m=(x_1+x_2)/2$. The four subsets of $\{x_1,x_2\}$ are realised as follows:
\begin{align*}
(x_1-2,x_1-1)\cap\{x_1,x_2\}=\varnothing.
\end{align*}
\begin{align*}
(x_1-1,m)\cap\{x_1,x_2\}=\{x_1\}.
\end{align*}
\begin{align*}
(m,x_2+1)\cap\{x_1,x_2\}=\{x_2\}.
\end{align*}
\begin{align*}
(x_1-1,x_2+1)\cap\{x_1,x_2\}=\{x_1,x_2\}.
\end{align*}
The same argument also covers one-point sets by ignoring the unused point, so $\mathcal C$ shatters some set of size $2$.
Now take any three ordered points $x_1<x_2<x_3$. If an interval $(a,b)$ realised the labeling $\{x_1,x_3\}$ on this set, then $x_1\in(a,b)$ and $x_3\in(a,b)$, so $a<x_1$ and $x_3<b$. Since $x_1<x_2<x_3$, these inequalities give $a<x_2<b$, hence $x_2\in(a,b)$, contradicting the required trace $\{x_1,x_3\}$. Thus no three-point set is shattered. Therefore $V(\mathcal C)=2$, and the obstruction is exactly that every interval is order-convex: once it contains two ordered points, it contains every point between them.
[/example]
This example shows how VC dimension detects geometric constraints. For intervals, the class is infinite, but its action on $n$ ordered data points is governed by two endpoints rather than by $2^n$ arbitrary choices.
[example: Halfspaces in Euclidean Space]
Let $\mathcal H_d$ be the class of affine halfspaces in $\mathbb R^d$, and choose affinely independent points $x_0,\dots,x_d$. For any subset $I\subset\{0,\dots,d\}$, assign signs $s_i=1$ for $i\in I$ and $s_i=-1$ for $i\notin I$. Since $x_0,\dots,x_d$ are affinely independent, every point in their affine span has unique affine coordinates $\lambda_0,\dots,\lambda_d$ with $\lambda_i(x_j)=\mathbb{1}_{\{i=j\}}$ and $\sum_{i=0}^d\lambda_i=1$. Define the affine function
\begin{align*}
L(x)=\sum_{i=0}^d s_i\lambda_i(x).
\end{align*}
For each vertex $x_j$,
\begin{align*}
L(x_j)=\sum_{i=0}^d s_i\lambda_i(x_j)=\sum_{i=0}^d s_i\mathbb{1}_{\{i=j\}}=s_j.
\end{align*}
Thus the halfspace $\{x:L(x)>0\}$ contains exactly the vertices with indices in $I$. Since $I$ was arbitrary, the set $\{x_0,\dots,x_d\}$ is shattered, so $V(\mathcal H_d)\ge d+1$.
Now let $A=\{y_0,\dots,y_{d+1}\}$ be any set of $d+2$ points in $\mathbb R^d$. By *[Radon's theorem](/theorems/4086)*, there are disjoint nonempty index sets $I,J$ with $I\cup J\subset\{0,\dots,d+1\}$ such that
\begin{align*}
\operatorname{conv}\{y_i:i\in I\}\cap \operatorname{conv}\{y_j:j\in J\}\neq\varnothing.
\end{align*}
Choose a point $z$ in this intersection. Then for some coefficients $\alpha_i\ge 0$ and $\beta_j\ge 0$ with $\sum_{i\in I}\alpha_i=1$ and $\sum_{j\in J}\beta_j=1$,
\begin{align*}
z=\sum_{i\in I}\alpha_i y_i=\sum_{j\in J}\beta_j y_j.
\end{align*}
If $A$ were shattered, some affine halfspace $\{x:L(x)>0\}$ would contain every $y_i$ with $i\in I$ and contain no $y_j$ with $j\in J$. Hence $L(y_i)>0$ for $i\in I$ and $L(y_j)\le 0$ for $j\in J$. By affine linearity,
\begin{align*}
L(z)=L\left(\sum_{i\in I}\alpha_i y_i\right)=\sum_{i\in I}\alpha_i L(y_i)>0.
\end{align*}
The strict inequality holds because $I$ is nonempty, the coefficients are nonnegative and sum to $1$, and each $L(y_i)>0$. The same point $z$ also satisfies
\begin{align*}
L(z)=L\left(\sum_{j\in J}\beta_j y_j\right)=\sum_{j\in J}\beta_j L(y_j)\le 0.
\end{align*}
This contradiction shows that no set of $d+2$ points is shattered. Therefore $V(\mathcal H_d)=d+1$: affine halfspaces can realize all labelings on the vertices of a simplex, but Radon's convexity obstruction prevents arbitrary labelings on one more point.
[/example]
The halfspace example is the prototype for statistical learning: the number of parameters is reflected in the VC dimension, but the argument is geometric rather than merely parametric. To use VC dimension in probability bounds, we need a theorem turning the absence of large shattered sets into a uniform growth estimate.
[quotetheorem:1969]
[citeproof:1969]
The lemma is the combinatorial engine of VC theory. It says that finite VC dimension changes the worst-case number of labelings from exponential in $n$ to polynomial in $n$, which is the scale needed for entropy integrals and uniform laws of large numbers. The finite VC dimension hypothesis is essential: if $\mathcal C=2^S$ on an infinite set $S$, then every finite sample is shattered and $\Pi_{\mathcal C}(n)=2^n$ for all $n$. The theorem does not say that a VC class has few sets globally, only that its restrictions to finite samples have few distinct traces. This distinction is what makes the result useful for probability, where empirical processes only see a class through its values on the observed sample.
[remark: Infinite VC Dimension]
If $V(\mathcal C)=\infty$, then for every $n$ there is some $n$-point set shattered by $\mathcal C$, so $\Pi_{\mathcal C}(n)=2^n$. In this case the Sauer-Shelah conclusion gives no reduction in complexity.
[/remark]
The dichotomy is sharp: either arbitrary finite samples can be shattered at all sizes, or the growth function is bounded by a polynomial. This is why VC dimension acts as a discrete analogue of dimension for classifiers.
## VC Subgraph Classes
Empirical processes are usually indexed by real-valued functions, not only indicators. To connect VC combinatorics with entropy bounds for function classes, we encode inequalities of the form $f(x)>t$ as subsets of a larger space.
[definition: Subgraph Class]
Let $S$ be a set. The subgraph operator is the map
\begin{align*}
\operatorname{subgraph}:2^{\mathbb R^S}\to 2^{2^{S\times\mathbb R}}
\end{align*}
defined on each class $\mathcal F\subset \mathbb R^S$ by
\begin{align*}
\operatorname{subgraph}(\mathcal F) := \big\{\{(x,t)\in S\times\mathbb R : t < f(x)\}: f\in\mathcal F\big\}.
\end{align*}
[/definition]
The notation means that each $f\in\mathcal F$ contributes one subset of $S\times\mathbb R$. The next question is when this lifted set class has finite combinatorial dimension; that condition is the real-valued analogue of being a VC class.
[definition: VC Subgraph Class]
Let $S$ be a set. A class $\mathcal F\subset\mathbb R^S$ is a VC subgraph class if
\begin{align*}
V(\operatorname{subgraph}(\mathcal F))\in\mathbb N\cup\{0\}.
\end{align*}
[/definition]
For indicator functions, this definition reduces to the earlier set-class notion up to a harmless change of dimension. The subgraph formulation is more flexible because it treats thresholds, real-valued losses, and regression functions in the same language.
[example: Threshold Classifiers]
Let $\mathcal F=\{f_a:a\in\mathbb R\}$, where $f_a(x)=\mathbb{1}_{(-\infty,a]}(x)$ on $\mathbb R$. For each $a$, its subgraph is
\begin{align*}
G_a=\{(x,t)\in\mathbb R^2:t<f_a(x)\}.
\end{align*}
Since $f_a(x)=1$ when $x\le a$ and $f_a(x)=0$ when $x>a$, membership in $G_a$ is exactly
\begin{align*}
(x,t)\in G_a \quad \Longleftrightarrow \quad (x\le a \text{ and } t<1)\text{ or }(x>a \text{ and } t<0).
\end{align*}
Equivalently, points with $t<0$ are always in $G_a$, points with $t\ge 1$ are never in $G_a$, and points with $0\le t<1$ are in $G_a$ exactly when $x\le a$.
Now fix a finite sample $A=\{(x_i,t_i):1\le i\le n\}\subset\mathbb R^2$. Put
\begin{align*}
A_-=\{(x_i,t_i)\in A:t_i<0\}.
\end{align*}
\begin{align*}
A_0=\{(x_i,t_i)\in A:0\le t_i<1\}.
\end{align*}
\begin{align*}
A_+=\{(x_i,t_i)\in A:t_i\ge 1\}.
\end{align*}
For every $a\in\mathbb R$,
\begin{align*}
G_a\cap A=A_-\cup\{(x_i,t_i)\in A_0:x_i\le a\}.
\end{align*}
Thus the only part of the trace that can change with $a$ is an initial segment of the $x$-coordinates appearing in $A_0$. If $m=|A_0|$, there are at most $m+1$ such initial segments, so
\begin{align*}
|\operatorname{subgraph}(\mathcal F)|_A|\le m+1\le n+1.
\end{align*}
In particular, no two-point set can be shattered, since shattering two points would require $2^2=4$ traces but the bound gives at most $3$.
A one-point set such as $\{(0,1/2)\}$ is shattered: if $a<0$, then $(0,1/2)\notin G_a$, while if $a\ge 0$, then $(0,1/2)\in G_a$. Hence $V(\operatorname{subgraph}(\mathcal F))=1$, so $\mathcal F$ is a VC subgraph class. The subgraph condition has not introduced new combinatorial freedom; it is still governed by one threshold on the real line.
[/example]
Thresholds illustrate the passage from binary classifiers to functions. The next problem is permanence: once a few primitive classes are known to be VC subgraph, we need operations that preserve the property when building losses and decision rules.
[quotetheorem:9829]
[citeproof:9829]
These closure properties are a practical calculus for examples. They justify treating many parametric decision rules and loss classes as VC subgraph once their basic inequalities are semialgebraic or built from finitely many thresholds. The hypotheses matter: arbitrary transformations of the range, or arbitrary pointwise operations over infinitely many functions, need not preserve finite VC subgraph dimension because they can encode increasingly complicated threshold patterns. For instance, starting with the one-function class $\mathcal F=\{f\}$ where $f(x)=x$ on $S=\mathbb R$, unrestricted measurable post-processing would include $\mathbb{1}_A\circ f$ for every subset $A\subset\mathbb R$; on any finite sample, a suitable choice of $A$ realises every labeling, so the resulting indicator class has infinite VC dimension. The theorem is therefore a finite-operation principle, not a licence to close VC subgraph classes under all measurable post-processing. This limitation guides the examples below, where each decision region is reduced to finitely many polynomial or affine inequalities.
[example: Euclidean Balls]
Let $\mathcal B_d$ be the class of closed Euclidean balls $\overline B(a,r)=\{x\in\mathbb R^d:|x-a|\le r\}$, with $a\in\mathbb R^d$ and $r\ge 0$. For $x=(x_1,\dots,x_d)$, membership in $\overline B(a,r)$ is equivalent to the quadratic inequality
\begin{align*}
|x-a|^2-r^2\le 0.
\end{align*}
Writing $a=(a_1,\dots,a_d)$, the left side expands as
\begin{align*}
|x-a|^2-r^2=\sum_{k=1}^d (x_k-a_k)^2-r^2.
\end{align*}
Expanding each square gives
\begin{align*}
\sum_{k=1}^d (x_k-a_k)^2-r^2=\sum_{k=1}^d x_k^2-2\sum_{k=1}^d a_kx_k+\sum_{k=1}^d a_k^2-r^2.
\end{align*}
Since $\sum_{k=1}^d x_k^2=|x|^2$ and $\sum_{k=1}^d a_k^2=|a|^2$, this becomes
\begin{align*}
|x-a|^2-r^2=|x|^2-2a\cdot x+|a|^2-r^2.
\end{align*}
Now define the feature map $\Phi:\mathbb R^d\to\mathbb R^{d+1}$ by
\begin{align*}
\Phi(x)=(x_1,\dots,x_d,|x|^2).
\end{align*}
For each ball $\overline B(a,r)$, define an affine function on $\mathbb R^{d+1}$ by
\begin{align*}
L_{a,r}(u_1,\dots,u_d,s)=s-2\sum_{k=1}^d a_ku_k+|a|^2-r^2.
\end{align*}
Substituting $\Phi(x)$ into this affine function gives
\begin{align*}
L_{a,r}(\Phi(x))=|x|^2-2a\cdot x+|a|^2-r^2.
\end{align*}
Therefore
\begin{align*}
x\in\overline B(a,r)\Longleftrightarrow L_{a,r}(\Phi(x))\le 0.
\end{align*}
Thus every labeling of points in $\mathbb R^d$ produced by a Euclidean ball is obtained by first applying the finite-dimensional lift $\Phi$ and then applying an affine halfspace in $\mathbb R^{d+1}$. Since affine halfspaces in $\mathbb R^{d+1}$ have VC dimension $d+2$, no set of $d+3$ lifted points can be shattered by all affine halfspaces, and hence no set of $d+3$ original points can be shattered by balls. Consequently $V(\mathcal B_d)\le d+2$. The useful point is the finite-dimensional lift: a quadratic ball inequality becomes one affine threshold inequality in the variables $(x_1,\dots,x_d,|x|^2)$.
[/example]
The lifting trick is a recurring way to recognise VC classes. Polynomial decision regions follow the same pattern, replacing the quadratic lift by all monomials up to a fixed degree.
[example: Polynomial Decision Regions]
Fix $d,m\in\mathbb N$, and let $\mathcal P_{d,m}$ consist of all sets $\{x\in\mathbb R^d:p(x)\ge 0\}$ with $p$ a real polynomial of degree at most $m$. For a multi-index $\alpha=(\alpha_1,\dots,\alpha_d)\in\mathbb N^d$, write
\begin{align*}
|\alpha|=\alpha_1+\cdots+\alpha_d,\qquad x^\alpha=x_1^{\alpha_1}\cdots x_d^{\alpha_d}.
\end{align*}
Every polynomial of degree at most $m$ has the form
\begin{align*}
p(x)=\sum_{|\alpha|\le m} c_\alpha x^\alpha.
\end{align*}
The number of monomials with $|\alpha|\le m$ is the number of $d$-tuples of nonnegative integers whose sum is at most $m$. Introducing one extra coordinate $\alpha_{d+1}=m-|\alpha|$, this is the number of nonnegative $(d+1)$-tuples summing to $m$, hence
\begin{align*}
N=\#\{\alpha\in\mathbb N^d:|\alpha|\le m\}={d+m\choose m}.
\end{align*}
Define the feature map $\Phi:\mathbb R^d\to\mathbb R^N$ by listing all monomials of degree at most $m$:
\begin{align*}
\Phi(x)=(x^\alpha)_{|\alpha|\le m}.
\end{align*}
If $c=(c_\alpha)_{|\alpha|\le m}\in\mathbb R^N$, then
\begin{align*}
c\cdot \Phi(x)=\sum_{|\alpha|\le m} c_\alpha x^\alpha=p(x).
\end{align*}
Therefore
\begin{align*}
\{x:p(x)\ge 0\}=\{x:c\cdot \Phi(x)\ge 0\}=\Phi^{-1}(\{u\in\mathbb R^N:c\cdot u\ge 0\}).
\end{align*}
Thus every polynomial decision region is the inverse image, under a fixed finite-dimensional feature map, of a linear halfspace in $\mathbb R^N$. Since linear halfspaces are a subclass of affine halfspaces in $\mathbb R^N$, and affine halfspaces in $\mathbb R^N$ have VC dimension $N+1$ by the halfspace argument above, no set of $N+2$ lifted points can be shattered by all affine halfspaces. Hence no set of $N+2$ original points can be shattered by $\mathcal P_{d,m}$, so
\begin{align*}
V(\mathcal P_{d,m})\le N+1={d+m\choose m}+1.
\end{align*}
With the alternative convention that the constant monomial is counted as the affine offset, this is often stated as a bound of order $N={d+m\choose m}$. The essential point is that a polynomial inequality of bounded degree becomes one halfspace inequality after a finite monomial lift.
[/example]
The conclusion is that many nonlinear classifiers have finite VC dimension because they are linear after a finite feature map. This finite-dimensional feature representation will feed directly into entropy bounds.
## Entropy Bounds for VC Classes
The next problem is to translate combinatorial dimension into metric entropy. Entropy involves distances such as $L^2(Q)$, where $Q$ is a probability measure, while VC dimension is defined by exact labelings on finite sets. The bridge is to approximate sets by their values on a random finite sample and then use Sauer-Shelah to count the possible traces.
[definition: Covering Number]
Let $(T,d)$ be a metric space and let $A\subset T$. The covering number of $A$ in $(T,d)$ is the map
\begin{align*}
N(\cdot,A,d):(0,\infty)\to \mathbb N\cup\{\infty\}
\end{align*}
whose value $N(\varepsilon,A,d)$ is the least number of closed $d$-balls of radius $\varepsilon$ needed to cover $A$.
[/definition]
Covering numbers quantify the size of a class at resolution $\varepsilon$. For indicators, the relevant metric is $d_Q(C,D)=Q(C\triangle D)^{1/2}$, which is the $L^2(Q)$ distance between $\mathbb{1}_C$ and $\mathbb{1}_D$. The problem is that $Q$ may be any probability measure, including one concentrated on finitely many difficult points, so the entropy bound must come from the combinatorial trace structure rather than from regularity of the ambient space.
[quotetheorem:9830]
[citeproof:9830]
This result is the metric form of the VC principle. Finite VC dimension gives entropy of logarithmic order $v\log(1/\varepsilon)$, which is precisely the order needed in maximal inequalities for empirical processes. The measurability assumptions are not cosmetic: if a set $C\subset S$ is not $Q$-measurable, then $Q(C\triangle D)$ may not be defined, and if the supremum over a class is nonmeasurable, a random finite sample cannot be inserted into an ordinary probability bound without replacing probabilities by outer probabilities. Thus the theorem is stated for measurable classes, or in more advanced treatments with explicit separability conventions. The constants are not meant to be sharp; the value of the theorem is its distribution-free polynomial dependence on $\varepsilon^{-1}$. In later maximal inequalities, this bound supplies the complexity input while concentration estimates supply the probabilistic input.
[remark: Uniformity in the Underlying Distribution]
The entropy bound is uniform over all probability measures $Q$. This uniformity is essential in empirical process applications because symmetrisation often introduces the empirical measure $P_n$, which is itself random.
[/remark]
The set-class entropy bound extends to many bounded VC subgraph classes by applying the same discretisation to subgraphs and level sets. This is the route by which real-valued VC subgraph classes enter empirical process estimates.
## Uniform Deviation Bounds for Binary Classifiers
The final question in this chapter is probabilistic: how does finite VC dimension control the gap between empirical and true probabilities uniformly over a classifier class? Let $X_1,\dots,X_n$ be i.i.d. with distribution $P$, and write
\begin{align*}
P_n(C) := \frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(X_i).
\end{align*}
The quantity $\sup_{C\in\mathcal C}|P_n(C)-P(C)|$ measures the worst classification-frequency error over $\mathcal C$.
[quotetheorem:9831]
[citeproof:9831]
The inequality gives a nonasymptotic uniform law of large numbers. For fixed VC dimension, the right-hand side tends to zero whenever $t$ is larger than a constant multiple of $\sqrt{(v\log n)/n}$, up to constants and logarithmic factors. Countability, or an equivalent measurability convention using outer probability, is needed so that the supremum over $\mathcal C$ is a legitimate random variable. The finite-growth assumption is also essential: for a class that shatters every finite sample, the union bound has $2^n$ possible labelings and the displayed estimate no longer forces convergence. The theorem controls uniform deviations for a fixed class, but it does not identify the sharp constants or replace problem-specific inequalities such as Dvoretzky-Kiefer-Wolfowitz for thresholds.
[example: Uniform Deviation for Thresholds]
Let $\mathcal C=\{(-\infty,a]:a\in\mathbb R\}$. If $x\in\mathbb R$, then choosing $a<x$ gives $(-\infty,a]\cap\{x\}=\varnothing$, while choosing $a=x$ gives $(-\infty,a]\cap\{x\}=\{x\}$, so one-point sets are shattered. For two ordered points $x_1<x_2$, every threshold trace is one of
\begin{align*}
\varnothing,\qquad \{x_1\},\qquad \{x_1,x_2\}.
\end{align*}
The subset $\{x_2\}$ cannot occur, because $x_2\le a$ and $x_1<x_2$ imply $x_1\le a$. Hence $V(\mathcal C)=1$.
For an ordered sample $x_{(1)}\le \cdots \le x_{(n)}$, the trace of $(-\infty,a]$ is determined by the number
\begin{align*}
k(a)=\#\{i:x_i\le a\}.
\end{align*}
As $a$ moves from below all sample points to above all sample points, the possible traces are the initial segments with $k=0,1,\dots,n$, so $\Pi_{\mathcal C}(n)=n+1$. Applying the *VC Inequality* with $2n$ in the growth function gives
\begin{align*}
\Pi_{\mathcal C}(2n)=2n+1.
\end{align*}
Therefore, for every $t>0$,
\begin{align*}
\mathbb P\left(\sup_{a\in\mathbb R}|P_n((-\infty,a])-P((-\infty,a])|>t\right)\le 8(2n+1)\exp\left(-\frac{nt^2}{32}\right).
\end{align*}
Since $P_n((-\infty,a])$ is the empirical distribution function at $a$ and $P((-\infty,a])$ is the true distribution function at $a$, this is a distribution-free uniform bound for the empirical distribution function. It has weaker constants and less sharp rate detail than the *Dvoretzky-Kiefer-Wolfowitz inequality*, but the same VC argument applies to arbitrary classes once their growth function is controlled.
[/example]
Thresholds show the cost of generality: VC methods sacrifice sharp constants for structural breadth. In classification, this breadth is more important than exact constants because the same estimate applies to many classes of decision regions.
[example: Binary Linear Classification]
Let $\mathcal H_d$ be the class of affine halfspaces in $\mathbb R^d$, and let $X_1,\dots,X_n$ be i.i.d. with law $P$. From the halfspace computation above, $V(\mathcal H_d)=d+1$. Applying the *VC Inequality* with $v=d+1$ gives, for every $t>0$ and every $n\ge d+1$,
\begin{align*}
\mathbb P\left(\sup_{H\in\mathcal H_d}|P_n(H)-P(H)|>t\right)\le 8\left(\frac{2en}{d+1}\right)^{d+1}\exp\left(-\frac{nt^2}{32}\right).
\end{align*}
The polynomial factor can be written as
\begin{align*}
\left(\frac{2en}{d+1}\right)^{d+1}=\left(\frac{2e}{d+1}\right)^{d+1}n^{d+1}.
\end{align*}
Thus the logarithm of the right-hand side is
\begin{align*}
\log 8+(d+1)\log\left(\frac{2e}{d+1}\right)+(d+1)\log n-\frac{nt^2}{32}.
\end{align*}
For fixed $d$ and fixed $t>0$, the first two terms are constant in $n$, the term $(d+1)\log n$ grows logarithmically, and the term $nt^2/32$ grows linearly. Hence the displayed logarithm tends to $-\infty$, so the probability tends to $0$. Therefore empirical halfspace frequencies converge uniformly to their population probabilities for every fixed dimension $d$.
[/example]
The chapter's main message is that finite combinatorial dimension is enough to control empirical processes indexed by binary classifiers. Sauer-Shelah supplies polynomial growth, the Dudley-Pollard bound turns it into entropy, and the VC inequality turns both into distribution-free deviation estimates. Later chapters will replace exact combinatorial counting by chaining and bracketing, but the same philosophy remains: complexity bounds determine the size of the empirical process.
# 5. Donsker Classes and Brownian Bridges
The previous chapters established uniform laws of large numbers and concentration bounds for empirical processes. We now move from first-order convergence to second-order fluctuations: after centering and multiplying by $\sqrt n$, the empirical measure may converge to a Gaussian process. The main question is no longer whether $P_n f$ is close to $P f$ uniformly over a class $\mathcal F$, but whether the whole random function $f \mapsto \sqrt n(P_n-P)f$ has a weak limit in $\ell^\infty(\mathcal F)$.
The limiting process is a Brownian bridge indexed by the same class of sets or functions. This chapter explains what it means for a class to be Donsker, how the Brownian bridge is constructed through its covariance, and why asymptotic equicontinuity is the condition that upgrades finite-dimensional central limit theorems to functional weak convergence.
## Weak Convergence of Empirical Processes in $\ell^\infty(F)$
The finite-dimensional central limit theorem from Chapter 1 gives convergence of $(G_n f_1,\dots,G_n f_k)$ for any fixed finite list of functions. The problem is to decide when these finite-dimensional limits fit together into convergence of the entire random element $G_n$ as a [bounded function](/page/Bounded%20Function) on $F$.
[definition: Empirical Process Indexed by a Function Class]
Let $(X_i)_{i\ge 1}$ be i.i.d. random variables with law $P$ on a measurable space $(S,\mathcal S)$. Let $F$ be a class of measurable real-valued functions on $S$ such that $P f$ is defined for every $f\in F$. Define the empirical mean by
\begin{align*}
P_n f:=\frac{1}{n}\sum_{i=1}^n f(X_i).
\end{align*}
The empirical process indexed by $F$ is the map
\begin{align*}
G_n:F&\to \mathbb R, & G_n f&:=\sqrt n(P_n-P)f=\frac{1}{\sqrt n}\sum_{i=1}^n(f(X_i)-P f).
\end{align*}
[/definition]
This process is viewed as a random element of $\ell^\infty(F)$ when $\sup_{f\in F}|G_n f|<\infty$ with high enough probability, using outer probability when measurability is not automatic. Since the course needs a name for the classes where this random element has a Gaussian weak limit, we introduce the Donsker property.
[definition: P-Donsker Class]
Let $F$ be a class of measurable real-valued functions with $P f^2<\infty$ for every $f\in F$. The class $F$ is $P$-Donsker if the empirical process $G_n$ converges weakly in $\ell^\infty(F)$ to a tight centered Gaussian process $G_P$ indexed by $F$ with covariance
\begin{align*}
\operatorname{Cov}(G_P f,G_P g)=P(fg)-P f\,P g.
\end{align*}
[/definition]
The notation records dependence on the underlying distribution. A class can be Donsker for one law $P$ and fail to be Donsker for another law, especially when envelopes or metric entropy depend strongly on $P$.
[example: Finite Classes Are Donsker]
Let $F=\{f_1,\dots,f_k\}$ and assume $P f_j^2<\infty$ for each $j$. Since $P|f_j|\le (P f_j^2)^{1/2}$, every $P f_j$ is finite. For each observation define the centered vector
\begin{align*}
Y_i:=\bigl(f_1(X_i)-P f_1,\dots,f_k(X_i)-P f_k\bigr)\in\mathbb R^k.
\end{align*}
Then $Y_1,Y_2,\dots$ are i.i.d., $E Y_i=0$, and the empirical process on $F$ is exactly
\begin{align*}
(G_n f_1,\dots,G_n f_k)=\frac{1}{\sqrt n}\sum_{i=1}^n Y_i.
\end{align*}
For the covariance matrix $\Sigma$ of $Y_i$, the $(i,j)$ entry is
\begin{align*}
\Sigma_{ij}=E\bigl[(f_i(X)-P f_i)(f_j(X)-P f_j)\bigr].
\end{align*}
Expanding the product gives
\begin{align*}
(f_i(X)-P f_i)(f_j(X)-P f_j)=f_i(X)f_j(X)-f_i(X)P f_j-f_j(X)P f_i+P f_i\,P f_j.
\end{align*}
Taking expectations term by term,
\begin{align*}
\Sigma_{ij}=P(f_i f_j)-P f_i\,P f_j-P f_j\,P f_i+P f_i\,P f_j=P(f_i f_j)-P f_i\,P f_j.
\end{align*}
By the *Multivariate Central Limit Theorem*,
\begin{align*}
(G_n f_1,\dots,G_n f_k)\xrightarrow{d}N_k(0,\Sigma).
\end{align*}
Identifying $\ell^\infty(F)$ with $\mathbb R^k$ by $h\mapsto (h(f_1),\dots,h(f_k))$, this is weak convergence of $G_n$ to a centered Gaussian process with covariance $P(f_i f_j)-P f_iP f_j$. Thus every finite square-integrable class is $P$-Donsker, and the real difficulty in the Donsker property begins only when $F$ is infinite.
[/example]
For infinite classes, the example leaves open the question of which functions should be considered close as empirical-process indices. The covariance of the finite-dimensional limits provides the right semimetric, so we isolate it before stating the first functional central limit theorem.
[definition: Covariance Semimetric]
Let $F\subset L^2(P)$. The covariance semimetric on $F$ is the map $d_P:F\times F\to [0,\infty)$ defined by
\begin{align*}
d_P(f,g)^2:=P(f-g)^2-(P f-P g)^2.
\end{align*}
[/definition]
This is the standard deviation of $(f-g)(X)$ after centering. If $d_P(f,g)=0$, then $f-g$ is $P$-a.s. constant, so the empirical process cannot distinguish $f$ from $g$. This motivates the following theorem for the threshold class on the real line, which is the first non-finite Donsker class of the course.
[quotetheorem:6303]
[citeproof:6303]
The theorem is the prototype for all Donsker results in this course. The index class is the collection of lower half-lines $\{(-\infty,t]:t\in\mathbb R\}$, and the limit is a Gaussian process whose covariance is inherited from the law $P$. The monotone structure of half-lines is essential: it gives tight control of increments through interval counts. If an index class is replaced by an unrestricted collection of measurable sets, finite-dimensional Gaussian limits may still exist while the supremum over the class oscillates too violently for convergence in $\ell^\infty(F)$; asymptotic equicontinuity is precisely the condition that rules out this failure.
The assumptions also mark the boundary of the statement. Independence cannot be dropped without changing the limit: if $X_i=X_1$ for all $i$, then $F_n=F_1$ and $\sqrt n(F_n-F_P)$ typically diverges rather than converging to a Brownian bridge. Identical distribution is also part of the conclusion: for a triangular array whose laws change with $i$, the covariance of the centered indicators need not be $F_P(s\wedge t)-F_P(s)F_P(t)$ for any fixed distribution function $F_P$. The theorem does not say that every set-indexed empirical process is Donsker, nor that the sample paths $t\mapsto \sqrt n(F_n(t)-F_P(t))$ converge pointwise in an almost sure sense. It says that the whole centered empirical distribution function has a weak limit in the sup norm, a stronger statement than fixed-$t$ central limit theorems and a narrower statement than a universal Donsker theorem for arbitrary classes.
[example: Empirical CDF Limit]
For $X_i\sim \operatorname{Unif}(0,1)$ and $0\le t\le 1$, set
\begin{align*}F_n(t)=\frac{1}{n}\sum_{i=1}^n\mathbb{1}_{[0,t]}(X_i).\end{align*}
Since $P(X_i\in[0,t])=t$, we have
\begin{align*}F_n(t)-t=\frac{1}{n}\sum_{i=1}^n(\mathbb{1}_{[0,t]}(X_i)-t),\end{align*}
and therefore
\begin{align*}\sqrt n(F_n(t)-t)=\frac{1}{\sqrt n}\sum_{i=1}^n(\mathbb{1}_{[0,t]}(X_i)-t).\end{align*}
The *Classical Donsker Theorem for Distribution Functions* applied to the uniform law gives
\begin{align*}\bigl(\sqrt n(F_n(t)-t)\bigr)_{0\le t\le 1}\xrightarrow{d}(B(t))_{0\le t\le 1}\end{align*}
in $\ell^\infty([0,1])$.
At a fixed threshold $t$, the variable $\mathbb{1}_{[0,t]}(X_i)$ is Bernoulli with mean
\begin{align*}E\mathbb{1}_{[0,t]}(X_i)=P(X_i\in[0,t])=t,\end{align*}
and, since $\mathbb{1}_{[0,t]}(X_i)^2=\mathbb{1}_{[0,t]}(X_i)$,
\begin{align*}\operatorname{Var}(\mathbb{1}_{[0,t]}(X_i))=E\mathbb{1}_{[0,t]}(X_i)^2-\bigl(E\mathbb{1}_{[0,t]}(X_i)\bigr)^2=t-t^2=t(1-t).\end{align*}
For two thresholds $s,t\in[0,1]$, the product of indicators is the indicator of the intersection:
\begin{align*}\mathbb{1}_{[0,s]}(X_i)\mathbb{1}_{[0,t]}(X_i)=\mathbb{1}_{[0,s]\cap[0,t]}(X_i)=\mathbb{1}_{[0,s\wedge t]}(X_i).\end{align*}
Thus
\begin{align*}E[\mathbb{1}_{[0,s]}(X_i)\mathbb{1}_{[0,t]}(X_i)]=s\wedge t,\end{align*}
and expanding the centered covariance gives
\begin{align*}\operatorname{Cov}(\mathbb{1}_{[0,s]}(X_i),\mathbb{1}_{[0,t]}(X_i))=(s\wedge t)-st.\end{align*}
The functional limit keeps this covariance across all thresholds at once, so the Brownian bridge records not only the fixed-$t$ Bernoulli fluctuations but also the nested dependence of the intervals $[0,s]$ and $[0,t]$.
[/example]
## Brownian Bridges Indexed by Sets and Functions
Once the covariance of the empirical process has been identified, the next problem is to describe the Gaussian process that could appear as its limit. For a general function class $F$, the Brownian bridge is not a single process on time $[0,1]$ but an isonormal-looking centered Gaussian process indexed by $F$ modulo $P$-a.s. constants.
[definition: P-Brownian Bridge Indexed by Functions]
Let $F\subset L^2(P)$. A $P$-Brownian bridge indexed by $F$ is a random map $G_P:F\to \mathbb R$, written $f\mapsto G_P f$, such that $(G_P f)_{f\in F}$ is a centered Gaussian process satisfying
\begin{align*}
\operatorname{Cov}(G_P f,G_P g)=P(fg)-P f\,P g
\end{align*}
for all $f,g\in F$.
[/definition]
The word bridge reflects the centering by $P f$. For indicators of initial intervals under the uniform distribution, the process is the usual Brownian bridge, pinned to be $0$ at time $0$ and time $1$. Since many empirical processes are indexed by events rather than general functions, we also need the equivalent set-indexed notation.
There is a useful Hilbert-space way to read this definition. Let
\begin{align*}
L^2_0(P):=\{h\in L^2(P):P h=0\}.
\end{align*}
If $W$ is an isonormal Gaussian process on $L^2_0(P)$, meaning $\operatorname{Cov}(W(h),W(k))=P(hk)$, then
\begin{align*}
G_P f=W(f-P f).
\end{align*}
Constants disappear under the map $f\mapsto f-P f$, so the bridge really lives on $L^2(P)$ modulo constant functions. Under this identification, $d_P(f,g)=\| (f-P f)-(g-P g)\|_{L^2(P)}$, and the covariance semimetric is the Hilbert norm of the centered difference. The next case to isolate is the one most common in empirical distribution theory: the index $f$ is an indicator $\mathbb{1}_C$. Writing the same bridge directly in terms of sets keeps the notation aligned with events, distribution functions, and VC classes, while preserving the same covariance structure.
[definition: P-Brownian Bridge Indexed by Sets]
Let $\mathcal C\subset \mathcal S$ be a class of measurable sets. A $P$-Brownian bridge indexed by $\mathcal C$ is a random map $G_P:\mathcal C\to \mathbb R$, written $C\mapsto G_P(C)$, such that $(G_P(C))_{C\in\mathcal C}$ is a centered Gaussian process with covariance
\begin{align*}
\operatorname{Cov}(G_P(C),G_P(D))=P(C\cap D)-P(C)P(D).
\end{align*}
[/definition]
This is the set-indexed version of the function-indexed definition with $f=\mathbb{1}_C$. Many statistical processes are easier to read in set notation, while entropy bounds often treat indicators as a function class.
[example: Brownian Bridge on Lower Half-Lines]
Let $\mathcal C=\{[0,t]:0\le t\le 1\}$ and let $P$ be Lebesgue measure on $[0,1]$. For $s,t\in[0,1]$, the set-indexed bridge covariance is
\begin{align*}
\operatorname{Cov}(G_P([0,s]),G_P([0,t]))=P([0,s]\cap[0,t])-P([0,s])P([0,t]).
\end{align*}
Since $[0,s]\cap[0,t]=[0,s\wedge t]$, Lebesgue measure gives
\begin{align*}
P([0,s]\cap[0,t])=P([0,s\wedge t])=s\wedge t.
\end{align*}
Also,
\begin{align*}
P([0,s])P([0,t])=st.
\end{align*}
Substituting these two terms into the covariance formula,
\begin{align*}
\operatorname{Cov}(G_P([0,s]),G_P([0,t]))=(s\wedge t)-st.
\end{align*}
Thus the process $t\mapsto G_P([0,t])$ is centered Gaussian with covariance $(s,t)\mapsto s\wedge t-st$, which is exactly the covariance of the standard Brownian bridge $B(t)$. Therefore $t\mapsto G_P([0,t])$ has the same finite-dimensional distributions as $B$, so the set-indexed bridge recovers the Brownian bridge appearing in the empirical distribution function theorem.
[/example]
A covariance formula determines finite-dimensional distributions, but weak convergence in $\ell^\infty(F)$ also requires the limit to live in that space. The first path property needed for this is boundedness, because the sup norm topology cannot see an unbounded sample path as an element of $\ell^\infty(F)$.
[definition: Sample Bounded Gaussian Process]
Let $(T,d)$ be a semimetric space and let $X:T\to \mathbb R$ be a stochastic process, written $t\mapsto X_t$. The process is sample bounded on $T$ if its sample paths belong to $\ell^\infty(T)$ with probability $1$, meaning
\begin{align*}
\sup_{t\in T}|X_t|<\infty
\end{align*}
with probability $1$.
[/definition]
For a Brownian bridge indexed by $F$, sample boundedness allows the limiting process to be interpreted as an element of $\ell^\infty(F)$. Boundedness alone does not control small-scale oscillations, so the next condition asks the sample paths to respect the covariance geometry.
[definition: Uniformly Continuous Sample Paths]
Let $(T,d)$ be a semimetric space and let $X:T\to \mathbb R$ be a stochastic process, written $t\mapsto X_t$. The process has uniformly continuous sample paths with respect to $d$ if, with probability $1$, for every $\varepsilon>0$ there exists $\delta>0$ such that
\begin{align*}
d(s,t)<\delta \implies |X_s-X_t|<\varepsilon
\end{align*}
for all $s,t\in T$.
[/definition]
For empirical process limits, the semimetric is usually $d_P$. [Uniform continuity](/page/Uniform%20Continuity) says that the limit process respects the same local geometry as the centered observations. To turn these path properties into weak convergence, the course uses the following tightness principle from probability in metric spaces.
[quotetheorem:9832]
This background result belongs to weak convergence theory in metric spaces and empirical process measurability. Its role is structural: finite-dimensional convergence identifies the only possible limit, while tightness prevents mass from escaping through increasingly oscillatory directions of $F$. Without tightness, the coordinate projections can converge while the random functions do not converge in sup norm; for instance, take $F=\mathbb N$ and let $Z_n=e_n\in \ell^\infty(\mathbb N)$ be the deterministic sequence with value $1$ at coordinate $n$ and $0$ elsewhere. Every fixed finite coordinate vector converges to $0$, but $\|Z_n\|_\infty=1$ for all $n$, so no convergence to $0$ occurs in sup norm.
Each hypothesis has a separate job. The compact set in the tightness definition cannot be replaced by a merely [bounded set](/page/Bounded%20Set): in infinite-dimensional $\ell^\infty(F)$, the unit ball is generally not compact, so boundedness of $\|Z_n\|_\infty$ gives no subsequential weak limit by itself. The finite-dimensional convergence assumption cannot be omitted either, since a tight constant sequence $Z_n\equiv Z$ converges only to the law of $Z$, not to an arbitrary candidate limit. The condition that the target $Z$ be a tight Borel random element is also substantive; a formal Gaussian family with unbounded sample paths does not define a probability law on $\ell^\infty(F)$. The criterion does not prove tightness, does not identify compact sets in a usable way, and does not remove measurability issues for nonseparable classes. Its purpose is to split the problem: later arguments must prove tightness or asymptotic equicontinuity, after which finite-dimensional central limit theorems identify the Gaussian limit.
## Asymptotic Equicontinuity and Functional Convergence
The remaining task is to make tightness checkable. In empirical process theory, tightness is usually proved by showing that $G_n f$ and $G_n g$ are close uniformly over pairs with small $d_P(f,g)$; this property is called asymptotic equicontinuity.
[definition: Asymptotic Equicontinuity]
Let $(F,d)$ be a semimetric space and let $Z_n:F\to \mathbb R$ be stochastic processes, viewed as random elements of $\ell^\infty(F)$ when their sample paths are bounded. The sequence $(Z_n)$ is asymptotically equicontinuous with respect to $d$ if for every $\varepsilon>0$,
\begin{align*}
\lim_{\delta\downarrow 0}\limsup_{n\to\infty}\mathbb P^*\left(\sup_{d(f,g)<\delta}|Z_n f-Z_n g|>\varepsilon\right)=0.
\end{align*}
[/definition]
The outer probability $\mathbb P^*$ is included because the displayed supremum may not be measurable for arbitrary classes. In separable settings it can be replaced by ordinary probability. The main criterion now says that finite-dimensional convergence plus this local control is enough for functional convergence.
[quotetheorem:9833]
[citeproof:9833]
This theorem is the main conceptual bridge of the chapter. Donsker proofs usually have two parts: compute the finite-dimensional Gaussian limit, then prove asymptotic equicontinuity through entropy, bracketing, VC theory, or chaining. The proof intuition is finite approximation: replace $F$ by a finite $d_P$-net, use the multivariate central limit theorem on that net, and then show that the replacement error is uniformly small for both the empirical process and the Gaussian limit.
The limitations are distinct from this proof idea. Total boundedness cannot be omitted, because finite nets are the mechanism by which finite-dimensional convergence is upgraded to a sup norm statement. A concrete failure is obtained by taking $F=\mathbb N$ with the [discrete metric](/page/Discrete%20Metric) $d(i,j)=\mathbb{1}_{\{i\ne j\}}$ and defining the deterministic process $Z_n=e_n\in \ell^\infty(\mathbb N)$, where $e_n(n)=1$ and $e_n(j)=0$ for $j\ne n$. For every fixed finite set of coordinates, $Z_n$ eventually vanishes on that set, so the finite-dimensional distributions converge to $0$; however $\|Z_n\|_\infty=1$ for all $n$, so there is no convergence to $0$ in $\ell^\infty(\mathbb N)$. Asymptotic equicontinuity is a separate hypothesis: on $F=\{0\}\cup\{1/n:n\ge 1\}$ with the usual metric, a process that puts an order-one spike at the random index $1/n$ has finite-dimensional limits equal to $0$ while its local oscillation near $0$ does not vanish. The bounded uniformly continuous limit-path assumption is also necessary for convergence in $\ell^\infty(F)$, since a Gaussian family with unbounded paths or discontinuities in the $d_P$-geometry is not the limit of uniformly approximable bounded paths under this criterion. The theorem does not supply entropy bounds, does not verify measurability for arbitrary nonseparable $F$, and does not say that finite-dimensional convergence alone is enough. It points forward to the practical work of proving the equicontinuity condition by bracketing, VC estimates, or chaining.
[example: Two-Sample Empirical Process]
Let $X_1,\dots,X_m$ and $Y_1,\dots,Y_n$ be independent i.i.d. samples from the same law $P$, with empirical measures $P_m$ and $Q_n$. For $f\in F$, define
\begin{align*}Z_{m,n}f=\sqrt{\frac{mn}{m+n}}(P_m f-Q_n f).\end{align*}
Write the two empirical processes as
\begin{align*}G_m^X f=\sqrt m(P_m-P)f.\end{align*}
\begin{align*}G_n^Y f=\sqrt n(Q_n-P)f.\end{align*}
Since $P_m f-Q_n f=(P_m-P)f-(Q_n-P)f$, substitution gives
\begin{align*}Z_{m,n}f=\sqrt{\frac{mn}{m+n}}\left(\frac{G_m^X f}{\sqrt m}-\frac{G_n^Y f}{\sqrt n}\right).\end{align*}
Multiplying the two coefficients separately,
\begin{align*}\sqrt{\frac{mn}{m+n}}\frac{1}{\sqrt m}=\sqrt{\frac{n}{m+n}}.\end{align*}
\begin{align*}\sqrt{\frac{mn}{m+n}}\frac{1}{\sqrt n}=\sqrt{\frac{m}{m+n}}.\end{align*}
Hence
\begin{align*}Z_{m,n}f=\sqrt{\frac{n}{m+n}}\,G_m^X f-\sqrt{\frac{m}{m+n}}\,G_n^Y f.\end{align*}
Assume $m/(m+n)\to\lambda\in(0,1)$. Then $n/(m+n)=1-m/(m+n)\to 1-\lambda$. Since $F$ is $P$-Donsker, $G_m^X$ and $G_n^Y$ converge in $\ell^\infty(F)$ to independent copies $G_P^X$ and $G_P^Y$ of the $P$-Brownian bridge. By the continuous mapping principle applied to the [linear map](/page/Linear%20Map) $(h,k)\mapsto \sqrt{1-\lambda}\,h-\sqrt{\lambda}\,k$,
\begin{align*}Z_{m,n}\xrightarrow{d}Z:=\sqrt{1-\lambda}\,G_P^X-\sqrt{\lambda}\,G_P^Y.\end{align*}
The limit $Z$ is centered Gaussian because it is a linear combination of independent centered Gaussian processes.
For $f,g\in F$, independence of $G_P^X$ and $G_P^Y$ gives zero cross-covariances:
\begin{align*}\operatorname{Cov}(G_P^X f,G_P^Y g)=0.\end{align*}
Therefore
\begin{align*}\operatorname{Cov}(Zf,Zg)=(1-\lambda)\operatorname{Cov}(G_P^X f,G_P^X g)+\lambda\operatorname{Cov}(G_P^Y f,G_P^Y g).\end{align*}
Both bridge copies have covariance $P(fg)-P f\,P g$, so
\begin{align*}\operatorname{Cov}(Zf,Zg)=(1-\lambda)\{P(fg)-P f\,P g\}+\lambda\{P(fg)-P f\,P g\}.\end{align*}
Factoring the common covariance term,
\begin{align*}\operatorname{Cov}(Zf,Zg)=\bigl((1-\lambda)+\lambda\bigr)\{P(fg)-P f\,P g\}=P(fg)-P f\,P g.\end{align*}
Thus $Z$ has the same finite-dimensional distributions as a $P$-Brownian bridge indexed by $F$, so the two-sample process converges to the same bridge as in the one-sample Donsker theorem. The factor $\sqrt{mn/(m+n)}$ is exactly the normalization that makes the two independent covariance contributions add to one copy of the bridge covariance.
[/example]
The two-sample example shows why functional convergence is useful in statistics: once a process converges in $\ell^\infty(F)$, many statistics can be treated as maps of that process. This motivates the following mapping theorem, which converts Donsker limits into distributional limits for real-valued statistics.
[quotetheorem:9834]
[citeproof:9834]
This principle turns Donsker theorems into limit theorems for goodness-of-fit statistics and many estimators. It is the bridge from process convergence to the concrete statistics computed from data: supremum statistics use the map $h\mapsto \|h\|_\infty$, integrated-square statistics use $h\mapsto \int h^2$, and plug-in estimators are often handled by applying continuous or differentiable maps to $G_n$. The continuity hypothesis is essential because weak convergence in $\ell^\infty(F)$ only controls statistics that do not jump under small sup norm perturbations at the limiting path. For example, a functional that records whether the maximizer of a process is unique can be discontinuous at paths with tied maxima, so the mapping theorem gives no conclusion there without an additional argument showing that the limiting bridge avoids the discontinuity set.
The other hypotheses are not technical decoration. If $\Phi$ is not measurable, then $\Phi(G_n)$ may not be an ordinary real-valued random variable, so convergence in distribution is not even a well-defined target without an outer-probability convention. A concrete way this can occur is to take a nonmeasurable set $A\subset \mathbb R$ and define $\Phi(x)=\mathbb{1}_A(x(0))$ on $\ell^\infty(\{0\})$; even for a measurable real coordinate $G_n(0)$, the composition may fail to be measurable. If $\Phi$ is continuous only at some paths but the limit lands in its discontinuity set with positive probability, the conclusion can fail: take real variables $X_n=n^{-1}Z$ with $Z\sim \mathcal N(0,1)$, so $X_n\xrightarrow{d}0$, and take $\Phi(x)=\mathbb{1}_{\{x>0\}}$; then $\Phi(X_n)$ has Bernoulli distribution with parameter $1/2$, while $\Phi(0)=0$. The theorem also does not say that every statistic of an empirical process is continuous, nor that discontinuous statistics have no limit. It says that once a statistic is a measurable functional and the Brownian bridge avoids its discontinuities, the empirical-process weak convergence transfers automatically. The next examples give the classical distribution-free limits for the empirical distribution function.
[example: Kolmogorov-Smirnov Limit]
For i.i.d. $\operatorname{Unif}(0,1)$ observations, define
\begin{align*}
D_n=\sup_{0\le t\le 1}|F_n(t)-t|.
\end{align*}
Multiplying the statistic by $\sqrt n$ and using that $\sqrt n\ge 0$ gives
\begin{align*}
\sqrt n D_n=\sqrt n\sup_{0\le t\le 1}|F_n(t)-t|=\sup_{0\le t\le 1}\sqrt n|F_n(t)-t|=\sup_{0\le t\le 1}|\sqrt n(F_n(t)-t)|.
\end{align*}
Thus $\sqrt nD_n=\Phi(H_n)$, where
\begin{align*}
H_n(t):=\sqrt n(F_n(t)-t)
\end{align*}
and $\Phi:\ell^\infty([0,1])\to\mathbb R$ is the supremum functional
\begin{align*}
\Phi(h):=\sup_{0\le t\le 1}|h(t)|.
\end{align*}
The functional $\Phi$ is continuous under the sup norm. Indeed, for $h,k\in\ell^\infty([0,1])$, the [reverse triangle inequality](/theorems/2300) gives
\begin{align*}
\bigl||h(t)|-|k(t)|\bigr|\le |h(t)-k(t)|
\end{align*}
for every $t\in[0,1]$, and taking suprema gives
\begin{align*}
|\Phi(h)-\Phi(k)|\le \sup_{0\le t\le 1}|h(t)-k(t)|=\|h-k\|_\infty.
\end{align*}
By the *Classical Donsker Theorem for Distribution Functions* for the uniform law,
\begin{align*}
H_n\xrightarrow{d}B
\end{align*}
in $\ell^\infty([0,1])$. Applying the *[Continuous Mapping Principle for Empirical Process Statistics](/theorems/9834)* to $\Phi$ yields
\begin{align*}
\sqrt nD_n=\Phi(H_n)\xrightarrow{d}\Phi(B)=\sup_{0\le t\le 1}|B(t)|.
\end{align*}
This example shows why the empirical CDF theorem is a functional limit theorem: the Kolmogorov-Smirnov statistic uses the whole path $t\mapsto \sqrt n(F_n(t)-t)$, not just its value at any fixed threshold.
[/example]
The supremum functional measures the largest pointwise deviation. A second standard statistic averages the squared deviation, so it is sensitive to accumulated error rather than only the largest error.
[example: Cramer-von Mises Limit]
For i.i.d. $\operatorname{Unif}(0,1)$ observations, define
\begin{align*}W_n=n\int_0^1(F_n(t)-t)^2\,dt.\end{align*}
Let
\begin{align*}H_n(t):=\sqrt n(F_n(t)-t).\end{align*}
Then, for each $t\in[0,1]$,
\begin{align*}H_n(t)^2=\bigl(\sqrt n(F_n(t)-t)\bigr)^2=n(F_n(t)-t)^2.\end{align*}
Substituting this identity into the integral gives
\begin{align*}W_n=\int_0^1 H_n(t)^2\,dt.\end{align*}
Define $\Phi:\ell^\infty([0,1])\to\mathbb R$ by
\begin{align*}\Phi(h):=\int_0^1 h(t)^2\,dt.\end{align*}
On bounded Borel-measurable functions, $\Phi$ is continuous under the sup norm. Indeed, if $h,k$ are bounded and Borel-measurable, then
\begin{align*}h(t)^2-k(t)^2=(h(t)-k(t))(h(t)+k(t)).\end{align*}
Taking absolute values and using $|h(t)-k(t)|\le \|h-k\|_\infty$ gives
\begin{align*}|h(t)^2-k(t)^2|\le \|h-k\|_\infty\bigl(|h(t)|+|k(t)|\bigr).\end{align*}
Since $|h(t)|\le \|h\|_\infty$ and $|k(t)|\le \|k\|_\infty$,
\begin{align*}|h(t)^2-k(t)^2|\le \|h-k\|_\infty(\|h\|_\infty+\|k\|_\infty).\end{align*}
Integrating over $[0,1]$ yields
\begin{align*}|\Phi(h)-\Phi(k)|\le \|h-k\|_\infty(\|h\|_\infty+\|k\|_\infty).\end{align*}
Thus if $\|h-k\|_\infty\to 0$ while $h$ is fixed and $k$ is close to $h$, the right-hand side tends to $0$.
By the *Classical Donsker Theorem for Distribution Functions* for the uniform law,
\begin{align*}H_n\xrightarrow{d}B\end{align*}
in $\ell^\infty([0,1])$. The Brownian bridge $B$ has bounded Borel-measurable sample paths, so the continuity just proved applies at $B$ with probability $1$. Therefore the *Continuous Mapping Principle for Empirical Process Statistics* gives
\begin{align*}W_n=\Phi(H_n)\xrightarrow{d}\Phi(B)=\int_0^1 B(t)^2\,dt.\end{align*}
This limit captures the accumulated squared Brownian bridge fluctuation over $[0,1]$, rather than only the largest pointwise fluctuation.
[/example]
The chapter can be summarized as a three-step pattern. First, finite-dimensional central limit theorems determine the Gaussian covariance. Second, the Brownian bridge must be sample bounded and uniformly continuous in the covariance semimetric. Third, asymptotic equicontinuity converts finite-dimensional convergence into weak convergence in $\ell^\infty(F)$, allowing continuous statistical functionals to inherit Brownian bridge limits.
These ideas also connect the course to broader functional analysis and probability. Tightness in $\ell^\infty(F)$ is a compactness question in an infinite-dimensional [Banach space](/page/Banach%20Space), while asymptotic equicontinuity is the probabilistic analogue of the Arzelà-Ascoli principle: compactness comes from boundedness plus uniform control of oscillations. The Brownian bridge itself is a Gaussian random element whose covariance operator records the geometry of $L^2(P)$ after constants are quotiented out. Thus Donsker theory sits between central limit theory, compactness in function spaces, and the continuity of statistical functionals.
# 6. Entropy, Bracketing, and Uniform Central Limit Theorems
Entropy methods answer the question left open by symmetrisation, finite-net reductions, and VC counting: how much geometric size can a class of functions have before uniform probabilistic limits fail? This chapter assumes the earlier material on empirical measures $P_n$, empirical processes $G_n f=\sqrt n(P_n-P)f$, symmetrisation by Rademacher variables, finite-dimensional central limit theorems, and the definitions of $P$-Glivenko-Cantelli and $P$-Donsker classes. Earlier chapters reduced Glivenko-Cantelli and Donsker questions to controlling suprema of empirical processes over small metric pieces. This chapter turns those reductions into usable criteria by measuring the size of a class through covering numbers, bracketing numbers, and entropy integrals.
The guiding theme is that finite-dimensional central limit behaviour is not enough. A class may be pointwise well behaved while containing so many nearly distinct functions that the supremum of the empirical process has no tight limit. Entropy conditions prevent this by requiring increasingly fine approximations to remain economical as the scale tends to zero. The main background tools are $L^p(P)$ approximation, envelopes, finite bracketing arguments, and the idea that random empirical metrics must be controlled uniformly over the discrete laws generated by samples.
## Covering Numbers and Entropy Integrals
The first problem is to quantify the phrase “not too many functions” in a way compatible with the metric used by the empirical process. For a fixed probability measure $P$, the natural size of $f-g$ is often the $L^2(P)$ distance, because variances of empirical averages are controlled by $P(f-g)^2$.
[definition: Covering Number]
Let $(T,d)$ be a metric space. The covering number is the map
\begin{align*}
N(\cdot,T,d):(0,\infty)\to\mathbb N\cup\{\infty\}
\end{align*}
defined as follows. For $\varepsilon>0$, $N(\varepsilon,T,d)$ is the least integer $N$ such that there exist $t_1,\dots,t_N\in T$ with
\begin{align*}
T \subset \bigcup_{j=1}^N B(t_j,\varepsilon).
\end{align*}
If no finite such $N$ exists, set $N(\varepsilon,T,d)=\infty$.
[/definition]
A covering number records approximation in an abstract metric space. For empirical processes we need the same idea with a distribution-dependent metric, because the random fluctuations of $P_n f-Pf$ depend on the law of $f(X)$ under $P$.
[definition: Metric Entropy in L Two]
Let $\mathcal F$ be a class of measurable real-valued functions on $(\Omega,\mathcal F_0,P)$. For $1\le p<\infty$, define
\begin{align*}
d_{P,p}(f,g) &:= \|f-g\|_{L^p(P)} = \left(\int |f-g|^p\,dP\right)^{1/p}.
\end{align*}
This is an extended pseudometric $d_{P,p}:\mathcal F\times\mathcal F\to[0,\infty]$; it becomes a metric after identifying functions equal $P$-a.s. and restricting to finite distances. The $L^p(P)$ metric entropy of $\mathcal F$ is the function $H(\cdot,\mathcal F,L^p(P)):(0,\infty)\to[0,\infty]$ defined by
\begin{align*}
H(\varepsilon,\mathcal F,L^p(P)) &:= \log N(\varepsilon,\mathcal F,d_{P,p}).
\end{align*}
[/definition]
Metric entropy converts approximation into analysis, but a single scale does not control a chaining argument. Since chains use approximations at many resolutions, the next object sums the square-root entropy contributions across all small scales.
[definition: Entropy Integral]
Let $\mathcal F$ be a class of measurable real-valued functions and let $F$ be an envelope for $\mathcal F$. The $L^2(P)$ entropy integral up to scale $\delta>0$ is
\begin{align*}
J(\delta,\mathcal F,L^2(P)) &:= \int_0^\delta \sqrt{1+H(\varepsilon \|F\|_{L^2(P)},\mathcal F,L^2(P))}\,d\varepsilon.
\end{align*}
[/definition]
The scaling by $\|F\|_{L^2(P)}$ makes the condition invariant under multiplying every function by a constant. To see what the integral permits, the basic calculation is the polynomial entropy case that appears for VC-type classes.
[example: Polynomial Entropy Integral]
Suppose $\mathcal F$ has envelope $F\in L^2(P)$ and constants $A,v>0$ such that, for every $0<\varepsilon<1$,
\begin{align*}
N(\varepsilon\|F\|_{L^2(P)},\mathcal F,L^2(P)) \le \left(\frac{A}{\varepsilon}\right)^v.
\end{align*}
Since a covering number is at least $1$ for a nonempty class, this bound forces $(A/\varepsilon)^v\ge 1$ for every $0<\varepsilon<1$, hence $A\ge 1$. By the definition of metric entropy,
\begin{align*}
H(\varepsilon\|F\|_{L^2(P)},\mathcal F,L^2(P)) = \log N(\varepsilon\|F\|_{L^2(P)},\mathcal F,L^2(P)).
\end{align*}
Taking logarithms in the assumed covering bound gives
\begin{align*}
H(\varepsilon\|F\|_{L^2(P)},\mathcal F,L^2(P)) \le \log\left(\left(\frac{A}{\varepsilon}\right)^v\right)=v\log(A/\varepsilon).
\end{align*}
Therefore
\begin{align*}
J(1,\mathcal F,L^2(P)) \le \int_0^1 \sqrt{1+v\log(A/\varepsilon)}\,d\varepsilon.
\end{align*}
It remains to check that the last integral is finite. Put $u=\log(A/\varepsilon)$. Then $\varepsilon=Ae^{-u}$ and $d\varepsilon=-Ae^{-u}\,du$. As $\varepsilon\downarrow0$, $u\to\infty$, and when $\varepsilon=1$, $u=\log A$. Hence
\begin{align*}
\int_0^1 \sqrt{1+v\log(A/\varepsilon)}\,d\varepsilon=A\int_{\log A}^{\infty}\sqrt{1+vu}\,e^{-u}\,du.
\end{align*}
The integral on the right is finite because $\sqrt{1+vu}\,e^{-u}\le C e^{-u/2}$ for all sufficiently large $u$, for a constant $C$ depending only on $v$. Thus $J(1,\mathcal F,L^2(P))<\infty$, so polynomial covering growth is integrable in the entropy scale used for chaining.
[/example]
Polynomial entropy shows that many natural classes are small enough at each scale. The remaining question is which entropy condition is strong enough for the empirical process proof, where the metric itself becomes random after symmetrisation.
[quotetheorem:6306]
[citeproof:6306]
The criterion is deliberately uniform in $Q$, not only stated for the true law $P$. This is needed because after symmetrisation the relevant distance is the empirical $L^2(P_n)$ distance, and $P_n$ is a random finitely discrete measure rather than the fixed law $P$. A condition only in $L^2(P)$ can be misleading: for a continuous distribution, the class of indicators of finite subsets of $\mathbb R$ has zero $L^2(P)$ size, but it can select the observed sample and make $\sup_f |P_n f-Pf|$ equal to $1$. That example is not a well-behaved Donsker class; it illustrates that entropy must control the discrete geometries seen by the sample, not merely the geometry under the population law. Pollard's condition is also a sufficient condition, not a characterization: some Donsker classes are proved by other methods and need not satisfy this particular entropy integral. The theorem therefore points forward to VC-type uniform entropy bounds, which are designed precisely to hold over all empirical laws.
## Brackets and Bracketing Entropy
Covering numbers approximate a class by centres. The next problem is to exploit order structure when a function can be trapped between a lower and an upper function, which is often easier than producing metric centres.
[definition: Bracket]
Let $l$ and $u$ be measurable real-valued functions on $(\Omega,\mathcal F_0)$. The bracket $[l,u]$ is the set
\begin{align*}
[l,u] := \{f : l(x)\le f(x)\le u(x) \text{ for all } x\in\Omega\}.
\end{align*}
For $1\le p<\infty$, its $L^p(P)$ size is $\|u-l\|_{L^p(P)}$.
[/definition]
A bracket gives one order interval, but an entropy theorem needs a finite collection of such intervals at each accuracy. This motivates counting the minimum number of brackets needed to trap every function in the class.
[definition: Bracketing Number]
Let $\mathcal F$ be a class of measurable real-valued functions. For $1\le p<\infty$, the bracketing number is the map
\begin{align*}
N_{[]}(\cdot,\mathcal F,L^p(P)):(0,\infty)\to\mathbb N\cup\{\infty\}
\end{align*}
defined as follows. For $\varepsilon>0$, $N_{[]}(\varepsilon,\mathcal F,L^p(P))$ is the least integer $N$ such that there exist brackets $[l_j,u_j]$, $1\le j\le N$, with
\begin{align*}
\|u_j-l_j\|_{L^p(P)} \le \varepsilon, \qquad \mathcal F \subset \bigcup_{j=1}^N [l_j,u_j].
\end{align*}
If no finite such family exists, set $N_{[]}(\varepsilon,\mathcal F,L^p(P))=\infty$.
[/definition]
The corresponding entropy is $H_{[]} (\varepsilon,\mathcal F,L^p(P))=\log N_{[]} (\varepsilon,\mathcal F,L^p(P))$. Since central limit bounds accumulate errors across many bracket scales, we next package these numbers into an integral.
[definition: Bracketing Entropy Integral]
For $1\le p<\infty$, the $L^p(P)$ bracketing entropy integral is the map $J_{[]}(\cdot,\mathcal F,L^p(P)):(0,\infty)\to[0,\infty]$ defined by
\begin{align*}
J_{[]} (\delta,\mathcal F,L^p(P)) &:= \int_0^\delta \sqrt{1+H_{[]} (\varepsilon,\mathcal F,L^p(P))}\,d\varepsilon.
\end{align*}
[/definition]
The choice between $L^1(P)$ and $L^2(P)$ brackets depends on the limit theorem. Uniform laws of large numbers need expectation errors to be small, so the first entropy result for brackets is a Glivenko-Cantelli theorem in the $L^1(P)$ scale.
[quotetheorem:9835]
[citeproof:9835]
This theorem is the bracketing version of finite approximation. Its force is that no total boundedness in a symmetric metric is required; order intervals are enough. The hypothesis is still much stronger than pointwise laws of large numbers: every individual $f$ may satisfy $P_n f\to Pf$ while the class contains too many functions for the convergence to be uniform. For instance, under a continuous law on $\mathbb R$, indicators of arbitrary finite sets have pointwise mean zero but can fit the sample exactly, so their empirical supremum does not converge to zero. Finite $L^1(P)$ bracketing rules out this pathology by forcing the class to be approximable by finitely many endpoint functions whose ordinary laws of large numbers can be used simultaneously. The interval example below shows how an ordered class avoids the pathology through quantile brackets.
[example: Indicators of Intervals]
Let $\mathcal C=\{(-\infty,t]:t\in\mathbb R\}$ and $\mathcal F=\{\mathbb 1_{(-\infty,t]}:t\in\mathbb R\}$. Fix $0<\varepsilon<1$ and put $\eta=\varepsilon^2$. We construct brackets by cutting the distribution function $F_P(t)=P((-\infty,t])$ into probability increments of size at most $\eta$, while treating large atoms separately.
Start with $a_0=-\infty$ and $C_0=\varnothing$. Given a half-line $C_k=(-\infty,a_k]$, choose the next cut so that all half-lines strictly between $C_k$ and the next cut add $P$-mass at most $\eta$ unless the next increase is caused by one atom. If an atom at $x$ has to be crossed and
\begin{align*}
P(\{x\})>\eta,
\end{align*}
insert the degenerate bracket $[\mathbb 1_{(-\infty,x]},\mathbb 1_{(-\infty,x]}]$ for that half-line. Such large atoms are few, because if $x_1,\dots,x_r$ are distinct atoms with $P(\{x_i\})>\eta$, then
\begin{align*}
r\eta < \sum_{i=1}^r P(\{x_i\}) \le 1,
\end{align*}
so $r<1/\eta$.
For the remaining pieces, take consecutive half-lines $A\subset B$ in the constructed chain with
\begin{align*}
P(B\setminus A)\le \eta.
\end{align*}
The corresponding bracket is
\begin{align*}
[\mathbb 1_A,\mathbb 1_B]=\{f:\mathbb 1_A\le f\le \mathbb 1_B\}.
\end{align*}
If $\mathbb 1_{(-\infty,t]}$ lies between $A$ and $B$, then
\begin{align*}
\mathbb 1_A \le \mathbb 1_{(-\infty,t]} \le \mathbb 1_B,
\end{align*}
so this bracket contains it. Its $L^2(P)$ width is
\begin{align*}
\|\mathbb 1_B-\mathbb 1_A\|_{L^2(P)}^2 = \int (\mathbb 1_B-\mathbb 1_A)^2\,dP.
\end{align*}
Since $A\subset B$, the difference $\mathbb 1_B-\mathbb 1_A$ equals $\mathbb 1_{B\setminus A}$, hence
\begin{align*}
\int (\mathbb 1_B-\mathbb 1_A)^2\,dP = \int \mathbb 1_{B\setminus A}\,dP = P(B\setminus A)\le \eta=\varepsilon^2.
\end{align*}
Therefore
\begin{align*}
\|\mathbb 1_B-\mathbb 1_A\|_{L^2(P)}\le \varepsilon.
\end{align*}
The non-atomic probability increments contribute at most $\lceil 1/\eta\rceil$ brackets, and the large-atom degenerate brackets contribute fewer than $1/\eta$ more. Thus, for a universal constant $C$,
\begin{align*}
N_{[]}(\varepsilon,\mathcal F,L^2(P))\le C\varepsilon^{-2}.
\end{align*}
This bound is uniform in $P$, so half-line indicators have polynomial $L^2(P)$ bracketing entropy uniformly over all probability laws on $\mathbb R$.
[/example]
This example recovers the entropy input behind the classical empirical distribution function theorem. It also shows why bracketing adapts well to ordered indicator classes.
## Envelopes and Integrability Conditions
Entropy controls the combinatorial or metric size of a class, but it says little about the magnitude of the functions. The next problem is to separate geometric complexity from tail behaviour. Envelope functions provide the bridge.
[definition: Envelope Function]
Let $\mathcal F$ be a class of measurable real-valued functions on $\Omega$. A measurable function $F:\Omega\to[0,\infty]$ is an envelope for $\mathcal F$ if
\begin{align*}
|f(x)|\le F(x)
\end{align*}
for every $f\in\mathcal F$ and every $x\in\Omega$.
[/definition]
The envelope is allowed to be larger than necessary, but sharper envelopes give better entropy and moment conditions. For Donsker theorems, $F\in L^2(P)$ is the basic integrability scale; for Glivenko-Cantelli results, $F\in L^1(P)$ is often enough.
[example: Monotone Functions on the Unit Interval]
Let $\mathcal F$ be the class of nondecreasing functions $f:[0,1]\to[0,1]$, and fix a probability measure $P$ on $[0,1]$. Given $0<\varepsilon<1$, choose numbers $\delta,\gamma>0$ with $\delta+\gamma\le\varepsilon$. First split $[0,1]$ into finitely many ordered cells $C_1,\dots,C_m$ such that every atom of $P$ with mass larger than $\delta$ is a singleton cell, and every remaining cell has $P(C_j)\le\delta$. This is possible because there are fewer than $1/\delta$ atoms of mass larger than $\delta$, and the rest of the probability mass can be cut by quantiles into pieces of mass at most $\delta$.
Choose an integer $K$ with $1/K\le\gamma/2$, and let the range grid be $\{0,1/K,2/K,\dots,1\}$. For a fixed $f\in\mathcal F$, put
\begin{align*}
s_j:=\inf_{x\in C_j} f(x), \qquad t_j:=\sup_{x\in C_j} f(x).
\end{align*}
Since $f$ is nondecreasing and the cells are ordered, $s_1\le t_1\le s_2\le t_2\le\cdots\le s_m\le t_m$. Define grid values
\begin{align*}
a_j:=\frac{\lfloor K s_j\rfloor}{K}, \qquad b_j:=\frac{\lceil K t_j\rceil}{K}.
\end{align*}
Then $a_j\le f(x)\le b_j$ for every $x\in C_j$, because $a_j\le s_j\le f(x)\le t_j\le b_j$. Let
\begin{align*}
l_f(x):=a_j \text{ and } u_f(x):=b_j \quad \text{for } x\in C_j.
\end{align*}
Thus $f\in[l_f,u_f]$.
The $L^1(P)$ width of this bracket is
\begin{align*}
\|u_f-l_f\|_{L^1(P)}=\sum_{j=1}^m (b_j-a_j)P(C_j).
\end{align*}
For each $j$,
\begin{align*}
b_j-a_j\le (t_j-s_j)+\frac{2}{K}.
\end{align*}
Hence
\begin{align*}
\|u_f-l_f\|_{L^1(P)}\le \sum_{j=1}^m (t_j-s_j)P(C_j)+\frac{2}{K}\sum_{j=1}^m P(C_j).
\end{align*}
The second sum equals $1$, so its contribution is at most $2/K\le\gamma$. For singleton cells, $t_j=s_j$, and for every other cell $P(C_j)\le\delta$. Therefore
\begin{align*}
\sum_{j=1}^m (t_j-s_j)P(C_j)\le \delta\sum_{j=1}^m (t_j-s_j).
\end{align*}
Because the oscillations occur in increasing order and $0\le f\le1$,
\begin{align*}
\sum_{j=1}^m (t_j-s_j)\le 1.
\end{align*}
Combining the estimates gives
\begin{align*}
\|u_f-l_f\|_{L^1(P)}\le \delta+\gamma\le\varepsilon.
\end{align*}
Only finitely many brackets arise from this construction: each endpoint pair is determined by finitely many choices of grid values on the fixed finite cell partition. Thus $N_{[]}(\varepsilon,\mathcal F,L^1(P))<\infty$ for every $\varepsilon>0$. By the [bracketing Glivenko-Cantelli criterion](/theorems/9835), the monotone class $\mathcal F$ is $P$-Glivenko-Cantelli.
[/example]
The monotone example is a prototype for using structural constraints to produce brackets. Instead of estimating every possible oscillation, the construction uses the fact that monotone functions can cross each level once.
## Bracketing Central Limit Theorem
For Donsker results, finite $L^1(P)$ bracketing is not enough. The empirical process has fluctuations of order $n^{-1/2}$, so the bracket widths must be square-integrable and summable across scales through the square-root entropy integral.
[quotetheorem:9836]
[proofunderconstruction:9836]
The theorem is especially useful for classes with order or smoothness constraints. The $L^2(P)$ scale is essential because Donsker convergence concerns fluctuations of variance order, while $L^1(P)$ brackets control only mean errors and are enough for uniform laws of large numbers but not for Gaussian tightness. The entropy integral is also a genuine summability condition: finite bracketing at each fixed scale does not prevent the number of brackets from exploding so quickly as $\varepsilon\downarrow0$ that chaining increments cannot be summed. A concrete limitation is provided by the coordinate projections on a nonatomic product space. Let $\Omega=\{-1,1\}^{\mathbb N}$ with product fair-coin measure and let $f_k(\omega)=\omega_k$. The class $\mathcal F=\{f_k:k\ge1\}$ has envelope $1$, and every pair of distinct functions has $L^2(P)$ distance $\sqrt{2}$. For every $\varepsilon<1$ it therefore requires infinitely many $L^2(P)$ brackets, so the bracketing integral is infinite. The empirical process values $G_n f_k=n^{-1/2}\sum_{i=1}^n \omega_{i,k}$ behave like infinitely many independent standardized sums; over infinitely many coordinates their supremum is not tight in $\ell^\infty(\mathcal F)$. This shows that the theorem's entropy hypothesis excludes a real obstruction, not only a proof artifact. In many nonparametric problems, bracketing entropy is easier to estimate than uniform entropy over all finitely supported measures, and the examples below show how smoothness or bounded variation supplies the missing fine-scale control.
[example: Holder Balls]
Let $\mathcal F$ be a uniformly bounded ball in $C^{\alpha}([0,1]^d)$: assume $|f|\le M$ for every $f\in\mathcal F$, and assume the Holder radius is bounded by $R$. The piecewise-polynomial approximation estimate for Holder balls gives constants $C>0$ and $\varepsilon_0>0$, depending only on $d,\alpha,R,M$, such that for every probability measure $P$ on $[0,1]^d$ and every $0<\varepsilon\le\varepsilon_0$,
\begin{align*}
H_{[]}(\varepsilon,\mathcal F,L^2(P))\le C\varepsilon^{-d/\alpha}.
\end{align*}
To see what this rate gives in the bracketing integral, take square roots:
\begin{align*}
\sqrt{H_{[]}(\varepsilon,\mathcal F,L^2(P))}\le \sqrt C\,\varepsilon^{-d/(2\alpha)}.
\end{align*}
Hence, for $0<\delta\le\varepsilon_0$,
\begin{align*}
\int_0^\delta \sqrt{H_{[]}(\varepsilon,\mathcal F,L^2(P))}\,d\varepsilon \le \sqrt C\int_0^\delta \varepsilon^{-d/(2\alpha)}\,d\varepsilon.
\end{align*}
If $\alpha>d/2$, then $d/(2\alpha)<1$, so the last integral is
\begin{align*}
\int_0^\delta \varepsilon^{-d/(2\alpha)}\,d\varepsilon=\frac{\delta^{1-d/(2\alpha)}}{1-d/(2\alpha)}<\infty.
\end{align*}
The part of the entropy integral over $[\delta,\infty)$ is finite as well: for $\varepsilon\ge 2M$, the single bracket $[-M,M]$ contains every $f\in\mathcal F$ and has $L^2(P)$ width at most $2M$, so $H_{[]}(\varepsilon,\mathcal F,L^2(P))=0$ for $\varepsilon\ge2M$. Therefore
\begin{align*}
\int_0^\infty \sqrt{H_{[]}(\varepsilon,\mathcal F,L^2(P))}\,d\varepsilon<\infty.
\end{align*}
By the *Bracketing Central Limit Theorem*, every uniformly bounded Holder ball with $\alpha>d/2$ is $P$-Donsker. The threshold says that smoothness must dominate dimension strongly enough for the fine-scale bracketing errors to be summable.
[/example]
This smoothness threshold reflects the same competition seen throughout empirical process theory: the parameter space is infinite-dimensional, but enough regularity makes the class effectively small at fine resolution. It also marks a limitation of entropy criteria: boundedness alone does not make a function class Donsker, because functions may still oscillate independently on arbitrarily fine cells. The next example replaces smoothness by a different structural restriction. Bounded variation permits jumps, so it is less regular than a Holder ball, but it limits the total amount of oscillation and therefore still gives enough order structure for bracketing arguments.
[example: Bounded Variation Classes]
Let $\mathcal F$ be uniformly bounded by $M$ and suppose every $f\in\mathcal F$ has total variation at most $V$. By the Jordan decomposition for [functions of bounded variation](/page/Functions%20of%20Bounded%20Variation), each $f$ can be written as
\begin{align*}
f=f(0)+g-h
\end{align*}
where $g$ and $h$ are nondecreasing, $g(0)=h(0)=0$, and $g(1)+h(1)\le V$. Thus bracketing $\mathcal F$ reduces to bracketing two bounded monotone components and one constant term.
The key conversion from $L^1(P)$ control to $L^2(P)$ control is the following elementary estimate. If a bracket $[l,u]$ satisfies $0\le u-l\le C$ pointwise and $P(u-l)\le\eta$, then
\begin{align*}
\|u-l\|_{L^2(P)}^2=\int (u-l)^2\,dP.
\end{align*}
Since $0\le u-l\le C$, we have $(u-l)^2\le C(u-l)$ pointwise, hence
\begin{align*}
\int (u-l)^2\,dP\le C\int (u-l)\,dP=C\,P(u-l)\le C\eta.
\end{align*}
Choosing $\eta=\varepsilon^2/C$ gives
\begin{align*}
\|u-l\|_{L^2(P)}^2\le C\frac{\varepsilon^2}{C}=\varepsilon^2,
\end{align*}
and therefore $\|u-l\|_{L^2(P)}\le\varepsilon$.
For one-dimensional bounded-variation classes, the standard monotone-component bracketing construction gives constants $K,\varepsilon_0>0$, depending only on $M$ and $V$, such that for $0<\varepsilon\le\varepsilon_0$,
\begin{align*}
H_{[]}(\varepsilon,\mathcal F,L^2(P))\le K\varepsilon^{-1}.
\end{align*}
Taking square roots gives
\begin{align*}
\sqrt{H_{[]}(\varepsilon,\mathcal F,L^2(P))}\le \sqrt K\,\varepsilon^{-1/2}.
\end{align*}
Therefore, for $0<\delta\le\varepsilon_0$,
\begin{align*}
\int_0^\delta \sqrt{H_{[]}(\varepsilon,\mathcal F,L^2(P))}\,d\varepsilon
\le \sqrt K\int_0^\delta \varepsilon^{-1/2}\,d\varepsilon.
\end{align*}
The last integral is
\begin{align*}
\int_0^\delta \varepsilon^{-1/2}\,d\varepsilon=2\delta^{1/2},
\end{align*}
so
\begin{align*}
\int_0^\delta \sqrt{H_{[]}(\varepsilon,\mathcal F,L^2(P))}\,d\varepsilon
\le 2\sqrt K\,\delta^{1/2}<\infty.
\end{align*}
For large $\varepsilon$, the single bracket $[-M,M]$ contains every $f\in\mathcal F$ and has $L^2(P)$ width at most $2M$, so the remaining part of the bracketing integral is finite. Hence
\begin{align*}
\int_0^\infty \sqrt{H_{[]}(\varepsilon,\mathcal F,L^2(P))}\,d\varepsilon<\infty.
\end{align*}
Bounded variation therefore supplies enough one-dimensional order structure for bracketing to prove both uniform laws of large numbers and Donsker convergence.
[/example]
Bounded variation sits between monotonicity and smoothness. It permits jumps and nonsmooth behaviour, but the total amount of oscillation is limited enough for bracketing to remain efficient.
## Uniform Entropy and VC-Type Classes
Bracketing is powerful but can be too demanding for classes described by combinatorial dimension, such as VC classes of sets. The next question is how to prove Donsker theorems from ordinary covering numbers in a way that is uniform over the discrete measures arising in symmetrisation.
[definition: Uniform Entropy Integral]
Let $\mathfrak A$ be the collection of pairs $(\mathcal G,G)$ where $\mathcal G$ is a class of measurable real-valued functions on a measurable space $(\Omega,\mathcal F_0)$ and $G$ is an envelope for $\mathcal G$. The uniform $L^2$ entropy integral is the map $J_{\mathrm{unif}}:\mathfrak A\to[0,\infty]$ defined by
\begin{align*}
J_{\mathrm{unif}}(\mathcal F,F) &:= \int_0^1 \sup_Q \sqrt{1+\log N(\varepsilon\|F\|_{L^2(Q)},\mathcal F,L^2(Q))}\,d\varepsilon,
\end{align*}
where the supremum ranges over finitely discrete probability measures $Q$ with $0<\|F\|_{L^2(Q)}<\infty$.
[/definition]
The supremum over $Q$ makes the condition distribution-free at the level needed by the empirical process proof. This prepares a Donsker theorem for classes whose covering numbers can be controlled uniformly over all empirical laws.
[quotetheorem:6306]
[citeproof:6306]
This theorem is the common route from VC theory to Donsker theory. Once a class has polynomial uniform covering numbers and a square-integrable envelope, the entropy integral is finite by the earlier polynomial calculation. The limitation is that entropy under the single law $P$ does not suffice: a class can look small in $L^2(P)$ while becoming large under empirical measures concentrated on the observed points. The finite-set indicator class under a continuous distribution is the extreme warning example, since it has population $L^2(P)$ distance zero from $0$ but fails uniform convergence by selecting the sample. Uniform entropy excludes this behaviour by testing every finitely discrete law, which is exactly the family of laws produced inside the symmetrised proof.
[example: Indicator Classes with Polynomial Entropy]
If $\mathcal C$ is empty, the class is trivially $P$-Donsker, so assume $\mathcal C$ is nonempty. Put
\begin{align*}
\mathcal G:=\{\mathbb 1_C:C\in\mathcal C\}.
\end{align*}
The constant function $G=1$ is an envelope for $\mathcal G$, because $|\mathbb 1_C|\le 1$ pointwise. For every probability measure $Q$,
\begin{align*}
\|G\|_{L^2(Q)}=\left(\int 1^2\,dQ\right)^{1/2}=1.
\end{align*}
Since $\mathcal G$ is nonempty, every covering number of $\mathcal G$ is at least $1$. Applying the assumed bound with any $0<\varepsilon<1$ gives
\begin{align*}
1\le N(\varepsilon,\mathcal G,L^2(Q))\le \left(\frac A\varepsilon\right)^v.
\end{align*}
Letting $\varepsilon\uparrow1$ shows $A\ge1$.
For every finitely discrete probability measure $Q$ and every $0<\varepsilon<1$, the definition of metric entropy and the assumed polynomial covering bound give
\begin{align*}
\log N(\varepsilon\|G\|_{L^2(Q)},\mathcal G,L^2(Q))=\log N(\varepsilon,\mathcal G,L^2(Q)).
\end{align*}
Also,
\begin{align*}
\log N(\varepsilon,\mathcal G,L^2(Q))\le \log\left(\left(\frac A\varepsilon\right)^v\right)=v\log(A/\varepsilon).
\end{align*}
Therefore
\begin{align*}
\sup_Q\sqrt{1+\log N(\varepsilon\|G\|_{L^2(Q)},\mathcal G,L^2(Q))}\le \sqrt{1+v\log(A/\varepsilon)}.
\end{align*}
It remains only to check that the right-hand side is integrable on $(0,1)$. With $u=\log(A/\varepsilon)$, we have $\varepsilon=Ae^{-u}$ and $d\varepsilon=-Ae^{-u}\,du$. As $\varepsilon\downarrow0$, $u\to\infty$, and at $\varepsilon=1$, $u=\log A$. Hence
\begin{align*}
\int_0^1 \sqrt{1+v\log(A/\varepsilon)}\,d\varepsilon=A\int_{\log A}^{\infty}\sqrt{1+vu}\,e^{-u}\,du.
\end{align*}
For large $u$, the ratio $\sqrt{1+vu}/e^{u/2}$ tends to $0$, so there is a constant $C$ such that $\sqrt{1+vu}\le C e^{u/2}$ on the tail. Thus the tail integrand is bounded by $C e^{-u/2}$, whose integral is finite, while the remaining part over any finite interval is finite by continuity. Hence
\begin{align*}
J_{\mathrm{unif}}(\mathcal G,G)<\infty.
\end{align*}
By the *Uniform Entropy Donsker Theorem*, $\mathcal G=\{\mathbb 1_C:C\in\mathcal C\}$ is $P$-Donsker for every probability measure $P$.
[/example]
The polynomial condition holds for many VC classes. Half-lines, rectangles in fixed dimension, Euclidean balls, and subgraphs from finite-dimensional linear families are standard examples.
## Preservation Under Lipschitz Transformations
Entropy criteria become more useful when they are stable under operations used in statistics. The most common operation is composing a function class with a Lipschitz map, for instance passing from regression functions to losses.
[quotetheorem:9837]
[citeproof:9837]
For losses that are Lipschitz only on bounded intervals, the same conclusion applies to uniformly bounded classes after restricting the Lipschitz constant to the relevant range. This is the form used later for empirical risk problems: once a regression, classification, or likelihood class has an entropy bound, a Lipschitz loss transfers that bound to the loss class whose empirical average is minimized. The theorem therefore turns Donsker or Glivenko-Cantelli control of predictors into the corresponding uniform limit theory for risks.
Some form of Lipschitz control is necessary: a merely continuous transformation can magnify small $L^p(Q)$ distances into much larger ones near points where its slope is unbounded, so covering numbers need not be preserved at comparable scales. For example, with $\phi(t)=\sqrt{|t|}$ and a one-point probability space, the constant functions $0$ and $a$ are separated by $a$ in $L^p(Q)$ before transformation but by $\sqrt a$ after transformation. No fixed linear scale comparison can hold as $a\downarrow0$. The condition $\phi(0)=0$ in the envelope statement is also substantive; without it, the transformed class has the additional constant size $|\phi(0)|$, so an envelope must be enlarged to $|\phi(0)|+LF$ rather than just $LF$. This is why loss transformations are usually checked either globally, or on a bounded range where a concrete local Lipschitz constant is available before applying uniform central limit theorems to empirical risk processes.
The theorem is only a scale-comparison result for ordinary covering numbers. It does not assert a reverse inequality, preserve bracketing numbers by itself, or repair an entropy condition that already fails for $\mathcal F$. It also requires the transformation to act pointwise through a common Lipschitz constant. If the Lipschitz constant is allowed to depend on the sample law or on the particular function being transformed, the covering centres for $\mathcal F$ need not map to a controlled cover of the transformed class. For a concrete failure of the reverse direction, take $\phi(t)=0$ for all $t$. Then $\phi\circ\mathcal F$ has covering number $1$ at every scale for every class $\mathcal F$, including classes with infinite covering numbers. Thus small entropy after transformation says little about the original class; the theorem is used only to push known entropy bounds forward through stable operations.
[example: Absolute Loss over a Bounded Regression Class]
Let $(X,\mathcal A)$ be a measurable covariate space, let $\mathcal F$ be a class of measurable functions $f:X\to[-M,M]$, and assume $|Y|\le M$ a.s. For $f\in\mathcal F$, write
\begin{align*}
\ell_f(x,y):=|y-f(x)|.
\end{align*}
Then
\begin{align*}
\mathcal L=\{\ell_f:f\in\mathcal F\}.
\end{align*}
First, $\mathcal L$ has envelope $2M$. Indeed, for every $(x,y)$ in the bounded range,
\begin{align*}
|\ell_f(x,y)|=|y-f(x)|.
\end{align*}
By the triangle inequality,
\begin{align*}
|y-f(x)|\le |y|+|f(x)|.
\end{align*}
Since $|y|\le M$ and $|f(x)|\le M$,
\begin{align*}
|\ell_f(x,y)|\le M+M=2M.
\end{align*}
Now let $Q$ be a probability measure on $X\times\mathbb R$, and let $Q_X$ be its marginal on $X$. Suppose $f_1,\dots,f_N$ cover $\mathcal F$ in $L^2(Q_X)$ at radius $\varepsilon$, so for every $f\in\mathcal F$ there is a $j$ such that
\begin{align*}
\|f-f_j\|_{L^2(Q_X)}\le \varepsilon.
\end{align*}
For this $j$, compare the two loss functions pointwise. Using $||a|-|b||\le |a-b|$ with $a=y-f(x)$ and $b=y-f_j(x)$,
\begin{align*}
|\ell_f(x,y)-\ell_{f_j}(x,y)|\le |(y-f(x))-(y-f_j(x))|.
\end{align*}
The $y$ terms cancel, so
\begin{align*}
|(y-f(x))-(y-f_j(x))|=|f_j(x)-f(x)|.
\end{align*}
Therefore
\begin{align*}
|\ell_f(x,y)-\ell_{f_j}(x,y)|^2\le |f(x)-f_j(x)|^2.
\end{align*}
Integrating with respect to $Q$ gives
\begin{align*}
\|\ell_f-\ell_{f_j}\|_{L^2(Q)}^2\le \int |f(x)-f_j(x)|^2\,dQ(x,y).
\end{align*}
By the definition of the marginal $Q_X$,
\begin{align*}
\int |f(x)-f_j(x)|^2\,dQ(x,y)=\int |f(x)-f_j(x)|^2\,dQ_X(x).
\end{align*}
Thus
\begin{align*}
\|\ell_f-\ell_{f_j}\|_{L^2(Q)}^2\le \|f-f_j\|_{L^2(Q_X)}^2\le \varepsilon^2,
\end{align*}
and hence
\begin{align*}
\|\ell_f-\ell_{f_j}\|_{L^2(Q)}\le \varepsilon.
\end{align*}
So
\begin{align*}
N(\varepsilon,\mathcal L,L^2(Q))\le N(\varepsilon,\mathcal F,L^2(Q_X)).
\end{align*}
Consequently, if $\mathcal F$ satisfies a uniform entropy Donsker condition with the constant envelope $M$, then $\mathcal L$ satisfies the same type of condition with envelope $2M$. The absolute-loss transformation therefore preserves the required uniform covering control, and the loss class is covered no less efficiently than the original regression class under the corresponding marginal metric.
[/example]
The chapter’s main message is that entropy translates geometry into probability. $L^1(P)$ bracketing is a practical route to Glivenko-Cantelli theorems, $L^2(P)$ bracketing gives central limit theorems through bracketing entropy integrals, and uniform covering entropy gives Donsker theorems for VC-type and Lipschitz-transformed classes. These criteria are the technical engine behind many later applications to asymptotic statistics, [bootstrap consistency](/theorems/1995), and learning theory.
# 7. Maximal Inequalities
Maximal inequalities turn pointwise concentration into bounds for suprema of empirical processes. Earlier chapters introduced symmetrisation, entropy, VC control, and finite-net approximations as ways to describe the size of a function class; this chapter explains how those ingredients become high-probability uniform bounds. The multiscale chaining viewpoint is developed explicitly in the next chapter. The main questions are how large
\begin{align*}
\sup_{f\in\mathcal F}|(P_n-P)f|
\end{align*}
can be, how boundedness and variance improve the answer, and how the bound can be localized near functions with small risk or small variance.
The same pattern appears outside empirical-process theory whenever many noisy quantities must be controlled at once: random matrix operator norms, stochastic optimization gradients, and uniform deviations in [nonparametric statistics](/page/Nonparametric%20Statistics) all ask for a pointwise tail bound plus a way to pay for the size of the index set. The chapter focuses on empirical averages, but the proof architecture is a reusable template for turning local concentration into uniform control.
## From Pointwise Concentration to Suprema
A single empirical average is often controlled by Hoeffding or [Bernstein inequalities](/theorems/3188). Empirical process theory needs a simultaneous version over many functions, so the first step is to record the pointwise inequalities in a form that can later be combined with entropy and symmetrisation.
[definition: Empirical Average]
Let $X_1,\dots,X_n$ be i.i.d. random variables with distribution $P$ on a measurable space $(S,\mathcal A)$. The empirical average is the random functional
\begin{align*}
P_n: \{f:S\to\mathbb R \text{ measurable}: |f(X_i)|<\infty \text{ for }1\le i\le n\}\to\mathbb R,
\end{align*}
defined by
\begin{align*}
P_nf := \frac{1}{n}\sum_{i=1}^{n} f(X_i).
\end{align*}
The population expectation is the functional
\begin{align*}
P:L^1(P)\to\mathbb R,
\end{align*}
defined by
\begin{align*}
Pf := \int_S f\,dP.
\end{align*}
[/definition]
The difference $(P_n-P)f$ is a centered average. This motivates the first pointwise bound: before taking a supremum over $\mathcal F$, we need a sharp estimate for one bounded summand.
[quotetheorem:9838]
[citeproof:9838]
Hoeffding uses only the range, so it treats a rare-event indicator and a balanced Bernoulli variable in the same way. The bounded range hypothesis is essential: without it, a centered variable can have finite mean but polynomial tails, and no Gaussian exponential bound of this form can hold uniformly. The theorem also does not use the actual variance, so it cannot distinguish between a function that almost never differs from its mean and one that fluctuates on every observation. This limitation motivates Bernstein's inequality, which keeps a boundedness correction but also records the variance scale used later in local empirical-process bounds.
[quotetheorem:9839]
[citeproof:9839]
The difference between Hoeffding and Bernstein is visible when the variance is much smaller than the square of the envelope. Centering is needed because the theorem controls fluctuations around zero; for an uncentered function the deterministic mean would shift $P_nf$ and make the displayed event measure both bias and noise. Boundedness is also doing real work: finite variance alone permits heavy-tailed variables for which Bernstein tails fail at large deviations. The theorem still does not solve the empirical-process problem, because it controls one function at a time; the next example makes the variance scale concrete before the chapter passes to suprema.
[example: Variance Sensitive Bernoulli Average]
Let $A\in\mathcal A$ and write $p=P(A)$. Define $f=\mathbf 1_A-p$. Then
\begin{align*}
Pf=P\mathbf 1_A-p=p-p=0.
\end{align*}
Also, for $x\in A$ one has $f(x)=1-p\in[0,1]$, while for $x\notin A$ one has $f(x)=-p\in[-1,0]$, so $|f|\le 1$. Its variance term is
\begin{align*}
P f^2=P(\mathbf 1_A-p)^2.
\end{align*}
Since $\mathbf 1_A^2=\mathbf 1_A$,
\begin{align*}
(\mathbf 1_A-p)^2=\mathbf 1_A-2p\mathbf 1_A+p^2.
\end{align*}
Taking expectations gives
\begin{align*}
P f^2=P(A)-2pP(A)+p^2=p-2p^2+p^2=p(1-p).
\end{align*}
Moreover,
\begin{align*}
P_nf=\frac{1}{n}\sum_{i=1}^{n}(\mathbf 1_A(X_i)-p)=P_n(A)-p=P_n(A)-P(A).
\end{align*}
Applying *[Bernstein Inequality For Empirical Averages](/theorems/9839)* with $M=1$ and $\sigma^2=p(1-p)$ therefore gives, for every $t>0$,
\begin{align*}
\mathbb P\left(P_n(A)-P(A)\ge t\right)\le \exp\left(-\frac{nt^2}{2P(A)(1-P(A))+2t/3}\right).
\end{align*}
Thus when $P(A)$ is small, the quadratic part of the denominator is of order $P(A)$ rather than order $1$, and only larger deviations are governed by the linear boundedness correction $2t/3$.
[/example]
## Orlicz Norms and Tail Integration
Pointwise inequalities are usually stated as tail bounds, while maximal inequalities often estimate expectations of suprema. We therefore need a compact way to pass between tails, moments, and expected maxima.
[definition: Orlicz Psi Two Norm]
On a probability space $(\Omega,\mathcal F,\mathbb P)$, the Orlicz $\psi_2$ norm is the functional
\begin{align*}
\|\cdot\|_{\psi_2}:L^0(\Omega)\to[0,\infty],
\end{align*}
defined for a real-valued random variable $Z$ by
\begin{align*}
\|Z\|_{\psi_2}:=\inf\left\{c>0:\mathbb E\left[\exp\left(Z^2/c^2\right)-1\right]\le 1\right\}.
\end{align*}
A random variable $Z\in L^0(\Omega)$ is called sub-Gaussian if $\|Z\|_{\psi_2}<\infty$.
[/definition]
The $\psi_2$ norm packages Gaussian-type tails into a single scale parameter. This motivates an increment condition, because chaining controls a supremum by repeatedly comparing nearby indices rather than by estimating each index in isolation.
[definition: Sub-Gaussian Increments]
Let $(T,d)$ be a semimetric space and let $(Z_t)_{t\in T}$ be a real-valued stochastic process on a probability space $(\Omega,\mathcal F,\mathbb P)$, meaning that each $Z_t:\Omega\to\mathbb R$ is measurable. The process has sub-Gaussian increments with respect to $d$ if there exists $K>0$ such that
\begin{align*}
\|Z_s-Z_t\|_{\psi_2}\le K d(s,t)
\end{align*}
for all $s,t\in T$.
[/definition]
Sub-Gaussian increments explain why metric entropy appears in maximal inequalities for genuinely sub-Gaussian processes. To use the same chaining geometry for empirical processes, we need to know what replaces the sub-Gaussian increment estimate for the empirical process $G_n$. Bounded empirical processes have a related but weaker increment structure: the small-deviation part is governed by the $L^2(P)$ distance, while the large-deviation part remembers the envelope through a Bernstein correction. The next theorem is the bridge from the pointwise Bernstein inequality to the metric increments that chaining and entropy arguments can exploit.
[quotetheorem:9840]
[citeproof:9840]
This theorem deliberately has a Bernstein denominator rather than a pure $d(f,g)^2$ sub-Gaussian denominator. The assumptions $f,g\in[0,1]$ imply both a variance bound, through $P(f-g)^2=d(f,g)^2$, and a pointwise increment bound, because $f-g\in[-1,1]$ and centering gives absolute size at most $2$. A general envelope bound $|f|,|g|\le M$ would give the same structure with constants depending on $M$, while control of only one of $f$ and $g$ would not bound the centered difference. Rare Bernoulli increments show why the envelope term cannot be removed: if $h=\mathbf{1}_A$ with $P(A)=p\ll1$, then $\|h\|_{L^2(P)}=p^{1/2}$ but the centered variable has a jump of size close to $1$, so its $\psi_2$ norm is not bounded by a universal constant times $p^{1/2}$. Thus boundedness alone does not produce sub-Gaussian increments at the $L^2(P)$ scale; the envelope term is needed for large deviations. The increment bound still gives the tail information needed for chained suprema, while the main estimates in this chapter are often stated in expectation. This motivates the tail-integration identity, which is the conversion step from high-probability bounds to expected maximal bounds.
[quotetheorem:9841]
[citeproof:9841]
This identity will be used repeatedly after a union bound or a chaining argument has produced a tail estimate. The non-negativity assumption is not cosmetic: if $Z$ takes both signs, then $\int_0^\infty \mathbb P(Z>u)\,du$ computes $\mathbb E[Z^+]$, not $\mathbb E[Z]$. For example, a random variable taking values $1$ and $-1$ with equal probability has mean $0$, while the integral of its positive tail is $1/2$. Thus for a general real-valued random variable one applies the identity to positive and negative parts, or to $|Z|$, and integrability is exactly the finiteness of the resulting tail integral. The shifted Gaussian tail is the right hypothesis after Hoeffding, Bernstein in its quadratic regime, or chaining with sub-Gaussian increments, because integrating $e^{-u^2/b^2}$ contributes a scale of order $b$ rather than a larger polynomial factor. Under heavier tails this conversion changes character: if only $\mathbb P(Z>a+u)\lesssim u^{-p}$ is available, then the expectation contribution is finite only for $p>1$ and is controlled by a tail moment scale rather than by a Gaussian standard-deviation scale. The shifted tail hypothesis only controls the part of the expectation above $a$; it says nothing about concentration below $a$, which is why the conclusion is an expectation bound rather than a two-sided concentration theorem. For finite classes it gives a transparent maximal inequality and motivates replacing finite cardinality by entropy in the infinite-class case.
[example: Finite Class Maximal Bound]
Let $\mathcal F=\{f_1,\dots,f_N\}$ with $f_j:S\to[0,1]$, and set
\begin{align*}
Z:=\max_{1\le j\le N}|(P_n-P)f_j|.
\end{align*}
For each fixed $j$, *[Hoeffding Inequality For Empirical Averages](/theorems/9838)* with $a=0$ and $b=1$ gives
\begin{align*}
\mathbb P((P_n-P)f_j>t)\le e^{-2nt^2}.
\end{align*}
The same theorem applied to the lower tail gives
\begin{align*}
\mathbb P((P-P_n)f_j>t)\le e^{-2nt^2}.
\end{align*}
Since $|(P_n-P)f_j|>t$ is the union of the events $(P_n-P)f_j>t$ and $(P-P_n)f_j>t$, the union bound gives
\begin{align*}
\mathbb P(|(P_n-P)f_j|>t)\le 2e^{-2nt^2}.
\end{align*}
Applying the union bound once more over $j=1,\dots,N$ yields
\begin{align*}
\mathbb P(Z>t)\le \sum_{j=1}^{N}\mathbb P(|(P_n-P)f_j|>t)\le 2Ne^{-2nt^2}.
\end{align*}
Let
\begin{align*}
a:=\sqrt{\frac{\log(2N)}{2n}}.
\end{align*}
Then $2Ne^{-2na^2}=1$. By *[Tail Integration Bound](/theorems/9841)* applied to the nonnegative random variable $Z$,
\begin{align*}
\mathbb E Z=\int_0^\infty \mathbb P(Z>u)\,du.
\end{align*}
Splitting the integral at $a$ gives
\begin{align*}
\mathbb E Z\le a+\int_a^\infty 2Ne^{-2nu^2}\,du.
\end{align*}
With $u=a+v$, the remaining integral is
\begin{align*}
\int_a^\infty 2Ne^{-2nu^2}\,du=\int_0^\infty 2Ne^{-2n(a+v)^2}\,dv.
\end{align*}
Since $(a+v)^2=a^2+2av+v^2\ge a^2+v^2$ for $v\ge0$,
\begin{align*}
2Ne^{-2n(a+v)^2}\le 2Ne^{-2na^2}e^{-2nv^2}=e^{-2nv^2}.
\end{align*}
Therefore
\begin{align*}
\int_a^\infty 2Ne^{-2nu^2}\,du\le \int_0^\infty e^{-2nv^2}\,dv=\frac{\sqrt\pi}{2\sqrt{2n}}.
\end{align*}
Thus
\begin{align*}
\mathbb E\left[\max_{1\le j\le N}|(P_n-P)f_j|\right]\le \sqrt{\frac{\log(2N)}{2n}}+\frac{\sqrt\pi}{2\sqrt{2n}}.
\end{align*}
In particular, this is bounded by a universal constant times $\sqrt{(1+\log N)/n}$, and for $N\ge2$ by a universal constant times $\sqrt{\log N/n}$. The finite-class price is therefore logarithmic in the number of functions, rather than linear in $N$.
[/example]
## Maximal Inequalities for Bounded Empirical Processes
The finite-class bound suggests that the logarithm of the number of functions should be replaced by an entropy integral. A direct union bound fails for an infinite class because there may be uncountably many functions, and even a countable union can give an infinite sum of identical tail bounds. The main problem is therefore to control a supremum over an infinite class without losing measurability or paying for every point separately.
[definition: Uniform Entropy Integral]
Let $\mathcal F$ be a class of measurable functions with envelope $F:S\to[0,\infty)$. The uniform entropy integral is the functional
\begin{align*}
J:\{(\delta,\mathcal F,F):\delta\in[0,\infty),\ F \text{ is an envelope for }\mathcal F\}\to[0,\infty],
\end{align*}
defined as follows. For $Q$ a finitely discrete probability measure, let
\begin{align*}
N(\cdot,\mathcal F,L^2(Q)):(0,\infty)\to \mathbb N\cup\{\infty\}
\end{align*}
be the covering-number function of $\mathcal F$ in $L^2(Q)$, so that $N(\varepsilon\|F\|_{L^2(Q)},\mathcal F,L^2(Q))$ is the least number of $L^2(Q)$ balls of radius $\varepsilon\|F\|_{L^2(Q)}$ needed to cover $\mathcal F$. Then
\begin{align*}
J(\delta,\mathcal F,F):=\int_0^\delta \sup_Q \sqrt{1+\log N(\varepsilon\|F\|_{L^2(Q)},\mathcal F,L^2(Q))}\,d\varepsilon,
\end{align*}
where the supremum ranges over finitely discrete probability measures $Q$ with $\|F\|_{L^2(Q)}>0$.
[/definition]
Uniform entropy is designed to be distribution-free. This motivates the bounded maximal inequality, where a single entropy integral controls the expected supremum after symmetrisation and conditional chaining.
[quotetheorem:9842]
[citeproof:9842]
The theorem says that a smaller $L^2(P)$ radius improves the expected supremum. Pointwise measurability is included so that the supremum can be handled as a genuine random variable, usually by reducing it to a countable dense subclass. The bounded envelope is also essential for the empirical-radius correction: without a uniform bound, the second symmetrisation step cannot prevent a few large observations from dominating $P_nf^2$. The variance radius $\sigma$ is the quantity that makes localization useful, while the second term records the price of estimating that radius from the sample rather than knowing it deterministically. When the class is not localized, take $\sigma=M$ and recover a global bound of order $M J(1,\mathcal F,F)/\sqrt n$ for $(P_n-P)f$ after dividing by $\sqrt n$.
[example: Uniform Risk Bound For Bounded Loss]
Let $\mathcal G$ be a class of predictors, let each loss satisfy $\ell_g:S\to[0,1]$, and set $\mathcal F=\{\ell_g:g\in\mathcal G\}$. The envelope of $\mathcal F$ is the constant function $1$, so $M=1$. Also, since $0\le \ell_g\le 1$, we have $\ell_g^2\le \ell_g\le 1$ pointwise, hence
\begin{align*}
P\ell_g^2\le P1=1.
\end{align*}
Thus the variance radius in *[Maximal Inequality For Bounded Empirical Processes](/theorems/9842)* may be taken to be $\sigma=1$.
For $f=\ell_g$, the empirical process is
\begin{align*}
G_n f=\sqrt n(P_n-P)f.
\end{align*}
Therefore
\begin{align*}
\sup_{g\in\mathcal G}|(P_n-P)\ell_g|=\frac{1}{\sqrt n}\sup_{f\in\mathcal F}|G_n f|.
\end{align*}
Taking expectations gives
\begin{align*}
\mathbb E\left[\sup_{g\in\mathcal G}|(P_n-P)\ell_g|\right]=\frac{1}{\sqrt n}\mathbb E\left[\sup_{f\in\mathcal F}|G_n f|\right].
\end{align*}
Applying the maximal inequality with $M=1$, $\sigma=1$, and envelope $F=1$ gives
\begin{align*}
\mathbb E\left[\sup_{f\in\mathcal F}|G_n f|\right]\le C\left\{J(1,\mathcal F,1)+\frac{J(1,\mathcal F,1)^2}{\sqrt n}\right\}.
\end{align*}
Substituting this into the previous display yields
\begin{align*}
\mathbb E\left[\sup_{g\in\mathcal G}|(P_n-P)\ell_g|\right]\le C\left\{\frac{J(1,\mathcal F,1)}{\sqrt n}+\frac{J(1,\mathcal F,1)^2}{n}\right\}.
\end{align*}
Equivalently,
\begin{align*}
\mathbb E\left[\sup_{g\in\mathcal G}|(P_n-P)\ell_g|\right]\lesssim \frac{J(1,\mathcal F,1)}{\sqrt n}+\frac{J(1,\mathcal F,1)^2}{n}.
\end{align*}
Thus a finite entropy integral makes empirical risks uniformly close to their true risks in expectation, with the leading scale $J(1,\mathcal F,1)/\sqrt n$ and a smaller empirical-radius correction $J(1,\mathcal F,1)^2/n$.
[/example]
For finite VC-type classes, the entropy integral is often logarithmic in the inverse radius. This example motivates the later use of VC and bracketing estimates inside the same maximal-inequality template.
[example: VC Type Entropy In A Bounded Class]
Assume $0<\delta\le 1$. By the definition of the uniform entropy integral and the VC-type entropy hypothesis,
\begin{align*}
J(\delta,\mathcal F,F)=\int_0^\delta \sup_Q \sqrt{1+\log N(\varepsilon\|F\|_{L^2(Q)},\mathcal F,L^2(Q))}\,d\varepsilon.
\end{align*}
For each $0<\varepsilon\le\delta$, the covering-number bound gives
\begin{align*}
\sup_Q \log N(\varepsilon\|F\|_{L^2(Q)},\mathcal F,L^2(Q))\le v\log(A/\varepsilon).
\end{align*}
Since $A\ge e$, $v\ge 1$, and $\varepsilon\le 1$, we have $\log(A/\varepsilon)\ge 1$, hence
\begin{align*}
1+v\log(A/\varepsilon)\le 2v\log(A/\varepsilon).
\end{align*}
Therefore
\begin{align*}
J(\delta,\mathcal F,F)\le \sqrt{2v}\int_0^\delta \sqrt{\log(A/\varepsilon)}\,d\varepsilon.
\end{align*}
To bound the remaining integral, set $\varepsilon=\delta s$. Then
\begin{align*}
\int_0^\delta \sqrt{\log(A/\varepsilon)}\,d\varepsilon=\delta\int_0^1 \sqrt{\log(A/\delta)+\log(1/s)}\,ds.
\end{align*}
Using $\sqrt{x+y}\le \sqrt{x}+\sqrt{y}$ for $x,y\ge0$,
\begin{align*}
\int_0^\delta \sqrt{\log(A/\varepsilon)}\,d\varepsilon\le \delta\sqrt{\log(A/\delta)}+\delta\int_0^1\sqrt{\log(1/s)}\,ds.
\end{align*}
For the last integral, put $t=\log(1/s)$, so $s=e^{-t}$ and $ds=-e^{-t}\,dt$. Thus
\begin{align*}
\int_0^1\sqrt{\log(1/s)}\,ds=\int_0^\infty t^{1/2}e^{-t}\,dt=\Gamma(3/2)=\frac{\sqrt\pi}{2}.
\end{align*}
Because $\log(A/\delta)\ge1$, this gives
\begin{align*}
\int_0^\delta \sqrt{\log(A/\varepsilon)}\,d\varepsilon\le \left(1+\frac{\sqrt\pi}{2}\right)\delta\sqrt{\log(A/\delta)}.
\end{align*}
Consequently, for a universal constant $C_1$,
\begin{align*}
J(\delta,\mathcal F,F)\le C_1\delta\sqrt{v\log(A/\delta)}.
\end{align*}
Now suppose $P f^2\le\sigma^2$ for every $f\in\mathcal F$, with $0<\sigma\le1$. Since $F\le1$, the maximal inequality for bounded empirical processes applies with $M=1$ and gives
\begin{align*}
\mathbb E\left[\sup_{f\in\mathcal F}|G_n f|\right]\le C\left\{J(\sigma,\mathcal F,F)+\frac{J(\sigma,\mathcal F,F)^2}{\sigma^2\sqrt n}\right\}.
\end{align*}
Using $G_n f=\sqrt n(P_n-P)f$,
\begin{align*}
\mathbb E\left[\sup_{f\in\mathcal F}|(P_n-P)f|\right]=\frac{1}{\sqrt n}\mathbb E\left[\sup_{f\in\mathcal F}|G_n f|\right].
\end{align*}
Substituting the entropy estimate with $\delta=\sigma$ yields
\begin{align*}
\frac{J(\sigma,\mathcal F,F)}{\sqrt n}\le C_1\sigma\sqrt{\frac{v\log(A/\sigma)}{n}}.
\end{align*}
For the correction term,
\begin{align*}
\frac{J(\sigma,\mathcal F,F)^2}{\sigma^2 n}\le C_1^2\frac{\sigma^2 v\log(A/\sigma)}{\sigma^2 n}=C_1^2\frac{v\log(A/\sigma)}{n}.
\end{align*}
Thus
\begin{align*}
\mathbb E\left[\sup_{f\in\mathcal F}|(P_n-P)f|\right]\le C_2\left\{\sigma\sqrt{\frac{v\log(A/\sigma)}{n}}+\frac{v\log(A/\sigma)}{n}\right\}.
\end{align*}
The leading term is the localized standard-error scale $\sigma\sqrt{v\log(A/\sigma)/n}$; the second term is the empirical-radius correction coming from the bounded maximal inequality.
[/example]
## Local Maximal Inequalities and Peeling
Global bounds treat every function in the class as if it had the same variance or risk. Local theory asks for a sharper statement on shells where the functions are small in $L^2(P)$, excess risk, or another problem-specific radius.
[definition: Localized Function Class]
Let $\mathcal F$ be a class of measurable functions and let $r>0$. The $L^2(P)$-localized class at radius $r$ is
\begin{align*}
\mathcal F(r):=\{f\in\mathcal F: P f^2\le r^2\}.
\end{align*}
[/definition]
The previous maximal inequality immediately applies to $\mathcal F(r)$. This motivates recording the localized form explicitly, because statistical applications use it with $r$ determined by variance or excess risk.
[quotetheorem:9843]
[citeproof:9843]
A localized estimate controls one radius at a time. Without boundedness, the same radius may still contain functions with rare but enormous values, so the displayed inequality would no longer follow from the bounded maximal inequality. Without pointwise measurability, the supremum over $\mathcal F(r)$ may fail to be measurable; a nonmeasurable index class can produce an outer supremum that is not an ordinary random variable, so the displayed expectation is not defined in the usual sense. The restriction $r\in(0,M]$ is also structural. At $r=0$ the correction term contains $r^{-2}$ and the class must be handled separately as functions with $P f^2=0$; for $r>M$, the localization condition is no stronger than the global envelope bound and the theorem should be read with radius $M$. Without localization, the theorem falls back to the global radius $M$ and loses the faster scale available near small-variance functions; for instance, indicators of rare events are then treated like indicators with probability near $1/2$. The result is also an expectation bound, not by itself a high-probability oracle inequality. This motivates dyadic peeling, which turns radius-by-radius bounds into a single statement over the full function class.
[definition: Dyadic Peeling]
Let $\mathcal F$ be a class equipped with a finite non-negative size functional $R:\mathcal F\to[0,\infty)$. A dyadic peeling of $\mathcal F$ above radius $r_0>0$ is the decomposition
\begin{align*}
\mathcal F = \{f:R(f)\le r_0\}\cup \bigcup_{k=0}^{\infty}\{f:2^k r_0< R(f)\le 2^{k+1}r_0\}.
\end{align*}
[/definition]
A peeling argument is useful when the desired bound contains the radius itself. This motivates the following principle: prove the right concentration estimate on each shell, then sum the shell failure probabilities.
[quotetheorem:9844]
[citeproof:9844]
The monotonicity of $A(r)$ is what allows a shell to be controlled by the larger endpoint radius rather than by a separate bound for every function in the shell. The Bernstein form of the shell tail is also important: it gives summable probabilities when the thresholds include both a square-root variance term and a linear boundedness term. Infinite peeling can fail if the shell penalties are not summable, or if the class has no finite outer radius and the chosen thresholds grow too slowly. The most common statistical use is to control excess risks by their own deterministic size. A variance condition converts risk radius into $L^2(P)$ radius, and the local maximal inequality supplies the stochastic part.
[example: Localized Excess Risk Estimate]
Let $g_0\in\mathcal G$ be a risk minimizer, and for each $g\in\mathcal G$ define the excess loss
\begin{align*}
f_g=\ell_g-\ell_{g_0}.
\end{align*}
Since $0\le \ell_g\le 1$ and $0\le \ell_{g_0}\le 1$, we have $-1\le f_g\le 1$, so the excess-loss class has envelope $1$. Suppose the Bernstein condition
\begin{align*}
P f_g^2\le C_0 P f_g
\end{align*}
holds for every $g\in\mathcal G$. On the excess-risk shell $P f_g\le r$, this gives
\begin{align*}
P f_g^2\le C_0 P f_g\le C_0 r.
\end{align*}
Therefore
\begin{align*}
\|f_g\|_{L^2(P)}=(P f_g^2)^{1/2}\le (C_0r)^{1/2}.
\end{align*}
Set
\begin{align*}
\mathcal F_r:=\{f_g:g\in\mathcal G,\ P f_g\le r\}.
\end{align*}
The previous display shows that $\mathcal F_r\subseteq \mathcal F((C_0r)^{1/2})$, where localization is in $L^2(P)$. Hence, when $0<C_0r\le 1$, applying *[Local Rademacher Bound](/theorems/9843)* with $M=1$ and radius $(C_0r)^{1/2}$ gives
\begin{align*}
\mathbb E\left[\sup_{P f_g\le r}|(P_n-P)f_g|\right]\le C\left\{\frac{1}{\sqrt n}J\left((C_0r)^{1/2},\mathcal F,1\right)+\frac{1}{nC_0r}J\left((C_0r)^{1/2},\mathcal F,1\right)^2\right\}.
\end{align*}
Thus the stochastic fluctuation on the shell is controlled by the entropy integral at the smaller radius $(C_0r)^{1/2}$ rather than at the global radius $1$. Small excess risk therefore forces a small $L^2(P)$ radius, which is the basic mechanism behind localized oracle inequalities.
[/example]
The localized excess-risk example shows how the technical inequalities become rate statements for learning problems. This motivates the closing perspective: maximal inequalities are reusable infrastructure rather than isolated endpoints.
[remark: Maximal Inequalities As Course Infrastructure]
The chapter's inequalities are not final statistical conclusions by themselves. They are tools that will be combined with entropy estimates, Donsker and Glivenko-Cantelli criteria, and argmax or argmin arguments in later chapters. They also connect empirical process theory to concentration of measure, Banach-space geometry through entropy and chaining, and nonparametric statistics through localized risk bounds. The important structural message is that global complexity controls uniform laws, while local complexity controls rates near the target.
[/remark]
# 8. Chaining Methods
This chapter explains how chaining turns many local approximations into a global bound for the supremum of a stochastic process. In earlier chapters, entropy entered through finite approximations and union bounds; here the approximation is performed at many resolutions and then added across scales. The guiding question is how the geometry of an index set, measured by covering numbers, controls oscillations of Gaussian, Rademacher, and empirical processes.
## Successive Nets and Dudley's Entropy Integral
The basic problem is to bound $\mathbb E[\sup_{t \in T} X_t]$ when the index set $T$ is too large for a single union bound. A finite net at one scale controls the coarse location of $t$, but it loses the information that nearby indices have strongly correlated process values. Chaining repairs this by approximating every $t$ through a sequence of increasingly fine representatives and writing $X_t$ as a telescoping sum of increments.
Let $(T,d)$ be a metric space. For $\varepsilon>0$, write $N(T,d,\varepsilon)$ for the smallest cardinality of an $\varepsilon$-net of $T$ in the metric $d$. To organize a telescoping argument, we first need a formal name for the family of nets used across scales.
[definition: Admissible Successive Nets]
A sequence $(T_k)_{k\ge 0}$ of finite subsets of $T$ is a system of admissible successive nets at scales $(\varepsilon_k)_{k\ge 0}$ if $T_k$ is an $\varepsilon_k$-net of $T$ for every $k\ge 0$, where $\varepsilon_k \downarrow 0$.
[/definition]
The definition records the approximation structure used by the chain. To see what the projections do in a familiar case, it is helpful to look at dyadic approximations of an interval before adding probability.
[example: Dyadic Projection Chain]
Let $T=[0,1]$ with $d(s,t)=|s-t|$, and let $T_k=\{j2^{-k}:0\le j\le 2^k\}$. For each $t\in[0,1]$, choose $\pi_k(t)\in T_k$ so that $|\pi_k(t)-t|=\min_{u\in T_k}|u-t|$, breaking ties arbitrarily. Since consecutive points of $T_k$ are spaced by $2^{-k}$, every $t$ lies within distance $2^{-(k+1)}$ of some point of $T_k$, so
\begin{align*}
|\pi_k(t)-t|\le 2^{-(k+1)}.
\end{align*}
For $k\ge1$, the triangle inequality gives
\begin{align*}
|\pi_k(t)-\pi_{k-1}(t)|\le |\pi_k(t)-t|+|t-\pi_{k-1}(t)|.
\end{align*}
Using the two projection bounds at levels $k$ and $k-1$,
\begin{align*}
|\pi_k(t)-\pi_{k-1}(t)|\le 2^{-(k+1)}+2^{-k}=\frac{3}{2}2^{-k}.
\end{align*}
Also $|\pi_k(t)-t|\le 2^{-(k+1)}\to0$, so $\pi_k(t)\to t$. Thus the dyadic chain approximates $t$ more accurately at each level, while its level-$k$ correction has size controlled by the mesh scale $2^{-k}$.
[/example]
This example contains the geometric mechanism: the value $X_t$ is compared with $X_{\pi_0(t)}$ plus a sum of small increments. To turn this geometry into a probabilistic bound, we need an increment condition saying that small metric distance produces strong tail decay.
[definition: Sub-Gaussian Process]
A real-valued stochastic process $(X_t)_{t\in T}$ is sub-Gaussian with respect to a metric $d$ on $T$ if there exists a constant $K>0$ such that for all $s,t\in T$ and all $u\ge0$,
\begin{align*}
\mathbb P(|X_t-X_s|\ge u) \le 2\exp\left(-\frac{u^2}{K^2d(s,t)^2}\right).
\end{align*}
[/definition]
The constant $K$ measures the scale of the increments. For Gaussian processes, the increment tail is determined by the variance of the difference, so the next task is to identify the metric that records those variances.
[definition: Canonical Gaussian Semimetric]
For a centred Gaussian process $(G_t)_{t\in T}$, the canonical semimetric is the map
\begin{align*}
d_G:T\times T&\to[0,\infty), & d_G(s,t)&=\left(\mathbb E[(G_s-G_t)^2]\right)^{1/2}.
\end{align*}
[/definition]
With this metric, $G_s$ and $G_t$ are close exactly when their difference has small variance. We can now state the first chaining theorem, which turns the covering numbers of $(T,d_G)$ into an upper bound for the expected supremum.
[quotetheorem:9845]
[citeproof:9845]
Dudley's bound is a template for reading geometry probabilistically, and each hypothesis removes a concrete obstruction. Total boundedness is needed before the first net can be chosen at every positive scale: for instance, if $T=\mathbb N$ with $d(m,n)=1$ for $m\ne n$, then $N(T,d,1/2)=\infty$ and the chaining construction has no finite coarse approximation. Separability is not a size assumption but a measurability assumption; without it, a supremum over an uncountable Gaussian family may fail to be a measurable random variable. A standard obstruction is obtained by taking $T$ to be an uncountable set with the discrete semimetric and independent nondegenerate Gaussian coordinates; the event $\{\sup_{t\in T}G_t\le a\}$ is governed by an uncountable intersection and need not be measurable in the product completion. Separability replaces the index set by a countable dense skeleton, so the same supremum becomes the supremum of a sequence of measurable random variables. The theorem is also only an upper bound: it does not assert that divergence of the entropy integral forces unbounded sample paths, nor does it identify the exact supremum size in every geometry. The scale $\varepsilon$ contributes the local radius $d\varepsilon$ and the combinatorial price $\sqrt{\log N(T,d_G,\varepsilon)}$. The following example shows how a large finite index set can still have bounded supremum when its metric geometry is hierarchical.
[example: Supremum on a Finite Metric Tree]
Let $T$ be the set of leaves of a rooted binary tree of depth $m$. Put $d(t,t)=0$, and for distinct leaves set $d(s,t)=2^{-\ell(s,t)}$, where $\ell(s,t)$ is the level of their last common ancestor. Suppose the centred Gaussian process $(G_t)_{t\in T}$ has canonical metric $d_G$ comparable to $d$, so there are constants $a,b>0$ such that $a d(s,t)\le d_G(s,t)\le b d(s,t)$ for all $s,t\in T$.
At level $k$, the tree has $2^k$ vertices, and each such vertex determines the set of leaves below it. If two leaves lie below the same level-$k$ vertex, then their last common ancestor has level at least $k$, hence their $d$-distance is at most $2^{-k}$. Therefore the $2^k$ descendant sets give a cover at radius $2^{-k}$, so
\begin{align*}
N(T,d,2^{-k})\le 2^k.
\end{align*}
Conversely, choose one leaf below each level-$k$ vertex. Two leaves chosen from different level-$k$ vertices have last common ancestor at level at most $k-1$, so their distance is at least $2^{-(k-1)}$. Thus balls of radius smaller than a constant multiple of $2^{-k}$ cannot merge these $2^k$ representatives, giving
\begin{align*}
N(T,d,c2^{-k})\ge 2^k
\end{align*}
for a universal constant $c>0$. Hence, up to universal constants,
\begin{align*}
\log N(T,d,\varepsilon)\asymp k
\end{align*}
when $2^{-(k+1)}<\varepsilon\le 2^{-k}$ and $1\le k\le m$.
Using the *Dudley Chaining Bound* and the comparability of $d_G$ with $d$, the entropy integral is controlled by dyadic annuli:
\begin{align*}
\int_0^1\sqrt{\log N(T,d,\varepsilon)}\,d\varepsilon \asymp \sum_{k=1}^{m}\int_{2^{-(k+1)}}^{2^{-k}}\sqrt{k}\,d\varepsilon.
\end{align*}
For each $k$,
\begin{align*}
\int_{2^{-(k+1)}}^{2^{-k}}\sqrt{k}\,d\varepsilon=\sqrt{k}\left(2^{-k}-2^{-(k+1)}\right)=2^{-(k+1)}\sqrt{k}.
\end{align*}
Thus
\begin{align*}
\int_0^1\sqrt{\log N(T,d,\varepsilon)}\,d\varepsilon \asymp \sum_{k=1}^{m}2^{-k}\sqrt{k}.
\end{align*}
The infinite series $\sum_{k\ge1}2^{-k}\sqrt{k}$ converges, for example because $\sqrt{k}\le 2^{k/2}$ for all sufficiently large $k$, so the tail is bounded by $\sum_k2^{-k/2}$. Therefore the Dudley upper bound is bounded uniformly in $m$: adding more leaves does not by itself force a large Gaussian supremum when the tree metric makes the new branches close at fine scales.
[/example]
The theorem also explains why entropy assumptions often appear in empirical process theory. A class may be infinite, but if it can be covered efficiently at all resolutions in the natural $L^2(P)$ or covariance metric, then its process can still have a controlled supremum.
## Boundedness Criteria and the Generic Chaining Viewpoint
Dudley's integral is powerful, but it is not the final geometric answer. The next question asks whether covering numbers alone, summed in this particular way, always capture the correct size of a Gaussian supremum. Generic chaining refines the construction by assigning different approximation budgets to different parts of the space rather than forcing uniform nets at each scale.
To formulate that refinement, the uniform condition "$T_k$ covers all of $T$ at radius $\varepsilon_k$" is replaced by a cardinality budget. This lets each point pay for its own approximation error at level $k$.
[definition: Admissible Sequence]
An admissible sequence for a metric space $(T,d)$ is a sequence $(A_k)_{k\ge0}$ of subsets of $T$ such that $|A_0|=1$ and
\begin{align*}
|A_k|\le 2^{2^k}\quad\text{for all }k\ge1.
\end{align*}
[/definition]
The cardinality budget grows very fast with $k$, so fine scales are allowed many representatives. A single covering radius at each scale is too crude for spaces with uneven geometry: some points may need fine approximation earlier than others. The functional below measures the worst accumulated approximation cost along an admissible sequence, with each scale weighted according to the Gaussian size expected from its cardinality budget.
[definition: Gamma Two Functional]
On the class $\mathrm{Met}$ of metric spaces, the $\gamma_2$ functional is the map
\begin{align*}
\gamma_2:\mathrm{Met}&\to[0,\infty], & (T,d)&\mapsto \gamma_2(T,d),
\end{align*}
defined by
\begin{align*}
\gamma_2(T,d)=\inf_{(A_k)}\sup_{t\in T}\sum_{k=0}^{\infty}2^{k/2}d(t,A_k),
\end{align*}
where the infimum is over all admissible sequences and $d(t,A_k)=\inf_{a\in A_k}d(t,a)$.
[/definition]
The weight $2^{k/2}$ matches Gaussian maxima over $2^{2^k}$ points. Before using $\gamma_2$ as a refinement, we need to connect it back to the entropy integral already obtained from Dudley's proof.
[quotetheorem:9846]
[citeproof:9846]
The theorem shows that Dudley's integral is a uniform-net upper bound for the more flexible chaining functional, and its assumptions are exactly the ones needed for that comparison. Total boundedness guarantees finite covering numbers at the radii used to build the admissible sets; on a discrete infinite space with mutual distances bounded below, the entropy integral and the net construction both break at the first small scale. The conclusion is one-way: a small entropy integral forces a small $\gamma_2$, but a large entropy integral does not rule out a smaller $\gamma_2$, because admissible sequences may allocate many representatives only where the space actually needs them. Its limitation is also visible from the proof: the same net is forced to serve all points at a fixed scale, even if different regions of $T$ have very different local complexity. The natural next question is whether this refined quantity gives the correct order of a Gaussian supremum, not only an upper estimate.
[quotetheorem:9847]
This is Talagrand's [majorizing measure theorem](/theorems/9847), also called the generic chaining theorem, and it is used here as an external structural input. Total boundedness keeps $\gamma_2(T,d_G)$ tied to genuine finite approximations; separability again ensures that the oscillation supremum is the measurable supremum over a countable dense skeleton. The theorem does not give a computable formula for the optimal admissible sequence, and it does not replace entropy estimates when only rough covering information is available. Its upper bound continues the chaining principle already seen in Dudley's argument, while the lower bound is the deep direction: it says that no Gaussian process can have expected oscillation substantially smaller than the best admissible-chain cost.
For applications in this course, the entropy side gives a practical boundedness consequence that can be checked directly from covering numbers. The next result packages this consequence as a sample-path boundedness criterion rather than merely an expectation estimate, so it also has to address the completion of the semimetric space and the identification of points at zero canonical distance.
[quotetheorem:9848]
This criterion converts a deterministic entropy calculation into almost sure boundedness of the process, but each clause is doing work. The quotient by zero distance is necessary because two labels with $d_G(s,t)=0$ have $G_s-G_t=0$ in $L^2$ and hence should represent the same stochastic coordinate after modification. The completion is needed because chaining constructs limits of approximating values; without completing the semimetric space, the limiting index may not belong to the original label set. Separability prevents the boundedness statement from depending on a nonmeasurable supremum. The conclusion is also not a converse: failure of the entropy integral does not by itself prove almost sure unboundedness. Upper bounds alone do not say whether large entropy at one scale must create a large supremum, so the matching single-scale obstruction is supplied by [Sudakov minoration](/theorems/9849).
[quotetheorem:9849]
Sudakov's inequality gives the opposite pressure: a large separated set at any single scale forces the Gaussian supremum to be large. Separability is retained for the same measurability reason as above, and the use of separated sets is essential because covering numbers alone may count balls whose centres do not produce well-spaced Gaussian variables. The theorem gives only a lower bound at a chosen scale; it does not say that a large separated set is the only way a supremum becomes large, and it does not provide an upper estimate. This lower bound explains why the exponent in Dudley's integral is a genuine boundary for entropy methods.
[example: Critical Entropy Exponent]
Suppose there are constants $0<c\le C<\infty$ and $\varepsilon_0>0$ such that, for $0<\varepsilon\le\varepsilon_0$,
\begin{align*}
c\varepsilon^{-2}\le \log N(T,d,\varepsilon)\le C\varepsilon^{-2}.
\end{align*}
Taking square roots gives
\begin{align*}
\sqrt{c}\,\varepsilon^{-1}\le \sqrt{\log N(T,d,\varepsilon)}\le \sqrt{C}\,\varepsilon^{-1}.
\end{align*}
Hence the lower part of Dudley's entropy integral satisfies, for $0<\delta<\varepsilon_0$,
\begin{align*}
\int_\delta^{\varepsilon_0}\sqrt{\log N(T,d,\varepsilon)}\,d\varepsilon\ge \sqrt{c}\int_\delta^{\varepsilon_0}\varepsilon^{-1}\,d\varepsilon.
\end{align*}
The elementary antiderivative of $\varepsilon^{-1}$ is $\log \varepsilon$, so
\begin{align*}
\int_\delta^{\varepsilon_0}\varepsilon^{-1}\,d\varepsilon=\log \varepsilon_0-\log \delta=\log(\varepsilon_0/\delta).
\end{align*}
As $\delta\downarrow0$, $\log(\varepsilon_0/\delta)\to\infty$, and therefore the entropy integral diverges at the origin.
This divergence does not prove that every Gaussian process with canonical metric $d$ is unbounded; it only says that the Dudley entropy method cannot give a finite upper bound from this entropy estimate alone. If a centred separable Gaussian process has canonical metric $d$, then *Sudakov Minoration* at scale $\varepsilon$ gives
\begin{align*}
\mathbb E\sup_{t\in T}G_t\ge c_0\varepsilon\sqrt{\log M(T,d,\varepsilon)}.
\end{align*}
Since a maximal $\varepsilon$-separated set is an $\varepsilon$-net, $N(T,d,\varepsilon)\le M(T,d,\varepsilon)$, and hence
\begin{align*}
\log M(T,d,\varepsilon)\ge \log N(T,d,\varepsilon)\ge c\varepsilon^{-2}.
\end{align*}
Substituting this into Sudakov's bound gives
\begin{align*}
\mathbb E\sup_{t\in T}G_t\ge c_0\varepsilon\sqrt{c\varepsilon^{-2}}=c_0\sqrt c.
\end{align*}
Thus every sufficiently small scale already contributes a non-vanishing lower bound; at the exponent $\varepsilon^{-2}$, boundedness can no longer be recovered from entropy growth alone without additional structure.
[/example]
This critical example is a warning about the limits of entropy-only thinking. Chaining methods are scale-sensitive, and the exact arrangement of points can matter as much as the number of points needed to cover them.
## Chaining for Empirical and Rademacher Processes
The Gaussian theory is the cleanest because increments are completely controlled by their variances. Empirical processes have a related geometry, but their increments also depend on envelopes, measurability, and symmetrisation. The central question is how much of Gaussian chaining survives when $G_t$ is replaced by $\sqrt n(P_n-P)f$ or by a Rademacher average.
Let $X_1,\dots,X_n$ be i.i.d. with distribution $P$, and let $\mathcal F$ be a class of measurable real-valued functions on the sample space. The empirical process indexed by $\mathcal F$ is $G_n(f)=\sqrt n(P_n-P)f$. Since the observed sample determines which functions are distinguishable, this motivates the following random metric for chaining.
[definition: Empirical Semimetric]
For a sample $X_1,\dots,X_n$, the empirical $L^2$ semimetric on $\mathcal F$ is the map
\begin{align*}
d_n:\mathcal F\times\mathcal F&\to[0,\infty), & d_n(f,g)&=\left(P_n(f-g)^2\right)^{1/2}.
\end{align*}
[/definition]
This metric is random, so the chaining geometry itself depends on the observed sample. The obstacle in empirical-process bounds is that the centred variables $(P_n-P)f$ still involve the unknown mean $P f$, so the sample alone does not give a conditionally centred process with independent summands.
The next object isolates the part of the empirical fluctuation that can be analyzed after the data are fixed. By replacing the unknown centering with independent random signs, one obtains a conditionally centred process whose increments are measured exactly by $d_n$. This is the Rademacher process, and it is the form to which conditional chaining estimates apply.
[definition: Rademacher Process]
Let $\varepsilon_1,\dots,\varepsilon_n$ be i.i.d. Rademacher random variables independent of $X_1,\dots,X_n$. The Rademacher process indexed by $\mathcal F$ is the random map
\begin{align*}
R_n:\mathcal F&\to\mathbb R, & R_n(f)&=\frac{1}{\sqrt n}\sum_{i=1}^n\varepsilon_i f(X_i).
\end{align*}
[/definition]
Conditional on $X_1,\dots,X_n$, the increment $R_n(f)-R_n(g)$ is sub-Gaussian with variance proxy $d_n(f,g)^2$. The remaining problem is to control the supremum over a possibly infinite function class after the sample has fixed a random semimetric. A Dudley-type estimate is useful precisely because it converts this conditional increment control into a bound involving the random covering numbers of $(\mathcal F,d_n)$.
[quotetheorem:9850]
[citeproof:9850]
The conditional theorem is often paired with symmetrisation to return to empirical processes. Its bounded-on-sample hypothesis prevents a different failure mode from the Gaussian case: without finite conditional sub-Gaussian scales, the empirical metric alone does not control the tails of the increments. The countable reduction is also substantive; it makes the conditional supremum and the random covering argument measurable after the sample has been fixed. The theorem does not control the unconditional empirical process by itself, and it does not remove the need to estimate the random entropy integral. The next result gives the comparison that makes Rademacher complexity the bridge between empirical process theory and learning-theoretic generalisation bounds.
[quotetheorem:9851]
[citeproof:9851]
The comparison is upper-sided: it says that Rademacher averages control empirical fluctuations in expectation. The finiteness assumption is not cosmetic, since a nonmeasurable or nonintegrable supremum cannot be compared by the ghost-sample argument. The theorem also does not give a pathwise inequality for a fixed dataset, nor does it give a reverse comparison without extra structure. The factor $2$ is the price of replacing the unknown mean $P$ by a ghost empirical mean and then splitting the signed difference into two sample averages. To relate this to Gaussian chaining, we need a process with the same empirical covariance geometry but Gaussian rather than signed increments.
[quotetheorem:9852]
This comparison is precise enough for chaining applications: Gaussian averages dominate Rademacher averages up to a numerical constant, and both are governed by the same empirical $L^2$ geometry. Pointwise measurability and finiteness of the conditional expectations keep the two suprema inside ordinary conditional expectation; without them the displayed inequality may not be a well-defined statement. The result is not a two-sided theorem in this generality, and it does not compare individual sample paths of the Rademacher and Gaussian processes. For a concrete pathwise obstruction, take $n=1$ and a class $\mathcal F=\{0,f\}$ with $f(X_1)=1$. If the Rademacher sign is $1$ and the Gaussian variable is negative, then $\sup_{\mathcal F}R_1=1$ while $\sup_{\mathcal F}Z_f=0$; if the sign is $-1$ and the Gaussian variable is positive, the inequality is reversed. Thus the comparison is an expectation comparison after averaging over the auxiliary randomness, not an ordering of realised suprema. Lower expectation comparisons require additional symmetry, convexity, or regularity assumptions on the class. The following compactness example shows how deterministic regularity of a function class feeds into these random-metric bounds.
[example: Lipschitz Functions on a Compact Metric Space]
Let $(S,\rho)$ be compact, and let $\mathcal F$ be the class of functions $f:S\to[-1,1]$ with Lipschitz constant at most $1$. For any probability measure $P$ on $S$ and any $f,g\in\mathcal F$,
\begin{align*}
\|f-g\|_{L^2(P)}^2=\int_S |f(x)-g(x)|^2\,dP(x)\le \int_S \|f-g\|_\infty^2\,dP(x)=\|f-g\|_\infty^2P(S)=\|f-g\|_\infty^2.
\end{align*}
Taking square roots gives
\begin{align*}
\|f-g\|_{L^2(P)}\le \|f-g\|_\infty.
\end{align*}
The same calculation with $P_n$ in place of $P$ gives $d_n(f,g)\le \|f-g\|_\infty$.
We now show explicitly why compactness and the Lipschitz bound give finite sup-norm covers. Fix $\varepsilon>0$, set $\eta=\delta=\varepsilon/4$, and choose a finite $\eta$-net $\{s_1,\dots,s_r\}$ of $S$. For each $f\in\mathcal F$, choose numbers $q_i(f)\in\delta\mathbb Z\cap[-1,1]$ such that $|f(s_i)-q_i(f)|\le\delta$. There are only finitely many possible vectors $(q_1(f),\dots,q_r(f))$. For each vector that occurs, choose one representative function $h\in\mathcal F$ with that vector.
If $f$ and $h$ have the same vector and $x\in S$, choose $s_i$ with $\rho(x,s_i)\le\eta$. Since both functions are $1$-Lipschitz,
\begin{align*}
|f(x)-h(x)|\le |f(x)-f(s_i)|+|f(s_i)-h(s_i)|+|h(s_i)-h(x)|.
\end{align*}
The first and third terms are at most $\eta$, while the common rounded value gives
\begin{align*}
|f(s_i)-h(s_i)|\le |f(s_i)-q_i(f)|+|q_i(h)-h(s_i)|\le 2\delta.
\end{align*}
Therefore
\begin{align*}
|f(x)-h(x)|\le 2\eta+2\delta=\varepsilon.
\end{align*}
Taking the supremum over $x$ shows $\|f-h\|_\infty\le\varepsilon$, so $\mathcal F$ has a finite $\varepsilon$-net in sup norm, hence also in $L^2(P)$ and in the empirical semimetric $d_n$. Thus the deterministic compactness of the Lipschitz class supplies the finite covering numbers needed by chaining, and the empirical or Rademacher supremum can be bounded through these covers once the corresponding entropy integral is finite.
[/example]
This example is representative of how function-class geometry enters empirical process bounds. The index set is no longer a finite-dimensional parameter space, but regularity of the functions produces enough compactness for chaining to apply. The final example records the opposite phenomenon, where entropy sits exactly at the boundary of the integral method.
[example: Failure of the Entropy Integral at the Boundary]
Suppose that, near $0$, the $L^2(P)$ covering entropy has the boundary growth rate
\begin{align*}
\log N(\mathcal F,L^2(P),\varepsilon)\asymp \varepsilon^{-2}.
\end{align*}
This means that there are constants $0<c\le C<\infty$ and $\varepsilon_0>0$ such that, whenever $0<\varepsilon\le \varepsilon_0$,
\begin{align*}
c\varepsilon^{-2}\le \log N(\mathcal F,L^2(P),\varepsilon)\le C\varepsilon^{-2}.
\end{align*}
Taking square roots in the lower bound gives
\begin{align*}
\sqrt{\log N(\mathcal F,L^2(P),\varepsilon)}\ge \sqrt{c\varepsilon^{-2}}=\sqrt c\,\varepsilon^{-1}.
\end{align*}
Therefore, for $0<\delta<\varepsilon_0$,
\begin{align*}
\int_\delta^{\varepsilon_0}\sqrt{\log N(\mathcal F,L^2(P),\varepsilon)}\,d\varepsilon\ge \sqrt c\int_\delta^{\varepsilon_0}\varepsilon^{-1}\,d\varepsilon.
\end{align*}
Since an antiderivative of $\varepsilon^{-1}$ is $\log \varepsilon$,
\begin{align*}
\int_\delta^{\varepsilon_0}\varepsilon^{-1}\,d\varepsilon=\log \varepsilon_0-\log \delta=\log(\varepsilon_0/\delta).
\end{align*}
As $\delta\downarrow0$, $\log(\varepsilon_0/\delta)\to\infty$, so the entropy integral diverges at the origin.
Thus the standard Dudley entropy integral gives no finite uniform upper bound from this entropy estimate alone. This failure is a limitation of the integral method, not a proof that every empirical process at this boundary is untight: truncation, variance decay, bracketing, or localization may replace the global entropy by a smaller effective entropy on the part of the class actually seen by the process.
[/example]
The lesson is that chaining is not a single theorem but a method. For Gaussian processes, the canonical metric and $\gamma_2$ describe the answer with striking precision. For empirical processes, the same multiscale decomposition remains central, but it must be combined with symmetrisation, random metrics, envelopes, and localization to reflect the additional structure of sampled data.
# 9. Permanence Properties and Examples of Donsker Classes
The preceding chapters developed sufficient conditions for a class $\mathcal F$ to be $P$-Donsker, usually by controlling covering or bracketing entropy. This chapter assumes the empirical-process convergence, entropy, bracketing, Brownian-bridge, stochastic-equicontinuity, and continuous mapping tools from those chapters. It also introduces permanence results that prepare for the Banach-space [delta method](/theorems/1861) used below and in the next chapter. It asks what happens after a Donsker theorem has already been proved: which operations preserve the conclusion, which familiar statistical classes fit the theory, and which large classes exceed the possible entropy scale. The main point is that Donsker theory is stable under many analytic transformations, but not under arbitrary enlargement of the index set.
## Permanence Under Maps and Algebraic Operations
A Donsker theorem is a weak convergence statement in $\ell^\infty(\mathcal F)$. The first question is therefore structural: if the empirical process indexed by $\mathcal F$ converges, when does an induced process indexed by a transformed class converge as well?
[definition: Image Class]
Let $\mathcal F$ be a class of measurable functions $f:S\to \mathbb R$, let $T$ be a set, and let $\psi:T\to \mathcal F$ be a map. The image-indexed class is
\begin{align*}
\mathcal F_\psi := \{\psi(t):t\in T\}.
\end{align*}
[/definition]
The definition changes the labels of the same functions rather than changing their values. This is useful because many statistical processes are written with a natural parameter $t$, while the empirical process only sees the function $\psi(t)$. This motivates the following theorem, which turns reparametrisation into a direct application of the [continuous mapping theorem](/theorems/1847).
[quotetheorem:9853]
[citeproof:9853]
This theorem explains why changing coordinates or replacing a parameter by an equivalent parametrisation does not create new empirical-process difficulty. The continuity hypothesis is a real hypothesis about the ambient norms: if a statistic is obtained by first taking a Donsker empirical process and then applying an evaluation map that is discontinuous in the chosen topology, weak convergence in the original space need not determine the transformed limit. For instance, convergence in $L^2(P)$ alone would not justify evaluating a limiting path at a single point; two representatives can be close in $L^2(P)$ while having unrelated point values. The theorem also does not say that arbitrary data-dependent reindexing is harmless, because $\psi$ is fixed and deterministic. This distinction prepares the finite-sum theorem, where the transformation is again deterministic but now changes function values rather than only their labels.
The next operation changes the functions themselves. A first obstruction is that an uncontrolled sum can import a non-Donsker component: adding a fixed Donsker class to an arbitrary large class cannot improve the latter. Thus any useful finite-sum theorem must use joint control of both summands, not just pointwise algebra.
[definition: Sum Class]
Let $\mathcal F$ and $\mathcal G$ be classes of measurable real-valued functions on the same measurable space $S$. Their pointwise sum class is
\begin{align*}
\mathcal F+\mathcal G:=\{f+g:f\in\mathcal F,\ g\in\mathcal G\}.
\end{align*}
[/definition]
Sums appear whenever an estimating function is decomposed into a leading term and a nuisance correction. The obstruction is not the algebraic identity $G_n(f+g)=G_n f+G_n g$, but the need for a single tight limiting process that remembers the covariance between the two indexed classes. Joint Donsker control over the union supplies that covariance information and makes the addition map continuous on the combined limit.
[quotetheorem:9854]
[citeproof:9854]
The same proof pattern handles scalar multiplication and any fixed finite linear combination. The union assumption is stronger than asking separately that $\mathcal F$ and $\mathcal G$ be Donsker; separate convergence does not by itself specify the cross-covariances needed for the Gaussian limit of $G_n f+G_n g$. The theorem also does not permit a number of summands increasing with $n$: repeated addition over many choices can inflate entropy even when each component class is tame. Convex hulls are the natural test case because they consist of mixtures of arbitrary finite length, so the next definition isolates the class whose size has to be controlled by a separate entropy theorem.
[definition: Convex Hull Of A Function Class]
Let $\mathcal F$ be a class of measurable real-valued functions on $S$. Its convex hull is the class
\begin{align*}
\operatorname{conv}(\mathcal F):=\left\{\sum_{i=1}^m a_i f_i:m\in\mathbb N,\ f_i\in\mathcal F,\ a_i\ge 0,\ \sum_{i=1}^m a_i=1\right\}.
\end{align*}
[/definition]
Convex hulls arise in randomized predictors, mixtures, and aggregation procedures. Finite-sum permanence does not by itself control them, because the number of mixture components is unbounded and the class can acquire many new directions. To keep convexification within the Donsker regime, one needs an entropy condition strong enough to control all finite mixtures at once rather than one fixed linear combination at a time.
[quotetheorem:9855]
[citeproof:9855]
The polynomial uniform entropy hypothesis is a checkable way of saying that the original class has finite-dimensional combinatorial complexity. Mere Donsker-ness of $\mathcal F$ would not be enough for this argument: the classical Hilbert-ball obstruction gives a concrete warning. If $(e_j)_{j\ge1}$ is an orthonormal sequence in $L^2(P)$ and $\mathcal F=\{0\}\cup\{a_j e_j:j\ge1\}$ with $a_j\downarrow0$ fast enough, then $\mathcal F$ can be totally bounded and compatible with a Gaussian limit, while its convex hull contains many directions whose entropy behaves like that of an infinite-dimensional ellipsoid. For slow enough decay this convex hull has nonintegrable entropy and is not Donsker. The theorem also does not say that every closure of the convex hull is Donsker in every topology; it uses the measurable separable $L^2(P)$ version controlled by the stated bracketing theorem. In many statistical examples, such as VC indicator classes, this hypothesis is exactly the available entropy input, so the next example records the common classifier case before we turn to nonlinear transformations.
[example: Randomized Classifiers From A VC Class]
Let $\mathcal C$ be a VC class of measurable subsets of $S$, and set $\mathcal F=\{\mathbf 1_C:C\in\mathcal C\}$. An element of $\operatorname{conv}(\mathcal F)$ has the form
\begin{align*}
h(x)=\sum_{i=1}^m a_i\mathbf 1_{C_i}(x)
\end{align*}
where $m\in\mathbb N$, $C_i\in\mathcal C$, $a_i\ge0$, and $\sum_{i=1}^m a_i=1$. Since each $\mathbf 1_{C_i}(x)$ is either $0$ or $1$, we have
\begin{align*}
0\le a_i\mathbf 1_{C_i}(x)\le a_i
\end{align*}
for every $i$, and summing over $i$ gives
\begin{align*}
0\le h(x)\le \sum_{i=1}^m a_i=1.
\end{align*}
Thus the convex hull consists exactly of finite randomized classifiers taking values in $[0,1]$.
If $\mathcal C$ has VC dimension $v$, the VC entropy bound gives constants $A,v'>0$ such that, for every finitely supported probability measure $Q$ and every $0<\varepsilon<1$,
\begin{align*}
N(\varepsilon,\mathcal F,L^2(Q))\le A\varepsilon^{-v'}.
\end{align*}
The class is bounded by the envelope $1$, so *Convex Hull Permanence Under Entropy Control* applies and shows that $\operatorname{sconv}(\mathcal F)$ is $P$-Donsker for every probability measure $P$ after passing to the usual measurable separable version. Since $\operatorname{conv}(\mathcal F)\subset\operatorname{sconv}(\mathcal F)$, the randomized classifier class $\operatorname{conv}(\mathcal F)$ is also $P$-Donsker.
[/example]
The classifier example shows that mixtures preserve useful statistical structure when entropy stays controlled. The next operation is composition: loss functions, link functions, and truncations all transform a real-valued class by applying a fixed function to its output. The obstruction is that a rough transform can create oscillation at small scales; for instance, composing with a discontinuous threshold may turn a smooth real-valued class into a large indicator class. A Lipschitz transform is the safe case because it contracts the relevant $L^2(P)$ geometry up to a constant.
[definition: Lipschitz Transform Of A Class]
Let $\mathcal F$ be a class of measurable functions $f:S\to I$, where $I\subset\mathbb R$. Let $\phi:I\to\mathbb R$ be Lipschitz. The transformed class is
\begin{align*}
\phi\circ\mathcal F:=\{\phi\circ f:f\in\mathcal F\}.
\end{align*}
[/definition]
The Lipschitz condition says that distances in function space are not magnified by more than a fixed constant. This turns $L^2(P)$ covers of $\mathcal F$ into $L^2(P)$ covers of the transformed class, which motivates the following permanence theorem.
[quotetheorem:9857]
[citeproof:9857]
This theorem is one of the main tools for passing from linear statistics to nonlinear losses, and its hypotheses identify the two possible failure points. Without Lipschitz continuity the transform may introduce jumps and destroy equicontinuity: applying $u\mapsto\mathbf{1}_{\{u>0\}}$ to a smooth threshold class can produce an indicator class whose boundary behaviour is governed by a different entropy calculation. Without the envelope condition, even a Lipschitz map such as $u\mapsto u+c$ leaves square-integrability unresolved when the original functions have no $L^2(P)$ envelope. The theorem also does not give differentiability of the transformed statistical functional; it only preserves the Donsker property of the transformed index class. Truncation is usually compatible with empirical-process limits precisely because it is Lipschitz and improves envelopes, and the next example shows the mechanism for a standard loss.
[example: Logistic Loss With Bounded Linear Predictors]
Let $B:=\sup_{\theta\in\Theta}|\theta|<\infty$, and assume $|x|\le M$ on the sample space. For the linear predictors $f_\theta(x)=\theta\cdot x$, the Cauchy--Schwarz inequality gives
\begin{align*}
|f_\theta(x)-f_\eta(x)|=|(\theta-\eta)\cdot x|\le |\theta-\eta|\,|x|\le M|\theta-\eta|.
\end{align*}
Also,
\begin{align*}
|f_\theta(x)|=|\theta\cdot x|\le |\theta|\,|x|\le BM.
\end{align*}
Thus the linear class has a bounded envelope and Euclidean parameter covers of $\Theta$ give $L^2(P)$ covers of $\mathcal F$ at the same scale up to the factor $M$, so the finite-dimensional Lipschitz criterion gives that $\mathcal F$ is $P$-Donsker.
Now let $\phi(u)=\log(1+e^{-u})$. Its derivative is
\begin{align*}
\phi'(u)=\frac{1}{1+e^{-u}}(-e^{-u})=-\frac{e^{-u}}{1+e^{-u}}=-\frac{1}{1+e^u}.
\end{align*}
Since $e^u>0$, we have
\begin{align*}
|\phi'(u)|=\frac{1}{1+e^u}\le 1.
\end{align*}
By the [mean value theorem](/theorems/186), for any $u,v\in\mathbb R$,
\begin{align*}
|\phi(u)-\phi(v)|\le |u-v|.
\end{align*}
Therefore $\phi$ is $1$-Lipschitz. The transformed functions are
\begin{align*}
(\phi\circ f_\theta)(x)=\phi(\theta\cdot x)=\log(1+e^{-\theta\cdot x}).
\end{align*}
Moreover, since $|\theta\cdot x|\le BM$,
\begin{align*}
0\le \log(1+e^{-\theta\cdot x})\le \log(1+e^{BM}),
\end{align*}
so the transformed class has a bounded $L^2(P)$ envelope. Hence *[Donsker Permanence Under Lipschitz Transforms](/theorems/9857)* applies, and the logistic loss class
\begin{align*}
\{x\mapsto \log(1+e^{-\theta\cdot x}):\theta\in\Theta\}
\end{align*}
is $P$-Donsker. This example shows that the nonlinear logistic loss adds no empirical-process complexity beyond the bounded finite-dimensional linear predictor class.
[/example]
The logistic example depends on a Lipschitz composition, but monotone classes need not arise from a single Lipschitz transform. The difficulty is that a monotone class is infinite-dimensional, so finite-dimensional parametrisation gives no entropy bound. Its saving structure is order: bounded monotone functions cannot oscillate freely, and that ordered geometry yields brackets even without smoothness or a finite parameter set.
[quotetheorem:9858]
[citeproof:9858]
The theorem is a useful warning about structure. A class may be infinite-dimensional and still Donsker if its geometry prevents oscillation; here monotonicity replaces finite-dimensional parametrisation by limiting how many upcrossings a function can have. The bounded range matters because it supplies the envelope: the class of all nondecreasing maps $[0,1]\to[0,\infty)$ contains constant functions of arbitrary height, so it has no common $L^2(P)$ envelope. The one-dimensional interval also matters because the bracketing construction orders the domain; coordinatewise monotone functions on $[0,1]^d$ for larger $d$ have substantially different entropy behaviour and cannot be treated by the same one-dimensional partition argument. The theorem does not classify all shape-constrained classes, but it provides the prototype for using order to replace smooth parametrisation. Distribution functions give a direct illustration of this principle.
[example: Distribution Functions As A Monotone Class]
Let $\mathcal D$ denote the class of distribution functions on $[0,1]$. If $F\in\mathcal D$, then $0\le F(x)\le 1$ for every $x\in[0,1]$, and for $x\le y$ the defining monotonicity of a distribution function gives $F(x)\le F(y)$. Hence
\begin{align*}
F:[0,1]\to[0,1]\text{ is nondecreasing},
\end{align*}
so $F\in\mathcal M$. Therefore $\mathcal D\subset\mathcal M$.
By *[Bounded Monotone Functions On An Interval Are Donsker](/theorems/9858)*, $\mathcal M$ is $P$-Donsker for every probability measure $P$ on $[0,1]$. The empirical process indexed by $\mathcal D$ is the restriction of the empirical process indexed by $\mathcal M$:
\begin{align*}
G_n|_{\mathcal D}(F)=G_n(F)=\sqrt n(P_n-P)F,\qquad F\in\mathcal D.
\end{align*}
Restriction from $\ell^\infty(\mathcal M)$ to $\ell^\infty(\mathcal D)$ is continuous because
\begin{align*}
\sup_{F\in\mathcal D}|z(F)-w(F)|\le \sup_{f\in\mathcal M}|z(f)-w(f)|.
\end{align*}
Thus the continuous mapping theorem gives weak convergence of $G_n|_{\mathcal D}$ to the restricted Brownian bridge $G_P|_{\mathcal D}$. In this sense, distribution functions form a Donsker subclass of the bounded monotone class, so their empirical process has a tight Gaussian limit indexed by monotone test functions.
[/example]
## Smooth Parametric Classes
Many empirical processes in statistics are indexed by a finite-dimensional parameter rather than by an arbitrary function class. [Finite dimensionality](/theorems/1534) alone is not enough: a parametrisation can oscillate wildly in $x$ as $\theta$ changes, or a single anchor function can fail to be square-integrable. The central question is therefore how differentiability and envelope control of the statistical model convert the infinite-looking process into a finite-dimensional Gaussian approximation.
[definition: Differentiability In Quadratic Mean]
Let $\{P_\theta:\theta\in\Theta\subset\mathbb R^d\}$ be dominated by a measure $\mu$, with densities $p_\theta$. The model is differentiable in quadratic mean at $\theta_0$ if there exists a measurable score function $\dot\ell_{\theta_0}:S\to\mathbb R^d$ such that, as $h\to 0$ in $\mathbb R^d$ with $\theta_0+h\in\Theta$,
\begin{align*}
\int\left(\sqrt{p_{\theta_0+h}}-\sqrt{p_{\theta_0}}-\frac{1}{2}h\cdot\dot\ell_{\theta_0}\sqrt{p_{\theta_0}}\right)^2d\mu=o(|h|^2).
\end{align*}
[/definition]
This definition measures differentiability in the [Hilbert space](/page/Hilbert%20Space) $L^2(\mu)$ after taking square roots of densities. It is the right smoothness condition for likelihood expansions and local asymptotic normality. To use it in empirical-process arguments, the following theorem supplies a general Donsker criterion for finite-dimensional smooth classes.
[quotetheorem:9859]
[citeproof:9859]
This result is the workhorse behind smooth finite-dimensional examples. The anchor condition produces an $L^2(P)$ envelope, while the Lipschitz condition turns Euclidean compactness of $\Theta$ into polynomial covering numbers for the function class. If the anchor condition is dropped, the constant-in-parameter class $f_\theta(x)=x$ under a distribution with $\mathbb E[X^2]=\infty$ already has no square-integrable envelope. If the $L^2(P)$ Lipschitz envelope is dropped, a parametrisation may be differentiable in $\theta$ at each fixed $x$ while the derivative spikes on small sets as $\theta$ varies, so Euclidean compactness no longer controls $L^2(P)$ covering numbers. The theorem does not claim that every finite-dimensional parametrisation is Donsker; it says that finite dimensionality becomes useful only after it is tied to the sample-space geometry by an integrable Lipschitz bound. With these safeguards in place, the complexity is controlled by the number of parameters, not by the ambient sample space, as the likelihood-score example shows.
[example: Smooth Parametric Likelihood Scores]
Let $\Theta\subset\mathbb R^d$ be compact, fix $\theta_*\in\Theta$, and write
\begin{align*}
\dot\ell_\theta(x)=\left(\partial_{\theta_1}\log p_\theta(x),\ldots,\partial_{\theta_d}\log p_\theta(x)\right).
\end{align*}
Assume $\dot\ell_{\theta_*}\in L^2(P_{\theta_0})$ and that there is $M\in L^2(P_{\theta_0})$ such that, for every $j=1,\ldots,d$,
\begin{align*}
|\partial_{\theta_j}\log p_\theta(x)-\partial_{\theta_j}\log p_\eta(x)|\le M(x)|\theta-\eta|
\end{align*}
for all $\theta,\eta\in\Theta$ and $P_{\theta_0}$-a.e. $x$. Then, component by component,
\begin{align*}
|\dot\ell_\theta(x)-\dot\ell_\eta(x)|^2
=
\sum_{j=1}^d|\partial_{\theta_j}\log p_\theta(x)-\partial_{\theta_j}\log p_\eta(x)|^2
\le dM(x)^2|\theta-\eta|^2.
\end{align*}
Taking square roots gives
\begin{align*}
|\dot\ell_\theta(x)-\dot\ell_\eta(x)|\le \sqrt d\,M(x)|\theta-\eta|.
\end{align*}
For the score class
\begin{align*}
\mathcal S=\{x\mapsto a\cdot \dot\ell_\theta(x):\theta\in\Theta,\ |a|\le 1\},
\end{align*}
the envelope is square-integrable. Indeed, by Cauchy--Schwarz and the preceding bound with $\eta=\theta_*$,
\begin{align*}
|a\cdot\dot\ell_\theta(x)|
\le |a|\,|\dot\ell_\theta(x)|
\le |\dot\ell_{\theta_*}(x)|+\sqrt d\,M(x)\operatorname{diam}(\Theta).
\end{align*}
The right-hand side lies in $L^2(P_{\theta_0})$.
Now compare two indices $(a,\theta)$ and $(b,\eta)$ with $|a|\le1$ and $|b|\le1$. We have
\begin{align*}
|a\cdot\dot\ell_\theta(x)-b\cdot\dot\ell_\eta(x)|
\le |(a-b)\cdot\dot\ell_\theta(x)|+|b\cdot(\dot\ell_\theta(x)-\dot\ell_\eta(x))|.
\end{align*}
Using Cauchy--Schwarz in both terms gives
\begin{align*}
|a\cdot\dot\ell_\theta(x)-b\cdot\dot\ell_\eta(x)|
\le |a-b|\,|\dot\ell_\theta(x)|+|b|\,|\dot\ell_\theta(x)-\dot\ell_\eta(x)|.
\end{align*}
Since $|b|\le1$ and $|\dot\ell_\theta(x)|\le |\dot\ell_{\theta_*}(x)|+\sqrt d\,M(x)\operatorname{diam}(\Theta)$,
\begin{align*}
|a\cdot\dot\ell_\theta(x)-b\cdot\dot\ell_\eta(x)|
\le U(x)\bigl(|a-b|+|\theta-\eta|\bigr),
\end{align*}
where
\begin{align*}
U(x):=|\dot\ell_{\theta_*}(x)|+\sqrt d\,M(x)\operatorname{diam}(\Theta)+\sqrt d\,M(x).
\end{align*}
Thus $U\in L^2(P_{\theta_0})$ is a Lipschitz envelope for the finite-dimensional parameter set $\{(a,\theta):|a|\le1,\theta\in\Theta\}\subset\mathbb R^{2d}$. By *Donsker Behaviour Of Smooth Parametric Scores*, $\mathcal S$ is $P_{\theta_0}$-Donsker.
The limiting Brownian bridge $G$ has covariance
\begin{align*}
\operatorname{Cov}\bigl(G(a\cdot\dot\ell_\theta),G(b\cdot\dot\ell_\eta)\bigr)
=
P_{\theta_0}\bigl[(a\cdot\dot\ell_\theta)(b\cdot\dot\ell_\eta)\bigr]
-
P_{\theta_0}(a\cdot\dot\ell_\theta)\,P_{\theta_0}(b\cdot\dot\ell_\eta).
\end{align*}
At $\theta=\eta=\theta_0$, the score has mean zero under the usual regularity identity $P_{\theta_0}\dot\ell_{\theta_0}=0$, so this covariance becomes
\begin{align*}
a^\top P_{\theta_0}\bigl[\dot\ell_{\theta_0}\dot\ell_{\theta_0}^\top\bigr]b
=
a^\top I(\theta_0)b,
\end{align*}
the Fisher information pairing. Thus the empirical score process has a tight Gaussian limit, and at the true parameter its covariance is exactly the Fisher information [bilinear form](/page/Bilinear%20Form).
[/example]
The score example gives convergence for an indexed process, while statistical applications often require applying a functional to that process. Ordinary differentiability along a single fixed direction is not stable enough for weak limits, because empirical perturbations approach their limiting directions only through nearby paths in a function space. The differentiability notion below builds this stability into the definition and restricts attention to the tangent directions where the limiting random element lives.
[definition: Hadamard Differentiability]
Let $D$ and $E$ be normed spaces equipped with their norm metrics, let $\Phi:D_\Phi\subset D\to E$, let $x\in D_\Phi$, and let $D_0\subset D$ be a linear subspace. The map $\Phi$ is Hadamard differentiable at $x$ tangentially to $D_0$ if there is a continuous linear map $\Phi'_x:D_0\to E$ such that for every $h_t\to h$ in $D$ with $h\in D_0$ and every $t\downarrow0$ satisfying $x+t h_t\in D_\Phi$,
\begin{align*}
\frac{\Phi(x+t h_t)-\Phi(x)}{t}\to \Phi'_x(h)
\end{align*}
in $E$.
[/definition]
Hadamard differentiability is designed for limits in function spaces because it allows perturbing directions $h_t$ that converge to the limiting direction. In the applications below, the limiting random element takes values in a separable tangential subspace $D_0\subset D$, so the derivative only has to be defined on the directions that can appear in the limit. Ordinary pointwise Gateaux differentiability would be too weak here: empirical-process perturbations are random and only converge in distribution, so the derivative must be stable along nearby deterministic paths. This is exactly the situation produced by empirical processes, and it motivates the following delta method.
[quotetheorem:9860]
[citeproof:9860]
For empirical processes this theorem is most often applied with $X_n=P_n$ or with an estimated distribution function. The tangential condition matters because many statistical functionals are not differentiable in every ambient direction; quantiles, for example, require perturbations compatible with distribution functions and a positive density at the target point. Ordinary continuity is insufficient: the map $F\mapsto F^{-1}(p)$ can be continuous at a strictly increasing distribution function while failing to have a linear first-order expansion if the density vanishes at the $p$-quantile, and a kinked functional such as $x\mapsto |x|$ at $0$ is Gateaux differentiable along some one-sided directions without being stable along arbitrary nearby directions. The theorem therefore does not assert that every continuous functional preserves a Donsker limit. It says that once the correct Hadamard derivative is available, quantiles, smooth risks, and minimum-distance criteria inherit weak limits from the underlying empirical process.
[example: Smooth Risk Functionals]
Let $\mathcal F=\{f_\theta:\theta\in\Theta\}$ satisfy the finite-dimensional Lipschitz condition above, and define the risk curve
\begin{align*}
R(Q)(\theta)=Qf_\theta,\qquad \theta\in\Theta,
\end{align*}
for any signed measure $Q$ for which $\sup_{\theta\in\Theta}|Qf_\theta|<\infty$. Equip this signed-[measure space](/page/Measure%20Space) with the seminorm
\begin{align*}
\|Q\|_{\mathcal F}:=\sup_{\theta\in\Theta}|Qf_\theta|.
\end{align*}
For perturbations $H_t\to H$ in this seminorm and $t\downarrow0$, linearity of integration gives, for every $\theta\in\Theta$,
\begin{align*}
\frac{R(P+tH_t)(\theta)-R(P)(\theta)}{t}=H_t f_\theta.
\end{align*}
Hence
\begin{align*}
\left\|\frac{R(P+tH_t)-R(P)}{t}-R(H)\right\|_{\ell^\infty(\Theta)}=\sup_{\theta\in\Theta}|H_t f_\theta-Hf_\theta|=\|H_t-H\|_{\mathcal F}\to0.
\end{align*}
Thus $R$ is Hadamard differentiable at $P$, with derivative $R'_P(H)(\theta)=Hf_\theta$.
Applying this identity to $H_n=\sqrt n(P_n-P)$ gives, for every $\theta\in\Theta$,
\begin{align*}
\sqrt n\{R(P_n)(\theta)-R(P)(\theta)\}=\sqrt n(P_n-P)f_\theta.
\end{align*}
Therefore
\begin{align*}
\sqrt n(R(P_n)-R(P))=\{\sqrt n(P_n-P)f_\theta:\theta\in\Theta\}
\end{align*}
as an element of $\ell^\infty(\Theta)$. Since the finite-dimensional Lipschitz criterion makes $\mathcal F$ $P$-Donsker, this process converges in $\ell^\infty(\Theta)$ to the Brownian bridge indexed by $\{f_\theta:\theta\in\Theta\}$. The risk functional therefore adds no extra first-order approximation error: its derivative simply reads off the empirical process over the original loss class.
[/example]
The smooth-risk example is finite-dimensional through the parameter set, not through the formula for the function values. Fixed-architecture neural networks give another instance of this principle when the activation and weights are controlled.
[example: Neural Networks With Fixed Architecture]
Fix a network architecture with $L$ layers and finitely many parameters. Write $z_\theta^0(x)=x$, and for layer $\ell=1,\ldots,L$ write
\begin{align*}
z_\theta^\ell(x)=\sigma(W_\theta^\ell z_\theta^{\ell-1}(x)+b_\theta^\ell),
\end{align*}
where $\sigma$ is applied coordinatewise and is $K_\sigma$-Lipschitz. Assume $|x|\le B$ and that the bounded parameter set $\Theta$ gives a common bound $|W_\theta^\ell|_{\mathrm{op}}\le R$ and $|b_\theta^\ell|\le R$ for all $\theta$ and $\ell$. Since $\sigma$ is Lipschitz,
\begin{align*}
|\sigma(u)|\le |\sigma(0)|+K_\sigma |u|.
\end{align*}
Thus, if $|z_\theta^{\ell-1}(x)|\le A_{\ell-1}$, then
\begin{align*}
|z_\theta^\ell(x)|\le |\sigma(0)|\sqrt{p_\ell}+K_\sigma |W_\theta^\ell z_\theta^{\ell-1}(x)+b_\theta^\ell|\le |\sigma(0)|\sqrt{p_\ell}+K_\sigma(RA_{\ell-1}+R),
\end{align*}
where $p_\ell$ is the width of layer $\ell$. Starting from $A_0=B$, this recursion gives finite constants $A_1,\ldots,A_L$ such that $|z_\theta^\ell(x)|\le A_\ell$ for every $\theta\in\Theta$ and every admissible input $x$.
Now compare two parameter values $\theta,\eta\in\Theta$. Suppose inductively that
\begin{align*}
|z_\theta^{\ell-1}(x)-z_\eta^{\ell-1}(x)|\le D_{\ell-1}|\theta-\eta|.
\end{align*}
Using the Lipschitz property of $\sigma$ and adding and subtracting $W_\eta^\ell z_\theta^{\ell-1}(x)$ gives
\begin{align*}
|z_\theta^\ell(x)-z_\eta^\ell(x)|\le K_\sigma |(W_\theta^\ell-W_\eta^\ell)z_\theta^{\ell-1}(x)+W_\eta^\ell(z_\theta^{\ell-1}(x)-z_\eta^{\ell-1}(x))+(b_\theta^\ell-b_\eta^\ell)|.
\end{align*}
The triangle inequality and the bounds above give
\begin{align*}
|z_\theta^\ell(x)-z_\eta^\ell(x)|\le K_\sigma(A_{\ell-1}+RD_{\ell-1}+1)|\theta-\eta|.
\end{align*}
With $D_0=0$, define
\begin{align*}
D_\ell:=K_\sigma(A_{\ell-1}+RD_{\ell-1}+1).
\end{align*}
Then $|z_\theta^\ell(x)-z_\eta^\ell(x)|\le D_\ell|\theta-\eta|$ for every layer.
If the final output is $f_\theta(x)=a_\theta\cdot z_\theta^L(x)+c_\theta$ with $|a_\theta|\le R$ and $|c_\theta|\le R$, then
\begin{align*}
|f_\theta(x)-f_\eta(x)|\le |(a_\theta-a_\eta)\cdot z_\theta^L(x)|+|a_\eta\cdot(z_\theta^L(x)-z_\eta^L(x))|+|c_\theta-c_\eta|.
\end{align*}
By Cauchy--Schwarz and the preceding layer bound,
\begin{align*}
|f_\theta(x)-f_\eta(x)|\le (A_L+RD_L+1)|\theta-\eta|.
\end{align*}
Also $|f_{\theta_*}(x)|\le RA_L+R$ for any fixed $\theta_*\in\Theta$, so the anchor function is bounded and hence lies in $L^2(P)$. Therefore the finite-dimensional Lipschitz criterion *Donsker Behaviour Of Smooth Parametric Scores* applies to the class $\{f_\theta:\theta\in\Theta\}$. The fixed architecture contributes only finitely many bounded parameters, so the neural-network class has polynomial entropy and is $P$-Donsker.
[/example]
## Non-Donsker Examples And Entropy Sharpness
The permanence results above might suggest that Donsker classes are common, but the theory has sharp limits. The final question in this chapter is how large a class can be before tight Gaussian limits cease to exist.
[definition: Universal Indicator Class]
Let $(S,\mathcal A)$ be a measurable space. The universal indicator class is
\begin{align*}
\mathcal I_{\mathcal A}:=\{\mathbf{1}_A:A\in\mathcal A\}.
\end{align*}
[/definition]
This class indexes the empirical measure over every measurable set. The obstruction is total boundedness of the covariance geometry: on a non-atomic space one can carve out infinitely many sets whose indicators remain separated in $L^2(P)$. Then finite-dimensional central limit theorems still hold on each fixed subcollection, but there is no tight Gaussian process indexed by the whole universal class.
[quotetheorem:9861]
[citeproof:9861]
The obstruction is not merely a proof artifact. The central limit theorem for every fixed finite subcollection still holds, but the finite-dimensional limits cannot be assembled into a tight element of $\ell^\infty(\mathcal I_{\mathcal A})$. Non-atomicity is essential here: on a finite sample space, the class of all indicators is finite and hence Donsker. The theorem also does not forbid rich but structured indicator classes, such as VC classes of half-lines or rectangles under suitable entropy bounds. The next example gives a concrete separated sequence on the unit interval, showing exactly where the total-boundedness obstruction enters.
[example: All Measurable Sets On The Unit Interval]
Let $S=[0,1]$, let $P=\mathcal L^1$, and let $\mathcal A=\mathcal B([0,1])$. For $j\ge1$, define $A_j$ to be the set of points whose $j$th binary digit is $1$, with dyadic rationals assigned one binary expansion by a fixed convention. The binary digit map $x\mapsto \varepsilon_j(x)$ is Borel measurable under this convention, so
\begin{align*}
A_j=\varepsilon_j^{-1}(\{1\})\in\mathcal B([0,1]).
\end{align*}
Under Lebesgue measure, each binary digit is Bernoulli with parameter $1/2$, so
\begin{align*}
P(A_j)=P(\varepsilon_j=1)=\frac{1}{2}.
\end{align*}
If $i\ne j$, the $i$th and $j$th binary digits are independent Bernoulli coordinates. Hence
\begin{align*}
P(A_i\cap A_j^c)=P(\varepsilon_i=1,\varepsilon_j=0)=P(\varepsilon_i=1)P(\varepsilon_j=0)=\frac{1}{2}\cdot\frac{1}{2}=\frac{1}{4}.
\end{align*}
Similarly,
\begin{align*}
P(A_i^c\cap A_j)=P(\varepsilon_i=0,\varepsilon_j=1)=P(\varepsilon_i=0)P(\varepsilon_j=1)=\frac{1}{2}\cdot\frac{1}{2}=\frac{1}{4}.
\end{align*}
Since $A_i\triangle A_j=(A_i\cap A_j^c)\cup(A_i^c\cap A_j)$ and the two sets in this union are disjoint,
\begin{align*}
P(A_i\triangle A_j)=P(A_i\cap A_j^c)+P(A_i^c\cap A_j)=\frac{1}{4}+\frac{1}{4}=\frac{1}{2}.
\end{align*}
For indicator functions,
\begin{align*}
|\mathbf 1_{A_i}(x)-\mathbf 1_{A_j}(x)|^2=\mathbf 1_{A_i\triangle A_j}(x).
\end{align*}
Therefore
\begin{align*}
\|\mathbf 1_{A_i}-\mathbf 1_{A_j}\|_{L^2(P)}^2=P(A_i\triangle A_j)=\frac{1}{2}.
\end{align*}
Taking square roots gives
\begin{align*}
\|\mathbf 1_{A_i}-\mathbf 1_{A_j}\|_{L^2(P)}=\frac{1}{\sqrt2}.
\end{align*}
Thus $\{\mathbf 1_{A_j}:j\ge1\}$ is an infinite subset of the universal Borel indicator class whose distinct elements remain separated by the fixed distance $1/\sqrt2$. The class is therefore not totally bounded in $L^2(P)$, so the canonical-metric compactness required of a Donsker indicator class fails.
[/example]
Entropy conditions from earlier chapters are sufficient conditions, but examples like the universal indicator class show why some entropy restriction is unavoidable. The sharpness question asks how close the sufficient conditions are to necessary ones, and the following theorem records the necessary compactness left by any Donsker limit.
[quotetheorem:9862]
[citeproof:9862]
This theorem does not say that every Donsker class satisfies the exact entropy integrals used as sufficient hypotheses. It gives a necessary compactness condition, not a complete entropy characterisation: a class can be totally bounded but still fail a convenient bracketing criterion, and additional measurability or asymptotic equicontinuity work may remain. The boundedness assumption also matters, since heavy-tailed envelopes can break empirical-process tightness even when the canonical index metric is small. Increasing finite classes show the distinction between pointwise and uniform central limit theory.
[example: Sharpness For Increasing Finite Classes]
Let $\mathcal F_m=\{f_1,\ldots,f_m\}$, where $|f_j|\le 1$ for each $j$. For a fixed $m$, the empirical process indexed by $\mathcal F_m$ is the random vector
\begin{align*}
\bigl(\sqrt n(P_n-P)f_1,\ldots,\sqrt n(P_n-P)f_m\bigr)\in\mathbb R^m.
\end{align*}
Each coordinate has finite variance because
\begin{align*}
P f_j^2\le P1=1.
\end{align*}
Hence the usual finite-dimensional central limit theorem gives convergence of this vector to a centered Gaussian vector with covariance entries
\begin{align*}
\operatorname{Cov}(G(f_i),G(f_j))=P(f_if_j)-Pf_i\,Pf_j.
\end{align*}
Since $\ell^\infty(\mathcal F_m)$ is just $\mathbb R^m$ with the sup norm after identifying $z$ with $(z(f_1),\ldots,z(f_m))$, this proves that each fixed finite class $\mathcal F_m$ is $P$-Donsker.
Now suppose the classes are nested inside a single limiting class
\begin{align*}
\mathcal F_\infty:=\{f_j:j\ge1\},
\end{align*}
and assume there is a number $\delta>0$ such that
\begin{align*}
\|f_i-f_j\|_{L^2(P)}\ge\delta
\end{align*}
whenever $i\ne j$. If $\mathcal F_\infty$ were totally bounded in $L^2(P)$, then for $\varepsilon=\delta/3$ there would be finitely many $L^2(P)$ balls $B(g_1,\varepsilon),\ldots,B(g_N,\varepsilon)$ covering $\mathcal F_\infty$. Because infinitely many $f_j$ are covered by only finitely many balls, two distinct functions $f_i$ and $f_j$ must lie in the same ball. The triangle inequality would then give
\begin{align*}
\|f_i-f_j\|_{L^2(P)}\le \|f_i-g_k\|_{L^2(P)}+\|g_k-f_j\|_{L^2(P)}<\frac{\delta}{3}+\frac{\delta}{3}=\frac{2\delta}{3},
\end{align*}
which contradicts $\|f_i-f_j\|_{L^2(P)}\ge\delta$. Thus the limiting union is not totally bounded.
The point is that every fixed finite projection has an ordinary Gaussian limit, but the unrestricted infinite class keeps adding directions separated by a fixed $L^2(P)$ distance. Those separated directions prevent the compactness needed for a uniform empirical-process limit.
[/example]
The chapter therefore leaves two complementary lessons. Donsker classes are stable under continuous reindexing, finite algebraic operations, Lipschitz transforms, controlled convexification, and smooth finite-dimensional parametrisation. They are not stable under unrestricted enlargement, and the failure is detected by the same entropy and canonical-metric geometry that powered the positive theorems.
# 10. Statistical Functionals and Z-Estimation
This chapter turns empirical process limit theory into asymptotic theory for estimators. The previous chapters gave tools for proving convergence of $\sqrt n(P_n-P)$ in spaces of bounded functions; here those process limits are pushed through maps, optimisation problems, and estimating equations. The guiding question is how a statistical procedure depending on the empirical measure inherits a first-order expansion from the empirical process.
The three main mechanisms are plug-in estimation, maximisation, and root-finding. Plug-in estimators are controlled by differentiability of statistical functionals, M-estimators are controlled by the local geometry of random objective functions, and Z-estimators are controlled by empirical process expansions of score maps.
## Plug-In Estimators and Hadamard Differentiability
Suppose a parameter of interest is not itself an expectation, but a functional $\phi(P)$ of the distribution. If $P_n$ is close to $P$, the plug-in estimator $\phi(P_n)$ should be close to $\phi(P)$; for asymptotic normality we need a linear first-order approximation to $\phi$ along the random direction $P_n-P$.
[definition: Statistical Functional]
A statistical functional is a map $\phi:\mathcal P\to E$ from a class $\mathcal P$ of probability measures into a normed space $E$.
[/definition]
The functional viewpoint separates the sampling step from the analytic step: empirical process theory studies $P_n-P$, while the map $\phi$ records how the target parameter is read from a distribution. This distinction leads to the estimator used throughout this section, obtained by substituting the empirical measure into the same functional.
[definition: Plug-In Estimator]
Let $X_1,\dots,X_n$ be i.i.d. with distribution $P\in\mathcal P$, and let $P_n$ be the empirical measure. For a statistical functional $\phi:\mathcal P\to E$, the plug-in estimator of $\phi(P)$ is $\phi(P_n)$, whenever $P_n\in\mathcal P$ or $\phi(P_n)$ is otherwise well-defined.
[/definition]
The plug-in definition gives a general estimator, but asymptotic normality requires more than continuity of $\phi$. Since empirical measures approach $P$ through scaled signed-measure directions, the differentiability notion must be stable along such directions rather than on the whole ambient space.
[definition: Hadamard Differentiability]
Let $D$ and $E$ be normed spaces, let $\mathbb D\subset D$, and let $\phi:\mathbb D\to E$. The map $\phi$ is Hadamard differentiable at $\theta\in\mathbb D$ tangentially to a subspace $D_0\subset D$ if there exists a continuous linear map $\phi'_\theta:D_0\to E$ such that, for every $h\in D_0$, every $t_n\downarrow 0$, and every $h_n\to h$ in $D$ with $\theta+t_nh_n\in\mathbb D$,
\begin{align*}
\frac{\phi(\theta+t_nh_n)-\phi(\theta)}{t_n}\to \phi'_\theta(h).
\end{align*}
[/definition]
The phrase "tangentially to $D_0$" matters because empirical processes may converge in a smaller set of regular directions than the ambient normed space. The next theorem is the main plug-in device: it says that this derivative is exactly the map through which the empirical process limit passes.
[quotetheorem:6354]
[citeproof:6354]
This theorem converts Donsker convergence into estimator limits, but each hypothesis is doing real work. Hadamard differentiability is stronger than ordinary directional differentiability because the empirical direction is only known approximately: if $h_n\to h$ but the difference quotient is unstable along such perturbations, a formal derivative along the exact line $\theta+th$ does not control $\phi(\theta_n)$. Tangential support is equally important; if the weak limit lives outside the tangent set on which $\phi'_\theta$ is continuous, the expression $\phi'_\theta(Z)$ is not a legitimate limiting random element. The theorem also does not prove that $r_n(\theta_n-\theta)$ converges, nor does it compute the derivative; those are separate empirical-process and analytic tasks. The empirical distribution function gives the canonical example, where the parameter is a quantile rather than a mean.
[example: Sample Quantile]
Let $q_p=F^{-1}(p)$ and let $\hat q_p=F_n^{-1}(p)$. To compute the derivative of
\begin{align*}
\phi(F)=F^{-1}(p):=\inf\{x\in\mathbb R:F(x)\ge p\},
\end{align*}
perturb $F$ to $F_t=F+th_t$, where $h_t\to h$ uniformly and $h$ is continuous at $q_p$. Write $q_t=\phi(F_t)$. Since $f(q_p)>0$ in a neighbourhood of $q_p$, $F$ is locally strictly increasing there, so $q_t\to q_p$. Using $F_t(q_t)=p$ and $F(q_p)=p$, we have
\begin{align*}
0=F_t(q_t)-F(q_p)=F(q_t)-F(q_p)+t h_t(q_t).
\end{align*}
If $q_t\ne q_p$, divide by $t$ and factor the first term:
\begin{align*}
0=\frac{q_t-q_p}{t}\frac{F(q_t)-F(q_p)}{q_t-q_p}+h_t(q_t).
\end{align*}
As $q_t\to q_p$, differentiability of $F$ at $q_p$ gives
\begin{align*}
\frac{F(q_t)-F(q_p)}{q_t-q_p}\to f(q_p),
\end{align*}
and uniform convergence plus continuity gives $h_t(q_t)\to h(q_p)$. Therefore
\begin{align*}
\frac{q_t-q_p}{t}\to -\frac{h(q_p)}{f(q_p)}.
\end{align*}
Thus the Hadamard derivative of the quantile functional at $F$ in direction $h$ is
\begin{align*}
\phi'_F(h)=-\frac{h(q_p)}{f(q_p)}.
\end{align*}
Under the empirical distribution function central limit theorem,
\begin{align*}
\sqrt n(F_n-F)\xrightarrow{d}G_F
\end{align*}
in $\ell^\infty(\mathbb R)$. Applying the *Functional Delta Method* with the derivative just computed gives
\begin{align*}
\sqrt n(\hat q_p-q_p)\xrightarrow{d}-\frac{G_F(q_p)}{f(q_p)}.
\end{align*}
For the empirical distribution function limit, $G_F$ is a Brownian bridge indexed by distribution-function values, so
\begin{align*}
\operatorname{Var}\{G_F(q_p)\}=F(q_p)\{1-F(q_p)\}=p(1-p).
\end{align*}
Multiplying by the squared scale factor $1/f(q_p)^2$, the limiting variance is
\begin{align*}
\frac{p(1-p)}{f(q_p)^2}.
\end{align*}
[/example]
Quantiles show that nonlinear functionals may have simple linear derivatives. In survival analysis the same principle applies to more elaborate maps, although the analytic derivative is carried by product-limit structure.
[example: Kaplan-Meier Type Functional]
Let $X$ be the failure time, $C$ the censoring time, $T=X\wedge C$, and $\Delta=\mathbf 1\{X\le C\}$. Write
\begin{align*}
H_1(t)=\mathbb P(T\le t,\Delta=1)
\end{align*}
and
\begin{align*}
H(t)=\mathbb P(T\le t).
\end{align*}
On an interval $[0,\tau]$ with $\inf_{u\le \tau}\{1-H(u-)\}>0$, the cumulative hazard functional is
\begin{align*}
\Lambda(H_1,H)(t)=\int_{[0,t]}\frac{dH_1(u)}{1-H(u-)}.
\end{align*}
The corresponding survival functional is $S(t)=\prod_{u\le t}\{1-d\Lambda(u)\}$, and the empirical plug-in version obtained by replacing $(H_1,H)$ by $(H_{1n},H_n)$ is the Kaplan-Meier product-limit estimator.
To see the first-order map, perturb $(H_1,H)$ to $(H_1+ta,H+tb)$ and put $R(u)=1-H(u-)$. For the hazard part,
\begin{align*}
\Lambda(H_1+ta,H+tb)(s)=\int_{[0,s]}\frac{dH_1(u)+t\,da(u)}{R(u)-t b(u-)}.
\end{align*}
Subtracting $\Lambda(H_1,H)(s)$ and dividing by $t$ gives
\begin{align*}
\frac{\Lambda(H_1+ta,H+tb)(s)-\Lambda(H_1,H)(s)}{t}=\int_{[0,s]}\frac{da(u)}{R(u)-t b(u-)}+\int_{[0,s]}\frac{b(u-)}{R(u)\{R(u)-t b(u-)\}}\,dH_1(u).
\end{align*}
Because $R$ is bounded away from $0$ on $[0,\tau]$, the denominators stay bounded away from $0$ for small $t$, so the derivative of the hazard map is
\begin{align*}
\dot\Lambda_{H_1,H}(a,b)(s)=\int_{[0,s]}\frac{da(u)}{1-H(u-)}+\int_{[0,s]}\frac{b(u-)}{\{1-H(u-)\}^2}\,dH_1(u).
\end{align*}
At continuity points of the limiting survival curve, the product-integral map is differentiable in the hazard direction; in the continuous-hazard case this is the ordinary calculation $S(t)=e^{-\Lambda(t)}$, hence
\begin{align*}
\frac{e^{-\{\Lambda(t)+t\ell(t)\}}-e^{-\Lambda(t)}}{t}=e^{-\Lambda(t)}\frac{e^{-t\ell(t)}-1}{t}\to -S(t)\ell(t).
\end{align*}
Thus the derivative of the Kaplan-Meier survival functional in direction $(a,b)$ is
\begin{align*}
\dot S_{H_1,H}(a,b)(t)=-S(t)\dot\Lambda_{H_1,H}(a,b)(t)
\end{align*}
in the continuous case, with the analogous product-integral derivative at non-jump endpoints. Therefore, once the joint empirical process
\begin{align*}
\sqrt n\{(H_{1n},H_n)-(H_1,H)\}
\end{align*}
converges weakly, the *Functional Delta Method* sends that Gaussian limit through the displayed derivative and gives weak convergence of $\sqrt n(\hat S_n-S)$ on $[0,\tau]$.
[/example]
This example also signals a practical theme: once a functional derivative has been computed, the empirical process result supplies the asymptotic distribution with little additional probability.
## M-Estimators and Stochastic Equicontinuity
Many estimators are defined by optimising a criterion rather than applying an explicit formula. The problem is to show that if a random objective $M_n$ approximates a deterministic objective $M$, then maximisers of $M_n$ approximate maximisers of $M$, and that their local fluctuations are governed by an empirical process.
[definition: M-Estimator]
Let $\Theta$ be a parameter space and let $M_n:\Theta\to\mathbb R$ be a random criterion. An M-estimator is a measurable estimator $\hat\theta_n\in\Theta$ satisfying
\begin{align*}
M_n(\hat\theta_n)\ge \sup_{\theta\in\Theta}M_n(\theta)-o_{\mathbb P}(1).
\end{align*}
[/definition]
The deterministic target is usually the unique maximiser of $M(\theta)=\mathbb E[m_\theta(X)]$, but the definition alone does not say that approximate maximisers of $M_n$ converge to it. The obstruction is that random objectives may have narrow spikes, nearly flat remote regions, or escaping maximisers even when each fixed value $M_n(\theta)$ is close to $M(\theta)$. Consistency requires a uniform transfer of the deterministic separation around the true maximiser to the random criterion.
[quotetheorem:9863]
[citeproof:9863]
The argmax theorem gives consistency of the random optimiser, and its assumptions rule out several common failures. Uniqueness is needed because if $M$ has two separated maximisers, small random perturbations can make $\hat\theta_n$ jump between them rather than converge to a single point. Separation is stronger than merely having a unique maximiser: without a positive gap away from each neighbourhood of $\theta_0$, nearly optimal points may drift far from $\theta_0$ while losing only a negligible amount of criterion value. Compact containment prevents escape to infinity, and local uniform convergence is what lets deterministic gaps in $M$ survive in $M_n$; pointwise convergence alone can miss narrow random spikes that create false maximisers. The theorem is only a consistency result, so it gives no rate and no limiting distribution. For rates and limits, the empirical process must also be stable when its index is the random function $m_{\hat\theta_n}$ rather than the fixed function $m_{\theta_0}$, which motivates the following local continuity condition.
[definition: Stochastic Equicontinuity]
Let $(\mathcal F,d)$ be a semimetric class of measurable real-valued functions on a measurable space $(\mathcal X,\mathcal A)$. For each $n$, let $Z_n:\mathcal F\to\mathbb R$ be a random map, meaning that $Z_n(f)$ is a real-valued random variable for each $f\in\mathcal F$, and assume the displayed local suprema are measurable or are interpreted using outer probability. The sequence $(Z_n)$ is stochastically equicontinuous at $f_0\in\mathcal F$ with respect to $d$ if for every $\varepsilon>0$,
\begin{align*}
\lim_{\delta\downarrow0}\limsup_{n\to\infty}\mathbb P\left(\sup_{d(f,f_0)<\delta}|Z_n(f)-Z_n(f_0)|>\varepsilon\right)=0.
\end{align*}
[/definition]
This condition prevents empirical noise from changing discontinuously as the parameter moves over a shrinking neighbourhood. The following lemma is the form used in M-estimation: consistency in parameter space becomes permission to replace the random index by its limit inside the empirical process.
[quotetheorem:9864]
[citeproof:9864]
The lemma is the workhorse behind replacing a random index by a deterministic one in first-order expansions, and its two assumptions address different dangers. The convergence $d_P(m_{\hat\theta_n},m_{\theta_0})\to0$ says that consistency in parameter space is visible to the empirical process; without it, the random index may remain far away in covariance distance even when $\hat\theta_n$ is numerically close to $\theta_0$. Stochastic equicontinuity rules out rough indexed classes in which arbitrarily nearby functions can have empirical fluctuations of order one; discontinuous threshold classes or non-Donsker enlargements can produce exactly this failure. The lemma does not prove consistency of $\hat\theta_n$, does not prove a central limit theorem, and does not identify the limiting distribution of the criterion. It only justifies the local substitution $m_{\hat\theta_n}\rightsquigarrow m_{\theta_0}$ inside a process that is already known to be locally stable. The least-squares criterion shows how it interacts with familiar regression geometry.
[example: Least Squares with Random Design]
Let $\Sigma=\mathbb E[ZZ^\top]$ and suppose $Y=Z^\top\beta_0+\varepsilon$ with $\mathbb E[\varepsilon\mid Z]=0$. For $d=\beta-\beta_0$,
\begin{align*}
M(\beta)=-\mathbb E\{Y-Z^\top\beta\}^2=-\mathbb E\{\varepsilon-Z^\top d\}^2.
\end{align*}
Expanding the square gives
\begin{align*}
M(\beta)=-\mathbb E[\varepsilon^2]+2\mathbb E[\varepsilon Z^\top d]-\mathbb E[(Z^\top d)^2].
\end{align*}
Since $\mathbb E[\varepsilon Z]=\mathbb E\{Z\mathbb E[\varepsilon\mid Z]\}=0$ and $\mathbb E[(Z^\top d)^2]=d^\top\Sigma d$,
\begin{align*}
M(\beta)-M(\beta_0)=-d^\top\Sigma d.
\end{align*}
Because $\Sigma$ is nonsingular and positive semidefinite, it is positive definite, so $d^\top\Sigma d>0$ whenever $d\ne0$. Hence $M$ is uniquely maximised at $\beta_0$.
Now let $\hat\beta_n$ maximise $M_n$, equivalently minimise $P_n(Y-Z^\top\beta)^2$. The gradient of the squared-error criterion is
\begin{align*}
\nabla_\beta P_n(Y-Z^\top\beta)^2=P_n\{-2Z(Y-Z^\top\beta)\}.
\end{align*}
Thus the normal equations are
\begin{align*}
P_nZ(Y-Z^\top\hat\beta_n)=0.
\end{align*}
Substituting $Y=Z^\top\beta_0+\varepsilon$ gives
\begin{align*}
0=P_nZ\{Z^\top\beta_0+\varepsilon-Z^\top\hat\beta_n\}.
\end{align*}
Rearranging,
\begin{align*}
P_n(ZZ^\top)(\hat\beta_n-\beta_0)=P_n(Z\varepsilon).
\end{align*}
On events where $P_n(ZZ^\top)$ is invertible,
\begin{align*}
\sqrt n(\hat\beta_n-\beta_0)=\{P_n(ZZ^\top)\}^{-1}\sqrt n\,P_n(Z\varepsilon).
\end{align*}
The law of large numbers gives $P_n(ZZ^\top)\to\Sigma$ in probability, so the inverse converges to $\Sigma^{-1}$. Also $\mathbb E[Z\varepsilon]=0$, and the multivariate central limit theorem gives
\begin{align*}
\sqrt n\,P_n(Z\varepsilon)\xrightarrow{d}\mathcal N(0,\Omega),
\end{align*}
where
\begin{align*}
\Omega=\mathbb E[\varepsilon^2ZZ^\top].
\end{align*}
Therefore Slutsky's theorem yields
\begin{align*}
\sqrt n(\hat\beta_n-\beta_0)\xrightarrow{d}\mathcal N(0,\Sigma^{-1}\Omega\Sigma^{-1}).
\end{align*}
In this quadratic model the first-order expansion is exact once the normal equations are written down; stochastic equicontinuity is the condition that lets the same local empirical-process replacement survive in less algebraically explicit M-estimation problems.
[/example]
Optimisation arguments often hide an estimating equation in the first-order condition. Making that equation explicit leads to Z-estimation.
## Z-Estimators and Empirical Process Expansions
A Z-estimator is defined as an approximate zero of a random map. The basic question is how to turn an equation $\Psi_n(\hat\theta_n)\approx0$ into a linear expansion for $\hat\theta_n-\theta_0$ when $\Psi_n$ itself is an empirical average.
[definition: Z-Estimator]
Let $\Theta\subset\mathbb R^p$, let $\psi_\theta:\mathcal X\to\mathbb R^p$ be a measurable estimating function for each $\theta\in\Theta$, let $\Psi:\Theta\to\mathbb R^p$ be the population map
\begin{align*}
\Psi(\theta)=\mathbb E[\psi_\theta(X)],
\end{align*}
and let $\Psi_n:\Theta\to\mathbb R^p$ be a random map. A Z-estimator for $\theta_0$ is a measurable estimator $\hat\theta_n\in\Theta$ such that
\begin{align*}
\Psi_n(\hat\theta_n)=o_{\mathbb P}(n^{-1/2})
\end{align*}
when $\Psi(\theta_0)=0$.
[/definition]
The scale $n^{-1/2}$ is chosen because the empirical fluctuation of $P_n\psi_{\theta_0}$ is typically of that order. The obstruction is that an approximate zero of $\Psi_n$ need not automatically inherit the central limit theorem for $\Psi_n(\theta_0)$: the estimator moves with the data, and this movement changes both the deterministic mean $\Psi(\theta)$ and the random score class. A normal limit follows only when the population equation is locally linear and the stochastic change from $\theta_0$ to $\hat\theta_n$ is negligible at the same scale.
[quotetheorem:9865]
[citeproof:9865]
The theorem separates the analytic derivative $A$ from the probabilistic limit $Z$, and each hypothesis protects one part of the linearisation. Consistency is needed because differentiability of $\Psi$ at $\theta_0$ only controls a local neighbourhood; an approximate zero far away may solve a different branch of the estimating equation. Nonsingularity of $A$ is the identifiability condition at first order: if $A$ has a nontrivial kernel, the equation may not determine movement in that direction at the $n^{-1/2}$ scale, and slower or nonnormal limits can occur. The empirical-process equicontinuity condition removes the random change in the score class between $\hat\theta_n$ and $\theta_0$; without it, the stochastic remainder can be the same size as the linear term. The theorem does not prove existence of a root, consistency of the selected root, or validity under nonsmooth estimating equations, so those must be established separately in applications. In empirical process language, the equicontinuity condition says that the class $\{\psi_\theta:\theta\text{ near }\theta_0\}$ is sufficiently regular near the true parameter.
[example: Logistic Regression Score Equations]
Let $(Y_i,Z_i)$ be i.i.d., with $Y_i\in\{0,1\}$ and $Z_i\in\mathbb R^p$. In the logistic model
\begin{align*}
\mathbb P(Y=1\mid Z)=\pi_{\beta_0}(Z),\qquad \pi_\beta(z)=\{1+e^{-z^\top\beta}\}^{-1},
\end{align*}
the empirical estimating equation is
\begin{align*}
\Psi_n(\beta)=P_n\left[Z\{Y-\pi_\beta(Z)\}\right]=0.
\end{align*}
The population map is
\begin{align*}
\Psi(\beta)=\mathbb E\left[Z\{Y-\pi_\beta(Z)\}\right].
\end{align*}
At $\beta_0$, the conditional mean assumption gives
\begin{align*}
\Psi(\beta_0)=\mathbb E\left[Z\{Y-\pi_{\beta_0}(Z)\}\right]=\mathbb E\left[Z\,\mathbb E\{Y-\pi_{\beta_0}(Z)\mid Z\}\right]=0.
\end{align*}
For the derivative, fix coordinates $j,k$. Since
\begin{align*}
\frac{\partial}{\partial\beta_k}\pi_\beta(z)=\frac{e^{-z^\top\beta}z_k}{\{1+e^{-z^\top\beta}\}^2}=\pi_\beta(z)\{1-\pi_\beta(z)\}z_k,
\end{align*}
the $(j,k)$ entry of $J\Psi_{\beta_0}$ is
\begin{align*}
\frac{\partial}{\partial\beta_k}\mathbb E\left[Z_j\{Y-\pi_\beta(Z)\}\right]\bigg|_{\beta=\beta_0}=-\mathbb E\left[Z_jZ_k\pi_{\beta_0}(Z)\{1-\pi_{\beta_0}(Z)\}\right].
\end{align*}
Thus
\begin{align*}
J\Psi_{\beta_0}=-\mathbb E\left[ZZ^\top\pi_{\beta_0}(Z)\{1-\pi_{\beta_0}(Z)\}\right].
\end{align*}
Put
\begin{align*}
A=-J\Psi_{\beta_0}=\mathbb E\left[ZZ^\top\pi_{\beta_0}(Z)\{1-\pi_{\beta_0}(Z)\}\right].
\end{align*}
At the true parameter,
\begin{align*}
\sqrt n\,\Psi_n(\beta_0)=\sqrt n\,P_n\left[Z\{Y-\pi_{\beta_0}(Z)\}\right].
\end{align*}
The summand has mean zero, as shown above, and covariance
\begin{align*}
B=\operatorname{Var}\left(Z\{Y-\pi_{\beta_0}(Z)\}\right).
\end{align*}
Hence the multivariate central limit theorem gives
\begin{align*}
\sqrt n\,\Psi_n(\beta_0)\xrightarrow{d}\mathcal N(0,B).
\end{align*}
If $A$ is nonsingular, $\hat\beta_n\xrightarrow{\mathbb P}\beta_0$, and the local Donsker and envelope assumptions give the required stochastic equicontinuity of the score class, then *[Asymptotic Normality of Z-Estimators](/theorems/9865)* applies with $J\Psi_{\beta_0}=-A$. Therefore
\begin{align*}
\sqrt n(\hat\beta_n-\beta_0)\xrightarrow{d}-(-A)^{-1}\mathcal N(0,B)=A^{-1}\mathcal N(0,B).
\end{align*}
Multiplying a mean-zero normal vector with covariance $B$ by $A^{-1}$ gives covariance $A^{-1}BA^{-\top}$, so
\begin{align*}
\sqrt n(\hat\beta_n-\beta_0)\xrightarrow{d}\mathcal N(0,A^{-1}BA^{-\top}).
\end{align*}
The example shows the Z-estimation pattern explicitly: the empirical score supplies the Gaussian term, while the derivative of the population score supplies the linear inverse $A^{-1}$.
[/example]
This example also explains why score equations are a natural home for empirical process methods: the random part is an empirical average indexed by nearby parameters, and the deterministic part is the derivative of the population score.
[remark: Influence Function Viewpoint]
When the limit $Z$ can be written as a Gaussian limit of $\sqrt nP_n\varphi_{\theta_0}$ with $P\varphi_{\theta_0}=0$, the expansion
\begin{align*}
\sqrt n(\hat\theta_n-\theta_0)=-A^{-1}\sqrt nP_n\varphi_{\theta_0}+o_{\mathbb P}(1)
\end{align*}
identifies $-A^{-1}\varphi_{\theta_0}$ as the influence function. This representation is the bridge to semiparametric efficiency and bootstrap validity.
[/remark]
The chapter has reduced three estimator constructions to the same pattern. Find a deterministic first-order derivative, prove the empirical fluctuation is tight and locally stable, and then apply a continuous mapping, argmax, or linearisation theorem to obtain the limiting distribution.
# 11. Bootstrap and Multiplier Empirical Processes
Bootstrap methods ask whether the data can estimate not only an unknown distributional target, but also the sampling law of the empirical process itself. In the preceding chapters, Donsker theorems and equicontinuity criteria described limits of $G_n=\sqrt n(P_n-P)$ in $\ell^\infty(\mathcal F)$. This chapter studies two resampling versions of that process: the nonparametric bootstrap process based on resampled observations, and the multiplier process based on external random weights. The main question is when these conditional processes converge to the same $P$-Brownian bridge as the original empirical process, so that suprema and confidence bands can be calibrated from the data.
## The Nonparametric Bootstrap Empirical Process
The first resampling method replaces the unknown distribution $P$ by the empirical measure $P_n$ and then draws a new sample from $P_n$. The problem is conditional: after observing the data, the bootstrap sample is random, and we want its conditional law to approximate the unconditional law of the original empirical process.
[definition: Bootstrap Empirical Measure]
Let $X_1,\dots,X_n$ be i.i.d. random variables with distribution $P$ on a measurable space $(S,\mathcal S)$. Conditional on $X_1,\dots,X_n$, let $X_1^*,\dots,X_n^*$ be i.i.d. with distribution $P_n=n^{-1}\sum_{i=1}^n\delta_{X_i}$. The bootstrap empirical measure is the random probability measure $P_n^*:(S,\mathcal S)\to[0,1]$ defined by
\begin{align*}
P_n^*(A)=\frac{1}{n}\sum_{i=1}^n\mathbb{1}_A(X_i^*),\qquad A\in\mathcal S.
\end{align*}
Its associated integration functional maps measurable real-valued functions $f:S\to\mathbb R$ with $P_n^*|f|<\infty$ to $\mathbb R$ by
\begin{align*}
P_n^*f=\frac{1}{n}\sum_{i=1}^n f(X_i^*).
\end{align*}
[/definition]
The bootstrap empirical measure is the random probability measure obtained by resampling from the observed data. To compare it with the original empirical process, we need to subtract its own conditional mean, since conditionally on the data the law of each $X_i^*$ is $P_n$ rather than $P$. This produces the process whose conditional fluctuations should mimic the fluctuations of $P_n$ around $P$.
[definition: Bootstrap Empirical Process]
For a class $\mathcal F$ of measurable functions $f:S\to\mathbb R$, the nonparametric bootstrap empirical process is the conditional random map $G_n^*:\mathcal F\to\mathbb R$ defined by
\begin{align*}
G_n^*f=\sqrt n(P_n^*-P_n)f,\qquad f\in\mathcal F.
\end{align*}
It is viewed conditionally on $X_1,\dots,X_n$ as a random element of $\ell^\infty(\mathcal F)$ whenever $\sup_{f\in\mathcal F}|G_n^*f|<\infty$ and the required measurability holds.
[/definition]
A useful way to read $P_n^*$ is through multinomial counts. If $N_i^*$ is the number of times $X_i$ appears in the bootstrap sample, then $(N_1^*,\dots,N_n^*)\sim\operatorname{Multinomial}(n;1/n,\dots,1/n)$ conditionally on the data, and
\begin{align*}
G_n^*f=\frac{1}{\sqrt n}\sum_{i=1}^n (N_i^*-1)f(X_i).
\end{align*}
This representation shows that the bootstrap process is a randomly weighted empirical process with dependent exchangeable weights.
[example: Bootstrap Mean]
Let $Y_i=f(X_i)$ and $\bar Y_n=P_nf=n^{-1}\sum_{i=1}^nY_i$. Conditional on $X_1,\dots,X_n$, the variables $Y_1^*,\dots,Y_n^*$ are i.i.d. draws from the empirical distribution putting mass $1/n$ at each $Y_i$, so
\begin{align*}
G_n^*f=\sqrt n(P_n^*-P_n)f
=\frac{1}{\sqrt n}\sum_{j=1}^n(Y_j^*-\bar Y_n).
\end{align*}
The conditional mean is zero because
\begin{align*}
\mathbb E^*[Y_j^*-\bar Y_n]=\mathbb E^*[Y_j^*]-\bar Y_n=P_nf-P_nf=0.
\end{align*}
For the conditional variance, independence of the bootstrap draws gives
\begin{align*}
\operatorname{Var}^*(G_n^*f)
=\operatorname{Var}^*\left(\frac{1}{\sqrt n}\sum_{j=1}^n(Y_j^*-\bar Y_n)\right)
=\frac{1}{n}\sum_{j=1}^n\operatorname{Var}^*(Y_j^*-\bar Y_n).
\end{align*}
Since all summands have the same conditional law,
\begin{align*}
\operatorname{Var}^*(G_n^*f)=\operatorname{Var}^*(Y_1^*)=\mathbb E^*[(Y_1^*-\bar Y_n)^2].
\end{align*}
Expanding the empirical expectation,
\begin{align*}
\mathbb E^*[(Y_1^*-\bar Y_n)^2]=\frac{1}{n}\sum_{i=1}^n(Y_i-\bar Y_n)^2=P_n(f-P_nf)^2.
\end{align*}
Also,
\begin{align*}
P_n(f-P_nf)^2=P_nf^2-2(P_nf)^2+(P_nf)^2=P_nf^2-(P_nf)^2.
\end{align*}
Because $Pf^2<\infty$, the [weak law of large numbers](/theorems/1851) gives $P_nf\to Pf$ and $P_nf^2\to Pf^2$ in probability, hence
\begin{align*}
P_n(f-P_nf)^2\xrightarrow{\mathbb P}Pf^2-(Pf)^2=P(f-Pf)^2.
\end{align*}
Thus the conditional [bootstrap mean](/theorems/1991) is centred and has conditional variance converging to the same variance as $G_nf=\sqrt n(P_n-P)f$; the one-dimensional conditional central limit theorem therefore gives the same Gaussian limit, namely $\mathcal N(0,P(f-Pf)^2)$.
[/example]
The one-function calculation is not enough for confidence bands, because bands depend on the whole path of the process. We therefore need a notion of convergence for random conditional laws in a function space. The next definition turns conditional approximation into a bounded-Lipschitz distance between the bootstrap law and the target weak limit.
[definition: Conditional Weak Convergence in Probability]
Let $Z_n^*$ be random elements defined conditionally on data $X_1,\dots,X_n$, and let $Z$ be a tight Borel random element of a metric space $(E,d)$. We write $Z_n^*\rightsquigarrow Z$ conditionally in probability if
\begin{align*}
\sup_{h\in\operatorname{BL}_1(E)}\left|\mathbb E^*[h(Z_n^*)]-\mathbb E[h(Z)]\right|\xrightarrow{\mathbb P}0,
\end{align*}
where $\mathbb E^*$ denotes conditional expectation given the data and $\operatorname{BL}_1(E)$ is the set of functions $h:E\to\mathbb R$ with $\|h\|_\infty\le 1$ and Lipschitz constant at most $1$.
[/definition]
The bounded-Lipschitz metric is used because it metrises weak convergence for tight laws on separable metric spaces. With this definition in place, the main bootstrap question becomes sharp: under what assumptions does the conditional law of $G_n^*$ approach the same Brownian bridge law as $G_n$? The answer is that the Donsker condition is stable under nonparametric resampling.
[quotetheorem:6353]
[citeproof:6353]
The theorem says that, for Donsker classes, the empirical distribution contains enough local information to reproduce the Gaussian limit of the empirical process. The Donsker hypothesis is essential: for a class as large as all measurable indicator functions on $[0,1]$, the empirical process is not tight in $\ell^\infty(\mathcal F)$, and no conditional bootstrap law can converge to a tight Brownian-bridge limit on that index set. The square-integrable envelope rules out a different failure, where a few extreme observations dominate the resampled sums and prevent Gaussian conditional fluctuations. The result also does not give finite-sample accuracy or bootstrap validity for arbitrary discontinuous functionals of the path; those require separate continuous mapping or anti-concentration arguments. The next section isolates the tightness mechanism behind the theorem, namely conditional control of oscillations over small $d_P$-balls.
## Conditional Asymptotic Equicontinuity
Finite-dimensional bootstrap convergence is rarely the hard part. The real obstacle is uniformity over $\mathcal F$: small changes in the index under $d_P$ must produce small changes in the bootstrap process, uniformly and conditionally on the data.
[definition: Conditional Asymptotic Equicontinuity]
Let $(\mathcal F,d)$ be a semimetric space. A sequence of conditional processes $Z_n^*:\mathcal F\to\mathbb R$, viewed as conditional random elements of $\ell^\infty(\mathcal F)$, is conditionally asymptotically equicontinuous in probability if, for every $\varepsilon>0$,
\begin{align*}
\lim_{\delta\downarrow 0}\limsup_{n\to\infty}\mathbb P^*\left(\sup_{d(f,g)<\delta}|Z_n^*f-Z_n^*g|>\varepsilon\right)=0
\end{align*}
in outer probability.
[/definition]
This definition is the conditional version of the oscillation control used in ordinary empirical process tightness. It lets us reduce a process indexed by many functions to a finite net, provided the process does not move much inside small metric balls. The next criterion packages this reduction into a reusable route from finite-dimensional convergence to weak convergence in $\ell^\infty(\mathcal F)$.
[quotetheorem:9866]
[citeproof:9866]
The criterion isolates the two roles of entropy. Total boundedness is needed because finite-dimensional convergence only controls finitely many coordinates; if $\mathcal F$ contains infinitely many functions separated by a fixed $d_P$ distance, a finite net cannot approximate the path uniformly. Conditional equicontinuity is also a genuine condition: a process may converge correctly at every fixed $f$ while having rare spikes that move around the index set and keep $\sup_{f\in\mathcal F}|Z_n^*f|$ away from the limiting supremum. The uniform continuity of $Z$ prevents a final mismatch in which the approximating net converges but the limiting path itself oscillates too much between nearby points. Thus entropy supplies finite nets, while maximal inequalities control the random oscillation inside each net cell.
[example: Bootstrap Empirical Distribution Function]
Let $f_t=\mathbb{1}_{(-\infty,t]}$ and $\mathcal F=\{f_t:t\in\mathbb R\}$. For every $t\in\mathbb R$,
\begin{align*}
P_nf_t=\frac{1}{n}\sum_{i=1}^n\mathbb{1}_{\{X_i\le t\}}=F_n(t)
\end{align*}
and, for the bootstrap sample,
\begin{align*}
P_n^*f_t=\frac{1}{n}\sum_{i=1}^n\mathbb{1}_{\{X_i^*\le t\}}=F_n^*(t).
\end{align*}
Therefore the bootstrap empirical process indexed by half-lines satisfies
\begin{align*}
G_n^*f_t=\sqrt n(P_n^*-P_n)f_t=\sqrt n(F_n^*(t)-F_n(t)).
\end{align*}
The class of half-line indicators is a VC class, hence it is $P$-Donsker by the *VC Donsker theorem*. The *Bootstrap Donsker Theorem* therefore gives, conditionally in probability,
\begin{align*}
\{G_n^*f_t:t\in\mathbb R\}\rightsquigarrow \{G_Pf_t:t\in\mathbb R\}
\end{align*}
in $\ell^\infty(\mathcal F)$. The covariance of the limiting process is, for $s,t\in\mathbb R$,
\begin{align*}
\operatorname{Cov}(G_Pf_s,G_Pf_t)=P(f_sf_t)-Pf_sPf_t.
\end{align*}
Since $f_s(x)f_t(x)=\mathbb{1}_{\{x\le \min(s,t)\}}$, this becomes
\begin{align*}
P(f_sf_t)-Pf_sPf_t=F(\min(s,t))-F(s)F(t).
\end{align*}
Because $F$ is nondecreasing, $F(\min(s,t))=\min(F(s),F(t))$, so this covariance equals
\begin{align*}
\min(F(s),F(t))-F(s)F(t).
\end{align*}
This is exactly the covariance of $B(F(t))$, where $B$ is the standard Brownian bridge on $[0,1]$; thus the bootstrap empirical distribution function converges conditionally to the pulled-back Brownian bridge $B_F(t)=B(F(t))$.
[/example]
This example is the prototype for bootstrap Kolmogorov-Smirnov bands. The statistic is a supremum of the bootstrap empirical distribution function, and the theorem supplies its conditional limiting law.
## Multiplier Bootstrap Processes
The nonparametric bootstrap uses multinomial counts, whose dependence can be inconvenient in proofs and computation. Multiplier methods replace those counts by independent weights, keeping the observed sample fixed and injecting randomness through an external sequence.
[definition: Multiplier Sequence]
A multiplier sequence is a sequence $\xi_1,\xi_2,\dots$ independent of $X_1,X_2,\dots$ such that the variables $\xi_i$ are i.i.d., $\mathbb E[\xi_1]=0$, and $\mathbb E[\xi_1^2]=1$.
[/definition]
Two choices dominate applications. Gaussian multipliers make the conditional process Gaussian, while Rademacher multipliers connect directly to symmetrisation inequalities.
[example: Gaussian and Rademacher Multipliers]
Let
\begin{align*}
W_n^\xi f=\frac{1}{\sqrt n}\sum_{i=1}^n \xi_i f(X_i)
\end{align*}
with the data $X_1,\dots,X_n$ held fixed. If $\xi_i\sim\mathcal N(0,1)$ are independent, then for each fixed $f$ the conditional mean is
\begin{align*}
\mathbb E^\xi[W_n^\xi f\mid X_1,\dots,X_n]=\frac{1}{\sqrt n}\sum_{i=1}^n f(X_i)\mathbb E[\xi_i]=0.
\end{align*}
For $f,g\in\mathcal F$, independence of the multipliers gives $\mathbb E[\xi_i\xi_j]=0$ when $i\ne j$ and $\mathbb E[\xi_i^2]=1$, so
\begin{align*}
\operatorname{Cov}^\xi(W_n^\xi f,W_n^\xi g\mid X_1,\dots,X_n)=\frac{1}{n}\sum_{i=1}^n f(X_i)g(X_i)=P_n(fg).
\end{align*}
Also, every finite vector $(W_n^\xi f_1,\dots,W_n^\xi f_k)$ is a linear transformation of the Gaussian vector $(\xi_1,\dots,\xi_n)$, hence is conditionally Gaussian with this empirical covariance matrix.
If instead $\mathbb P(\xi_i=1)=\mathbb P(\xi_i=-1)=1/2$, then
\begin{align*}
\mathbb E[\xi_i]=1\cdot\frac12+(-1)\cdot\frac12=0
\end{align*}
and
\begin{align*}
\mathbb E[\xi_i^2]=1^2\cdot\frac12+(-1)^2\cdot\frac12=1.
\end{align*}
Thus
\begin{align*}
W_n^\xi f=\frac{1}{\sqrt n}\sum_{i=1}^n \xi_i f(X_i)
\end{align*}
is the empirical sum with each observed contribution independently assigned a random sign, which is the usual symmetrised empirical process conditional on the sample. Gaussian and Rademacher multipliers therefore have the same conditional first and second moments, and both match the covariance structure $P_n(fg)$ of the uncentred multiplier approximation.
[/example]
The multiplier sequence supplies the external randomness, but we still need to decide what process it drives. For general function classes, raw sums $n^{-1/2}\sum_i\xi_i f(X_i)$ have covariance $P_n(fg)$ rather than the Brownian bridge covariance. Centering each index by $P_nf$ produces the empirical covariance of the centred functions and therefore the right target.
[definition: Multiplier Empirical Process]
Let $\mathcal F$ be a class of measurable functions $f:S\to\mathbb R$, and let $(\xi_i)_{i=1}^n$ be multipliers independent of the data. The centred multiplier empirical process is the conditional random map $G_n^\xi:\mathcal F\to\mathbb R$ defined by
\begin{align*}
G_n^\xi f=\frac{1}{\sqrt n}\sum_{i=1}^n \xi_i(f(X_i)-P_nf),\qquad f\in\mathcal F.
\end{align*}
When $\sup_{f\in\mathcal F}|G_n^\xi f|<\infty$ and the required measurability holds, it is viewed conditionally on the data as a random element of $\ell^\infty(\mathcal F)$.
[/definition]
The subtraction of $P_nf$ makes the conditional covariance equal to the empirical covariance of $f(X_i)$ and $g(X_i)$. The remaining question is whether independent multiplier weights preserve the same functional convergence as multinomial bootstrap weights. Under the same Donsker hypothesis, the answer is affirmative.
[quotetheorem:9867]
[citeproof:9867]
Multiplier processes are often simpler to simulate than resampling observations. The Donsker and envelope assumptions play the same roles as in the nonparametric bootstrap: without tightness of the original empirical process, multiplier weighting cannot manufacture a tight Gaussian limit, and without $PF^2<\infty$ a small number of large observations can dominate the weighted sum. The multiplier tail condition is also not cosmetic; very heavy-tailed weights may have mean zero and variance one but still violate the maximal inequality needed for uniform control over $\mathcal F$. The theorem does not cover uncentred multiplier sums when the target is a Brownian bridge, nor does it justify high-dimensional maxima whose index class grows with $n$ without additional Gaussian comparison or anti-concentration input. Its practical advantage is that the independent weights connect directly to symmetrisation and to the Gaussian and Rademacher comparison tools used in modern simultaneous inference.
[example: Multiplier Confidence Bands for Distribution Functions]
Let $f_t=\mathbb{1}_{(-\infty,t]}$. Since $P_nf_t=F_n(t)$, the centred multiplier empirical process from the definition gives
\begin{align*}
G_n^\xi f_t=\frac{1}{\sqrt n}\sum_{i=1}^n\xi_i(f_t(X_i)-P_nf_t)=\frac{1}{\sqrt n}\sum_{i=1}^n\xi_i(\mathbb{1}_{\{X_i\le t\}}-F_n(t)).
\end{align*}
Thus, writing $G_n^\xi(t)=G_n^\xi f_t$, the statistic used for the band is
\begin{align*}
\|G_n^\xi\|_\infty=\sup_{t\in\mathbb R}\left|\frac{1}{\sqrt n}\sum_{i=1}^n\xi_i(\mathbb{1}_{\{X_i\le t\}}-F_n(t))\right|.
\end{align*}
Let $c_{n,1-\alpha}^\xi$ be the conditional $(1-\alpha)$-quantile of $\|G_n^\xi\|_\infty$ given $X_1,\dots,X_n$. The class of half-line indicators is a VC class, so by the *VC Donsker theorem* and the *Multiplier Central Limit Theorem*,
\begin{align*}
G_n^\xi\rightsquigarrow \{B(F(t)):t\in\mathbb R\}
\end{align*}
conditionally in probability in $\ell^\infty(\mathbb R)$, where $B$ is a standard Brownian bridge. The ordinary empirical distribution process satisfies
\begin{align*}
G_n(t)=\sqrt n(F_n(t)-F(t))=\sqrt n(P_n-P)f_t,
\end{align*}
and the same Donsker limit gives
\begin{align*}
G_n\rightsquigarrow \{B(F(t)):t\in\mathbb R\}.
\end{align*}
Therefore the conditional law of $\sup_t|G_n^\xi(t)|$ estimates the law of $\sup_t|G_n(t)|$ in the bounded-Lipschitz sense.
The simultaneous band
\begin{align*}
F_n(t)-\frac{c_{n,1-\alpha}^\xi}{\sqrt n}\le F(t)\le F_n(t)+\frac{c_{n,1-\alpha}^\xi}{\sqrt n}\qquad\text{for every }t\in\mathbb R
\end{align*}
contains the whole distribution function exactly when
\begin{align*}
\sup_{t\in\mathbb R}|F_n(t)-F(t)|\le \frac{c_{n,1-\alpha}^\xi}{\sqrt n}.
\end{align*}
Multiplying both sides by $\sqrt n$, this event is
\begin{align*}
\sup_{t\in\mathbb R}|\sqrt n(F_n(t)-F(t))|\le c_{n,1-\alpha}^\xi.
\end{align*}
Since $F$ is continuous, $\{B(F(t)):t\in\mathbb R\}$ has the same supremum distribution as $\{B(u):0\le u\le1\}$, and the [Kolmogorov distribution](/theorems/6305) has no atom at its usual quantiles. The *Bootstrap Validity for Suprema* theorem therefore yields asymptotic coverage $1-\alpha$ for the multiplier band.
[/example]
The same principle applies beyond distribution functions. The statistic must be expressible, at first order, as a supremum of an empirical process or a continuous transform of one.
## Suprema, Continuous Maps, and Confidence Bands
Statistical applications usually require a scalar critical value rather than convergence of the whole process. The issue is whether conditional weak convergence is preserved after taking suprema and whether the resulting quantiles give valid coverage.
[quotetheorem:9868]
[citeproof:9868]
The continuity assumption excludes flat parts of the limiting distribution at the target quantile. This is necessary: if $\|Z\|_{\mathcal F}$ has an atom at $c_{1-\alpha}$, small errors in the conditional distribution can move the selected quantile across the atom and the coverage need not converge to exactly $1-\alpha$. The assumptions $Z_n\rightsquigarrow Z$ and $Z_n^*\rightsquigarrow Z$ conditionally are also both needed, since a bootstrap process can consistently estimate the wrong limiting law if the statistic has not been correctly centred or linearised. The theorem validates continuous functionals such as suprema, but it does not by itself handle nonsmooth plug-in maps, estimated nuisance parameters, or growing index classes without an additional approximation step. In most empirical process bands the quantile-continuity condition follows from anti-concentration for Gaussian suprema or from direct properties of the Brownian bridge.
[example: Bootstrap Kolmogorov-Smirnov Bands]
Let $f_t=\mathbb{1}_{(-\infty,t]}$ and let $\mathcal F=\{f_t:t\in\mathbb R\}$. Then
\begin{align*}
P_nf_t=\frac{1}{n}\sum_{i=1}^n\mathbb{1}_{\{X_i\le t\}}=F_n(t)
\end{align*}
and, for a bootstrap sample,
\begin{align*}
P_n^*f_t=\frac{1}{n}\sum_{i=1}^n\mathbb{1}_{\{X_i^*\le t\}}=F_n^*(t).
\end{align*}
Hence the bootstrap Kolmogorov-Smirnov statistic can be written as
\begin{align*}
\sqrt n\sup_{t\in\mathbb R}|F_n^*(t)-F_n(t)|=\sup_{t\in\mathbb R}|\sqrt n(P_n^*-P_n)f_t|=\|G_n^*\|_{\mathcal F}.
\end{align*}
The proposed band contains the whole distribution function $F$ exactly when, for every $t\in\mathbb R$,
\begin{align*}
-\frac{c_{n,1-\alpha}^*}{\sqrt n}\le F(t)-F_n(t)\le \frac{c_{n,1-\alpha}^*}{\sqrt n}.
\end{align*}
Equivalently,
\begin{align*}
\sup_{t\in\mathbb R}|F_n(t)-F(t)|\le \frac{c_{n,1-\alpha}^*}{\sqrt n}.
\end{align*}
Multiplying by $\sqrt n$ gives the event
\begin{align*}
\sup_{t\in\mathbb R}|\sqrt n(F_n(t)-F(t))|\le c_{n,1-\alpha}^*.
\end{align*}
Since
\begin{align*}
\sqrt n(F_n(t)-F(t))=\sqrt n(P_n-P)f_t=G_nf_t,
\end{align*}
the coverage probability is
\begin{align*}
\mathbb P\left(\|G_n\|_{\mathcal F}\le c_{n,1-\alpha}^*\right).
\end{align*}
The class of half-line indicators is a VC class, so by the *VC Donsker theorem* the original empirical process satisfies $G_n\rightsquigarrow G_P$ in $\ell^\infty(\mathcal F)$, and by the *Bootstrap Donsker Theorem* the bootstrap process satisfies $G_n^*\rightsquigarrow G_P$ conditionally in probability. For this class,
\begin{align*}
\operatorname{Cov}(G_Pf_s,G_Pf_t)=P(f_sf_t)-Pf_sPf_t=F(\min(s,t))-F(s)F(t).
\end{align*}
Because $F$ is nondecreasing,
\begin{align*}
F(\min(s,t))=\min(F(s),F(t)),
\end{align*}
so the covariance is
\begin{align*}
\min(F(s),F(t))-F(s)F(t),
\end{align*}
which is the covariance of the pulled-back Brownian bridge $B(F(t))$. Since $F$ is continuous, the closure of $\{F(t):t\in\mathbb R\}$ is $[0,1]$, and the continuity of Brownian bridge paths gives
\begin{align*}
\sup_{t\in\mathbb R}|B(F(t))|=\sup_{0\le u\le1}|B(u)|.
\end{align*}
The limiting supremum has the Kolmogorov distribution, whose distribution function is continuous, so the *Bootstrap Validity for Suprema* theorem yields
\begin{align*}
\mathbb P\left(\|G_n\|_{\mathcal F}\le c_{n,1-\alpha}^*\right)\to 1-\alpha.
\end{align*}
Thus the bootstrap Kolmogorov-Smirnov band has asymptotic simultaneous coverage $1-\alpha$.
[/example]
Confidence bands for function-valued estimators usually need an additional linearisation step. Empirical process theory enters after the estimator is approximated by a process indexed by the target parameter.
[example: Simultaneous Inference for Regression Curves]
Suppose
\begin{align*}
\Psi=\{\psi_t:t\in T\}
\end{align*}
is $P$-Donsker and the estimator satisfies the uniform linearisation
\begin{align*}
\sqrt n(\hat m(t)-m(t))=G_n\psi_t+r_n(t),\qquad \sup_{t\in T}|r_n(t)|\xrightarrow{\mathbb P}0.
\end{align*}
We compare the target statistic with the empirical-process statistic. For each $t$,
\begin{align*}
\left|\sqrt n(\hat m(t)-m(t))\right|=\left|G_n\psi_t+r_n(t)\right|.
\end{align*}
Using $\big||a+b|-|a|\big|\le |b|$ with $a=G_n\psi_t$ and $b=r_n(t)$ gives
\begin{align*}
\left|\left|G_n\psi_t+r_n(t)\right|-\left|G_n\psi_t\right|\right|\le |r_n(t)|.
\end{align*}
Taking suprema over $t\in T$,
\begin{align*}
\left|\sup_{t\in T}\left|\sqrt n(\hat m(t)-m(t))\right|-\sup_{t\in T}|G_n\psi_t|\right|\le \sup_{t\in T}|r_n(t)|.
\end{align*}
Since the right-hand side converges to $0$ in probability, the two suprema have the same limiting law.
The multiplier process indexed by the same influence functions is
\begin{align*}
G_n^\xi\psi_t=\frac{1}{\sqrt n}\sum_{i=1}^n\xi_i(\psi_t(X_i)-P_n\psi_t).
\end{align*}
By the *Multiplier Central Limit Theorem*, conditionally in probability,
\begin{align*}
\{G_n^\xi\psi_t:t\in T\}\rightsquigarrow \{G_P\psi_t:t\in T\}.
\end{align*}
The original empirical process also satisfies
\begin{align*}
\{G_n\psi_t:t\in T\}\rightsquigarrow \{G_P\psi_t:t\in T\},
\end{align*}
because $\Psi$ is $P$-Donsker. Therefore the conditional law of
\begin{align*}
\sup_{t\in T}|G_n^\xi\psi_t|
\end{align*}
estimates the law of
\begin{align*}
\sup_{t\in T}|G_n\psi_t|.
\end{align*}
Let $c_{n,1-\alpha}^\xi$ be the conditional $(1-\alpha)$-quantile of $\sup_{t\in T}|G_n^\xi\psi_t|$. The band
\begin{align*}
\hat m(t)-\frac{c_{n,1-\alpha}^\xi}{\sqrt n}\le m(t)\le \hat m(t)+\frac{c_{n,1-\alpha}^\xi}{\sqrt n}\qquad t\in T
\end{align*}
contains the whole curve exactly when
\begin{align*}
\sup_{t\in T}|\hat m(t)-m(t)|\le \frac{c_{n,1-\alpha}^\xi}{\sqrt n}.
\end{align*}
Multiplying both sides by $\sqrt n$, this event is
\begin{align*}
\sup_{t\in T}\left|\sqrt n(\hat m(t)-m(t))\right|\le c_{n,1-\alpha}^\xi.
\end{align*}
The uniform linearisation replaces the left-hand side by $\sup_{t\in T}|G_n\psi_t|$ up to $o_{\mathbb P}(1)$, and the multiplier quantile consistently estimates the corresponding limiting quantile. If the distribution of $\sup_{t\in T}|G_P\psi_t|$ has no atom at its $(1-\alpha)$-quantile, then the simultaneous band has asymptotic coverage $1-\alpha$.
[/example]
The chapter's message is that bootstrap validity is a functional central limit theorem, not only a statement about a single statistic. Once conditional convergence holds in $\ell^\infty(\mathcal F)$, continuous maps such as suprema, coordinate projections, and many plug-in band constructions inherit valid conditional approximations.
# 12. Learning Theory Applications
The guiding question is not only whether empirical risk minimization is consistent, but how the structure of the loss and the hypothesis class determines the rate. The chapter assumes the earlier empirical-process material on symmetrisation, Rademacher averages, bounded-differences concentration, and entropy or growth-function bounds. Uniform convergence gives robust slow-rate bounds. Margin losses and contraction turn prediction classes into loss classes. Localization and Bernstein-type curvature conditions explain why some learning problems admit rates faster than $n^{-1/2}$.
## Uniform Convergence Bounds for Empirical Risk Minimization
The basic learning problem starts with an unknown distribution $P$ on an observation space $\mathcal Z$ and a class $\mathcal F$ of candidate predictors or decision rules. Each $f \in \mathcal F$ is evaluated through a loss function, so the statistical object controlled by empirical process theory is the induced class of loss functions. The first question is: if we choose a rule that performs well on the sample, how much true performance can we lose?
[definition: Risk And Empirical Risk]
Let $Z_1,\dots,Z_n$ be i.i.d. random variables with distribution $P$ on a measurable space $(\mathcal Z,\mathcal A)$. Let $(\mathcal X,\mathcal B)$ and $(\mathcal Y,\mathcal C)$ be measurable spaces, let $\mathcal F$ be a class of measurable predictors $f:\mathcal X\to\mathcal Y$, and let $\ell_f:\mathcal Z\to\mathbb R$ be the loss associated with $f\in\mathcal F$. The risk and empirical risk are the maps $R,R_n:\mathcal F\to\mathbb R$ defined by
\begin{align*}
R(f):=P\ell_f=\mathbb E[\ell_f(Z)].
\end{align*}
\begin{align*}
R_n(f):=P_n\ell_f=\frac{1}{n}\sum_{i=1}^n \ell_f(Z_i).
\end{align*}
An empirical risk minimizer is any measurable $\hat f_n\in\mathcal F$ satisfying
\begin{align*}
R_n(\hat f_n)=\inf_{f\in\mathcal F}R_n(f).
\end{align*}
[/definition]
Risk and empirical risk name the optimization criterion, but they do not yet say what counts as success. Since the class $\mathcal F$ may not contain the globally optimal measurable rule, the right benchmark for the first theorem is the best rule inside $\mathcal F$. This leads to the class-relative notion of excess risk.
[definition: Excess Risk]
Assume $f^*\in\mathcal F$ satisfies $R(f^*)=\inf_{f\in\mathcal F}R(f)$. The excess risk relative to $\mathcal F$ is the map $\mathcal E:\mathcal F\to\mathbb R$ defined by
\begin{align*}
\mathcal E(f):=R(f)-R(f^*).
\end{align*}
[/definition]
The excess-risk definition isolates the estimation error created by using data. To bound it, compare population and empirical risks at both $\hat f_n$ and $f^*$. The next inequality is the algebraic step that turns learning theory into a uniform empirical-process problem.
[quotetheorem:9856]
[citeproof:9856]
The result converts learning into uniform convergence, and each hypothesis has a specific role. The existence of $f^*$ keeps the benchmark inside the class; if the infimum is not attained, the same argument must be written with an approximate oracle, otherwise the displayed excess risk is not defined. The ERM assumption is also essential: a rule with much larger empirical risk could be selected even when the uniform deviation is small, so the middle empirical term in the proof would no longer have the right sign. What the theorem does not say is that the global supremum is sharp; it gives a robust slow-rate route and prepares the VC and Rademacher bounds, while later localization will replace the whole class by a risk-dependent neighbourhood of $f^*$.
[example: Binary Classification With Halfspaces]
Let $Z=(X,Y)$, where $X\in\mathbb R^d$ and $Y\in\{-1,1\}$, and let
\begin{align*}
f_{a,b}(x)=\operatorname{sgn}(a\cdot x+b).
\end{align*}
For zero-one loss, the error indicator is
\begin{align*}
\ell_{a,b}(x,y)=\mathbf 1_{\{f_{a,b}(x)\ne y\}}=\mathbf 1_{\{y(a\cdot x+b)\le 0\}}.
\end{align*}
Writing $\theta=(a,b)\in\mathbb R^{d+1}$ and $\tilde x=(x,1)\in\mathbb R^{d+1}$ gives
\begin{align*}
y(a\cdot x+b)=\theta\cdot(y\tilde x).
\end{align*}
Thus each loss is the indicator of the halfspace
\begin{align*}
\{v\in\mathbb R^{d+1}:\theta\cdot v\le 0\}
\end{align*}
applied to the transformed point $v=y(x,1)$. The loss class is therefore contained in the class of halfspaces in $\mathbb R^{d+1}$, whose VC dimension is at most $d+2$.
By the *VC Generalization Theorem*, with probability at least $1-e^{-t}$,
\begin{align*}
\sup_{a,b}|(P_n-P)\ell_{a,b}|\le K\sqrt{\frac{m\log(en/m)+t}{n}},
\end{align*}
where $m=\min\{d+2,n\}$. Applying the *Basic ERM Excess Risk Bound* to an empirical risk minimizer $\hat f_n$ gives
\begin{align*}
R(\hat f_n)-R(f^*)\le 2K\sqrt{\frac{m\log(en/m)+t}{n}}.
\end{align*}
When $d+2\le n$, this becomes
\begin{align*}
R(\hat f_n)-R(f^*)\le 2K\sqrt{\frac{(d+2)\log(en/(d+2))+t}{n}}.
\end{align*}
Up to the logarithmic factor and the confidence term $t/n$, the geometric dimension count gives the learning scale $\sqrt{d/n}$.
[/example]
The halfspace example points to a general combinatorial question: how many different binary labelings can a class realize on a finite sample? This question is necessary because infinite classes can behave very differently: the class of all subsets of $\mathcal X$ shatters every finite sample, and ERM over that class can fit arbitrary noise without giving distribution-free generalization. For bounded losses, finite VC dimension rules out that failure by forcing polynomial rather than exponential growth of the sample labelings. As in the VC chapters, $V(\mathcal C)$ denotes the largest size of a finite subset of $\mathcal X$ shattered by $\mathcal C$, with value $\infty$ if arbitrarily large finite sets can be shattered.
The point of the definition is that the empirical process indexed by a VC class behaves like that of a finite class whose effective size is polynomial in $n$. Sauer's lemma supplies the polynomial growth; concentration turns it into a probability bound.
[quotetheorem:9869]
[citeproof:9869]
This theorem is often the first complete explanation of why empirical classification works in infinite classes. Infinite cardinality is harmless when the sample can see only polynomially many labeling patterns, but infinite VC dimension removes that protection and permits sample-wise memorisation. For a concrete failure, take $\mathcal X=[0,1]$ with atomless $P_X$, let labels be independent fair signs, and let $\mathcal C$ contain every finite subset of $\mathcal X$. With probability one the sample points are distinct, so an ERM can choose the finite set that matches every noisy training label, while its true classification risk remains $1/2$. Measurability is needed so that the empirical and population probabilities in the supremum are legitimate random variables; in more advanced treatments this is handled by separability or outer probability conventions. Boundedness is also built into the classification setting through indicators, and the result does not by itself cover unbounded real-valued losses. The next section replaces combinatorial counting by Rademacher complexity, which applies more naturally to margin-based real-valued prediction classes.
## Margin Losses, Surrogate Losses, and Contraction of Rademacher Complexity
Zero-one loss is statistically natural but computationally difficult. Learning algorithms often replace it by a convex surrogate loss that upper-bounds the classification error and responds to the margin $yf(x)$. The question becomes: how does the complexity of the real-valued scoring class transfer to the complexity of the composed loss class?
[definition: Margin Loss]
Let $\mathcal X$ be an input space, let $Y\in\{-1,1\}$, and let $\mathcal G$ be a class of measurable scoring functions $g:\mathcal X\to\mathbb R$. Given a measurable function $\phi:\mathbb R\to[0,\infty)$, the margin loss associated with $g\in\mathcal G$ is the map $\ell_g:\mathcal X\times\{-1,1\}\to[0,\infty)$ defined by
\begin{align*}
\ell_g(x,y)=\phi(yg(x)).
\end{align*}
[/definition]
The scalar $yg(x)$ is positive when the prediction has the right sign and large when the classifier is confident. Convex surrogates penalize small or negative margins while keeping optimization feasible.
[example: Hinge Loss Support Vector Machines]
For binary classification, take scores $g_w(x)=w\cdot x$ with $|w|\le B$ and assume $|x|\le L$ almost surely. The hinge loss is $\phi(u)=(1-u)_+$, so the empirical hinge-risk minimizer over this class minimizes
\begin{align*}
\frac{1}{n}\sum_{i=1}^n (1-Y_i w\cdot X_i)_+
\end{align*}
over $|w|\le B$, which is the basic support vector machine without an offset term.
The map $\phi$ is $1$-Lipschitz because for all $u,v\in\mathbb R$,
\begin{align*}
|(1-u)_+-(1-v)_+|\le |(1-u)-(1-v)|=|u-v|.
\end{align*}
Since $\phi(0)=1$, define the centered loss transform $\psi(u)=\phi(u)-1$. Then $\psi(0)=0$, $\psi$ is still $1$-Lipschitz, and the constant shift does not change the expected Rademacher average because
\begin{align*}
\mathbb E_\sigma\left[\frac{1}{n}\sum_{i=1}^n\sigma_i\right]=0.
\end{align*}
By the *Contraction Inequality For Rademacher Complexity* applied to $u_i(w)=Y_i w\cdot X_i$,
\begin{align*}
\mathbb E_\sigma\left[\sup_{|w|\le B}\frac{1}{n}\sum_{i=1}^n\sigma_i\psi(Y_iw\cdot X_i)\right]\le 2\mathbb E_\sigma\left[\sup_{|w|\le B}\frac{1}{n}\sum_{i=1}^n\sigma_iY_iw\cdot X_i\right].
\end{align*}
For the linear score class, set $S_\sigma=\sum_{i=1}^n\sigma_iY_iX_i$. Then
\begin{align*}
\sup_{|w|\le B}\sum_{i=1}^n\sigma_iY_iw\cdot X_i=\sup_{|w|\le B}w\cdot S_\sigma=B|S_\sigma|
\end{align*}
by Cauchy-Schwarz, with equality when $w$ is parallel to $S_\sigma$ if $S_\sigma\ne0$. Hence
\begin{align*}
\mathbb E_\sigma\left[\sup_{|w|\le B}\frac{1}{n}\sum_{i=1}^n\sigma_iY_iw\cdot X_i\right]=\frac{B}{n}\mathbb E_\sigma|S_\sigma|.
\end{align*}
[Jensen's inequality](/theorems/1977) gives
\begin{align*}
\mathbb E_\sigma|S_\sigma|\le \left(\mathbb E_\sigma|S_\sigma|^2\right)^{1/2}.
\end{align*}
Expanding the square,
\begin{align*}
\mathbb E_\sigma|S_\sigma|^2=\sum_{i=1}^n|X_i|^2+\sum_{i\ne j}Y_iY_jX_i\cdot X_j\,\mathbb E_\sigma[\sigma_i\sigma_j].
\end{align*}
For $i\ne j$, independence and mean zero give $\mathbb E_\sigma[\sigma_i\sigma_j]=0$, so
\begin{align*}
\mathbb E_\sigma|S_\sigma|^2=\sum_{i=1}^n|X_i|^2\le nL^2.
\end{align*}
Therefore the score-class Rademacher complexity is at most
\begin{align*}
\frac{B}{n}\sqrt{nL^2}=\frac{BL}{\sqrt n}.
\end{align*}
After contraction, the centered hinge-loss class has empirical Rademacher complexity at most $2BL/\sqrt n$, up to the universal constants in the comparison theorem. Thus the statistical complexity term for norm-bounded linear hinge classifiers scales as $BL/\sqrt n$, showing that the SVM generalization rate is controlled by the product of the weight radius and input radius, not directly by the ambient dimension.
[/example]
The preceding example uses a general principle: Lipschitz transformations do not increase the ability of a class to fit random signs by more than the Lipschitz constant. A non-Lipschitz transformation can destroy this control; for instance, exponentiating a uniformly bounded but moderately rich score class can amplify small score differences into much larger loss differences. To state the stable Lipschitz principle and the resulting high-probability learning bound, we first quantify random-sign fitting on the observed sample.
[definition: Empirical Rademacher Complexity]
Let $\mathcal H$ be a class of real-valued measurable functions on $\mathcal Z$, and let $Z_1,\dots,Z_n$ be fixed points in $\mathcal Z$. The empirical Rademacher complexity is the functional that assigns to the pair $(\mathcal H,Z_1^n)$ the real number
\begin{align*}
\mathfrak R_n(\mathcal H;Z_1^n):=\mathbb E_\sigma\left[\sup_{h\in\mathcal H}\frac{1}{n}\sum_{i=1}^n\sigma_i h(Z_i)\right],
\end{align*}
where $\sigma_1,\dots,\sigma_n$ are independent Rademacher random variables independent of the sample.
[/definition]
Small Rademacher complexity means the class cannot track arbitrary label fluctuations, which is the probabilistic source of generalization. Absolute deviations require both signs of the class, so it is useful to write
\begin{align*}
\mathcal H^\pm:=\mathcal H\cup\{-h:h\in\mathcal H\}.
\end{align*}
The next theorem turns this random-sign quantity into a uniform deviation estimate for bounded losses, and then the basic ERM bound turns that deviation estimate into an excess-risk statement.
[quotetheorem:9870]
[citeproof:9870]
The boundedness assumption is doing real work: it gives the bounded-differences concentration step, and unbounded losses require separate tail assumptions such as sub-Gaussian or sub-exponential control. A concrete failure mode is squared loss with heavy-tailed responses: if $Y$ has finite mean but very large or infinite variance, then a single extreme observation can dominate $P_n(Y-f(X))^2$, so the sample risk may concentrate poorly even for a one-parameter constant class. The theorem also gives a uniform deviation bound, not an optimization guarantee by itself; the ERM inequality is the step that turns it into excess risk. Its strength is that it applies to real-valued classes without asking for a VC dimension, but the price is that one must estimate the Rademacher complexity of the loss class. For margin methods, this is useful after composing the score class with a loss, and the next result is the standard tool that prevents the complexity from being recomputed from scratch.
[quotetheorem:9825]
[citeproof:9825]
The hypothesis $\psi_i(0)=0$ is a centering convention rather than a modeling restriction. Constants disappear inside Rademacher sums when the same constant is added for every function in the class. Lipschitzness is the substantive assumption: without it, composition can magnify small score oscillations and make the loss class much more complex than the score class. A concrete failure is $\psi(u)=e^{u/\varepsilon}$ on scores taking values in $[0,\varepsilon\log n]$: the original scores have range only $\varepsilon\log n$, but the composed values range over $[1,n]$, so a few sample-dependent oscillations can dominate the Rademacher average. The theorem is also a comparison result, not a generalization theorem by itself; it must be combined with the preceding Rademacher bound. This is why convex margin losses such as hinge or logistic loss are analytically convenient: their Lipschitz constants translate optimization-friendly surrogates back into statistical complexity bounds.
[example: Lipschitz Losses Composed With Norm-Bounded Linear Predictors]
Let $\mathcal G=\{x\mapsto w\cdot x: |w|\le B\}$, and fix sample points $x_1,\dots,x_n$ with $|x_i|\le L$ for every $i$. We first compute the empirical Rademacher complexity of the linear score class. Set
\begin{align*}
S_\sigma:=\sum_{i=1}^n\sigma_i x_i.
\end{align*}
For each $w$,
\begin{align*}
\sum_{i=1}^n\sigma_i w\cdot x_i=w\cdot\sum_{i=1}^n\sigma_i x_i=w\cdot S_\sigma.
\end{align*}
By Cauchy-Schwarz, $w\cdot S_\sigma\le |w|\,|S_\sigma|\le B|S_\sigma|$. If $S_\sigma\ne0$, equality is attained by $w=B S_\sigma/|S_\sigma|$; if $S_\sigma=0$, both sides are $0$. Hence
\begin{align*}
\sup_{|w|\le B}\frac{1}{n}\sum_{i=1}^n\sigma_i w\cdot x_i=\frac{B}{n}|S_\sigma|.
\end{align*}
Taking expectation over the Rademacher signs gives
\begin{align*}
\mathbb E_\sigma\left[\sup_{|w|\le B}\frac{1}{n}\sum_{i=1}^n\sigma_i w\cdot x_i\right]=\frac{B}{n}\mathbb E_\sigma|S_\sigma|.
\end{align*}
By Jensen's inequality applied to the concave map $u\mapsto \sqrt u$,
\begin{align*}
\mathbb E_\sigma|S_\sigma|\le \left(\mathbb E_\sigma|S_\sigma|^2\right)^{1/2}.
\end{align*}
Now expand the squared norm:
\begin{align*}
|S_\sigma|^2=\left(\sum_{i=1}^n\sigma_i x_i\right)\cdot\left(\sum_{j=1}^n\sigma_j x_j\right)=\sum_{i=1}^n\sum_{j=1}^n\sigma_i\sigma_j x_i\cdot x_j.
\end{align*}
Taking expectation and separating diagonal and off-diagonal terms,
\begin{align*}
\mathbb E_\sigma|S_\sigma|^2=\sum_{i=1}^n |x_i|^2+\sum_{i\ne j}x_i\cdot x_j\,\mathbb E_\sigma[\sigma_i\sigma_j].
\end{align*}
For $i\ne j$, independence and $\mathbb E_\sigma\sigma_i=0$ imply
\begin{align*}
\mathbb E_\sigma[\sigma_i\sigma_j]=\mathbb E_\sigma[\sigma_i]\mathbb E_\sigma[\sigma_j]=0.
\end{align*}
Therefore
\begin{align*}
\mathbb E_\sigma|S_\sigma|^2=\sum_{i=1}^n |x_i|^2\le nL^2.
\end{align*}
Combining the previous displays,
\begin{align*}
\mathbb E_\sigma\left[\sup_{|w|\le B}\frac{1}{n}\sum_{i=1}^n\sigma_i w\cdot x_i\right]\le \frac{B}{n}\sqrt{nL^2}=\frac{BL}{\sqrt n}.
\end{align*}
Now let $\phi$ be $L_\phi$-Lipschitz and define the centered transform $\psi(u)=\phi(u)-\phi(0)$. Then $\psi(0)=0$ and $\psi$ is still $L_\phi$-Lipschitz. The constant shift does not change the expected Rademacher average because
\begin{align*}
\frac{1}{n}\sum_{i=1}^n\sigma_i\phi(w\cdot x_i)=\frac{1}{n}\sum_{i=1}^n\sigma_i\psi(w\cdot x_i)+\frac{\phi(0)}{n}\sum_{i=1}^n\sigma_i
\end{align*}
and
\begin{align*}
\mathbb E_\sigma\left[\frac{\phi(0)}{n}\sum_{i=1}^n\sigma_i\right]=0.
\end{align*}
By the *Contraction Inequality For Rademacher Complexity* applied to $g_w(x)=w\cdot x$,
\begin{align*}
\mathbb E_\sigma\left[\sup_{|w|\le B}\frac{1}{n}\sum_{i=1}^n\sigma_i\psi(w\cdot x_i)\right]\le 2L_\phi\,\mathbb E_\sigma\left[\sup_{|w|\le B}\frac{1}{n}\sum_{i=1}^n\sigma_i w\cdot x_i\right].
\end{align*}
Using the score-class bound just proved,
\begin{align*}
\mathbb E_\sigma\left[\sup_{|w|\le B}\frac{1}{n}\sum_{i=1}^n\sigma_i\psi(w\cdot x_i)\right]\le \frac{2L_\phi BL}{\sqrt n}.
\end{align*}
Thus a Lipschitz loss composed with norm-bounded linear predictors has empirical Rademacher complexity controlled by $L_\phi BL/\sqrt n$ up to the universal contraction constant, so the generalization scale is governed by the Lipschitz constant, the weight radius, and the input radius rather than directly by the ambient dimension.
[/example]
## Fast Rates From Localization, Bernstein Conditions, and Complexity Fixed Points
Uniform convergence treats all functions in a class equally, including functions with high excess risk that an ERM will not select once the sample is informative. Faster rates come from looking near the oracle $f^*$ and exploiting variance-risk relationships. The central question is: when does the empirical process shrink as the excess risk shrinks?
[definition: Excess Loss Class]
Let $f^*\in\mathcal F$ minimize $R(f)=P\ell_f$ over $\mathcal F$. The excess loss class is
\begin{align*}
\mathcal L:=\{\ell_f-\ell_{f^*}:f\in\mathcal F\}.
\end{align*}
For $r>0$, its localized slice is
\begin{align*}
\mathcal L(r):=\{g\in\mathcal L:Pg\le r\}.
\end{align*}
[/definition]
Localization replaces a global supremum by suprema over the smaller sets $\mathcal L(r)$. To make this useful, risk must control variance; otherwise a function with tiny mean excess loss could still fluctuate substantially.
[definition: Bernstein Condition]
The excess loss class $\mathcal L$ satisfies a Bernstein condition with constant $B>0$ and exponent $\beta\in(0,1]$ if every $g\in\mathcal L$ satisfies
\begin{align*}
Pg^2\le B(Pg)^\beta.
\end{align*}
[/definition]
When $\beta=1$, variance is proportional to excess risk, which is the strongest common form and is typical in well-specified bounded regression or classification problems with margin assumptions. Smaller $\beta$ gives intermediate behavior.
[example: Bounded Regression With Squared Loss]
Let $Z=(X,Y)$, let $P_X$ be the marginal law of $X$, assume $Y\in[-M,M]$, and let $\mathcal F$ be a convex class of functions $f:\mathcal X\to[-M,M]$. Fix a risk minimizer $f^*\in\mathcal F$ for squared loss $\ell_f(x,y)=(y-f(x))^2$, and write $h=f-f^*$.
For $0<t\le 1$, convexity gives $f^*+th\in\mathcal F$, so minimality of $f^*$ implies
\begin{align*}
0\le R(f^*+th)-R(f^*)=P\left[(Y-f^*(X)-th(X))^2-(Y-f^*(X))^2\right].
\end{align*}
Expanding the square inside the expectation gives
\begin{align*}
0\le -2t\,P\left[(Y-f^*(X))h(X)\right]+t^2P h(X)^2.
\end{align*}
Dividing by $t>0$ and letting $t\downarrow0$ yields the projection inequality
\begin{align*}
P\left[(Y-f^*(X))(f(X)-f^*(X))\right]\le0.
\end{align*}
Now the excess risk is
\begin{align*}
R(f)-R(f^*)=P\left[(Y-f(X))^2-(Y-f^*(X))^2\right].
\end{align*}
Since $f=f^*+h$, the integrand expands as
\begin{align*}
(Y-f^*-h)^2-(Y-f^*)^2=h^2-2(Y-f^*)h.
\end{align*}
Therefore
\begin{align*}
R(f)-R(f^*)=P h^2-2P\left[(Y-f^*)h\right]\ge P h^2=\|f-f^*\|_{L^2(P_X)}^2.
\end{align*}
For the Bernstein bound, set $g=\ell_f-\ell_{f^*}$. Pointwise,
\begin{align*}
g(x,y)=(y-f(x))^2-(y-f^*(x))^2=(f^*(x)-f(x))(2y-f(x)-f^*(x)).
\end{align*}
Because $y,f(x),f^*(x)\in[-M,M]$,
\begin{align*}
|2y-f(x)-f^*(x)|\le 2|y|+|f(x)|+|f^*(x)|\le4M.
\end{align*}
Hence
\begin{align*}
g(x,y)^2\le 16M^2(f(x)-f^*(x))^2.
\end{align*}
Taking expectations and using the previous lower bound on excess risk,
\begin{align*}
Pg^2\le16M^2\|f-f^*\|_{L^2(P_X)}^2\le16M^2\{R(f)-R(f^*)\}=16M^2Pg.
\end{align*}
Thus the excess loss class satisfies the Bernstein condition with exponent $\beta=1$ and constant $16M^2$.
[/example]
The regression example explains why localization can be sharper than a global bound: low excess risk also means low variance. The next problem is to identify the risk level at which localized random fluctuation becomes small compared with the radius of the risk shell. The following definition gives that threshold.
[definition: Complexity Fixed Point]
Let $\psi:(0,\infty)\to(0,\infty)$ be an upper bound for the expected localized Rademacher complexity of the excess loss class, in the sense that
\begin{align*}
\mathbb E\,\mathfrak R_n(\mathcal L(r);Z_1^n)\le \psi(r).
\end{align*}
A positive number $r_n$ is a complexity fixed point if
\begin{align*}
\psi(r_n)\le c r_n
\end{align*}
for a sufficiently small universal constant $c>0$.
[/definition]
The exact constants vary between formulations because some versions use sub-root functions, others use isomorphic inequalities. The conceptual content is stable: below the fixed point, stochastic error can dominate; above it, the empirical criterion has enough accuracy to identify near-optimal functions. This is the same kind of balance condition that appears in nonparametric estimation, inverse problems, and penalized approximation theory: a deterministic approximation scale is matched against a stochastic fluctuation scale.
[quotetheorem:9871]
These notes use this theorem as the standard localized empirical-process input. Its proof belongs conceptually to the theory developed through peeling over risk shells, concentration on each shell, and the sub-root fixed-point argument that makes the shell bounds self-consistent.
The boundedness hypothesis supplies concentration on each localized shell; for unbounded losses, the same statement needs tail assumptions and additional truncation or moment arguments. For example, squared loss with a heavy-tailed response can have small mean excess loss near the oracle while rare observations create large empirical deviations, so shell-wise concentration fails unless extra moment or tail conditions are imposed. The Bernstein exponent $1$ is what turns small excess risk into small variance; without it, the localized empirical process may not shrink quickly enough to beat the global $n^{-1/2}$ scale. A boundary case is classification near a decision boundary with no margin condition: many classifiers can have similar risks while disagreeing on a set of non-negligible probability, so variance is not proportional to excess risk. The sub-root and fixed-point assumptions encode the self-consistency condition: above $r_n$ the stochastic error is smaller than the risk radius, while below $r_n$ the data cannot reliably distinguish functions. Thus the theorem explains fast rates when curvature, variance control, and local complexity align, and it points naturally toward penalized model selection where different models have different fixed points.
[example: Model Selection By Structural Risk Minimization]
Suppose $\mathcal F_1\subset\mathcal F_2\subset\cdots$ is a nested sequence of models, and write $R(f)=P\ell_f$ and $R_n(f)=P_n\ell_f$. For each $m$, define the model-wise uniform deviation
\begin{align*}
\Delta_m:=\sup_{f\in\mathcal F_m}|(P_n-P)\ell_f|.
\end{align*}
Assume the penalties are nonnegative and, on the event under consideration, satisfy
\begin{align*}
\operatorname{pen}(m)\ge 2\Delta_m
\end{align*}
for every $m\ge 1$. Structural risk minimization chooses $\hat m$ and $\hat f\in\mathcal F_{\hat m}$ so that, for some optimization tolerance $\eta\ge0$,
\begin{align*}
R_n(\hat f)+\operatorname{pen}(\hat m)\le \inf_{m\ge1}\inf_{f\in\mathcal F_m}\{R_n(f)+\operatorname{pen}(m)\}+\eta.
\end{align*}
Fix any model index $m$ and any $f\in\mathcal F_m$. Since $\hat f\in\mathcal F_{\hat m}$ and $\operatorname{pen}(\hat m)\ge 2\Delta_{\hat m}$,
\begin{align*}
R(\hat f)-R_n(\hat f)\le \Delta_{\hat m}\le \frac{1}{2}\operatorname{pen}(\hat m).
\end{align*}
Therefore
\begin{align*}
R(\hat f)\le R_n(\hat f)+\frac{1}{2}\operatorname{pen}(\hat m).
\end{align*}
Because $\operatorname{pen}(\hat m)\ge0$,
\begin{align*}
R_n(\hat f)+\frac{1}{2}\operatorname{pen}(\hat m)\le R_n(\hat f)+\operatorname{pen}(\hat m).
\end{align*}
Using the defining inequality for $\hat f$ and $\hat m$ gives
\begin{align*}
R(\hat f)\le R_n(f)+\operatorname{pen}(m)+\eta.
\end{align*}
Finally, since $f\in\mathcal F_m$ and $\operatorname{pen}(m)\ge2\Delta_m$,
\begin{align*}
R_n(f)-R(f)\le \Delta_m\le \frac{1}{2}\operatorname{pen}(m).
\end{align*}
Substituting this into the previous display yields
\begin{align*}
R(\hat f)\le R(f)+\frac{3}{2}\operatorname{pen}(m)+\eta.
\end{align*}
Since $m$ and $f\in\mathcal F_m$ were arbitrary,
\begin{align*}
R(\hat f)\le \inf_{m\ge1}\left\{\inf_{f\in\mathcal F_m}R(f)+\frac{3}{2}\operatorname{pen}(m)\right\}+\eta.
\end{align*}
Thus empirical process bounds enter structural risk minimization by choosing $\operatorname{pen}(m)$ large enough to dominate the deviation for model $m$; the selected predictor then competes with the best tradeoff between approximation error and model complexity across the whole sequence.
[/example]
The chapter closes the course by showing that learning theory is not a separate set of estimates but a direct application of empirical process tools. Uniform laws justify ERM, VC and Rademacher bounds quantify slow rates, contraction handles surrogate losses, and localization identifies when the geometry of the problem supports fast rates.
## Beyond and Connections
This course connects most directly to Androma's notes on probability in Banach spaces, [Concentration Inequalities I: Classical Methods](/page/Concentration%20Inequalities%20I%3A%20Classical%20Methods), [Concentration Inequalities II: Entropy and Transport](/page/Concentration%20Inequalities%20II%3A%20Entropy%20and%20Transport), [Weak Convergence](/page/Weak%20Convergence), and [Cambridge II Mathematics of Machine Learning](/page/Cambridge%20II%20Mathematics%20of%20Machine%20Learning). The repeated theme is that high-dimensional randomness is controlled not by pointwise convergence alone, but by the geometry of an indexed class of functions: covering numbers, entropy integrals, symmetrization, Gaussian and Rademacher comparison, and localization all measure that geometry in different ways.
Several later directions build on the same toolkit. Donsker theory studies when empirical processes converge as random elements of function spaces, as in [Weak Convergence](/page/Weak%20Convergence) and tightness. Concentration and isoperimetry explain why suprema of empirical or Gaussian processes are stable around their means; see [Concentration Inequalities I: Classical Methods](/page/Concentration%20Inequalities%20I%3A%20Classical%20Methods) and [Concentration Inequalities II: Entropy and Transport](/page/Concentration%20Inequalities%20II%3A%20Entropy%20and%20Transport). Learning theory uses the same uniform laws to justify empirical risk minimization, while localized complexity and margin assumptions explain fast rates; this connects directly to the VC, Rademacher, and ERM material in [Cambridge II Mathematics of Machine Learning](/page/Cambridge%20II%20Mathematics%20of%20Machine%20Learning). These links make empirical process theory a bridge between measure-theoretic probability, functional analysis, and modern statistical inference.
## References
- [Weak Convergence](/page/Weak%20Convergence), for the probabilistic meaning of convergence in function spaces and the role of tightness.
- [Concentration Inequalities I: Classical Methods](/page/Concentration%20Inequalities%20I%3A%20Classical%20Methods) and [Concentration Inequalities II: Entropy and Transport](/page/Concentration%20Inequalities%20II%3A%20Entropy%20and%20Transport), for bounded-difference, sub-Gaussian, transport, and isoperimetric tools used to control suprema.
- Gaussian processes and comparison principles, for the Gaussian side of entropy and chaining arguments.
- [Cambridge II Mathematics of Machine Learning](/page/Cambridge%20II%20Mathematics%20of%20Machine%20Learning), for the VC, Rademacher-complexity, and empirical-risk-minimization applications of uniform laws.
Contents
- Introduction
- Random Samples as Random Measures
- Two Uniform Limit Problems
- Complexity of Function Classes
- Symmetrisation, Concentration, and Chaining
- Applications and Roadmap
- 1. Empirical Measures and Indexed Processes
- Empirical Measures as Random Functionals
- Bounded Index Spaces and Measurability
- Finite-Dimensional Distributions
- Covariance Semimetrics
- 2. Glivenko-Cantelli Theory
- Uniform Laws of Large Numbers
- Finite Nets and Stochastic Equicontinuity
- Distribution Functions and the Kolmogorov-Smirnov Statistic
- Vapnik-Chervonenkis Classes
- 3. Symmetrisation and Rademacher Averages
- Ghost Samples and Symmetrisation Inequalities
- Rademacher Complexity and Contraction
- Concentration and High-Probability Uniform Bounds
- 4. VC Classes and Combinatorial Dimension
- Shattering and Growth Functions
- VC Subgraph Classes
- Entropy Bounds for VC Classes
- Uniform Deviation Bounds for Binary Classifiers
- 5. Donsker Classes and Brownian Bridges
- Weak Convergence of Empirical Processes in $\ell^\infty(F)$
- Brownian Bridges Indexed by Sets and Functions
- Asymptotic Equicontinuity and Functional Convergence
- 6. Entropy, Bracketing, and Uniform Central Limit Theorems
- Covering Numbers and Entropy Integrals
- Brackets and Bracketing Entropy
- Envelopes and Integrability Conditions
- Bracketing Central Limit Theorem
- Uniform Entropy and VC-Type Classes
- Preservation Under Lipschitz Transformations
- 7. Maximal Inequalities
- From Pointwise Concentration to Suprema
- Orlicz Norms and Tail Integration
- Maximal Inequalities for Bounded Empirical Processes
- Local Maximal Inequalities and Peeling
- 8. Chaining Methods
- Successive Nets and Dudley's Entropy Integral
- Boundedness Criteria and the Generic Chaining Viewpoint
- Chaining for Empirical and Rademacher Processes
- 9. Permanence Properties and Examples of Donsker Classes
- Permanence Under Maps and Algebraic Operations
- Smooth Parametric Classes
- Non-Donsker Examples And Entropy Sharpness
- 10. Statistical Functionals and Z-Estimation
- Plug-In Estimators and Hadamard Differentiability
- M-Estimators and Stochastic Equicontinuity
- Z-Estimators and Empirical Process Expansions
- 11. Bootstrap and Multiplier Empirical Processes
- The Nonparametric Bootstrap Empirical Process
- Conditional Asymptotic Equicontinuity
- Multiplier Bootstrap Processes
- Suprema, Continuous Maps, and Confidence Bands
- 12. Learning Theory Applications
- Uniform Convergence Bounds for Empirical Risk Minimization
- Margin Losses, Surrogate Losses, and Contraction of Rademacher Complexity
- Fast Rates From Localization, Bernstein Conditions, and Complexity Fixed Points
- Beyond and Connections
- References
Empirical Process Theory
Also known as: empirical process theory, empirical processes, empirical process, uniform empirical processes, weak convergence of empirical processes
Content
Problems
History
Created by admin on 6/22/2026 | Last updated on 6/22/2026
Prerequisites (0/8 completed)
Log in to track your prerequisite progress.
Prerequisites Graph
Interactive dependency map showing prerequisite concepts
Loading dependency graph...
Theorem
Definition
Current
Requires
Rate this page
★
★
★
★
★
Poor
Excellent