Nonparametric statistics studies inference and estimation methods that do not assume a finite-dimensional parametric form for the underlying distribution or regression function. Rather than specifying that data comes from a normal distribution with two unknown parameters, or a polynomial regression of degree $k$, nonparametric methods allow the data to speak more freely about its structure. This course develops both the theoretical foundations and practical techniques for this flexible approach, emphasizing the probability theory that governs empirical processes and the concentration inequalities that guarantee reliable estimation across vast model spaces.
The material progresses through three interconnected themes. First, we build machinery for understanding convergence of empirical quantities: the empirical distribution function, empirical processes, and their weak limits, culminating in applications to distribution-free goodness-of-fit tests. Second, we study concrete nonparametric estimators—kernel density estimation, kernel regression, and rank-based methods—and develop [uniform convergence](/page/Uniform%20Convergence) theory that controls estimation error across the entire domain. Third, we address inference and optimality: constructing confidence sets without parametric assumptions, establishing minimax lower bounds that characterize fundamental limits, and selecting tuning parameters like bandwidth to balance bias and variance.
The chapters form a logical architecture: Chapters 1–3 establish empirical process theory and [weak convergence](/page/Weak%20Convergence) as the foundation; Chapters 4–7 apply this theory to density and regression estimation; Chapters 8–10 introduce rank and U-statistic methods that leverage order structure; and Chapters 11–12 synthesize uncertainty quantification and theoretical optimality. Throughout, the course balances rigorous probability theory with practical statistical reasoning, preparing you to design and analyze nonparametric procedures in modern applications where parametric assumptions are unjustified or overly restrictive.
# Introduction
This introductory chapter sets the scope and language of the course. Nonparametric statistics studies inference when the unknown object is not reduced in advance to a finite-dimensional parameter such as a mean vector or a variance matrix. The main examples will be unknown distribution functions, densities, regression functions, quantile functions, and smoothness-constrained families of such objects.
The course sits between measure-theoretic probability and statistical methodology. Probability supplies empirical measures, weak convergence, concentration inequalities, and Hilbert-space or Fourier tools; statistics turns these tools into estimators, tests, confidence statements, and risk bounds. Chapters 2 and 3 develop the empirical-measure and empirical-process language, Chapters 5 through 7 use it for kernel smoothing, and Chapters 11 and 12 return to confidence sets and minimax limits. Kernel density estimation will serve as the central smoothing example, while empirical processes will provide the common language for uniform approximation.
## Why Nonparametric Statistics Needs New Tools
What changes when the unknown object is a function or a distribution rather than a vector in $\mathbb R^d$? In a finite-dimensional parametric model, asymptotic theory often studies a sequence of estimators after rescaling by $\sqrt n$, with the target dimension fixed. In a nonparametric model, the unknown may have infinitely many degrees of freedom, and the statistical difficulty depends on regularity, dimension, and the loss used to judge error.
Before defining estimators, we need a precise container for the possible laws of the data. Without this container, statements such as "valid for all continuous distributions" or "uniformly over a Hölder ball" have no mathematical domain: the same formula for a statistic can have one risk under a Gaussian law and a very different risk under a heavy-tailed law. The statistical experiment records both the measurable observations and the class of laws over which guarantees are being made.
[definition: Statistical Experiment]
A statistical experiment is a triple $(\mathcal X, \mathcal A, \mathcal P)$, where $(\mathcal X, \mathcal A)$ is a measurable sample space and $\mathcal P$ is a set of probability measures on $(\mathcal X, \mathcal A)$.
[/definition]
This definition fixes the objects on which inference is built: a sample space and a model class. To see what is special about this course, we next isolate the finite-dimensional case that nonparametric methods generalise and often contrast with.
[definition: Parametric Model]
A parametric model is a statistical experiment $(\mathcal X, \mathcal A, \mathcal P)$ for which there exist $d \in \mathbb N$, a parameter set $\Theta \subseteq \mathbb R^d$, and a map
\begin{align*}
\Theta &\to \operatorname{Prob}(\mathcal X,\mathcal A), &
\theta &\mapsto P_\theta,
\end{align*}
such that $\mathcal P = \{P_\theta : \theta \in \Theta\}$.
[/definition]
Parametric models remain important throughout the course, both as contrasts and as local approximations. Their limitation is that a fixed finite list of coordinates can force structure that is not part of the statistical problem. When the unknown object is a whole distribution, density, regression curve, or other function, the model must allow variation that cannot be captured by one predetermined finite-dimensional parameter set.
[definition: Nonparametric Model]
A nonparametric model is a statistical experiment $(\mathcal X, \mathcal A, \mathcal P)$ in which $\mathcal P$ is not specified as the image of a fixed finite-dimensional parameter set.
[/definition]
In this course, the model class is usually a family of probability measures described through an infinite-dimensional object such as a distribution function, density, regression function, or smoothness-constrained function. The definition is intentionally about how the statistical experiment is specified: the same broad family can sometimes admit finite-dimensional submodels, but the nonparametric analysis does not begin by choosing one finite list of coordinates. The following example is the first model in the course where the target is a whole function rather than a finite list of parameters.
[example: Distribution Function Model]
Let $X_1,\dots,X_n$ be i.i.d. real-valued random variables with distribution function $F$, so $F(x)=\mathbb P(X_1\le x)$ for each $x\in\mathbb R$. The model lets $F$ range over all distribution functions on $\mathbb R$, rather than assuming a fixed finite-dimensional formula $F=F_\theta$ with $\theta\in\mathbb R^d$.
For a fixed threshold $x\in\mathbb R$, define the empirical distribution function by
\begin{align*}
F_n(x)=\frac{1}{n}\sum_{i=1}^n \mathbb{1}_{\{X_i\le x\}}.
\end{align*}
The summand $\mathbb{1}_{\{X_i\le x\}}$ equals $1$ on the event $\{X_i\le x\}$ and equals $0$ on its complement. Hence
\begin{align*}
\mathbb E[\mathbb{1}_{\{X_i\le x\}}]=1\cdot\mathbb P(X_i\le x)+0\cdot\mathbb P(X_i>x).
\end{align*}
Since every $X_i$ has distribution function $F$, this becomes
\begin{align*}
\mathbb E[\mathbb{1}_{\{X_i\le x\}}]=F(x).
\end{align*}
By linearity of expectation,
\begin{align*}
\mathbb E[F_n(x)]=\frac{1}{n}\sum_{i=1}^n \mathbb E[\mathbb{1}_{\{X_i\le x\}}].
\end{align*}
Substituting the previous identity gives
\begin{align*}
\mathbb E[F_n(x)]=\frac{1}{n}\sum_{i=1}^n F(x)=F(x).
\end{align*}
Thus $F_n(x)$ estimates the value $F(x)$ by averaging threshold indicators, and the collection of these values over all $x$ estimates the whole distribution function without imposing a finite-dimensional parametrization.
[/example]
This example introduces a recurring pattern: estimate a whole function by averaging simple features of the sample. The first check is whether the proposed estimator has the correct target and whether its pointwise fluctuation decays with $n$, which motivates the following calculation.
[quotetheorem:6293]
[citeproof:6293]
The i.i.d. hypothesis is doing real work here: if $X_i$ have distribution functions $F_i$ that are not all equal to $F$, then $\mathbb E[F_n(x)]$ becomes $n^{-1}\sum_i F_i(x)$ rather than $F(x)$. The theorem also gives no simultaneous control over all $x$; many later statistics require bounds for $\sup_x |F_n(x)-F(x)|$, not only for one fixed threshold. This pointwise calculation is therefore a local prototype, and the course next develops empirical-process tools that turn pointwise averages into uniform approximations.
## Statistical Targets, Estimators, And Loss
How do we say what a nonparametric procedure is trying to estimate? The target is often a functional of the underlying distribution, and it may be scalar-valued, vector-valued, or function-valued. The choice of loss is part of the problem because a procedure good for pointwise error may be poor for integrated error or sup-norm error.
[definition: Statistical Functional]
Let $(\mathcal X, \mathcal A, \mathcal P)$ be a statistical experiment and let $(\mathcal T, \mathcal C)$ be a measurable space. A statistical functional is a map $T:\mathcal P \to \mathcal T$.
[/definition]
The functional viewpoint keeps the notation uniform: estimating a mean, a density, a distribution function, or a regression curve all become instances of estimating $T(P)$. To connect this target to data, we need the corresponding notion of a measurable rule computed from the sample.
[definition: Estimator]
Let $X_1,\dots,X_n$ be observations taking values in $(\mathcal X,\mathcal A)$, let $(\mathcal T,\mathcal C)$ be a measurable space, and let $T:\mathcal P\to\mathcal T$ be a statistical functional. An estimator of $T(P)$ is a measurable map
\begin{align*}
\hat T_n:(\mathcal X^n,\mathcal A^{\otimes n})\to(\mathcal T,\mathcal C).
\end{align*}
[/definition]
An estimator by itself is only a candidate rule; it does not say which errors matter or how randomness should be averaged. Pointwise error, integrated error, and classification error can rank the same estimator differently, so statistical comparison needs an explicit loss and the corresponding expected loss under the sampling law.
[definition: Loss And Risk]
Let $T:\mathcal P\to\mathcal T$ be a statistical functional. A loss function is a map $L:\mathcal T\times\mathcal T\to[0,\infty]$. The risk of an estimator $\hat T_n$ at $P\in\mathcal P$ is
\begin{align*}
R(P,\hat T_n) = \mathbb E_P[L(\hat T_n,T(P))].
\end{align*}
[/definition]
Risk is the bridge between probability bounds and statistical conclusions. The next example shows that even for the same target, changing the loss can change the statistical problem.
[example: Density Estimation Losses]
Let $X_1,\dots,X_n$ be i.i.d. with density $f$ on $[0,1]$ with respect to one-dimensional [Lebesgue measure](/page/Lebesgue%20Measure) $\mathcal L^1$, and let $\hat f_n$ be a density estimator. Write the pointwise error as
\begin{align*}
e_n(x)=\hat f_n(x)-f(x),\qquad x\in[0,1].
\end{align*}
The integrated squared loss is
\begin{align*}
L_2(\hat f_n,f)=\int_0^1 |\hat f_n(x)-f(x)|^2\,d\mathcal L^1(x).
\end{align*}
Using the definition of $e_n$, this is
\begin{align*}
L_2(\hat f_n,f)=\int_0^1 |e_n(x)|^2\,d\mathcal L^1(x).
\end{align*}
For instance, if $|e_n(x)|=a$ on a measurable interval $I\subseteq[0,1]$ and $e_n(x)=0$ on $[0,1]\setminus I$, then additivity of the [Lebesgue integral](/page/Lebesgue%20Integral) gives
\begin{align*}
L_2(\hat f_n,f)=\int_I a^2\,d\mathcal L^1(x)+\int_{[0,1]\setminus I}0\,d\mathcal L^1(x).
\end{align*}
Since the integral of a constant over $I$ is that constant times $\mathcal L^1(I)$, we get
\begin{align*}
L_2(\hat f_n,f)=a^2\mathcal L^1(I).
\end{align*}
The sup-norm loss is
\begin{align*}
L_\infty(\hat f_n,f)=\sup_{x\in[0,1]}|\hat f_n(x)-f(x)|.
\end{align*}
Again substituting $e_n(x)=\hat f_n(x)-f(x)$ gives
\begin{align*}
L_\infty(\hat f_n,f)=\sup_{x\in[0,1]}|e_n(x)|.
\end{align*}
For the same error pattern, if $I$ is nonempty and $a\ge 0$, then $|e_n(x)|$ takes the value $a$ on $I$ and the value $0$ off $I$, so
\begin{align*}
L_\infty(\hat f_n,f)=\sup\{a,0\}=a.
\end{align*}
Thus integrated squared loss records both the magnitude and the Lebesgue-measure extent of the error, while sup-norm loss records the largest local deviation even if it occurs on a very small region.
[/example]
These two losses lead to different proof techniques and sometimes different rates. This is why the course states assumptions and conclusions with the loss written explicitly.
## The Central Role Of Empirical Measures
What common object lies behind empirical distribution functions, plug-in estimators, goodness-of-fit tests, and rank procedures? The answer is the empirical measure. If we work only with separate formulas such as $F_n(x)$ for each threshold $x$, then the shared averaging structure is hidden, and it becomes hard to compare distribution-function estimators, sample moments, and goodness-of-fit statistics. The empirical measure converts a sample into a random probability measure and allows many statistics to be written as integrals.
[definition: Empirical Measure]
Let $X_1,\dots,X_n$ be observations taking values in a measurable space $(\mathcal X,\mathcal A)$, and let $\operatorname{Prob}(\mathcal X,\mathcal A)$ denote the set of probability measures on $(\mathcal X,\mathcal A)$. The empirical measure is the map
\begin{align*}
P_n:\mathcal X^n\to \operatorname{Prob}(\mathcal X,\mathcal A)
\end{align*}
defined by
\begin{align*}
P_n(x_1,\dots,x_n) = \frac{1}{n}\sum_{i=1}^n \delta_{x_i},
\end{align*}
where $\delta_x$ denotes the Dirac measure at $x$.
[/definition]
The notation $P_n g$ means integration of a measurable function $g:\mathcal X\to\mathbb R$ with respect to $P_n$, and Chapter 2 will use the same notation for plug-in integration throughout. Thus
\begin{align*}
P_n g = \int g\,dP_n = \frac{1}{n}\sum_{i=1}^n g(X_i).
\end{align*}
This identity turns many estimators into averages indexed by a class of functions. Distribution functions use threshold indicators $g_t(x)=\mathbb{1}_{\{x\le t\}}$, moment estimates use functions such as $g(x)=x$ when they are integrable, and goodness-of-fit statistics often take suprema over many test functions. To analyse these statistics, it is not enough to know the behaviour of a single average $P_n g$: the proof must control all centred errors $P_n g-Pg$ for $g$ in the chosen class at the same time.
The distinction between one function and a class of functions is the first place where empirical processes become necessary. For a fixed integrable $g$, the sample average may behave well, while taking a supremum over many choices can fail because the index is chosen after seeing the data; for instance, allowing all measurable indicator functions lets the class adapt to the observed sample. The next definition keeps the whole indexed family in view, so later assumptions can control this adaptivity through restrictions on the size and complexity of the class. It is also the first genuinely functional-analytic object of the course: the random object is no longer a single real variable, but a map indexed by $\mathcal G$.
[definition: Empirical Process]
Let $X_1,\dots,X_n$ be i.i.d. with distribution $P$ on $(\mathcal X,\mathcal A)$, and let $\mathcal G$ be a class of [measurable functions](/page/Measurable%20Functions) $g:\mathcal X\to\mathbb R$ such that $P|g|<\infty$ for every $g\in\mathcal G$. The empirical process indexed by $\mathcal G$ is the random map $\mathbb G_n:\mathcal G\to\mathbb R$ defined by
\begin{align*}
\mathbb G_n(g)=\sqrt n\,(P_n g-Pg), \qquad g\in\mathcal G,
\end{align*}
where $Pg=\int g\,dP$.
[/definition]
Equivalently, after the class $\mathcal G$ has been fixed, $\mathbb G_n$ may be viewed as a random element of the product space $\mathbb R^{\mathcal G}$. The empirical process packages the uniform fluctuation problem into a single random map. This viewpoint is needed for results in which entropy controls quantities such as $\sup_{g\in\mathcal G}|P_n g-Pg|$: the size of $\mathcal G$ determines whether pointwise convergence can be upgraded to uniform convergence. To understand what extra difficulty comes from the index class $\mathcal G$, we first record the single-function law of large numbers that every later uniform result strengthens.
[quotetheorem:6294]
[citeproof:6294]
The integrability assumption cannot be dropped as a harmless technicality: for a Cauchy-distributed observation and $g(x)=x$, the target $Pg$ is not finite, so the displayed convergence statement is not even a well-defined assertion about a real number. The i.i.d. assumption is also part of the conclusion, not background decoration. If the variables are independent but have changing means, then $P_n g$ may converge to an average of the changing expectations rather than to $Pg$; if dependence is strong, repeated observations can prevent averaging from reducing fluctuation. The theorem also does not handle data-dependent selection from a class, because it fixes $g$ before the sample is observed. Uniform laws of large numbers ask for convergence of $\sup_{g\in\mathcal G}|P_n g-Pg|$, and later chapters develop entropy and concentration conditions under which this stronger quantity tends to zero.
## Smoothing, Bias, And Variance
Why does density estimation require smoothing when distribution-function estimation does not? A density is not directly observed; the sample consists of point masses, so the empirical measure is too rough to be a density with respect to Lebesgue measure. If one tried to use the empirical measure itself as a density, it would be singular with respect to one-dimensional Lebesgue measure $\mathcal L^1$, assigning mass to sample points rather than spreading mass across intervals. Kernel estimators replace each point mass by a small bump and then choose the bump width to balance approximation error and sampling variation.
[definition: Kernel]
A kernel on $\mathbb R$ is an integrable function $K:\mathbb R\to\mathbb R$ satisfying
\begin{align*}
\int_{\mathbb R} K(u)\,d\mathcal L^1(u)=1.
\end{align*}
[/definition]
Additional moment, smoothness, or sign conditions will be imposed when particular theorems require them. The scaling parameter is the bandwidth, and this motivates the estimator obtained by applying the rescaled kernel at every observation.
[definition: Kernel Density Estimator]
Let $X_1,\dots,X_n$ be real-valued observations, let $K$ be a kernel on $\mathbb R$, and let $h>0$. The kernel density estimator with bandwidth $h$ is the map
\begin{align*}
\hat f_h:\mathbb R\to\mathbb R, \qquad
x \mapsto \frac{1}{nh}\sum_{i=1}^n K\left(\frac{x-X_i}{h}\right).
\end{align*}
[/definition]
The estimator averages nearby observations after rescaling distance by $h$. Small $h$ preserves local features but has high variance; large $h$ reduces variance but blurs features of the density.
[example: Gaussian Kernel Estimator]
Take $K(u)=(2\pi)^{-1/2}e^{-u^2/2}$ and observations $X_1,\dots,X_n\in\mathbb R$. Substituting this kernel into the definition of the kernel density estimator gives
\begin{align*}
\hat f_h(x)=\frac{1}{nh}\sum_{i=1}^n K\left(\frac{x-X_i}{h}\right).
\end{align*}
Using the displayed formula for $K$ with $u=(x-X_i)/h$, each term becomes
\begin{align*}
K\left(\frac{x-X_i}{h}\right)=(2\pi)^{-1/2}\exp\left(-\frac{1}{2}\left(\frac{x-X_i}{h}\right)^2\right).
\end{align*}
Since $\left((x-X_i)/h\right)^2=(x-X_i)^2/h^2$, we obtain
\begin{align*}
\hat f_h(x)=\frac{1}{n}\sum_{i=1}^n \frac{1}{h\sqrt{2\pi}}\exp\left(-\frac{(x-X_i)^2}{2h^2}\right).
\end{align*}
For each fixed observation $X_i$, the $i$th summand is the Gaussian density in the variable $x$ with mean $X_i$ and variance $h^2$, because it has normalizing factor $(h\sqrt{2\pi})^{-1}$ and exponent $-(x-X_i)^2/(2h^2)$. Thus $\hat f_h$ is the average of $n$ Gaussian bumps, one centred at each observation and each having standard deviation $h$.
If $h$ is replaced by a larger bandwidth $h'>h$, then for any fixed $x\ne X_i$,
\begin{align*}
0<\frac{(x-X_i)^2}{2(h')^2}<\frac{(x-X_i)^2}{2h^2}.
\end{align*}
Multiplying by $-1$ reverses the inequalities, so
\begin{align*}
-\frac{(x-X_i)^2}{2h^2}<-\frac{(x-X_i)^2}{2(h')^2}<0.
\end{align*}
Hence the exponential factor decays more slowly away from $X_i$ when the bandwidth is larger, while the prefactor $1/(h\sqrt{2\pi})$ lowers the peak height. Increasing $h$ therefore spreads each observation's contribution over a wider region, which can merge several local peaks into a smoother curve.
[/example]
Squared loss is valuable because it turns prediction error into an [orthogonal decomposition](/theorems/436). If a random quantity is estimated by a fixed number, the unavoidable fluctuation around its mean should be separated from the error caused by choosing the wrong centre.
The same question appears when the estimate is allowed to depend on observed information. We need a general identity that says which part of squared loss is residual randomness and which part is approximation error from the chosen predictor.
[quotetheorem:4425]
[citeproof:4425]
In the first identity, the best constant predictor under squared loss is the mean $\mu$, and every other constant $a$ pays two costs: the intrinsic fluctuation $\operatorname{Var}(Y)$ and the squared displacement $(\mu-a)^2$. The conditional identity says the same thing with information included. Once $X$ is observed, the conditional mean $\mathbb E[Y\mid X]$ is the optimal squared-loss predictor, and any chosen rule $g(X)$ adds an approximation term on top of the residual conditional variance.
This theorem is not yet a kernel-density risk formula. Its role here is conceptual: it explains why later KDE calculations separately estimate stochastic fluctuation and smoothing bias. When we return to kernels, additional assumptions on the sample, target density, kernel, bandwidth, and loss space will be needed before this abstract squared-loss decomposition can be turned into pointwise or integrated risk bounds.
## Distribution-Free Methods And Ranks
Can useful inference avoid estimating the unknown distribution altogether? Many classical nonparametric tests are distribution-free under a null hypothesis: their null distribution does not depend on the underlying continuous distribution. Raw observations usually do not have this property; for example, the sample mean has a different null distribution under a standard normal law and under a centred Laplace law. This phenomenon is strongest for ranks and order statistics, which remove parts of the data that carry distribution-specific scale and shape information.
[definition: Order Statistics]
Let $n\in\mathbb N$. The order-statistic map is the function
\begin{align*}
S_n:\mathbb R^n\to\mathbb R^n,\qquad S_n(x_1,\dots,x_n)=(x_{(1)},\dots,x_{(n)}),
\end{align*}
where
\begin{align*}
x_{(1)}\le x_{(2)}\le \dots \le x_{(n)}
\end{align*}
and $\{x_{(1)},\dots,x_{(n)}\}=\{x_1,\dots,x_n\}$ as multisets. For real-valued observations $X_1,\dots,X_n$, the order statistics are the random variables $X_{(1)},\dots,X_{(n)}$ obtained by applying $S_n$ to $(X_1,\dots,X_n)$.
[/definition]
Order statistics retain the sorted values but not the original labels. Rank methods go further by keeping only relative order: this matters because numerical spacings can be dominated by the unknown marginal distribution, so a test based on raw gaps may change under a monotone transformation even when the ordering information is unchanged. The next definition names the statistic that forgets numerical spacing while preserving ordering information.
[definition: Ranks]
Let
\begin{align*}
\mathcal X_n^{\neq}=\{(x_1,\dots,x_n)\in\mathbb R^n:x_i\neq x_j\ \text{for }i\neq j\}.
\end{align*}
For each $i\in\{1,\dots,n\}$, the rank map is
\begin{align*}
R_i:\mathcal X_n^{\neq}\to\{1,\dots,n\},\qquad
R_i(x_1,\dots,x_n)=\sum_{j=1}^n \mathbb{1}_{\{x_j\le x_i\}}.
\end{align*}
For real-valued observations $X_1,\dots,X_n$ with no ties, the rank of $X_i$ is
\begin{align*}
R_i=\sum_{j=1}^n \mathbb{1}_{\{X_j\le X_i\}}.
\end{align*}
[/definition]
When the common distribution is continuous, ties have probability zero, but distribution-freeness also requires more than merely assigning ranks. The key question is whether every strict ordering of the sample is equally likely under an arbitrary continuous i.i.d. law; if so, the unknown distribution disappears after passing to the rank vector.
[quotetheorem:6295]
[citeproof:6295]
Continuity is essential because ties change what rank information represents: if $X_i$ are Bernoulli random variables, then ties occur with positive probability, and the quantities $R_i=\sum_j\mathbb{1}_{\{X_j\le X_i\}}$ are defined but need not form a permutation of $(1,\dots,n)$. A separate tie-breaking or mid-rank convention would then be part of the statistic. The i.i.d. assumption is also essential, since observations with different continuous distributions are not exchangeable and the strict orderings need not have equal probabilities. This theorem is the prototype for rank tests such as the Wilcoxon and Mann-Whitney procedures, while later chapters combine the same symmetry idea with asymptotic approximations when exact finite-sample distributions become cumbersome.
## How The Course Is Organised
What should the reader expect after this introduction? The course begins with statistical models without finite parameters, where we formalise targets, losses, plug-in estimators, and minimax risk. It then turns to empirical distribution functions, quantiles, concentration inequalities, and the uniform laws that make empirical procedures reliable.
The middle part develops kernel density estimation in detail: bias expansions, variance bounds, integrated risk, bandwidth choice, boundary effects, and adaptive ideas. U-statistics and rank methods then provide a second family of nonparametric tools, where symmetry and degeneracy replace smoothing as the main organising ideas.
The final part moves from density estimation to regression. Local polynomial estimators, smoothness classes, and confidence bands show how the same bias-variance and empirical-process principles appear in more structured settings. Along the way the methods connect to approximation theory through smoothness classes, to optimisation through risk minimisation, and to functional analysis through norms and compactness ideas. The unifying question throughout is how much can be inferred from the data when the model is flexible enough to contain many possible distributions or functions.
Chapter 1 addresses this unifying question head-on by developing the foundational language for nonparametric models. It formalizes what we mean by distributions, densities, and functions that are not determined by a fixed finite number of parameters, and establishes the framework within which the rest of the course operates.
# 1. Statistical Models Without Finite Parameters
Nonparametric statistics begins from a simple tension: many statistical questions ask for an entire function, distribution, or set, while classical parametric theory usually assumes that the unknown object is described by finitely many [real numbers](/page/Real%20Numbers). This chapter sets up the language needed to treat such problems as honest statistical models rather than as informal curve-fitting tasks. The prerequisites are basic probability, expectation, i.i.d. sampling, elementary real analysis, and the idea of a measurable function. We introduce infinite-dimensional model classes, specify what it means for a target to be identifiable, compare estimators through loss and risk, and record the first lower-bound method that explains why smoothness assumptions control achievable rates.
## Infinite-Dimensional Statistical Models
What changes when the unknown parameter is not a vector in $\mathbb R^d$? The sample still has a probability law, but the collection of possible laws is indexed by objects such as distribution functions, densities, or regression functions. The purpose of the model is to state which laws are allowed and which feature of the law we want to estimate.
[definition: Statistical Model]
A statistical model for data $X$ taking values in a measurable space $(\mathcal X, \mathcal A)$ is a family $\mathcal P$ of probability measures on $(\mathcal X, \mathcal A)$.
[/definition]
The model is nonparametric when $\mathcal P$ is too large to be described by a fixed finite-dimensional Euclidean parameter without losing the structure of the problem. In these notes, the unknown distribution itself is often the parameter, and a target functional $\theta(P)$ extracts the object of interest from $P \in \mathcal P$.
[definition: Nonparametric Model]
A nonparametric statistical model is a statistical model $\mathcal P$ whose elements are indexed by an infinite-dimensional class $\Theta$, together with a map $\theta \mapsto P_\theta$ from $\Theta$ to $\operatorname{Prob}(\mathcal X,\mathcal A)$, the set of probability measures on $(\mathcal X,\mathcal A)$.
[/definition]
This definition leaves room for several common forms. A distribution model indexes the law by its distribution function; a density model indexes absolutely continuous laws by their density; a regression model indexes conditional means by a function. The common feature is that the target is a whole curve or function, not a short coordinate vector.
[example: Distribution Function Model]
Let $X_1,\dots,X_n$ be i.i.d. real-valued random variables with common law $P$ on $(\mathbb R,\mathcal B(\mathbb R))$. For each $x\in\mathbb R$, the set $(-\infty,x]$ belongs to $\mathcal B(\mathbb R)$, so $P((-\infty,x])$ is a well-defined probability. The distribution function of $P$ is
\begin{align*}
F_P(x)=P((-\infty,x]),\qquad x\in\mathbb R.
\end{align*}
Since $X_1$ has law $P$, the event $\{X_1\le x\}$ is the same as the inverse image $X_1^{-1}((-\infty,x])$, and therefore
\begin{align*}
\mathbb P(X_1\le x)
=\mathbb P\bigl(X_1\in(-\infty,x]\bigr)
=P((-\infty,x])
=F_P(x).
\end{align*}
Thus the nonparametric model may be written as
\begin{align*}
\mathcal P=\operatorname{Prob}(\mathbb R,\mathcal B(\mathbb R)),
\end{align*}
with target map
\begin{align*}
P\longmapsto F_P.
\end{align*}
For a fixed threshold $x$, the target value is the single number
\begin{align*}
F_P(x)=P((-\infty,x]).
\end{align*}
For the whole distribution function, the target is the full collection
\begin{align*}
\bigl(F_P(x):x\in\mathbb R\bigr)
=
\bigl(P((-\infty,x]):x\in\mathbb R\bigr).
\end{align*}
This collection is indexed by all real thresholds, not by finitely many coordinates, so the target is a function rather than a finite-dimensional vector.
The empirical distribution function introduced in the next chapter follows the plug-in principle. If
\begin{align*}
P_n=\frac{1}{n}\sum_{i=1}^n\delta_{X_i},
\end{align*}
then evaluating the same target map at $P_n$ gives
\begin{align*}
F_{P_n}(x)
=P_n((-\infty,x])
=\frac{1}{n}\sum_{i=1}^n\delta_{X_i}((-\infty,x]).
\end{align*}
For each $i$, the point mass $\delta_{X_i}$ assigns mass $1$ to $(-\infty,x]$ when $X_i\in(-\infty,x]$, and assigns mass $0$ to $(-\infty,x]$ when $X_i\notin(-\infty,x]$. Equivalently,
\begin{align*}
\delta_{X_i}((-\infty,x])=\mathbb 1_{\{X_i\le x\}}.
\end{align*}
Substituting this identity into the preceding display gives
\begin{align*}
F_{P_n}(x)=\frac{1}{n}\sum_{i=1}^n \mathbb 1_{\{X_i\le x\}}.
\end{align*}
The plug-in estimator therefore estimates each probability $P((-\infty,x])$ by the observed fraction of sample points falling at or below $x$, while the full curve records that fraction simultaneously for every threshold $x$.
[/example]
The distribution function model is broad because it allows discrete, continuous, and mixed laws. If the course question concerns local shape, such as the height of a density at a point, the model must restrict attention to absolutely continuous distributions and name the density class explicitly. We write $\mathcal L^1$ for one-dimensional Lebesgue measure on $\mathbb R$, restricted to $[0,1]$ when the domain is $[0,1]$.
[definition: Density Model]
Let $\mathcal F$ be a class of nonnegative measurable functions $f:[0,1]\to[0,\infty)$ satisfying $\int_0^1 f(x)\,d\mathcal L^1(x)=1$. The density model generated by $\mathcal F$ is the family of laws $P_f$ on $[0,1]$ defined by
\begin{align*}
P_f(A)=\int_A f(x)\,d\mathcal L^1(x), \qquad A\in\mathcal B([0,1]).
\end{align*}
[/definition]
Here the parameter is the density $f$, not a finite vector. The observation law determines $f$ only up to a.e. equality, so any loss for density estimation must respect that convention.
[example: Uniform Density Benchmark]
In the density model on $[0,1]$, the function $f_0(x)=1$ is a valid density: for every $x\in[0,1]$,
\begin{align*}
f_0(x)=1\ge 0,
\end{align*}
and
\begin{align*}
\int_0^1 f_0(x)\,d\mathcal L^1(x)=\int_0^1 1\,d\mathcal L^1(x)=\mathcal L^1([0,1])=1.
\end{align*}
If an estimator $\hat f_n$ is measured by integrated squared error, then substituting $f_0(x)=1$ into the risk gives
\begin{align*}
R(f_0,\hat f_n)=\mathbb E_{f_0}\left[\int_0^1(\hat f_n(x)-f_0(x))^2\,d\mathcal L^1(x)\right]=\mathbb E_{f_0}\left[\int_0^1(\hat f_n(x)-1)^2\,d\mathcal L^1(x)\right].
\end{align*}
This benchmark separates variance from interior curvature bias. Suppose an interior kernel smoother has expectation
\begin{align*}
\mathbb E_{f_0}[\hat f_n(x)]=\int_0^1 K_h(x-u)f_0(u)\,d\mathcal L^1(u),
\end{align*}
where $K_h(v)=h^{-1}K(v/h)$, the point $x$ is far enough from $0$ and $1$ that the kernel window is contained in $[0,1]$, and the normalized kernel satisfies
\begin{align*}
\int_{\mathbb R}K(w)\,d\mathcal L^1(w)=1.
\end{align*}
Since $f_0(u)=1$ for every $u\in[0,1]$,
\begin{align*}
\mathbb E_{f_0}[\hat f_n(x)]=\int_0^1 K_h(x-u)\,d\mathcal L^1(u),
\end{align*}
and therefore
\begin{align*}
\mathbb E_{f_0}[\hat f_n(x)]-f_0(x)=\int_0^1 K_h(x-u)\,d\mathcal L^1(u)-1.
\end{align*}
Because the full kernel window lies inside $[0,1]$, the integrand $K_h(x-u)$ contributes no mass outside $[0,1]$, so
\begin{align*}
\int_0^1 K_h(x-u)\,d\mathcal L^1(u)=\int_{\mathbb R}K_h(x-u)\,d\mathcal L^1(u).
\end{align*}
Using $K_h(v)=h^{-1}K(v/h)$, the full-line integral is
\begin{align*}
\int_{\mathbb R}K_h(x-u)\,d\mathcal L^1(u)=\int_{\mathbb R}h^{-1}K\left(\frac{x-u}{h}\right)\,d\mathcal L^1(u).
\end{align*}
With the substitution $w=(x-u)/h$, so $u=x-hw$ and $d\mathcal L^1(u)=h\,d\mathcal L^1(w)$ after reversing the limits, this becomes
\begin{align*}
\int_{\mathbb R}h^{-1}K\left(\frac{x-u}{h}\right)\,d\mathcal L^1(u)=\int_{\mathbb R}h^{-1}K(w)h\,d\mathcal L^1(w).
\end{align*}
Canceling $h^{-1}h=1$ gives
\begin{align*}
\int_{\mathbb R}h^{-1}K(w)h\,d\mathcal L^1(w)=\int_{\mathbb R}K(w)\,d\mathcal L^1(w)=1.
\end{align*}
Substituting this value into the bias expression gives
\begin{align*}
\mathbb E_{f_0}[\hat f_n(x)]-f_0(x)=1-1=0.
\end{align*}
Near $0$ or $1$, part of the kernel window can lie outside $[0,1]$, so the identity
\begin{align*}
\int_0^1 K_h(x-u)\,d\mathcal L^1(u)=\int_{\mathbb R}K_h(x-u)\,d\mathcal L^1(u)
\end{align*}
can fail. Thus the constant density removes interior shape bias, while still exposing boundary bias unless the estimator renormalizes or reflects the kernel.
[/example]
A density model describes how mass is distributed without covariates. Many applications instead observe responses at design points, so the unknown object is the conditional mean rather than the marginal law of one observation. This motivates the regression formulation, where the function class controls how the mean response may vary across the design space.
[definition: Nonparametric Regression Model]
Fix design points $t_1,\dots,t_n\in[0,1]$, a function class $\mathcal G\subset\{g:[0,1]\to\mathbb R\}$, and a class $\mathcal E$ of product noise laws $\nu=\nu_1\otimes\cdots\otimes\nu_n$ on $\mathbb R^n$ such that, under $\nu$, the coordinate variables $\varepsilon_1,\dots,\varepsilon_n$ are independent and satisfy $\mathbb E_\nu[\varepsilon_i]=0$ for $i=1,\dots,n$. For each $(g,\nu)\in\mathcal G\times\mathcal E$, let $P_{g,\nu}$ be the joint law of $(Y_1,\dots,Y_n)$ defined by $Y_i=g(t_i)+\varepsilon_i$. The nonparametric regression model is the family $\{P_{g,\nu}:(g,\nu)\in\mathcal G\times\mathcal E\}$.
[/definition]
The regression function $g$ is identifiable from the conditional mean when the zero-mean noise condition is part of the model. Shape constraints, such as monotonicity, are often used when smooth derivatives are not scientifically natural.
[example: Monotone Regression Curve]
Suppose $t_i=i/n$ and
\begin{align*}
Y_i=g(t_i)+\varepsilon_i,\qquad i=1,\dots,n,
\end{align*}
where $g:[0,1]\to\mathbb R$ is nondecreasing and the errors are i.i.d. with $\mathbb E[\varepsilon_i]=0$ and $\operatorname{Var}(\varepsilon_i)=\sigma^2$. Since $g(t_i)$ is nonrandom once the design point $t_i$ is fixed, linearity of expectation gives
\begin{align*}
\mathbb E[Y_i]=\mathbb E[g(t_i)+\varepsilon_i]=\mathbb E[g(t_i)]+\mathbb E[\varepsilon_i]=g(t_i)+0=g(t_i).
\end{align*}
Thus the observable mean vector is $(g(t_1),\dots,g(t_n))$. Because
\begin{align*}
t_1=\frac{1}{n}\le \frac{2}{n}=t_2\le\cdots\le \frac{n}{n}=t_n
\end{align*}
and $g$ is nondecreasing, the mean vector satisfies
\begin{align*}
g(t_1)\le g(t_2)\le\cdots\le g(t_n).
\end{align*}
The monotone least-squares estimator chooses
\begin{align*}
(\hat m_1,\dots,\hat m_n)\in\arg\min_{m_1\le\cdots\le m_n}\sum_{i=1}^n(Y_i-m_i)^2.
\end{align*}
To see why fitted flat pieces are block averages, suppose a block
\begin{align*}
B=\{a,a+1,\dots,b\}
\end{align*}
is assigned one common fitted value $c$, and write
\begin{align*}
\bar Y_B=\frac{1}{|B|}\sum_{i\in B}Y_i.
\end{align*}
For each $i\in B$,
\begin{align*}
Y_i-c=(Y_i-\bar Y_B)+(\bar Y_B-c).
\end{align*}
Squaring this identity gives
\begin{align*}
(Y_i-c)^2=(Y_i-\bar Y_B)^2+2(Y_i-\bar Y_B)(\bar Y_B-c)+(\bar Y_B-c)^2.
\end{align*}
Summing over $i\in B$ and using that $\bar Y_B-c$ does not depend on $i$ gives
\begin{align*}
\sum_{i\in B}(Y_i-c)^2=\sum_{i\in B}(Y_i-\bar Y_B)^2+2(\bar Y_B-c)\sum_{i\in B}(Y_i-\bar Y_B)+\sum_{i\in B}(\bar Y_B-c)^2.
\end{align*}
The centered sum is zero because
\begin{align*}
\sum_{i\in B}(Y_i-\bar Y_B)=\sum_{i\in B}Y_i-\sum_{i\in B}\bar Y_B.
\end{align*}
Since $\bar Y_B$ is constant over the block,
\begin{align*}
\sum_{i\in B}\bar Y_B=|B|\bar Y_B=|B|\frac{1}{|B|}\sum_{i\in B}Y_i=\sum_{i\in B}Y_i.
\end{align*}
Therefore
\begin{align*}
\sum_{i\in B}(Y_i-\bar Y_B)=\sum_{i\in B}Y_i-\sum_{i\in B}Y_i=0.
\end{align*}
Also, since $(\bar Y_B-c)^2$ is the same for every index in $B$,
\begin{align*}
\sum_{i\in B}(\bar Y_B-c)^2=|B|(\bar Y_B-c)^2.
\end{align*}
Substituting these two identities yields
\begin{align*}
\sum_{i\in B}(Y_i-c)^2=\sum_{i\in B}(Y_i-\bar Y_B)^2+|B|(\bar Y_B-c)^2.
\end{align*}
The first term is independent of $c$, while $|B|(\bar Y_B-c)^2\ge0$ and equals $0$ exactly when $c=\bar Y_B$. Hence the squared error over a flat fitted block is minimized by assigning that block the sample average of its observations.
The fitted vector can be displayed as the step function
\begin{align*}
\hat g_n(t)=\hat m_i\qquad\text{for }t\in(t_{i-1},t_i],
\end{align*}
with the convention $t_0=0$. The monotonicity constraint pools neighbouring observations precisely when separate noisy fitted values would violate the required order, so qualitative shape information supplies structure without assuming differentiability.
[/example]
## Identifiability, Loss, and Risk
Before asking whether an estimator is good, we need to know whether the target is determined by the observation law. Nonparametric models often contain redundant parameterisations: two densities that differ on a null set, or two regression functions that agree at all design points, can generate the same data distribution.
[definition: Identifiable Target]
Let $\mathcal P=\{P_\theta:\theta\in\Theta\}$ be a statistical model and let $\psi:\Theta\to\Psi$ be a target map. The target $\psi$ is identifiable in $\mathcal P$ if
\begin{align*}
P_{\theta_1}=P_{\theta_2}\implies \psi(\theta_1)=\psi(\theta_2)
\end{align*}
for all $\theta_1,\theta_2\in\Theta$.
[/definition]
Identifiability turns the target into a functional of the probability law. Without it, no estimator can separate parameters that induce the same distribution of the data.
[example: Nonidentifiability From Sparse Design]
In fixed-design regression, suppose the design points are $t_i=i/n$ and the observations have the form
\begin{align*}
Y_i=g(t_i)+\varepsilon_i,\qquad i=1,\dots,n,
\end{align*}
where the noise vector $\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)$ has a fixed joint law $\nu$ on $\mathbb R^n$. We show that the point target $g(x_0)$ is not identifiable when $x_0$ is not one of the design points. Let $g_1$ and $g_2$ be two regression functions satisfying
\begin{align*}
g_1(t_i)=g_2(t_i)\qquad\text{for every }i=1,\dots,n,
\end{align*}
but assume that
\begin{align*}
g_1(x_0)\ne g_2(x_0)
\end{align*}
for some point $x_0\notin\{t_1,\dots,t_n\}$.
Under $g_j$, the observed vector is
\begin{align*}
Y^{(j)}=(g_j(t_1)+\varepsilon_1,\dots,g_j(t_n)+\varepsilon_n),\qquad j=1,2.
\end{align*}
For each coordinate $i\in\{1,\dots,n\}$, the equality $g_1(t_i)=g_2(t_i)$ gives
\begin{align*}
Y_i^{(1)}=g_1(t_i)+\varepsilon_i=g_2(t_i)+\varepsilon_i=Y_i^{(2)}.
\end{align*}
Since this holds for every coordinate, the two vectors have the same first coordinate, the same second coordinate, and continuing through the same $n$th coordinate. Therefore
\begin{align*}
Y^{(1)}=Y^{(2)}
\end{align*}
as random vectors built from the same noise vector $\varepsilon$. Equal random vectors have the same joint law, so
\begin{align*}
P_{g_1,\nu}=P_{g_2,\nu}.
\end{align*}
However, the target values at $x_0$ are different:
\begin{align*}
\psi(g_1,\nu)=g_1(x_0)\ne g_2(x_0)=\psi(g_2,\nu).
\end{align*}
The same observation law therefore corresponds to two different values of the target. Hence the point value $g(x_0)$ is not identifiable from the fixed-design experiment unless the model adds assumptions, such as smoothness or shape constraints, that connect unobserved points to the observed design values.
[/example]
The sparse-design example shows that an estimand must be both meaningful under the model and measurable through the law of the data. Once this condition is in place, the next decision is how errors are measured. This motivates the definition of a loss function, which turns estimation error into a mathematical quantity.
[definition: Loss Function]
Let $\Psi$ be the target space. A loss function is a map $L:\Psi\times\Psi\to[0,\infty]$, where $L(a,b)$ is the loss incurred by reporting $a$ when the true target is $b$.
[/definition]
Common nonparametric losses compare functions pointwise, in an integral norm, or through a supremum norm. These choices lead to different risks and often different optimal estimators.
[example: Losses For Density Estimation]
For a density estimator $\hat f_n$ of a density $f$ on $[0,1]$, the loss depends on which feature of the estimation error is being measured. At a fixed point $x_0\in[0,1]$, the pointwise error is the single number
\begin{align*}
\hat f_n(x_0)-f(x_0).
\end{align*}
Squaring this number gives the pointwise squared loss
\begin{align*}
L_{x_0}(\hat f_n,f)=\bigl(\hat f_n(x_0)-f(x_0)\bigr)^2.
\end{align*}
If $\hat f_n(x_0)=f(x_0)$, then substituting this equality into the definition gives
\begin{align*}
L_{x_0}(\hat f_n,f)=\bigl(\hat f_n(x_0)-f(x_0)\bigr)^2.
\end{align*}
The equality $\hat f_n(x_0)=f(x_0)$ changes the right-hand side to
\begin{align*}
\bigl(f(x_0)-f(x_0)\bigr)^2=0^2=0.
\end{align*}
Thus pointwise squared loss can be zero even when $\hat f_n(x)\ne f(x)$ for points $x\ne x_0$, because the definition only uses the value at $x_0$.
Integrated squared loss instead squares the pointwise error at every location and integrates those squared errors with respect to Lebesgue measure:
\begin{align*}
L_2(\hat f_n,f)=\int_0^1\bigl(\hat f_n(x)-f(x)\bigr)^2\,d\mathcal L^1(x).
\end{align*}
For every $x\in[0,1]$, the square of a real number is nonnegative, so
\begin{align*}
\bigl(\hat f_n(x)-f(x)\bigr)^2\ge 0.
\end{align*}
The integral of a nonnegative measurable function is nonnegative, hence
\begin{align*}
L_2(\hat f_n,f)\ge 0.
\end{align*}
If $A\subset[0,1]$ is a measurable set, then the part of the integrated squared loss coming from $A$ is
\begin{align*}
\int_A\bigl(\hat f_n(x)-f(x)\bigr)^2\,d\mathcal L^1(x).
\end{align*}
Errors on a small-measure set therefore contribute through their integral over that set, so integrated squared loss measures global average squared accuracy rather than the largest pointwise discrepancy.
Uniform loss records the largest absolute pointwise error:
\begin{align*}
L_\infty(\hat f_n,f)=\sup_{x\in[0,1]}\left|\hat f_n(x)-f(x)\right|.
\end{align*}
For a fixed $x\in[0,1]$, the quantity
\begin{align*}
\left|\hat f_n(x)-f(x)\right|
\end{align*}
is one of the values over which the supremum is taken. By the defining upper-bound property of the supremum,
\begin{align*}
\left|\hat f_n(x)-f(x)\right|\le \sup_{u\in[0,1]}\left|\hat f_n(u)-f(u)\right|.
\end{align*}
Substituting the definition of $L_\infty$ gives
\begin{align*}
\left|\hat f_n(x)-f(x)\right|\le L_\infty(\hat f_n,f)
\end{align*}
for every $x\in[0,1]$. Thus pointwise squared loss asks for local accuracy at $x_0$, integrated squared loss asks for average curve accuracy over $[0,1]$, and uniform loss asks for worst-case curve accuracy over the whole interval.
[/example]
The example shows that loss records the size and type of error but not its probability distribution. Since the estimator is random, we must average the chosen loss under each possible data-generating distribution. This motivates the risk function of a procedure.
[definition: Risk]
Let $\hat\psi_n:\mathcal X^n\to\Psi$ be a measurable estimator of an identifiable target $\psi(P)$ in a model $\mathcal P$. The risk of $\hat\psi_n$ at $P\in\mathcal P$ under loss $L$ is
\begin{align*}
R(P,\hat\psi_n)=\mathbb E_P[L(\hat\psi_n,\psi(P))].
\end{align*}
[/definition]
Risk gives a numerical comparison after a target and loss have been fixed. A common construction now suggests itself: estimate the whole law first, then evaluate the target functional on that estimated law.
[definition: Plug-In Estimator]
Let $\psi:\mathcal P\to\Psi$ be a target functional. If $\hat P_n:\mathcal X^n\to\mathcal P$ is a measurable estimator of $P$, the corresponding plug-in estimator is $\psi(\hat P_n)$, whenever this expression is defined.
[/definition]
The empirical measure from Chapter 0 will supply the main example in Chapter 2: replace $P$ by
\begin{align*}
P_n=\frac{1}{n}\sum_{i=1}^n\delta_{X_i},
\end{align*}
then compute the same functional. For density estimation, direct plug-in with the empirical measure fails because $P_n$ is discrete, so smoothing is needed before applying a density functional. Before we compare such procedures, the next theorem isolates a general convexity principle: when the action space and loss are convex, averaging a randomized report can only improve risk.
[quotetheorem:6296]
[citeproof:6296]
This theorem is a first structural simplification: under squared loss or integrated squared loss on a linear function space, it is enough to study non-randomized estimators when randomized estimators are allowed. The convexity hypotheses are essential. If the report space is restricted to a non-convex set, such as the label set $\{-1,1\}$, a randomized rule that reports each label with probability $1/2$ has conditional mean $0$, which is not an admissible label. Non-convex loss can also punish averaging: for the loss $\ell(a)=\mathbb{1}_{\{a=0\}}$ on reports in $[-1,1]$, randomizing equally between $-1$ and $1$ has loss $0$ almost surely, while replacing the report by its mean gives loss $1$. The result also does not identify the best estimator or prove consistency; it only says that a particular source of randomization is unnecessary under convex risk comparisons. Later projection and least-squares estimators use the same idea in a stronger form: convexity turns averaging and projection into risk-reducing operations.
## Minimax Risk and Two-Point Lower Bounds
Risk at a single distribution does not capture the difficulty of a model class. Nonparametric statistics usually asks for procedures that perform well uniformly over a class, and the minimax risk measures the best achievable worst-case performance.
[definition: Minimax Risk]
Let $\mathcal P$ be a model, let $\psi(P)$ be an identifiable target, and let $L$ be a loss. The minimax risk over $\mathcal P$ is
\begin{align*}
\mathcal R_n(\mathcal P,L,\psi)=\inf_{\hat\psi_n}\sup_{P\in\mathcal P}\mathbb E_P[L(\hat\psi_n,\psi(P))],
\end{align*}
where the infimum is over all estimators based on the sample.
[/definition]
The minimax viewpoint turns estimation into a game: nature chooses the hardest distribution in the class, while the statistician chooses the estimator. To prove that a proposed rate is unavoidable, we need a lower-bound method that applies to every estimator. The first such method embeds a two-hypothesis testing problem inside the estimation problem.
[quotetheorem:6297]
[citeproof:6297]
The bound is useful when $P_0$ and $P_1$ are hard to distinguish but have separated targets. Both hypotheses matter. Without target separation, for instance if $\psi(P_0)=\psi(P_1)$, the testing problem may be difficult but it gives no lower bound on estimation error. The metric assumption is also doing real work: the triangle inequality turns the separation $d(\psi(P_0),\psi(P_1))\ge 2s$ into the conclusion that an estimate cannot lie within distance $s$ of both targets at once. A concrete failure occurs if the loss is not tied to the same metric: take $\Psi=\mathbb R$, $P_0\ne P_1$, $\psi(P_0)=0$, $\psi(P_1)=2s$, and loss $L(a,b)=\mathbb{1}_{\{a\ne b\}}$ for reports restricted to $\{0,2s\}$. The metric separation is present, but the displayed lower bound for expected metric loss is not a statement about this classification loss. The total variation testing bound is the bridge from geometry to probability, because it gives the sharp lower bound on the sum of the two simple-hypothesis testing errors. Without the product structure, the total variation term would have to be computed for the actual joint laws of the observations rather than for $P_0^{\otimes n}$ and $P_1^{\otimes n}$. The method also has a built-in limitation: it only uses two alternatives, so it may miss difficulties caused by many small perturbations spread across the parameter space. Later packing and testing methods refine this idea by embedding larger finite hypothesis classes.
[example: Two Densities With Separated Point Values]
Fix an interior point $x_0\in(0,1)$, and choose $h>0$ such that $[x_0-h,x_0+h]\subset[0,1]$. Let $\varphi$ be a smooth function supported on $[-1,1]$ satisfying
\begin{align*}
\int_{-1}^1 \varphi(u)\,d\mathcal L^1(u)=0,\qquad \varphi(0)=1,\qquad \|\varphi\|_\infty<\infty.
\end{align*}
For a constant $c>0$, define
\begin{align*}
f_0(x)=1,\qquad f_1(x)=1+c h^\alpha \varphi\left(\frac{x-x_0}{h}\right),\qquad x\in[0,1].
\end{align*}
If $c h^\alpha\|\varphi\|_\infty\le 1$, then for every $x\in[0,1]$,
\begin{align*}
\varphi\left(\frac{x-x_0}{h}\right)\ge -\left|\varphi\left(\frac{x-x_0}{h}\right)\right|\ge -\|\varphi\|_\infty.
\end{align*}
Multiplying by $c h^\alpha\ge0$ and adding $1$ gives
\begin{align*}
f_1(x)=1+c h^\alpha \varphi\left(\frac{x-x_0}{h}\right)\ge 1-c h^\alpha\|\varphi\|_\infty\ge 0.
\end{align*}
The support condition gives $\varphi((x-x_0)/h)=0$ unless $(x-x_0)/h\in[-1,1]$, equivalently unless $x\in[x_0-h,x_0+h]$. Since $[x_0-h,x_0+h]\subset[0,1]$,
\begin{align*}
\int_0^1 \varphi\left(\frac{x-x_0}{h}\right)\,d\mathcal L^1(x)=\int_{x_0-h}^{x_0+h}\varphi\left(\frac{x-x_0}{h}\right)\,d\mathcal L^1(x).
\end{align*}
With the substitution $u=(x-x_0)/h$, the endpoints $x=x_0-h$ and $x=x_0+h$ become $u=-1$ and $u=1$, and $d\mathcal L^1(x)=h\,d\mathcal L^1(u)$. Therefore
\begin{align*}
\int_{x_0-h}^{x_0+h}\varphi\left(\frac{x-x_0}{h}\right)\,d\mathcal L^1(x)=h\int_{-1}^1\varphi(u)\,d\mathcal L^1(u)=h\cdot0=0.
\end{align*}
Using linearity of the integral,
\begin{align*}
\int_0^1 f_1(x)\,d\mathcal L^1(x)=\int_0^1 1\,d\mathcal L^1(x)+c h^\alpha\int_0^1 \varphi\left(\frac{x-x_0}{h}\right)\,d\mathcal L^1(x).
\end{align*}
Substituting the preceding integral value gives
\begin{align*}
\int_0^1 f_1(x)\,d\mathcal L^1(x)=\mathcal L^1([0,1])+c h^\alpha\cdot0=1.
\end{align*}
Thus $f_1$ is nonnegative and integrates to $1$, so it is a density on $[0,1]$. At the target point $x_0$,
\begin{align*}
f_1(x_0)-f_0(x_0)=\left(1+c h^\alpha\varphi\left(\frac{x_0-x_0}{h}\right)\right)-1=c h^\alpha\varphi(0)=c h^\alpha.
\end{align*}
The squared size of the perturbation relative to $f_0$ is
\begin{align*}
\chi^2(P_{f_1},P_{f_0})=\int_0^1 \left(\frac{f_1(x)}{f_0(x)}-1\right)^2 f_0(x)\,d\mathcal L^1(x).
\end{align*}
Since $f_0(x)=1$ for every $x\in[0,1]$,
\begin{align*}
\frac{f_1(x)}{f_0(x)}-1=f_1(x)-1=c h^\alpha \varphi\left(\frac{x-x_0}{h}\right).
\end{align*}
Substituting this identity into the chi-squared integral gives
\begin{align*}
\chi^2(P_{f_1},P_{f_0})=\int_0^1 \left(c h^\alpha \varphi\left(\frac{x-x_0}{h}\right)\right)^2\,d\mathcal L^1(x).
\end{align*}
Since $c$ and $h^\alpha$ do not depend on $x$,
\begin{align*}
\chi^2(P_{f_1},P_{f_0})=c^2h^{2\alpha}\int_0^1 \varphi\left(\frac{x-x_0}{h}\right)^2\,d\mathcal L^1(x).
\end{align*}
Again the support condition reduces the integral to $[x_0-h,x_0+h]$, and the same substitution $u=(x-x_0)/h$ gives
\begin{align*}
\int_0^1 \varphi\left(\frac{x-x_0}{h}\right)^2\,d\mathcal L^1(x)=h\int_{-1}^1\varphi(u)^2\,d\mathcal L^1(u).
\end{align*}
Writing
\begin{align*}
C_\varphi=\int_{-1}^1\varphi(u)^2\,d\mathcal L^1(u),
\end{align*}
we obtain
\begin{align*}
\chi^2(P_{f_1},P_{f_0})=C_\varphi c^2h^{2\alpha+1}.
\end{align*}
For $n$ i.i.d. observations, the likelihood ratio of $P_{f_1}^{\otimes n}$ with respect to $P_{f_0}^{\otimes n}$ is
\begin{align*}
\prod_{i=1}^n \frac{f_1(X_i)}{f_0(X_i)}.
\end{align*}
By the definition of chi-squared divergence,
\begin{align*}
1+\chi^2(P_{f_1}^{\otimes n},P_{f_0}^{\otimes n})=\mathbb E_{f_0}^{\otimes n}\left[\left(\prod_{i=1}^n \frac{f_1(X_i)}{f_0(X_i)}\right)^2\right].
\end{align*}
The square of the product is the product of the squares, so
\begin{align*}
\left(\prod_{i=1}^n \frac{f_1(X_i)}{f_0(X_i)}\right)^2=\prod_{i=1}^n \left(\frac{f_1(X_i)}{f_0(X_i)}\right)^2.
\end{align*}
Independence under $P_{f_0}^{\otimes n}$ factors the expectation:
\begin{align*}
\mathbb E_{f_0}^{\otimes n}\left[\prod_{i=1}^n \left(\frac{f_1(X_i)}{f_0(X_i)}\right)^2\right]=\prod_{i=1}^n \mathbb E_{f_0}\left[\left(\frac{f_1(X_i)}{f_0(X_i)}\right)^2\right].
\end{align*}
Each factor has the same value, and the one-observation chi-squared identity gives
\begin{align*}
\mathbb E_{f_0}\left[\left(\frac{f_1(X_i)}{f_0(X_i)}\right)^2\right]=1+\chi^2(P_{f_1},P_{f_0}).
\end{align*}
Therefore
\begin{align*}
1+\chi^2(P_{f_1}^{\otimes n},P_{f_0}^{\otimes n})=\left(1+\chi^2(P_{f_1},P_{f_0})\right)^n.
\end{align*}
Equivalently,
\begin{align*}
\chi^2(P_{f_1}^{\otimes n},P_{f_0}^{\otimes n})=\left(1+\chi^2(P_{f_1},P_{f_0})\right)^n-1.
\end{align*}
Substituting the one-observation value yields
\begin{align*}
\chi^2(P_{f_1}^{\otimes n},P_{f_0}^{\otimes n})=\left(1+C_\varphi c^2h^{2\alpha+1}\right)^n-1.
\end{align*}
Let $t=C_\varphi c^2h^{2\alpha+1}$. Since $t\ge0$ and $1+t\le e^t$,
\begin{align*}
\left(1+C_\varphi c^2h^{2\alpha+1}\right)^n\le \left(e^{C_\varphi c^2h^{2\alpha+1}}\right)^n=\exp\left(nC_\varphi c^2h^{2\alpha+1}\right).
\end{align*}
Hence
\begin{align*}
\chi^2(P_{f_1}^{\otimes n},P_{f_0}^{\otimes n})\le \exp\left(nC_\varphi c^2h^{2\alpha+1}\right)-1.
\end{align*}
So the two $n$-sample experiments remain close in this chi-squared sense when $n h^{2\alpha+1}$ is bounded. Taking
\begin{align*}
n h^{2\alpha+1}\asymp 1
\end{align*}
is the same scale relation as
\begin{align*}
h^{2\alpha+1}\asymp n^{-1}.
\end{align*}
Raising both sides to the power $1/(2\alpha+1)$ gives
\begin{align*}
h\asymp n^{-1/(2\alpha+1)}.
\end{align*}
At this bandwidth scale, the target separation satisfies
\begin{align*}
|f_1(x_0)-f_0(x_0)|=c h^\alpha\asymp c\left(n^{-1/(2\alpha+1)}\right)^\alpha=c n^{-\alpha/(2\alpha+1)}.
\end{align*}
Since $c$ is fixed, this is
\begin{align*}
|f_1(x_0)-f_0(x_0)|\asymp n^{-\alpha/(2\alpha+1)}.
\end{align*}
Thus a perturbation of width $h$ and height $h^\alpha$ changes the point value at $x_0$ by order $h^\alpha$, while the $n$-sample laws stay close when $h$ is chosen on the scale $n^{-1/(2\alpha+1)}$. This is the two-point lower-bound heuristic behind the standard pointwise density-estimation rate.
[/example]
A lower bound alone is not a full theory, but it identifies the quantities that matter. We now introduce the assumptions that make those quantities calculable.
## Smoothness and Shape Classes
Why can an infinite-dimensional model be estimable at all? The answer is that the class must forbid arbitrarily rough behaviour. Smoothness and shape restrictions connect nearby points of a function, reducing the effective complexity of the model.
[definition: Holder Class]
Let $0<\alpha\le 1$ and $L>0$. The Holder class, written here as $\mathcal H(\alpha,L)$ and later as $\mathcal H^s(L;\mathcal X)$ when the domain and smoothness order are both displayed, is the set of functions $f:[0,1]\to\mathbb R$ such that
\begin{align*}
|f(x)-f(y)|\le L|x-y|^\alpha
\end{align*}
for all $x,y\in[0,1]$.
[/definition]
Holder smoothness controls local oscillation by a power law. Many estimators use local polynomial approximations, so first-order oscillation control is not enough when the model is assumed to have several derivatives. This motivates a higher-order Holder class, where the same condition is placed on the highest relevant derivative.
[definition: Higher Order Holder Class]
Let $\beta>0$, let $L>0$, and write $m=\lfloor\beta\rfloor$. If $\beta\notin\mathbb N$, the class $\mathcal H(\beta,L)$ on $[0,1]$ consists of functions $f:[0,1]\to\mathbb R$ with $m$ continuous derivatives such that $f^{(m)}\in\mathcal H(\beta-m,L)$. If $\beta=m\in\mathbb N$, these notes use the convention that $\mathcal H(m,L)$ consists of functions $f:[0,1]\to\mathbb R$ with $m-1$ continuous derivatives such that $f^{(m-1)}$ is Lipschitz with constant $L$.
[/definition]
These classes support Taylor expansion arguments for bias. Kernel and local polynomial estimators will later exploit exactly this local polynomial structure.
The notation $\mathcal H(\beta,L)$ in this chapter records local smoothness through the Holder or Lipschitz constant of the highest derivative specified above. When a compact Holder ball is needed for minimax statements, we will state the additional radius explicitly, for example by also requiring bounds on $\|f\|_\infty$ and the lower derivatives. This distinction prevents the smoothness constant from being mistaken for a full norm bound.
[example: Bias Scale Under Holder Smoothness]
Suppose $f\in\mathcal H(\alpha,L)$ with $0<\alpha\le1$, and let the averaging window around an interior point $x_0$ be
\begin{align*}
W_h=\{x\in[0,1]:|x-x_0|\le h\}.
\end{align*}
For every $x\in W_h$, the Holder condition gives
\begin{align*}
|f(x)-f(x_0)|\le L|x-x_0|^\alpha.
\end{align*}
Since $x\in W_h$ means $|x-x_0|\le h$, and $r\mapsto r^\alpha$ is nondecreasing on $[0,\infty)$ for $0<\alpha\le1$, we have
\begin{align*}
|x-x_0|^\alpha\le h^\alpha.
\end{align*}
Therefore
\begin{align*}
|f(x)-f(x_0)|\le Lh^\alpha.
\end{align*}
If $x_1,\dots,x_m\in W_h$, the deterministic error of the unweighted local average is
\begin{align*}
\frac{1}{m}\sum_{j=1}^m f(x_j)-f(x_0)=\frac{1}{m}\sum_{j=1}^m f(x_j)-\frac{1}{m}\sum_{j=1}^m f(x_0).
\end{align*}
Because $f(x_0)$ does not depend on $j$,
\begin{align*}
\frac{1}{m}\sum_{j=1}^m f(x_0)=\frac{m}{m}f(x_0)=f(x_0).
\end{align*}
Hence
\begin{align*}
\frac{1}{m}\sum_{j=1}^m f(x_j)-f(x_0)=\frac{1}{m}\sum_{j=1}^m\bigl(f(x_j)-f(x_0)\bigr).
\end{align*}
Taking absolute values and using the triangle inequality gives
\begin{align*}
\left|\frac{1}{m}\sum_{j=1}^m f(x_j)-f(x_0)\right|\le \frac{1}{m}\sum_{j=1}^m |f(x_j)-f(x_0)|.
\end{align*}
Since each $x_j$ lies in $W_h$, the bound $|f(x_j)-f(x_0)|\le Lh^\alpha$ applies to every summand, so
\begin{align*}
\frac{1}{m}\sum_{j=1}^m |f(x_j)-f(x_0)|\le \frac{1}{m}\sum_{j=1}^m Lh^\alpha.
\end{align*}
The summand $Lh^\alpha$ is constant in $j$, and therefore
\begin{align*}
\frac{1}{m}\sum_{j=1}^m Lh^\alpha=\frac{mLh^\alpha}{m}=Lh^\alpha.
\end{align*}
Thus local averaging over a window of radius $h$ has deterministic bias bounded by a constant times $h^\alpha$.
Now suppose the same window contains $m$ independent noise variables $\varepsilon_1,\dots,\varepsilon_m$ with variances of constant order. Concretely, assume that there are constants $0<a\le b<\infty$ such that
\begin{align*}
a\le \operatorname{Var}(\varepsilon_j)\le b
\end{align*}
for $j=1,\dots,m$. For the averaged noise,
\begin{align*}
\operatorname{Var}\left(\frac{1}{m}\sum_{j=1}^m\varepsilon_j\right)=\frac{1}{m^2}\operatorname{Var}\left(\sum_{j=1}^m\varepsilon_j\right).
\end{align*}
Expanding the variance of the sum gives
\begin{align*}
\operatorname{Var}\left(\sum_{j=1}^m\varepsilon_j\right)=\sum_{j=1}^m\operatorname{Var}(\varepsilon_j)+2\sum_{1\le i<j\le m}\operatorname{Cov}(\varepsilon_i,\varepsilon_j).
\end{align*}
Independence gives $\operatorname{Cov}(\varepsilon_i,\varepsilon_j)=0$ for $i\ne j$, so
\begin{align*}
\operatorname{Var}\left(\frac{1}{m}\sum_{j=1}^m\varepsilon_j\right)=\frac{1}{m^2}\sum_{j=1}^m\operatorname{Var}(\varepsilon_j).
\end{align*}
The variance bounds imply
\begin{align*}
ma\le \sum_{j=1}^m\operatorname{Var}(\varepsilon_j)\le mb.
\end{align*}
Dividing by $m^2$ gives
\begin{align*}
\frac{a}{m}\le \operatorname{Var}\left(\frac{1}{m}\sum_{j=1}^m\varepsilon_j\right)\le \frac{b}{m}.
\end{align*}
Thus the noise variance is of order $m^{-1}$, and the corresponding standard deviation is of order $m^{-1/2}$.
If the effective number of observations in the window satisfies $m\asymp nh$, then there are constants $0<c_1\le c_2<\infty$ such that
\begin{align*}
c_1nh\le m\le c_2nh.
\end{align*}
Taking reciprocal square roots reverses the inequalities:
\begin{align*}
\frac{1}{\sqrt{c_2nh}}\le \frac{1}{\sqrt m}\le \frac{1}{\sqrt{c_1nh}}.
\end{align*}
Hence
\begin{align*}
m^{-1/2}\asymp (nh)^{-1/2}.
\end{align*}
Balancing the deterministic bias scale and stochastic standard deviation means setting
\begin{align*}
h^\alpha\asymp(nh)^{-1/2}.
\end{align*}
Squaring both sides gives
\begin{align*}
h^{2\alpha}\asymp \frac{1}{nh}.
\end{align*}
Multiplying by $nh$ gives
\begin{align*}
nh^{2\alpha+1}\asymp 1.
\end{align*}
Equivalently,
\begin{align*}
h^{2\alpha+1}\asymp n^{-1}.
\end{align*}
Taking the power $1/(2\alpha+1)$ yields
\begin{align*}
h\asymp n^{-1/(2\alpha+1)}.
\end{align*}
Holder smoothness therefore turns the window width $h$ into a bias scale $h^\alpha$, while the effective sample size $nh$ turns averaging independent noise into a standard deviation scale $(nh)^{-1/2}$.
[/example]
Holder classes are pointwise and are well matched to uniform or pointwise loss. For global losses, it is often more natural to measure derivatives in an integral norm, because the risk itself averages errors over the domain. This motivates Sobolev smoothness classes.
[definition: Sobolev Smoothness Class]
Let $k\in\mathbb N$ and $1\le p\le\infty$. The Sobolev class $W^{k,p}([0,1])$ is the set of functions $f\in L^p([0,1])$ whose weak derivatives $D^j f$ exist and belong to $L^p([0,1])$ for $1\le j\le k$.
[/definition]
Sobolev assumptions control average roughness rather than pointwise roughness. Some statistical signals, however, are not smooth in a derivative sense because they contain jumps, while still having limited cumulative oscillation. This motivates bounded variation as a class for functions with finite total movement.
[definition: Bounded Variation Class]
A function $f:[0,1]\to\mathbb R$ has bounded variation if
\begin{align*}
V(f)=\sup\left\{\sum_{i=1}^m |f(x_i)-f(x_{i-1})|:0\le x_0<x_1<\cdots<x_m\le1,\ m\in\mathbb N\right\}<\infty.
\end{align*}
The bounded variation class with radius $M$ is $\{f:V(f)\le M\}$.
[/definition]
Bounded variation permits jumps while ruling out unlimited oscillation. Other models impose qualitative geometric information instead of derivative bounds, especially when the scientific assumption is order, convexity, or diminishing returns. This motivates the two shape constraints used most often in introductory nonparametric regression.
[definition: Monotone and Convex Classes]
The monotone class on $[0,1]$ is the set of nondecreasing functions $f:[0,1]\to\mathbb R$. The convex class on $[0,1]$ is the set of functions $f:[0,1]\to\mathbb R$ satisfying
\begin{align*}
f(tx+(1-t)y)\le t f(x)+(1-t)f(y)
\end{align*}
for all $x,y\in[0,1]$ and $t\in[0,1]$.
[/definition]
Shape constraints are qualitative rather than metric smoothness assumptions. They can produce estimators by constrained optimization, such as isotonic regression for monotone functions and least-squares projection onto convex sequences for convex regression. This motivates the bias-variance calculation that connects smoothness assumptions to bandwidth choice and statistical rates.
[illustration:local-averaging-bandwidth]
[explanation: Bias-Variance Bandwidth Balance]
Consider estimation of a function value $f(x_0)$ from $n$ observations using a local average over bandwidth $h$, in an interior region where the window contains about $nh$ observations. This is an organizing calculation rather than a theorem: it assumes a local averaging estimator, an asymptotic regime $n\to\infty$ with $h=h_n\to0$ and $nh_n\to\infty$, a smoothness condition giving bias comparable to $h^\alpha$, and a noise or sampling condition giving variance comparable to $(nh)^{-1}$.
Under those assumptions, the mean squared error is summarized by
\begin{align*}
h^{2\alpha}+\frac{1}{nh},
\end{align*}
where constants and lower-order terms depend on the estimator and model. Balancing the squared-bias and variance terms gives
\begin{align*}
h^{2\alpha}\asymp\frac{1}{nh},
\end{align*}
so the bandwidth scale is $h\asymp n^{-1/(2\alpha+1)}$, and the corresponding mean squared error scale is $n^{-2\alpha/(2\alpha+1)}$.
[/explanation]
This calculation is an organizing principle, not a universal theorem about every estimator. The bias assumption can fail at a boundary if the window extends outside $[0,1]$ without correction, even when the function is Holder smooth. The variance assumption can fail when the effective number of observations in the window is not of order $nh$, for instance under highly non-uniform design or strongly dependent errors. The balance also addresses pointwise mean squared error for local averaging; it does not by itself prove a minimax theorem, choose constants, or handle adaptive bandwidth selection. Upper bounds later construct estimators that attain such balances under precise assumptions, while lower bounds use Le Cam or its refinements to show that no estimator can improve the balance uniformly over the class.
Before tackling minimax optimality and bandwidth selection, we need to understand the basic nonparametric estimator at the foundation of the theory: the empirical distribution function. This chapter develops the first consistency results and empirical approximation properties that will later be refined through weak convergence and used to construct optimal estimators.
# 2. Empirical Distribution Functions
The first chapter framed nonparametric statistics as inference over large classes of distributions, densities, and regression functions. We now study the empirical distribution function, the basic object that replaces an unknown distribution by the observed sample. Building on the empirical-measure notation $P_n g$ from Chapter 0 and the plug-in viewpoint of Chapter 1, this chapter asks how much information is contained in the empirical measure, how accurately it estimates the true distribution function, and how its generalized inverse leads to nonparametric estimators of quantiles such as medians.
## Empirical Measures and Plug-In Integration
Suppose $X_1,\dots,X_n$ are i.i.d. observations with common distribution $P$ on a measurable space $(E,\mathcal E)$. The central nonparametric question is how to estimate expectations $Pg=\mathbb E[g(X_1)]$ without imposing a finite-dimensional model for $P$. Directly using $P$ is impossible because $P$ is exactly the unknown object; replacing it by a parametric approximation would add assumptions that nonparametric statistics is trying to avoid. The empirical measure answers this obstruction by assigning mass $1/n$ to each observed data point.
[definition: Empirical Measure]
Let $X_1,\dots,X_n$ be random variables taking values in a measurable space $(E,\mathcal E)$. The empirical measure is the random map $P_n:\mathcal E\to[0,1]$ defined by
\begin{align*}
P_n(A)=\frac{1}{n}\sum_{i=1}^n \mathbb{1}_A(X_i), \qquad A\in \mathcal E.
\end{align*}
[/definition]
The definition says that the data induce a discrete probability law, even when the true distribution is continuous. Since statistical functionals are often expectations, the next notation records how empirical measures integrate functions.
[definition: Empirical Integral]
Let $X_1,\dots,X_n$ take values in a measurable space $(E,\mathcal E)$. The empirical integration functional is the map
\begin{align*}
I_n:\{g:E\to\mathbb R \text{ measurable}: \sum_{i=1}^n |g(X_i)|<\infty\}\to\mathbb R
\end{align*}
defined by
\begin{align*}
I_n(g)=P_ng=\int_E g\,dP_n=\frac{1}{n}\sum_{i=1}^n g(X_i).
\end{align*}
[/definition]
Thus plug-in estimation is not a separate trick: it is integration against $P_n$ rather than against $P$. To see why this single operation covers several familiar estimators, it helps to compute it for ordinary functions and for indicator functions.
[example: Plug-In Estimation of a Mean and a Tail Probability]
Let $E=\mathbb R$ and let $X_1,\dots,X_n$ be i.i.d. with distribution $P$. For $g(x)=x$, the definition of empirical integration gives
\begin{align*}
P_ng=\frac{1}{n}\sum_{i=1}^n g(X_i).
\end{align*}
Since $g(X_i)=X_i$ for each $i$, this becomes
\begin{align*}
P_ng=\frac{1}{n}\sum_{i=1}^n X_i.
\end{align*}
Thus the plug-in estimate of
\begin{align*}
Pg=\mathbb E[g(X_1)]=\mathbb E[X_1]
\end{align*}
is exactly the sample mean, provided $X_1$ is integrable.
For a fixed threshold $t$, take $g(x)=\mathbb{1}_{(t,\infty)}(x)$. The empirical integral is
\begin{align*}
P_ng=\frac{1}{n}\sum_{i=1}^n g(X_i).
\end{align*}
Substituting the indicator function gives
\begin{align*}
P_ng=\frac{1}{n}\sum_{i=1}^n \mathbb{1}_{(t,\infty)}(X_i).
\end{align*}
For each observation, $\mathbb{1}_{(t,\infty)}(X_i)=\mathbb{1}_{\{X_i>t\}}$, so
\begin{align*}
P_ng=\frac{1}{n}\sum_{i=1}^n \mathbb{1}_{\{X_i>t\}}.
\end{align*}
By the definition of the empirical measure applied to the set $(t,\infty)$,
\begin{align*}
P_n((t,\infty))=\frac{1}{n}\sum_{i=1}^n \mathbb{1}_{\{X_i>t\}}.
\end{align*}
Each summand equals $1$ exactly when $X_i>t$ and equals $0$ otherwise, so the sum counts observations above $t$ and division by $n$ gives the observed tail fraction. The same empirical integral therefore produces both sample averages and empirical probabilities; only the [test function](/page/Test%20Function) has changed.
[/example]
The example shows that indicators of sets extract empirical probabilities from $P_n$. For real-valued data, the most important such sets are half-lines, because their probabilities determine the whole distribution; this motivates turning $P_n$ into a cumulative function.
[definition: Empirical Distribution Function]
Let $X_1,\dots,X_n$ be real-valued random variables. The empirical distribution function is the random function $F_n:\mathbb R\to[0,1]$ defined by
\begin{align*}
x\mapsto F_n(x)=P_n((-\infty,x])=\frac{1}{n}\sum_{i=1}^n \mathbb{1}_{\{X_i\le x\}}.
\end{align*}
[/definition]
The empirical distribution function is a step function that jumps by $1/n$ at each observed value, with larger jumps at ties. This discreteness is not a cosmetic detail: even when $F$ is continuous, $F_n$ is discontinuous and cannot approximate $F$ uniformly by pointwise smoothness arguments. Since this picture is the easiest way to remember the object, the next example spells out its exact shape for a simulated uniform sample and identifies why the largest vertical gap is the relevant error.
[example: Simulating the Empirical CDF]
Take $X_1,\dots,X_n\overset{\text{i.i.d.}}{\sim}\operatorname{Unif}(0,1)$ and write the ordered observations as $X_{(1)}\le \dots\le X_{(n)}$. Since the uniform distribution is continuous, the event $X_{(1)}<\dots<X_{(n)}$ has probability one, and on this event the empirical CDF can be computed by counting how many observations are at most $x$:
\begin{align*}
F_n(x)
&=\frac{1}{n}\sum_{i=1}^n \mathbb{1}_{\{X_i\le x\}}.
\end{align*}
If $x<X_{(1)}$, then $X_i>x$ for every $i$, so each indicator is $0$ and
\begin{align*}
F_n(x)
&=\frac{1}{n}\sum_{i=1}^n 0
=0.
\end{align*}
If $X_{(k)}\le x<X_{(k+1)}$ for some $1\le k\le n-1$, then exactly the observations $X_{(1)},\dots,X_{(k)}$ are at most $x$, so
\begin{align*}
F_n(x)
&=\frac{1}{n}\left(\sum_{j=1}^k 1+\sum_{j=k+1}^n 0\right)
=\frac{k}{n}.
\end{align*}
If $x\ge X_{(n)}$, then every observation is at most $x$, and therefore
\begin{align*}
F_n(x)
&=\frac{1}{n}\sum_{i=1}^n 1
=1.
\end{align*}
For the population distribution, $F(x)=x$ on $0\le x\le 1$, with $F(x)=0$ for $x<0$ and $F(x)=1$ for $x\ge 1$. Thus a simulation plots a random step function against the deterministic diagonal on $[0,1]$, and the largest vertical discrepancy is
\begin{align*}
\sup_{x\in\mathbb R}|F_n(x)-F(x)|.
\end{align*}
This is the statistic controlled by the uniform convergence and finite-sample bounds below.
[/example]
## Pointwise Consistency and Binomial Fluctuations
For a fixed threshold $x$, estimating $F(x)$ is the same as estimating the success probability of the event $\{X\le x\}$. This reduction turns the empirical distribution function into a binomial average and gives the first consistency and asymptotic normality results.
[quotetheorem:6298]
[citeproof:6298]
This representation gives exact finite-sample information at a single point, and the i.i.d. hypothesis is essential for the binomial law rather than merely a convenient assumption. If the observations were dependent, the indicators could be dependent even though each had success probability $F(x)$; if they were not identically distributed, the success probabilities could differ from observation to observation. The theorem also has a built-in limitation: it says nothing about choosing $x$ after inspecting the data or about controlling all thresholds at once. The next question is whether these exact pointwise fluctuations vanish as the sample size grows, which is the consistency requirement for estimating $F(x)$.
[quotetheorem:6299]
[citeproof:6299]
Pointwise consistency handles any fixed threshold chosen before seeing the data, and the fixed-threshold qualification is a real restriction. For example, even if every rational threshold has small error eventually, the supremum over all real thresholds still needs a monotonicity argument to transfer control between grid points. Dependence can also break the strong-law conclusion for the indicator sequence, so the i.i.d. sampling assumption is carrying the probabilistic averaging. Many nonparametric procedures ask for simultaneous control over all thresholds, so the next result upgrades the scalar law of large numbers to a uniform statement.
[quotetheorem:2004]
[citeproof:2004]
The theorem shows that the whole empirical distribution function converges to the true distribution function in sup norm, and it explains why distribution functions are more tractable than arbitrary test-function classes. The monotonicity of half-lines is essential to this elementary proof; a general collection of measurable sets can be too large for uniform convergence to hold. The result is also qualitative: it gives eventual almost sure convergence but no sample size at which the error is likely to be below a prescribed tolerance. For statistical inference we therefore want a finite-sample probability bound on the size of the supremum error. A pointwise binomial confidence interval is not enough, because taking many thresholds would create a multiple-comparisons problem and miss the random location of the largest deviation. This obstruction leads to the distribution-free inequality below.
[quotetheorem:6300]
[citeproof:6300]
The inequality turns the qualitative [Glivenko-Cantelli theorem](/theorems/2004) into a finite-sample bound with no unknown constants and no smoothness assumptions on $F$. Its distribution-free nature is special to one-dimensional distribution functions; comparable bounds for richer classes require complexity conditions such as VC dimension or entropy control. The strict inequality in the probability event is immaterial for applications, but the exponential dependence on $n\varepsilon^2$ is the main rate information. Solving the bound for $\varepsilon$ produces a distribution-free confidence band for the entire CDF, which is the main applied use in this chapter.
[example: Uniform Confidence Band for a CDF]
Let $X_1,\dots,X_n$ be i.i.d. with distribution function $F$, and fix $\alpha\in(0,1)$. We choose the band width by setting the right side of the *[Dvoretzky-Kiefer-Wolfowitz inequality](/theorems/6300)* equal to $\alpha$:
\begin{align*}
2e^{-2n\varepsilon_{n,\alpha}^2}=\alpha.
\end{align*}
Dividing by $2$ gives
\begin{align*}
e^{-2n\varepsilon_{n,\alpha}^2}=\frac{\alpha}{2}.
\end{align*}
Taking logarithms gives
\begin{align*}
-2n\varepsilon_{n,\alpha}^2=\log\frac{\alpha}{2}.
\end{align*}
Multiplying by $-1$ rewrites this as
\begin{align*}
2n\varepsilon_{n,\alpha}^2=\log\frac{2}{\alpha}.
\end{align*}
Since $\varepsilon_{n,\alpha}>0$, solving for the positive square root gives
\begin{align*}
\varepsilon_{n,\alpha}=\sqrt{\frac{1}{2n}\log\frac{2}{\alpha}}.
\end{align*}
With this choice,
\begin{align*}
2e^{-2n\varepsilon_{n,\alpha}^2}=2e^{-\log(2/\alpha)}.
\end{align*}
Since $e^{-\log(2/\alpha)}=\alpha/2$, the last expression equals $\alpha$. Therefore the *Dvoretzky-Kiefer-Wolfowitz inequality* gives
\begin{align*}
\mathbb P\left(\sup_{x\in\mathbb R}|F_n(x)-F(x)|>\varepsilon_{n,\alpha}\right)\le \alpha.
\end{align*}
Taking complements yields
\begin{align*}
\mathbb P\left(\sup_{x\in\mathbb R}|F_n(x)-F(x)|\le \varepsilon_{n,\alpha}\right)\ge 1-\alpha.
\end{align*}
On the event inside this probability, every $x\in\mathbb R$ satisfies
\begin{align*}
|F_n(x)-F(x)|\le \varepsilon_{n,\alpha}.
\end{align*}
This inequality is equivalent to
\begin{align*}
F_n(x)-\varepsilon_{n,\alpha}\le F(x)\le F_n(x)+\varepsilon_{n,\alpha}.
\end{align*}
Because every distribution function satisfies $0\le F(x)\le 1$, the same event also gives the truncated bounds
\begin{align*}
\max\{F_n(x)-\varepsilon_{n,\alpha},0\}\le F(x)\le \min\{F_n(x)+\varepsilon_{n,\alpha},1\}.
\end{align*}
Thus the functions $L_n(x)=\max\{F_n(x)-\varepsilon_{n,\alpha},0\}$ and $U_n(x)=\min\{F_n(x)+\varepsilon_{n,\alpha},1\}$ form a simultaneous $(1-\alpha)$ confidence band for $F$. The statement is simultaneous because the same event controls all $x\in\mathbb R$, and it is distribution-free because the bound contains no unknown feature of $F$.
[/example]
## Quantiles and the Probability Integral Transform
Distribution functions estimate probabilities of half-lines, but many statistical summaries are inverse objects: medians, quartiles, and percentile cutoffs. The main issue is that a CDF may be flat or have jumps, so its inverse must be defined through inequalities rather than through an ordinary inverse function.
[definition: Quantile Function]
Let $F$ be a distribution function on $\mathbb R$. The quantile function is the map $F^{-1}:(0,1)\to\mathbb R\cup\{-\infty,\infty\}$ defined by
\begin{align*}
p\mapsto F^{-1}(p)=\inf\{x\in\mathbb R:F(x)\ge p\}.
\end{align*}
[/definition]
This convention gives a well-defined population target for every distribution function, including discrete and mixed laws. The extended codomain records a possible endpoint pathology, although for ordinary real-valued probability distributions and $p\in(0,1)$ the quantile is finite under the usual nondegenerate tail behaviour. The main obstruction is that flat pieces and jumps destroy ordinary invertibility: many locations may correspond to the same probability level, and an atom may skip over the desired level. The next problem is how to estimate that target from the empirical CDF while keeping the same generalized inverse convention.
[definition: Sample Quantile]
Let $X_1,\dots,X_n$ be real-valued observations with empirical distribution function $F_n$. The empirical quantile function is the map $F_n^{-1}:(0,1)\to\mathbb R$ defined by
\begin{align*}
p\mapsto F_n^{-1}(p)=\inf\{x\in\mathbb R:F_n(x)\ge p\}.
\end{align*}
[/definition]
The sample quantile is an order statistic. If $X_{(1)}\le\dots\le X_{(n)}$ are the ordered observations, then $F_n^{-1}(p)=X_{(\lceil np\rceil)}$ under this convention; the median case shows how local density affects the estimator.
[example: Median Estimation Under Positive Density]
Suppose $F$ has a density $f_X$ in a neighbourhood of its median $m=F^{-1}(1/2)$, with $F(m)=1/2$ and $f_X(m)>0$. Under the generalized inverse convention,
\begin{align*}
\hat m_n=F_n^{-1}(1/2)=\inf\{x:F_n(x)\ge 1/2\}.
\end{align*}
If $X_{(1)}\le \cdots \le X_{(n)}$, then $F_n(x)=k/n$ on the interval $X_{(k)}\le x<X_{(k+1)}$. The first index $k$ satisfying $k/n\ge 1/2$ is $k=\lceil n/2\rceil$, so
\begin{align*}
\hat m_n=X_{(\lceil n/2\rceil)}.
\end{align*}
The role of the positive density condition is visible from the local expansion at $m$. Differentiability gives
\begin{align*}
F(m+h)=F(m)+f_X(m)h+o(h).
\end{align*}
Since $F(m)=1/2$, this becomes
\begin{align*}
F(m+h)=\frac{1}{2}+f_X(m)h+o(h).
\end{align*}
Thus, to first order, moving the location by $h$ changes the probability level by $f_X(m)h$, so a probability error $\delta$ corresponds to a location error $\delta/f_X(m)$.
At the median, $F_n(m)$ is the empirical average of the indicators $\mathbb{1}_{\{X_i\le m\}}$. Each indicator has success probability $F(m)=1/2$, hence variance $(1/2)(1-1/2)=1/4$. Independence gives
\begin{align*}
\operatorname{Var}(F_n(m))=\operatorname{Var}\left(\frac{1}{n}\sum_{i=1}^n \mathbb{1}_{\{X_i\le m\}}\right).
\end{align*}
Using independence to add variances,
\begin{align*}
\operatorname{Var}(F_n(m))=\frac{1}{n^2}\sum_{i=1}^n \operatorname{Var}(\mathbb{1}_{\{X_i\le m\}}).
\end{align*}
Substituting $\operatorname{Var}(\mathbb{1}_{\{X_i\le m\}})=1/4$ gives
\begin{align*}
\operatorname{Var}(F_n(m))=\frac{1}{n^2}\cdot n\cdot \frac{1}{4}=\frac{1}{4n}.
\end{align*}
The vertical fluctuation scale of $F_n(m)$ is therefore $1/(2\sqrt n)$. Multiplying by the inverse local slope $1/f_X(m)$ gives the median fluctuation scale
\begin{align*}
\frac{1}{2\sqrt n\,f_X(m)}.
\end{align*}
Equivalently, the limiting variance for $\sqrt n(\hat m_n-m)$ is
\begin{align*}
\frac{1/4}{f_X(m)^2}=\frac{1}{4f_X(m)^2},
\end{align*}
as in *[Asymptotic Normality of Sample Quantiles](/theorems/6301)*. The density at the median is the conversion factor from probability error to location error: smaller $f_X(m)$ makes the same empirical CDF fluctuation produce a wider median fluctuation.
[/example]
Quantiles also have a constructive interpretation. Suppose a simulation procedure can generate only uniform random numbers, but the statistical model requires samples from a distribution with cumulative distribution function $F$. The quantile function is the natural candidate for converting the available uniform randomness into observations on the original scale.
The obstruction is that this conversion is valid only when the inverse really turns order comparisons for $F^{-1}(U)$ into order comparisons for $U$. A strictly increasing continuous CDF has exactly the monotonicity and invertibility needed for that argument.
[quotetheorem:1139]
[citeproof:1139]
The strict monotonicity assumption makes the inverse $F^{-1}$ an ordinary function, so the event $\{F^{-1}(U)\le x\}$ can be rewritten as $\{U\le F(x)\}$. This is the inverse-transform direction, not the forward probability-integral-transform direction. For sample quantiles, the useful lesson is that quantiles convert vertical probability levels into horizontal locations. To quantify the error of sample quantiles, we next combine the pointwise [central limit theorem](/theorems/521) for $F_n$ with a local inverse approximation.
[quotetheorem:6301]
[citeproof:6301]
The theorem is the basic asymptotic justification for nonparametric confidence intervals for medians and other quantiles. The hypotheses exclude the main failure modes: if $F$ jumps at $q_p$, the sample quantile may have a non-normal limiting behaviour, while if $f_X(q_p)=0$, the inverse map is too flat for the ordinary $n^{-1/2}$ scale. The condition $F(q_p)=p$ rules out ambiguity at a plateau where several locations could be legitimate generalized inverses for nearby probability levels. It also marks a shift in the course: from estimating a distribution function itself to estimating functionals of that distribution, a theme developed further for empirical-process functionals in Chapter 3 and for confidence intervals in Chapter 11.
[example: Asymptotic Confidence Interval for a Median]
Let $m=F^{-1}(1/2)$ and assume $F$ has density $f_X$ in a neighbourhood of $m$, with $F(m)=1/2$ and $f_X(m)>0$. For the sample median
\begin{align*}
\hat m_n=F_n^{-1}(1/2),
\end{align*}
*Asymptotic Normality of Sample Quantiles* with $p=1/2$ gives
\begin{align*}
\sqrt n(\hat m_n-m)
&\xrightarrow{d}\mathcal N\left(0,\frac{(1/2)(1-1/2)}{f_X(m)^2}\right).
\end{align*}
Since
\begin{align*}
(1/2)(1-1/2)
&=(1/2)(1/2)
=\frac{1}{4},
\end{align*}
the limiting variance is
\begin{align*}
\frac{(1/2)(1-1/2)}{f_X(m)^2}
&=\frac{1/4}{f_X(m)^2}
=\frac{1}{4f_X(m)^2}.
\end{align*}
Therefore the limiting standard deviation of $\sqrt n(\hat m_n-m)$ is
\begin{align*}
\sqrt{\frac{1}{4f_X(m)^2}}
&=\frac{1}{2f_X(m)},
\end{align*}
where $f_X(m)>0$ makes the positive square root unambiguous. Dividing by $\sqrt n$, the asymptotic standard error of $\hat m_n$ is
\begin{align*}
\frac{1}{2\sqrt n\,f_X(m)}.
\end{align*}
Let $z_{1-\alpha/2}$ be the $(1-\alpha/2)$-quantile of the standard normal distribution. Replacing the unknown $f_X(m)$ by a consistent estimate evaluated at the sample median gives the plug-in standard error
\begin{align*}
\widehat{\operatorname{se}}(\hat m_n)
&=\frac{1}{2\sqrt n\,\hat f_n(\hat m_n)}.
\end{align*}
Thus the normal approximation gives the interval
\begin{align*}
\hat m_n
-z_{1-\alpha/2}\frac{1}{2\sqrt n\,\hat f_n(\hat m_n)}
\le m \le
\hat m_n
+z_{1-\alpha/2}\frac{1}{2\sqrt n\,\hat f_n(\hat m_n)}.
\end{align*}
Equivalently, the approximate $(1-\alpha)$ confidence interval is
\begin{align*}
\left[
\hat m_n-z_{1-\alpha/2}\frac{1}{2\sqrt n\,\hat f_n(\hat m_n)},
\,
\hat m_n+z_{1-\alpha/2}\frac{1}{2\sqrt n\,\hat f_n(\hat m_n)}
\right].
\end{align*}
The denominator shows the statistical meaning of the density condition: when $f_X(m)$ is small, the CDF is locally flat near the median, so the same vertical empirical-CDF fluctuation corresponds to a larger horizontal error in the estimated median.
[/example]
Chapter 2 showed that the empirical distribution function uniformly approximates the true distribution function. Chapter 3 sharpens this result by studying the stochastic fluctuations around that approximation—not just the bias, but the random deviations at the natural central limit rate. This requires a new perspective: viewing the empirical distribution function as a stochastic process rather than just an estimator.
# 3. Empirical Processes and Weak Convergence
Chapter 2 showed that empirical distribution functions give a uniform approximation to a fixed distribution function. The next step is to study the fluctuations around that approximation at the natural central limit scale. This chapter introduces the empirical process, first as a random signed measure and then as a stochastic process indexed by sets or functions, and explains why Brownian bridges appear as the universal Gaussian limit in distribution-free nonparametric procedures.
## From Empirical Measures to Empirical Processes
The law of large numbers says that $P_n f$ is close to $P f$ for many fixed test functions $f$. For inference we need the size and shape of the random error, especially when a statistic takes a supremum over many possible sets or functions.
[definition: Empirical Measure]
Let $X_1,\dots,X_n$ be i.i.d. random variables with distribution $P$ on a measurable space $(\mathcal X,\mathcal A)$. The empirical measure is the random probability measure $P_n$ defined by
\begin{align*}
P_n(A) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}_A(X_i), \qquad A \in \mathcal A.
\end{align*}
For a measurable function $f:\mathcal X\to\mathbb R$ with $P|f|<\infty$, write
\begin{align*}
P_n f = \int f\,dP_n = \frac{1}{n}\sum_{i=1}^n f(X_i), \qquad P f = \int f\,dP.
\end{align*}
[/definition]
The empirical measure packages sample averages into one object, but the object itself still lives on the law of large numbers scale. To obtain a non-degenerate approximation for sampling error, we need to centre at $P$ and magnify by the central limit scale $\sqrt n$; this motivates the empirical process.
[definition: Centered Empirical Process]
Let $\mathcal F$ be a class of measurable functions $f:\mathcal X\to\mathbb R$ such that $P f^2<\infty$ for every $f\in\mathcal F$. The centered empirical process indexed by $\mathcal F$ is the random map $\alpha_n:\mathcal F\to\mathbb R$ defined by
\begin{align*}
\alpha_n(f) = \sqrt n (P_n-P)f
= \frac{1}{\sqrt n}\sum_{i=1}^n \{f(X_i)-P f\}, \qquad f\in\mathcal F.
\end{align*}
For a class of sets $\mathcal C\subseteq\mathcal A$, the associated empirical process is the random map $\alpha_n:\mathcal C\to\mathbb R$ defined by $\alpha_n(C)=\sqrt n(P_n(C)-P(C))$ for $C\in\mathcal C$.
[/definition]
For a fixed $f$, this is the ordinary [central limit theorem](/theorems/1848). The empirical-process problem is that many nonparametric statistics depend on $\sup_{f\in\mathcal F}|\alpha_n(f)|$, so we first isolate the single-coordinate case before asking for convergence of the whole random function $f\mapsto\alpha_n(f)$.
[example: Single Function Central Limit]
Let $f:\mathcal X\to\mathbb R$ satisfy $P f^2<\infty$, and set $\sigma_f^2=P(f-Pf)^2$. Define
\begin{align*}
Y_i=f(X_i)-P f.
\end{align*}
Because $X_1,\dots,X_n$ are i.i.d., the variables $Y_1,\dots,Y_n$ are i.i.d. Their mean is $\mathbb E Y_i=\mathbb E f(X_i)-P f=P f-P f=0$, and their variance is
\begin{align*}
\operatorname{Var}(Y_i)=\mathbb E(Y_i-\mathbb E Y_i)^2=\mathbb E\{f(X_i)-P f\}^2=P(f-Pf)^2=\sigma_f^2.
\end{align*}
Since $P f^2<\infty$, this variance is finite, so the *Central Limit Theorem* gives
\begin{align*}
\frac{1}{\sqrt n}\sum_{i=1}^n Y_i \xrightarrow{d}\mathcal N(0,\sigma_f^2).
\end{align*}
Substituting $Y_i=f(X_i)-P f$ into the sum gives
\begin{align*}
\frac{1}{\sqrt n}\sum_{i=1}^n Y_i=\frac{1}{\sqrt n}\sum_{i=1}^n \{f(X_i)-P f\}=\sqrt n\left\{\frac{1}{n}\sum_{i=1}^n f(X_i)-P f\right\}=\sqrt n(P_n-P)f=\alpha_n(f).
\end{align*}
Therefore $\alpha_n(f)\xrightarrow{d}\mathcal N(0,\sigma_f^2)$. When the index class is the singleton $\{f\}$, the empirical process is exactly the classical one-dimensional central-limit scaling, so no tightness or uniformity issue appears.
[/example]
A statistic usually depends on more than one coordinate, so the next question is how several empirical-process values fluctuate jointly. The answer fixes the covariance structure of any Gaussian limit, and that covariance will later identify the Brownian bridge.
[quotetheorem:6302]
[citeproof:6302]
This theorem identifies all finite-dimensional marginals of a possible limit, so any later Gaussian process must have exactly this covariance structure. The finite second-moment hypothesis is essential even for a single coordinate: without it the ordinary central limit theorem can fail and stable-law behaviour may replace Gaussian behaviour. The limitation is that finite-dimensional convergence says nothing about how $\alpha_n(f)$ behaves as $f$ varies through an infinite class. For example, if $P$ is non-atomic and $\mathcal C$ is the class of all measurable subsets of $\mathcal X$, then the realised sample set has $P_n(C)=1$ and $P(C)=0$, so $\sup_{C\in\mathcal C}|\alpha_n(C)|=\sqrt n$ and no tight Gaussian limit in $\ell^\infty(\mathcal C)$ is possible. To see the concrete process that motivates the later tightness theory, we specialise to the half-lines that generate the empirical distribution function.
[example: Half-Line Indexing]
Let $\mathcal C=\{(-\infty,x]:x\in\mathbb R\}$, let $F(x)=P((-\infty,x])$, and define $F_n(x)=P_n((-\infty,x])$. For $g_x=\mathbf 1_{(-\infty,x]}$, the half-line coordinate is
\begin{align*}
\alpha_n((-\infty,x])=\sqrt n\{P_n((-\infty,x])-P((-\infty,x])\}=\sqrt n\{F_n(x)-F(x)\}.
\end{align*}
Fix $x_1,\dots,x_k\in\mathbb R$. By the preceding finite-dimensional empirical-process limit applied to $g_{x_1},\dots,g_{x_k}$, the limiting vector is centred Gaussian with $(i,j)$ covariance entry $P(g_{x_i}g_{x_j})-P g_{x_i}\,P g_{x_j}$. For each $y\in\mathbb R$,
\begin{align*}
g_{x_i}(y)g_{x_j}(y)=\mathbf 1_{\{y\le x_i\}}\mathbf 1_{\{y\le x_j\}}=\mathbf 1_{\{y\le x_i \text{ and } y\le x_j\}}=\mathbf 1_{\{y\le x_i\wedge x_j\}}=g_{x_i\wedge x_j}(y).
\end{align*}
Therefore
\begin{align*}
P(g_{x_i}g_{x_j})=P g_{x_i\wedge x_j}=P((-\infty,x_i\wedge x_j])=F(x_i\wedge x_j).
\end{align*}
Also,
\begin{align*}
P g_{x_i}\,P g_{x_j}=P((-\infty,x_i])P((-\infty,x_j])=F(x_i)F(x_j).
\end{align*}
Hence
\begin{align*}
\operatorname{Cov}(G(x_i),G(x_j))=F(x_i\wedge x_j)-F(x_i)F(x_j).
\end{align*}
Thus half-line indexing turns empirical-distribution fluctuations into a Gaussian process whose covariance becomes the Brownian bridge covariance after the probability-scale transformation $t=F(x)$.
[/example]
## The Uniform Empirical Process and the Brownian Bridge
For distribution-free testing it is natural to remove the unknown distribution $F$. The probability integral transform sends a continuous distribution to the uniform distribution on $[0,1]$, and the empirical process then has a limit that no longer depends on $P$.
[definition: Uniform Empirical Process]
Let $U_1,\dots,U_n$ be i.i.d. with distribution $\operatorname{Unif}(0,1)$. The uniform empirical distribution function is the random map $G_n:[0,1]\to\mathbb R$ defined by
\begin{align*}
G_n(t)=\frac{1}{n}\sum_{i=1}^n \mathbf{1}_{\{U_i\le t\}}, \qquad 0\le t\le 1.
\end{align*}
The uniform empirical process is the random map $\alpha_n:[0,1]\to\mathbb R$ defined by
\begin{align*}
\alpha_n(t)=\sqrt n\{G_n(t)-t\}, \qquad 0\le t\le 1.
\end{align*}
[/definition]
The uniform empirical process has Gaussian finite-dimensional limits because every finite collection of coordinates satisfies a central limit theorem. Its limit must also be pinned to zero at both endpoints, since $G_n(0)=0$ and $G_n(1)=1$. These two requirements motivate the following definition of the Brownian bridge, which is the process needed for Kolmogorov-Smirnov limit theory.
[definition: Brownian Bridge]
A Brownian bridge on $[0,1]$ is a centred Gaussian random map $B:[0,1]\to\mathbb R$ with covariance
\begin{align*}
\operatorname{Cov}(B(s),B(t))=s\wedge t-st, \qquad 0\le s,t\le 1.
\end{align*}
[/definition]
The bridge may be realised as $B(t)=W(t)-tW(1)$ for standard [Brownian motion](/page/Brownian%20Motion) $W$. This representation gives a fast way to verify the covariance and to remember why $B(1)=0$.
[example: Brownian Bridge Covariance]
Let $B(t)=W(t)-tW(1)$, where $W$ is standard Brownian motion. Since $\mathbb E W(u)=0$ for every $u\in[0,1]$, we have
\begin{align*}
\mathbb E B(t)=\mathbb E\{W(t)-tW(1)\}=\mathbb E W(t)-t\,\mathbb E W(1)=0-t\cdot 0=0.
\end{align*}
Thus $B$ is centred.
For $0\le s,t\le 1$, standard Brownian motion satisfies $\mathbb E[W(a)W(b)]=a\wedge b$. Since $B$ is centred,
\begin{align*}
\operatorname{Cov}(B(s),B(t))=\mathbb E[B(s)B(t)].
\end{align*}
Expanding the product gives
\begin{align*}
B(s)B(t)=(W(s)-sW(1))(W(t)-tW(1)).
\end{align*}
Multiplying out,
\begin{align*}
B(s)B(t)=W(s)W(t)-tW(s)W(1)-sW(1)W(t)+stW(1)^2.
\end{align*}
Taking expectations term by term,
\begin{align*}
\mathbb E[B(s)B(t)]=\mathbb E[W(s)W(t)]-t\,\mathbb E[W(s)W(1)]-s\,\mathbb E[W(1)W(t)]+st\,\mathbb E[W(1)^2].
\end{align*}
Using $\mathbb E[W(a)W(b)]=a\wedge b$,
\begin{align*}
\mathbb E[B(s)B(t)]=(s\wedge t)-t(s\wedge 1)-s(1\wedge t)+st(1\wedge 1).
\end{align*}
Because $0\le s,t\le 1$, this becomes
\begin{align*}
\mathbb E[B(s)B(t)]=(s\wedge t)-ts-st+st=s\wedge t-st.
\end{align*}
Therefore
\begin{align*}
\operatorname{Cov}(B(s),B(t))=s\wedge t-st.
\end{align*}
Also $B(1)=W(1)-W(1)=0$, so subtracting the random line $tW(1)$ produces a centred Gaussian process with the Brownian bridge covariance and with its endpoint tied down at time $1$.
[/example]
The covariance calculation matches the half-line empirical-process covariance on the uniform scale. The remaining question is whether convergence holds uniformly over all $t\in[0,1]$, and Donsker's theorem answers this in the sup-norm function space.
[quotetheorem:6303]
[citeproof:6303]
The theorem turns uniform empirical fluctuations into a Gaussian object in a space strong enough to remember suprema over all thresholds. The uniform law and interval indexing are doing real work here: if $U_1,\dots,U_n$ are uniform and the index class is enlarged from intervals to all Borel subsets of $[0,1]$, the random finite set $\{U_1,\dots,U_n\}$ has empirical mass $1$ and Lebesgue mass $0$, so the corresponding supremum of $|\alpha_n(C)|$ is at least $\sqrt n$. Thus finite-dimensional Brownian-bridge covariance does not by itself control a large index class. The endpoint constraints are also inherited in the limit, since every sample path of $\alpha_n$ is zero at $0$ and $1$. Nonparametric test statistics are usually not the whole path itself but a functional of the path, so we need a principle that transfers weak convergence through operations such as taking a supremum.
[quotetheorem:6304]
[citeproof:6304]
This theorem connects the process limit to the Kolmogorov-Smirnov statistic studied in Chapter 4 because the supremum functional is continuous for the norm in which Donsker convergence was proved. The sup-norm hypothesis cannot be casually weakened: convergence of only finitely many coordinates would not determine the maximum over a continuum of thresholds. For a concrete boundary case, let $T=[0,1]$ and let $z_n(t)=\mathbf{1}_{\{t=t_n\}}$, where $(t_n)$ is a sequence of distinct points. Then $z_n(t)\to0$ for every fixed $t$, but $\sup_{t\in T}|z_n(t)|=1$ for every $n$. The absolute value is harmless because it is already built into a Lipschitz functional on $\ell^\infty(T)$. To turn the limiting statistic into critical values, we need the distribution of the supremum of the absolute Brownian bridge.
[quotetheorem:6305]
[citeproof:6305]
The reflection-principle argument for Brownian motion works together with the representation of the bridge as Brownian motion conditioned to return to zero at time $1$. The hypotheses behind its use matter. First, the bridge limit comes from the continuous-null, one-sample empirical distribution function after the probability integral transform; atoms or data-dependent fitted parameters change the limiting law. Second, the displayed expression is for the two-sided supremum of the absolute bridge, while one-sided Kolmogorov-Smirnov statistics have related but different boundary-crossing probabilities. Third, the theorem is an asymptotic distributional input, not a finite-sample equality for the Kolmogorov-Smirnov statistic $D_n=\sup_{x\in\mathbb R}|F_n(x)-F(x)|$ and not a statement about the location at which the maximum is attained. In this course the formula is quoted for Kolmogorov-Smirnov critical values and for asymptotic confidence bands around distribution functions.
[example: Kolmogorov-Smirnov Scaling]
Let $X_1,\dots,X_n$ be i.i.d. with continuous distribution function $F$, and set $U_i=F(X_i)$. By the *probability integral transform*, $U_1,\dots,U_n$ are i.i.d. $\operatorname{Unif}(0,1)$. Let
\begin{align*}
G_n(t)=\frac{1}{n}\sum_{i=1}^n \mathbf 1_{\{U_i\le t\}}, \qquad 0\le t\le 1.
\end{align*}
Write $X_{(1)}\le\cdots\le X_{(n)}$ for the order statistics and $U_{(1)}\le\cdots\le U_{(n)}$ for the order statistics of $U_1,\dots,U_n$. Since $F$ is nondecreasing, the ordered transformed observations satisfy $U_{(i)}=F(X_{(i)})$ almost surely up to ties, and ties have probability zero because $U_i$ has a continuous uniform distribution.
For the empirical distribution function, $F_n(x)$ is constant between consecutive order statistics and changes only at sample points. Hence the largest absolute vertical discrepancy occurs at a sample point, using either the value immediately before the jump or the value at the jump:
\begin{align*}
\sup_{x\in\mathbb R}|F_n(x)-F(x)|
=
\max_{1\le i\le n}
\left\{
\left|\frac{i-1}{n}-F(X_{(i)})\right|,
\left|\frac{i}{n}-F(X_{(i)})\right|
\right\}.
\end{align*}
Substituting $U_{(i)}=F(X_{(i)})$ gives
\begin{align*}
\sup_{x\in\mathbb R}|F_n(x)-F(x)|
=
\max_{1\le i\le n}
\left\{
\left|\frac{i-1}{n}-U_{(i)}\right|,
\left|\frac{i}{n}-U_{(i)}\right|
\right\}.
\end{align*}
The same jump-point description applied to $G_n$ gives
\begin{align*}
\sup_{0\le t\le 1}|G_n(t)-t|
=
\max_{1\le i\le n}
\left\{
\left|\frac{i-1}{n}-U_{(i)}\right|,
\left|\frac{i}{n}-U_{(i)}\right|
\right\}.
\end{align*}
Therefore
\begin{align*}
\sup_{x\in\mathbb R}|F_n(x)-F(x)|
=
\sup_{0\le t\le 1}|G_n(t)-t|.
\end{align*}
Multiplying both sides by $\sqrt n$,
\begin{align*}
\sqrt n\sup_{x\in\mathbb R}|F_n(x)-F(x)|
=
\sup_{0\le t\le 1}\left|\sqrt n\{G_n(t)-t\}\right|
=
\sup_{0\le t\le 1}|\alpha_n(t)|.
\end{align*}
By *Donsker Theorem for Intervals*, $\alpha_n\xrightarrow{d}B$ in $\ell^\infty([0,1])$, and by *Continuous Mapping for Suprema*,
\begin{align*}
\sup_{0\le t\le 1}|\alpha_n(t)|
\xrightarrow{d}
\sup_{0\le t\le 1}|B(t)|.
\end{align*}
Thus
\begin{align*}
\sqrt n\sup_{x\in\mathbb R}|F_n(x)-F(x)|
\xrightarrow{d}
\sup_{0\le t\le 1}|B(t)|.
\end{align*}
This is the asymptotic null distribution of the one-sample Kolmogorov-Smirnov statistic under a continuous null distribution.
[/example]
## Tightness, Entropy, and Large Index Classes
Finite-dimensional convergence describes any finite list of coordinates of the empirical process. Weak convergence as a random element of $\ell^\infty(\mathcal F)$ also needs a uniform control statement preventing the process from oscillating too much over nearby functions.
[definition: Empirical Semimetric]
Let $\mathcal F$ be a class of measurable functions with $P f^2<\infty$ for all $f\in\mathcal F$. The $L^2(P)$ semimetric on $\mathcal F$ is
\begin{align*}
d_P:\mathcal F\times\mathcal F\to[0,\infty), \qquad d_P(f,g)=\left(P(f-g)^2\right)^{1/2}.
\end{align*}
[/definition]
Nearby functions in this semimetric have small variance difference under the empirical process. To turn that local variance control into a global supremum bound, we need a way to count how many local approximations are needed at each accuracy level.
[definition: Covering Number]
Let $(T,d)$ be a semimetric space. The covering-number function is the map $N(\cdot,T,d):(0,\infty)\to\mathbb N\cup\{\infty\}$ for which $N(\varepsilon,T,d)$ is the smallest integer $m$ for which there exist $t_1,\dots,t_m\in T$ such that
\begin{align*}
T\subseteq \bigcup_{j=1}^m \{t\in T:d(t,t_j)<\varepsilon\}.
\end{align*}
If no such finite $m$ exists, $N(\varepsilon,T,d)=\infty$.
[/definition]
The logarithm of the covering number is called entropy. Since chaining adds approximation errors across many resolutions, the relevant condition is not a single covering number but an integral of square-root entropy over scales; this motivates the entropy integral.
[definition: Entropy Integral]
Let $\mathcal F$ be a class of measurable functions with envelope $F_e$ satisfying $|f|\le F_e$ for all $f\in\mathcal F$ and $P F_e^2<\infty$. The uniform entropy integral is the map $J(\cdot,\mathcal F):(0,\infty)\to[0,\infty]$ defined by
\begin{align*}
J(\delta,\mathcal F)=\int_0^\delta \sup_Q \sqrt{\log N(\varepsilon\|F_e\|_{L^2(Q)},\mathcal F,L^2(Q))}\,d\varepsilon,
\end{align*}
where the supremum is over finitely supported probability measures $Q$ with $0<\|F_e\|_{L^2(Q)}<\infty$.
[/definition]
Entropy conditions are a way of verifying stochastic equicontinuity without proving it again for each statistic. The obstruction is that pointwise central limit theorems control each fixed function separately, while a functional central limit theorem must control all functions in the class at once. A finite entropy integral limits the number of distinguishable functions across resolutions, and a square-integrable envelope keeps the summands uniformly moderate.
[quotetheorem:6306]
[citeproof:6306]
This theorem explains why empirical process theory cares about the size of a function class rather than only about moments of individual functions. The square-integrable envelope controls the size of individual summands, while the entropy integral controls how many genuinely different summands appear at each resolution. If the envelope condition fails, even the single-function class $\{f\}$ can break central limit behaviour: for instance, take $f(X)=X$ when $X$ has a Pareto distribution with tail index in $(1,2)$, so the mean exists but the variance is infinite. If the entropy condition fails, a concrete obstruction is the class of all Borel subsets of $[0,1]$ under the uniform law, for which the empirical process can put mass on the observed sample set and make the supremum grow like $\sqrt n$. The criterion is sufficient rather than necessary, so failing it does not automatically prove that a class is not Donsker. The half-line class has very small entropy, and computing its covering numbers gives a concrete model for the general criterion.
[example: Intervals Have Low Complexity]
Write $f_t=\mathbf 1_{[0,t]}$ for $0\le t\le 1$. Under the uniform law on $[0,1]$, for $s,t\in[0,1]$,
\begin{align*}
d_P(f_s,f_t)^2=P(f_s-f_t)^2.
\end{align*}
By the definition of $P$ as Lebesgue measure on $[0,1]$,
\begin{align*}
P(f_s-f_t)^2=\int_0^1\left(\mathbf 1_{[0,s]}(x)-\mathbf 1_{[0,t]}(x)\right)^2\,dx.
\end{align*}
The integrand is $1$ exactly on the symmetric difference $[0,s]\triangle[0,t]$ and is $0$ elsewhere, so
\begin{align*}
\int_0^1\left(\mathbf 1_{[0,s]}(x)-\mathbf 1_{[0,t]}(x)\right)^2\,dx=\lambda([0,s]\triangle[0,t]).
\end{align*}
If $s\le t$, then $[0,s]\triangle[0,t]=(s,t]$, whose Lebesgue measure is $t-s$; if $t\le s$, the same argument gives $s-t$. Hence
\begin{align*}
d_P(f_s,f_t)^2=|s-t|.
\end{align*}
Taking square roots gives
\begin{align*}
d_P(f_s,f_t)=|s-t|^{1/2}.
\end{align*}
Fix $0<\varepsilon\le 1$, and set $\eta=\varepsilon^2/2$. Let $m=\lceil 1/\eta\rceil$ and use the grid points $t_j=j/m$ for $j=0,\dots,m$. Every $t\in[0,1]$ lies within distance at most $1/m\le \eta$ of some grid point, so
\begin{align*}
d_P(f_t,f_{t_j})=|t-t_j|^{1/2}\le \eta^{1/2}<\varepsilon.
\end{align*}
Therefore
\begin{align*}
N(\varepsilon,\mathcal C,d_P)\le m+1.
\end{align*}
Since $m=\lceil 2/\varepsilon^2\rceil\le 2/\varepsilon^2+1\le 3/\varepsilon^2$, we get
\begin{align*}
N(\varepsilon,\mathcal C,d_P)\le \frac{4}{\varepsilon^2}.
\end{align*}
Conversely, a $d_P$-ball of radius $\varepsilon$ centred at $f_a$ can contain only those $f_t$ with $|t-a|<\varepsilon^2$. Thus one such ball covers an interval of $t$-values of length at most $2\varepsilon^2$, so covering all of $[0,1]$ requires at least $(2\varepsilon^2)^{-1}$ balls. Hence, for $0<\varepsilon\le 1$,
\begin{align*}
\frac{1}{2\varepsilon^2}\le N(\varepsilon,\mathcal C,d_P)\le \frac{4}{\varepsilon^2}.
\end{align*}
In particular,
\begin{align*}
\sqrt{\log N(\varepsilon,\mathcal C,d_P)}\le \sqrt{\log 4+2\log(1/\varepsilon)}.
\end{align*}
The integral $\int_0^1\sqrt{\log(1/\varepsilon)}\,d\varepsilon$ is finite, since the substitution $u=\log(1/\varepsilon)$ gives $d\varepsilon=-e^{-u}\,du$ and
\begin{align*}
\int_0^1\sqrt{\log(1/\varepsilon)}\,d\varepsilon=\int_0^\infty u^{1/2}e^{-u}\,du<\infty.
\end{align*}
Thus the interval class has only polynomial covering growth, and its entropy integral is finite near zero, which is the low-complexity behaviour behind the Brownian-bridge limit for the uniform empirical process.
[/example]
Large nonparametric classes require this complexity viewpoint. Smoothness assumptions in density estimation and regression can be read as restrictions that reduce entropy enough for uniform convergence or Gaussian approximation.
## Quantile and QQ Plot Fluctuations
The empirical distribution function is a vertical object, while a quantile is obtained by inverting that object horizontally. This inversion is nonlinear: an error of size $\varepsilon$ in $F_n(x)$ can produce a much larger error in the inverse when the graph of $F$ is flat. Thus distribution-function errors are magnified where the density is small, and QQ plots show that magnification directly because they compare empirical quantiles with model quantiles.
[definition: Quantile Function]
Let $F$ be a distribution function on $\mathbb R$. Its left-continuous quantile function is
\begin{align*}
F^{-1}:(0,1)\to\mathbb R, \qquad F^{-1}(t)=\inf\{x\in\mathbb R:F(x)\ge t\}.
\end{align*}
[/definition]
The inverse map is nonlinear, so fluctuations of quantiles depend on the density at the target quantile. Where the density is small, a small vertical error in the distribution function becomes a large horizontal error in the quantile; the next theorem is the functional [delta method](/theorems/1861) for this inverse map.
[quotetheorem:6307]
[citeproof:6307]
This theorem explains why QQ plots are most stable in regions where the fitted distribution has high density. The lower bound on $f_X$ is not a cosmetic assumption: for an exponential distribution near $t=1$, or for any model whose density tends to zero in the tail, the factor $1/f_X(F^{-1}(t))$ becomes large and uniform quantile fluctuations over intervals approaching the endpoint are no longer controlled by the same bounded limit. The differentiability assumption is what permits the inverse map to be linearised; at an atom or a flat stretch of $F$, the inverse is not locally smooth and the Brownian-bridge delta-method argument breaks down. The theorem also gives a direct way to attach a scale to the random vertical deviations from the reference line.
[example: QQ Plot Fluctuations]
Suppose the model distribution $F$ is continuous with positive density $f_X$, and plot the ordered observations $X_{(i)}$ against the model quantiles $F^{-1}(i/(n+1))$. Set
\begin{align*}
t_i=\frac{i}{n+1}, \qquad q_i=F^{-1}(t_i).
\end{align*}
For the empirical quantile function, $F_n^{-1}(t_i)=X_{(i)}$, because
\begin{align*}
i-1<\frac{ni}{n+1}<i,
\end{align*}
so the smallest order statistic index $j$ with $j/n\ge t_i$ is $j=i$.
By *Quantile Process Limit*, for quantile levels away from $0$ and $1$,
\begin{align*}
\sqrt n\{F_n^{-1}(t_i)-F^{-1}(t_i)\}
\approx -\frac{B(t_i)}{f_X(F^{-1}(t_i))}.
\end{align*}
Substituting $F_n^{-1}(t_i)=X_{(i)}$ and $F^{-1}(t_i)=q_i$ gives
\begin{align*}
\sqrt n\{X_{(i)}-q_i\}
\approx -\frac{B(t_i)}{f_X(q_i)}.
\end{align*}
Dividing by $\sqrt n$,
\begin{align*}
X_{(i)}-q_i
\approx -\frac{1}{\sqrt n}\frac{B(t_i)}{f_X(q_i)}.
\end{align*}
Taking absolute values removes the sign:
\begin{align*}
|X_{(i)}-q_i|
\approx \frac{1}{\sqrt n}\frac{|B(t_i)|}{f_X(F^{-1}(t_i))}.
\end{align*}
Thus the same Brownian bridge governs all quantile levels, while the factor $1/f_X(F^{-1}(t_i))$ rescales the visible vertical departures from the QQ reference line. In the tails of a normal QQ plot, $f_X(F^{-1}(t_i))$ is small, so the same bridge fluctuation produces larger plotted deviations.
[/example]
The chapter's main message is that empirical-process convergence is a uniform central limit theorem. Finite-dimensional convergence determines the Gaussian candidate, tightness controls the passage from finitely many coordinates to a whole function class, and entropy gives a usable route to tightness for the classes encountered in nonparametric statistics.
The weak convergence and empirical-process theory developed in Chapter 3 now becomes operational: we use the Brownian-bridge limit and probability integral transform to construct goodness-of-fit tests that are valid without knowing the underlying distribution. This chapter shows how empirical-process theory transforms consistency and asymptotic theory into practical testing procedures.
# 4. Distribution-Free Goodness-of-Fit Tests
This chapter turns the empirical distribution function from an estimator into a testing device. The prerequisite material is the empirical distribution function, uniform convergence, weak convergence, and the basic language of distribution functions. In Chapters 2 and 3, uniform convergence and Brownian-bridge weak convergence justified treating $F_n$ as a close approximation to the true distribution function $F$; here the question is whether the observed discrepancy from a proposed model is too large to be attributed to sampling variability. The main goal is to understand why certain goodness-of-fit tests have null laws that do not depend on the shape of a fully specified continuous null distribution.
The chapter begins with the Kolmogorov-Smirnov statistic and the probability integral transform, then compares integral discrepancy statistics such as Cramer-von Mises and Anderson-Darling. The final section studies power, separating fixed alternatives from local alternatives that approach the null at the $n^{-1/2}$ scale. These tests connect statistical decision rules to empirical-process theory: the empirical process supplies the limiting random path, while the chosen statistic selects the feature of that path that the test emphasizes. The same ideas also underlie simultaneous confidence bands, quantile-quantile diagnostics, simulation checks for generative models, calibration tests for probabilistic forecasts, stress testing of financial risk models, and validation of environmental or engineering simulations.
## Testing a Fully Specified Continuous Distribution
Suppose we observe i.i.d. real-valued random variables $X_1,\dots,X_n$ and want to test whether their common distribution is a specified continuous distribution function $F_0$. The problem is not to estimate a parameter, but to decide whether the whole distributional shape described by $F_0$ is compatible with the empirical distribution function.
[definition: Empirical Distribution Function]
Let $X_1,\dots,X_n$ be real-valued random variables. The empirical distribution function is the random function $F_n:\mathbb R\to[0,1]$ defined by
\begin{align*}
F_n(x)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_{\{X_i\le x\}}.
\end{align*}
[/definition]
The empirical distribution function gives a data-based curve on the same scale as $F_0$, so goodness-of-fit can be phrased as a distance between two distribution functions. To obtain a test with simultaneous control over the whole real line, we need a statistic that records the largest vertical departure rather than a departure at a preselected point.
[definition: Kolmogorov-Smirnov Statistic]
For a specified distribution function $F_0$, the one-sample Kolmogorov-Smirnov statistic is the map $D_n:\mathbb R^n\to[0,1]$ defined as follows. Given observations $x_1,\dots,x_n\in\mathbb R$ with empirical distribution function $F_n$, set
\begin{align*}
D_n(x_1,\dots,x_n)=\sup_{x\in\mathbb R}|F_n(x)-F_0(x)|.
\end{align*}
The scaled Kolmogorov-Smirnov statistic is the map $\sqrt n D_n:\mathbb R^n\to[0,\sqrt n]$.
[/definition]
This supremum distance is sensitive to the worst discrepancy anywhere on the line. It is especially useful when the alternative changes the location, spread, or central shape of the distribution rather than only a small tail region.
[example: Kolmogorov-Smirnov Statistic From Ordered Data]
Let $X_{(1)}\le\dots\le X_{(n)}$ be the order statistics, and assume first that the observed values are distinct. For $X_{(i)}\le x<X_{(i+1)}$, exactly $i$ sample points are at or below $x$, so the empirical distribution function is
\begin{align*}
F_n(x)=\frac{i}{n}.
\end{align*}
On this interval the discrepancy is $|i/n-F_0(x)|$. Since $F_0$ is nondecreasing, the maximum over the interval is attained at one of the endpoint limits: if $F_0(x)\le i/n$, the expression decreases as $F_0(x)$ increases, and if $F_0(x)\ge i/n$, it increases as $F_0(x)$ increases.
At the jump point $X_{(i)}$, the right value of the empirical distribution function is
\begin{align*}
F_n(X_{(i)})=\frac{i}{n}.
\end{align*}
The left limit just before that observation is
\begin{align*}
F_n(X_{(i)}-)=\frac{i-1}{n}.
\end{align*}
Because $F_0$ is continuous, its left limit at $X_{(i)}$ equals its value there:
\begin{align*}
F_0(X_{(i)}-)=F_0(X_{(i)}).
\end{align*}
Thus the two endpoint discrepancies associated with $X_{(i)}$ are
\begin{align*}
\left|\frac{i}{n}-F_0(X_{(i)})\right|.
\end{align*}
and
\begin{align*}
\left|\frac{i-1}{n}-F_0(X_{(i)})\right|.
\end{align*}
Taking the largest such discrepancy over all jumps gives
\begin{align*}
D_n=\max_{1\le i\le n}\max\left\{\left|\frac{i}{n}-F_0(X_{(i)})\right|,\left|\frac{i-1}{n}-F_0(X_{(i)})\right|\right\}.
\end{align*}
If ties occur, the same endpoint argument is applied at the distinct observed values with jump sizes equal to their multiplicities; under a continuous null distribution, ties occur with probability zero. The formula turns the supremum over $\mathbb R$ into a finite maximum and shows that the transformed ordered values $F_0(X_{(i)})$ are the relevant sample summaries.
[/example]
The ordered-data formula shows that all null calculations depend on the sample only through the transformed values $F_0(X_i)$. When the null is true and $F_0$ is continuous, the probability integral transform makes these transformed observations i.i.d. uniform variables. Continuity is essential: atoms create jumps and ties, so the transformed values no longer have the continuous uniform law. For a goodness-of-fit test, we need the stronger conclusion that the whole KS statistic has the same null law after this transformation.
[quotetheorem:6308]
[citeproof:6308]
Continuity is again the hypothesis that prevents atoms from creating ties and non-uniform transformed observations. If $F_0$ were discrete, the null distribution of $D_n$ would depend on the jump sizes of $F_0$, so the same universal table could not be used. The result also assumes the null is fully specified before the data are observed. A concrete counterexample is the normality problem in which $F_0$ is replaced by $\Phi((x-\hat\mu)/\hat\sigma)$ using the same sample. Even when the data are normal, the fitted curve is pulled toward the empirical distribution function, so
\begin{align*}
\sup_x|F_n(x)-\Phi((x-\hat\mu)/\hat\sigma)|
\end{align*}
does not have the same null distribution as the KS statistic for a prespecified normal distribution. The theorem therefore does not say that every plug-in goodness-of-fit statistic is distribution-free; it says that the exact finite-sample reduction works for a continuous null distribution fixed in advance. Finite-sample distribution-freeness gives universal null calibration, but exact critical values are still combinatorial. To obtain simple large-sample critical values and connect with empirical-process theory, we need the limiting distribution of the uniform empirical process.
[definition: Brownian Bridge]
A standard Brownian bridge is a random element $B:\Omega\to C([0,1];\mathbb R)$ such that the coordinate process $(B(t))_{0\le t\le1}$ is mean-zero Gaussian with covariance function
\begin{align*}
\operatorname{Cov}(B(s),B(t))=\min\{s,t\}-st.
\end{align*}
[/definition]
The Brownian bridge records the Gaussian fluctuation of an empirical distribution function constrained by $H_n(1)-1=0$. A pointwise central limit theorem would only describe $H_n(t)-t$ at finitely many fixed values of $t$, and it would not justify taking a supremum over all $t\in[0,1]$. The obstruction is that the KS statistic is a path functional, so the limit must control the empirical process as a whole. The next theorem turns this process limit into the asymptotic null law used for KS critical values.
[quotetheorem:6309]
[citeproof:6309]
The continuity and fully specified null assumptions enter through the finite-sample reduction to the uniform empirical process. If $F_0$ has jumps or if parameters are fitted from the data, the limiting law is generally not the displayed Brownian-bridge supremum. The theorem gives an asymptotic calibration, not an exact finite-sample critical value, so small-sample implementations may still use exact tables or simulation. An immediate application is an empirical confidence band for $F$. If $c_{\alpha}$ is chosen so that $\mathbb P(\sup_{0\le t\le1}|B(t)|>c_{\alpha})=\alpha$, then an approximate simultaneous $1-\alpha$ band is
\begin{align*}
F_n(x)-\frac{c_{\alpha}}{\sqrt n}\le F(x)\le F_n(x)+\frac{c_{\alpha}}{\sqrt n},\qquad x\in\mathbb R,
\end{align*}
with endpoints truncated to $[0,1]$.
[example: Interpreting EDF Confidence Bands]
For $n=100$, the approximate simultaneous $95\%$ KS band based on the asymptotic critical value $c_{0.05}\approx1.36$ is
\begin{align*}
F_n(x)-\frac{c_{0.05}}{\sqrt n}\le F(x)\le F_n(x)+\frac{c_{0.05}}{\sqrt n},\qquad x\in\mathbb R.
\end{align*}
Since
\begin{align*}
\sqrt n=\sqrt{100}=10
\end{align*}
and
\begin{align*}
\frac{c_{0.05}}{\sqrt n}\approx \frac{1.36}{10}=0.136,
\end{align*}
the band has constant half-width approximately $0.136$ at every $x$. A fully specified proposed distribution function $F_0$ is rejected by the corresponding KS rule when there is some $x$ for which
\begin{align*}
|F_n(x)-F_0(x)|>0.136,
\end{align*}
equivalently when $F_0(x)$ lies outside the band at at least one point. The control is simultaneous over all $x\in\mathbb R$: one crossing anywhere along the empirical curve counts as rejection, instead of treating many separate pointwise intervals as independent checks.
[/example]
The confidence-band interpretation depends on a fully specified null distribution before seeing the data. When parameters are estimated from the same sample, the transform $F_{\hat\theta}(X_i)$ no longer produces independent uniform variables under the fitted model, and the critical values change.
[example: Testing Normality With Estimated Parameters]
Suppose first that $\mu$ and $\sigma>0$ were fixed before seeing the data and that $X_i\sim\mathcal N(\mu,\sigma^2)$. For $0<u<1$,
\begin{align*}
\mathbb P\left(\Phi\left(\frac{X_i-\mu}{\sigma}\right)\le u\right)=\mathbb P\left(\frac{X_i-\mu}{\sigma}\le \Phi^{-1}(u)\right)=\Phi(\Phi^{-1}(u))=u.
\end{align*}
Thus the transformed observations $\Phi((X_i-\mu)/\sigma)$ are i.i.d. uniform variables, so the ordinary KS null calibration applies in the prespecified case.
Now estimate the parameters from the same data by
\begin{align*}
\hat\mu=\frac{1}{n}\sum_{i=1}^{n}X_i
\end{align*}
and
\begin{align*}
\hat\sigma^2=\frac{1}{n}\sum_{i=1}^{n}(X_i-\hat\mu)^2.
\end{align*}
Set
\begin{align*}
Z_i=\frac{X_i-\hat\mu}{\hat\sigma}.
\end{align*}
The fitted residuals obey
\begin{align*}
\sum_{i=1}^{n}Z_i=\frac{1}{\hat\sigma}\left(\sum_{i=1}^{n}X_i-n\hat\mu\right)=\frac{1}{\hat\sigma}\left(\sum_{i=1}^{n}X_i-\sum_{i=1}^{n}X_i\right)=0.
\end{align*}
They also obey
\begin{align*}
\frac{1}{n}\sum_{i=1}^{n}Z_i^2=\frac{1}{n}\sum_{i=1}^{n}\frac{(X_i-\hat\mu)^2}{\hat\sigma^2}=\frac{1}{\hat\sigma^2}\cdot\frac{1}{n}\sum_{i=1}^{n}(X_i-\hat\mu)^2=1.
\end{align*}
Therefore the variables $\Phi(Z_i)$ cannot behave like independent uniform observations: if $U_1,\dots,U_n$ were independent uniforms, then $\Phi^{-1}(U_1),\dots,\Phi^{-1}(U_n)$ would have a continuous joint density, so the exact constraints $\sum_i \Phi^{-1}(U_i)=0$ and $n^{-1}\sum_i\Phi^{-1}(U_i)^2=1$ would occur with probability zero.
Consequently
\begin{align*}
\sup_x |F_n(x)-\Phi((x-\hat\mu)/\hat\sigma)|
\end{align*}
is not governed by the standard KS null distribution. This is the Lilliefors setting: testing a prespecified normal distribution and testing membership in a normal family after fitting its parameters require different critical values, so a parametric bootstrap or Lilliefors-type calibration is needed for a valid normality test.
[/example]
## Integral Discrepancy Statistics
The supremum statistic asks for the worst vertical gap. A different question is whether the empirical distribution function accumulates a sustained discrepancy across a region, even if no single point has a large deviation.
[definition: Cramer-von Mises Statistic]
For a continuous null distribution function $F_0$, the Cramer-von Mises statistic is the map $W_n^2:\mathbb R^n\to[0,\infty)$ defined as follows. Given observations $x_1,\dots,x_n\in\mathbb R$ with empirical distribution function $F_n$, set
\begin{align*}
W_n^2(x_1,\dots,x_n)=n\int_{\mathbb R}(F_n(x)-F_0(x))^2\,dF_0(x).
\end{align*}
[/definition]
The integral is taken with respect to the null distribution, so after the probability integral transform the statistic becomes an $L^2$ norm over $[0,1]$. This makes it responsive to broad moderate departures from the null.
[example: Computing Cramer-von Mises From Transformed Order Statistics]
Let $U_{(i)}=F_0(X_{(i)})$ be the transformed order statistics under a continuous null, and write $u_i=U_{(i)}$. By the *Probability Integral Transform*, we may compute on the uniform scale. Set $u_0=0$ and $u_{n+1}=1$. On $[u_i,u_{i+1})$, the empirical distribution function of the transformed sample is $i/n$, so
\begin{align*}
W_n^2=n\sum_{i=0}^{n}\int_{u_i}^{u_{i+1}}\left(\frac{i}{n}-t\right)^2\,dt.
\end{align*}
Expanding the square inside the integral gives
\begin{align*}
\left(\frac{i}{n}-t\right)^2=t^2-\frac{2i}{n}t+\frac{i^2}{n^2}.
\end{align*}
Therefore
\begin{align*}
W_n^2=n\sum_{i=0}^{n}\left(\frac{u_{i+1}^3-u_i^3}{3}-\frac{i}{n}(u_{i+1}^2-u_i^2)+\frac{i^2}{n^2}(u_{i+1}-u_i)\right).
\end{align*}
The cubic part telescopes:
\begin{align*}
n\sum_{i=0}^{n}\frac{u_{i+1}^3-u_i^3}{3}=\frac{n}{3}(u_{n+1}^3-u_0^3)=\frac{n}{3}.
\end{align*}
For the quadratic part, reindex the first sum by $j=i+1$ and the second by $j=i$:
\begin{align*}
-\sum_{i=0}^{n}i(u_{i+1}^2-u_i^2)=-\sum_{j=1}^{n+1}(j-1)u_j^2+\sum_{j=0}^{n}ju_j^2.
\end{align*}
Since $u_0=0$ and $u_{n+1}=1$, this becomes
\begin{align*}
-\sum_{i=0}^{n}i(u_{i+1}^2-u_i^2)=\sum_{j=1}^{n}u_j^2-n.
\end{align*}
For the linear part, the same reindexing gives
\begin{align*}
\frac{1}{n}\sum_{i=0}^{n}i^2(u_{i+1}-u_i)=\frac{1}{n}\sum_{j=1}^{n+1}(j-1)^2u_j-\frac{1}{n}\sum_{j=0}^{n}j^2u_j.
\end{align*}
Again using $u_0=0$ and $u_{n+1}=1$, we get
\begin{align*}
\frac{1}{n}\sum_{i=0}^{n}i^2(u_{i+1}-u_i)=-\frac{1}{n}\sum_{j=1}^{n}(2j-1)u_j+n.
\end{align*}
Combining the three parts yields
\begin{align*}
W_n^2=\sum_{j=1}^{n}u_j^2-\frac{1}{n}\sum_{j=1}^{n}(2j-1)u_j+\frac{n}{3}.
\end{align*}
To complete the square, use
\begin{align*}
\sum_{j=1}^{n}(2j-1)^2=4\sum_{j=1}^{n}j^2-4\sum_{j=1}^{n}j+\sum_{j=1}^{n}1.
\end{align*}
Substituting $\sum_{j=1}^{n}j=n(n+1)/2$ and $\sum_{j=1}^{n}j^2=n(n+1)(2n+1)/6$ gives
\begin{align*}
\sum_{j=1}^{n}(2j-1)^2=\frac{n(4n^2-1)}{3}.
\end{align*}
Hence
\begin{align*}
\sum_{j=1}^{n}\frac{(2j-1)^2}{4n^2}+\frac{1}{12n}=\frac{4n^2-1}{12n}+\frac{1}{12n}=\frac{n}{3}.
\end{align*}
Substituting this expression for $n/3$ into the previous formula gives
\begin{align*}
W_n^2=\frac{1}{12n}+\sum_{j=1}^{n}\left(u_j^2-\frac{2j-1}{n}u_j+\frac{(2j-1)^2}{4n^2}\right).
\end{align*}
The expression inside the sum is a square:
\begin{align*}
u_j^2-\frac{2j-1}{n}u_j+\frac{(2j-1)^2}{4n^2}=\left(u_j-\frac{2j-1}{2n}\right)^2.
\end{align*}
Therefore
\begin{align*}
W_n^2=\frac{1}{12n}+\sum_{j=1}^{n}\left(U_{(j)}-\frac{2j-1}{2n}\right)^2.
\end{align*}
Thus the statistic compares the transformed order statistics with the evenly spaced midpoint grid $(2j-1)/(2n)$, and it becomes large when the transformed sample is systematically too concentrated or too dispersed relative to uniformity.
[/example]
The order-statistic formula shows that the statistic measures a global squared deviation on the uniform scale. A naive calibration based on independent squared errors at the grid points would be wrong, because the empirical distribution function values are strongly dependent and constrained by the same sample. The Brownian bridge encodes both the covariance structure and the endpoint constraint. To calibrate the test asymptotically, we need the corresponding Brownian-bridge functional rather than the supremum functional used for KS.
[quotetheorem:6310]
[citeproof:6310]
The continuous-null hypothesis is what permits the transformation to the uniform probability scale. If the null has atoms, the integral weights and the empirical jumps depend on the actual jump pattern, so the displayed universal limit need not apply. For a concrete failure, take $F_0$ to put mass $1/2$ at $0$ and mass $1/2$ at $1$. Then $F_n-F_0$ is determined by the binomial count of observations equal to $0$, and $W_n^2$ has a limit built from a single normal fluctuation with weights at the two atoms, not from $\int_0^1 B(t)^2\,dt$.
Full specification is a separate requirement. If the proposed null is the exponential family $F_\lambda(x)=1-e^{-\lambda x}$ for $x\ge0$ and $\lambda$ is estimated by $\hat\lambda=1/\bar X$, then the statistic formed with $F_{\hat\lambda}$ is no longer the same functional of an unconditioned uniform empirical process. Under the exponential null, the fitted rate removes the scale direction from the empirical fluctuations, so the limit is a projected bridge depending on the estimation method rather than the displayed $\int_0^1B(t)^2\,dt$. The theorem therefore does not license using the same Cramer-von Mises critical values after fitting parameters from the data.
The theorem is a calibration result: it identifies the large-sample null distribution used to choose critical values under a continuous fully specified null. It is not an optimality theorem, so it does not say that Cramer-von Mises has the best power against every alternative. It is also not a finite-sample exactness statement; exact null calculations still depend on finite-$n$ empirical distribution functions, while the theorem describes their limit. The Cramer-von Mises limit treats all parts of the null probability scale equally, which can understate discrepancies in regions where observations are sparse. This motivates the Anderson-Darling statistic: it is needed when the testing problem asks for extra sensitivity near $F_0(x)=0$ or $F_0(x)=1$.
[definition: Anderson-Darling Statistic]
For a continuous null distribution function $F_0$, the Anderson-Darling statistic is the map $A_n^2:\mathbb R^n\to[0,\infty]$ defined as follows. Given observations $x_1,\dots,x_n\in\mathbb R$ with empirical distribution function $F_n$, set
\begin{align*}
A_n^2(x_1,\dots,x_n)=n\int_{\mathbb R}\frac{(F_n(x)-F_0(x))^2}{F_0(x)(1-F_0(x))}\,dF_0(x),
\end{align*}
where the integral is interpreted over points with $0<F_0(x)<1$.
[/definition]
The denominator is the pointwise variance scale of the empirical distribution function under the null. As a result, a discrepancy of a given absolute size receives more weight in the tails than near the median.
[example: Comparing Exponential And Weibull Tails]
Let the exponential null be
\begin{align*}
F_0(x)=1-e^{-\lambda x},\qquad x\ge 0,
\end{align*}
and suppose the true lifetime distribution is Weibull with shape $k>1$ and scale $\eta$:
\begin{align*}
F(x)=1-\exp\left(-(x/\eta)^k\right),\qquad x\ge 0.
\end{align*}
On the null probability scale, $t=F_0(x)$ gives
\begin{align*}
1-t=e^{-\lambda x}.
\end{align*}
Taking logarithms gives
\begin{align*}
-\log(1-t)=\lambda x.
\end{align*}
Hence
\begin{align*}
x=\frac{-\log(1-t)}{\lambda}.
\end{align*}
At this same point, the true Weibull upper tail is
\begin{align*}
1-F(x)=\exp\left(-(x/\eta)^k\right).
\end{align*}
Substituting $x=-\log(1-t)/\lambda$ gives
\begin{align*}
1-F(x)=\exp\left(-\left(\frac{-\log(1-t)}{\lambda\eta}\right)^k\right).
\end{align*}
The exponential null upper tail is
\begin{align*}
1-F_0(x)=1-t=\exp(-[-\log(1-t)]).
\end{align*}
Therefore
\begin{align*}
\frac{1-F(x)}{1-F_0(x)}=\frac{\exp\left(-\left(\frac{-\log(1-t)}{\lambda\eta}\right)^k\right)}{\exp(-[-\log(1-t)])}.
\end{align*}
Combining the exponents gives
\begin{align*}
\frac{1-F(x)}{1-F_0(x)}=\exp\left(-\frac{[-\log(1-t)]^k}{(\lambda\eta)^k}+[-\log(1-t)]\right).
\end{align*}
As $t\uparrow1$, the quantity $y=-\log(1-t)$ satisfies $y\to\infty$. Since $k>1$, the term $-y^k/(\lambda\eta)^k$ dominates the linear term $y$, so
\begin{align*}
-\frac{y^k}{(\lambda\eta)^k}+y\to-\infty.
\end{align*}
Thus the Weibull upper tail is eventually much lighter than the exponential upper tail on the transformed scale.
For Anderson-Darling, the transformed statistic has the form
\begin{align*}
A_n^2=n\int_0^1\frac{(H_n(t)-t)^2}{t(1-t)}\,dt.
\end{align*}
Near the upper tail, $t$ is close to $1$, so $t(1-t)$ is small and the weight $1/[t(1-t)]$ is large. For instance, if $|H_n(t)-t|\ge\delta$ on $[1-\varepsilon,1-\varepsilon/2]$, then that interval contributes at least
\begin{align*}
n\delta^2\int_{1-\varepsilon}^{1-\varepsilon/2}\frac{1}{t(1-t)}\,dt.
\end{align*}
Since
\begin{align*}
\frac{1}{t(1-t)}=\frac{1}{t}+\frac{1}{1-t},
\end{align*}
an antiderivative is
\begin{align*}
\log t-\log(1-t).
\end{align*}
Therefore the contribution is at least
\begin{align*}
n\delta^2\left(\log(1-\varepsilon/2)-\log(\varepsilon/2)-\log(1-\varepsilon)+\log\varepsilon\right).
\end{align*}
Combining the logarithms gives
\begin{align*}
n\delta^2\log\left(\frac{(1-\varepsilon/2)\varepsilon}{(1-\varepsilon)(\varepsilon/2)}\right)=n\delta^2\log\left(\frac{2-\varepsilon}{1-\varepsilon}\right).
\end{align*}
This is why Anderson-Darling reacts strongly to exponential-versus-Weibull tail mismatch: discrepancies at transformed values near $1$ are deliberately amplified rather than averaged on the same scale as central discrepancies.
[/example]
These integral statistics are still distribution-free for continuous fully specified nulls, because the only input after transformation is a uniform empirical process. The choice among KS, Cramer-von Mises, and Anderson-Darling is therefore not about the null calibration, but about which alternatives the statistic is designed to detect efficiently.
[remark: Supremum Versus Integral Tests]
The KS statistic is geometrically simple and gives direct uniform confidence bands. Cramer-von Mises averages squared discrepancies and is often more sensitive to distributed deviations. Anderson-Darling uses variance weighting and is especially attentive to tail departures.
[/remark]
## Power Against Fixed Alternatives
After constructing a level-$\alpha$ test, the next question is whether it will detect a false distribution when the sample size grows. Fixed alternatives are distributions $F\ne F_0$ that do not change with $n$, so the deterministic separation between $F$ and $F_0$ should eventually dominate random empirical fluctuations.
[definition: Consistency Of A Test]
For testing $H_0:F=F_0$ against alternatives $F\in\mathcal A$, a sequence of tests is a sequence of measurable maps $\varphi_n:\mathbb R^n\to\{0,1\}$, where $\varphi_n=1$ denotes rejection. The sequence is consistent against $F\in\mathcal A$ if
\begin{align*}
\mathbb E_F[\varphi_n]\to1.
\end{align*}
It is consistent against $\mathcal A$ if it is consistent against every $F\in\mathcal A$.
[/definition]
Consistency gives the target property, but it remains to verify that the KS rejection rule has it. The key issue is whether a positive deterministic distance $\sup_x|F(x)-F_0(x)|$ can be hidden by empirical noise as $n$ grows.
[quotetheorem:6311]
[citeproof:6311]
The assumption $\Delta>0$ is the formal way of saying that the true and null distribution functions are separated somewhere on the line. If $F=F_0$, then $\Delta=0$ and the same argument gives no divergence, as expected under the null. The bounded-critical-value condition matches ordinary fixed-level KS calibration; if the critical values grew like $\sqrt n$, the conclusion could fail. This proof explains why consistency is a large-sample guarantee rather than a promise of high power at small sample sizes. If $\Delta$ is small or concentrated where the sample has little information, many observations may be needed before deterministic separation dominates noise.
[example: Small Fixed Departure]
Let $F_0(x)=\Phi(x)$ and let $F(x)=\Phi(x/1.05)$, where $\Phi$ is the standard normal distribution function. For $x>0$, we have $x/1.05<x$, so monotonicity of $\Phi$ gives
\begin{align*}
|F(x)-F_0(x)|=\Phi(x)-\Phi(x/1.05).
\end{align*}
Using the symmetry identity $\Phi(-y)=1-\Phi(y)$,
\begin{align*}
|F(-x)-F_0(-x)|=|1-\Phi(x/1.05)-1+\Phi(x)|.
\end{align*}
Thus
\begin{align*}
|F(-x)-F_0(-x)|=\Phi(x)-\Phi(x/1.05),
\end{align*}
so it is enough to maximize
\begin{align*}
g(x)=\Phi(x)-\Phi(x/1.05),\qquad x\ge0.
\end{align*}
Writing $\phi(y)=(2\pi)^{-1/2}e^{-y^2/2}$ for the standard normal density, differentiation gives
\begin{align*}
g'(x)=\phi(x)-\frac{1}{1.05}\phi(x/1.05).
\end{align*}
At an interior critical point,
\begin{align*}
\phi(x)=\frac{1}{1.05}\phi(x/1.05).
\end{align*}
Substituting the formula for $\phi$ and canceling the common factor $(2\pi)^{-1/2}$ gives
\begin{align*}
e^{-x^2/2}=\frac{1}{1.05}e^{-x^2/(2\cdot1.05^2)}.
\end{align*}
Taking logarithms gives
\begin{align*}
-\frac{x^2}{2}=-\log(1.05)-\frac{x^2}{2\cdot1.05^2}.
\end{align*}
Moving the $x^2$ terms to one side gives
\begin{align*}
\log(1.05)=\frac{x^2}{2}\left(1-\frac{1}{1.05^2}\right).
\end{align*}
Therefore
\begin{align*}
x^2=\frac{2\log(1.05)}{1-1/1.05^2}.
\end{align*}
Numerically this gives
\begin{align*}
x\approx1.024.
\end{align*}
Hence the maximal separation is approximately
\begin{align*}
\Delta=\sup_x|F(x)-F_0(x)|\approx \Phi(1.024)-\Phi(1.024/1.05)\approx0.0118.
\end{align*}
This separation is positive, so the scaled deterministic gap $\sqrt n\,\Delta$ tends to infinity. For instance,
\begin{align*}
\sqrt{100}\,\Delta\approx10\cdot0.0118=0.118,
\end{align*}
while
\begin{align*}
\sqrt{10000}\,\Delta\approx100\cdot0.0118=1.18.
\end{align*}
The KS test is therefore consistent against this fixed alternative, but moderate samples can still have limited power because the largest vertical difference between the two distribution functions is only about $1.2\%$.
[/example]
The example shows that fixed-alternative consistency depends on eventual deterministic separation, not on the test being powerful for every finite sample size. For integral statistics, the analogous argument requires positive squared separation on a set that receives nonzero weight under the null measure.
[quotetheorem:6312]
[citeproof:6312]
The positive integral condition is needed because the statistic only sees discrepancies on sets that carry $F_0$-measure. If $F$ and $F_0$ differ only on a set with zero $F_0$-measure, this weighted integral argument cannot detect the difference, even though a supremum statistic might. The theorem does not compare finite-sample power or claim that Cramer-von Mises dominates KS; it only gives eventual rejection under the stated weighted separation. The condition in this theorem highlights a distinction between supremum and weighted integral tests. The KS statistic reacts to any point of maximal distribution-function separation, while an integral statistic reacts to separation on a set with positive weight under the measure used in the integral.
## Local Alternatives And Sensitivity
Consistency describes what happens when the alternative stays fixed, but it does not compare tests at the scale where rejection probabilities are neither near the size nor near one. To study sensitivity, alternatives are often placed at distance $n^{-1/2}$ from the null, matching the stochastic size of the empirical process.
[definition: Local Distribution-Function Alternative]
Let $F_0:\mathbb R\to[0,1]$ be a continuous distribution function. A sequence of alternatives $(F_{n,\mathrm{alt}})$ is local at rate $n^{-1/2}$ with drift $h$ if each $F_{n,\mathrm{alt}}:\mathbb R\to[0,1]$ is a distribution function and
\begin{align*}
F_{n,\mathrm{alt}}(x)=F_0(x)+\frac{h(F_0(x))}{\sqrt n}+r_n(x),
\end{align*}
where $h:[0,1]\to\mathbb R$ is bounded with $h(0)=h(1)=0$, $\sup_x|r_n(x)|=o(n^{-1/2})$, and the resulting functions $F_{n,\mathrm{alt}}$ are nondecreasing, right-continuous, satisfy $\lim_{x\to-\infty}F_{n,\mathrm{alt}}(x)=0$, and satisfy $\lim_{x\to\infty}F_{n,\mathrm{alt}}(x)=1$.
[/definition]
The definition puts deterministic departures on the same scale as empirical noise, so the limiting power is nondegenerate. To compare KS, Cramer-von Mises, and Anderson-Darling at this scale, we need to see how the empirical process changes under such local alternatives.
[quotetheorem:6313]
[citeproof:6313]
The empirical-process convergence assumption is the substantive regularity input; without it, the centered fluctuations under the alternatives need not be Brownian-bridge-like. The continuity of $F_0$ is also part of the mechanism, not a cosmetic assumption. For a concrete counterexample, let $F_0$ put mass $1/2$ at $0$ and mass $1/2$ at $1$, and let the local alternatives perturb only the mass at $0$ by $n^{-1/2}\eta$. The centered empirical process is then governed by a one-dimensional binomial fluctuation at the atom, so the limiting object is a Gaussian variable attached to a jump rather than a continuous Brownian bridge indexed by $t\in[0,1]$. Thus the theorem does not describe local alternatives for arbitrary discontinuous nulls.
Even when all alternatives are continuous, a moving spike whose width shrinks too quickly can break the uniform drift condition by producing large local discrepancies on intervals that vanish in pointwise calculations. The uniform drift assumption is therefore necessary, because pointwise convergence of the alternatives would not justify applying supremum or integral functionals to the whole path. The theorem does not prove contiguity for a particular parametric or semiparametric model; it states the precise process-level conditions under which the usual shifted-bridge calculation is valid. For the purposes of this course, the important message is operational: different statistics summarize the same shifted empirical process in different ways.
[example: Local Central Shift Versus Local Tail Shift]
Under a local alternative with drift $h$, the limiting empirical-process path is shifted from $B(t)$ to $B(t)+h(t)$. Thus the deterministic part seen by Cramer-von Mises is measured through
\begin{align*}
\int_0^1 h(t)^2\,dt,
\end{align*}
while the deterministic part seen by Anderson-Darling is measured through
\begin{align*}
\int_0^1\frac{h(t)^2}{t(1-t)}\,dt.
\end{align*}
For a central departure, suppose for illustration that $h(t)=\delta$ on
\begin{align*}
I_c=\left[\frac12-\varepsilon,\frac12+\varepsilon\right]
\end{align*}
and is zero outside this interval. Then the Cramer-von Mises squared drift over this interval is
\begin{align*}
\int_{I_c}h(t)^2\,dt=\int_{1/2-\varepsilon}^{1/2+\varepsilon}\delta^2\,dt
=\delta^2\left(\frac12+\varepsilon-\frac12+\varepsilon\right)
=2\varepsilon\delta^2.
\end{align*}
The Anderson-Darling weighted contribution is
\begin{align*}
\int_{I_c}\frac{h(t)^2}{t(1-t)}\,dt
=\delta^2\int_{1/2-\varepsilon}^{1/2+\varepsilon}\frac{1}{t(1-t)}\,dt.
\end{align*}
Since
\begin{align*}
\frac{1}{t(1-t)}=\frac{1}{t}+\frac{1}{1-t},
\end{align*}
we get
\begin{align*}
\int\frac{1}{t(1-t)}\,dt=\log t-\log(1-t),
\end{align*}
and hence
\begin{align*}
\int_{I_c}\frac{h(t)^2}{t(1-t)}\,dt
=\delta^2\left[ \log t-\log(1-t)\right]_{1/2-\varepsilon}^{1/2+\varepsilon}
=2\delta^2\log\left(\frac{1/2+\varepsilon}{1/2-\varepsilon}\right).
\end{align*}
This is a moderate reweighting when the departure is near $t=1/2$.
For a tail departure, suppose instead that $h(t)=\delta$ on
\begin{align*}
I_u=[1-2\varepsilon,1-\varepsilon]
\end{align*}
and is zero outside this interval. The Cramer-von Mises squared drift is
\begin{align*}
\int_{I_u}h(t)^2\,dt
=\int_{1-2\varepsilon}^{1-\varepsilon}\delta^2\,dt
=\varepsilon\delta^2.
\end{align*}
The Anderson-Darling weighted contribution is
\begin{align*}
\int_{I_u}\frac{h(t)^2}{t(1-t)}\,dt
=\delta^2\left[\log t-\log(1-t)\right]_{1-2\varepsilon}^{1-\varepsilon}.
\end{align*}
Evaluating the endpoints gives
\begin{align*}
\int_{I_u}\frac{h(t)^2}{t(1-t)}\,dt
=\delta^2\left(\log(1-\varepsilon)-\log\varepsilon-\log(1-2\varepsilon)+\log(2\varepsilon)\right),
\end{align*}
so
\begin{align*}
\int_{I_u}\frac{h(t)^2}{t(1-t)}\,dt
=\delta^2\log\left(\frac{2(1-\varepsilon)}{1-2\varepsilon}\right).
\end{align*}
As $\varepsilon\downarrow0$, the Cramer-von Mises contribution $\varepsilon\delta^2$ tends to $0$, while the Anderson-Darling contribution tends to $\delta^2\log 2$. Thus a discrepancy squeezed into an upper-tail probability interval can be nearly invisible to an unweighted integral statistic but still receive non-negligible Anderson-Darling weight. This is why tail-sensitive tests are useful for lifetime and reliability data, where the practical question often concerns rare long survival times or early failures.
[/example]
Local power also clarifies why there is no universally best goodness-of-fit test. A statistic optimized for a broad central departure can be inefficient for tail alternatives, while a tail-weighted statistic may pay extra variance for alternatives concentrated near the middle.
[remark: Composite Nulls And Bootstrap Calibration]
For composite null families such as normal, exponential, or Weibull models with estimated parameters, local alternatives interact with the estimation step. The fitted parameters remove certain directions of discrepancy, so the limiting process is a projected Brownian bridge rather than the original bridge. In applications, parametric bootstrap calibration often gives the most direct way to account for this projection.
[/remark]
The chapter's central conclusion is that distribution-free goodness-of-fit testing is empirical-process theory made operational. The probability integral transform supplies null universality, Brownian bridges supply asymptotic calibration, and the choice of supremum or integral functional determines which departures are emphasized.
While Chapter 4 used the empirical distribution function as a tool for testing, this chapter uses it as a starting point for estimation of a new quantity: the probability density. Kernel density estimation replaces the step-function empirical measure with a smooth kernel bump at each observation, introducing the smoothing philosophy that will dominate the remaining chapters.
# 5. Kernel Density Estimation
Kernel density estimation is the first sustained smoothing example in the course: it takes the empirical measure from Chapters 0 and 2 and replaces each point mass by a kernel bump. The empirical distribution function estimates probabilities of half-lines without choosing a smoothing scale; a density estimator must instead recover local mass by averaging observations near the point of interest. The main questions are how the kernel and bandwidth control bias and variance, how these local calculations aggregate into integrated risk, and why the optimal bandwidth depends on the smoothness of the unknown density.
## Kernels, Bandwidths, and Local Averaging
The empirical measure places point mass $1/n$ at each observation, so it cannot itself be a density with respect to Lebesgue measure. To estimate a density, we replace each point mass by a small bump and add the bumps together. The bandwidth sets the spatial scale of the bump; the kernel determines its shape and moment cancellations.
[definition: Kernel]
Let $K:\mathbb R \to \mathbb R$ be an integrable function. The function $K$ is a kernel if
\begin{align*}
\int_{\mathbb R} K(u)\,d\mathcal L^1(u) = 1.
\end{align*}
The support of $K$ is the [closed set](/page/Closed%20Set) $\operatorname{supp}K = \overline{\{u \in \mathbb R : K(u) \neq 0\}}$.
[/definition]
Many kernels used in statistics are nonnegative and symmetric, but nonnegativity is not part of the minimal analytic definition. Symmetry removes first-order bias terms, and compact support makes local computations depend only on nearby observations.
[example: Three Standard Kernels]
The box, Epanechnikov, and Gaussian kernels are
\begin{align*}
K_{\mathrm{box}}(u)=\frac{1}{2}\mathbb{1}_{[-1,1]}(u),\quad K_{\mathrm{Epa}}(u)=\frac{3}{4}(1-u^2)\mathbb{1}_{[-1,1]}(u),\quad K_{\mathrm{Gau}}(u)=(2\pi)^{-1/2}e^{-u^2/2}.
\end{align*}
For the box kernel,
\begin{align*}
\int_{\mathbb R}K_{\mathrm{box}}(u)\,d\mathcal L^1(u)=\frac{1}{2}\int_{-1}^{1}1\,du=\frac{1}{2}(1-(-1))=1.
\end{align*}
For the Epanechnikov kernel,
\begin{align*}
\int_{\mathbb R}K_{\mathrm{Epa}}(u)\,d\mathcal L^1(u)=\frac{3}{4}\int_{-1}^{1}(1-u^2)\,du.
\end{align*}
The two elementary integrals are
\begin{align*}
\int_{-1}^{1}1\,du=2
\end{align*}
and
\begin{align*}
\int_{-1}^{1}u^2\,du=\left[\frac{u^3}{3}\right]_{-1}^{1}=\frac{1}{3}-\left(-\frac{1}{3}\right)=\frac{2}{3}.
\end{align*}
Therefore
\begin{align*}
\int_{\mathbb R}K_{\mathrm{Epa}}(u)\,d\mathcal L^1(u)=\frac{3}{4}\left(2-\frac{2}{3}\right)=\frac{3}{4}\cdot\frac{4}{3}=1.
\end{align*}
For the Gaussian kernel, the [Gaussian integral](/theorems/1140) is obtained by squaring the integral and using polar coordinates:
\begin{align*}
\left(\int_{\mathbb R}e^{-u^2/2}\,du\right)^2=\int_{\mathbb R^2}e^{-(x^2+y^2)/2}\,dx\,dy=\int_{0}^{2\pi}\int_{0}^{\infty}e^{-r^2/2}r\,dr\,d\theta=2\pi.
\end{align*}
The radial integral used here is
\begin{align*}
\int_{0}^{\infty}e^{-r^2/2}r\,dr=\left[-e^{-r^2/2}\right]_{0}^{\infty}=1.
\end{align*}
Thus $\int_{\mathbb R}e^{-u^2/2}\,du=\sqrt{2\pi}$, and
\begin{align*}
\int_{\mathbb R}K_{\mathrm{Gau}}(u)\,d\mathcal L^1(u)=(2\pi)^{-1/2}\sqrt{2\pi}=1.
\end{align*}
The first two kernels vanish outside $[-1,1]$ and are nonzero on $(-1,1)$, so $\operatorname{supp}K_{\mathrm{box}}=\operatorname{supp}K_{\mathrm{Epa}}=[-1,1]$. The Gaussian kernel satisfies $K_{\mathrm{Gau}}(u)>0$ for every $u\in\mathbb R$, so it assigns positive weight at every distance.
For the Epanechnikov kernel,
\begin{align*}
R(K_{\mathrm{Epa}})=\int_{-1}^{1}\left(\frac{3}{4}(1-u^2)\right)^2\,du=\frac{9}{16}\int_{-1}^{1}(1-2u^2+u^4)\,du.
\end{align*}
Using $\int_{-1}^{1}u^2\,du=2/3$ and $\int_{-1}^{1}u^4\,du=2/5$ gives
\begin{align*}
R(K_{\mathrm{Epa}})=\frac{9}{16}\left(2-2\cdot\frac{2}{3}+\frac{2}{5}\right)=\frac{9}{16}\cdot\frac{16}{15}=\frac{3}{5}.
\end{align*}
Its second moment is
\begin{align*}
\mu_2(K_{\mathrm{Epa}})=\frac{3}{4}\int_{-1}^{1}u^2(1-u^2)\,du=\frac{3}{4}\left(\frac{2}{3}-\frac{2}{5}\right)=\frac{1}{5}.
\end{align*}
A box kernel on $[-a,a]$ with total mass $1$ is $K_a(u)=(2a)^{-1}\mathbb{1}_{[-a,a]}(u)$. Its second moment is
\begin{align*}
\mu_2(K_a)=\frac{1}{2a}\int_{-a}^{a}u^2\,du=\frac{1}{2a}\cdot\frac{2a^3}{3}=\frac{a^2}{3}.
\end{align*}
Matching the Epanechnikov second moment gives $a^2/3=1/5$, hence $a=\sqrt{3/5}$. For this matched box kernel,
\begin{align*}
R(K_a)=\int_{-a}^{a}\left(\frac{1}{2a}\right)^2\,du=\frac{1}{4a^2}\cdot 2a=\frac{1}{2a}=\frac{\sqrt{5/3}}{2}.
\end{align*}
Finally,
\begin{align*}
\frac{\sqrt{5/3}}{2}>\frac{3}{5}
\end{align*}
because both sides are positive and squaring gives
\begin{align*}
\frac{5}{12}>\frac{9}{25},
\end{align*}
which is equivalent to $125>108$. Thus, after matching the same second moment, the Epanechnikov kernel has smaller $R(K)=\int K^2\,d\mathcal L^1$ than the corresponding box kernel; this smaller squared-mass constant is the feature that makes it appear in optimal MISE constant calculations, while the bandwidth still controls the main smoothing scale.
[/example]
The examples show possible bump shapes, but an estimator also needs a rule for putting a bump at each data point and choosing its width. This leads to the scaled kernel average, whose normalisation turns local probability mass into density height.
[definition: Kernel Density Estimator]
Let $X_1,\dots,X_n$ be real-valued random variables and let $K$ be a kernel. For a bandwidth $h>0$, the kernel density estimator is the random function
\begin{align*}
\hat f_{n,h}:\mathbb R\to\mathbb R, \qquad
x\mapsto \frac{1}{nh}\sum_{i=1}^n K\left(\frac{x-X_i}{h}\right).
\end{align*}
For each sample outcome $\omega$, the realised estimator is the function $\hat f_{n,h}(\omega):\mathbb R\to\mathbb R$ given by the same formula with $X_i=X_i(\omega)$.
[/definition]
The factor $h^{-1}$ is forced by the change of variables $u=(x-y)/h$. With this normalisation, $\int \hat f_{n,h}\,d\mathcal L^1=1$ whenever $K$ integrates to one, so nonnegative kernels produce genuine densities.
[example: Mixture Density Estimation]
For the mixture density
\begin{align*}
f(x)=0.7\frac{1}{\sqrt{2\pi}}e^{-x^2/2}+0.3\frac{1}{\sqrt{2\pi(1/4)}}e^{-(x-4)^2/(2\cdot 1/4)},
\end{align*}
the first component has standard deviation $1$, while the second has standard deviation $1/2$. For the Gaussian kernel $K(u)=(2\pi)^{-1/2}e^{-u^2/2}$, the scaled bump centred at $y$ is
\begin{align*}
\frac{1}{h}K\left(\frac{x-y}{h}\right)=\frac{1}{\sqrt{2\pi}h}e^{-(x-y)^2/(2h^2)}.
\end{align*}
If $Y\sim\mathcal N(m,\sigma^2)$, then the contribution of that component to the expected KDE is
\begin{align*}
\int_{\mathbb R}\frac{1}{\sqrt{2\pi}h}e^{-(x-y)^2/(2h^2)}\frac{1}{\sqrt{2\pi}\sigma}e^{-(y-m)^2/(2\sigma^2)}\,dy.
\end{align*}
Completing the square gives
\begin{align*}
\frac{(x-y)^2}{h^2}+\frac{(y-m)^2}{\sigma^2}=\frac{h^2+\sigma^2}{h^2\sigma^2}\left(y-\frac{\sigma^2x+h^2m}{h^2+\sigma^2}\right)^2+\frac{(x-m)^2}{h^2+\sigma^2}.
\end{align*}
Therefore the component contribution equals
\begin{align*}
\frac{1}{2\pi h\sigma}e^{-(x-m)^2/(2(h^2+\sigma^2))}\int_{\mathbb R}\exp\left\{-\frac{h^2+\sigma^2}{2h^2\sigma^2}\left(y-\frac{\sigma^2x+h^2m}{h^2+\sigma^2}\right)^2\right\}\,dy.
\end{align*}
Using $\int_{\mathbb R}e^{-a(y-b)^2/2}\,dy=\sqrt{2\pi/a}$ for $a>0$, with $a=(h^2+\sigma^2)/(h^2\sigma^2)$, this becomes
\begin{align*}
\frac{1}{2\pi h\sigma}e^{-(x-m)^2/(2(h^2+\sigma^2))}\sqrt{\frac{2\pi h^2\sigma^2}{h^2+\sigma^2}}=\frac{1}{\sqrt{2\pi(h^2+\sigma^2)}}e^{-(x-m)^2/(2(h^2+\sigma^2))}.
\end{align*}
Thus
\begin{align*}
\mathbb E[\hat f_{n,h}(x)]=0.7\frac{1}{\sqrt{2\pi(1+h^2)}}e^{-x^2/(2(1+h^2))}+0.3\frac{1}{\sqrt{2\pi(1/4+h^2)}}e^{-(x-4)^2/(2(1/4+h^2))}.
\end{align*}
The bandwidth adds $h^2$ to each component variance. At the centre $x=4$ of the narrow component, its peak height changes from $0.3/\sqrt{2\pi(1/4)}$ to $0.3/\sqrt{2\pi(1/4+h^2)}$, so the multiplicative height factor is
\begin{align*}
\frac{0.3/\sqrt{2\pi(1/4+h^2)}}{0.3/\sqrt{2\pi(1/4)}}=\frac{1/2}{\sqrt{1/4+h^2}}.
\end{align*}
For $h=1/2$, this factor is
\begin{align*}
\frac{1/2}{\sqrt{1/4+1/4}}=\frac{1}{\sqrt 2},
\end{align*}
and for $h=1$, it is
\begin{align*}
\frac{1/2}{\sqrt{1/4+1}}=\frac{1}{\sqrt5}.
\end{align*}
Large bandwidths therefore flatten the narrow mode near $4$ and can merge the two components, while very small bandwidths keep deterministic smoothing small but leave many sample-level bumps. This example shows that bandwidth selection must respect the different spatial scales present in the target density.
[/example]
The mixture example shows that smoothing can blur features, so the next task is to identify which algebraic properties of $K$ reduce deterministic smoothing error. Moment conditions provide the cancellation mechanism used in the Taylor expansion of the bias.
[definition: Kernel Order]
Let $s \in \mathbb N$. A kernel $K:\mathbb R\to\mathbb R$ has order $s$ if
\begin{align*}
\int_{\mathbb R} K(u)\,d\mathcal L^1(u)=1,
\end{align*}
\begin{align*}
\int_{\mathbb R} u^j K(u)\,d\mathcal L^1(u)=0 \quad \text{for } 1\le j\le s-1,
\end{align*}
\begin{align*}
\int_{\mathbb R} |u|^s |K(u)|\,d\mathcal L^1(u)<\infty.
\end{align*}
[/definition]
For symmetric kernels, all odd moments vanish when they exist. Nonnegative kernels cannot usually have high order beyond the second-order situation, so higher-order kernels often take negative values.
[remark: Boundary Behavior]
If the target density is supported on $[0,1]$, then a symmetric kernel centred near $0$ or $1$ puts mass outside the support. Near a boundary, the usual interior cancellation argument integrates over only part of the kernel, so the bias can be of order $1$ or $h$ rather than the interior order. Common repairs include boundary kernels, reflection methods, and local polynomial density estimators.
[/remark]
## Pointwise Bias and Variance
The next problem is to quantify what the smoothing scale does at a fixed point $x$. The expectation of the estimator is a convolution-smoothed version of $f$, so bias is controlled by smoothness of $f$ and moments of $K$. The variance is controlled by the effective number of observations falling in an interval of length $h$, which is approximately $nh$.
[quotetheorem:6314]
[citeproof:6314]
The bias formula isolates deterministic smoothing error and shows exactly where the assumptions enter. The moment conditions are not decorative: without them, lower-order Taylor terms survive and dominate the advertised $h^s$ rate. Smoothness of $f$ near $x$ is also local and essential, since a cusp or jump at $x$ cannot be controlled by the same Taylor expansion. The theorem is pointwise and interior in spirit; it does not by itself handle boundary truncation or uniform error over many $x$. After the deterministic part is understood, the next question is how much random fluctuation remains around this smoothed expectation.
Concrete failures show why these hypotheses cannot be silently dropped. If $s=2$ but $K$ is not symmetric, then $\int uK(u)\,d\mathcal L^1(u)\neq0$ can occur and the bias contains an $h f'(x)$ term before the $h^2$ term. If $f(t)=|t|$ and $x=0$, the second derivative needed for the usual second-order expansion does not exist, and a symmetric kernel produces a bias of order $h\int |u|K(u)\,d\mathcal L^1(u)$ rather than order $h^2$. Compact support is the localisation condition used in this statement: for an unbounded-support kernel, values of $f$ far from $x$ enter through $f(x-hu)$, so local differentiability near $x$ alone does not control the integral without an additional tail domination assumption. Bias alone cannot decide whether the estimator is accurate, because making $h$ smaller reduces smoothing error while making each kernel bump taller; the next calculation measures the random fluctuation created by that taller local average.
[quotetheorem:6315]
[citeproof:6315]
The variance calculation explains the probabilistic side of locality. Independence is what turns the variance of the sum into $1/n$ times a single-summand variance; dependent observations can have covariance terms of the same order as the leading term. The assumption $K\in L^2(\mathbb R)$ is also structural, since $R(K)$ is the finite constant measuring how concentrated the bump is after squaring. Continuity of $f$ at $x$ lets the local average of $f(x-hu)$ collapse to $f(x)$; near a discontinuity, the limiting constant may instead depend on one-sided behaviour and kernel mass on each side. Combining the bias and variance calculations gives the local tradeoff: for a second-order symmetric kernel and twice differentiable density, the squared bias is of order $h^4$, while the variance is of order $(nh)^{-1}$.
The assumptions can fail in different ways. If $K(u)=c|u|^{-1/2}\mathbb{1}_{(0,1)}(u)$ with $c$ chosen so that $\int K\,d\mathcal L^1=1$, then $K\notin L^2(\mathbb R)$ and the constant $R(K)$ is infinite, so the displayed variance formula has no finite leading coefficient. If $f$ has a jump at $x$, for instance $f(t)=\mathbb{1}_{[0,1]}(t)$ at $x=0$ after ignoring endpoint normalisation conventions, then a symmetric compactly supported kernel averages only the right-hand density on one side and the limiting constant is not $f(0)R(K)$. If $X_i=X_1$ for every $i$, the sum has no averaging gain from independence and the covariance terms prevent the $1/n$ variance reduction.
[example: Oversmoothing and Undersmoothing at a Point]
Let $K$ be symmetric of order $2$, and suppose $f$ is twice differentiable near $x$. By *Pointwise Bias Expansion* with $s=2$,
\begin{align*}
\mathbb E[\hat f_{n,h}(x)]-f(x)
=\frac{h^2}{2}f''(x)\mu_2(K)+o(h^2),
\end{align*}
because $(-1)^2=1$ and $2!=2$. Thus, when $f''(x)>0$ and $\mu_2(K)>0$, the leading bias term is positive:
\begin{align*}
\frac{h^2}{2}f''(x)\mu_2(K)>0.
\end{align*}
At a valley, the expected KDE is therefore lifted above the true density value; at a peak, where $f''(x)<0$, the same formula gives a negative leading bias and the expected KDE is pulled downward.
The random fluctuation is governed by *Pointwise Variance Calculation*:
\begin{align*}
\operatorname{Var}(\hat f_{n,h}(x))
=\frac{f(x)}{nh}R(K)+o\left(\frac{1}{nh}\right).
\end{align*}
Increasing $h$ multiplies the leading squared bias by
\begin{align*}
\left(\frac{h^2}{2}f''(x)\mu_2(K)\right)^2
=\frac{h^4}{4}\{f''(x)\}^2\mu_2(K)^2,
\end{align*}
so deterministic distortion grows like $h^4$ in squared error. Decreasing $h$ reduces this deterministic term, but the leading variance term contains $1/(nh)$, so smaller bandwidths make the estimator more variable. In a plot, oversmoothing corresponds to the large-bias side of this tradeoff, with flattened peaks or filled valleys, while undersmoothing corresponds to the large-variance side, with narrow sample-level fluctuations.
[/example]
## Consistency and Asymptotic Normality
The preceding formulas suggest two bandwidth requirements: $h\to0$ to remove smoothing bias, and $nh\to\infty$ to average enough observations locally. These conditions are the basic asymptotic regime for pointwise kernel density estimation.
[quotetheorem:6316]
[citeproof:6316]
The two bandwidth conditions have different jobs. The condition $h_n\to0$ removes the deterministic smoothing bias; if $h_n$ stays bounded away from zero, the estimator converges to a smoothed version of $f$ rather than to $f(x)$. The condition $nh_n\to\infty$ ensures that the effective local sample size diverges; if $nh_n$ stays bounded, the variance need not vanish. This theorem gives pointwise convergence in probability only, so it does not provide uniform convergence over $x$, convergence of modes, or a confidence interval. For inference at a fixed point, the KDE must be treated as an average of triangular-array summands, and a central limit theorem applies after centring by its expectation.
The kernel and continuity assumptions also have concrete roles. If the kernel does not integrate to $1$, the expectation converges to a multiple of $f(x)$ rather than to $f(x)$. If $K\notin L^2(\mathbb R)$, the variance control used above may be infinite, as for $K(u)=c|u|^{-1/2}\mathbb{1}_{(0,1)}(u)$. If $f$ is discontinuous at $x$, a symmetric kernel estimates a local average of the one-sided limits; for a density with a jump at $0$, the limit can be the midpoint of the left and right limits rather than either point value. Consistency gives convergence but not its distributional scale, so the next theorem refines the centred stochastic term into the normal approximation used for fixed-point inference.
[quotetheorem:6317]
[citeproof:6317]
The hypotheses are tuned to a fixed-point normal approximation. Boundedness of $K$ is a convenient way to verify the Lindeberg condition; kernels with large spikes may require separate tail control before the same conclusion can be trusted. The assumption $f(x)>0$ prevents the limiting variance from degenerating, since at points where the density vanishes the usual $\sqrt{nh_n}$ scaling can collapse. The condition $nh_n\to\infty$ is again the effective local sample-size condition, and the theorem says nothing about simultaneous coverage over a range of $x$. The centring issue is important in applications: bandwidths that are optimal for mean squared error often leave non-negligible bias at the central-limit scale, so confidence intervals usually require undersmoothing, bias correction, or a higher-order estimator.
Specific counterexamples clarify the list. If $f(x)=0$ at an isolated boundary point of the support, the limiting variance in the displayed scaling is $0$ and the normal approximation degenerates. If $nh_n$ stays bounded, the number of observations in the effective window does not diverge and a Poisson-type local count limit can replace the Gaussian limit. If $K$ has unbounded spikes, the maximum summand in the triangular array need not vanish, so Lindeberg can fail even when the variance is finite. If the bias condition is not imposed, for example with a second-order kernel and the MISE-optimal scale $h_n\asymp n^{-1/5}$, the centred-by-$f(x)$ statistic has a nonzero asymptotic mean rather than the displayed centred normal law.
## Integrated Risk and Bandwidth Rates
Pointwise error is not the only way to measure a density estimate. A common global loss is integrated squared error, and its expectation is the mean integrated squared error. This criterion turns the local bias and variance calculations into a single bandwidth choice.
[definition: Mean Integrated Squared Error]
Let $f\in L^2(\mathbb R)$ be a density. The integrated squared error functional associated to $f$ is the map
\begin{align*}
\operatorname{ISE}_f:L^2(\mathbb R)\to[0,\infty], \qquad
\operatorname{ISE}_f(g)=\int_{\mathbb R}(g(x)-f(x))^2\,d\mathcal L^1(x).
\end{align*}
For a random density estimator $\hat f_{n,h}:\Omega\to L^2(\mathbb R)$, the mean integrated squared error functional associated to $f$ is
\begin{align*}
\operatorname{MISE}_f:\{\text{$L^2(\mathbb R)$-valued random density estimators}\}\to[0,\infty], \qquad
\operatorname{MISE}_f(\hat f_{n,h})=\mathbb E[\operatorname{ISE}_f(\hat f_{n,h})].
\end{align*}
[/definition]
The definition gives the global loss, but bandwidth choice requires a deterministic leading approximation rather than the exact random error. The pointwise bias and variance formulas only become useful after we know that their remainders remain controlled when integrated over the whole line. The main issue is therefore whether smoothness of the density and square-integrability of the kernel are enough to separate MISE into integrated squared bias and integrated variance.
[quotetheorem:6318]
[citeproof:6318]
The expansion is useful because it turns a random global loss into two deterministic leading terms, but its assumptions carry real content. The $L^2$ condition on $f^{(s)}$ is what makes the squared bias integrable, and translation continuity in $L^2$ prevents pointwise Taylor errors from accumulating into a leading-order integrated error. The condition $K\in L^2(\mathbb R)$ is needed for a finite integrated variance constant. If the density has a kink, such as $f(x)=ce^{-|x|}$, the second derivative contains singular behaviour at $0$ and the integrated squared bias need not follow the $h^4\|f''\|_{L^2}^2$ law. If the support has an uncorrected boundary, such as the uniform density on $[0,1]$ with a symmetric kernel, boundary bias contributes order $h$ to the integrated squared bias, not the interior order. If $f^{(s)}\notin L^2(\mathbb R)$ because the tail derivatives decay too slowly, the displayed integrated bias constant is infinite or undefined. If $K\notin L^2(\mathbb R)$, the integrated variance constant $R(K)$ is infinite, so the variance term is not $R(K)/(nh)$. Under the stated regularity, the bandwidth problem reduces to minimising a deterministic leading approximation, and the next result solves that optimisation.
[quotetheorem:6319]
[citeproof:6319]
The assumptions $A>0$ and $B>0$ exclude degenerate cases in the leading approximation. A concrete $A=0$ case is a density whose relevant derivative vanishes a.e. on the region governed by the model, or more simply an order-$s$ kernel with $\mu_s(K)=0$; then the displayed leading bias term vanishes and the optimal balance must be computed from the first nonzero higher-order bias term. If $B=0$, the kernel would have no squared mass and could not be a nonzero kernel in $L^2$. Failure of the leading approximation also changes the answer: for the uniform density on $[0,1]$ with an uncorrected symmetric kernel, boundary bias gives a different leading bias order, while for a kinked density the nominal $h^{2s}$ integrated squared-bias term may be replaced by a lower-order term. The formula also optimises only the asymptotic leading expression, not the exact finite-sample MISE, so constants, boundary effects, and pilot estimates of unknown quantities still matter in practice. The rate $n^{-1/(2s+1)}$ expresses the cost of estimating an infinite-dimensional object: higher smoothness permits a larger effective neighbourhood without excessive bias, but the cost of local averaging remains visible through the $+1$ in the denominator.
[example: Second-Order Bandwidth Scale]
For a symmetric second-order kernel, set $s=2$ in *Optimal Bandwidth Rate for s Smooth Densities*. The leading MISE approximation has the form
\begin{align*}
Ah^{2s}+\frac{B}{nh}=Ah^4+\frac{B}{nh},
\end{align*}
where
\begin{align*}
A=\frac{\mu_2(K)^2}{(2!)^2}\|f''\|_{L^2}^2=\frac{\mu_2(K)^2}{4}\|f''\|_{L^2}^2
\end{align*}
and
\begin{align*}
B=R(K).
\end{align*}
The optimizing bandwidth formula gives
\begin{align*}
h_{\mathrm{MISE}}=\left(\frac{B}{2sA}\right)^{1/(2s+1)}n^{-1/(2s+1)}.
\end{align*}
Substituting $s=2$ gives $2s=4$ and $2s+1=5$, hence
\begin{align*}
h_{\mathrm{MISE}}=\left(\frac{B}{4A}\right)^{1/5}n^{-1/5}.
\end{align*}
Thus the leading bandwidth scale is $h\asymp n^{-1/5}$, with an unknown constant depending on $R(K)$, $\mu_2(K)$, and $\|f''\|_{L^2}$.
At this bandwidth, the squared-bias contribution is
\begin{align*}
Ah_{\mathrm{MISE}}^4=A\left[\left(\frac{B}{4A}\right)^{1/5}n^{-1/5}\right]^4.
\end{align*}
Raising each factor to the fourth power gives
\begin{align*}
Ah_{\mathrm{MISE}}^4=A\left(\frac{B}{4A}\right)^{4/5}n^{-4/5}.
\end{align*}
The variance contribution is
\begin{align*}
\frac{B}{nh_{\mathrm{MISE}}}=\frac{B}{n\left(\frac{B}{4A}\right)^{1/5}n^{-1/5}}.
\end{align*}
Combining the powers of $n$ in the denominator gives $n\cdot n^{-1/5}=n^{4/5}$, so
\begin{align*}
\frac{B}{nh_{\mathrm{MISE}}}=B\left(\frac{B}{4A}\right)^{-1/5}n^{-4/5}.
\end{align*}
Both leading terms are therefore constant multiples of $n^{-4/5}$, so the leading MISE is of order $n^{-4/5}$.
If the sample size is doubled from $n$ to $2n$, the bandwidth factor changes by
\begin{align*}
\frac{(2n)^{-1/5}}{n^{-1/5}}=2^{-1/5}.
\end{align*}
In practice the unknown curvature factor $\|f''\|_{L^2}$ must be estimated or replaced by a reference rule, such as the value obtained under a Gaussian reference density. The exponent $1/5$ is small, so even doubling the sample size changes the second-order optimal bandwidth only by the modest multiplicative factor $2^{-1/5}$.
[/example]
## Kernel Shape, Diagnostics, and Practical Interpretation
The last question is how much the kernel shape matters compared with the bandwidth. The asymptotic formulas show that the kernel enters through low-dimensional summaries such as $\mu_s(K)$ and $R(K)$, while the bandwidth controls the dominant scale of smoothing.
[remark: Kernel Choice Versus Bandwidth Choice]
For standard symmetric second-order kernels, changing $K$ usually has a smaller practical effect than changing $h$. Compactly supported kernels reduce computation and make locality explicit, while Gaussian kernels give smoother-looking estimates and avoid discontinuities at the edge of the kernel support. The MISE constants distinguish kernels, but the bandwidth rate is unchanged within the same order class.
[/remark]
Bandwidth diagnostics are best read through the [bias-variance decomposition](/theorems/1424) rather than through visual smoothness alone. Features that persist over a range of bandwidths are more credible than features that appear only for a narrow undersmoothed choice.
[example: Reading a Bandwidth Sweep]
Suppose the sweep uses bandwidths $h_1>h_2>\cdots>h_m$ with a symmetric second-order kernel. At a fixed interior point $x$ where $f$ is twice differentiable, the leading deterministic term has the form
\begin{align*}
\mathbb E[\hat f_{n,h}(x)]-f(x)
=\frac{h^2}{2}f''(x)\mu_2(K)+o(h^2).
\end{align*}
Thus, near the mode of a unimodal density where $f''(x)<0$ and $\mu_2(K)>0$, the leading bias is negative:
\begin{align*}
\frac{h^2}{2}f''(x)\mu_2(K)<0.
\end{align*}
Large $h$ therefore pulls the expected estimate downward at the peak. In a skewed density with a long right tail, the same wide averaging window also spreads central mass toward nearby tail locations, so the displayed curve has a flattened peak and a tail that is too heavily blended with the centre.
The stochastic scale moves in the opposite direction:
\begin{align*}
\operatorname{Var}(\hat f_{n,h}(x))
=\frac{f(x)}{nh}R(K)+o\left(\frac{1}{nh}\right).
\end{align*}
If the bandwidth is halved, the leading squared bias changes by
\begin{align*}
\frac{\left((h/2)^2 f''(x)\mu_2(K)/2\right)^2}{\left(h^2 f''(x)\mu_2(K)/2\right)^2}
=\frac{(h/2)^4}{h^4}
=\frac{1}{16},
\end{align*}
while the leading variance changes by
\begin{align*}
\frac{f(x)/(n(h/2))}{f(x)/(nh)}
=\frac{2f(x)/(nh)}{f(x)/(nh)}
=2.
\end{align*}
So moving down the bandwidth grid reduces deterministic smoothing error but increases random fluctuation.
In a bandwidth sweep, an intermediate range is credible when the peak location, skewness, and tail decay remain nearly unchanged for several adjacent values of $h$. At very small $h$, the variance factor $1/(nh)$ is large, so narrow bumps caused by individual observations appear and disappear as $h$ changes. Those unstable local peaks are evidence of undersmoothing, while the overly flat large-$h$ curves are evidence of oversmoothing.
[/example]
This chapter establishes the template used repeatedly in nonparametric statistics: define a local averaging estimator, expand its bias using smoothness, compute its variance from the effective local sample size, and choose the tuning parameter by balancing the two. The same pattern will reappear in local polynomial regression, nonparametric confidence intervals, and minimax lower-bound comparisons.
Chapter 5 developed the pointwise bias-variance theory for kernel density estimators and the basic consistency logic through careful choice of bandwidth. Chapter 6 extends these pointwise results to uniform statements over the domain, showing how dimension, smoothness classes, and bandwidth simultaneously determine the accuracy of nonparametric smoothing methods.
# 6. Uniform Theory for Kernel Estimators
Chapter 5 introduced kernel smoothing, pointwise bias-variance calculations, and the basic consistency logic for density estimation. This chapter develops the uniform version of those tools, with emphasis on how smoothness, bandwidth, sample size, and dimension determine the accuracy of kernel estimators. This chapter turns from pointwise kernel calculations to statements that hold uniformly over ranges of evaluation points. Uniform theory asks how large the worst error can be, how boundary effects alter that error, and how dimension changes the bandwidth and sample-size requirements.
## Supremum Norm Consistency
The first question is how to upgrade pointwise consistency of a kernel density estimator into control of the largest error over a whole region. This is not only a technical strengthening: it is the form of consistency needed for confidence bands, mode estimation, level-set estimation, and visual density reconstruction. The price of taking a supremum is a logarithmic factor, coming from the number of effectively distinct locations at the smoothing scale.
[definition: Kernel Density Estimator in Dimension D]
Let $X_1,\dots,X_n$ be i.i.d. random vectors in $\mathbb R^d$ with density $f$. Let $K:\mathbb R^d\to\mathbb R$ be integrable with $\int_{\mathbb R^d}K(u)\,d\mathcal L^d(u)=1$, and let $h>0$. The kernel density estimator with bandwidth $h$ is
\begin{align*}
\hat f_h:\mathbb R^d\to\mathbb R,\qquad
x\mapsto \frac{1}{nh^d}\sum_{i=1}^n K\left(\frac{x-X_i}{h}\right).
\end{align*}
[/definition]
This definition is the same averaging construction from the pointwise theory, but now the argument ranges over a set rather than being fixed. For later shorthand, write $K_h(u)=h^{-d}K(u/h)$, so that $\mathbb E[\hat f_h(x)]=(K_h*f)(x)$ whenever the convolution is well-defined. The deterministic part of the uniform error is the bias of this smoothed density, while the random part is an empirical-process fluctuation indexed by translated kernels. To state the size of that error, we need a norm that records the worst discrepancy on the region being estimated.
[definition: Supremum Norm Loss]
Let $A\subset\mathbb R^d$, and let $\mathcal B(A)$ denote the [vector space](/page/Vector%20Space) of bounded functions $g:A\to\mathbb R$. The supremum norm over $A$ is the map
\begin{align*}
\|\cdot\|_{\infty,A}:\mathcal B(A)\to[0,\infty),\qquad
g\mapsto \sup_{x\in A}|g(x)|.
\end{align*}
[/definition]
The notation records the domain of the supremum because boundary and tail regions often require different arguments. On a compact interior set, kernel windows stay inside the support for small bandwidth, which separates the main smoothing analysis from boundary correction. This motivates naming the regions where the uncorrected estimator has its cleanest behaviour.
[definition: Compact Interior Set]
Let $S\subset\mathbb R^d$ be a set. A compact set $A\subset S$ is a compact interior set of $S$ if $A$ is compact and there exists $\delta>0$ such that
\begin{align*}
\operatorname{dist}(x,S^c)\ge \delta \qquad \text{for all }x\in A.
\end{align*}
[/definition]
Compact interior sets are the natural place where the uncorrected KDE behaves as though the density lived on all of $\mathbb R^d$. Pointwise consistency is not enough for confidence bands, mode estimation, or comparing the whole estimated curve on a region: we need the largest error over the set to vanish. The obstruction is simultaneous control over many overlapping kernel windows, which forces conditions on the kernel class and on the effective sample size $nh^d$.
[quotetheorem:6320]
[citeproof:6320]
This discussion explains why the effective sample size is $nh^d$, the expected number of observations in a bandwidth-sized neighbourhood. The boundedness, entropy, and bandwidth-log conditions are doing real work: if $K$ has a sharp spike with no usable entropy control, concentration at a finite grid need not extend to the continuum; if $h_n=\exp(-n)$, then $\log(1/h_n)$ is much larger than $\log n$ and the displayed empirical-process bound is not controlled by the effective-sample-size condition. If $nh^d$ stays bounded, many windows contain too little data for uniform convergence. The theorem does not give a sharp finite-sample confidence band or a boundary result, since it treats compact sets in the ambient space and only proves convergence in probability. The extra $\log n$ pays for simultaneous control over many locations, and the balance between bias and this maximal random term already appears in simple one-dimensional estimators.
[example: Uniform Error for a Triangular Kernel]
Let $d=1$, let $K(u)=(1-|u|)\mathbb 1_{[-1,1]}(u)$, and let $f$ be a Lipschitz density on $[0,1]$ with Lipschitz constant $L$. First,
\begin{align*}
\int_{\mathbb R}K(u)\,d\mathcal L^1(u)=\int_{-1}^1(1-|u|)\,du=2\int_0^1(1-u)\,du=1.
\end{align*}
Thus $K$ has unit integral. Fix $A=[\varepsilon,1-\varepsilon]$ with $\varepsilon>0$. For $h_n=n^{-1/3}$,
\begin{align*}
\frac{nh_n}{\log n}=\frac{n\cdot n^{-1/3}}{\log n}=\frac{n^{2/3}}{\log n}\to\infty.
\end{align*}
Also,
\begin{align*}
\log(1/h_n)=\log(n^{1/3})=\frac13\log n=O(\log n).
\end{align*}
So the bandwidth assumptions in *[Uniform Consistency of Kernel Density Estimators](/theorems/6320)* are satisfied on this compact interior interval.
For the bias, take $n$ large enough that $h_n<\varepsilon$. Then $x-h_nv\in[0,1]$ whenever $x\in A$ and $v\in[-1,1]$. With the change of variables $v=(x-u)/h_n$,
\begin{align*}
\mathbb E[\hat f_{h_n}(x)]=\int_0^1 h_n^{-1}K\left(\frac{x-u}{h_n}\right)f(u)\,du=\int_{-1}^1K(v)f(x-h_nv)\,dv.
\end{align*}
Since $\int K=1$ and $f$ is $L$-Lipschitz,
\begin{align*}
\left|\mathbb E[\hat f_{h_n}(x)]-f(x)\right|=\left|\int_{-1}^1K(v)\{f(x-h_nv)-f(x)\}\,dv\right|.
\end{align*}
The Lipschitz bound gives
\begin{align*}
\left|\mathbb E[\hat f_{h_n}(x)]-f(x)\right|\le Lh_n\int_{-1}^1(1-|v|)|v|\,dv.
\end{align*}
By symmetry,
\begin{align*}
\int_{-1}^1(1-|v|)|v|\,dv=2\int_0^1v(1-v)\,dv=2\left(\frac12-\frac13\right)=\frac13.
\end{align*}
Therefore
\begin{align*}
\|\mathbb E[\hat f_{h_n}]-f\|_{\infty,A}\le \frac{L}{3}h_n=O(n^{-1/3}).
\end{align*}
The centred stochastic term has the maximal-deviation order from *Maximal Deviation Rate for Kernel Density Estimators*:
\begin{align*}
\|\hat f_{h_n}-\mathbb E[\hat f_{h_n}]\|_{\infty,A}=O_{\mathbb P}\left(\sqrt{\frac{\log n}{nh_n}}\right).
\end{align*}
Substituting $h_n=n^{-1/3}$ gives
\begin{align*}
\sqrt{\frac{\log n}{nh_n}}=\sqrt{\frac{\log n}{n\cdot n^{-1/3}}}=n^{-1/3}\sqrt{\log n}.
\end{align*}
Combining the deterministic and stochastic pieces,
\begin{align*}
\|\hat f_{h_n}-f\|_{\infty,A}=O(n^{-1/3})+O_{\mathbb P}(n^{-1/3}\sqrt{\log n}).
\end{align*}
Thus both contributions vanish, but at the bandwidth $h_n=n^{-1/3}$ the supremum stochastic fluctuation is larger than the Lipschitz bias by a factor of order $\sqrt{\log n}$.
[/example]
The previous theorem gives convergence but does not identify the typical size of the maximal fluctuation. For bandwidth choice and confidence bands, consistency is too qualitative; we need a rate that separates the random fluctuation from the deterministic smoothing bias.
[quotetheorem:6321]
[citeproof:6321]
This rate is not merely a proof artefact: the estimator is averaging about $nh^d$ observations locally, and the maximum over many local averages has a logarithmic inflation. The VC-subgraph and compact-support assumptions are used to control the whole continuum of translated kernels through empirical-process entropy; without such entropy control, pointwise concentration need not imply a supremum bound. The bounded-density assumption prevents a local variance scale from exploding, while $\log(1/h)=O(\log n)$ rules out bandwidths so small that the covering count outgrows the displayed logarithm. The theorem describes the centred fluctuation only, not the smoothing bias or endpoint behaviour, so it must be combined with a bias calculation before choosing $h$.
## Stochastic Equicontinuity
The next question is why values of the KDE at nearby points should be close in a random, uniform sense. Pointwise concentration alone does not prevent a random function from oscillating between grid points. Stochastic equicontinuity supplies the missing bridge between finite-dimensional control and process-level control.
[definition: Stochastic Equicontinuity]
Let $(Z_n(t))_{t\in T}$ be stochastic processes indexed by a [metric space](/page/Metric%20Space) $(T,\rho)$. The sequence $(Z_n)$ is stochastically equicontinuous if, for every $\varepsilon>0$,
\begin{align*}
\lim_{\delta\downarrow0}\limsup_{n\to\infty}\mathbb P\left(\sup_{\rho(s,t)<\delta}|Z_n(s)-Z_n(t)|>\varepsilon\right)=0.
\end{align*}
[/definition]
For KDEs, the relevant process is often the centred and scaled estimator $Z_n(x)=\sqrt{nh^d}(\hat f_h(x)-\mathbb E[\hat f_h(x)])$. The metric must reflect the smoothing scale, so ordinary distance $|x-y|$ is too crude unless it is compared to $h$. The next theorem makes that bandwidth-scale continuity precise and explains why the finite grid argument controls the whole continuum.
[quotetheorem:6322]
[citeproof:6322]
Stochastic equicontinuity is the reason that kernel estimators behave like smooth random functions rather than unrelated estimates at uncountably many points. The normalization by $\sqrt{nh^d}$ is essential: without it, the statement is either too weak to connect with maximal deviation theory or false at the intended stochastic scale. The scale $h\delta$ is also essential, because points separated by a fixed Euclidean distance eventually use almost disjoint kernel windows. Each regularity assumption rules out a different obstruction. If $K$ is not square-integrable, a narrow singularity can make the local variance infinite even when $f$ is bounded. If compact support is dropped without a replacement tail condition, far-away observations can contribute many small but untracked increments. If the increment class has no VC-type entropy bound, a kernel with rapidly oscillating translates can have small pointwise increments while the supremum over locations remains uncontrolled. This theorem does not identify the limiting distribution of $Z_n$; it supplies the tightness-type ingredient that lets finite-grid probability bounds become uniform bounds.
[remark: Why Compact Support Is Often Assumed]
Compact support of $K$ is not essential for all uniform results, but it simplifies the entropy and tail arguments. Gaussian kernels can be handled by truncation or by exponential tail bounds, but the proof has to control both spatial covering and the far tails of $K((x-X_i)/h)$.
[/remark]
## Boundary Bias and Correction
Uniform consistency over compact interiors avoids a major problem: near the boundary of the support, a symmetric kernel puts mass outside the region where the density lives. The question is how to modify the estimator so that it remains accurate at points such as $0$ for densities supported on $[0,\infty)$ or $[0,1]$. Boundary correction changes the kernel shape or the local fitting criterion near the edge.
[example: Failure Near Zero for Positive Data]
Let $X_1,\dots,X_n$ have density $f(u)=e^{-u}\mathbb 1_{[0,\infty)}(u)$, and let $K$ be a symmetric kernel supported on $[-1,1]$ with $\int_{\mathbb R}K(v)\,d\mathcal L^1(v)=1$. At the boundary point $x=0$, the naive KDE is
\begin{align*}
\hat f_h(0)=\frac{1}{nh}\sum_{i=1}^n K\left(\frac{-X_i}{h}\right).
\end{align*}
Taking expectation and using the density of $X_i$ gives
\begin{align*}
\mathbb E[\hat f_h(0)]=\frac{1}{h}\int_0^\infty K\left(\frac{-u}{h}\right)e^{-u}\,du.
\end{align*}
Since $K$ is supported on $[-1,1]$, the factor $K(-u/h)$ is zero for $u>h$, so
\begin{align*}
\mathbb E[\hat f_h(0)]=\frac{1}{h}\int_0^h K\left(\frac{-u}{h}\right)e^{-u}\,du.
\end{align*}
Set $v=-u/h$, so $u=-hv$ and $du=-h\,dv$. When $u=0$, $v=0$; when $u=h$, $v=-1$. Therefore
\begin{align*}
\mathbb E[\hat f_h(0)]=\frac{1}{h}\int_0^{-1}K(v)e^{hv}(-h)\,dv.
\end{align*}
Reversing the orientation gives
\begin{align*}
\mathbb E[\hat f_h(0)]=\int_{-1}^0K(v)e^{hv}\,dv.
\end{align*}
For $v\in[-1,0]$, $e^{hv}\to1$ as $h\downarrow0$, and the integrand is bounded in absolute value by $\|K\|_\infty\mathbb 1_{[-1,0]}(v)$. Hence
\begin{align*}
\lim_{h\downarrow0}\mathbb E[\hat f_h(0)]=\int_{-1}^0K(v)\,dv.
\end{align*}
Symmetry gives $\int_{-1}^0K(v)\,dv=\int_0^1K(v)\,dv$. Since $K$ has unit integral and is supported on $[-1,1]$,
\begin{align*}
1=\int_{-1}^1K(v)\,dv=\int_{-1}^0K(v)\,dv+\int_0^1K(v)\,dv=2\int_{-1}^0K(v)\,dv.
\end{align*}
Thus
\begin{align*}
\lim_{h\downarrow0}\mathbb E[\hat f_h(0)]=\frac12.
\end{align*}
But $f(0)=e^0=1$, so the limiting boundary bias is $1/2-1=-1/2$. Shrinking the bandwidth does not remove the error, because at $x=0$ half of the symmetric kernel window lies outside the support $[0,\infty)$.
[/example]
The previous example identifies the missing mass mechanism: near zero, the estimator averages over a window that has been cut in half by the support constraint. This motivates defining an estimator that repairs the missing mass by mirroring observations across the boundary. The reflected estimator restores the missing half-window without changing the interior formula far from zero.
[definition: Reflection Kernel Density Estimator]
Let $X_1,\dots,X_n$ be observations supported on $[0,\infty)$ and let $K:\mathbb R\to\mathbb R$ be a kernel. The reflection KDE is
\begin{align*}
\hat f_{h,\mathrm{ref}}:[0,\infty)\to\mathbb R,\qquad
x\mapsto \frac{1}{nh}\sum_{i=1}^n\left\{K\left(\frac{x-X_i}{h}\right)+K\left(\frac{x+X_i}{h}\right)\right\}.
\end{align*}
[/definition]
The added reflected term supplies the kernel mass that would otherwise have fallen outside the support. Reflection works best when the target density is smooth at the boundary in a way compatible with even extension, so the next result calculates the remaining boundary bias and exposes the condition under which the first-order term disappears.
[quotetheorem:6323]
[citeproof:6323]
Reflection is simple, but it hard-codes the geometry of a flat boundary. The symmetry and support assumptions in the theorem prevent leakage past zero and make the one-sided Taylor expansion exact up to the displayed order. A nonsymmetric kernel can break the correction even for the constant density shape near zero: if $\int_0^1K(u)\,d\mathcal L^1(u)\ne\int_{-1}^0K(u)\,d\mathcal L^1(u)$, the two reflected terms do not reproduce the same local averaging as an even kernel, so the endpoint normalisation is distorted. A nonzero boundary derivative also has a concrete effect: for $f(x)=e^{-x}\mathbb 1_{[0,\infty)}(x)$, $f'(0+)=-1$, so the displayed formula leaves the first-order term $-2h\mu_{1,+}(K)+O(h^2)$ rather than an $O(h^2)$ bias. The flat-boundary hypothesis is also restrictive; reflecting data from a density supported on a curved region in $\mathbb R^2$ across a coordinate axis puts artificial mass in the wrong location. The theorem only describes the endpoint $x=0$ for a half-line and does not settle corners, curved boundaries, or finite-sample variance. On bounded intervals or positive half-lines, another approach is to use kernels whose support automatically matches the sample space.
[definition: Beta Kernel Density Estimator]
Let $X_1,\dots,X_n$ be observations supported on $[0,1]$. For $x\in[0,1]$ and bandwidth $h>0$, define the beta kernel $K_{x,h}$ to be the beta density on $[0,1]$ with parameters depending on $x$ and $h$, commonly chosen as
\begin{align*}
\alpha_{x,h}=\frac{x}{h}+1,\qquad \beta_{x,h}=\frac{1-x}{h}+1.
\end{align*}
The beta-kernel estimator is
\begin{align*}
K_{x,h}:[0,1]\to[0,\infty),\qquad
t\mapsto \frac{t^{\alpha_{x,h}-1}(1-t)^{\beta_{x,h}-1}}{B(\alpha_{x,h},\beta_{x,h})},
\end{align*}
and
\begin{align*}
\hat f_{h,\beta}:[0,1]\to\mathbb R,\qquad
x\mapsto \frac{1}{n}\sum_{i=1}^n K_{x,h}(X_i).
\end{align*}
[/definition]
Here the kernel depends on the evaluation point $x$, so this estimator is not a convolution. The gain is that no probability mass is assigned outside $[0,1]$, including at the endpoints, and this endpoint adaptation can be seen directly from the beta parameters.
[example: Beta Kernel at an Endpoint]
At $x=0$, the beta-kernel parameters are
\begin{align*}
\alpha_{0,h}=\frac{0}{h}+1=1
\end{align*}
and
\begin{align*}
\beta_{0,h}=\frac{1-0}{h}+1=\frac1h+1.
\end{align*}
For $b>0$,
\begin{align*}
B(1,b)=\int_0^1(1-t)^{b-1}\,dt=\left[\frac{-(1-t)^b}{b}\right]_{t=0}^{t=1}=\frac1b.
\end{align*}
Taking $b=1/h+1$, the endpoint beta density is
\begin{align*}
K_{0,h}(t)=\frac{t^{0}(1-t)^{1/h}}{B(1,1/h+1)}=\left(\frac1h+1\right)(1-t)^{1/h},\qquad 0\le t\le1.
\end{align*}
Its total mass is
\begin{align*}
\int_0^1K_{0,h}(t)\,dt=\left(\frac1h+1\right)\int_0^1(1-t)^{1/h}\,dt=\left(\frac1h+1\right)\frac{1}{1/h+1}=1.
\end{align*}
Thus the kernel is supported on $[0,1]$ and puts no mass outside the sample space.
Since $\hat f_{h,\beta}(0)=n^{-1}\sum_{i=1}^nK_{0,h}(X_i)$ and the $X_i$ have density $f$ on $[0,1]$,
\begin{align*}
\mathbb E[\hat f_{h,\beta}(0)]=\mathbb E[K_{0,h}(X_1)]=\int_0^1\left(\frac1h+1\right)(1-t)^{1/h}f(t)\,dt.
\end{align*}
Using the unit-mass calculation,
\begin{align*}
\mathbb E[\hat f_{h,\beta}(0)]-f(0)=\int_0^1K_{0,h}(t)\{f(t)-f(0)\}\,dt.
\end{align*}
For any fixed $\eta\in(0,1)$, the beta tail away from zero is
\begin{align*}
\int_\eta^1K_{0,h}(t)\,dt=\left(\frac1h+1\right)\int_\eta^1(1-t)^{1/h}\,dt=(1-\eta)^{1/h+1}\to0
\end{align*}
as $h\downarrow0$. On $[0,\eta]$, continuity of $f$ at $0$ makes $|f(t)-f(0)|$ uniformly small once $\eta$ is small; on $[\eta,1]$, the displayed tail mass tends to zero. Therefore
\begin{align*}
\mathbb E[\hat f_{h,\beta}(0)]\to f(0).
\end{align*}
Unlike a symmetric kernel centred at $0$, the endpoint beta kernel concentrates its full unit mass inside $[0,1]$ near the boundary point, so its endpoint expectation targets $f(0)$ rather than losing half the mass outside the interval.
[/example]
The beta-kernel example solves support leakage by changing the kernel family with the evaluation point. This motivates defining a different correction that keeps a fixed kernel but changes the local estimating equation. Local linear correction uses polynomial reproduction to cancel the bias caused by truncated windows.
[definition: Local Linear Density Correction]
Let $X_1,\dots,X_n$ be observations on an interval $S\subset\mathbb R$, let $K:\mathbb R\to\mathbb R$ be a kernel, and let $h>0$. For $x\in S$, set
\begin{align*}
T_{x,h}=\{u\in\mathbb R:x-hu\in S\}.
\end{align*}
For $j=0,1,2$, set the moment maps $s_j:S\times(0,\infty)\to\mathbb R$, wherever the following integrals are finite, by
\begin{align*}
s_j(x,h)=\int_{T_{x,h}}u^jK(u)\,d\mathcal L^1(u).
\end{align*}
Finally, set $\Delta:S\times(0,\infty)\to\mathbb R$ by
\begin{align*}
\Delta(x,h)=s_0(x,h)s_2(x,h)-s_1(x,h)^2.
\end{align*}
When $\Delta(x,h)>0$, define the local linear equivalent kernel
\begin{align*}
L_{x,h}:T_{x,h}\to\mathbb R,\qquad
u\mapsto \frac{s_2(x,h)-u s_1(x,h)}{\Delta(x,h)}K(u).
\end{align*}
The local linear density estimator is the map
\begin{align*}
\hat f_{h,\mathrm{ll}}:S\to\mathbb R,\qquad
x\mapsto \frac{1}{nh}\sum_{i=1}^n L_{x,h}\left(\frac{x-X_i}{h}\right).
\end{align*}
[/definition]
The formal construction can be written through equivalent weighted moment formulas. Its important feature for this course is that local linear fitting automatically adjusts for asymmetric kernel windows near the boundary. The theorem below states the resulting bias cancellation, which is the density-estimation analogue of boundary correction in local linear regression.
[quotetheorem:6324]
[citeproof:6324]
Boundary correction therefore restores the same bias order at the edge that ordinary second-order kernels have in the interior. The nondegenerate moment condition is necessary because the correction solves a two-moment reproduction problem; if the truncated window has zero second effective variation, the intercept and slope are not separately identified. The twice differentiable assumption is also used in the Taylor remainder; for a density with a kink at the boundary, first-order cancellation of a smooth expansion no longer implies an $h^2$ bias bound. The theorem is a bias statement only, so variance constants and the stability of the equivalent kernel still have to be checked before using the method in finite samples.
## Multivariate Kernel Density Estimation
The multivariate problem asks what changes when each observation is a vector rather than a scalar. The formula looks almost unchanged, but the bandwidth now controls volume, and volume shrinks as $h^d$. This is the main entry point for the curse of dimensionality in kernel density estimation.
[definition: Product Kernel]
Let $K_1:\mathbb R\to\mathbb R$ be a univariate kernel. The product kernel on $\mathbb R^d$ generated by $K_1$ is
\begin{align*}
K:\mathbb R^d\to\mathbb R,\qquad
u=(u_1,\dots,u_d)\mapsto \prod_{j=1}^d K_1(u_j).
\end{align*}
[/definition]
Product kernels are convenient because they reduce multivariate smoothing to coordinatewise smoothing. They are not rotationally invariant in general, but they make the bias and variance calculations transparent. In two dimensions, the estimator is often read visually through contours, so the uniform theory has a direct graphical interpretation.
[example: Bivariate Density Contours]
Let $K_1(u)=(2\pi)^{-1/2}e^{-u^2/2}$ and use the product Gaussian kernel
\begin{align*}
K(u_1,u_2)=K_1(u_1)K_1(u_2)=\frac{1}{2\pi}\exp\left(-\frac{u_1^2+u_2^2}{2}\right).
\end{align*}
For $x=(x_1,x_2)$ and $X_i=(X_{i1},X_{i2})$, the two-dimensional scalar-bandwidth KDE is
\begin{align*}
\hat f_h(x_1,x_2)=\frac{1}{nh^2}\sum_{i=1}^n K\left(\frac{x_1-X_{i1}}{h},\frac{x_2-X_{i2}}{h}\right).
\end{align*}
Substituting the product Gaussian formula gives
\begin{align*}
\hat f_h(x_1,x_2)=\frac{1}{nh^2}\sum_{i=1}^n \frac{1}{2\pi}\exp\left(-\frac{(x_1-X_{i1})^2/h^2+(x_2-X_{i2})^2/h^2}{2}\right).
\end{align*}
Since $(a/h)^2=a^2/h^2$, this is
\begin{align*}
\hat f_h(x_1,x_2)=\frac{1}{2\pi n h^2}\sum_{i=1}^n \exp\left(-\frac{(x_1-X_{i1})^2+(x_2-X_{i2})^2}{2h^2}\right).
\end{align*}
Thus a contour at level $c$ is the set
\begin{align*}
\{(x_1,x_2)\in\mathbb R^2:\hat f_h(x_1,x_2)=c\}.
\end{align*}
The bandwidth controls how much neighbouring observations contribute to the same contour. At the data point $x=X_j$, the $j$th summand in the final formula equals
\begin{align*}
\frac{1}{2\pi n h^2}\exp\left(-\frac{(X_{j1}-X_{j1})^2+(X_{j2}-X_{j2})^2}{2h^2}\right).
\end{align*}
The numerator in the exponent is $0+0=0$, so this contribution is
\begin{align*}
\frac{1}{2\pi n h^2}\exp(0)=\frac{1}{2\pi n h^2}.
\end{align*}
If a point $x$ is at Euclidean distance at least $r>0$ from every observation, then
\begin{align*}
(x_1-X_{i1})^2+(x_2-X_{i2})^2\ge r^2
\end{align*}
for every $i$. Since $a\mapsto e^{-a}$ is decreasing,
\begin{align*}
\exp\left(-\frac{(x_1-X_{i1})^2+(x_2-X_{i2})^2}{2h^2}\right)\le \exp\left(-\frac{r^2}{2h^2}\right).
\end{align*}
Therefore
\begin{align*}
\hat f_h(x)\le \frac{1}{2\pi n h^2}\sum_{i=1}^n \exp\left(-\frac{r^2}{2h^2}\right)=\frac{1}{2\pi h^2}\exp\left(-\frac{r^2}{2h^2}\right).
\end{align*}
Writing $t=1/h^2$, the right-hand side is $(2\pi)^{-1}t e^{-r^2t/2}$, which tends to $0$ as $t\to\infty$ because exponential decay dominates polynomial growth. Small bandwidths therefore create high peaks around individual observations and low density between separated observations, which makes contours fragment. With larger bandwidths, the exponential factors change more slowly across nearby observations, so distinct peaks are averaged together and separate modes can merge.
Uniform error matters because contours are level-set objects. If
\begin{align*}
\|\hat f_h-f\|_{\infty,A}\le \eta,
\end{align*}
then every $x\in A$ with $f(x)\ge c+\eta$ satisfies
\begin{align*}
\hat f_h(x)\ge f(x)-|\hat f_h(x)-f(x)|\ge c+\eta-\eta=c.
\end{align*}
Also, every $x\in A$ with $\hat f_h(x)\ge c$ satisfies
\begin{align*}
f(x)\ge \hat f_h(x)-|\hat f_h(x)-f(x)|\ge c-\eta.
\end{align*}
Hence
\begin{align*}
\{x\in A:f(x)\ge c+\eta\}\subseteq \{x\in A:\hat f_h(x)\ge c\}\subseteq \{x\in A:f(x)\ge c-\eta\}.
\end{align*}
A pointwise error bound controls isolated locations, but this level-set inclusion needs one error bound holding over the whole plotting region.
[/example]
The contour example uses the same smoothing scale in every direction, which can be inappropriate when the cloud is elongated or correlated. This motivates defining a matrix-bandwidth estimator whose window can rotate and stretch. The determinant in the formula records the smoothing volume, which is the multivariate replacement for the scalar factor $h^d$.
[definition: Matrix Bandwidth Kernel Density Estimator]
Let $X_1,\dots,X_n$ be observations in $\mathbb R^d$, let $K:\mathbb R^d\to\mathbb R$ be a kernel, and let $H\in\mathbb R^{d\times d}$ be symmetric positive definite. The matrix-bandwidth KDE is
\begin{align*}
\hat f_H:\mathbb R^d\to\mathbb R,\qquad
x\mapsto \frac{1}{n(\det H)^{1/2}}\sum_{i=1}^n K\left(H^{-1/2}(x-X_i)\right).
\end{align*}
[/definition]
When $H=h^2I_d$, this reduces to the scalar bandwidth estimator with volume factor $h^d$. General $H$ is useful when the density has different scales in different directions. The next theorem returns to scalar bandwidths so that the role of dimension in the basic bias and variance orders can be isolated before studying integrated risk.
[quotetheorem:6325]
[citeproof:6325]
The preceding theorem identifies the two forces behind all bandwidth rules: smoothness rewards large bandwidth through lower variance, while approximation rewards small bandwidth through lower bias. Each hypothesis has a visible failure mode. If $A$ touches the boundary of the support, the one-dimensional uniform density on $[0,1]$ with a symmetric kernel gives endpoint mass loss rather than the interior Taylor expansion. If $K$ is only first order and $\int u_jK(u)\,d\mathcal L^d(u)\ne0$ for some coordinate, the Taylor term $-h\,\partial_j f(x)\int u_jK(u)\,d\mathcal L^d(u)$ remains, so a claimed second-order bias bound fails for $f(x)=x_j$ locally. If $K\notin L^2$, the variance calculation may be infinite even when $f$ is bounded, because the term $\int K^2$ is the leading variance constant. If $f$ has a kink, such as $f(x)\propto 1-|x_1|$ near the origin after smoothing to form a density, a second derivative expansion cannot justify an $h^2$ remainder at the kink. The theorem is pointwise and does not control the maximum over $A$ or the integrated risk by itself. This motivates passing from pointwise error to integrated risk, because MISE is the criterion used to derive the chapter's dimension-dependent benchmark rate.
[quotetheorem:6326]
[citeproof:6326]
This theorem is the standard expression of the curse of dimensionality for density estimation. The variance assumption depends on integrating over a fixed $d$-dimensional region with local volume $h^d$; if the data concentrate near a smooth curve in $\mathbb R^2$, isotropic two-dimensional smoothing counts observations in area-$h^2$ windows even though the statistical variation may be governed by the one-dimensional concentration along the curve. The $L^2$ condition on the kernel is also necessary for this variance calculation: a kernel with an integrable singularity but $K\notin L^2$ can define a formal average while making the integrated variance infinite. The smoothness assumption is decisive as well. For a density with a cusp, for instance a normalised version of $f(x)=1-|x|$ near zero in one dimension, a second-order squared-bias rate is not justified at the cusp, so balancing $h^4$ against $(nh)^{-1}$ would understate the approximation error. The theorem gives an upper-rate calculation rather than a full optimality theorem: it does not prove a minimax lower bound, does not say whether a method adapts when $s$ is unknown, and does not select a bandwidth from the data. Concrete bandwidth exponents make the dimensional deterioration visible even before constants and data-driven tuning are considered.
[example: Bandwidth Scaling in Dimensions One Two and Five]
Take second-order smoothness $s=2$. By *Multivariate MISE Rate*, the MISE-optimal scalar bandwidth has the scale
\begin{align*}
h\asymp n^{-1/(2s+d)}.
\end{align*}
Substituting $s=2$ gives
\begin{align*}
2s+d=2\cdot 2+d=4+d,
\end{align*}
so
\begin{align*}
h\asymp n^{-1/(4+d)}.
\end{align*}
For $d=1$,
\begin{align*}
4+d=4+1=5,
\end{align*}
and hence
\begin{align*}
h\asymp n^{-1/5}.
\end{align*}
For $d=2$,
\begin{align*}
4+d=4+2=6,
\end{align*}
so
\begin{align*}
h\asymp n^{-1/6}.
\end{align*}
For $d=5$,
\begin{align*}
4+d=4+5=9,
\end{align*}
and therefore
\begin{align*}
h\asymp n^{-1/9}.
\end{align*}
The same theorem gives the MISE rate
\begin{align*}
\operatorname{MISE}(h)=O\left(n^{-2s/(2s+d)}\right).
\end{align*}
Again substituting $s=2$ yields
\begin{align*}
\frac{2s}{2s+d}=\frac{2\cdot 2}{2\cdot 2+d}=\frac{4}{4+d}.
\end{align*}
Thus $d=1$ gives
\begin{align*}
\frac{4}{4+d}=\frac{4}{4+1}=\frac45,
\end{align*}
so the rate is $n^{-4/5}$. For $d=2$,
\begin{align*}
\frac{4}{4+d}=\frac{4}{4+2}=\frac46=\frac23,
\end{align*}
so the rate is $n^{-2/3}$. For $d=5$,
\begin{align*}
\frac{4}{4+d}=\frac{4}{4+5}=\frac49,
\end{align*}
so the rate is $n^{-4/9}$. As $d$ increases from $1$ to $2$ to $5$, the bandwidth exponent decreases from $1/5$ to $1/6$ to $1/9$, and the MISE exponent decreases from $4/5$ to $2/3$ to $4/9$; higher dimension therefore forces slower bandwidth shrinkage and slower MISE convergence.
[/example]
## Uniform Rates and Practical Consequences
The final question is how to combine smoothness bias, stochastic fluctuation, boundary correction, and dimension into a usable rule of thumb. Uniform theory says that a density estimate should be judged by the largest relevant error over the region where it will be interpreted. In practice, this means that interior plots, boundary estimates, and multivariate contours may require different bandwidth or correction choices.
[quotetheorem:6327]
[citeproof:6327]
The logarithm is the main difference between pointwise and uniform bandwidth heuristics. The compact-interior hypothesis excludes boundary mass loss; using this theorem at $0$ for a density on $[0,\infty)$ would import an interior bias calculation into a setting where the first-order boundary term may dominate. The entropy hypothesis is the stochastic analogue: without it, pointwise concentration does not imply a supremum bound. The result gives the scale suggested by the upper bound, not an oracle choice for finite samples, and it records where the uncorrected estimator is being justified.
[remark: Interior Theory and Boundary Theory Should Not Be Mixed]
A theorem stated on a compact interior set does not justify using an uncorrected KDE at the boundary. Boundary-corrected estimators are designed to recover the same bias order near the support edge, but they require their own moment calculations because the equivalent kernel changes with location.
[/remark]
This warning matters when the statistical question is about a boundary feature rather than about the central part of the support. The next example separates the region where an interior theorem applies from the region the analyst may actually care about.
[example: Choosing What Region to Estimate]
Suppose positive data have support $[0,\infty)$ and the inferential target includes the behaviour at the boundary point $0$. If an analyst chooses an interior interval $[\varepsilon,M]$ with $0<\varepsilon<M$, then
\begin{align*}
0<\varepsilon \le x \le M \qquad \text{for every }x\in[\varepsilon,M],
\end{align*}
so
\begin{align*}
0\notin[\varepsilon,M].
\end{align*}
Moreover, since $[0,\infty)^c=(-\infty,0)$, the distance from any $x\in[\varepsilon,M]$ to the complement of the support is
\begin{align*}
\operatorname{dist}(x,[0,\infty)^c)=x\ge \varepsilon,
\end{align*}
and hence $[\varepsilon,M]$ is a compact interior set of $[0,\infty)$.
An interior uniform statement on $[\varepsilon,M]$ therefore controls only
\begin{align*}
\|\hat f_h-f\|_{\infty,[\varepsilon,M]}
=\sup_{\varepsilon\le x\le M}|\hat f_h(x)-f(x)|.
\end{align*}
This supremum does not contain the boundary error $|\hat f_h(0)-f(0)|$, because $0\notin[\varepsilon,M]$. By contrast, a boundary target such as $[0,M]$ has
\begin{align*}
\|\hat f_h-f\|_{\infty,[0,M]}
=\sup_{0\le x\le M}|\hat f_h(x)-f(x)|
\ge |\hat f_h(0)-f(0)|.
\end{align*}
Thus an uncorrected KDE justified only on $[\varepsilon,M]$ does not answer a question about estimation near zero. Reflection or local linear correction is the relevant analysis when the target region is $[0,M]$, while the interior-only analysis is appropriate for features that are genuinely separated from the boundary.
[/example]
Uniform KDE theory therefore has two messages. First, on compact interior regions, the estimator is uniformly consistent and its stochastic fluctuation is of order
\begin{align*}
\sqrt{\frac{\log n}{nh^d}}.
\end{align*}
Second, boundary and dimension are not minor implementation details: boundaries change the bias mechanism, and dimension changes the effective sample size inside each smoothing window.
Chapter 6 established that kernel estimators can be uniformly consistent with explicit rates depending on bandwidth choice and smoothness assumptions. Chapter 7 addresses the practical problem: how do we actually choose the bandwidth in practice? It develops both data-driven methods (like cross-validation) and theoretical principles that balance approximation error against estimation variance.
# 7. Bandwidth Selection and Adaptation
This chapter develops the practical and theoretical tools used to choose the smoothing scale in kernel density estimation. It builds on the pointwise and integrated-risk expansions from Chapter 5 and the uniform-rate considerations from Chapter 6; throughout, $d\mathcal L^1(x)$ denotes integration with respect to Lebesgue measure on $\mathbb R$. Bandwidth is the main tuning parameter in kernel methods: it decides how much local averaging is performed before the data are turned into a curve. Earlier chapters treated $h$ as fixed or as a deterministic sequence satisfying asymptotic conditions. This chapter asks how $h$ is chosen from data, how far simple formulas can be trusted, and how adaptive procedures change the amount of smoothing across the sample space.
The guiding tension is that the bandwidth controls both bias and variance. Large $h$ hides fine structure but stabilises the estimator, while small $h$ follows local features but increases sampling noise. Bandwidth selection is therefore a problem of estimating risk, estimating unknown smoothness constants, or comparing many candidate estimators in a way that does not use the unknown density.
## Rule-of-Thumb Selectors and Normal Reference Bandwidth
The first question is what bandwidth would be chosen if the unknown density were replaced by a simple parametric reference model. Rule-of-thumb selectors answer this by inserting a normal density into the asymptotic mean integrated squared error formula. They are fast, interpretable, and useful as starting values, but their assumptions are strong enough that they can oversmooth multimodal or skewed data.
We work with i.i.d. real-valued random variables $X_1,\dots,X_n$ with density $f$, and with a kernel density estimator
\begin{align*}
\hat f_h(x) = \frac{1}{nh}\sum_{i=1}^n K\left(\frac{x-X_i}{h}\right),
\end{align*}
where $K$ is a symmetric kernel with finite second moment. The target bandwidth is usually defined through integrated squared error or its expectation.
[definition: Mean Integrated Squared Error]
Let $\mathcal H\subset(0,\infty)$ be a bandwidth set, let $f:\mathbb R\to[0,\infty)$ be a density, and let $(\hat f_h)_{h\in\mathcal H}$ be density estimators with $\hat f_h-f\in L^2(\mathbb R)$ almost surely. The mean integrated squared error is the functional $\operatorname{MISE}:\mathcal H\to[0,\infty]$ defined by
\begin{align*}
\operatorname{MISE}(h) = \mathbb E\left[\int_{\mathbb R} (\hat f_h(x)-f(x))^2\,d\mathcal L^1(x)\right].
\end{align*}
[/definition]
The expectation averages over the sample, while the integral measures global squared error over the line. Since $f$ is unknown, exact MISE cannot be minimised directly; this motivates an asymptotic expansion whose terms separate the variance cost of small $h$ from the bias cost of large $h$.
[quotetheorem:6328]
[citeproof:6328]
Each hypothesis is tied to one part of the expansion. Symmetry and the finite second moment make the first non-zero bias term quadratic in $h$; without symmetry, a first-order bias term can dominate and the displayed $h^4$ integrated squared bias is no longer the leading contribution. The condition $R(K)<\infty$ is also substantive: for instance, a normalised kernel with tail $K(u)=c(1+|u|)^{-3/4}$ belongs to $L^1(\mathbb R)$ but not to $L^2(\mathbb R)$, so the leading variance constant $R(K)$ is infinite. The finite-moment assumptions cannot be replaced by mere integrability of $K$; a normalised symmetric tail $K(u)=c(1+|u|)^{-4}$ has finite second moment but infinite fourth moment, while heavier symmetric tails such as $c(1+|u|)^{-5/2}$ make $\mu_2(K)$ infinite and destroy the displayed quadratic-bias coefficient. The square-integrability and quantified remainder assumption on $f''$ justify integrating the Taylor expansion; densities with cusps or non-square-integrable curvature can have a different asymptotic risk. The condition $nh\to\infty$ prevents the variance term from staying large, while $h\to0$ prevents persistent smoothing bias. The theorem does not claim that AMISE is accurate for a fixed small sample or for all bandwidths on a numerical grid; it only supplies the leading large-sample risk that the next minimisation uses as a benchmark.
The expansion converts bandwidth selection into a calculus problem once $R(f'')$ is known. The next step is to minimise the two-term approximation, because that gives the benchmark bandwidth against which every data-driven selector is compared.
[quotetheorem:6329]
[citeproof:6329]
The hypotheses ensure that the AMISE curve has exactly the two competing positive terms used in the calculation. If $R(f'')=0$, as would happen for a linear density on an interval before accounting for boundary effects, the formula degenerates because there is no positive curvature penalty in this approximation. If $R(f'')$ is infinite, the displayed minimiser is not meaningful and the AMISE expansion is not the right risk summary. The theorem also minimises the asymptotic approximation, not the exact MISE, so it does not guarantee finite-sample optimality. Its value is to identify the unknown quantity that prevents direct implementation: the curvature functional $R(f'')$. We therefore need a convention for replacing that curvature by a normal reference value, and the following definition records the resulting practical bandwidth.
[definition: Normal Reference Bandwidth]
Fix $n\ge1$ and a positive scale functional $s:\mathbb R^n\to(0,\infty)$. The Gaussian-kernel normal reference bandwidth is the selector $h_{\operatorname{NR}}:\mathbb R^n\to(0,\infty)$ defined by
\begin{align*}
h_{\operatorname{NR}}(X_1,\dots,X_n)=1.06\,s(X_1,\dots,X_n)\, n^{-1/5}.
\end{align*}
[/definition]
Common choices for $s$ are the sample standard deviation or a robust alternative based on the interquartile range. The constant $1.06$ is not universal: it belongs to the Gaussian kernel with a Gaussian reference density. Other kernels change the value of $R(K)$ and $\mu_2(K)$, while non-Gaussian reference densities change $R(f'')$.
[example: Gaussian Mixture Normal Reference Bandwidth]
Let $f$ be the mixture density
\begin{align*}
f(x)=0.7\phi_{0,1}(x)+0.3\phi_{4,0.4}(x),
\end{align*}
where $\phi_{\mu,\sigma}$ is the $\mathcal N(\mu,\sigma^2)$ density. If $Z$ is the component label, then the mixture mean is
\begin{align*}
m=0.7\cdot0+0.3\cdot4=1.2.
\end{align*}
Using $\operatorname{Var}(X)=\mathbb E[\operatorname{Var}(X\mid Z)]+\operatorname{Var}(\mathbb E[X\mid Z])$, its variance is
\begin{align*}
\sigma_f^2=0.7\{1^2+(0-1.2)^2\}+0.3\{0.4^2+(4-1.2)^2\}.
\end{align*}
The two contributions are
\begin{align*}
0.7\{1+1.44\}=0.7(2.44)=1.708
\end{align*}
and
\begin{align*}
0.3\{0.16+7.84\}=0.3(8)=2.4.
\end{align*}
Therefore
\begin{align*}
\sigma_f^2=1.708+2.4=4.108.
\end{align*}
Thus the normal reference rule using the population scale would choose
\begin{align*}
h_{\operatorname{NR}}=1.06\sqrt{4.108}\,n^{-1/5}\approx 1.06(2.0268)n^{-1/5}\approx 2.149\,n^{-1/5}.
\end{align*}
The issue is that this scale treats the two separated components as one broad normal curve. For
\begin{align*}
\phi_{\mu,\sigma}(x)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right),
\end{align*}
differentiating with respect to $x$ gives
\begin{align*}
\phi_{\mu,\sigma}'(x)=-\frac{x-\mu}{\sigma^2}\phi_{\mu,\sigma}(x).
\end{align*}
Differentiating once more gives
\begin{align*}
\phi_{\mu,\sigma}''(x)=-\frac{1}{\sigma^2}\phi_{\mu,\sigma}(x)-\frac{x-\mu}{\sigma^2}\phi_{\mu,\sigma}'(x).
\end{align*}
Substituting the formula for $\phi_{\mu,\sigma}'$ yields
\begin{align*}
\phi_{\mu,\sigma}''(x)=\left(\frac{(x-\mu)^2}{\sigma^4}-\frac{1}{\sigma^2}\right)\phi_{\mu,\sigma}(x).
\end{align*}
Put $y=(x-\mu)/\sigma$, so that $d\mathcal L^1(x)=\sigma\,d\mathcal L^1(y)$ and $\phi_{\mu,\sigma}(x)=\sigma^{-1}\phi_{0,1}(y)$. Then
\begin{align*}
R(\phi_{\mu,\sigma}'')=\frac{1}{\sigma^5}\int_{\mathbb R}(y^2-1)^2\frac{1}{2\pi}e^{-y^2}\,d\mathcal L^1(y).
\end{align*}
Using
\begin{align*}
\int_{\mathbb R}e^{-y^2}\,d\mathcal L^1(y)=\sqrt{\pi},\quad \int_{\mathbb R}y^2e^{-y^2}\,d\mathcal L^1(y)=\frac{\sqrt{\pi}}{2},\quad \int_{\mathbb R}y^4e^{-y^2}\,d\mathcal L^1(y)=\frac{3\sqrt{\pi}}{4},
\end{align*}
we obtain
\begin{align*}
\int_{\mathbb R}(y^2-1)^2\frac{1}{2\pi}e^{-y^2}\,d\mathcal L^1(y)=\frac{1}{2\pi}\left(\frac{3\sqrt{\pi}}{4}-2\cdot\frac{\sqrt{\pi}}{2}+\sqrt{\pi}\right)=\frac{3}{8\sqrt{\pi}}.
\end{align*}
Therefore
\begin{align*}
R(\phi_{\mu,\sigma}'')=\frac{3}{8\sqrt{\pi}\,\sigma^5}.
\end{align*}
For the cross term, let $d=0-4=-4$ and $S=1^2+0.4^2=1.16$. The Gaussian product integral is
\begin{align*}
I(d)=\int_{\mathbb R}\phi_{0,1}(x)\phi_{4,0.4}(x)\,d\mathcal L^1(x)=\frac{1}{\sqrt{2\pi S}}\exp\left(-\frac{d^2}{2S}\right).
\end{align*}
Since differentiating twice in each mean differentiates this product integral four times in $d$,
\begin{align*}
\int_{\mathbb R}\phi_{0,1}''(x)\phi_{4,0.4}''(x)\,d\mathcal L^1(x)=I^{(4)}(d).
\end{align*}
Writing $I(d)=C\exp(-d^2/(2S))$ with $C=(2\pi S)^{-1/2}$, successive differentiation gives
\begin{align*}
I'(d)=-\frac{d}{S}I(d).
\end{align*}
Then
\begin{align*}
I''(d)=\left(\frac{d^2}{S^2}-\frac{1}{S}\right)I(d).
\end{align*}
Differentiating again,
\begin{align*}
I'''(d)=\left(-\frac{d^3}{S^3}+\frac{3d}{S^2}\right)I(d).
\end{align*}
One more differentiation gives
\begin{align*}
I^{(4)}(d)=\left(\frac{d^4}{S^4}-\frac{6d^2}{S^3}+\frac{3}{S^2}\right)I(d).
\end{align*}
Substituting $d=-4$ and $S=1.16$ yields
\begin{align*}
\int_{\mathbb R}\phi_{0,1}''(x)\phi_{4,0.4}''(x)\,d\mathcal L^1(x)=\left(\frac{256}{1.16^4}-\frac{96}{1.16^3}+\frac{3}{1.16^2}\right)\frac{1}{\sqrt{2\pi\cdot1.16}}\exp\left(-\frac{16}{2\cdot1.16}\right)\approx 0.0308.
\end{align*}
Since
\begin{align*}
f''(x)=0.7\phi_{0,1}''(x)+0.3\phi_{4,0.4}''(x),
\end{align*}
expanding the square gives
\begin{align*}
R(f'')=0.7^2R(\phi_{0,1}'')+0.3^2R(\phi_{4,0.4}'')+2(0.7)(0.3)\int_{\mathbb R}\phi_{0,1}''(x)\phi_{4,0.4}''(x)\,d\mathcal L^1(x).
\end{align*}
Thus
\begin{align*}
R(f'')=0.49\frac{3}{8\sqrt{\pi}}+0.09\frac{3}{8\sqrt{\pi}\,0.4^5}+0.42(0.0308)\approx 0.1037+1.8595+0.0129\approx 1.9761.
\end{align*}
By contrast, a single normal density with the same mean and variance has curvature
\begin{align*}
R(\phi_{1.2,\sqrt{4.108}}'')=\frac{3}{8\sqrt{\pi}(\sqrt{4.108})^5}=\frac{3}{8\sqrt{\pi}(4.108)^{5/2}}\approx 0.0062.
\end{align*}
The AMISE-optimal bandwidth has curvature factor $R(f'')^{-1/5}$, so replacing the mixture curvature by the same-variance normal curvature changes that factor by
\begin{align*}
\frac{(0.0062)^{-1/5}}{(1.9761)^{-1/5}}=\left(\frac{1.9761}{0.0062}\right)^{1/5}\approx 3.17.
\end{align*}
The normal reference bandwidth is therefore adapted to a broad unimodal approximation, so it tends to oversmooth the sharp component near $4$ and can flatten or merge that feature in the estimated density.
[/example]
This example explains the role of rule-of-thumb selectors in practice. They are useful baselines and initial values for numerical optimisation, but their dependence on a reference shape motivates risk-estimation and plug-in methods.
## Cross-Validation and Plug-In Selection
The next question is whether the data can estimate the loss of a bandwidth without assuming a reference density. Least-squares cross-validation starts from integrated squared error and removes the unknown constant that does not depend on $h$. Biased cross-validation and plug-in methods instead estimate the unknown curvature terms appearing in the AMISE expansion.
For a realised sample, the integrated squared error is
\begin{align*}
\operatorname{ISE}(h)=\int_{\mathbb R}\hat f_h(x)^2\,d\mathcal L^1(x)-2\int_{\mathbb R}\hat f_h(x)f(x)\,d\mathcal L^1(x)+\int_{\mathbb R}f(x)^2\,d\mathcal L^1(x).
\end{align*}
The final term does not depend on $h$. The challenge is estimating the middle term without using the same observation twice in a way that creates optimistic bias.
[definition: Leave-One-Out Kernel Density Estimator]
Let $n\ge2$, let $K:\mathbb R\to\mathbb R$ be measurable, and let $\mathcal M(\mathbb R)$ denote the space of real-valued Borel-measurable functions on $\mathbb R$. For each $1\le i\le n$, the leave-one-out kernel density estimator is the map from $\mathbb R^n\times(0,\infty)$ to $\mathcal M(\mathbb R)$ that sends $((X_1,\dots,X_n),h)$ to $\hat f_{h,-i}$,
where $\hat f_{h,-i}:\mathbb R\to\mathbb R$ is defined by
\begin{align*}
\hat f_{h,-i}(x)=\frac{1}{(n-1)h}\sum_{j\ne i}K\left(\frac{x-X_j}{h}\right), \qquad x\in\mathbb R.
\end{align*}
[/definition]
Removing $X_i$ from the estimate evaluated at $X_i$ is the key independence device. This motivates a cross-validation criterion in which the empirical cross term is computed from leave-one-out fitted values instead of from $\hat f_h(X_i)$ itself.
[definition: Least-Squares Cross-Validation Criterion]
Let $\mathcal H_n\subset(0,\infty)$ be a specified bandwidth set such that the following integrals are finite. The least-squares cross-validation criterion is the random functional $\operatorname{LSCV}:\mathcal H_n\to\mathbb R$ defined by
\begin{align*}
\operatorname{LSCV}(h)=\int_{\mathbb R}\hat f_h(x)^2\,d\mathcal L^1(x)-\frac{2}{n}\sum_{i=1}^n \hat f_{h,-i}(X_i).
\end{align*}
[/definition]
A least-squares cross-validation bandwidth is any element of $\operatorname{argmin}_{h\in\mathcal H_n}\operatorname{LSCV}(h)$ when this set is nonempty. The criterion is random and can have several local minima, so the bandwidth set and numerical grid matter. Its appeal comes from the following unbiasedness property, which shows that the criterion targets MISE up to an additive constant.
[quotetheorem:6330]
[citeproof:6330]
Each substantive hypothesis prevents a specific failure. The leave-one-out condition is essential: if $\hat f_h(X_i)$ is used instead of $\hat f_{h,-i}(X_i)$ and $K(0)>0$, then every summand contains the self-contribution $K(0)/(nh)$, so the empirical cross term is inflated by $2K(0)/(nh)$ and the criterion is driven toward very small bandwidths. The condition that the displayed integrals are finite is also substantive. For example, if $K(u)=c(1+|u|)^{-3/4}$ with the constant chosen so that $\int K=1$, then $K\in L^1(\mathbb R)$ but $K\notin L^2(\mathbb R)$, and $\int \hat f_h(x)^2\,d\mathcal L^1(x)=\infty$ whenever at least one observation is present. If the density itself is not square-integrable, such as $f(x)=\frac{1}{2\sqrt{x}}\mathbb{1}_{(0,1)}(x)$, then the additive constant $\int f^2$ is infinite and the MISE decomposition used in the theorem is no longer a finite identity. The fixed-$h$ assumption matters as well: unbiasedness at each prescribed bandwidth does not imply that the random minimiser over a dense grid is unbiased or stable. The constant is irrelevant for minimising over $h$, so LSCV aims at the same target as MISE, but its weakness is variance: an unbiased estimate of a whole risk curve can still be too noisy to locate the best smoothing level in small samples.
[example: Small-Sample Cross-Validation Instability]
Suppose $n=40$ observations are sampled from the smooth mixture density
\begin{align*}
f(x)=0.5\phi_{-1,1}(x)+0.5\phi_{1,1}(x).
\end{align*}
For a realised sample, the leave-one-out estimator is
\begin{align*}
\hat f_{h,-i}(x)=\frac{1}{39h}\sum_{j\ne i}K\left(\frac{x-X_j}{h}\right).
\end{align*}
Evaluating at $x=X_i$ gives
\begin{align*}
\hat f_{h,-i}(X_i)=\frac{1}{39h}\sum_{j\ne i}K\left(\frac{X_i-X_j}{h}\right).
\end{align*}
Multiplying by $2/40$ and summing over $i$ gives the leave-one-out part of least-squares cross-validation:
\begin{align*}
\frac{2}{40}\sum_{i=1}^{40}\hat f_{h,-i}(X_i)=\frac{2}{40\cdot39h}\sum_{i=1}^{40}\sum_{j\ne i}K\left(\frac{X_i-X_j}{h}\right).
\end{align*}
The squared-norm term starts from
\begin{align*}
\hat f_h(x)=\frac{1}{40h}\sum_{i=1}^{40}K\left(\frac{x-X_i}{h}\right).
\end{align*}
Squaring the finite sum gives
\begin{align*}
\hat f_h(x)^2=\frac{1}{40^2h^2}\sum_{i=1}^{40}\sum_{j=1}^{40}K\left(\frac{x-X_i}{h}\right)K\left(\frac{x-X_j}{h}\right).
\end{align*}
Therefore
\begin{align*}
\int_{\mathbb R}\hat f_h(x)^2\,d\mathcal L^1(x)=\frac{1}{40^2h^2}\sum_{i=1}^{40}\sum_{j=1}^{40}\int_{\mathbb R}K\left(\frac{x-X_i}{h}\right)K\left(\frac{x-X_j}{h}\right)\,d\mathcal L^1(x).
\end{align*}
For a fixed pair $(i,j)$, put $u=(x-X_i)/h$. Then $x=X_i+hu$, $d\mathcal L^1(x)=h\,d\mathcal L^1(u)$, and
\begin{align*}
\frac{x-X_j}{h}=u+\frac{X_i-X_j}{h}.
\end{align*}
Substituting into the previous display gives
\begin{align*}
\int_{\mathbb R}\hat f_h(x)^2\,d\mathcal L^1(x)=\frac{1}{40^2h}\sum_{i=1}^{40}\sum_{j=1}^{40}\int_{\mathbb R}K(u)K\left(u+\frac{X_i-X_j}{h}\right)\,d\mathcal L^1(u).
\end{align*}
When $i=j$, the spacing is $X_i-X_i=0$, so the diagonal contribution is
\begin{align*}
\frac{1}{40^2h}\sum_{i=1}^{40}\int_{\mathbb R}K(u)^2\,d\mathcal L^1(u)=\frac{40R(K)}{40^2h}=\frac{R(K)}{40h}.
\end{align*}
The off-diagonal part is therefore
\begin{align*}
\frac{1}{40^2h}\sum_{i=1}^{40}\sum_{j\ne i}\int_{\mathbb R}K(u)K\left(u+\frac{X_i-X_j}{h}\right)\,d\mathcal L^1(u).
\end{align*}
Every off-diagonal summand depends on the realised spacing $X_i-X_j$ through the ratio $(X_i-X_j)/h$.
Combining the squared-norm term with the leave-one-out term, the realised criterion can be written as
\begin{align*}
\operatorname{LSCV}(h)=\frac{R(K)}{40h}+\frac{1}{40^2h}\sum_{i=1}^{40}\sum_{j\ne i}\int_{\mathbb R}K(u)K\left(u+\frac{X_i-X_j}{h}\right)\,d\mathcal L^1(u)-\frac{2}{40\cdot39h}\sum_{i=1}^{40}\sum_{j\ne i}K\left(\frac{X_i-X_j}{h}\right).
\end{align*}
Thus the common factor $h^{-1}$ is multiplied by sums whose values depend on the realised spacings. If two observations are close, then $(X_i-X_j)/h$ remains near $0$ over a wider range of bandwidths than it does for a well-separated pair; if an observation is isolated, its off-diagonal contributions can be negligible for small $h$ and then change rapidly as $h$ increases.
If the minimum over a fine grid occurs at a very small $h$, then
\begin{align*}
\hat f_h(x)=\frac{1}{40h}\sum_{i=1}^{40}K\left(\frac{x-X_i}{h}\right)
\end{align*}
places a narrow kernel bump around each observation, with height scale $1/(40h)$ and horizontal scale $h$. Random spacing features in the sample can then appear as artificial bumps in the estimated curve, even though the mixture density $0.5\phi_{-1,1}+0.5\phi_{1,1}$ is smooth. A new sample changes the spacings $X_i-X_j$, hence changes both off-diagonal sums in $\operatorname{LSCV}(h)$, so the minimising bandwidth can move substantially from one sample to the next.
[/example]
The small-sample example shows how direct risk estimation can produce a noisy objective curve. A plug-in method takes a different route: it accepts the AMISE approximation as the smoother target and estimates only the unknown curvature term in that target. To make this precise, we need a criterion whose random part is the estimated value of $R(f'')$, while the bandwidth dependence remains the deterministic AMISE form.
[definition: Plug-In AMISE Criterion]
Let $\mathcal H_n\subset(0,\infty)$ be a bandwidth set and let $\widehat{R(f'')}:\mathbb R^n\to[0,\infty)$ be a data-based estimator of $R(f'')$. The plug-in AMISE criterion is the random functional $\operatorname{PIAMISE}:\mathcal H_n\to\mathbb R$ defined by
\begin{align*}
\operatorname{PIAMISE}(h)=\frac{R(K)}{nh}+\frac{h^4\mu_2(K)^2}{4}\widehat{R(f'')}.
\end{align*}
[/definition]
A plug-in AMISE selector is any map from the sample to an element of $\operatorname{argmin}_{h\in\mathcal H_n}\operatorname{PIAMISE}(h)$ when this set is nonempty. This criterion has the same shape as AMISE with the unknown curvature replaced by an estimate. In the narrower classical terminology, biased cross-validation estimates the integrated squared bias by kernel-derivative functionals rather than by merely inserting an arbitrary estimator of $R(f'')$ into AMISE. The present notes use the displayed criterion as the direct plug-in route, and the following theorem records the corresponding closed-form bandwidth.
[quotetheorem:6331]
[citeproof:6331]
Each assumption has a concrete role in the ratio conclusion. Positivity of the target $R(f'')$ excludes flat AMISE curvature: if $R(f'')=0$, the reference formula has no finite interior minimiser because the leading bias penalty has disappeared. Positivity of $\widehat{R(f'')}$ keeps the implemented bandwidth in $(0,\infty)$; if the estimator is zero, $\hat h_{\operatorname{PI}}=\infty$, while a negative overcorrected estimate makes the real fifth-root bandwidth undefined as a positive smoothing parameter. Consistency is needed, not just boundedness or eventual positivity. For a specific failure, suppose the true value is $R(f'')=r>0$ but the pilot procedure always uses such a large pilot bandwidth that its curvature estimate converges in probability to $r/32$. Then
\begin{align*}
\frac{\hat h_{\operatorname{PI}}}{h_{\operatorname{AMISE}}}
\xrightarrow{\mathbb P}
\left(\frac{r}{r/32}\right)^{1/5}
=2,
\end{align*}
so the selector remains twice the AMISE bandwidth asymptotically. If the pilot estimate converges instead to $4r$, the limiting ratio is $4^{-1/5}$ and the final estimator undersmooths. The theorem says only that the plug-in formula matches the AMISE minimiser asymptotically under consistent curvature estimation, not that it estimates the exact finite-sample MISE minimiser. Plug-in selectors can be more stable than LSCV, but they move the difficulty to estimating a derivative functional. Pilot bandwidths, higher-order kernels, and recursive plug-in schemes are all ways of handling that second smoothing problem.
[example: Pilot Bandwidth for a Curvature Functional]
Take a twice differentiable density $f$, a twice differentiable pilot kernel $K$, and a pilot bandwidth $g>0$. If
\begin{align*}
\hat f_g(x)=\frac{1}{ng}\sum_{i=1}^n K\left(\frac{x-X_i}{g}\right),
\end{align*}
then, for one summand,
\begin{align*}
\frac{d}{dx}\left\{\frac{1}{g}K\left(\frac{x-X_i}{g}\right)\right\}
=\frac{1}{g}K'\left(\frac{x-X_i}{g}\right)\frac{1}{g}
=\frac{1}{g^2}K'\left(\frac{x-X_i}{g}\right).
\end{align*}
Differentiating the same summand once more gives
\begin{align*}
\frac{d^2}{dx^2}\left\{\frac{1}{g}K\left(\frac{x-X_i}{g}\right)\right\}
=\frac{1}{g^2}K''\left(\frac{x-X_i}{g}\right)\frac{1}{g}
=\frac{1}{g^3}K''\left(\frac{x-X_i}{g}\right).
\end{align*}
Therefore
\begin{align*}
\hat f_g''(x)=\frac{1}{ng^3}\sum_{i=1}^n K''\left(\frac{x-X_i}{g}\right).
\end{align*}
The uncorrected plug-in curvature estimate is
\begin{align*}
\widehat{R(f'')}=\int_{\mathbb R}\left\{\frac{1}{ng^3}\sum_{i=1}^n K''\left(\frac{x-X_i}{g}\right)\right\}^2\,d\mathcal L^1(x).
\end{align*}
Expanding the square of the finite sum gives
\begin{align*}
\left\{\sum_{i=1}^n K''\left(\frac{x-X_i}{g}\right)\right\}^2=\sum_{i=1}^n\sum_{j=1}^n K''\left(\frac{x-X_i}{g}\right)K''\left(\frac{x-X_j}{g}\right).
\end{align*}
Substituting this expansion into the integral gives
\begin{align*}
\widehat{R(f'')}=\frac{1}{n^2g^6}\sum_{i=1}^n\sum_{j=1}^n\int_{\mathbb R}K''\left(\frac{x-X_i}{g}\right)K''\left(\frac{x-X_j}{g}\right)\,d\mathcal L^1(x).
\end{align*}
For a fixed pair $(i,j)$, put $u=(x-X_i)/g$. Then $x=X_i+gu$, $d\mathcal L^1(x)=g\,d\mathcal L^1(u)$, and
\begin{align*}
\frac{x-X_j}{g}=\frac{X_i+gu-X_j}{g}=u+\frac{X_i-X_j}{g}.
\end{align*}
Thus
\begin{align*}
\int_{\mathbb R}K''\left(\frac{x-X_i}{g}\right)K''\left(\frac{x-X_j}{g}\right)\,d\mathcal L^1(x)=g\int_{\mathbb R}K''(u)K''\left(u+\frac{X_i-X_j}{g}\right)\,d\mathcal L^1(u).
\end{align*}
Hence
\begin{align*}
\widehat{R(f'')}=\frac{1}{n^2g^5}\sum_{i=1}^n\sum_{j=1}^n\int_{\mathbb R}K''(u)K''\left(u+\frac{X_i-X_j}{g}\right)\,d\mathcal L^1(u).
\end{align*}
When $i=j$, the spacing is $X_i-X_i=0$, so the diagonal contribution is
\begin{align*}
\frac{1}{n^2g^5}\sum_{i=1}^n\int_{\mathbb R}K''(u)^2\,d\mathcal L^1(u)=\frac{1}{n^2g^5}\sum_{i=1}^n R(K'')=\frac{R(K'')}{ng^5}.
\end{align*}
The off-diagonal contribution is
\begin{align*}
\frac{1}{n^2g^5}\sum_{i=1}^n\sum_{j\ne i}\int_{\mathbb R}K''(u)K''\left(u+\frac{X_i-X_j}{g}\right)\,d\mathcal L^1(u).
\end{align*}
Thus a very small pilot bandwidth can inflate the estimate through the diagonal factor $g^{-5}$ and through off-diagonal terms whose arguments depend on the ratios $(X_i-X_j)/g$.
A very large $g$ has the opposite effect through bias in the pilot derivative. Since the observations have density $f$,
\begin{align*}
\mathbb E[\hat f_g''(x)]=\frac{1}{ng^3}\sum_{i=1}^n\mathbb E\left[K''\left(\frac{x-X_i}{g}\right)\right]=\frac{1}{g^3}\int_{\mathbb R}K''\left(\frac{x-y}{g}\right)f(y)\,d\mathcal L^1(y).
\end{align*}
Put $u=(x-y)/g$, so that $y=x-gu$ and $d\mathcal L^1(y)=g\,d\mathcal L^1(u)$. Then
\begin{align*}
\mathbb E[\hat f_g''(x)]=\frac{1}{g^2}\int_{\mathbb R}K''(u)f(x-gu)\,d\mathcal L^1(u).
\end{align*}
Assume the boundary terms vanish in the two integrations by parts. The first [integration by parts](/theorems/210) gives
\begin{align*}
\int_{\mathbb R}K''(u)f(x-gu)\,d\mathcal L^1(u)=\left[K'(u)f(x-gu)\right]_{-\infty}^{\infty}-\int_{\mathbb R}K'(u)\frac{d}{du}f(x-gu)\,d\mathcal L^1(u).
\end{align*}
Because $\frac{d}{du}f(x-gu)=-g f'(x-gu)$ and the boundary term is $0$, this becomes
\begin{align*}
\int_{\mathbb R}K''(u)f(x-gu)\,d\mathcal L^1(u)=g\int_{\mathbb R}K'(u)f'(x-gu)\,d\mathcal L^1(u).
\end{align*}
The second [integration by parts](/theorems/2098) gives
\begin{align*}
\int_{\mathbb R}K'(u)f'(x-gu)\,d\mathcal L^1(u)=\left[K(u)f'(x-gu)\right]_{-\infty}^{\infty}-\int_{\mathbb R}K(u)\frac{d}{du}f'(x-gu)\,d\mathcal L^1(u).
\end{align*}
Because $\frac{d}{du}f'(x-gu)=-g f''(x-gu)$ and the boundary term is $0$, this becomes
\begin{align*}
\int_{\mathbb R}K'(u)f'(x-gu)\,d\mathcal L^1(u)=g\int_{\mathbb R}K(u)f''(x-gu)\,d\mathcal L^1(u).
\end{align*}
Combining the two integrations by parts,
\begin{align*}
\int_{\mathbb R}K''(u)f(x-gu)\,d\mathcal L^1(u)=g^2\int_{\mathbb R}K(u)f''(x-gu)\,d\mathcal L^1(u).
\end{align*}
Therefore
\begin{align*}
\mathbb E[\hat f_g''(x)]=\frac{1}{g^2}\cdot g^2\int_{\mathbb R}K(u)f''(x-gu)\,d\mathcal L^1(u)=\int_{\mathbb R}K(u)f''(x-gu)\,d\mathcal L^1(u).
\end{align*}
Thus the expected pilot derivative is a kernel average of curvature values over the scale $g$, so a large pilot bandwidth can suppress sharp local curvature before the curvature functional is inserted into
\begin{align*}
\hat h_{\operatorname{PI}}=\left(\frac{R(K)}{\mu_2(K)^2\widehat{R(f'')}n}\right)^{1/5}.
\end{align*}
The final plug-in bandwidth is therefore controlled by a previous smoothing choice: $g$ stabilises the estimated curvature when it is not too small, but it biases the curvature downward when it averages over features that are sharper than the pilot scale.
[/example]
The three selector families now have distinct personalities: rule-of-thumb methods estimate a reference risk, LSCV estimates finite-sample risk up to a constant, and plug-in methods estimate the leading asymptotic risk. The last part of the chapter changes the problem by allowing bandwidths to depend on location or on the data point being smoothed.
## Adaptive Bandwidths and Local Choice
A single global bandwidth cannot be equally suitable in a dense mode and in a sparse tail. The final question is how to smooth more in regions with little data while preserving detail where observations are concentrated. Adaptive bandwidths answer this by replacing one number $h$ with a local bandwidth function or a data-dependent model-selection rule.
[illustration:adaptive-bandwidth-comparison]
[definition: Balloon Kernel Density Estimator]
Let $K:\mathbb R\to[0,\infty)$ be integrable, let $\mathcal B$ be a class of bandwidth functions $h:\mathbb R\to(0,\infty)$, and let $\mathcal M_+(\mathbb R)$ denote the space of nonnegative Borel-measurable functions on $\mathbb R$. The balloon estimator is the map from $\mathbb R^n\times\mathcal B$ to $\mathcal M_+(\mathbb R)$ that sends $((X_1,\dots,X_n),h)$ to $\hat f_{\operatorname{bal}}$,
where $\hat f_{\operatorname{bal}}:\mathbb R\to[0,\infty)$ is defined by
\begin{align*}
\hat f_{\operatorname{bal}}(x)=\frac{1}{n h(x)}\sum_{i=1}^nK\left(\frac{x-X_i}{h(x)}\right).
\end{align*}
[/definition]
The bandwidth is attached to the evaluation point $x$. This makes the estimator responsive to location, but it may fail to integrate to $1$ because the normalising factor changes with $x$ inside the integral over the whole line. To preserve the density normalisation, the next adaptive construction attaches the bandwidth to each observation instead.
[definition: Sample-Point Kernel Density Estimator]
Let $K:\mathbb R\to[0,\infty)$ be integrable with $\int_{\mathbb R}K(u)\,d\mathcal L^1(u)=1$. The sample-point estimator is the map from $\mathbb R^n\times(0,\infty)^n$ to nonnegative functions in $L^1(\mathbb R)$ sending $((X_1,\dots,X_n),(h_1,\dots,h_n))$ to $\hat f_{\operatorname{sp}}:\mathbb R\to[0,\infty)$ defined by
\begin{align*}
\hat f_{\operatorname{sp}}(x)=\frac{1}{n}\sum_{i=1}^n\frac{1}{h_i}K\left(\frac{x-X_i}{h_i}\right).
\end{align*}
[/definition]
Here each observation contributes a kernel with its own width. If $K$ integrates to $1$, then $\hat f_{\operatorname{sp}}$ integrates to $1$, so the estimator remains a density. A common design takes $h_i$ larger in sparse regions and smaller near modes, often through a pilot density estimate.
[example: Adaptive Smoothing Near Modes and Tails]
Let $\tilde f$ be a positive pilot density estimate, and choose sample-point bandwidths by
\begin{align*}
h_i=h_0\left(\frac{G}{\tilde f(X_i)}\right)^\alpha,
\qquad
G=\exp\left(\frac{1}{n}\sum_{j=1}^n\log \tilde f(X_j)\right),
\qquad
0<\alpha\le 1.
\end{align*}
If $X_i$ lies near the sharp mode and $X_j$ lies in the tail, write $\tilde f(X_i)=a$ and $\tilde f(X_j)=b$, where $a>b>0$. Then
\begin{align*}
\frac{h_i}{h_j}=\frac{h_0(G/a)^\alpha}{h_0(G/b)^\alpha}.
\end{align*}
Cancelling the common positive factor $h_0$ gives
\begin{align*}
\frac{h_i}{h_j}=\frac{(G/a)^\alpha}{(G/b)^\alpha}.
\end{align*}
Since $G>0$, $a>0$, and $b>0$, the quotient of powers is the power of the quotient:
\begin{align*}
\frac{h_i}{h_j}=\left(\frac{G/a}{G/b}\right)^\alpha.
\end{align*}
Inside the parentheses,
\begin{align*}
\frac{G/a}{G/b}=\frac{G}{a}\cdot\frac{b}{G}=\frac{b}{a}.
\end{align*}
Therefore
\begin{align*}
\frac{h_i}{h_j}=\left(\frac{b}{a}\right)^\alpha.
\end{align*}
Because $a>b>0$, we have $0<b/a<1$. Since $\alpha>0$, raising a number in $(0,1)$ to the power $\alpha$ keeps it in $(0,1)$, so
\begin{align*}
0<\frac{h_i}{h_j}<1.
\end{align*}
Thus $h_i<h_j$: the observation near the mode receives the narrower kernel, while the tail observation receives the wider kernel.
The corresponding sample-point estimator is
\begin{align*}
\hat f_{\operatorname{sp}}(x)=\frac{1}{n}\sum_{i=1}^n\frac{1}{h_i}K\left(\frac{x-X_i}{h_i}\right).
\end{align*}
Assume $\int_{\mathbb R}K(u)\,d\mathcal L^1(u)=1$. For each fixed $i$, use the substitution
\begin{align*}
u=\frac{x-X_i}{h_i},\qquad x=X_i+h_i u,\qquad d\mathcal L^1(x)=h_i\,d\mathcal L^1(u).
\end{align*}
Since $h_i>0$, this change of variables maps $\mathbb R$ onto $\mathbb R$. Hence
\begin{align*}
\int_{\mathbb R}\frac{1}{h_i}K\left(\frac{x-X_i}{h_i}\right)\,d\mathcal L^1(x)=\int_{\mathbb R}\frac{1}{h_i}K(u)h_i\,d\mathcal L^1(u).
\end{align*}
Cancelling the positive factor $h_i$ gives
\begin{align*}
\int_{\mathbb R}\frac{1}{h_i}K\left(\frac{x-X_i}{h_i}\right)\,d\mathcal L^1(x)=\int_{\mathbb R}K(u)\,d\mathcal L^1(u).
\end{align*}
By the normalisation of $K$,
\begin{align*}
\int_{\mathbb R}\frac{1}{h_i}K\left(\frac{x-X_i}{h_i}\right)\,d\mathcal L^1(x)=1.
\end{align*}
Using linearity of the integral over the finite sum,
\begin{align*}
\int_{\mathbb R}\hat f_{\operatorname{sp}}(x)\,d\mathcal L^1(x)=\frac{1}{n}\sum_{i=1}^n\int_{\mathbb R}\frac{1}{h_i}K\left(\frac{x-X_i}{h_i}\right)\,d\mathcal L^1(x).
\end{align*}
Substituting the value of each summand,
\begin{align*}
\int_{\mathbb R}\hat f_{\operatorname{sp}}(x)\,d\mathcal L^1(x)=\frac{1}{n}\sum_{i=1}^n 1=1.
\end{align*}
The adaptive rule keeps narrow kernels where the pilot density is high, preserving the sharp modal peak, and uses wider kernels where the pilot density is low, reducing isolated bumps in the long tail.
[/example]
Adaptive estimators still require choices: the pilot bandwidth, the transformation from pilot density to local bandwidth, and safeguards against very small or very large values. Lepski's method gives a different viewpoint by selecting from a family of estimators through pairwise comparisons.
[definition: Lepski Selector]
Let $\mathcal H_n$ be a finite ordered set of bandwidths and let $x\in\mathbb R$ be fixed. The Lepski selector is the map
\begin{align*}
L_x:\mathbb R^{\mathcal H_n}\times [0,\infty)^{\mathcal H_n\times\mathcal H_n}\to\mathcal H_n
\end{align*}
defined as follows. For estimator values $z=(z_h)_{h\in\mathcal H_n}$ and thresholds $\lambda=(\lambda(h,h'))_{(h,h')\in\mathcal H_n\times\mathcal H_n}$, $L_x(z,\lambda)$ is the largest bandwidth $h\in\mathcal H_n$ such that
\begin{align*}
|z_h-z_{h'}|\le \lambda(h,h')
\end{align*}
for every $h'\in\mathcal H_n$ with $h'\le h$.
[/definition]
The rule starts from the principle that oversmoothing reveals itself by disagreement with finer estimators, while undersmoothing is controlled by the thresholds. To justify choosing the largest acceptable bandwidth, we need an oracle inequality showing that the selected estimator performs like the best bias-variance compromise on the grid.
[quotetheorem:6332]
[citeproof:6332]
Each structural assumption rules out a distinct counterexample. The simultaneous concentration event prevents the selector from chasing noise: on a grid with many very small bandwidths, a single fine-scale estimator can have an unusually large fluctuation at $x$ and make every smoother estimator fail comparison, forcing selection near the bottom of the grid. Threshold calibration is equally necessary. If every threshold is set to $0$, then even ordinary stochastic variation between neighbouring bandwidths causes rejection; if every threshold is chosen extremely large, the coarsest estimator passes even when it has visible smoothing bias. The oracle slack assumption prevents a more subtle failure in the theorem statement: if $B b_{h^*}(x)$ is larger than all thresholds at the oracle scale, then the oracle bandwidth fails comparisons against finer estimators despite being the best bias-variance compromise, so the proof has no way to compare the selected bandwidth with the oracle.
The oracle-scale pairwise bias condition and the monotonicity of the stochastic proxy are not cosmetic. For a pairwise-bias failure, take an ordered grid $h_1<h_2<h_3$ at a point $x$ where smoothing across two nearby bumps makes $\mathbb E[\hat f_{h_3}(x)]$ close to $f(x)$ by cancellation, while $\mathbb E[\hat f_{h_2}(x)]$ lies far below both; then the oracle bandwidth can fail comparison with a finer estimator even though its pointwise bias relative to $f$ is small. For a monotonicity failure, suppose the variance proxy is not nonincreasing because the estimator at a larger labelled bandwidth uses a higher-order or boundary-corrected kernel with larger pointwise variance; then the comparison between the selected estimator and the oracle estimator no longer reduces to the oracle stochastic scale. This theorem is a template rather than a single distribution-free statement: the constants and thresholds depend on concentration inequalities for the estimator class, and the conclusion holds on the high-probability event rather than deterministically. Its importance is conceptual because the selected bandwidth adapts to unknown local smoothness without estimating a smoothness index directly.
[example: Pointwise Lepski Adaptation]
Take the ordered grid $\mathcal H=\{0.05,0.10,0.20\}$ and use thresholds $\lambda(h,h')=0.08$ whenever $h'<h$. At a smooth point $x_s$, suppose the fitted values are
\begin{align*}
\hat f_{0.05}(x_s)=0.412,\quad \hat f_{0.10}(x_s)=0.398,\quad \hat f_{0.20}(x_s)=0.386.
\end{align*}
For the largest bandwidth $0.20$, the finer bandwidths are $0.10$ and $0.05$. The comparison with $0.10$ is
\begin{align*}
|\hat f_{0.20}(x_s)-\hat f_{0.10}(x_s)|=|0.386-0.398|=|-0.012|=0.012\le 0.08.
\end{align*}
The comparison with $0.05$ is
\begin{align*}
|\hat f_{0.20}(x_s)-\hat f_{0.05}(x_s)|=|0.386-0.412|=|-0.026|=0.026\le 0.08.
\end{align*}
Thus $0.20$ passes every required comparison against finer bandwidths. Since $0.20$ is the largest element of $\mathcal H$, Lepski's rule selects
\begin{align*}
\hat h(x_s)=0.20.
\end{align*}
The leading pointwise variance scale for a kernel density estimator is proportional to $(nh)^{-1}$. Comparing the selected scale $h=0.20$ with $h=0.10$ gives
\begin{align*}
\frac{(n\cdot0.20)^{-1}}{(n\cdot0.10)^{-1}}=\frac{1/(0.20n)}{1/(0.10n)}.
\end{align*}
Dividing by $1/(0.10n)$ is the same as multiplying by $0.10n$, so
\begin{align*}
\frac{1/(0.20n)}{1/(0.10n)}=\frac{1}{0.20n}(0.10n)=\frac{0.10n}{0.20n}=\frac{0.10}{0.20}=\frac{1}{2}.
\end{align*}
At the smooth point, the coarser estimate remains compatible with the finer estimates, so the selected bandwidth uses half the leading variance proxy associated with $h=0.10$.
Near a less smooth shoulder point $x_c$, suppose instead that
\begin{align*}
\hat f_{0.05}(x_c)=0.730,\quad \hat f_{0.10}(x_c)=0.690,\quad \hat f_{0.20}(x_c)=0.560.
\end{align*}
The largest bandwidth $0.20$ fails comparison with the finer bandwidth $0.10$ because
\begin{align*}
|\hat f_{0.20}(x_c)-\hat f_{0.10}(x_c)|=|0.560-0.690|=|-0.130|=0.130>0.08.
\end{align*}
Therefore $0.20$ is not acceptable. The middle bandwidth $0.10$ only has to be compared with the finer bandwidth $0.05$, and
\begin{align*}
|\hat f_{0.10}(x_c)-\hat f_{0.05}(x_c)|=|0.690-0.730|=|-0.040|=0.040\le 0.08.
\end{align*}
Thus $0.10$ is acceptable, while the only larger bandwidth in the grid is not acceptable, so Lepski's rule selects
\begin{align*}
\hat h(x_c)=0.10.
\end{align*}
The selected bandwidth decreases near the shoulder because the coarser estimate no longer agrees with finer-scale estimates; the adaptation is determined by pairwise estimator comparisons rather than by an explicit pilot estimate of curvature.
[/example]
The chapter's selectors can now be compared by what they estimate. Normal reference methods estimate a bandwidth under a parametric proxy, cross-validation estimates risk from leave-one-out prediction, plug-in methods estimate the unknown constants in asymptotic risk, and adaptive methods estimate or infer where the bias-variance compromise should change. The same structure appears outside density estimation: regularisation parameters in inverse problems, penalty levels in model selection, and mesh scales in numerical approximation all balance approximation error against instability or variance. Later topics in nonparametric regression reuse these ideas, with local polynomial estimators replacing kernel density estimators and with bandwidth choice again determining the effective scale of inference.
The first seven chapters focused on estimating entire functions—distribution functions, densities, regression curves. This chapter shifts to a different type of inference: estimating real-valued summaries of distributions and densities, such as functionals of the empirical measure. Understanding how to estimate these functionals efficiently sets up both the deeper resampling theory and the connection to hypothesis testing.
# 8. Nonparametric Functionals and U-Statistics
This chapter moves from estimating whole functions to estimating real-valued functionals of unknown distributions and densities. It assumes the earlier material on i.i.d. sampling, expectation, convergence in probability and distribution, laws of large numbers, central limit theorems, and the basic kernel smoothing ideas used for density estimation. The main point is that many natural targets, such as quadratic functionals and rank correlations, are not simple sample averages but can still be estimated by symmetrised averages over tuples of observations. U-statistics give a common language for these estimators, and Hoeffding's decomposition explains why their limiting behaviour often reduces to an ordinary central limit theorem plus a smaller degenerate remainder.
## Integral Functionals Beyond Plug-In Estimation
How should we estimate a nonlinear functional of an unknown density when the empirical distribution is too rough to be inserted directly? Linear functionals such as
\begin{align*}
\int_{\mathbb R} g(x)f(x)\,d\mathcal L^1(x)
\end{align*}
are handled by sample averages, but quadratic functionals such as
\begin{align*}
\int_{\mathbb R} f(x)^2\,d\mathcal L^1(x)
\end{align*}
involve products of unknown densities. The first lesson is that smoothing and sample splitting can convert these products into estimable averages with a controllable bias.
[definition: Integral Quadratic Functional]
Let
\begin{align*}
\mathcal P_2(\mathbb R):=\{f\in L^2(\mathbb R): f\ge 0\ \mathcal L^1\text{-a.e. and }\int_{\mathbb R} f\,d\mathcal L^1=1\}.
\end{align*}
The integral quadratic functional is the map $Q:\mathcal P_2(\mathbb R)\to \mathbb R$ defined by
\begin{align*}
Q(f) := \int_{\mathbb R} f(x)^2\,d\mathcal L^1(x).
\end{align*}
[/definition]
This functional measures concentration of the distribution: sharply peaked densities have larger $Q(f)$ than diffuse densities. It also appears inside $L^2$ distances, since for densities $f$ and $g$,
\begin{align*}
\|f-g\|_{L^2}^2 = Q(f) - 2\int_{\mathbb R} f(x)g(x)\,d\mathcal L^1(x) + Q(g).
\end{align*}
[example: Gaussian Quadratic Functional]
Let $X$ have Gaussian density
\begin{align*}
f(x)=(2\pi\sigma^2)^{-1/2}e^{-(x-\mu)^2/(2\sigma^2)}
\end{align*}
with $\sigma>0$. Squaring the density gives
\begin{align*}
f(x)^2=\left((2\pi\sigma^2)^{-1/2}\right)^2\left(e^{-(x-\mu)^2/(2\sigma^2)}\right)^2.
\end{align*}
Since $\left((2\pi\sigma^2)^{-1/2}\right)^2=(2\pi\sigma^2)^{-1}$ and $\left(e^{-(x-\mu)^2/(2\sigma^2)}\right)^2=e^{-(x-\mu)^2/\sigma^2}$, this becomes
\begin{align*}
f(x)^2=\frac{1}{2\pi\sigma^2}e^{-(x-\mu)^2/\sigma^2}.
\end{align*}
Therefore
\begin{align*}
Q(f)=\int_{\mathbb R}\frac{1}{2\pi\sigma^2}e^{-(x-\mu)^2/\sigma^2}\,d\mathcal L^1(x).
\end{align*}
Use the change of variables $u=(x-\mu)/\sigma$, so $x=\mu+\sigma u$ and $d\mathcal L^1(x)=\sigma\,d\mathcal L^1(u)$. Then
\begin{align*}
Q(f)=\frac{1}{2\pi\sigma^2}\int_{\mathbb R}e^{-u^2}\sigma\,d\mathcal L^1(u).
\end{align*}
Pulling the constant $\sigma$ out of the integral gives
\begin{align*}
Q(f)=\frac{1}{2\pi\sigma}\int_{\mathbb R}e^{-u^2}\,d\mathcal L^1(u).
\end{align*}
Using the Gaussian integral identity $\int_{\mathbb R}e^{-u^2}\,d\mathcal L^1(u)=\sqrt{\pi}$, we get
\begin{align*}
Q(f)=\frac{\sqrt{\pi}}{2\pi\sigma}=\frac{1}{2\sqrt{\pi}\sigma}.
\end{align*}
The final expression contains $\sigma$ but not $\mu$, so $Q(f)$ is invariant under translation. Also,
\begin{align*}
\frac{d}{d\sigma}\left(\frac{1}{2\sqrt{\pi}\sigma}\right)=-\frac{1}{2\sqrt{\pi}\sigma^2}<0,
\end{align*}
so $Q(f)$ decreases as the scale parameter $\sigma$ increases. This gives a scale check for bandwidth calculations: small $\sigma$ makes the target larger and the density sharper, so kernel smoothing bias becomes visible at smaller bandwidths.
[/example]
The Gaussian calculation makes the target concrete, but it also shows why direct plug-in from the empirical distribution is unavailable: empirical point masses do not form an $L^2$ density. A kernel density estimator can be squared and integrated, yet its diagonal terms compare each observation with itself and create a leading artificial contribution. The next definition removes those diagonal comparisons and keeps only genuinely independent pairs, which is the form needed for unbiased pairwise estimation of a smoothed quadratic functional.
[definition: Kernel U-Statistic for Quadratic Functional]
Let $X_1,\dots,X_n$ be i.i.d. real-valued random variables with density $f$. Let $K:\mathbb R\to\mathbb R$ be an integrable kernel and let $h>0$. Define $K_h(t)=h^{-1}K(t/h)$. The kernel U-statistic estimator of $Q(f)$ is
\begin{align*}
\widehat Q_{n,h}:\mathbb R^n\to\mathbb R,\qquad
\widehat Q_{n,h}(x_1,\dots,x_n):=\frac{1}{n(n-1)}\sum_{1\le i\ne j\le n}K_h(x_i-x_j).
\end{align*}
[/definition]
The diagonal-free form replaces a product of densities by pairwise closeness of independent observations. Its expectation is a smoothed version of $Q(f)$:
\begin{align*}
\mathbb E[\widehat Q_{n,h}] = \int_{\mathbb R}\int_{\mathbb R} K_h(x-y)f(x)f(y)\,d\mathcal L^1(x)d\mathcal L^1(y).
\end{align*}
This identity isolates the first analytic question about the estimator: before studying random fluctuation, we must know how far the smoothed target is from $Q(f)$. The following result gives the basic bias scale under the same moment and smoothness assumptions used in kernel smoothing.
[quotetheorem:6333]
[citeproof:6333]
The hypotheses are doing distinct jobs. The normalisation makes $K_h$ an approximate identity, so the smoothed autocorrelation target approaches the original quadratic functional as $h\to 0$. For this quadratic target, the first-order translation term pairs $f$ with its [weak derivative](/page/Weak%20Derivative) and vanishes under the Sobolev regularity assumptions; the point is not merely that the kernel has a zero first moment. The second absolute moment of $K$ controls the averaged Taylor remainder, and $H^2$ smoothness gives the stated second-order translation bound. If the second moment condition is dropped, for instance for a heavy-tailed kernel with $K(u)\asymp |u|^{-3}$ as $|u|\to\infty$, the averaged second-order remainder need not be finite and the $h^2$ bound is no longer justified by this argument. Under weaker smoothness assumptions the same approximate-identity argument may still give consistency, but the available bias rate is governed by the translation regularity of $f$ rather than by a formal Taylor expansion. The theorem is only a bias statement: it does not choose an optimal bandwidth and says nothing about stochastic variance. This estimator is already a U-statistic, but the same structure appears in examples that have no density-estimation flavour, so before proving general limit theorems it helps to record the wider class.
## U-Statistics as Symmetric Averages
What do sample variance, rank correlation, pairwise distances, and quadratic density functionals have in common? Each averages the same symmetric rule over many subsets of the data. U-statistics isolate this structure and separate the choice of statistical target from the probability theory of repeated tuple averages.
[definition: U-Statistic]
Let $X_1,\dots,X_n$ be i.i.d. random variables taking values in a measurable space $(E,\mathcal E)$. Let $m\in\mathbb N$ and let $h:E^m\to\mathbb R$ be a symmetric measurable function with $\mathbb E[|h(X_1,\dots,X_m)|]<\infty$. The U-statistic with kernel $h$ is
\begin{align*}
U_n:E^n\to\mathbb R,\qquad
U_n(x_1,\dots,x_n) := \binom{n}{m}^{-1}\sum_{1\le i_1<\cdots<i_m\le n} h(x_{i_1},\dots,x_{i_m}).
\end{align*}
The parameter estimated by $U_n$ is
\begin{align*}
\theta := \mathbb E[h(X_1,\dots,X_m)].
\end{align*}
[/definition]
The word kernel here means the symmetric function $h$, not necessarily a smoothing kernel. Since every unordered $m$-tuple has the same distribution as $(X_1,\dots,X_m)$, the statistic is unbiased for $\theta$ whenever $h$ is integrable.
[example: Sample Variance as a U-Statistic]
For real-valued $X$ with $\mathbb E[X^2]<\infty$, take $m=2$ and
\begin{align*}
h(x,y)=\frac{1}{2}(x-y)^2.
\end{align*}
If $X_1$ and $X_2$ are independent copies of $X$, then the parameter is
\begin{align*}
\theta=\mathbb E[h(X_1,X_2)]=\frac{1}{2}\mathbb E[(X_1-X_2)^2].
\end{align*}
Expanding the square gives
\begin{align*}
\theta=\frac{1}{2}\mathbb E[X_1^2-2X_1X_2+X_2^2].
\end{align*}
By linearity of expectation,
\begin{align*}
\theta=\frac{1}{2}\left(\mathbb E[X_1^2]-2\mathbb E[X_1X_2]+\mathbb E[X_2^2]\right).
\end{align*}
Independence gives $\mathbb E[X_1X_2]=\mathbb E[X_1]\mathbb E[X_2]$, and identical distribution gives $\mathbb E[X_1]=\mathbb E[X_2]=\mathbb E[X]$ and $\mathbb E[X_1^2]=\mathbb E[X_2^2]=\mathbb E[X^2]$. Hence
\begin{align*}
\theta=\frac{1}{2}\left(\mathbb E[X^2]-2(\mathbb E[X])^2+\mathbb E[X^2]\right).
\end{align*}
Combining the two identical second-moment terms,
\begin{align*}
\theta=\mathbb E[X^2]-(\mathbb E[X])^2=\operatorname{Var}(X).
\end{align*}
For observations $X_1,\dots,X_n$, the associated U-statistic is
\begin{align*}
U_n=\binom n2^{-1}\sum_{1\le i<j\le n}\frac{1}{2}(X_i-X_j)^2.
\end{align*}
Since $\binom n2=n(n-1)/2$, this is
\begin{align*}
U_n=\frac{1}{n(n-1)}\sum_{1\le i<j\le n}(X_i-X_j)^2.
\end{align*}
Now expand the pairwise sum:
\begin{align*}
\sum_{1\le i<j\le n}(X_i-X_j)^2=\sum_{1\le i<j\le n}(X_i^2-2X_iX_j+X_j^2).
\end{align*}
Each $X_i^2$ appears in exactly $n-1$ pairs, so
\begin{align*}
\sum_{1\le i<j\le n}(X_i-X_j)^2=(n-1)\sum_{i=1}^n X_i^2-2\sum_{1\le i<j\le n}X_iX_j.
\end{align*}
Also,
\begin{align*}
\left(\sum_{i=1}^n X_i\right)^2=\sum_{i=1}^n X_i^2+2\sum_{1\le i<j\le n}X_iX_j.
\end{align*}
Therefore
\begin{align*}
2\sum_{1\le i<j\le n}X_iX_j=\left(\sum_{i=1}^n X_i\right)^2-\sum_{i=1}^n X_i^2.
\end{align*}
Substituting this identity into the pairwise sum gives
\begin{align*}
\sum_{1\le i<j\le n}(X_i-X_j)^2=(n-1)\sum_{i=1}^n X_i^2-\left(\sum_{i=1}^n X_i\right)^2+\sum_{i=1}^n X_i^2.
\end{align*}
Combining the two sums of squares,
\begin{align*}
\sum_{1\le i<j\le n}(X_i-X_j)^2=n\sum_{i=1}^n X_i^2-\left(\sum_{i=1}^n X_i\right)^2.
\end{align*}
Since $\bar X_n=n^{-1}\sum_{i=1}^n X_i$,
\begin{align*}
\sum_{i=1}^n (X_i-\bar X_n)^2=\sum_{i=1}^n \left(X_i^2-2X_i\bar X_n+\bar X_n^2\right).
\end{align*}
Using $\sum_{i=1}^n X_i=n\bar X_n$, this becomes
\begin{align*}
\sum_{i=1}^n (X_i-\bar X_n)^2=\sum_{i=1}^n X_i^2-2n\bar X_n^2+n\bar X_n^2.
\end{align*}
Thus
\begin{align*}
\sum_{i=1}^n (X_i-\bar X_n)^2=\sum_{i=1}^n X_i^2-n\bar X_n^2.
\end{align*}
Replacing $\bar X_n$ by $n^{-1}\sum_{i=1}^n X_i$ gives
\begin{align*}
\sum_{i=1}^n (X_i-\bar X_n)^2=\sum_{i=1}^n X_i^2-\frac{1}{n}\left(\sum_{i=1}^n X_i\right)^2.
\end{align*}
Multiplying by $n$,
\begin{align*}
n\sum_{i=1}^n (X_i-\bar X_n)^2=n\sum_{i=1}^n X_i^2-\left(\sum_{i=1}^n X_i\right)^2.
\end{align*}
Comparing this with the pairwise-sum identity yields
\begin{align*}
\sum_{1\le i<j\le n}(X_i-X_j)^2=n\sum_{i=1}^n (X_i-\bar X_n)^2.
\end{align*}
Therefore
\begin{align*}
U_n=\frac{1}{n(n-1)}\,n\sum_{i=1}^n (X_i-\bar X_n)^2=\frac{1}{n-1}\sum_{i=1}^n (X_i-\bar X_n)^2.
\end{align*}
Thus the order-two U-statistic generated by $h(x,y)=(x-y)^2/2$ is exactly the usual unbiased sample variance, and its target is the population variance.
[/example]
Pairwise absolute differences give a scale functional that is less tied to squared loss. This example also shows why U-statistics are natural in nonparametric estimation: the target is defined directly through the distribution, not through a finite-dimensional parameter.
[example: Gini Mean Difference]
For a real-valued distribution with $\mathbb E[|X|]<\infty$, let $X_1,\dots,X_n$ be i.i.d. copies of $X$ and define the symmetric order-two kernel
\begin{align*}
h(x,y)=|x-y|.
\end{align*}
The kernel is integrable because, for independent copies $X_1$ and $X_2$,
\begin{align*}
|h(X_1,X_2)|=|X_1-X_2|\le |X_1|+|X_2|
\end{align*}
by the triangle inequality. Taking expectations gives
\begin{align*}
\mathbb E[|h(X_1,X_2)|]\le \mathbb E[|X_1|]+\mathbb E[|X_2|].
\end{align*}
Since $X_1$ and $X_2$ have the same distribution as $X$,
\begin{align*}
\mathbb E[|h(X_1,X_2)|]\le 2\mathbb E[|X|]<\infty.
\end{align*}
Thus the target parameter is well-defined and equals
\begin{align*}
\theta=\mathbb E[h(X_1,X_2)]=\mathbb E[|X_1-X_2|],
\end{align*}
which is the Gini mean difference.
The associated U-statistic is
\begin{align*}
U_n=\binom n2^{-1}\sum_{1\le i<j\le n}|X_i-X_j|.
\end{align*}
Since $\binom n2=n(n-1)/2$, this is equivalently
\begin{align*}
U_n=\frac{2}{n(n-1)}\sum_{1\le i<j\le n}|X_i-X_j|.
\end{align*}
For every pair $i<j$, the pair $(X_i,X_j)$ has the same distribution as $(X_1,X_2)$, so
\begin{align*}
\mathbb E[|X_i-X_j|]=\mathbb E[|X_1-X_2|]=\theta.
\end{align*}
Using linearity of expectation,
\begin{align*}
\mathbb E[U_n]=\binom n2^{-1}\sum_{1\le i<j\le n}\mathbb E[|X_i-X_j|].
\end{align*}
Substituting $\mathbb E[|X_i-X_j|]=\theta$ for each of the $\binom n2$ pairs gives
\begin{align*}
\mathbb E[U_n]=\binom n2^{-1}\sum_{1\le i<j\le n}\theta.
\end{align*}
The sum contains exactly $\binom n2$ identical terms, hence
\begin{align*}
\mathbb E[U_n]=\binom n2^{-1}\binom n2\,\theta=\theta.
\end{align*}
Thus $U_n$ is an unbiased estimator of the Gini mean difference. It depends only on the pairwise spacings $|X_i-X_j|$, so reordering the observations leaves the statistic unchanged, while distributions with larger typical separation between independent draws have larger target value.
[/example]
Rank methods produce U-statistics whose kernels involve comparisons between pairs. They are central in distribution-free nonparametric testing because monotone transformations preserve ranks.
[example: Kendall Tau]
Let $(X_i,Y_i)$ be i.i.d. observations in $\mathbb R^2$ from a continuous joint distribution, and define the symmetric order-two kernel
\begin{align*}
h((x_1,y_1),(x_2,y_2))=\operatorname{sgn}((x_1-x_2)(y_1-y_2)).
\end{align*}
For two independent observations, let $C$ be the concordance event
\begin{align*}
C=\{(X_1-X_2)(Y_1-Y_2)>0\},
\end{align*}
let $D$ be the discordance event
\begin{align*}
D=\{(X_1-X_2)(Y_1-Y_2)<0\},
\end{align*}
and let $T$ be the tie event
\begin{align*}
T=\{(X_1-X_2)(Y_1-Y_2)=0\}.
\end{align*}
The continuity assumption rules out ties, so $\mathbb P(T)=0$. On $C$ the product $(X_1-X_2)(Y_1-Y_2)$ is positive and the sign is $1$; on $D$ the product is negative and the sign is $-1$; on $T$ the sign is $0$. Therefore
\begin{align*}
\tau=\mathbb E[\operatorname{sgn}((X_1-X_2)(Y_1-Y_2))].
\end{align*}
Using the three disjoint events $C,D,T$, this expectation is
\begin{align*}
\tau=1\cdot \mathbb P(C)+(-1)\cdot \mathbb P(D)+0\cdot \mathbb P(T).
\end{align*}
Since $\mathbb P(T)=0$, we get
\begin{align*}
\tau=\mathbb P(C)-\mathbb P(D).
\end{align*}
Thus Kendall's tau is the probability of concordance minus the probability of discordance.
The associated U-statistic is
\begin{align*}
U_n=\binom n2^{-1}\sum_{1\le i<j\le n}\operatorname{sgn}((X_i-X_j)(Y_i-Y_j)).
\end{align*}
For each pair $i<j$, the pair $((X_i,Y_i),(X_j,Y_j))$ has the same distribution as $((X_1,Y_1),(X_2,Y_2))$, hence
\begin{align*}
\mathbb E[\operatorname{sgn}((X_i-X_j)(Y_i-Y_j))]=\tau.
\end{align*}
By linearity of expectation,
\begin{align*}
\mathbb E[U_n]=\binom n2^{-1}\sum_{1\le i<j\le n}\mathbb E[\operatorname{sgn}((X_i-X_j)(Y_i-Y_j))].
\end{align*}
Substituting the value $\tau$ for each pair gives
\begin{align*}
\mathbb E[U_n]=\binom n2^{-1}\sum_{1\le i<j\le n}\tau.
\end{align*}
There are exactly $\binom n2$ pairs, so
\begin{align*}
\mathbb E[U_n]=\binom n2^{-1}\binom n2\,\tau.
\end{align*}
Hence
\begin{align*}
\mathbb E[U_n]=\tau.
\end{align*}
The statistic is therefore an unbiased U-statistic estimator of Kendall's tau, obtained by averaging the signs of all pairwise rank agreements; continuity ensures that no tie correction is needed.
[/example]
The examples show unbiasedness, but they do not yet explain variance or limiting distributions. The decisive step is to project the kernel onto sums of functions of one observation, two observations, and so on.
## Hoeffding Decomposition and Degeneracy
Why should an average over dependent tuples often have the same first-order limit theory as an average of i.i.d. variables? Although the summands in a U-statistic overlap, the leading fluctuation is usually the sum of conditional expectations given a single observation. Hoeffding's decomposition makes this projection exact.
[definition: First Projection]
Let $X_1,\dots,X_m$ be i.i.d. random variables with common law $\mu_X$ on $(E,\mathcal E)$. Let $h:E^m\to\mathbb R$ be an integrable symmetric kernel and let $\theta=\mathbb E[h(X_1,\dots,X_m)]$. Assume the [conditional expectation](/page/Conditional%20Expectation) below is finite for $\mu_X$-a.e. $x\in E$. The first projection of $h$ is the measurable map $h_1:E\to\mathbb R$ defined, up to changes on a $\mu_X$-null set, by
\begin{align*}
h_1:x\mapsto \mathbb E[h(x,X_2,\dots,X_m)]-\theta.
\end{align*}
[/definition]
The first projection is centred: $\mathbb E[h_1(X_1)]=0$. It is the part of the kernel that is visible after conditioning on a single observation, so its variance tells us whether the U-statistic has an ordinary sample-mean component. When this conditional fluctuation disappears, the first-order Gaussian theory has no leading term; the next definition names that obstruction.
[definition: Degenerate U-Statistic Kernel]
Let $X_1,\dots,X_m$ be i.i.d. random variables with common law $\mu_X$ on $(E,\mathcal E)$. An integrable symmetric kernel $h:E^m\to\mathbb R$ with parameter $\theta=\mathbb E[h(X_1,\dots,X_m)]$ is degenerate of order one if
\begin{align*}
\mathbb E[h(x,X_2,\dots,X_m)] = \theta
\end{align*}
for $\mu_X$-a.e. $x\in E$.
[/definition]
Degeneracy means that conditioning on one observation gives no information about the kernel's fluctuation. This is the case where the usual $\sqrt n$ central limit theorem may fail or have zero variance, so later terms in the decomposition need separate analysis.
[quotetheorem:6334]
[citeproof:6334]
Square integrability is what makes the projection levels orthogonal in $L^2$ and lets their variances be compared cleanly. Without it, the decomposition may still be interpretable in an $L^1$ sense, but the variance calculations used for central limit theory can break down. A boundary case is $h(x,y)=xy$ with i.i.d. $X_i$ satisfying $\mathbb E[|X_1|]<\infty$ but $\mathbb E[X_1^2]=\infty$: the first projection can be defined after centring when the mean exists, but the $L^2$ orthogonality and variance comparison are unavailable. Symmetry is mainly a convention rather than a restriction, since a nonsymmetric kernel can be symmetrised without changing the associated average over unordered tuples. The theorem itself is structural: it does not assert convergence, normality, or a rate until moment and nondegeneracy assumptions are added. The first term in this decomposition is especially simple:
\begin{align*}
\binom m1 U_{n,1}=\frac{m}{n}\sum_{i=1}^n h_1(X_i).
\end{align*}
Thus a nondegenerate U-statistic has the same leading term as an ordinary sample mean. The remaining canonical terms are smaller under the moment assumptions used in the standard limit theorem.
[quotetheorem:6335]
[citeproof:6335]
The integrability assumption is essential because the theorem is an average law for the kernel values themselves. If $h(X_1,\dots,X_m)$ has no finite first moment, the target $\theta$ is not a finite real number and ordinary averaging can be dominated by rare extreme tuples. For example, with $m=1$ and $h(x)=x$, a Pareto distribution satisfying $\mathbb P(X>x)=x^{-\alpha}$ for $x\ge 1$ and $0<\alpha<1$ has infinite mean, and the sample average is driven by the largest observations rather than converging to a finite expectation. The result also gives no rate of convergence and no distributional approximation; it only says that the estimator is eventually close to its target almost surely. For distributional approximation, consistency is only the first layer, and the next question is whether the centred error is asymptotically Gaussian.
## Asymptotic Normality for Nondegenerate U-Statistics
When does a U-statistic have a classical $\sqrt n$ limit? Hoeffding's decomposition gives a precise answer: the first projection must have positive variance. Then the dependent higher-order terms are asymptotically negligible at the $\sqrt n$ scale.
[definition: Nondegenerate U-Statistic Kernel]
Let $h:E^m\to\mathbb R$ be a square-integrable symmetric kernel and let $h_1$ be its first projection. The kernel is nondegenerate if
\begin{align*}
\zeta_1:=\operatorname{Var}(h_1(X_1))>0.
\end{align*}
[/definition]
Nondegeneracy is a condition on the target distribution as well as on the kernel. For example, the sample variance kernel is nondegenerate unless the underlying distribution is concentrated at a single point. The reason this condition is singled out is that it is exactly what makes the first projection carry a nonzero Gaussian fluctuation. The next theorem turns that observation into the main distributional approximation for U-statistics: once $\zeta_1>0$, the full statistic has the same first-order limit as its projection term.
[quotetheorem:6336]
[citeproof:6336]
The square-integrability assumption is used twice: it gives a finite variance for the first projection and controls the higher-order canonical remainders in $L^2$. Nondegeneracy is also necessary for this particular normalisation; if $h_1=0$ a.e., the displayed variance is zero and the $\sqrt n$ limit cannot be the stated nonconstant Gaussian law. A concrete example is a genuinely degenerate order-two kernel, where the second projection rather than the first projection controls the limit. Thus the theorem reduces inference for many nonparametric functionals to computing $h_1$ and estimating its variance, but it does not cover degenerate kernels or bandwidth-dependent kernels without further arguments. The asymptotic variance is not usually the variance of the kernel $h(X_1,\dots,X_m)$; it is the variance of the conditional expectation fluctuation.
[example: First Projection for the Gini Mean Difference]
For the Gini mean difference kernel $h(x,y)=|x-y|$, the target parameter is
\begin{align*}
\theta=\mathbb E[h(X_1,X_2)]=\mathbb E[|X_1-X_2|].
\end{align*}
By the definition of the first projection for an order-two kernel,
\begin{align*}
h_1(x)=\mathbb E[h(x,X_2)]-\theta.
\end{align*}
Substituting $h(x,y)=|x-y|$ and the value of $\theta$ gives
\begin{align*}
h_1(x)=\mathbb E[|x-X_2|]-\mathbb E[|X_1-X_2|].
\end{align*}
This projection is centred. Indeed,
\begin{align*}
\mathbb E[h_1(X_1)]=\mathbb E[\mathbb E[|X_1-X_2|\mid X_1]]-\mathbb E[|X_1-X_2|].
\end{align*}
By the tower property,
\begin{align*}
\mathbb E[\mathbb E[|X_1-X_2|\mid X_1]]=\mathbb E[|X_1-X_2|].
\end{align*}
Therefore
\begin{align*}
\mathbb E[h_1(X_1)]=\mathbb E[|X_1-X_2|]-\mathbb E[|X_1-X_2|]=0.
\end{align*}
If $\mathbb E[X^2]<\infty$, then the kernel is square-integrable because
\begin{align*}
|X_1-X_2|^2\le (|X_1|+|X_2|)^2.
\end{align*}
Expanding the square gives
\begin{align*}
(|X_1|+|X_2|)^2=X_1^2+2|X_1||X_2|+X_2^2.
\end{align*}
Since $2|X_1||X_2|\le X_1^2+X_2^2$,
\begin{align*}
(|X_1|+|X_2|)^2\le 2X_1^2+2X_2^2.
\end{align*}
Hence
\begin{align*}
\mathbb E[|X_1-X_2|^2]\le 2\mathbb E[X_1^2]+2\mathbb E[X_2^2]=4\mathbb E[X^2]<\infty.
\end{align*}
Thus, if the distribution is nondegenerate in the U-statistic sense,
\begin{align*}
\operatorname{Var}(h_1(X_1))>0,
\end{align*}
the *[Central Limit Theorem for Nondegenerate U-Statistics](/theorems/6336)* applies with $m=2$. It gives
\begin{align*}
\sqrt n\,(U_n-\theta)\xrightarrow{d}\mathcal N(0,4\operatorname{Var}(h_1(X_1))).
\end{align*}
The same formula gives the empirical plug-in projection values. Replacing the distribution of $X_2$ in $\mathbb E[|x-X_2|]$ by the empirical distribution of the observations other than $X_i$ gives
\begin{align*}
\frac{1}{n-1}\sum_{j\ne i}|X_i-X_j|.
\end{align*}
Replacing $\theta$ by the Gini U-statistic
\begin{align*}
U_n=\binom n2^{-1}\sum_{1\le j<k\le n}|X_j-X_k|
\end{align*}
therefore gives
\begin{align*}
\widehat h_{1,i}=\frac{1}{n-1}\sum_{j\ne i}|X_i-X_j|-U_n.
\end{align*}
So inference for the nondegenerate Gini U-statistic is governed by the variance of the one-observation projection values, not by the variance of the pairwise distances themselves.
[/example]
The quadratic density estimator has one extra complication because its kernel changes with $n$ through the bandwidth $h$. Still, the fixed-kernel theory gives the right conceptual decomposition: a linear projection term, a degenerate pairwise term, and a smoothing bias.
[example: Projection Heuristic for Estimating Integral Squared Density]
Assume $K$ is symmetric, so $a_h(x,y)=K_h(x-y)$ is a symmetric order-two kernel. For fixed bandwidth $h>0$, define
\begin{align*}
\theta_h=\mathbb E[a_h(X_1,X_2)]=\mathbb E[K_h(X_1-X_2)].
\end{align*}
Since $X_1$ and $X_2$ are independent with common density $f$, the joint density is $(x,y)\mapsto f(x)f(y)$, and therefore
\begin{align*}
\theta_h=\int_{\mathbb R}\int_{\mathbb R}K_h(x-y)f(x)f(y)\,d\mathcal L^1(y)d\mathcal L^1(x).
\end{align*}
For each fixed $x$, the convolution convention used here gives
\begin{align*}
(K_h*f)(x)=\int_{\mathbb R}K_h(x-y)f(y)\,d\mathcal L^1(y).
\end{align*}
Substituting this inner integral into the expression for $\theta_h$ yields
\begin{align*}
\theta_h=\int_{\mathbb R}f(x)(K_h*f)(x)\,d\mathcal L^1(x).
\end{align*}
By the definition of the first projection for an order-two kernel,
\begin{align*}
a_{h,1}(x)=\mathbb E[a_h(x,X_2)]-\theta_h.
\end{align*}
Substituting $a_h(x,y)=K_h(x-y)$ gives
\begin{align*}
\mathbb E[a_h(x,X_2)]=\mathbb E[K_h(x-X_2)].
\end{align*}
Using the density of $X_2$,
\begin{align*}
\mathbb E[K_h(x-X_2)]=\int_{\mathbb R}K_h(x-y)f(y)\,d\mathcal L^1(y).
\end{align*}
By the same convolution convention,
\begin{align*}
\mathbb E[K_h(x-X_2)]=(K_h*f)(x).
\end{align*}
Hence
\begin{align*}
a_{h,1}(x)=(K_h*f)(x)-\theta_h.
\end{align*}
For symmetric $K$, the diagonal-free quadratic estimator is the order-two U-statistic generated by $a_h$:
\begin{align*}
\widehat Q_{n,h}=\frac{1}{n(n-1)}\sum_{1\le i\ne j\le n}K_h(X_i-X_j).
\end{align*}
Because the terms with $(i,j)$ and $(j,i)$ are equal, this is also
\begin{align*}
\widehat Q_{n,h}=\binom n2^{-1}\sum_{1\le i<j\le n}a_h(X_i,X_j).
\end{align*}
Thus its fixed-bandwidth first projection term is
\begin{align*}
\frac{2}{n}\sum_{i=1}^n a_{h,1}(X_i)=\frac{2}{n}\sum_{i=1}^n\{(K_h*f)(X_i)-\theta_h\}.
\end{align*}
The deterministic smoothing bias is the gap between the fixed-bandwidth target and the original quadratic functional:
\begin{align*}
\theta_h-Q(f)=\int_{\mathbb R}f(x)(K_h*f)(x)\,d\mathcal L^1(x)-\int_{\mathbb R}f(x)^2\,d\mathcal L^1(x).
\end{align*}
So the estimator separates into a sample-mean fluctuation driven by $(K_h*f)(X_i)-\theta_h$, a degenerate pairwise remainder, and the smoothing bias $\theta_h-Q(f)$; in bandwidth-dependent asymptotics, their relative sizes determine whether the estimator behaves like a regular sample mean or like a degenerate quadratic statistic.
[/example]
## Variance Estimation and Studentisation
How can the limiting theorem be turned into a usable confidence interval when $\zeta_1$ is unknown? The answer is to estimate the first projection empirically. This is another place where the U-statistic viewpoint is practical rather than only structural.
[definition: Empirical Projection Values]
For an order-$m$ U-statistic with kernel $h:E^m\to\mathbb R$ and sample size $n\ge m$, let $\mathcal A_i$ be the collection of all subsets $A\subset \{1,\dots,n\}\setminus\{i\}$ with $|A|=m-1$. The empirical projection map is $\widehat H_1:E^n\to\mathbb R^n$, where its $i$th coordinate is
\begin{align*}
\widehat H_{1,i}(x_1,\dots,x_n):=\binom{n-1}{m-1}^{-1}\sum_{A\in\mathcal A_i} h(x_i,(x_j)_{j\in A})-U_n(x_1,\dots,x_n).
\end{align*}
The leave-one empirical projection values are $\widehat h_{1,i}:=\widehat H_{1,i}(X_1,\dots,X_n)$ for $i=1,\dots,n$.
[/definition]
These values approximate $h_1(X_i)$ by averaging the kernel over tuples containing $X_i$. Their empirical variance provides the natural plug-in estimate of $\zeta_1$, but consistency is not automatic because the leave-one averages are dependent and each is itself computed from the full sample. The next theorem records the standard studentisation result: after controlling this dependence, the estimated projection variance can replace the unknown projection variance in the central limit theorem.
[quotetheorem:6337]
[citeproof:6337]
The fourth-moment assumption is a convenient sufficient condition ensuring that the empirical projection averages and their squared deviations obey the required laws of large numbers. If the kernel has only a second moment, the central limit theorem may still hold while the plug-in variance estimate can be unstable because a few large leave-one projection values dominate the empirical second moment. For instance, for the sample variance kernel $h(x,y)=(x-y)^2/2$, estimating the first-projection variance involves fourth-order behaviour of $X$; a distribution with finite second moment but infinite fourth moment can make the empirical squared projection values highly unstable. In applications, the displayed studentised statistic leads to the usual large-sample interval
\begin{align*}
U_n \pm z_{1-\alpha/2}\frac{m\sqrt{\widehat\zeta_1}}{\sqrt n},
\end{align*}
provided the nondegenerate approximation is credible. The same formula also gives diagnostics: a very small $\widehat\zeta_1$, extreme leave-one projection values, or visible instability under deletion of a few observations signals that the first-order normal approximation may be unreliable. The condition $\zeta_1>0$ is therefore essential: when the first projection vanishes, $\widehat\zeta_1$ estimates a zero first-order variance rather than the smaller-order variance governing the statistic. The main warning is that this studentisation is tied to nondegeneracy and to adequate moment control. In degenerate cases the variance is of smaller order, and the limiting distribution is often a weighted sum of centred chi-squared variables rather than a normal law.
[remark: What Changes Under Degeneracy]
When $h_1=0$ a.e., the term $m n^{-1}\sum_i h_1(X_i)$ is absent from Hoeffding's decomposition. The next canonical component determines the rate and limit law. For order-two degenerate kernels, $n(U_n-\theta)$ often converges to an infinite weighted sum of variables of the form $Z_j^2-1$, where the weights come from the spectral decomposition of the kernel operator.
[/remark]
This chapter's main message is that U-statistics are the bridge between nonparametric functionals and classical asymptotic theory. The same projection idea also connects to adjacent areas: in semiparametric inference it is the finite-sample counterpart of influence-function linearisation, in empirical process theory the degenerate terms are analysed through higher-order stochastic processes, and in bootstrap methods the studentised form determines whether resampling can mimic the correct first-order law. For nondegenerate functionals, the first projection supplies both the limiting variance and the route to studentisation. For degenerate or bandwidth-dependent functionals, the same decomposition identifies exactly which term requires new analysis.
Chapters 2 through 8 developed large-sample theory: when do estimators converge to limits, and what are the asymptotic distributions? This chapter takes a different approach through rank and permutation methods, which ask what inference is possible from the ordering of observations alone, without parametric assumptions or asymptotics. The answers are surprisingly complete: finite-sample validity and power can coexist.
# 9. Rank and Permutation Methods
Rank and permutation methods ask how much inference can be done after discarding the numerical scale of the observations. The chapter assumes the basic probability language of random variables, conditional distributions, expectation, variance, convergence in distribution, and the central limit theorem; it also uses the earlier $U$-statistic viewpoint that pairwise averages can be analysed through projections. Chapters 3 and 8 used empirical-process and $U$-statistic ideas to control estimators and smoothers; here the focus shifts to tests whose null distribution is determined by symmetry rather than by an unknown density. The central mechanism is exchangeability: if the data-generating law is invariant under a group of transformations, then averaging over that group gives exact finite-sample calibration.
The chapter begins with ranks, signs, and permutation distributions, then develops the Wilcoxon signed-rank and Mann-Whitney procedures as canonical linear rank tests. The final section explains how these tests behave under local alternatives and why rank methods can lose little efficiency under Gaussian models while gaining substantial robustness under heavier-tailed models.
## Exact Inference from Symmetry
Suppose we want a test of a null hypothesis without estimating nuisance features such as the common density, scale, or shape. The guiding question is: when does the null hypothesis itself imply enough symmetry to compute the distribution of a statistic exactly?
[definition: Exchangeable Random Variables]
Let $X_1,\dots,X_n$ be random variables taking values in a measurable space $(E,\mathcal E)$. They are exchangeable if, for every permutation $\pi$ of $\{1,\dots,n\}$,
\begin{align*}
(X_1,\dots,X_n) \xrightarrow{d} (X_{\pi(1)},\dots,X_{\pi(n)}).
\end{align*}
[/definition]
Exchangeability is weaker than independence with identical distribution, but it gives the same invariance under relabelling. This motivates the following example, which records the main source of exchangeability used in rank and permutation tests.
[example: Independent Identically Distributed Samples Are Exchangeable]
Let $X_1,\dots,X_n$ be independent random variables with common distribution $\mu$ on $(E,\mathcal E)$, and fix a permutation $\pi$ of $\{1,\dots,n\}$. For measurable sets $A_1,\dots,A_n\in\mathcal E$, the random variables $X_{\pi(1)},\dots,X_{\pi(n)}$ are still independent, so
\begin{align*}
\mathbb P(X_{\pi(1)}\in A_1,\dots,X_{\pi(n)}\in A_n)=\prod_{k=1}^n \mathbb P(X_{\pi(k)}\in A_k).
\end{align*}
Since each $X_{\pi(k)}$ has distribution $\mu$,
\begin{align*}
\prod_{k=1}^n \mathbb P(X_{\pi(k)}\in A_k)=\prod_{k=1}^n \mu(A_k).
\end{align*}
Applying independence to the unpermuted vector gives
\begin{align*}
\mathbb P(X_1\in A_1,\dots,X_n\in A_n)=\prod_{k=1}^n \mathbb P(X_k\in A_k).
\end{align*}
Since each $X_k$ also has distribution $\mu$,
\begin{align*}
\prod_{k=1}^n \mathbb P(X_k\in A_k)=\prod_{k=1}^n \mu(A_k).
\end{align*}
Combining the two displayed calculations,
\begin{align*}
\mathbb P(X_{\pi(1)}\in A_1,\dots,X_{\pi(n)}\in A_n)=\mathbb P(X_1\in A_1,\dots,X_n\in A_n).
\end{align*}
This equality holds for every measurable rectangle $A_1\times\cdots\times A_n$, and measurable rectangles generate the product $\sigma$-algebra on $E^n$. Hence the vectors have the same joint law:
\begin{align*}
(X_{\pi(1)},\dots,X_{\pi(n)})\xrightarrow{d}(X_1,\dots,X_n).
\end{align*}
Thus every relabelling of an i.i.d. sample has the same distribution as the original sample, which is exactly the symmetry that permutation methods use.
[/example]
The example shows that relabelling can preserve the law, but a test needs a reference distribution after the data have been observed. Once the observed sample is fixed, the only data sets allowed by the symmetry argument are the points in its group orbit. The conditional null reference is therefore not a fresh sampling distribution from the model; it is the empirical distribution obtained by applying every symmetry transformation to the observed data.
[definition: Permutation Distribution]
Let $G$ be a finite group, let $\mathcal X$ be a measurable sample space, and let
\begin{align*}
G\times \mathcal X &\to \mathcal X, & (g,x)&\mapsto gx
\end{align*}
be a [group action](/page/Group%20Action) such that each map $x\mapsto gx$ is measurable from $\mathcal X$ to $\mathcal X$. Let $T:\mathcal X \to \mathbb R$ be a test statistic. For observed data $x \in \mathcal X$, the permutation distribution of $T$ over $G$ is the discrete probability measure assigning mass $1/|G|$ to each value $T(gx)$, $g \in G$.
[/definition]
This distribution is conditional on the orbit $\{gx:g\in G\}$. The calibration problem is whether a tail probability computed from that finite orbit is conservative for the original null law before conditioning. That requires the null distribution to be invariant under the whole group action, so that the observed point is, under the null, indistinguishable from its transformed copies within the orbit.
[quotetheorem:6338]
[citeproof:6338]
The invariance hypothesis is the whole source of the conclusion: if the group action does not preserve the null law, the orbit average is just an artificial resampling device and need not control type I error. The finiteness of $G$ is also being used, since the p-value is an ordinary finite average over transformed datasets; continuous transformation groups require Haar-measure versions with additional measurability conditions. The theorem does not say that every permutation test is powerful, only that its null calibration is valid when the symmetry is genuine. A common failure occurs in two-sample problems with unequal variances: labels are not exchangeable even if the means agree. This motivates the following example, where repeated data values force a careful convention at the tail boundary.
[example: Permutation P-Value with Ties]
Take the observed first group to be $\{1,3\}$ and the observed second group to be $\{1,2\}$. Then
\begin{align*}
T(x)=\frac{1+3}{2}-\frac{1+2}{2}=2-\frac{3}{2}=\frac{1}{2}.
\end{align*}
There are $\binom{4}{2}=6$ assignments of two pooled observations to the first group. Write the two equal observations as $1_a$ and $1_b$ only to distinguish label assignments. If the first group is $\{1_a,1_b\}$, then
\begin{align*}
T=\frac{1+1}{2}-\frac{2+3}{2}=1-\frac{5}{2}=-\frac{3}{2}.
\end{align*}
If the first group is $\{1_a,2\}$, then
\begin{align*}
T=\frac{1+2}{2}-\frac{1+3}{2}=\frac{3}{2}-2=-\frac{1}{2}.
\end{align*}
If the first group is $\{1_b,2\}$, the same calculation gives
\begin{align*}
T=\frac{1+2}{2}-\frac{1+3}{2}=\frac{3}{2}-2=-\frac{1}{2}.
\end{align*}
If the first group is $\{1_a,3\}$, then
\begin{align*}
T=\frac{1+3}{2}-\frac{1+2}{2}=2-\frac{3}{2}=\frac{1}{2}.
\end{align*}
If the first group is $\{1_b,3\}$, the same calculation gives
\begin{align*}
T=\frac{1+3}{2}-\frac{1+2}{2}=2-\frac{3}{2}=\frac{1}{2}.
\end{align*}
If the first group is $\{2,3\}$, then
\begin{align*}
T=\frac{2+3}{2}-\frac{1+1}{2}=\frac{5}{2}-1=\frac{3}{2}.
\end{align*}
Thus the permutation distribution assigns mass $1/6$ to $-3/2$, mass $2/6$ to $-1/2$, mass $2/6$ to $1/2$, and mass $1/6$ to $3/2$. For the upper-tail test at the observed value $T(x)=1/2$, the conservative permutation p-value is
\begin{align*}
p_G(x)=\frac{1}{6}\#\{g:T(gx)\ge 1/2\}.
\end{align*}
The transformed datasets with $T(gx)\ge 1/2$ consist of the two assignments with $T(gx)=1/2$ and the one assignment with $T(gx)=3/2$, so
\begin{align*}
p_G(x)=\frac{2+1}{6}=\frac{1}{2}.
\end{align*}
The boundary value $1/2$ is counted twice because replacing $1_a$ by $1_b$ changes the label assignment but leaves the numerical statistic unchanged.
A randomised upper-tail p-value separates the strict tail from the tied boundary. If $V$ is independent and uniform on $[0,1]$, then in this example
\begin{align*}
p_{\mathrm{rand}}(x,V)=\frac{\#\{g:T(gx)>1/2\}+V\#\{g:T(gx)=1/2\}}{6}=\frac{1+2V}{6}.
\end{align*}
The conservative value includes the whole tied boundary, while the randomised value takes a uniformly chosen fraction of that boundary, which is the tie-breaking convention needed for exact conditional calibration on the finite permutation orbit.
[/example]
The permutation example uses relabelling symmetry, but paired experiments require a different invariance. In a paired design, the subjects are not relabelled; instead, the evidence is carried by the directions of the within-subject differences.
To turn those directions into an exact reference distribution, we need a condition that licenses changing any subset of signs while leaving the joint law unchanged. Marginal symmetry of each difference is too weak, since dependent signs could still make some sign patterns more likely than others.
[definition: Sign Symmetry]
A real-valued [random variable](/page/Random%20Variable) $Y$ has sign symmetry about $0$ if $Y \xrightarrow{d} -Y$. A vector $(Y_1,\dots,Y_n)$ has independent sign symmetry if
\begin{align*}
(Y_1,\dots,Y_n) \xrightarrow{d} (\varepsilon_1Y_1,\dots,\varepsilon_nY_n)
\end{align*}
for every fixed $\varepsilon_i \in \{-1,1\}$.
[/definition]
Sign symmetry is the paired-sample analogue of exchangeability. This motivates the following example, which shows how a paired null can give an exact distribution even when the paired differences are far from Gaussian.
[example: Paired Nonnormal Measurements]
Let $A_i$ and $B_i$ be two measurements on subject $i$, and set $Y_i=A_i-B_i$. Under a paired null with independent sign symmetry about $0$, for every fixed sign vector $\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)\in\{-1,1\}^n$,
\begin{align*}
(Y_1,\dots,Y_n)\xrightarrow{d}(\varepsilon_1Y_1,\dots,\varepsilon_nY_n).
\end{align*}
Writing $M_i=|Y_i|$, each possible sign-randomised vector has the form
\begin{align*}
(\varepsilon_1M_1,\dots,\varepsilon_nM_n),
\end{align*}
and there are
\begin{align*}
|\{-1,1\}^n|=2\cdot 2\cdots 2=2^n
\end{align*}
such sign assignments.
For example, if a sign statistic is
\begin{align*}
S(y)=\sum_{i=1}^n s_i\,\mathbb 1_{\{y_i>0\}},
\end{align*}
with fixed weights $s_i$ determined by the observed magnitudes, then its sign-randomisation distribution conditional on $M_1,\dots,M_n$ assigns mass $2^{-n}$ to each value
\begin{align*}
S(\varepsilon_1M_1,\dots,\varepsilon_nM_n)
=\sum_{i=1}^n s_i\,\mathbb 1_{\{\varepsilon_iM_i>0\}}
=\sum_{i=1}^n s_i\,\mathbb 1_{\{\varepsilon_i=1\}},
\end{align*}
assuming $M_i>0$ for all $i$. Thus the reference distribution is obtained by keeping the observed magnitudes fixed and enumerating all $2^n$ sign patterns, so the paired test is exact under sign symmetry without any Gaussian error assumption.
[/example]
The sign-randomisation construction gives exact calibration, but signs alone cannot distinguish a small positive difference from the largest positive difference in the sample. Using raw magnitudes would add that information in a scale-dependent way, so the next object must keep only the ordering of observed values.
We need a rank to formalize this scale-free ordering information before building rank tests. It records where an observation sits in the sample order while discarding the numerical spacing between observations, which is exactly the feature required when the null law should not depend on the unknown measurement distribution.
[definition: Rank]
Let $x_1,\dots,x_n \in \mathbb R$ be distinct. The rank of $x_i$ among $x_1,\dots,x_n$ is
\begin{align*}
R_i = \sum_{j=1}^n \mathbb{1}_{\{x_j \le x_i\}}.
\end{align*}
[/definition]
Ranks remove information about the marginal distribution while preserving order information. The remaining question is whether this loss of numerical information actually removes the unknown distribution from the null law. For continuous i.i.d. observations, the decisive symmetry is that every ordering of the sample should be equally likely, making the rank vector universal rather than distribution-specific.
[quotetheorem:6339]
[citeproof:6339]
The continuity assumption is needed to remove ties; with atoms, rank vectors depend on the tie-breaking convention and on the jump sizes of $F$. Independence and identical distribution are also doing separate work: identical distribution gives relabelling symmetry, while independence rules out dependence patterns that could bias orderings. The theorem does not claim that ranks contain all information about $F$; it says that after passing to ranks, the null law is universal. For example, if half the observations are drawn from one continuous distribution and half from another, the rank vector is no longer uniform. This prepares the first major rank procedure, where signs determine direction and ranks of magnitudes determine weights.
## Wilcoxon Signed-Rank Testing
In paired data, the sign test uses only whether $Y_i$ is positive or negative. The signed-rank test asks whether we can gain power by also using the relative magnitudes $|Y_i|$ while retaining exact calibration under sign symmetry.
[definition: Wilcoxon Signed-Rank Statistic]
Let
\begin{align*}
\mathcal W_n=\{y=(y_1,\dots,y_n)\in\mathbb R^n:y_i\ne 0\text{ for all }i,\ |y_i|\ne |y_j|\text{ for }i\ne j\}.
\end{align*}
For $y\in\mathcal W_n$, let $A_i(y)$ be the rank of $|y_i|$ among $|y_1|,\dots,|y_n|$. The Wilcoxon signed-rank statistic is the map $W_{n,+}:\mathcal W_n\to\mathbb R$ defined by
\begin{align*}
W_{n,+}(y)=\sum_{i=1}^n A_i(y)\mathbb{1}_{\{y_i>0\}}.
\end{align*}
[/definition]
The statistic gives larger weight to signs attached to larger absolute differences, but that weighting is valid only if the weights can be treated as fixed while the directions are randomized. Otherwise the rank weights could carry distributional information that changes the null law.
Exact calibration therefore requires a theorem that converts sign symmetry into a conditional law for the signed-rank statistic. The needed statement isolates whether, after conditioning on the observed absolute values, the signs behave like independent fair coin flips under the paired null.
[quotetheorem:6340]
[citeproof:6340]
Independent sign symmetry is stronger than marginal symmetry: it requires all sign changes of the vector to have the same joint law, so dependence among signs would destroy the product Bernoulli representation. The continuity assumptions remove zero differences and tied absolute values, because either phenomenon changes the available sign patterns or the rank weights. The theorem does not say that the signed-rank statistic is valid for every null hypothesis of zero mean; a skew distribution with mean $0$ generally fails sign symmetry. The exact law gives a finite-sample test, but tables and asymptotic calculations require its centre, spread, and limiting shape. This motivates the following normal approximation.
[quotetheorem:6341]
[citeproof:6341]
The Bernoulli-sum representation is essential here: it reduces the signed-rank statistic to a weighted sum of independent centred variables. The variance grows like $n^3$, while the largest single rank has size only $n$, so no single observation controls the standardised statistic. The theorem relies on three separate assumptions. Tied absolute values replace the weights $1,\dots,n$ by an observed multiset of midranks or by randomly broken ranks, so the variance and support of the null law change. Dependence among paired differences can make the sign indicators dependent even when each marginal distribution is symmetric. Lack of joint sign symmetry, such as skew paired differences with mean $0$, destroys the fair-sign representation itself. The normal approximation is useful for large $n$ under the ideal continuous sign-symmetric model, but the exact sign-randomisation distribution remains preferable for small samples and for software implementation. This motivates the following practical convention statement about zeros and ties.
[remark: Ties and Zero Differences in Signed-Rank Tests]
If some $Y_i=0$, common conventions either discard those pairs or split their signs symmetrically; the chosen convention changes the finite-sample null distribution. If absolute values are tied, midranks are often assigned, and the exact null distribution is then computed from the observed multiset of midrank weights. The validity argument remains a sign-randomisation argument, but the reference distribution is no longer the simple law of $\sum_{i=1}^n iB_i$.
[/remark]
The test is often used when the paired differences have heavy tails or outliers. This motivates the following example, where order and sign information remain stable despite extreme numerical values.
[example: Signed-Rank Test for Heavy-Tailed Paired Differences]
Suppose the paired differences satisfy $Y_i=\theta+\varepsilon_i$, where the errors have independent sign symmetry, so $(\varepsilon_1,\dots,\varepsilon_n)$ has the same law as $(\delta_1\varepsilon_1,\dots,\delta_n\varepsilon_n)$ for every fixed sign vector $\delta\in\{-1,1\}^n$. Under the null hypothesis $\theta=0$, $Y_i=\varepsilon_i$, and therefore
\begin{align*}
(Y_1,\dots,Y_n)\xrightarrow{d}(\delta_1Y_1,\dots,\delta_nY_n).
\end{align*}
This is the sign-randomisation symmetry used to calibrate the signed-rank test.
The signed-rank statistic is
\begin{align*}
W_{n,+}(Y_1,\dots,Y_n)=\sum_{i=1}^n A_i(Y)\mathbb 1_{\{Y_i>0\}},
\end{align*}
where $A_i(Y)$ is the rank of $|Y_i|$ among $|Y_1|,\dots,|Y_n|$. Conditional on the observed magnitudes $M_i=|Y_i|$, a sign-randomised vector has the form $(\delta_1M_1,\dots,\delta_nM_n)$. If $M_i>0$ for all $i$, then
\begin{align*}
W_{n,+}(\delta_1M_1,\dots,\delta_nM_n)=\sum_{i=1}^n A_i(M)\mathbb 1_{\{\delta_iM_i>0\}}.
\end{align*}
Since $M_i>0$, the event $\{\delta_iM_i>0\}$ is the same as $\{\delta_i=1\}$, so
\begin{align*}
W_{n,+}(\delta_1M_1,\dots,\delta_nM_n)=\sum_{i=1}^n A_i(M)\mathbb 1_{\{\delta_i=1\}}.
\end{align*}
Thus the exact null reference distribution assigns probability $2^{-n}$ to each of the $2^n$ sign vectors $\delta\in\{-1,1\}^n$.
For a one-sided test of $\theta=0$ against $\theta>0$, large values of $W_{n,+}$ mean that positive signs are attached to large absolute paired differences. If the errors have Cauchy-like tails, one extreme observation can strongly affect the paired mean: replacing one observation by a value $z$ contributes $z/n$ to
\begin{align*}
\bar Y=\frac{1}{n}\sum_{i=1}^n Y_i.
\end{align*}
By contrast, the signed-rank statistic uses that observation only through its sign and the rank of its magnitude. The test therefore keeps exact null calibration from sign symmetry while reducing dependence on the numerical size of extreme paired differences.
[/example]
## Mann-Whitney and Two-Sample Rank Testing
For two independent samples, the analogue of sign symmetry is label exchangeability under a common distribution. The problem is to test whether two populations differ in location while avoiding assumptions about the common density or variance.
[definition: Mann-Whitney Statistic]
Let
\begin{align*}
\mathcal M_{m,n}=\{(x,y)\in\mathbb R^m\times\mathbb R^n: x_i\ne x_{i'}\text{ for }i\ne i',\ y_j\ne y_{j'}\text{ for }j\ne j',\ x_i\ne y_j\text{ for all }i,j\}.
\end{align*}
The Mann-Whitney statistic is the map $U_{m,n}:\mathcal M_{m,n}\to\mathbb R$ defined by
\begin{align*}
U_{m,n}(x,y)=\sum_{i=1}^m\sum_{j=1}^n \mathbb{1}_{\{x_i<y_j\}}.
\end{align*}
[/definition]
The statistic counts favourable cross-pairs. This motivates the following example, which connects the statistic to a two-sample shift alternative.
[example: Two-Sample Location Shift]
Let $X_i\sim F$ and $Y_j\sim F(\cdot-\theta)$ independently, where positive $\theta$ shifts the $Y$ distribution to the right. Equivalently, we may write $Y_j=X_j'+\theta$, where $X_j'\sim F$ and $X_j'$ is independent of $X_i$. The Mann-Whitney statistic averages the cross-pair indicators:
\begin{align*}
\frac{U_{m,n}}{mn}=\frac{1}{mn}\sum_{i=1}^m\sum_{j=1}^n \mathbb 1_{\{X_i<Y_j\}}.
\end{align*}
For each pair,
\begin{align*}
\mathbb E[\mathbb 1_{\{X_i<Y_j\}}]=\mathbb P(X_i<Y_j).
\end{align*}
Since every pair has the same joint distribution as independent $X\sim F$ and $X'\sim F$ shifted by $\theta$,
\begin{align*}
\mathbb P(X_i<Y_j)=\mathbb P(X<X'+\theta).
\end{align*}
By linearity of expectation,
\begin{align*}
\mathbb E\left[\frac{U_{m,n}}{mn}\right]=\frac{1}{mn}\sum_{i=1}^m\sum_{j=1}^n \mathbb P(X_i<Y_j).
\end{align*}
Substituting the common pair probability gives
\begin{align*}
\mathbb E\left[\frac{U_{m,n}}{mn}\right]=\mathbb P(X<X'+\theta).
\end{align*}
Under $\theta=0$, the events $\{X<X'\}$, $\{X'>X\}$, and $\{X=X'\}$ partition the sample space, so
\begin{align*}
\mathbb P(X<X')+\mathbb P(X'>X)+\mathbb P(X=X')=1.
\end{align*}
If $F$ is continuous, then $\mathbb P(X=X')=0$. Also, $(X,X')$ and $(X',X)$ have the same joint distribution, so
\begin{align*}
\mathbb P(X<X')=\mathbb P(X'>X).
\end{align*}
Therefore,
\begin{align*}
2\mathbb P(X<X')=1.
\end{align*}
Hence, under the no-shift null,
\begin{align*}
\mathbb P(X<Y)=\mathbb P(X<X')=\frac12.
\end{align*}
For $\theta>0$,
\begin{align*}
\mathbb P(X<X'+\theta)=\mathbb P(X-X'<\theta).
\end{align*}
The event $\{X-X'<\theta\}$ contains the disjoint union of $\{X-X'<0\}$ and $\{0\le X-X'<\theta\}$, so
\begin{align*}
\mathbb P(X-X'<\theta)=\mathbb P(X-X'<0)+\mathbb P(0\le X-X'<\theta).
\end{align*}
The first term is $1/2$ by the preceding symmetry calculation. Thus
\begin{align*}
\mathbb P(X<Y)=\frac12+\mathbb P(0\le X-X'<\theta).
\end{align*}
Whenever the distribution of $X-X'$ assigns positive probability to $[0,\theta)$, this probability is larger than $1/2$. In that case $U_{m,n}/(mn)$ is centred above $1/2$, so the Mann-Whitney test is detecting a stochastic tendency for observations from the shifted $Y$ sample to exceed observations from the $X$ sample.
[/example]
The pair-count definition is intuitive, while the permutation null distribution is easiest to derive from pooled ranks. The rank-sum form is also what software often tabulates, so we need an exact algebraic bridge between the two statistics. This motivates the following representation theorem.
[quotetheorem:6342]
[citeproof:6342]
The no-ties hypothesis makes every cross-pair fall into exactly one of the two cases $X_i<Y_j$ or $Y_j<X_i$. If ties are present, the identity must be modified by midranks or by an explicit tie-breaking convention. The theorem is algebraic rather than probabilistic: it does not assert validity of a test until a null distribution for the pooled ranks has been specified. Its importance is that any exact or asymptotic statement for the rank sum transfers immediately to $U_{m,n}$. This motivates the following theorem, because under the common-distribution null the unknown distribution fixes the ordered values but not which rank positions receive the $X$ label.
[quotetheorem:6343]
[citeproof:6343]
The common continuous distribution assumption has two roles: it makes the ordered values distribution-free after ranking, and it makes the sample labels exchangeable among ordered positions. If $X$ and $Y$ have the same median but different spreads, the subset law need not hold because labels carry information about rank location. The theorem does not say that the rank-sum test is a pure median test; it says that under equality in distribution the calibration is exact. The exact subset distribution is finite and computable, but its support grows quickly with sample size. This motivates the following large-sample approximation with the correct finite-population variance.
[quotetheorem:6344]
[citeproof:6344]
The balance condition $m/(m+n)\to\lambda\in(0,1)$ prevents a degenerate limit in which one sample supplies too few labels for a stable rank-sum fluctuation. If $m$ stays fixed while $n\to\infty$, the statistic is driven by finitely many random rank locations rather than by a growing finite-population sum, so the Gaussian sampling-without-replacement limit is not the right reference. Continuity again removes ties: for discrete data the statistic may place positive mass on $X_i=Y_j$, and the variance must include the chosen convention for tied pairs or midranks. The theorem is a null approximation; under fixed alternatives the statistic is centred elsewhere and the same standardisation is no longer appropriate. The Mann-Whitney statistic is also a two-sample $U$-statistic with kernel $h(x,y)=\mathbb{1}_{\{x<y\}}$. This motivates the following interpretation away from the null.
[remark: What Mann-Whitney Estimates]
For independent $X\sim F$ and $Y\sim G$ with no ties, $U_{m,n}/(mn)$ estimates $\mathbb P(X<Y)$. This equals $1/2$ when $F=G$ and both distributions are continuous, but the converse need not hold. Thus the Mann-Whitney test is most naturally interpreted as a test of stochastic ordering or pairwise dominance, with the pure location-shift interpretation requiring additional structure.
[/remark]
This distinction matters in applications: unequal shapes can change $\mathbb P(X<Y)$ even when medians agree. This motivates the following example, which warns against reporting Mann-Whitney as only a median test without a location-shift model.
[example: Same Median but Different Spread]
Let $X\sim\operatorname{Unif}(-1,1)$, and let $Y$ be independent of $X$ with
\begin{align*}
\mathbb P(Y=0)=\frac12,\qquad \mathbb P(Y=2)=\frac14,\qquad \mathbb P\left(Y=-\frac12\right)=\frac14.
\end{align*}
The distribution of $X$ has median $0$ because
\begin{align*}
\mathbb P(X\le 0)=\frac{0-(-1)}{1-(-1)}=\frac12.
\end{align*}
Also,
\begin{align*}
\mathbb P(X\ge 0)=\frac{1-0}{1-(-1)}=\frac12.
\end{align*}
The distribution of $Y$ also has median $0$, since
\begin{align*}
\mathbb P(Y\le 0)=\mathbb P\left(Y=-\frac12\right)+\mathbb P(Y=0)=\frac14+\frac12=\frac34\ge \frac12.
\end{align*}
Similarly,
\begin{align*}
\mathbb P(Y\ge 0)=\mathbb P(Y=0)+\mathbb P(Y=2)=\frac12+\frac14=\frac34\ge \frac12.
\end{align*}
Now compute the pairwise ordering probability by conditioning on the value of $Y$. By the [law of total probability](/theorems/1113),
\begin{align*}
\mathbb P(X<Y)=\mathbb P(X<0)\mathbb P(Y=0)+\mathbb P(X<2)\mathbb P(Y=2)+\mathbb P\left(X<-\frac12\right)\mathbb P\left(Y=-\frac12\right).
\end{align*}
Because $X$ is uniform on $(-1,1)$,
\begin{align*}
\mathbb P(X<0)=\frac12.
\end{align*}
Also,
\begin{align*}
\mathbb P(X<2)=1.
\end{align*}
Finally,
\begin{align*}
\mathbb P\left(X<-\frac12\right)=\frac{-1/2-(-1)}{1-(-1)}=\frac{1/2}{2}=\frac14.
\end{align*}
Substituting these values gives
\begin{align*}
\mathbb P(X<Y)=\frac12\cdot\frac12+1\cdot\frac14+\frac14\cdot\frac14.
\end{align*}
Thus
\begin{align*}
\mathbb P(X<Y)=\frac14+\frac14+\frac{1}{16}=\frac{9}{16}.
\end{align*}
The two distributions therefore have the same median, but the Mann-Whitney target $\mathbb P(X<Y)$ is not $1/2$. The test responds to pairwise ordering, so interpreting it as only a median comparison requires an additional location-shift model.
[/example]
## Linear Rank Statistics and Large-Sample Theory
The signed-rank and rank-sum procedures are special cases of a broader construction. The question now is how to understand a whole class of rank tests through common score functions and a common asymptotic normality theorem.
[definition: Two-Sample Linear Rank Statistic]
Let $N=m+n$, and let
\begin{align*}
\mathcal L_{m,n}=\{z=(z_1,\dots,z_N)\in\mathbb R^N:z_i\ne z_j\text{ for }i\ne j\}.
\end{align*}
The first $m$ coordinates represent sample $X$ and the last $n$ coordinates represent sample $Y$. Given scores $a_N:\{1,\dots,N\}\to\mathbb R$, and writing $R_i(z)$ for the pooled rank of $z_i$, the two-sample linear rank statistic is the map $L_N^{(a)}:\mathcal L_{m,n}\to\mathbb R$ defined by
\begin{align*}
L_N^{(a)}(z)=\sum_{i=1}^m a_N(R_i(z)).
\end{align*}
[/definition]
Different score choices emphasise different parts of the distribution. This motivates the following example, which compares several common scoring rules while keeping the same permutation framework.
[example: Score Choices]
For the two-sample linear rank statistic
\begin{align*}
L_N^{(a)}(z)=\sum_{i=1}^m a_N(R_i(z)),
\end{align*}
a score rule assigns a numerical weight to each pooled rank before the $X$-labelled ranks are summed.
Wilcoxon scores are
\begin{align*}
a_N(r)=\frac{r}{N+1}.
\end{align*}
With these scores,
\begin{align*}
L_N^{(a)}(z)
=\sum_{i=1}^m \frac{R_i(z)}{N+1}
=\frac{1}{N+1}\sum_{i=1}^m R_i(z).
\end{align*}
Thus Wilcoxon scores are just a rescaled rank sum: replacing $R_i(z)$ by $R_i(z)+1$ increases the score by
\begin{align*}
\frac{R_i(z)+1}{N+1}-\frac{R_i(z)}{N+1}
=\frac{1}{N+1},
\end{align*}
so adjacent ranks receive equal increments.
Median scores are
\begin{align*}
a_N(r)=\mathbb{1}_{\{r>(N+1)/2\}}.
\end{align*}
Then
\begin{align*}
L_N^{(a)}(z)
=\sum_{i=1}^m \mathbb{1}_{\{R_i(z)>(N+1)/2\}},
\end{align*}
so the statistic counts how many $X$ observations fall above the middle pooled rank. Ranks at or below $(N+1)/2$ contribute $0$, while ranks above $(N+1)/2$ contribute $1$.
Normal scores are
\begin{align*}
a_N(r)=\Phi^{-1}\left(\frac{r}{N+1}\right),
\end{align*}
where $\Phi$ is the standard normal distribution function. With these scores,
\begin{align*}
L_N^{(a)}(z)
=\sum_{i=1}^m \Phi^{-1}\left(\frac{R_i(z)}{N+1}\right).
\end{align*}
Since $r/(N+1)$ increases with $r$ and $\Phi^{-1}$ is increasing, larger pooled ranks receive larger normal-score weights.
All three choices use the same statistic form, $\sum_{i=1}^m a_N(R_i(z))$, and the same permutation mechanism of assigning $m$ labels to $N$ pooled ranks. What changes is the score function: Wilcoxon scores use all ranks linearly, median scores keep only above-median information, and normal scores weight ranks according to standard-normal quantiles.
[/example]
The null distribution again comes from assigning $m$ labels to $N$ fixed scores without replacement. Exact enumeration becomes impractical for large samples, so the asymptotic problem is to know when this without-replacement sum behaves like a normal random variable. The possible obstruction is score concentration: if one rank carries too much of the variance, the random label assigned to that rank can dominate the statistic.
[quotetheorem:6345]
[citeproof:6345]
The maximum-score condition is needed to rule out a statistic dominated by one or a few rank positions. For instance, if $a_N(N)=N$ and all other scores are $0$, the statistic mostly records whether the largest observation receives the $X$ label, and a normal limit is not the right approximation. The sampling fraction condition prevents the variance from collapsing because one sample size is negligible relative to the other. The theorem does not choose the scores; it only says when a chosen score sequence has a Gaussian permutation limit. This separates calibration from modelling: the null distribution is determined by ranks, while local power depends on how the score function aligns with the way the alternative perturbs the distribution.
## Local Alternatives and Pitman Efficiency
Exact tests control type I error, but comparing tests requires looking at power. The asymptotic question is: under alternatives approaching the null at rate $N^{-1/2}$, how large is the mean shift of a standardised statistic?
[definition: Pitman Local Alternative]
Let $(P_\theta:\theta\in\Theta)$ be a one-dimensional statistical model with $0\in\Theta$. A Pitman local alternative is a sequence of laws $P_{\theta_N}$ with
\begin{align*}
\theta_N=\frac{h}{\sqrt{N}}
\end{align*}
for a fixed $h\in\mathbb R$ as $N\to\infty$.
[/definition]
Local alternatives are close enough that limiting power is nondegenerate, so two reasonable tests may both have nontrivial limiting rejection probabilities. Comparing their raw limiting powers at the same sample size can still be misleading, because one test may need fewer observations to achieve the same local power curve.
We need an efficiency definition that compares tests after their local asymptotic powers have been matched. The required summary is the limiting sample-size ratio, because that ratio records how many observations one procedure needs to reproduce the other's local power curve.
[definition: Pitman Asymptotic Relative Efficiency]
Let $(P_{\theta,N}:\theta\in\Theta)$ be a sequence of statistical experiments with $0\in\Theta$, and fix local alternatives $\theta_N=h/\sqrt N$ with $h\ne 0$. Let $(\phi_N)_{N\ge 1}$ and $(\psi_N)_{N\ge 1}$ be two sequences of level-$\alpha$ tests for this local asymptotic testing problem. Let $N_\psi(N)$ be a sequence of positive integers satisfying
\begin{align*}
\mathbb E_{P_{h/\sqrt N,N}}[\phi_N]-\mathbb E_{P_{h/\sqrt{N_\psi(N)},N_\psi(N)}}[\psi_{N_\psi(N)}]\to 0
\end{align*}
as $N\to\infty$, with any needed interpolation between integer sample sizes made by linear interpolation of the limiting power function. The Pitman asymptotic relative efficiency is the scalar, when it exists,
\begin{align*}
e(\phi,\psi)=\lim_{N\to\infty}\frac{N_\psi(N)}{N},
\end{align*}
provided the limit is independent of the particular integer sequence $N_\psi(N)$ satisfying the displayed matching condition.
[/definition]
An efficiency larger than $1$ means the first test needs fewer observations asymptotically for the same local power. This motivates the following classical comparison between Wilcoxon and the paired $t$-test in a smooth symmetric location model.
[quotetheorem:6346]
[citeproof:6346]
The assumptions are stronger than those needed for finite-sample signed-rank validity because efficiency compares local power, not just null calibration. Finite variance is needed for the paired $t$ benchmark, while the $L^2$ and differentiability assumptions control the first-order shift of the rank functional. The theorem does not say that Wilcoxon is always within $3/\pi$ of the $t$-test; the number $3/\pi$ is specific to Gaussian errors. The value $3/\pi\approx 0.955$ is the standard headline: under normality the Wilcoxon procedure pays only a small asymptotic price relative to the test designed for Gaussian means. This motivates the following example, where the same formula favours the rank test.
[example: Logistic Errors]
Let $f(x)=e^{-x}/(1+e^{-x})^2$ be the standard logistic density. It is symmetric about $0$ because
\begin{align*}
f(-x)=\frac{e^x}{(1+e^x)^2}.
\end{align*}
Multiplying numerator and denominator by $e^{-2x}$ gives
\begin{align*}
\frac{e^x}{(1+e^x)^2}=\frac{e^{-x}}{(e^{-x}+1)^2}=f(x).
\end{align*}
The standard logistic variance is $\sigma^2=\pi^2/3$. To compute the $L^2$ term in *Pitman Efficiency of Wilcoxon Against the T-Test*, square the density:
\begin{align*}
f(x)^2=\frac{e^{-2x}}{(1+e^{-x})^4}.
\end{align*}
Set $u=e^{-x}$. Then $dx=-du/u$, and as $x$ runs from $-\infty$ to $\infty$, $u$ runs from $\infty$ to $0$. Therefore
\begin{align*}
\int_{-\infty}^{\infty} f(x)^2\,dx=\int_{\infty}^{0}\frac{u^2}{(1+u)^4}\left(-\frac{du}{u}\right).
\end{align*}
Canceling one factor of $u$ and reversing the limits gives
\begin{align*}
\int_{-\infty}^{\infty} f(x)^2\,dx=\int_0^\infty \frac{u}{(1+u)^4}\,du.
\end{align*}
Now set $t=1+u$, so $du=dt$, $u=t-1$, and the limits become $t=1$ and $t=\infty$. Hence
\begin{align*}
\int_0^\infty \frac{u}{(1+u)^4}\,du=\int_1^\infty \frac{t-1}{t^4}\,dt.
\end{align*}
Since $(t-1)/t^4=t^{-3}-t^{-4}$,
\begin{align*}
\int_1^\infty \frac{t-1}{t^4}\,dt=\int_1^\infty \left(t^{-3}-t^{-4}\right)\,dt.
\end{align*}
An antiderivative is $-1/(2t^2)+1/(3t^3)$, so
\begin{align*}
\int_1^\infty \left(t^{-3}-t^{-4}\right)\,dt=0-\left(-\frac12+\frac13\right)=\frac16.
\end{align*}
Thus
\begin{align*}
\int_{-\infty}^{\infty} f(x)^2\,dx=\frac16.
\end{align*}
Substituting $\sigma^2=\pi^2/3$ and $\int f^2=1/6$ into the Pitman efficiency formula gives
\begin{align*}
e_W(f)=12\cdot \frac{\pi^2}{3}\cdot \left(\frac16\right)^2.
\end{align*}
Since $(1/6)^2=1/36$,
\begin{align*}
e_W(f)=12\cdot \frac{\pi^2}{3}\cdot \frac{1}{36}=\frac{\pi^2}{9}.
\end{align*}
Because $\pi>3$, we have $\pi^2/9>1$. Therefore, for logistic errors, the signed-rank test has higher Pitman efficiency than the paired $t$-test.
[/example]
Pitman efficiency is a local asymptotic summary, not a universal power ranking. This motivates the following remark about the robustness mechanism behind rank procedures.
[remark: Efficiency and Robustness]
Rank tests are invariant under increasing transformations of the measurement scale, while $t$-tests are not. This invariance explains part of their robustness: extreme numerical values affect ranks only through their order. The tradeoff is that rank tests may discard information when the parametric model is correctly specified and tails are light.
[/remark]
For two independent samples, the rank-sum statistic is driven by pairwise orderings rather than signed paired differences. Under a small location shift, the relevant obstruction is to identify how quickly the ordering probability $\mathbb P(X<Y)$ moves away from $1/2$.
The local-asymptotic comparison therefore requires a differentiable expansion of that ordering probability at the null. With such an expansion, the rank-sum statistic can be centered under contiguous shifts and compared on the same $\sqrt N$ scale as the signed-rank procedures.
[quotetheorem:6347]
[citeproof:6347]
The square-integrability of $f$ is exactly what makes the derivative of the pairwise ordering probability finite in this formula. Contiguity is needed because the theorem compares alternatives close enough to the null that the null fluctuation theory still applies after a mean shift. The result does not describe fixed alternatives, where $\mathbb P(X<Y)$ may be far from $1/2$ and the centring must change. This theorem ties together the exact and asymptotic parts of the chapter. This motivates the final example, which recovers the same $3/\pi$ Gaussian efficiency comparison in the two-sample setting.
[example: Gaussian Two-Sample Shift]
Let $F$ be $\mathcal N(0,\sigma^2)$, with density
\begin{align*}
f(x)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{x^2}{2\sigma^2}\right),
\end{align*}
and let $Y_j=X_j'+h/\sqrt N$ in distribution, where $X_j'\sim F$ is independent of the $X$ sample.
First compute the $L^2$ term in *Local Mean Shift of Wilcoxon Rank-Sum*. Squaring the Gaussian density gives
\begin{align*}
f(x)^2=\frac{1}{2\pi\sigma^2}\exp\left(-\frac{x^2}{\sigma^2}\right).
\end{align*}
With $u=x/\sigma$, we have $dx=\sigma\,du$, so
\begin{align*}
\int_{-\infty}^{\infty} f(x)^2\,dx=\frac{1}{2\pi\sigma^2}\int_{-\infty}^{\infty}\exp\left(-\frac{x^2}{\sigma^2}\right)\,dx.
\end{align*}
Substituting $x=\sigma u$ gives
\begin{align*}
\int_{-\infty}^{\infty} f(x)^2\,dx=\frac{1}{2\pi\sigma^2}\int_{-\infty}^{\infty}e^{-u^2}\sigma\,du.
\end{align*}
Therefore
\begin{align*}
\int_{-\infty}^{\infty} f(x)^2\,dx=\frac{1}{2\pi\sigma}\int_{-\infty}^{\infty}e^{-u^2}\,du.
\end{align*}
Using the Gaussian integral $\int_{-\infty}^{\infty}e^{-u^2}\,du=\sqrt{\pi}$,
\begin{align*}
\int_{-\infty}^{\infty} f(x)^2\,dx=\frac{\sqrt{\pi}}{2\pi\sigma}=\frac{1}{2\sqrt{\pi}\sigma}.
\end{align*}
By *Local Mean Shift of Wilcoxon Rank-Sum*, the Wilcoxon rank-sum noncentrality under the local shift is
\begin{align*}
h\sqrt{12\lambda(1-\lambda)}\int_{-\infty}^{\infty}f(x)^2\,dx.
\end{align*}
Substituting the computed integral gives
\begin{align*}
h\sqrt{12\lambda(1-\lambda)}\int_{-\infty}^{\infty}f(x)^2\,dx=\frac{h\sqrt{12\lambda(1-\lambda)}}{2\sqrt{\pi}\sigma}.
\end{align*}
For the two-sample $t$-statistic based on $\bar Y-\bar X$, the mean shift is $h/\sqrt N$. Under the Gaussian model and independence of the two samples,
\begin{align*}
\operatorname{Var}(\bar Y-\bar X)=\operatorname{Var}(\bar Y)+\operatorname{Var}(\bar X).
\end{align*}
Since each $Y_j$ and each $X_i$ has variance $\sigma^2$,
\begin{align*}
\operatorname{Var}(\bar Y-\bar X)=\frac{\sigma^2}{n}+\frac{\sigma^2}{m}.
\end{align*}
Thus
\begin{align*}
\operatorname{Var}(\bar Y-\bar X)=\sigma^2\left(\frac{1}{m}+\frac{1}{n}\right)=\sigma^2\frac{m+n}{mn}=\sigma^2\frac{N}{mn}.
\end{align*}
The corresponding noncentrality is therefore
\begin{align*}
\frac{h/\sqrt N}{\sigma\sqrt{N/(mn)}}=\frac{h\sqrt{mn}}{\sigma N}.
\end{align*}
Because $m/N\to\lambda$ and $n/N\to 1-\lambda$,
\begin{align*}
\frac{h\sqrt{mn}}{\sigma N}=\frac{h}{\sigma}\sqrt{\frac{m}{N}\frac{n}{N}}\to \frac{h\sqrt{\lambda(1-\lambda)}}{\sigma}.
\end{align*}
The squared ratio of the Wilcoxon noncentrality to the Gaussian two-sample $t$ noncentrality is
\begin{align*}
\left(\frac{h\sqrt{12\lambda(1-\lambda)}}{2\sqrt{\pi}\sigma}\cdot\frac{\sigma}{h\sqrt{\lambda(1-\lambda)}}\right)^2.
\end{align*}
Canceling $h$, $\sigma$, and $\sqrt{\lambda(1-\lambda)}$ leaves
\begin{align*}
\left(\frac{\sqrt{12}}{2\sqrt{\pi}}\right)^2=\frac{12}{4\pi}=\frac{3}{\pi}.
\end{align*}
Thus, in the Gaussian two-sample shift model, the Wilcoxon rank-sum test has Pitman efficiency $3/\pi$ relative to the two-sample $t$-test, matching the classical Wilcoxon-versus-$t$ comparison.
[/example]
Rank and permutation methods therefore combine two strengths that are often separated. Finite-sample validity comes from symmetry of the null law and does not require estimating the unknown distribution. Large-sample power can still be analysed through local alternatives, score functions, and normal limits for linear rank statistics.
Chapters 2 through 9 developed methods for distributions, densities, and functionals—all based on understanding the marginal or unconditional behaviour of the data. This chapter applies the same nonparametric philosophy to conditional inference: estimating the regression function (conditional mean) by smoothing methods. Local polynomial estimators extend kernel methods to the regression setting while preserving the interpretability and uniformity properties from earlier chapters.
# 10. Nonparametric Regression and Local Polynomials
Chapters 2 through 9 developed nonparametric methods for distribution functions, densities, distribution-free tests, and real-valued functionals. This chapter uses the same smoothing philosophy in regression, where the object of interest is the conditional mean $m(x)=\mathbb E[Y\mid X=x]$ rather than the marginal law of a single observation. The main question is how to borrow information from observations with design points near $x$ without imposing a finite-dimensional parametric form on $m$. Kernel smoothing gives the local averaging principle, while local polynomials refine it so that bias, boundary behaviour, derivative estimation, and pointwise inference can be treated in one framework.
## Conditional Mean Estimation by Local Averaging
Suppose we observe independent pairs $(X_1,Y_1),\dots,(X_n,Y_n)$ with $X_i\in \mathbb R$ and $Y_i\in\mathbb R$. If many observations had exactly $X_i=x$, the conditional mean could be estimated by averaging their responses. In continuous designs this almost never happens, so the statistical problem is to decide which nearby observations should count and with what weights.
[definition: Nonparametric Regression Model]
Let $(X,Y)$ be a random vector with $X\in\mathbb R$ and $Y\in\mathbb R$. The nonparametric regression model is
\begin{align*}
Y=m(X)+\varepsilon,
\end{align*}
where $m:\mathbb R\to\mathbb R$ is the regression function, $\mathbb E[\varepsilon\mid X]=0$, and the conditional variance function is the map $\sigma^2:S_X\to[0,\infty)$ on the design support $S_X\subseteq\mathbb R$ defined by $\sigma^2(x)=\mathbb E[\varepsilon^2\mid X=x]$.
[/definition]
The model isolates the feature of interest: $m(x)$ is a conditional mean, not a conditional density or a full conditional distribution. To estimate it from nearby observations, we next need a formal device for assigning distance-dependent weights and for controlling the size of the local neighbourhood.
[definition: Kernel And Bandwidth For Regression]
A regression kernel is a measurable function $K:\mathbb R\to[0,\infty)$ satisfying
\begin{align*}
\int_{\mathbb R} K(u)\,d\mathcal L^1(u)=1.
\end{align*}
For $h>0$, the rescaled kernel is the map $K_h:\mathbb R\to\mathbb R$ defined by
\begin{align*}
K_h:t\mapsto \frac{1}{h}K\left(\frac{t}{h}\right).
\end{align*}
The number $h$ is called the bandwidth.
[/definition]
The nonnegativity assumption is part of the regression setup in this chapter: the local polynomial criterion below is a weighted least-squares criterion, so the displayed weights must not turn the sum of squares into an indefinite quadratic form. Equivalent kernels introduced later may take negative values because they describe the resulting linear smoother, not the original least-squares weights.
The bandwidth controls the local neighbourhood: small $h$ uses fewer effectively weighted observations, while large $h$ averages across a wider part of the regression curve. With these weights available, the next task is to turn local averaging into an estimator that also adjusts for nonuniform sampling of the design variable.
[definition: Nadaraya Watson Estimator]
Given observations $(X_i,Y_i)_{i=1}^n$, a kernel $K$, and a bandwidth $h>0$, the Nadaraya-Watson estimator is the random map $\hat m_{NW}:D_h\to\mathbb R$, where
\begin{align*}
D_h=\left\{x\in\mathbb R:\sum_{i=1}^n K_h(x-X_i)\ne0\right\},
\end{align*}
defined by
\begin{align*}
\hat m_{NW}:x\mapsto \frac{\sum_{i=1}^n K_h(x-X_i)Y_i}{\sum_{i=1}^n K_h(x-X_i)}.
\end{align*}
[/definition]
This is a ratio estimator. The numerator estimates the joint quantity $m(x)f_X(x)$ after smoothing, while the denominator estimates the design density $f_X(x)$. Before analysing the ratio, it is useful to keep in mind the applied meaning of the weights.
[example: Dose Response Curve]
Suppose $X$ is a drug dose and $Y$ is a measured biological response, with more observations collected near clinically common doses. For a target dose $x$ with nonzero total local weight, the Nadaraya-Watson estimate is
\begin{align*}
\hat m_{NW}(x)=\frac{\sum_{i=1}^n K_h(x-X_i)Y_i}{\sum_{i=1}^n K_h(x-X_i)}.
\end{align*}
Define the normalized weight of observation $i$ at dose $x$ by
\begin{align*}
w_i(x)=\frac{K_h(x-X_i)}{\sum_{\ell=1}^n K_h(x-X_\ell)}.
\end{align*}
Then the estimator can be written as
\begin{align*}
\hat m_{NW}(x)=\sum_{i=1}^n w_i(x)Y_i.
\end{align*}
The weights sum to one because
\begin{align*}
\sum_{i=1}^n w_i(x)=\frac{\sum_{i=1}^n K_h(x-X_i)}{\sum_{\ell=1}^n K_h(x-X_\ell)}=1.
\end{align*}
Since $K_h(x-X_i)=h^{-1}K((x-X_i)/h)$, observations with $X_i$ close to $x$ receive the kernel weight assigned to a small scaled distance $(x-X_i)/h$, while observations farther from $x$ receive smaller or zero weight according to the chosen kernel. The estimate is therefore a locally weighted average of observed responses, not a fitted line or a prescribed sigmoid curve; decreasing $h$ concentrates the weights near $x$, while increasing $h$ spreads the weights across a wider dose range.
[/example]
The example shows why local averaging is attractive, but it does not say how much smoothing error and random error have been introduced. The same tension appears in numerical differentiation and signal processing: a local average suppresses noise, but it also blurs features of the underlying function. The next result gives the pointwise expansion needed to choose bandwidths and to compare local constant regression with later polynomial methods.
[quotetheorem:6348]
[citeproof:6348]
The expansion explains why the usual mean squared error balance is $h^4+(nh)^{-1}$ at an interior point, leading to the familiar bandwidth order $n^{-1/5}$ for twice differentiable regression functions. The assumptions each remove a specific obstruction: symmetry kills the first kernel moment, differentiability permits the Taylor expansion, $f_X(x)>0$ prevents division by a vanishing local sample density, and $nh\to\infty$ ensures that the local window contains enough effective observations. The extra condition $nh^3\to\infty$ is only needed for the displayed expectation expansion of the ratio with an $o(h^2)$ remainder; without it, denominator randomness can contribute at the same order as the smoothing bias. If $f_X(x)=0$, for example, the denominator no longer behaves like $f_X(x)$ and the variance formula is not meaningful. If the kernel is not symmetric and has nonzero first moment, even a constant design density can produce a first-order bias term proportional to $h\,m'(x)$. The expansion therefore warns that both the kernel moments and the design density affect local constant smoothing, which is the obstruction that local linear fitting is designed to remove.
[example: Bias From A Nonuniform Design]
Let $m(t)=t$ on a neighbourhood of an interior point $x$, with $f_X(x)>0$ and $f_X'(x)\ne0$. Using the displayed Nadaraya-Watson bias expansion, we have
\begin{align*}
\mathbb E[\hat m_{NW}(x)]-m(x)=\frac{h^2\mu_2(K)}{2}\left(m''(x)+2m'(x)\frac{f_X'(x)}{f_X(x)}\right)+o(h^2).
\end{align*}
For $m(t)=t$, the values at $x$ are $m(x)=x$, $m'(x)=1$, and $m''(x)=0$. Substituting these three values gives
\begin{align*}
\mathbb E[\hat m_{NW}(x)]-x=\frac{h^2\mu_2(K)}{2}\left(0+2\cdot 1\cdot\frac{f_X'(x)}{f_X(x)}\right)+o(h^2).
\end{align*}
Since $\frac{1}{2}\cdot 2\cdot 1=1$, this becomes
\begin{align*}
\mathbb E[\hat m_{NW}(x)]-x=h^2\mu_2(K)\frac{f_X'(x)}{f_X(x)}+o(h^2).
\end{align*}
Thus even when the regression function is exactly linear, the local constant estimator has a second-order bias term whenever the design density changes at $x$. The sign is determined by $f_X'(x)$: the estimator averages more observations from the denser side of the smoothing window.
[/example]
## Local Linear And Local Polynomial Fits
Local averaging estimates a constant value near $x$. The next question is whether fitting a local line, or a higher-order polynomial, can remove low-order bias without committing to a global polynomial regression model.
[definition: Local Polynomial Estimator]
Fix an integer $p\ge0$, a kernel $K$, and a bandwidth $h>0$. For each $x\in\mathbb R$ for which the weighted least-squares problem has a unique minimiser, choose coefficients $(\hat\beta_0(x),\dots,\hat\beta_p(x))\in\mathbb R^{p+1}$ minimising
\begin{align*}
\sum_{i=1}^n K_h(x-X_i)\left(Y_i-\sum_{j=0}^p\beta_j(X_i-x)^j\right)^2
\end{align*}
over $(\beta_0,\dots,\beta_p)\in\mathbb R^{p+1}$. The local polynomial estimator of order $p$ is the random map $\hat m_p:D_{p,h}\to\mathbb R$, where $D_{p,h}\subseteq\mathbb R$ is the set of such points $x$, defined by
\begin{align*}
\hat m_p:x\mapsto \hat\beta_0(x).
\end{align*}
[/definition]
For $p=0$ this is the Nadaraya-Watson estimator. For $p=1$ it fits a weighted local line, and for larger $p$ it fits a weighted Taylor polynomial. Since the same fitted polynomial contains local slope and curvature information, we next record how derivative estimates are read from its coefficients.
[definition: Derivative Estimator From A Local Polynomial]
Let $0\le \nu\le p$. The derivative estimator obtained from the local polynomial fit of order $p$ is the random map $\hat m_p^{(\nu)}:D_{p,h}\to\mathbb R$ defined by
\begin{align*}
\hat m_p^{(\nu)}:x\mapsto \nu!\,\hat\beta_\nu(x).
\end{align*}
[/definition]
This derivative estimator is a main reason to introduce local polynomials beyond order one. A local quadratic, for instance, can estimate curvature, while a local cubic can estimate a derivative with improved bias relative to a lower-order fit. The following example makes the coefficient interpretation concrete before we rewrite the estimator as a kernel average.
[example: Estimating A Derivative]
Suppose $m$ is three times differentiable near the target dose $x$, and write $s=t-x$. The Taylor expansion around $x$ is
\begin{align*}
m(x+s)
&=m(x)+m'(x)s+\frac{m''(x)}{2}s^2+O(s^3).
\end{align*}
A local linear fit uses the model $\beta_0+\beta_1s$. If the curve were exactly
\begin{align*}
m(x+s)=m(x)+m'(x)s,
\end{align*}
then the matching coefficients would be
\begin{align*}
\beta_0=m(x),\qquad \beta_1=m'(x),
\end{align*}
so the derivative estimate from the local linear fit is $\hat m_1^{(1)}(x)=1!\hat\beta_1(x)=\hat\beta_1(x)$.
The term not represented by the local line is
\begin{align*}
\frac{m''(x)}{2}s^2+O(s^3).
\end{align*}
Since observations receiving nonnegligible kernel weight have $s=X_i-x$ of order $h$, this omitted curvature has slope-scale size
\begin{align*}
\frac{\frac{m''(x)}{2}s^2+O(s^3)}{s}
&=\frac{m''(x)}{2}s+O(s^2)
=O(h).
\end{align*}
If a local quadratic fit is used, the fitted model is $\beta_0+\beta_1s+\beta_2s^2$, and the Taylor polynomial through order $2$ is matched by
\begin{align*}
\beta_0=m(x),\qquad \beta_1=m'(x),\qquad \beta_2=\frac{m''(x)}{2}.
\end{align*}
The first omitted term is then $O(s^3)$, whose slope-scale contribution is
\begin{align*}
\frac{O(s^3)}{s}=O(s^2)=O(h^2).
\end{align*}
Thus the coefficient $\hat\beta_1(x)$ estimates the marginal effect $m'(x)$, and fitting the quadratic term removes the curvature contribution that would otherwise enter the local linear slope at order $h$.
[/example]
The derivative example uses polynomial coefficients, but coefficient formulas hide how each observation contributes to the fitted value. Bias and variance calculations are easier when the estimator is written as a weighted average of the responses.
The obstruction is that local polynomial weights are not the original kernel weights: they are altered by the moment equations that force local polynomial reproduction. The equivalent kernel names this modified weighting function so that local polynomial estimators can be analysed like kernel averages.
[definition: Equivalent Kernel]
For a fixed design distribution and an interior point $x$, an equivalent kernel for a local polynomial estimator is a measurable map $K_{p,\mathrm{eq}}:\mathbb R\to\mathbb R$ such that the leading weighted-average representation of $\hat m_p(x)$ has the form
\begin{align*}
\hat m_p(x)\approx \frac{1}{nhf_X(x)}\sum_{i=1}^n K_{p,\mathrm{eq}}\left(\frac{X_i-x}{h}\right)Y_i.
\end{align*}
[/definition]
Equivalent kernels are a computational and conceptual device: they let local polynomial estimators be analysed like kernel averages, but with weights chosen to annihilate polynomial bias terms. This motivates the algebraic fact behind the construction: local polynomial fits exactly reproduce polynomial regression functions up to the fitted degree.
[quotetheorem:479]
[citeproof:479]
Polynomial reproduction is the mechanism behind bias reduction, but the uniqueness hypothesis is essential. A concrete failure occurs for a local quadratic fit at $x=0$ when the only positively weighted design points are $X_1=-1$ and $X_2=1$, with noiseless responses $Y_1=Y_2=1$. The polynomials $q_1(t)=1$ and $q_2(t)=t^2$ both fit the two weighted observations with zero residual, but their intercepts at $0$ are $1$ and $0$. Thus the fitted value at the target point is not identified by the weighted data. More generally, if there are too few distinct positively weighted design points, or if they are arranged so that the weighted design matrix is singular, many coefficient vectors can give the same fitted values and the intercept need not be well determined. The theorem is also noiseless and algebraic: it does not say that noisy observations from a polynomial regression function are recovered exactly, only that the deterministic polynomial part is not a source of smoothing bias. In the bias calculation, this algebraic fact means that the constant, linear, and higher Taylor terms through degree $p$ are annihilated by the local polynomial moment equations. If the regression function is well approximated by its Taylor polynomial of order $p$ around $x$, the fit captures that part, and the remaining question is which higher-order Taylor term has a nonzero equivalent-kernel moment.
[quotetheorem:6349]
[citeproof:6349]
The theorem should be read together with the variance scale. Its assumptions are not cosmetic: nonsingularity of $M_p$ is what makes the local coefficients identifiable, symmetry is what gives the parity cancellation, and the smoothness of $m$ and $f_X$ is what justifies replacing the regression curve and design density by Taylor expansions. If the kernel is one-sided at an interior point, the odd moment need not vanish and an even-order fit may not gain the extra order; if the design density vanishes at $x$, the local normal equations no longer have the stated population limit. The theorem also does not claim that larger $p$ is automatically better, since high-order equivalent kernels can have larger variance constants and unstable negative weights. The boundary case is where the benefit of polynomial reproduction is most visible, because the local constant estimator loses symmetry completely.
[example: Local Constant And Local Linear Fits At The Edge]
Let the design interval be $[0,1]$, and estimate $m(0)$ with a symmetric kernel supported on $[-1,1]$. To isolate the boundary effect, take noiseless responses from the linear regression function $m(t)=\alpha+\beta t$ and assume the design density is locally constant near $0$. Since $K_h(0-t)=h^{-1}K(-t/h)$ and the design support only contains $t\ge0$, the population local constant target at $0$ is
\begin{align*}
m_{0,h,\mathrm{pop}}(0)=\frac{\int_0^h K_h(0-t)(\alpha+\beta t)\,dt}{\int_0^h K_h(0-t)\,dt}.
\end{align*}
Use the substitution $u=t/h$, so $t=hu$ and $dt=h\,du$. By symmetry, $K(-u)=K(u)$, and therefore
\begin{align*}
\int_0^h K_h(0-t)(\alpha+\beta t)\,dt=\int_0^1 h^{-1}K(-u)(\alpha+\beta hu)h\,du.
\end{align*}
Thus
\begin{align*}
\int_0^h K_h(0-t)(\alpha+\beta t)\,dt=\alpha\int_0^1K(u)\,du+\beta h\int_0^1uK(u)\,du.
\end{align*}
Similarly,
\begin{align*}
\int_0^h K_h(0-t)\,dt=\int_0^1 h^{-1}K(-u)h\,du=\int_0^1K(u)\,du.
\end{align*}
Writing $\mu_j^+=\int_0^1u^jK(u)\,du$, we obtain
\begin{align*}
m_{0,h,\mathrm{pop}}(0)=\frac{\alpha\mu_0^++\beta h\mu_1^+}{\mu_0^+}=\alpha+\beta h\frac{\mu_1^+}{\mu_0^+}.
\end{align*}
Since $m(0)=\alpha$, the local constant boundary bias is
\begin{align*}
m_{0,h,\mathrm{pop}}(0)-m(0)=\beta h\frac{\mu_1^+}{\mu_0^+}.
\end{align*}
This is first order in $h$ whenever $\beta\ne0$ and $\mu_1^+\ne0$.
For the local linear population fit, minimise
\begin{align*}
\int_0^h K_h(0-t)\left((\alpha+\beta t)-(\gamma_0+\gamma_1t)\right)^2\,dt
\end{align*}
over $(\gamma_0,\gamma_1)$. Choosing $\gamma_0=\alpha$ and $\gamma_1=\beta$ makes
\begin{align*}
(\alpha+\beta t)-(\gamma_0+\gamma_1t)=(\alpha+\beta t)-(\alpha+\beta t)=0
\end{align*}
for every $t\in[0,h]$. Hence the objective value is $0$, so the fitted intercept is $\gamma_0=\alpha=m(0)$ whenever the one-sided local linear normal equations have a unique solution. The boundary problem is therefore not that the kernel itself changes, but that a one-sided local constant average cannot reproduce even a linear regression function, while the local linear fit can.
[/example]
## Boundary Adaptation And Bias Reduction
Kernel estimators behave differently near the edge of the design support because the smoothing window is cut off. The question is whether we must introduce special boundary kernels, or whether the local polynomial criterion adapts automatically.
[illustration:boundary-local-linear-correction]
[definition: Boundary Point For Kernel Smoothing]
Let the design support be an interval $[a,b]$. A target point $x\in[a,b]$ is a boundary point at bandwidth scale $h$ if either $x-a=O(h)$ or $b-x=O(h)$ as $h\to0$.
[/definition]
At such points, a symmetric kernel no longer sees a symmetric neighbourhood in the data. Local constant smoothing estimates an average over a one-sided window, which raises the next question: how much of the resulting first-order bias is removed by fitting a local line?
[quotetheorem:6350]
[citeproof:6350]
This is one of the main practical advantages of local linear regression. The support and positivity assumptions ensure that there are observations on a one-sided neighbourhood of $a$ and that the boundary moment matrix is nonsingular; if the design density degenerates at $a$, the local line may be poorly identified. The differentiability assumptions separate the linear term from the quadratic remainder, and if $m'(a)=0$ the local constant target may also have smaller than first-order bias, so the contrast in the theorem is no longer sharp. The result separates deterministic smoothing bias from sampling variability: it identifies the population targets first, then records the empirical stochastic scale under bandwidth assumptions. It uses the same least-squares problem everywhere, but the fitted slope corrects the one-sided averaging that causes boundary bias.
[remark: Boundary Kernels Versus Local Polynomials]
Boundary kernels modify the kernel shape near $a$ or $b$ to restore moment conditions. Local linear regression achieves the same first-order correction by changing the regression criterion rather than manually changing the kernel. For this reason local linear estimators are often preferred in applied nonparametric regression.
[/remark]
Bias reduction can also be pursued by raising the polynomial order. The price is that high-order equivalent kernels may have larger oscillations and negative weights, so the bandwidth and polynomial degree must be chosen together.
[example: Curvature At A Boundary]
Suppose $m(t)=a_0+a_1t+a_2t^2$ on the one-sided neighbourhood $0\le t\le h$, and take the design density to be locally constant so that only the one-sided kernel moments enter. Write
\begin{align*}
\mu_j^+=\int_0^1 u^jK(u)\,du.
\end{align*}
The population local linear fit at $0$ minimises
\begin{align*}
\int_0^h K_h(0-t)\left(a_0+a_1t+a_2t^2-\gamma_0-\gamma_1t\right)^2\,dt
\end{align*}
over $(\gamma_0,\gamma_1)$. Put $t=hu$, so $dt=h\,du$ and, by symmetry of the kernel, $K_h(0-t)=h^{-1}K(-u)=h^{-1}K(u)$. The criterion becomes
\begin{align*}
\int_0^1 K(u)\left(a_0+a_1hu+a_2h^2u^2-\gamma_0-\gamma_1hu\right)^2\,du.
\end{align*}
Define
\begin{align*}
\delta_0=\gamma_0-a_0,\qquad \delta_1=h(\gamma_1-a_1).
\end{align*}
Then the residual is
\begin{align*}
a_0+a_1hu+a_2h^2u^2-\gamma_0-\gamma_1hu=a_2h^2u^2-\delta_0-\delta_1u.
\end{align*}
The normal equation from varying the intercept is
\begin{align*}
\int_0^1 K(u)\left(a_2h^2u^2-\delta_0-\delta_1u\right)\,du=0.
\end{align*}
The normal equation from varying the slope is
\begin{align*}
\int_0^1 uK(u)\left(a_2h^2u^2-\delta_0-\delta_1u\right)\,du=0.
\end{align*}
Using the definitions of the moments, these become
\begin{align*}
\mu_0^+\delta_0+\mu_1^+\delta_1=a_2h^2\mu_2^+.
\end{align*}
and
\begin{align*}
\mu_1^+\delta_0+\mu_2^+\delta_1=a_2h^2\mu_3^+.
\end{align*}
If $D=\mu_0^+\mu_2^+-(\mu_1^+)^2\ne0$, solving this $2\times2$ system gives
\begin{align*}
\delta_0=a_2h^2\frac{(\mu_2^+)^2-\mu_1^+\mu_3^+}{\mu_0^+\mu_2^+-(\mu_1^+)^2}.
\end{align*}
Since the fitted value at $0$ is $\gamma_0=a_0+\delta_0$ and $m(0)=a_0$, the local linear boundary bias is
\begin{align*}
\gamma_0-m(0)=a_2h^2\frac{(\mu_2^+)^2-\mu_1^+\mu_3^+}{\mu_0^+\mu_2^+-(\mu_1^+)^2}.
\end{align*}
Thus the constant and linear parts are removed exactly, while the quadratic curvature contributes at order $h^2$ through the one-sided moments.
For a local quadratic fit, the polynomial $a_0+a_1t+a_2t^2$ is among the fitted functions. In the noiseless population problem, choosing
\begin{align*}
\gamma_0=a_0,\qquad \gamma_1=a_1,\qquad \gamma_2=a_2
\end{align*}
makes the residual
\begin{align*}
(a_0+a_1t+a_2t^2)-(\gamma_0+\gamma_1t+\gamma_2t^2)=0
\end{align*}
for every $t\in[0,h]$, so the fitted intercept is $\gamma_0=a_0=m(0)$ whenever the one-sided quadratic normal equations are nonsingular; this is the polynomial reproduction property from *Polynomial Reproduction*. If $m$ is three times differentiable and
\begin{align*}
m(t)=a_0+a_1t+a_2t^2+O(t^3)
\end{align*}
near $0$, the quadratic part is reproduced exactly. On the kernel window $t=hu$, the remainder is $O(h^3u^3)$, so the remaining deterministic boundary bias starts at cubic order rather than quadratic order.
[/example]
## Heteroskedastic Errors And Asymptotic Normality
The previous expansions describe mean and variance separately, but inference requires a distributional approximation. The final question is how kernel-weighted regression estimates fluctuate when the conditional variance depends on the design point.
[definition: Heteroskedastic Regression Errors]
In the nonparametric regression model $Y=m(X)+\varepsilon$, the conditional variance function is the map $\sigma^2:S_X\to[0,\infty)$ on the design support $S_X\subseteq\mathbb R$ defined by
\begin{align*}
\sigma^2:x\mapsto \mathbb E[\varepsilon^2\mid X=x].
\end{align*}
The errors are heteroskedastic if this map is not constant.
[/definition]
Heteroskedasticity changes the leading variance through the local value $\sigma^2(x)$. It does not change the deterministic smoothing bias, which is governed by $m$, $f_X$, the kernel, and the polynomial order. We now combine these two pieces into the pointwise distributional limit used for inference.
[quotetheorem:6351]
[citeproof:6351]
This result is the basis for pointwise confidence intervals after estimating the variance constant and the bias, or after undersmoothing so that the bias is negligible on the $1/\sqrt{nh}$ scale. The centering by $b_p(x)h^r$ matters: if it is omitted and $\sqrt{nh}\,h^r$ does not tend to $0$, the limiting normal distribution is shifted and nominal confidence intervals are systematically miscentered. Undersmoothing chooses $h$ small enough that this shift disappears, while explicit bias correction estimates $b_p(x)h^r$ and keeps a larger bandwidth. The hypotheses describe the two ways a normal approximation can fail: if $nh$ does not diverge, the local average is based on too few effective observations, while if the moment matrix is singular, the fitted polynomial coefficients are not stably identified. Compact support and boundedness control the largest kernel weights; with heavy-tailed errors lacking a local $(2+\delta)$ moment, a few responses can dominate the weighted sum and the Lindeberg condition may fail. The theorem is pointwise rather than uniform, so it does not by itself justify simultaneous confidence bands over many $x$ values. The same theorem also explains why derivative estimation is harder: derivatives use coefficients multiplied by powers of $h^{-1}$, increasing the stochastic scale.
[quotetheorem:6352]
[citeproof:6352]
Derivative estimation therefore requires larger effective sample size than function estimation. The condition $0\le\nu\le p$ is essential because a polynomial fit cannot estimate a derivative of order higher than its degree from its coefficients. The same nonsingularity and local sample-size assumptions used for function estimation are also needed here; if the local design matrix is nearly singular, the derivative coefficient is especially unstable because the rescaling multiplies noise by $h^{-\nu}$. The theorem gives only the stochastic scale, not the bias, so a derivative estimator may still be inaccurate if $h$ is chosen without accounting for the higher-order Taylor remainder. Bandwidth choices that work well for estimating $m(x)$ may be too small or too large for estimating $m'(x)$, depending on the desired bias-variance balance.
[example: Heteroskedastic Dose Response]
In a dose-response study, high doses may produce more variable outcomes than low doses. A local linear estimate at dose $x$ still targets
\begin{align*}
m(x)=\mathbb E[Y\mid X=x],
\end{align*}
but its pointwise standard error has the form
\begin{align*}
\operatorname{se}(x)
\approx
\left(\frac{\sigma^2(x)}{nh\,f_X(x)}V_1(K)\right)^{1/2}.
\end{align*}
Thus the local variance enters multiplicatively:
\begin{align*}
\left(\frac{\sigma^2(x)}{nh\,f_X(x)}V_1(K)\right)^{1/2}
=
\sigma(x)\left(\frac{V_1(K)}{nh\,f_X(x)}\right)^{1/2}.
\end{align*}
To estimate $\sigma^2(x)$, form residuals
\begin{align*}
\hat\varepsilon_i=Y_i-\hat m_1(X_i),
\end{align*}
and smooth their squares:
\begin{align*}
\hat\sigma^2(x)
=
\frac{\sum_{i=1}^n K_b(x-X_i)\hat\varepsilon_i^2}{\sum_{i=1}^n K_b(x-X_i)}.
\end{align*}
If
\begin{align*}
\omega_i(x)=\frac{K_b(x-X_i)}{\sum_{\ell=1}^n K_b(x-X_\ell)},
\end{align*}
then
\begin{align*}
\hat\sigma^2(x)
=
\sum_{i=1}^n \omega_i(x)\hat\varepsilon_i^2
\end{align*}
and
\begin{align*}
\sum_{i=1}^n\omega_i(x)
=
\frac{\sum_{i=1}^nK_b(x-X_i)}{\sum_{\ell=1}^nK_b(x-X_\ell)}
=1.
\end{align*}
A plug-in pointwise interval therefore has half-width proportional to
\begin{align*}
\left(\frac{\hat\sigma^2(x)}{nh\,\hat f_X(x)}V_1(K)\right)^{1/2}.
\end{align*}
Consequently, if two dose values have the same $n$, $h$, $\hat f_X$, and kernel constant but satisfy $\hat\sigma^2(x_{\mathrm{high}})>\hat\sigma^2(x_{\mathrm{low}})$, then
\begin{align*}
\frac{\operatorname{halfwidth}(x_{\mathrm{high}})}
{\operatorname{halfwidth}(x_{\mathrm{low}})}
=
\left(\frac{\hat\sigma^2(x_{\mathrm{high}})}
{\hat\sigma^2(x_{\mathrm{low}})}\right)^{1/2}
>1.
\end{align*}
The fitted mean curve still estimates the conditional mean, but uncertainty bands widen exactly where the locally smoothed squared residuals are larger.
[/example]
The chapter's main lesson is that local polynomial regression turns kernel smoothing into a local approximation problem. Nadaraya-Watson regression is the local constant case and gives the basic pointwise bias-variance calculation. Local linear and higher-order fits preserve the locality of kernel methods while improving boundary behaviour, reducing bias through polynomial reproduction, and supporting derivative estimation and asymptotic inference.
Chapters 5 through 10 developed nonparametric point estimators with explicit rates and asymptotic distributions. Confidence sets require a more subtle construction: they must account for both the asymptotic normality of the estimator and the bias from smoothing, and they must be valid even when the smoothing bandwidth itself is data-dependent. This chapter constructs honest confidence sets that achieve coverage simultaneously with the statistical estimation rate.
# 11. Confidence Sets and Nonparametric Uncertainty
Confidence statements in nonparametric statistics have to confront two sources of uncertainty at once: sampling fluctuation and approximation error from smoothing or plug-in estimation. Chapters 2 through 4 studied consistency and weak convergence for empirical distribution functions, Chapters 5 through 7 studied bandwidth choice for kernel density estimators, and Chapter 10 introduced local polynomial regression mainly as an estimation tool. This chapter turns those results into uncertainty quantification, separating pointwise statements from uniform statements and explaining why bootstrap methods need bias control before they become reliable confidence procedures.
## Pointwise Intervals and Uniform Bands
The first question is what event a reported confidence statement is meant to cover. A pointwise interval protects the value of a function at a fixed location, while a uniform band protects the whole function over a set of locations. These are different inferential tasks because maximising over a continuum changes the limiting distribution and usually requires empirical process input.
[definition: Pointwise Confidence Interval]
Let $\mathcal P$ be a class of probability measures containing the true distribution $P_0$, and let $\theta:\mathcal P\to\mathbb R$ be a real-valued functional. Put $\theta_0=\theta(P_0)$. A random interval $C_n=[L_n,U_n]$ is an asymptotic pointwise confidence interval of level $1-\tau$ for $\theta_0$ if
\begin{align*}
\mathbb P_{P_0}(\theta_0\in C_n) \to 1-\tau.
\end{align*}
[/definition]
This definition is enough for a fixed target, such as $F(x_0)$ or $m(x_0)$, but it does not describe simultaneous coverage of a whole curve. Since nonparametric objects are often functions, the next definition records the stronger event in which every point in an index set is covered at the same time.
[definition: Uniform Confidence Band]
Let $\mathcal P$ be a class of probability measures containing $P_0$, let $T$ be an index set, and let $\Theta:\mathcal P\to\ell^\infty(T)$ be a functional whose value $\Theta(P)$ is a bounded real-valued function on $T$. Put $\Theta_0=\Theta(P_0)$. A random pair of functions $L_n,U_n:T\to\mathbb R$ is an asymptotic uniform confidence band of level $1-\tau$ for $\Theta_0$ over $T$ if
\begin{align*}
\mathbb P_{P_0}\bigl(L_n(t)\le \Theta_0(t)\le U_n(t)\text{ for all }t\in T\bigr) \to 1-\tau.
\end{align*}
[/definition]
Uniform coverage is a stronger event than pointwise coverage. The distinction becomes concrete when pointwise normal intervals are drawn across a grid, since a collection of individually calibrated intervals need not have the desired simultaneous coverage.
[example: Pointwise Normal Intervals Do Not Give Uniform Bands]
Suppose $\hat m_n(x)$ estimates $m(x)$ and, for each fixed $x\in[0,1]$,
\begin{align*}
Z_n(x)=\frac{\hat m_n(x)-m(x)}{\hat s_n(x)} \xrightarrow{d} \mathcal N(0,1),
\end{align*}
with the bias already negligible on the $\hat s_n(x)$ scale. Since $z_{1-\tau/2}$ is chosen so that, for $Z\sim\mathcal N(0,1)$,
\begin{align*}
\mathbb P(|Z|\le z_{1-\tau/2})=\Phi(z_{1-\tau/2})-\Phi(-z_{1-\tau/2}),
\end{align*}
and $\Phi(z_{1-\tau/2})=1-\tau/2$ while $\Phi(-z_{1-\tau/2})=\tau/2$, we get
\begin{align*}
\mathbb P(|Z|\le z_{1-\tau/2})=(1-\tau/2)-(\tau/2)=1-\tau.
\end{align*}
Thus the interval $\hat m_n(x)\pm z_{1-\tau/2}\hat s_n(x)$ has limiting coverage $1-\tau$ at each fixed $x$.
Now take grid points $x_1,\dots,x_k$. Simultaneous coverage of the same pointwise intervals is the event
\begin{align*}
\bigcap_{j=1}^k\left\{m(x_j)\in \left[\hat m_n(x_j)-z_{1-\tau/2}\hat s_n(x_j),\hat m_n(x_j)+z_{1-\tau/2}\hat s_n(x_j)\right]\right\}.
\end{align*}
For each $j$, the event inside the intersection is equivalent to
\begin{align*}
-z_{1-\tau/2}\hat s_n(x_j)\le m(x_j)-\hat m_n(x_j)\le z_{1-\tau/2}\hat s_n(x_j).
\end{align*}
Dividing by the positive standard error $\hat s_n(x_j)$ and multiplying by $-1$ gives
\begin{align*}
|Z_n(x_j)|\le z_{1-\tau/2}.
\end{align*}
Therefore simultaneous coverage over the grid is the event
\begin{align*}
\max_{1\le j\le k}|Z_n(x_j)|\le z_{1-\tau/2}.
\end{align*}
For example, if $(Z_n(x_1),\dots,Z_n(x_k))$ converges to $k$ independent standard normal variables $(Z_1,\dots,Z_k)$, then the limiting simultaneous coverage is
\begin{align*}
\mathbb P\left(\max_{1\le j\le k}|Z_j|\le z_{1-\tau/2}\right)=\mathbb P(|Z_1|\le z_{1-\tau/2},\dots,|Z_k|\le z_{1-\tau/2}).
\end{align*}
Independence gives
\begin{align*}
\mathbb P(|Z_1|\le z_{1-\tau/2},\dots,|Z_k|\le z_{1-\tau/2})=\prod_{j=1}^k\mathbb P(|Z_j|\le z_{1-\tau/2}).
\end{align*}
Since each factor equals $1-\tau$, the limiting simultaneous coverage is
\begin{align*}
\prod_{j=1}^k\mathbb P(|Z_j|\le z_{1-\tau/2})=(1-\tau)^k.
\end{align*}
For $k\ge2$ and $0<\tau<1$, $(1-\tau)^k<1-\tau$, so pointwise calibration does not give the desired simultaneous coverage even on a finite grid. A uniform band therefore replaces $z_{1-\tau/2}$ by a critical value calibrated to a maximum such as $\sup_{x\in[0,1]}|Z_n(x)|$, after the same bias control has been imposed.
[/example]
This example shows why pointwise critical values do not solve simultaneous coverage: the probability of covering every point depends on the joint fluctuation of the whole process. The calibration target is therefore a supremum statistic, not a single normal coordinate.
We need a distribution-free benchmark for supremum calibration before returning to smoothed estimators and bootstrap approximations. The empirical distribution function supplies that benchmark, because its supremum law has an asymptotic form that does not depend on the underlying continuous distribution.
[quotetheorem:2006]
[citeproof:2006]
This theorem is a band rather than many pointwise intervals stitched together. Continuity of $F$ matters for the usual distribution-free Kolmogorov calibration: the probability integral transform then gives uniform random variables, so the supremum statistic has the Brownian-bridge limit indexed by $u\in[0,1]$. For general $F$, the empirical-process limit still exists, but atoms introduce a non-uniform time change and ties in the empirical distribution function, so the same tabulated Kolmogorov critical value is no longer the right universal calibration. The theorem also does not give a finite-sample exact band for arbitrary statistics: it is an asymptotic statement for the EDF supremum norm. Its Brownian-bridge critical value controls stochastic fluctuation of $F_n$ around $F$, but it does not account for smoothing bias, estimated nuisance parameters, or simultaneous inference for a regression or density estimator without an additional approximation argument.
[example: Confidence Band for a Regression Curve]
Let $(X_i,Y_i)$ be i.i.d. with $X_i\in[0,1]$, and let $\hat m_h(x)$ be a local linear estimator of $m(x)=\mathbb E[Y_i\mid X_i=x]$ on an interior interval $T=[a,b]\subset(0,1)$. If $\hat s_h(x)>0$ estimates the pointwise standard deviation of $\hat m_h(x)$ and $\operatorname{bias}_h(x)$ denotes the leading smoothing bias, then the standardized error at $x$ is
\begin{align*}
Z_h(x)=\frac{\hat m_h(x)-m(x)-\operatorname{bias}_h(x)}{\hat s_h(x)}.
\end{align*}
A pointwise interval at a fixed $x$ uses a normal critical value and covers when
\begin{align*}
m(x)&\in \left[\hat m_h(x)-\operatorname{bias}_h(x)-c\hat s_h(x),\hat m_h(x)-\operatorname{bias}_h(x)+c\hat s_h(x)\right].
\end{align*}
This event is equivalent to
\begin{align*}
-c\hat s_h(x)
&\le m(x)-\hat m_h(x)+\operatorname{bias}_h(x)
\le c\hat s_h(x),
\end{align*}
and, since $\hat s_h(x)>0$, to
\begin{align*}
-c
&\le \frac{m(x)-\hat m_h(x)+\operatorname{bias}_h(x)}{\hat s_h(x)}
\le c.
\end{align*}
Multiplying the middle term by $-1$ gives
\begin{align*}
\left|\frac{\hat m_h(x)-m(x)-\operatorname{bias}_h(x)}{\hat s_h(x)}\right|\le c.
\end{align*}
For simultaneous coverage over all $x\in T$, the same algebra must hold at every point, so the coverage event is
\begin{align*}
\left\{\left|\frac{\hat m_h(x)-m(x)-\operatorname{bias}_h(x)}{\hat s_h(x)}\right|\le c\text{ for all }x\in T\right\}.
\end{align*}
This is exactly the event
\begin{align*}
\sup_{x\in T}\left|\frac{\hat m_h(x)-m(x)-\operatorname{bias}_h(x)}{\hat s_h(x)}\right|\le c.
\end{align*}
Thus a uniform band must choose $c$ from the distribution of this supremum, not from a single standard normal coordinate. In practice the supremum is evaluated on a dense grid and calibrated by a Gaussian or bootstrap approximation, while $\operatorname{bias}_h(x)$ is made negligible by undersmoothing or removed by explicit bias correction.
[/example]
## Bootstrap Calibration for Empirical Processes and Smooth Functionals
The next problem is how to obtain critical values when the limiting distribution depends on unknown features of $P_0$ or is hard to tabulate. The bootstrap replaces the unknown sampling law by the conditional law generated from the empirical distribution. For empirical distribution functions it reproduces the Brownian bridge limit, and for smooth real-valued functionals it gives a route to percentile and studentized intervals.
[definition: Nonparametric Bootstrap Sample]
Given observations $X_1,\dots,X_n$, the nonparametric bootstrap draws $X_1^*,\dots,X_n^*$ independently from the empirical distribution $P_n$. The bootstrap empirical measure and empirical distribution function are
\begin{align*}
P_n^* &= \frac{1}{n}\sum_{i=1}^n \delta_{X_i^*}, & F_n^*(x)&=P_n^*((-\infty,x]).
\end{align*}
Bootstrap probability and expectation conditional on the data are denoted by $\mathbb P^*$ and $\mathbb E^*$.
[/definition]
Conditioning on the data is essential: the bootstrap is not a new experiment from $P_0$, but a data-dependent approximation to the distribution of the original statistic. The first consistency question is whether this conditional approximation reproduces the same empirical-process limit that produced the Kolmogorov-Smirnov band.
[quotetheorem:6353]
[citeproof:6353]
The theorem justifies using bootstrap quantiles for EDF bands and distribution-free tests, especially when finite-sample tabulation is unavailable or when related statistics do not have a compact closed form. The conditioning is not cosmetic: the bootstrap law is random and must approximate the sampling law after the observed data have been fixed. The half-line Donsker property supplies tightness and a Gaussian limit for the entire EDF process, so the supremum statistic can be handled as a functional of a process rather than as a collection of unrelated pointwise errors.
The hypotheses also point to concrete failure modes. If a statistic is centred at $F_n^*$ rather than $F_n$, the bootstrap fluctuation is identically zero, so it cannot approximate the non-degenerate law of $\sqrt n(F_n-F)$. If the empirical process is replaced by a rougher class, for example indicators of all measurable subsets of $[0,1]$, the supremum discrepancy equals $1$ with high probability for nonatomic $P_0$ rather than converging at the $n^{-1/2}$ Brownian-bridge scale; the Donsker structure of half-lines is what prevents this collapse. Centering at a mismatched smooth estimate creates another failure: taking a kernel-smoothed distribution estimate $\tilde F_h$ with non-negligible smoothing error makes $\sqrt n(F_n^*-\tilde F_h)$ contain a deterministic shift of order $\sqrt n\,\|F_n-\tilde F_h\|_\infty$, so the conditional law targets the wrong centre. The result is still limited to statistics that are continuous enough functions of the empirical process, and it does not by itself validate bootstrap calibration after non-negligible smoothing bias has been introduced. The next example turns the conditional convergence statement into a simulation procedure for the Kolmogorov-Smirnov critical value.
[example: Bootstrap Kolmogorov Smirnov Statistic]
Given data $X_1,\dots,X_n$, draw bootstrap samples $X_1^{*(b)},\dots,X_n^{*(b)}$ independently from $P_n$ for $b=1,\dots,B$. For the $b$th sample, form
\begin{align*}
F_n^{*(b)}(x)=\frac{1}{n}\sum_{i=1}^n \mathbf 1\{X_i^{*(b)}\le x\},
\end{align*}
and compute
\begin{align*}
T_n^{*(b)}
&=\sup_x \sqrt n\left|F_n^{*(b)}(x)-F_n(x)\right|.
\end{align*}
Let $\hat c_{1-\tau}^*$ be the empirical $(1-\tau)$-quantile of $T_n^{*(1)},\dots,T_n^{*(B)}$, meaning that approximately a fraction $1-\tau$ of the simulated values satisfy
\begin{align*}
T_n^{*(b)}\le \hat c_{1-\tau}^*.
\end{align*}
The bootstrap band is
\begin{align*}
F_n(x)-\frac{\hat c_{1-\tau}^*}{\sqrt n}\le F(x)\le F_n(x)+\frac{\hat c_{1-\tau}^*}{\sqrt n},
\qquad x\in\mathbb R.
\end{align*}
Its simultaneous coverage event is
\begin{align*}
\left\{F_n(x)-\frac{\hat c_{1-\tau}^*}{\sqrt n}\le F(x)\le F_n(x)+\frac{\hat c_{1-\tau}^*}{\sqrt n}\text{ for all }x\right\}.
\end{align*}
For a fixed $x$, the displayed inequalities are equivalent to
\begin{align*}
-\frac{\hat c_{1-\tau}^*}{\sqrt n}
\le F(x)-F_n(x)
\le \frac{\hat c_{1-\tau}^*}{\sqrt n}.
\end{align*}
Multiplying by $\sqrt n$ and taking absolute values gives
\begin{align*}
\sqrt n|F_n(x)-F(x)|\le \hat c_{1-\tau}^*.
\end{align*}
Requiring this for every $x$ is therefore equivalent to
\begin{align*}
\sup_x \sqrt n|F_n(x)-F(x)|\le \hat c_{1-\tau}^*.
\end{align*}
Thus $\hat c_{1-\tau}^*$ is calibrated for exactly the statistic that determines the band coverage. By *[Bootstrap Consistency](/theorems/1995) for the Empirical Distribution Function*, the conditional law of
\begin{align*}
\sup_x \sqrt n|F_n^*(x)-F_n(x)|
\end{align*}
converges to the same Brownian-bridge supremum limit as the law of
\begin{align*}
\sup_x \sqrt n|F_n(x)-F(x)|.
\end{align*}
The bootstrap quantile therefore replaces the Brownian-bridge quantile with a data-dependent estimate of the same large-sample critical value.
[/example]
For scalar parameters the bootstrap can be used in more than one way. The first construction asks for quantiles of the bootstrap estimator itself, giving an interval that is simple to compute and invariant under monotone reparametrisation.
[definition: Percentile Bootstrap Interval]
Let $\mathcal P$ be a class of probability measures containing $P_0$, and let $\theta:\mathcal P\to\mathbb R$ be a real-valued functional. Let $\hat\theta_n=\theta(P_n)$ estimate $\theta_0=\theta(P_0)$, and let $\hat\theta_n^* = \theta(P_n^*)$. If $q^*_{\alpha}$ denotes the conditional $\alpha$-quantile of $\hat\theta_n^*$, the percentile bootstrap interval of nominal level $1-\tau$ is
\begin{align*}
[q^*_{\tau/2},q^*_{1-\tau/2}].
\end{align*}
[/definition]
Percentile intervals are attractive because they are read directly from the simulated estimator values. When the standard error depends strongly on the unknown distribution, however, the next construction bootstraps a standardised statistic instead of the raw estimator.
[definition: Studentized Bootstrap Interval]
Let $\mathcal P$ be a class of probability measures containing $P_0$, let $\theta:\mathcal P\to\mathbb R$ be a real-valued functional, and put $\hat\theta_n=\theta(P_n)$ and $\hat\theta_n^*=\theta(P_n^*)$. Let $\hat s_n$ be a standard error estimator for $\hat\theta_n$, and let $\hat s_n^*$ be the corresponding bootstrap standard error estimator computed from the bootstrap sample. Define
\begin{align*}
T_n^* = \frac{\hat\theta_n^*-\hat\theta_n}{\hat s_n^*}.
\end{align*}
If $a^*_{\alpha}$ is the conditional $\alpha$-quantile of $T_n^*$, the studentized bootstrap interval of nominal level $1-\tau$ is
\begin{align*}
[\hat\theta_n-a^*_{1-\tau/2}\hat s_n,\hat\theta_n-a^*_{\tau/2}\hat s_n].
\end{align*}
[/definition]
Studentization asks the bootstrap to approximate a more nearly pivotal statistic. To know when either scalar bootstrap interval is justified, we need a theorem transferring empirical-process bootstrap consistency through a smooth functional.
[quotetheorem:6354]
[citeproof:6354]
This theorem covers plug-in estimators such as smooth integrals of a density or distribution function. Hadamard differentiability is the hypothesis that turns a complicated plug-in estimator into a first-order linear approximation; without it, the bootstrap may reproduce the wrong local law. A standard warning example is the maximum functional at a distribution with a flat or tied maximiser, where the map is not smooth in the needed direction and the limiting distribution can depend on second-order features.
The empirical-process assumptions are separate from smoothness of $\theta$. Donsker convergence is needed because the derivative $\dot\theta_{P_0}$ is applied to the whole process indexed by $\mathcal F$, not only to finitely many coordinates. If $\mathcal F$ is the class of all measurable subsets of $[0,1]$, the empirical process is not tight in $\ell^\infty(\mathcal F)$; a smooth-looking functional depending on the supremum over that class cannot be bootstrapped by this theorem because there is no Brownian-bridge limit in the required space. The measurability or outer-probability condition prevents another pathology: for nonseparable classes, quantities such as $\sup_{f\in\mathcal F}|\sqrt n(P_n-P_0)f|$ need not be measurable random variables, so ordinary conditional probabilities and quantiles may not even be defined without an outer-probability convention. Thus the theorem requires three distinct ingredients: a Donsker domain for the stochastic process, a measurable interpretation of the statistic, and differentiability of the functional at the target law. Discontinuous functionals, boundary parameters, and nonsmooth shape constraints often need modified resampling or different centering. The next example shows how the general statement reduces to the familiar bootstrap for a sample average while keeping the model nonparametric.
[example: Smooth Functional of a Distribution Function]
Let $X_1,\dots,X_n$ be i.i.d. from $P_0$, let $\psi:\mathbb R\to\mathbb R$ be bounded and measurable, and define
\begin{align*}
\theta(P)=\int \psi\,dP.
\end{align*}
For the empirical measure $P_n=n^{-1}\sum_{i=1}^n\delta_{X_i}$, the plug-in estimator is
\begin{align*}
\hat\theta_n=P_n\psi=\frac{1}{n}\sum_{i=1}^n\psi(X_i).
\end{align*}
The target value is
\begin{align*}
\theta(P_0)=P_0\psi=\mathbb E_{P_0}[\psi(X_1)].
\end{align*}
Therefore
\begin{align*}
\hat\theta_n-\theta(P_0)=\frac{1}{n}\sum_{i=1}^n\psi(X_i)-\mathbb E_{P_0}[\psi(X_1)].
\end{align*}
Equivalently,
\begin{align*}
\hat\theta_n-\theta(P_0)=\frac{1}{n}\sum_{i=1}^n\left(\psi(X_i)-\mathbb E_{P_0}[\psi(X_1)]\right),
\end{align*}
so this smooth plug-in estimator is exactly the sample mean of the transformed observations $\psi(X_i)$.
For a bootstrap sample $X_1^*,\dots,X_n^*$ drawn from $P_n$,
\begin{align*}
\hat\theta_n^*=P_n^*\psi=\frac{1}{n}\sum_{i=1}^n\psi(X_i^*).
\end{align*}
Conditionally on the data, each $X_1^*$ equals $X_j$ with probability $1/n$, so
\begin{align*}
\mathbb E^*[\psi(X_1^*)]=\sum_{j=1}^n\psi(X_j)\mathbb P^*(X_1^*=X_j).
\end{align*}
Substituting $\mathbb P^*(X_1^*=X_j)=1/n$ gives
\begin{align*}
\mathbb E^*[\psi(X_1^*)]=\frac{1}{n}\sum_{j=1}^n\psi(X_j)=P_n\psi=\hat\theta_n.
\end{align*}
Hence
\begin{align*}
\hat\theta_n^*-\hat\theta_n=\frac{1}{n}\sum_{i=1}^n\psi(X_i^*)-\hat\theta_n.
\end{align*}
Since $\hat\theta_n$ is fixed under the conditional bootstrap law,
\begin{align*}
\hat\theta_n^*-\hat\theta_n=\frac{1}{n}\sum_{i=1}^n\left(\psi(X_i^*)-\hat\theta_n\right).
\end{align*}
Let
\begin{align*}
\hat\sigma_n^2=\frac{1}{n}\sum_{i=1}^n\left(\psi(X_i)-\hat\theta_n\right)^2.
\end{align*}
For each bootstrap sample, let
\begin{align*}
(\hat\sigma_n^*)^2=\frac{1}{n}\sum_{i=1}^n\left(\psi(X_i^*)-\hat\theta_n^*\right)^2.
\end{align*}
The ordinary studentized statistic is
\begin{align*}
T_n=\frac{\hat\theta_n-\theta(P_0)}{\hat\sigma_n/\sqrt n}.
\end{align*}
The bootstrap studentized statistic is
\begin{align*}
T_n^*=\frac{\hat\theta_n^*-\hat\theta_n}{\hat\sigma_n^*/\sqrt n}.
\end{align*}
If $\operatorname{Var}_{P_0}(\psi(X_1))>0$, then the functional $P\mapsto P\psi$ is linear, because
\begin{align*}
\theta(P)-\theta(P_0)=P\psi-P_0\psi=(P-P_0)\psi.
\end{align*}
Thus the smooth-functional bootstrap applies by *Bootstrap Delta Method for Smooth Functionals*: the conditional law of $T_n^*$ consistently estimates the limiting law of $T_n$, provided the sample variance consistently estimates the nonzero variance of $\psi(X_1)$. The computation has the same shape as the usual one-dimensional sample mean, but the model is still nonparametric because no finite-dimensional family for $P_0$ has been assumed.
[/example]
## Kernel Estimator Intervals: Bias, Undersmoothing, and Correction
The final problem is specific to smoothing. Kernel density and regression estimators have stochastic error and smoothing bias on comparable scales under mean-square optimal bandwidths. A confidence interval that estimates only the variance but ignores visible bias can have the wrong centre, so coverage can fail even if the normal approximation to the stochastic part is good.
[definition: Kernel Density Estimator]
Let $X_1,\dots,X_n$ be i.i.d. real-valued observations with density $f$, let $K:\mathbb R\to\mathbb R$ be a kernel with $\int K(u)\,du=1$, and let $h=h_n>0$. The kernel density estimator is the random function $\hat f_h:\mathbb R\to\mathbb R$ defined, for each $x\in\mathbb R$, by
\begin{align*}
\hat f_h(x)=\frac{1}{nh}\sum_{i=1}^n K\left(\frac{x-X_i}{h}\right).
\end{align*}
[/definition]
The estimator targets a smoothed version of $f$ before it targets $f$ itself. To decide whether a confidence interval is centred at the right object, we need the leading bias and variance expansions at a fixed interior point.
[quotetheorem:6355]
[citeproof:6355]
The theorem shows the tension: if $h\asymp n^{-1/5}$, then the bias $h^2$ and standard deviation $(nh)^{-1/2}$ are both of order $n^{-2/5}$. Symmetry of $K$ is what removes the first-order Taylor term; with a nonsymmetric kernel the leading bias can be order $h$ instead of order $h^2$. The condition $R(K)<\infty$ makes the variance expansion finite, while the interior-point assumption keeps the kernel window from being truncated by the boundary of the support. Near a boundary, or with a nonsymmetric kernel, the displayed formulas must be replaced by boundary-corrected or one-sided expansions before they can support a confidence interval. The next example records the resulting coverage failure for a variance-only interval.
[example: Failure of a Naive Kernel Density Interval]
Take a twice differentiable density $f$ with $f(x_0)>0$ and $f''(x_0)\ne0$, and use a symmetric second-order kernel with bandwidth $h\asymp n^{-1/5}$. The variance-only interval is
\begin{align*}
I_n=\left[\hat f_h(x_0)-z_{1-\tau/2}\sqrt{\frac{\hat f_h(x_0)R(K)}{nh}},\hat f_h(x_0)+z_{1-\tau/2}\sqrt{\frac{\hat f_h(x_0)R(K)}{nh}}\right].
\end{align*}
By *Pointwise Bias and Variance of the Kernel Density Estimator*,
\begin{align*}
\mathbb E[\hat f_h(x_0)]-f(x_0)=\frac{h^2\mu_2(K)}{2}f''(x_0)+o(h^2).
\end{align*}
Also,
\begin{align*}
\operatorname{Var}(\hat f_h(x_0))=\frac{f(x_0)R(K)}{nh}+o\left(\frac{1}{nh}\right).
\end{align*}
Thus the natural standard-error scale is
\begin{align*}
s_n=\sqrt{\frac{f(x_0)R(K)}{nh}}.
\end{align*}
The leading bias divided by this scale is
\begin{align*}
\frac{h^2\mu_2(K)f''(x_0)/2}{s_n}=\frac{h^2\mu_2(K)f''(x_0)}{2}\sqrt{\frac{nh}{f(x_0)R(K)}}.
\end{align*}
Equivalently,
\begin{align*}
\frac{h^2\mu_2(K)f''(x_0)/2}{s_n}=\frac{\mu_2(K)f''(x_0)}{2\sqrt{f(x_0)R(K)}}\sqrt{nh^5}.
\end{align*}
For $h\asymp n^{-1/5}$, the factor $nh^5$ is bounded away from both zero and infinity along suitable subsequences, so the bias is not negligible relative to $s_n$.
The coverage event is
\begin{align*}
\{f(x_0)\in I_n\}=\left\{\left|\hat f_h(x_0)-f(x_0)\right|\le z_{1-\tau/2}\sqrt{\frac{\hat f_h(x_0)R(K)}{nh}}\right\}.
\end{align*}
Decompose the estimation error as
\begin{align*}
\hat f_h(x_0)-f(x_0)=\left(\hat f_h(x_0)-\mathbb E[\hat f_h(x_0)]\right)+\left(\mathbb E[\hat f_h(x_0)]-f(x_0)\right).
\end{align*}
If, along a subsequence, $nh^5\to\lambda\in(0,\infty)$, then
\begin{align*}
\frac{\mathbb E[\hat f_h(x_0)]-f(x_0)}{s_n}\to\delta.
\end{align*}
Here
\begin{align*}
\delta=\frac{\mu_2(K)f''(x_0)}{2\sqrt{f(x_0)R(K)}}\sqrt{\lambda}.
\end{align*}
Since $f''(x_0)\ne0$ and $f(x_0)R(K)>0$, this constant is nonzero.
The centred normal approximation gives
\begin{align*}
\frac{\hat f_h(x_0)-\mathbb E[\hat f_h(x_0)]}{s_n}\xrightarrow{d}\mathcal N(0,1).
\end{align*}
The plug-in standard error has the same first-order scale:
\begin{align*}
\sqrt{\frac{\hat f_h(x_0)R(K)}{nh}}\bigg/s_n\to1.
\end{align*}
Therefore the limiting coverage along the subsequence is
\begin{align*}
\mathbb P\left(|Z+\delta|\le z_{1-\tau/2}\right)
\end{align*}
for $Z\sim\mathcal N(0,1)$. Expanding this probability,
\begin{align*}
\mathbb P\left(|Z+\delta|\le z_{1-\tau/2}\right)=\mathbb P\left(-z_{1-\tau/2}\le Z+\delta\le z_{1-\tau/2}\right).
\end{align*}
Subtracting $\delta$ from each part of the inequality gives
\begin{align*}
\mathbb P\left(-z_{1-\tau/2}\le Z+\delta\le z_{1-\tau/2}\right)=\mathbb P\left(-z_{1-\tau/2}-\delta\le Z\le z_{1-\tau/2}-\delta\right).
\end{align*}
Hence
\begin{align*}
\mathbb P\left(|Z+\delta|\le z_{1-\tau/2}\right)=\Phi(z_{1-\tau/2}-\delta)-\Phi(-z_{1-\tau/2}-\delta).
\end{align*}
When $\delta\ne0$, this shifted normal probability is not the nominal value $1-\tau$. If $f''(x_0)>0$, the leading bias moves $\hat f_h(x_0)$ above $f(x_0)$; if $f''(x_0)<0$, it moves the centre below $f(x_0)$. Thus the naive interval can miss systematically on the side determined by the sign of the curvature.
[/example]
The failure identifies a precise obstruction: after standardisation, a nonvanishing bias shifts the centre of the limiting normal distribution and changes the coverage probability. A valid interval therefore needs either an explicit bias estimate or a bandwidth choice that makes the bias negligible on the standard-error scale.
We need a bandwidth condition that makes the bias negligible without estimating the curvature term directly. Undersmoothing takes this route by imposing assumptions that force the required scale separation between smoothing bias and stochastic standard error.
[definition: Undersmoothing Condition]
For a pointwise kernel density confidence interval at an interior point $x$, undersmoothing means choosing $h=h_n$ so that
\begin{align*}
h\to0,\qquad nh\to\infty,\qquad \sqrt{nh}\,h^2\to0.
\end{align*}
[/definition]
The bandwidth condition is not only a smoothing convention; it is what prevents the leading curvature bias from appearing as a fixed shift in the limiting pivot. Once that deterministic centring error is smaller than the standard error, the remaining stochastic term can be calibrated by a normal critical value.
Normal calibration requires a theorem that connects undersmoothing to an actual confidence interval with the right centring and variance scale. The statement must also keep the assumptions visible, since positivity of the density and interior-point bias control are what make the limiting pivot usable.
[quotetheorem:6356]
[citeproof:6356]
Undersmoothing is simple, but it deliberately sacrifices point-estimation efficiency. The condition $f(x)>0$ prevents the asymptotic variance from degenerating, and $nh\to\infty$ ensures that enough observations fall into the local window for a normal approximation. The condition $\sqrt{nh}\,h^2\to0$ is the scale separation that makes the leading smoothing bias negligible relative to the standard error. The interval is pointwise, so it does not give simultaneous coverage over a continuum of $x$ values, and it covers the unsmoothed density value only at interior points where the bias expansion and variance estimate are valid. The second approach estimates and subtracts the leading curvature term, so we next define the bias-corrected centre used by corrected intervals.
[definition: Bias Corrected Kernel Density Estimator]
Let $\hat f_h:\mathbb R\to\mathbb R$ be a second-order kernel density estimator using bandwidth $h>0$. Let $L:\mathbb R\to\mathbb R$ be a second-derivative kernel and $b>0$. The derivative estimator is the random function $\widehat{f''}_{b}:\mathbb R\to\mathbb R$ defined, for each $x\in\mathbb R$, by
\begin{align*}
\widehat{f''}_{b}(x)=\frac{1}{nb^3}\sum_{i=1}^n L\left(\frac{x-X_i}{b}\right).
\end{align*}
The leading-bias-corrected kernel density estimator is the random function $\hat f_h^{\mathrm{bc}}:\mathbb R\to\mathbb R$ defined, for each $x\in\mathbb R$, by
\begin{align*}
\hat f_h^{\mathrm{bc}}(x)=\hat f_h(x)-\frac{h^2\mu_2(K)}{2}\widehat{f''}_{b}(x).
\end{align*}
[/definition]
Bias correction changes the centre and may change the standard error. A corrected confidence interval is justified only when the remaining bias and the variance of the correction are both accounted for in the limiting normal approximation.
[quotetheorem:6357]
[citeproof:6357]
Bias correction works only because the leading curvature bias has a stable form and because the derivative estimator is accurate enough after multiplication by $h^2$. The smoothness assumption on $f$ controls the Taylor remainder, while the moment assumptions on $K$ and $L$ determine which lower-order terms vanish. If $b$ is too small, derivative estimation noise can dominate the corrected standard error; if $b$ is too large, the derivative estimate still carries bias that leaks back into the interval. The theorem is also pointwise and interior: it does not handle boundary points, densities with low smoothness, or uniform bands without additional maximal inequalities and a different critical value.
Bootstrap intervals for kernel estimators have to follow the same centring logic. Resampling can approximate the stochastic part, but it does not automatically remove smoothing bias relative to the unsmoothed target $f(x)$.
[example: Bootstrap Interval for a Kernel Estimator]
For a bootstrap sample $X_1^*,\dots,X_n^*$ drawn from $P_n$, compute the kernel estimator with the same bandwidth:
\begin{align*}
\hat f_h^*(x)=\frac{1}{nh}\sum_{i=1}^n K\left(\frac{x-X_i^*}{h}\right).
\end{align*}
Conditionally on the data, each $X_i^*$ equals $X_j$ with probability $1/n$. Hence
\begin{align*}
\mathbb E^*\left[K\left(\frac{x-X_i^*}{h}\right)\right]=\sum_{j=1}^n K\left(\frac{x-X_j}{h}\right)\frac{1}{n}.
\end{align*}
Substituting this into the bootstrap expectation gives
\begin{align*}
\mathbb E^*[\hat f_h^*(x)]=\frac{1}{nh}\sum_{i=1}^n \frac{1}{n}\sum_{j=1}^n K\left(\frac{x-X_j}{h}\right).
\end{align*}
The inner sum does not depend on $i$, so
\begin{align*}
\mathbb E^*[\hat f_h^*(x)]=\frac{1}{h}\frac{1}{n}\sum_{j=1}^n K\left(\frac{x-X_j}{h}\right).
\end{align*}
Therefore
\begin{align*}
\mathbb E^*[\hat f_h^*(x)]=\hat f_h(x).
\end{align*}
Thus the bootstrap fluctuation $\hat f_h^*(x)-\hat f_h(x)$ is conditionally centred at the observed smoothed estimator, not at the unsmoothed target $f(x)$.
The original estimation error has the decomposition
\begin{align*}
\hat f_h(x)-f(x)=\left(\hat f_h(x)-\mathbb E[\hat f_h(x)]\right)+\left(\mathbb E[\hat f_h(x)]-f(x)\right).
\end{align*}
By *Pointwise Bias and Variance of the Kernel Density Estimator*,
\begin{align*}
\mathbb E[\hat f_h(x)]-f(x)=\frac{h^2\mu_2(K)}{2}f''(x)+o(h^2).
\end{align*}
The corresponding standard-error scale is
\begin{align*}
s_n=\sqrt{\frac{f(x)R(K)}{nh}}.
\end{align*}
Dividing the leading bias by this scale gives
\begin{align*}
\frac{h^2\mu_2(K)f''(x)/2}{s_n}=\frac{h^2\mu_2(K)f''(x)}{2}\sqrt{\frac{nh}{f(x)R(K)}}.
\end{align*}
Equivalently,
\begin{align*}
\frac{h^2\mu_2(K)f''(x)/2}{s_n}=\frac{\mu_2(K)f''(x)}{2\sqrt{f(x)R(K)}}\sqrt{nh^5}.
\end{align*}
Under a mean-square optimal bandwidth $h\asymp n^{-1/5}$, the quantity $nh^5$ is of constant order, so the bias is generally not negligible on the standard-error scale when $f''(x)\ne0$.
Consequently, a percentile or symmetric bootstrap interval centred at $\hat f_h(x)$ estimates stochastic fluctuation around the smoothed centre but does not remove the deterministic shift from $\mathbb E[\hat f_h(x)]$ to $f(x)$. One valid route is undersmoothing: if $\sqrt{nh}\,h^2\to0$, then
\begin{align*}
\frac{h^2\mu_2(K)f''(x)/2}{s_n}\to0.
\end{align*}
A second route is to apply the same leading-bias correction in the original and bootstrap samples, then bootstrap the corrected fluctuation
\begin{align*}
\hat f_h^{\mathrm{bc},*}(x)-\hat f_h^{\mathrm{bc}}(x).
\end{align*}
In that case the standard error must include the extra variability from estimating the curvature term. A bootstrap interval for a kernel estimator targets the unsmoothed density value $f(x)$ only when its centring scheme either removes the leading smoothing bias or makes that bias asymptotically negligible.
[/example]
## Putting the Procedures Together
A confidence procedure in this course should be read as a statement about a target, a topology, and a centring scheme. The target may be a scalar functional, a distribution function, a density value, or an entire curve; the topology may be pointwise or supremum norm; and the centring scheme decides whether smoothing bias has been removed.
[remark: Checklist for Nonparametric Confidence Sets]
For scalar smooth functionals, use the bootstrap delta method, percentile intervals, or studentized intervals after verifying that the functional is sufficiently smooth. For empirical distribution functions, use Brownian-bridge or bootstrap calibration for supremum statistics. For kernel density and regression estimators, decide before calibration whether the interval is undersmoothed, bias corrected, or targeting the smoothed function rather than the original function. For uniform bands, calibrate the supremum of the standardised process rather than reusing pointwise quantiles.
[/remark]
The main lesson is that nonparametric confidence sets are not obtained by attaching a normal quantile to every estimator. The stochastic approximation, the bias approximation, and the coverage event must match the inferential question.
Chapters 5 through 11 studied specific estimators and their properties: consistency rates, limiting distributions, confidence procedures. This final chapter takes a global perspective: what is the best possible rate any estimator can achieve, and how does that rate depend on smoothness class, dimension, and the functional being estimated? Minimax lower bounds show when our estimators are optimal and identify which assumptions drive the rate.
# 12. Minimax Rates and Lower Bounds
This chapter turns the bias-variance rate heuristics from the kernel and local-polynomial chapters, especially Chapters 5 through 7 and Chapter 10, into formal minimax statements. The guiding question is no longer whether a particular kernel, bandwidth, or local polynomial estimator is consistent, but how well any estimator can perform over a smoothness class. We first match upper bounds from bias-variance calculations with lower bounds from testing arguments, then use the same logic to explain why adaptation and uncertainty quantification cannot both be unrestricted.
## Bias-Variance Rates over Holder Balls
How does smoothness control the best possible estimation error? The main example is a density or regression function on a bounded domain in $\mathbb R^d$, where an estimator must balance local averaging against approximation error. A larger bandwidth decreases variance but increases bias; a smaller bandwidth does the reverse. The minimax rate is the value of this tradeoff after optimising over the smoothing scale.
We use Holder balls as the standard smoothness model because they give pointwise Taylor control and are stable under the bump constructions used in lower bounds. Without a uniform smoothness condition, a function class can contain arbitrarily narrow spikes; then no fixed sample size can estimate the function well in regions where no observation falls.
[definition: Holder Ball]
Let $s>0$, $L>0$, and let $\mathcal X\subset \mathbb R^d$. Write $k=\lceil s\rceil-1$ and $\gamma=s-k$, so $0<\gamma\le 1$. The Holder ball $\mathcal H^s(L;\mathcal X)$ consists of functions $f:\mathcal X\to\mathbb R$ such that all partial derivatives $D^\alpha f$ with $|\alpha|\le k$ exist and are bounded by $L$, and for every multi-index $\alpha$ with $|\alpha|=k$,
\begin{align*}
|D^\alpha f(x)-D^\alpha f(y)|\le L|x-y|^\gamma
\end{align*}
for all $x,y\in \mathcal X$.
[/definition]
For integer $s$, this convention is the Zygmund-style Holder convention that controls derivatives of order $s-1$ with Lipschitz modulus. Some texts instead define integer Holder smoothness by bounded continuous derivatives through order $s$; the rate calculations in this chapter are unchanged at the exponent level, but the convention must be fixed before stating membership in $\mathcal H^s(L;\mathcal X)$.
For density estimation, the Holder condition is imposed together with nonnegativity and integration to one. For regression, it is imposed on the regression function $r(x)=\mathbb E[Y\mid X=x]$, usually with regularity assumptions on the design density and noise distribution. Throughout this chapter, $\mathcal L^d$ denotes Lebesgue measure on $\mathbb R^d$. To compare all estimators on the same scale, we measure the worst integrated squared error over the function class.
[definition: Minimax Risk]
Let $\mathcal X\subset\mathbb R^d$, let $(\Omega_n,\mathcal A_n)$ be the sample space for $n$ observations, and let $\mathbb P_f^{(n)}$ denote the law of the observations when the true function is $f$. Let $\mathfrak C(\mathcal X)$ denote a collection of function classes on $\mathcal X$. The minimax $L^2$ risk is the functional $R_n:\mathfrak C(\mathcal X)\to[0,\infty]$ defined by
\begin{align*}
R_n(\mathcal F)=\inf_{\hat f_n:\Omega_n\to L^2(\mathcal X)}\sup_{f\in\mathcal F}\mathbb E_f^{(n)}\left[\int_{\mathcal X}|\hat f_n(x)-f(x)|^2\,d\mathcal L^d(x)\right],
\end{align*}
where the infimum ranges over all measurable maps from $(\Omega_n,\mathcal A_n)$ into $L^2(\mathcal X)$.
[/definition]
This definition makes the estimator compete against the hardest function in the class. The integral loss is global, so the rate reflects both how many local regions must be estimated and how much information each region receives. The next step is to compute an attainable benchmark, using the same kernel estimators whose bias and variance were analysed earlier in the course.
[quotetheorem:6358]
[citeproof:6358]
The hypotheses prevent several distinct pathologies. Boundary correction is needed because an uncorrected convolution kernel near $0$ or $1$ averages outside the support; for a density that is nonzero at the boundary this creates a first-order boundary bias that is not controlled by the interior Taylor expansion. The uniform upper bound controls the variance term, since regions with very large density can make $\mathbb E_f[\hat f_h(x)^2]$ larger than the standard $(nh^d)^{-1}$ scale. The Holder condition controls the bias; without it, a density may oscillate below the bandwidth scale and the Taylor cancellation of the kernel gives no useful approximation bound.
The theorem is only an upper bound for a specified smoothing construction. It does not assert that kernels are optimal, that the displayed rate is unavoidable, or that constants are harmless in finite samples. It also does not remove modelling assumptions: in random-design regression the same rate requires the design density to be bounded above and below on $[0,1]^d$ and the errors to have bounded variance, because otherwise local sample sizes or noise levels can dominate the smoothing tradeoff. The next example isolates the rate calculation before the later lower bound proves that no estimator improves the exponent uniformly.
[example: One-Dimensional and High-Dimensional Smoothing]
Using the rate formula from *Kernel Upper Bound over Holder Balls*,
\begin{align*}
n^{-2s/(2s+d)},
\end{align*}
and the corresponding bandwidth scale
\begin{align*}
h\asymp n^{-1/(2s+d)},
\end{align*}
take smoothness $s=2$ and dimension $d=1$. First compute the denominator:
\begin{align*}
2s+d=2\cdot 2+1=4+1=5.
\end{align*}
Therefore the risk exponent is
\begin{align*}
\frac{2s}{2s+d}=\frac{2\cdot 2}{2\cdot 2+1}=\frac{4}{5},
\end{align*}
so the integrated squared-error rate is
\begin{align*}
n^{-2s/(2s+d)}=n^{-4/5}.
\end{align*}
The bandwidth exponent is
\begin{align*}
\frac{1}{2s+d}=\frac{1}{2\cdot 2+1}=\frac{1}{5},
\end{align*}
and hence
\begin{align*}
h\asymp n^{-1/5}.
\end{align*}
Now keep the same smoothness $s=2$ but increase the dimension to $d=10$. The denominator becomes
\begin{align*}
2s+d=2\cdot 2+10=4+10=14.
\end{align*}
Thus
\begin{align*}
\frac{2s}{2s+d}=\frac{2\cdot 2}{2\cdot 2+10}=\frac{4}{14}=\frac{2}{7},
\end{align*}
where the last equality divides numerator and denominator by $2$. The integrated squared-error rate is therefore
\begin{align*}
n^{-2s/(2s+d)}=n^{-4/14}=n^{-2/7}.
\end{align*}
The bandwidth scale is
\begin{align*}
h\asymp n^{-1/(2s+d)}=n^{-1/(2\cdot 2+10)}=n^{-1/14}.
\end{align*}
Increasing the dimension from $1$ to $10$ changes the risk exponent from $4/5$ to $2/7$, so the integrated error decreases more slowly with $n$. The variance term in the bias-variance bound is $(nh^d)^{-1}$: when $d=1$ it is $(nh)^{-1}$, while when $d=10$ it is $(nh^{10})^{-1}$. For any fixed $0<h<1$, we have $h^{10}<h$, so $nh^{10}<nh$ and the effective local sample size is smaller in ten dimensions.
[/example]
Examples like this explain why nonparametric methods often need either low dimension, additional structure, or large samples. The next question is whether the upper bound is an artefact of kernel estimators, or whether every estimator must pay the same price.
[quotetheorem:6360]
<!-- theorem-proof-needed: yes; reason: the course proves the lower bound using Assouad's construction and combines it with the kernel upper bound. -->
[citeproof:6360]
The theorem is the formal version of the bias-variance calculation: the smoothing bandwidth that optimises a kernel estimator also matches the resolution below which the data cannot reliably distinguish alternatives. The fixed lower bound is a convenient way to state the lecture's bump construction, since it gives room to add positive and negative perturbations while remaining inside the density model. It should not be read as the only possible formulation of the Holder minimax rate; more general density classes can be handled by localising the construction or by using a baseline that stays nonnegative on the perturbation supports. Uniform boundedness prevents the class from containing densities with extreme concentration, where the variance calculation and the testing distances have a different scale.
The compact cube is not essential, but some control of geometry is: on irregular domains or unbounded supports, boundary behaviour and tail mass can change the rate unless extra assumptions are imposed. Holder smoothness is also a substantive restriction rather than decoration; if the class allowed sharper local features, the bump amplitude could be larger at a fixed resolution and the minimax risk would be slower. Conversely, stronger structure such as sparsity, additivity, monotonicity, or parametric shape constraints can improve the rate, so the theorem should be read as a benchmark for full $d$-dimensional Holder classes rather than as a universal law for every nonparametric model.
## Assouad and Fano Lower Bound Strategies
How can a statement about every possible estimator be proved without analysing every estimator? The answer is to embed a finite testing problem inside the statistical model. If many parameter values are well separated in loss but produce probability laws that are hard to distinguish from $n$ observations, then estimation over the full class must be difficult.
Assouad's method is suited to product-like collections indexed by binary strings. It converts global estimation error into the expected Hamming error of recovering many hidden bits.
[quotetheorem:5910]
[citeproof:5910]
The assumptions in [Assouad's lemma](/theorems/5906) encode exactly what the reduction needs. The loss-decoding condition says that a good estimator must reveal many of the hidden bits; without such a decoder, a large Hamming error might have little connection with the statistical loss being studied. For example, if the loss depends only on the total mass of a density and all hypercube vertices have the same mass, then recovering the individual signs is irrelevant to the loss, so Hamming difficulty alone proves nothing about estimation risk. Similarly, if two different bit strings determine functions that are close in the target metric, a decoder that guesses the wrong string may still produce an accurate estimate.
The adjacent total variation condition is local because Assouad compares vertices that differ in one coordinate. If adjacent laws were nearly singular, each bit could be recovered accurately and the hypercube would not force a lower bound. This can happen when bump amplitudes are too large, supports receive many observations, or noise is so small that the presence or absence of a local perturbation leaves a visible signature in the likelihood. In those regimes the testing problem is no longer hard, even though the parameter set still has many vertices.
The lemma is therefore strongest for models that decompose into many comparable local decisions. It is less natural for lower bounds based on a single global alternative, or for packings where no coordinate-wise adjacency structure is available. The next construction fits the lemma by encoding each bit as a small bump whose amplitude is limited by smoothness.
[example: Bump-Function Lower Bound for Density Estimation]
Let $g\in C_c^\infty((0,1)^d)$ satisfy $\int g\,d\mathcal L^d=0$. Choose points $x_1,\dots,x_M$ so that the supports of
\begin{align*}
g_k(x)=h^s g\left(\frac{x-x_k}{h}\right)
\end{align*}
are disjoint and lie in cubes of side length proportional to $h$, with $M\asymp h^{-d}$. Define
\begin{align*}
f_\theta=f_0+\sum_{k=1}^M \theta_k g_k,\quad \theta\in\{0,1\}^M,
\end{align*}
where $f_0$ is a baseline density bounded below by $b>0$.
Each perturbation preserves total mass. Put $u=(x-x_k)/h$, so $x=x_k+hu$ and $d\mathcal L^d(x)=h^d\,d\mathcal L^d(u)$. Then
\begin{align*}
\int g_k(x)\,d\mathcal L^d(x)=\int h^s g\left(\frac{x-x_k}{h}\right)\,d\mathcal L^d(x).
\end{align*}
After the change of variables this becomes
\begin{align*}
\int g_k(x)\,d\mathcal L^d(x)=h^s\int g(u)h^d\,d\mathcal L^d(u).
\end{align*}
Thus
\begin{align*}
\int g_k(x)\,d\mathcal L^d(x)=h^{s+d}\int g(u)\,d\mathcal L^d(u)=0.
\end{align*}
Therefore
\begin{align*}
\int f_\theta\,d\mathcal L^d=\int f_0\,d\mathcal L^d+\sum_{k=1}^M\theta_k\int g_k\,d\mathcal L^d=1.
\end{align*}
Because the supports are disjoint, at each $x$ at most one summand $g_k(x)$ is nonzero, so
\begin{align*}
\left|\sum_{k=1}^M\theta_k g_k(x)\right|\le h^s\|g\|_\infty.
\end{align*}
If $h^s\|g\|_\infty\le b/2$, then
\begin{align*}
f_\theta(x)\ge f_0(x)-h^s\|g\|_\infty\ge b/2,
\end{align*}
so every $f_\theta$ is nonnegative.
If $\theta$ and $\theta'$ differ only in coordinate $j$, then all other coordinates cancel:
\begin{align*}
f_\theta-f_{\theta'}=\sum_{k=1}^M(\theta_k-\theta_k')g_k=(\theta_j-\theta_j')g_j.
\end{align*}
Since $|\theta_j-\theta_j'|=1$,
\begin{align*}
\|f_\theta-f_{\theta'}\|_{L^2}^2=\int h^{2s}g\left(\frac{x-x_j}{h}\right)^2\,d\mathcal L^d(x).
\end{align*}
Using $u=(x-x_j)/h$ gives
\begin{align*}
\|f_\theta-f_{\theta'}\|_{L^2}^2=h^{2s}\int g(u)^2h^d\,d\mathcal L^d(u).
\end{align*}
Hence
\begin{align*}
\|f_\theta-f_{\theta'}\|_{L^2}^2=h^{2s+d}\int g(u)^2\,d\mathcal L^d(u)=h^{2s+d}\|g\|_{L^2}^2.
\end{align*}
Thus changing one hidden bit changes the squared $L^2$ distance by order $h^{2s+d}$.
For adjacent vertices, write $\delta=f_\theta-f_{\theta'}$. Since $\int\delta\,d\mathcal L^d=0$ and $f_{\theta'}\ge b/2$, for $h$ small enough we have $|\delta|/f_{\theta'}\le 1/2$. For $|t|\le 1/2$, the elementary bound $(1+t)\log(1+t)\le t+Ct^2$ gives
\begin{align*}
K(P_{f_\theta},P_{f_{\theta'}})=\int f_{\theta'}\left(1+\frac{\delta}{f_{\theta'}}\right)\log\left(1+\frac{\delta}{f_{\theta'}}\right)\,d\mathcal L^d.
\end{align*}
Applying the bound with $t=\delta/f_{\theta'}$ gives
\begin{align*}
K(P_{f_\theta},P_{f_{\theta'}})\le \int \delta\,d\mathcal L^d+C\int \frac{\delta^2}{f_{\theta'}}\,d\mathcal L^d.
\end{align*}
The first term is zero, and $f_{\theta'}\ge b/2$, so
\begin{align*}
K(P_{f_\theta},P_{f_{\theta'}})\le \frac{2C}{b}\|\delta\|_{L^2}^2.
\end{align*}
Using the adjacent-vertex separation computed above,
\begin{align*}
K(P_{f_\theta},P_{f_{\theta'}})\le \frac{2C}{b}h^{2s+d}\|g\|_{L^2}^2.
\end{align*}
For $n$ independent observations, additivity of Kullback-Leibler divergence for product measures gives
\begin{align*}
K(P_{f_\theta}^{(n)},P_{f_{\theta'}}^{(n)})=nK(P_{f_\theta},P_{f_{\theta'}})\le \frac{2C}{b}nh^{2s+d}\|g\|_{L^2}^2.
\end{align*}
Thus adjacent alternatives remain hard to distinguish when
\begin{align*}
nh^{2s+d}\asymp 1.
\end{align*}
Solving this relation gives
\begin{align*}
h^{2s+d}\asymp n^{-1}.
\end{align*}
Taking the $(2s+d)$th root yields
\begin{align*}
h\asymp n^{-1/(2s+d)}.
\end{align*}
The hypercube has $M\asymp h^{-d}$ independently placed bumps. By *Assouad's lemma*, the lower bound accumulates one contribution of order $h^{2s+d}$ for each unresolved coordinate, so the total squared $L^2$ scale is
\begin{align*}
Mh^{2s+d}\asymp h^{-d}h^{2s+d}=h^{2s}.
\end{align*}
Substituting $h\asymp n^{-1/(2s+d)}$ gives
\begin{align*}
h^{2s}\asymp \left(n^{-1/(2s+d)}\right)^{2s}=n^{-2s/(2s+d)}.
\end{align*}
The construction shows where the minimax exponent comes from: each bump is small enough to be difficult to detect, but there are $M\asymp h^{-d}$ possible bump locations, and their accumulated $L^2$ cost is exactly the rate $n^{-2s/(2s+d)}$.
[/example]
The bump example has a hypercube structure, so Assouad records the cumulative cost of many local decisions. Some lower bounds are easier to build from a large packing without a coordinate-by-coordinate interpretation. In that setting we need a tool that measures how much information is available to identify one member of the packing.
[quotetheorem:5900]
[citeproof:5900]
The separation assumption is what converts testing failure into estimation error, and it depends on the metric chosen for the packing. A set that is well separated in $L^2$ may be poorly separated in sup norm, Hellinger distance, or an empirical design norm, so the same finite family can prove a lower bound for one loss while saying little about another. If two packing points were closer than $2\delta$ in the target metric, a nearest-neighbour mistake would not force loss of order $\delta^2$. The size $N$ matters because Fano compares information with the number of possible messages; a two-point construction uses a different testing inequality, while a very large packing can produce a stronger lower bound only if the average information remains small. The Kullback-Leibler condition is the formal indistinguishability requirement. If the average divergence were much larger than $\log N$, the data could contain enough information to identify the packing element, and this version of Fano would give no useful obstruction.
The displayed theorem is one common average-Kullback-Leibler version of Fano, not the only form used in minimax theory. Other variants replace the average divergence by a maximum divergence to a centre point, use mutual information directly, or change the numerical constants in the testing-error bound. These variants are often interchangeable at the rate level, but they can differ in convenience: a centre-point version is natural for star-shaped perturbation sets, while the average-pairwise version above fits packings where pairwise divergences are easier to sum.
A concrete separation failure occurs if the proposed packing contains many translated bumps whose supports overlap heavily. The statistical laws may be hard to distinguish, but if two functions differ only by shifting a broad bump by much less than its width, their $L^2$ distance can be below $2\delta$. A nearest-neighbour mistake between those two alternatives then need not imply loss of order $\delta^2$, so the Fano reduction cannot certify the claimed risk scale even when the information condition is favourable.
A concrete failure occurs in fixed-design Gaussian regression when the packing functions have amplitudes too large. If observations are $Y_m=f(x_m)+\varepsilon_m$ with $\varepsilon_m\sim\mathcal N(0,\sigma^2)$, then the Kullback-Leibler divergence between two alternatives is proportional to $n\|f_i-f_j\|_{n}^{2}/\sigma^2$, where $\|\cdot\|_n$ is the empirical design norm. Increasing the bump amplitude may improve $L^2$ separation, but it also makes $n\|f_i-f_j\|_{n}^{2}$ large. Once this information term is comparable to or larger than $\log N$, the regression data can identify the packing element, so the same construction no longer proves a lower bound.
Assouad and Fano often lead to the same rate, but they emphasise different aspects of the model. Assouad tracks many local binary decisions; Fano tracks the difficulty of selecting one element from a large separated packing. Neither method automatically constructs the packing or verifies that its elements remain inside the statistical model; that geometric work is usually the main part of a minimax lower bound.
[remark: Choosing a Lower Bound Method]
Use Assouad when the natural construction consists of many independent local perturbations whose signs or presence indicators can vary separately. Use Fano when the cleanest construction is a packing of functions with controlled pairwise information. Both methods depend on the same principle: statistical indistinguishability plus metric separation forces estimation error.
[/remark]
## Adaptation Limits and Honest Confidence Bands
Can an estimator attain the best rate simultaneously over several smoothness classes without knowing the smoothness? Point estimation sometimes permits adaptation, for example through Lepski's method or data-driven bandwidth selection. Confidence statements are more rigid because they must cover the true function uniformly while having diameter comparable to the estimation rate.
[definition: Adaptive Estimator over Nested Holder Classes]
Let $0<s_1<s_2$ and let $\mathcal F_{s_i}=\mathcal H^{s_i}(L_i;[0,1]^d)$, with any additional density or regression constraints understood. An estimator is a measurable map $\hat f_n:\Omega_n\to L^2([0,1]^d)$ from the sample space of $n$ observations into the $L^2$ function space. It is rate-adaptive over $(\mathcal F_{s_1},\mathcal F_{s_2})$ in $L^2$ risk if
\begin{align*}
\sup_{f\in\mathcal F_{s_i}}\mathbb E_f[\|\hat f_n-f\|_{L^2}^2]\lesssim n^{-2s_i/(2s_i+d)}
\end{align*}
for $i=1,2$, with constants independent of $n$.
[/definition]
The definition expresses an oracle goal: the estimator behaves as though it knew whether the truth was rougher or smoother. For risk estimation this is often attainable up to logarithmic factors, because the cost of selecting a bandwidth is small compared with the main bias-variance tradeoff.
[example: Adaptive Estimation over Nested Holder Classes]
Suppose $0<s_1<s_2$ in a one-dimensional regression problem with bounded noise variance. For a local polynomial estimator with bandwidth $h$, the bias-variance bound has the form
\begin{align*}
\operatorname{risk}_s(h)\lesssim h^{2s}+\frac{1}{nh},
\end{align*}
where $h^{2s}$ is the squared bias scale and $(nh)^{-1}$ is the variance scale. To find the oracle bandwidth for a known smoothness $s$, balance the two scales:
\begin{align*}
h^{2s}=\frac{1}{nh}.
\end{align*}
Multiplying both sides by $nh$ gives
\begin{align*}
nh\cdot h^{2s}=nh\cdot \frac{1}{nh}.
\end{align*}
The left side is $nh^{2s+1}$ and the right side is $1$, so
\begin{align*}
nh^{2s+1}=1.
\end{align*}
Dividing by $n$ gives
\begin{align*}
h^{2s+1}=n^{-1}.
\end{align*}
Since $h>0$, taking the positive $(2s+1)$st root gives
\begin{align*}
h=n^{-1/(2s+1)}.
\end{align*}
At this bandwidth, the squared bias term is
\begin{align*}
h^{2s}=\left(n^{-1/(2s+1)}\right)^{2s}=n^{-2s/(2s+1)}.
\end{align*}
The variance term has the same order. Substituting $h=n^{-1/(2s+1)}$ gives
\begin{align*}
\frac{1}{nh}=\frac{1}{n\,n^{-1/(2s+1)}}.
\end{align*}
Using $1/(ab)=a^{-1}b^{-1}$ and $\left(n^{-1/(2s+1)}\right)^{-1}=n^{1/(2s+1)}$,
\begin{align*}
\frac{1}{n\,n^{-1/(2s+1)}}=n^{-1}n^{1/(2s+1)}.
\end{align*}
Combining powers of $n$ gives
\begin{align*}
n^{-1}n^{1/(2s+1)}=n^{-1+1/(2s+1)}.
\end{align*}
Finally,
\begin{align*}
-1+\frac{1}{2s+1}=-\frac{2s+1}{2s+1}+\frac{1}{2s+1}=-\frac{2s}{2s+1},
\end{align*}
so
\begin{align*}
\frac{1}{nh}=n^{-2s/(2s+1)}.
\end{align*}
Thus an estimator that knows the smoothness $s$ would use bandwidth $h\asymp n^{-1/(2s+1)}$ and attain the integrated squared-error scale $n^{-2s/(2s+1)}$.
A Lepski-type selector replaces this unknown oracle bandwidth by a data-driven choice from a grid. It compares local polynomial fits across bandwidths and keeps the largest bandwidth whose estimate remains statistically compatible with estimates at smaller bandwidths. If the true regression function lies in $\mathcal H^{s_2}$, the squared bias at bandwidth $h$ is of order $h^{2s_2}$, so the selector can tolerate bandwidths near the smoother oracle scale
\begin{align*}
h_{s_2}=n^{-1/(2s_2+1)}.
\end{align*}
Substituting this bandwidth into the oracle calculation gives
\begin{align*}
h_{s_2}^{2s_2}=\left(n^{-1/(2s_2+1)}\right)^{2s_2}=n^{-2s_2/(2s_2+1)}.
\end{align*}
The corresponding variance term is
\begin{align*}
\frac{1}{nh_{s_2}}=n^{-1}n^{1/(2s_2+1)}=n^{-2s_2/(2s_2+1)}.
\end{align*}
Hence the smoother-class scale is $n^{-2s_2/(2s_2+1)}$, up to logarithmic factors from the bandwidth selection step.
If the true function lies only in $\mathcal H^{s_1}$, the squared bias is instead of order $h^{2s_1}$. Since $s_1<s_2$, we have $2s_1<2s_2$. For $0<h<1$, raising $h$ to a larger positive exponent makes it smaller, so
\begin{align*}
h^{2s_1}>h^{2s_2}.
\end{align*}
The rougher class has larger bias at the same bandwidth, so the selector must stop at a smaller bandwidth, near
\begin{align*}
h_{s_1}=n^{-1/(2s_1+1)}.
\end{align*}
At that scale,
\begin{align*}
h_{s_1}^{2s_1}=\left(n^{-1/(2s_1+1)}\right)^{2s_1}=n^{-2s_1/(2s_1+1)}.
\end{align*}
The variance term matches it:
\begin{align*}
\frac{1}{nh_{s_1}}=n^{-1}n^{1/(2s_1+1)}=n^{-2s_1/(2s_1+1)}.
\end{align*}
The example shows that adaptation chooses among bandwidths according to the observed bias-variance tradeoff; it does not change the smoothing rate formula itself.
[/example]
Point estimation only asks for small average loss, so the estimator may take risks that are invisible on easier functions. Confidence bands require a different standard: the reported interval-valued function must cover the truth uniformly over the whole class. We therefore need a definition that separates coverage from width.
[definition: Honest Confidence Band]
Let $\mathcal F$ be a class of real-valued functions on $\mathcal X$, and let $(\Omega_n,\mathcal A_n)$ be the sample space for $n$ observations. A random band is a pair $C_n=(\ell_n,u_n)$ of measurable maps $\ell_n,u_n:\Omega_n\to \ell^\infty(\mathcal X)$, with $\ell_n(\omega)(x)\le u_n(\omega)(x)$ for all $\omega\in\Omega_n$ and $x\in\mathcal X$. It is honest over $\mathcal F$ at level $1-\alpha$ if
\begin{align*}
\inf_{f\in\mathcal F}\mathbb P_f^{(n)}\left(\ell_n(x)\le f(x)\le u_n(x)\text{ for all }x\in\mathcal X\right)\ge 1-\alpha.
\end{align*}
[/definition]
Honesty is a uniform guarantee, not an average guarantee, and its force comes from requiring coverage at the least favourable functions in the class. By itself, however, honesty allows a band to be so wide that coverage carries little statistical information. To ask whether uncertainty quantification can adapt to unknown smoothness, we must add a second requirement: over each smoothness class, the band should have diameter on the natural minimax confidence scale. This turns the point-estimation adaptation question into a simultaneous coverage-and-width question, which motivates the next definition.
[definition: Adaptive Honest Band]
Let $(\Omega_n,\mathcal A_n)$ be the sample space for $n$ observations, and let $C_n=(\ell_n,u_n)$ be a random band with $\ell_n,u_n:\Omega_n\to \ell^\infty(\mathcal X)$. Define the sup-norm diameter functional $\operatorname{diam}_\infty(C_n):\Omega_n\to[0,\infty]$ by
\begin{align*}
\operatorname{diam}_\infty(C_n)(\omega)=\sup_{x\in\mathcal X}(u_n(\omega)(x)-\ell_n(\omega)(x)).
\end{align*}
For nested classes $\mathcal F_{s_2}\subset\mathcal F_{s_1}$, an honest band over $\mathcal F_{s_1}$ is adaptive over the pair if $\operatorname{diam}_\infty(C_n)$ over $\mathcal F_{s_i}$ is of the same order as the target confidence-band diameter over $\mathcal F_{s_i}$ for $i=1,2$, possibly up to logarithmic factors specified in the theorem.
[/definition]
The obstruction is that a function in the rougher class can lie extremely close, in distribution, to a smoother function while still being far enough in sup norm to force a wider band. If the band is honest over the rough class, it must cover such alternatives; if it is narrow over the smooth class, it cannot distinguish them reliably.
[quotetheorem:6359]
[citeproof:6359]
The full rough-class honesty assumption is the source of the obstruction. If coverage were required only pointwise in $f$, a band could be narrow at a smooth function and fail near carefully chosen rough alternatives without violating that weaker requirement. Uniform honesty forbids this escape because the same procedure must cover every rough function with the advertised probability.
Nested Holder structure is also essential to the comparison. The smoother class lies inside the rougher one, so the band is asked to be honest on the larger class while shrinking on a smaller submodel. If the two classes were separated by a fixed distance, or if the rough class were trimmed to remove functions statistically close to smoother functions, the testing contradiction could disappear. A typical trimmed alternative is a rough function of the form $f_0+a\psi((x-x_0)/h)$, where $f_0$ is smooth, $\psi$ is a compactly supported zero-integral bump, and $a$ is chosen so that the bump is visible in sup norm but still hard to test at sample size $n$; excluding functions with such isolated unresolved bumps removes the alternatives that drive the impossibility proof. This is the reason self-similarity and polished-tail restrictions can restore adaptive bands. The theorem therefore does not say that adaptive uncertainty quantification is useless. It says that unrestricted honesty over the full rough Holder ball is incompatible with minimax-scale width on smoother subclasses.
[remark: Estimation and Coverage Adaptation]
Adaptive point estimation and adaptive confidence bands answer different questions. A point estimator may perform well on average at each smoothness level, while a band must protect against the hardest alternatives uniformly. The lower-bound mechanism is therefore a testing obstruction, not a failure of bandwidth selection.
[/remark]
The chapter's main lesson is that the minimax rate
\begin{align*}
n^{-2s/(2s+d)}
\end{align*}
is both achievable and unavoidable for global $L^2$ estimation over Holder classes. Assouad and Fano provide the finite testing reductions behind the unavoidable part. The same reductions explain why uncertainty quantification has stricter limits than point estimation when the smoothness is unknown. The ideas also connect beyond nonparametric statistics: the bias-variance balance mirrors resolution limits in numerical inverse problems, and the Fano viewpoint is an information-theoretic statement about how many distinguishable messages the data can carry.
## Beyond and Connections
Nonparametric statistics sits between probability, statistical modelling, computation, and functional analysis. The empirical distribution function chapters use the probability integral transform, weak convergence, and Brownian bridge limits, so they lead naturally toward advanced probability courses on empirical processes, tightness, and Gaussian approximations. The testing chapters show how distribution-free procedures replace parametric likelihood calculations by invariance and rank information; this is the same structural idea behind permutation tests, randomisation inference, and robust comparisons of samples.
The smoothing chapters connect directly to regression, inverse problems, and high-dimensional estimation. Kernel density estimation and local polynomial regression both begin with the same bias-variance calculation, but their difficult questions are different: density estimation emphasises bandwidth, boundary effects, and integrated risk, while regression adds conditional means, heteroskedasticity, and design assumptions. Bootstrap confidence bands and honest uncertainty quantification then make the probabilistic approximation problem visible, because coverage has to hold uniformly rather than only at a fixed function.
The minimax chapters are the bridge to modern statistical theory. Assouad, Le Cam, and Fano arguments turn estimation lower bounds into testing lower bounds, and this viewpoint reappears in sparse recovery, random matrices, learning theory, and information theory. Adaptation results explain why data-driven procedures can often choose a rate without knowing the smoothness, while the impossibility of unrestricted adaptive honest bands shows where statistical inference needs extra structure such as self-similarity, shape restrictions, or separation from hard alternatives.
## References
Androma, [Cambridge II Principles of Statistics](/page/Cambridge%20II%20Principles%20of%20Statistics).
Androma, [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure).
Androma, [Cambridge IB Statistics](/page/Cambridge%20IB%20Statistics).
Androma, [Cambridge IA Probability](/page/Cambridge%20IA%20Probability).
Androma, [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability).
Contents
- Introduction
- Why Nonparametric Statistics Needs New Tools
- Statistical Targets, Estimators, And Loss
- The Central Role Of Empirical Measures
- Smoothing, Bias, And Variance
- Distribution-Free Methods And Ranks
- How The Course Is Organised
- 1. Statistical Models Without Finite Parameters
- Infinite-Dimensional Statistical Models
- Identifiability, Loss, and Risk
- Minimax Risk and Two-Point Lower Bounds
- Smoothness and Shape Classes
- 2. Empirical Distribution Functions
- Empirical Measures and Plug-In Integration
- Pointwise Consistency and Binomial Fluctuations
- Quantiles and the Probability Integral Transform
- 3. Empirical Processes and Weak Convergence
- From Empirical Measures to Empirical Processes
- The Uniform Empirical Process and the Brownian Bridge
- Tightness, Entropy, and Large Index Classes
- Quantile and QQ Plot Fluctuations
- 4. Distribution-Free Goodness-of-Fit Tests
- Testing a Fully Specified Continuous Distribution
- Integral Discrepancy Statistics
- Power Against Fixed Alternatives
- Local Alternatives And Sensitivity
- 5. Kernel Density Estimation
- Kernels, Bandwidths, and Local Averaging
- Pointwise Bias and Variance
- Consistency and Asymptotic Normality
- Integrated Risk and Bandwidth Rates
- Kernel Shape, Diagnostics, and Practical Interpretation
- 6. Uniform Theory for Kernel Estimators
- Supremum Norm Consistency
- Stochastic Equicontinuity
- Boundary Bias and Correction
- Multivariate Kernel Density Estimation
- Uniform Rates and Practical Consequences
- 7. Bandwidth Selection and Adaptation
- Rule-of-Thumb Selectors and Normal Reference Bandwidth
- Cross-Validation and Plug-In Selection
- Adaptive Bandwidths and Local Choice
- 8. Nonparametric Functionals and U-Statistics
- Integral Functionals Beyond Plug-In Estimation
- U-Statistics as Symmetric Averages
- Hoeffding Decomposition and Degeneracy
- Asymptotic Normality for Nondegenerate U-Statistics
- Variance Estimation and Studentisation
- 9. Rank and Permutation Methods
- Exact Inference from Symmetry
- Wilcoxon Signed-Rank Testing
- Mann-Whitney and Two-Sample Rank Testing
- Linear Rank Statistics and Large-Sample Theory
- Local Alternatives and Pitman Efficiency
- 10. Nonparametric Regression and Local Polynomials
- Conditional Mean Estimation by Local Averaging
- Local Linear And Local Polynomial Fits
- Boundary Adaptation And Bias Reduction
- Heteroskedastic Errors And Asymptotic Normality
- 11. Confidence Sets and Nonparametric Uncertainty
- Pointwise Intervals and Uniform Bands
- Bootstrap Calibration for Empirical Processes and Smooth Functionals
- Kernel Estimator Intervals: Bias, Undersmoothing, and Correction
- Putting the Procedures Together
- 12. Minimax Rates and Lower Bounds
- Bias-Variance Rates over Holder Balls
- Assouad and Fano Lower Bound Strategies
- Adaptation Limits and Honest Confidence Bands
- Beyond and Connections
- References
Nonparametric Statistics
Content
Problems
History
Created by admin on 6/11/2026 | Last updated on 6/11/2026
Prerequisites (0/7 completed)
Log in to track your prerequisite progress.
Prerequisites Graph
Interactive dependency map showing prerequisite concepts
Loading dependency graph...
Theorem
Definition
Current
Requires
Rate this page
★
★
★
★
★
Poor
Excellent