What brings you to Androma?

This course provides a comprehensive introduction to the mathematical foundations of statistics, progressing from classical parametric inference through Bayesian methods and into modern nonparametric and computational techniques. The course emphasizes the theoretical underpinnings of statistical estimation and inference, building rigorous mathematical understanding rather than focusing on applications or cookbook methods. By working through maximum likelihood estimation, information theory, and asymptotic analysis, students develop the conceptual tools needed to understand why standard statistical procedures work and when they can be trusted. The first half of the course establishes the frequentist framework through the lens of maximum likelihood estimation. Beginning with the basic definition and properties of MLEs, the course develops the key theoretical results that justify their use: the Fisher information and Cramér-Rao bound provide measures of efficiency, while consistency and asymptotic normality guarantee that MLEs behave predictably in large samples. This progression from finite-sample properties to asymptotic theory culminates in practical inferential procedures like confidence intervals and hypothesis tests. The chapters on asymptotic inference demonstrate how these theoretical results translate into actionable statistical methods. The second half broadens the statistical perspective by introducing Bayesian inference, decision theory, and modern computational and nonparametric methods. Students learn how prior distributions and loss functions shape Bayesian estimation, and how frequentist and Bayesian approaches relate through risk analysis and admissibility. The final chapters on resampling, Monte Carlo methods, and nonparametric statistics acknowledge the limitations of parametric assumptions and demonstrate how computational power enables statistical inference without strong structural assumptions. Throughout, the course emphasizes that sound statistical practice requires both theoretical understanding and awareness of practical limitations. # Introduction ## Statistical Models and the Likelihood Principle Statistics is concerned with the inverse problem of probability: given access to draws from an unknown probability distribution, make rigorous statements about that distribution. Where probability theory starts from a known distribution and asks what we expect to see, statistics starts from what we have seen and asks what distribution produced it. This chapter introduces the formal framework — statistical models, parameter spaces, and the three central goals of the discipline — before developing the likelihood principle, the most fundamental strategy for extracting information from data. ## The Statistical Framework The first difficulty in statistics is that an arbitrary probability distribution is an infinite-dimensional object — specifying it completely would require knowing $F(t)$ for every $t \in \mathbb{R}$. No finite dataset can pin down an object that complex. The way out is to restrict attention to families of distributions parametrized by a finite-dimensional vector $\theta$: instead of searching over all possible distributions, we search over a tractable parameter space $\Theta$. This section sets up that framework precisely. We work with a real-valued random variable $X$ on a probability space $(\Omega, \mathcal{F}, \mathbb{P})$. The distribution of $X$ is described by its cumulative distribution function \begin{align*} F(t) = \mathbb{P}(\omega \in \Omega : X(\omega) \leq t), \quad t \in \mathbb{R}. \end{align*} When $X$ is discrete, $F$ is related to the probability mass function $f$ by \begin{align*} F(t) = \sum_{x \leq t} f(x), \end{align*} and when $X$ is continuous, $F$ is related to the probability density function $f$ by \begin{align*} F(t) = \int_{-\infty}^{t} f(s)\, ds. \end{align*} Most statistical problems concern not a single observation but a sample: $n$ independent copies $X_1, \ldots, X_n$ of $X$, where $n$ is called the sample size. The distribution of each $X_i$ is the same unknown distribution, which we wish to learn. Rather than working with a completely unknown distribution — an infinite-dimensional object — statistics typically restricts attention to families of distributions parametrized by a finite-dimensional vector $\theta$. [definition: Statistical Model] A **statistical model** for a sample from $X$ is a family \begin{align*} \{f(\theta, \cdot) : \theta \in \Theta\} \quad \text{or} \quad \{P_\theta : \theta \in \Theta\} \end{align*} of probability mass functions or probability density functions $f(\theta, \cdot)$, or of probability distributions $P_\theta$ for the law of $X$. The index set $\Theta$ is called the **parameter space**. [/definition] The parameter space $\Theta$ encodes what values the unknown parameter $\theta$ is allowed to take. Both the shape of the model family and the choice of $\Theta$ are modelling decisions that the statistician makes before seeing data. [example: Standard Statistical Models] The following models illustrate how the choice of $\Theta$ encodes modelling assumptions. (i) $\mathcal{N}(\theta, 1)$ with $\theta \in \Theta = \mathbb{R}$. The variance is treated as known; only the mean is unknown. This is the simplest Gaussian model and will serve as our running example throughout the course. (ii) $\mathcal{N}(\mu, \sigma^2)$ with $\theta = (\mu, \sigma^2) \in \Theta = \mathbb{R} \times (0, \infty)$. Both mean and variance are unknown, so we have a two-dimensional parameter. (iii) $\mathrm{Exp}(\theta)$ with $\theta \in \Theta = (0, \infty)$. An exponential model with unknown rate. The parameter space is open because a zero rate makes no probabilistic sense. (iv) $\mathcal{N}(\theta, 1)$ with $\theta \in \Theta = [-1, 1]$. The same Gaussian family as (i), but with the mean constrained to a compact set — perhaps because domain knowledge rules out means outside $[-1, 1]$. [/example] Models (i) and (iv) use the same distributional family but impose different parameter spaces. This distinction is not cosmetic. The set $\Theta$ determines which estimators are admissible: an MLE for model (iv) must respect the constraint $\hat{\theta} \in [-1, 1]$, which can force a different answer than the unconstrained optimum. And if the true mean is $\theta_0 = 2$, then model (iv) is misspecified — no element of the family reproduces the true distribution, and the theory we develop simply does not apply. This is the chief danger of parametric modelling: a wrong assumption about $\Theta$ is not a small error in a coefficient, it is a categorical mistake about what distributions are even being considered. A less obvious danger is the opposite extreme: allowing $\Theta$ to be too large, or abandoning a parametric restriction altogether. If we do not restrict the distribution at all — if we allow $F$ to be any cumulative distribution function — then we are in the nonparametric setting. Nonparametric estimation is possible (the empirical distribution function is the canonical example), but it comes at a serious statistical cost: with no structure to exploit, estimators converge far more slowly, and many questions that are easy in the parametric case become hard or impossible to answer with finite data. The parametric assumption is not a crutch — it is what makes efficient inference tractable. ## Correct Specification A statistical model is only useful if the true distribution of $X$ actually belongs to the model family. This motivates the following concept. [definition: Correctly Specified Model] For a variable $X$ with distribution $P$, the model $\{P_\theta : \theta \in \Theta\}$ is **correctly specified** if there exists $\theta \in \Theta$ such that $P_\theta = P$. In this case we write $\theta_0$ for the true value of the parameter, and we say the observations $X_1, \ldots, X_n$ are i.i.d.\ from the model $\{P_\theta : \theta \in \Theta\}$. [/definition] If the model is not correctly specified, no element of the model family reproduces the true distribution. All the theory we develop assumes correct specification unless stated otherwise. Returning to the examples above: if $X \sim \mathcal{N}(2, 1)$, then model (i) is correctly specified (with $\theta_0 = 2$), but model (iv) is not, since $\theta_0 = 2 \notin [-1, 1]$. ## The Three Goals of Statistics Given a sample $X_1, \ldots, X_n$ from an unknown distribution $P_{\theta_0}$, what exactly do we want to know about $\theta_0$? There are fundamentally three different kinds of answers a statistician might need. The first is a point: a single best guess at $\theta_0$. The second is a decision: a binary verdict on whether $\theta_0$ belongs to a specified region. The third is a set: a region guaranteed to contain $\theta_0$ with prescribed probability. These correspond to estimation, hypothesis testing, and confidence sets — the three central problems of the course. They are not interchangeable. Estimation asks for precision; hypothesis testing asks for a decision with controlled error rates; confidence sets ask for calibrated uncertainty. A good estimator does not automatically give you a good test, and a good test does not automatically give you a confidence set. Yet they are deeply interconnected: in many settings, a confidence set can be constructed by inverting a family of hypothesis tests, and both rely on the quality of the underlying estimator. Understanding when they agree and when they diverge is one of the subtler themes of the course. [definition: Estimator] An **estimator** of $\theta$ is any measurable function \begin{align*} \hat{\theta} = \hat{\theta}(X_1, \ldots, X_n) : (\mathcal{X})^n \to \Theta \end{align*} of the observations. We say $\hat{\theta}$ is **consistent** if $\hat{\theta} \xrightarrow{\mathbb{P}} \theta_0$ as $n \to \infty$ under $P_{\theta_0}$, for all $\theta_0 \in \Theta$. [/definition] An estimator produces a single number — a best guess. But in many situations a point estimate is not enough: a scientist testing whether a drug has any effect needs a yes-or-no verdict, not a number; a regulator setting safety limits needs a range of plausible values, not a point. These lead to two further problems that are distinct from estimation and from each other. [definition: Hypothesis Test] A **hypothesis test** at level $\alpha \in (0,1)$ of the null $H_0 : \theta \in \Theta_0$ against the alternative $H_1 : \theta \in \Theta_1 = \Theta \setminus \Theta_0$ is a measurable function \begin{align*} \psi_n = \psi(X_1, \ldots, X_n) \in \{0, 1\} \end{align*} satisfying $\sup_{\theta \in \Theta_0} P_\theta(\psi_n = 1) \leq \alpha$. We reject $H_0$ when $\psi_n = 1$ and fail to reject when $\psi_n = 0$. [/definition] A test delivers a binary decision, but it says nothing about how far $\theta_0$ might be from $\Theta_0$, or about what values of $\theta$ are compatible with the data. For that, we need a different object entirely: not a point and not a verdict, but a set of parameter values that the data cannot rule out. [definition: Confidence Set] A **confidence set** at level $1 - \alpha$ is a random set $C_n = C(X_1, \ldots, X_n) \subseteq \Theta$ satisfying \begin{align*} P_\theta(\theta \in C_n) \geq 1 - \alpha, \quad \text{for all } \theta \in \Theta. \end{align*} The quantity $1 - \alpha$ is the **coverage probability**. Unlike an estimator, which produces a single point, a confidence set quantifies the uncertainty in estimation by providing a region that contains the true parameter with guaranteed probability. [/definition] Notice that hypothesis testing and confidence sets are in explicit duality: a test at level $\alpha$ and a confidence set at level $1 - \alpha$ are two sides of the same coin. Specifically, $C_n = \{\theta : \psi_n^\theta = 0\}$ — the set of parameter values not rejected by a family of level-$\alpha$ tests — is a confidence set at level $1 - \alpha$. This inversion principle will become one of our main tools for constructing confidence sets once we have a good family of tests. Estimation is logically prior to both: you need to know something about the parameter before you can test or build intervals. The rest of the course develops tools to address all three goals, but estimation and its asymptotic theory come first. ## The Likelihood Principle We now have a model $\{f(\cdot, \theta) : \theta \in \Theta\}$ and a sample $X_1, \ldots, X_n$, and we know what we want: an estimator, a test, or a confidence set. But the definitions above say nothing about *how* to use the data to learn about $\theta$. Looking at the raw observations $x_1, \ldots, x_n$ does not directly suggest a value of $\theta$ — we need a systematic way to extract the information that the sample carries about the parameter. The likelihood principle provides exactly this: it says that all the information the data contain about $\theta$ is captured by the joint density of the sample, read as a function of $\theta$ rather than of the data. ### The Likelihood Function for a Poisson Sample To ground the abstract principle concretely, consider a Poisson model $\{\mathrm{Poi}(\theta) : \theta \geq 0\}$. Let $X_1, \ldots, X_n$ be i.i.d.\ with $X_i \sim \mathrm{Poi}(\theta)$, and suppose we observe numerical values $X_i = x_i$ for $1 \leq i \leq n$. The Poisson probability mass function is $f(x, \theta) = e^{-\theta} \theta^x / x!$, so the joint distribution of the sample is \begin{align*} f(x_1, \ldots, x_n; \theta) &= P_\theta(X_1 = x_1, \ldots, X_n = x_n) \\ &= \prod_{i=1}^{n} P_\theta(X_i = x_i) \quad (\text{i.i.d.}) \\ &= \prod_{i=1}^{n} \frac{e^{-\theta} \theta^{x_i}}{x_i!} \\ &= e^{-n\theta} \prod_{i=1}^{n} \frac{\theta^{x_i}}{x_i!}. \end{align*} This is the probability of observing the particular sample $(x_1, \ldots, x_n)$, but we now regard it as a function of the unknown $\theta \geq 0$ rather than of the data. This reinterpretation is the essence of the likelihood principle. [definition: Likelihood Function] Let $\{f(\cdot, \theta) : \theta \in \Theta\}$ be a statistical model and suppose we observe $n$ realisations $x_1, \ldots, x_n$ of i.i.d.\ copies $X_1, \ldots, X_n$ of $X$. The **likelihood function** is \begin{align*} L_n(\theta) = \prod_{i=1}^{n} f(x_i, \theta), \end{align*} regarded as a function $L_n : \Theta \to \mathbb{R}$. When $X$ is discrete, $f(x_i, \theta) = P_\theta(X = x_i)$; when $X$ is continuous, $f(x_i, \theta)$ is the density evaluated at $x_i$. It is helpful to think of $L_n(\cdot)$ as a random function from $\Theta$ to $\mathbb{R}$, with the randomness coming from the $X_i$. [/definition] ### Maximum Likelihood Estimation Given the likelihood function, the most natural estimation strategy is immediate: choose the parameter that makes the observed data as probable as possible. But this raises a question the definition alone does not answer — is such a maximiser guaranteed to exist? Is it unique? The answer to both is no in general, and understanding when the MLE fails is as important as knowing when it succeeds. The MLE fails to exist when the likelihood is unbounded: for example, in a mixture model or in the normal model $\mathcal{N}(\theta, \sigma^2)$ with both $\theta$ and $\sigma^2$ unknown, one can send $\sigma^2 \to 0$ while setting $\theta = x_1$, driving $L_n(\theta, \sigma^2) \to \infty$. The MLE fails to be unique when the likelihood has multiple global maxima, which can happen in multimodal models or when the parameter is not identifiable (two distinct parameters produce the same distribution). These are not exotic pathologies — they appear regularly in practice, and part of the craft of parametric modelling is choosing families where the MLE exists, is unique, and is computationally tractable. [definition: Maximum Likelihood Estimator] A **maximum likelihood estimator** (MLE) is any $\hat{\theta} = \hat{\theta}_{\mathrm{MLE}}(X_1, \ldots, X_n) \in \Theta$ satisfying \begin{align*} L_n(\hat{\theta}) = \max_{\theta \in \Theta} L_n(\theta). \end{align*} Since $\log$ is strictly increasing, maximising $L_n$ is equivalent to maximising the **log-likelihood** \begin{align*} \ell_n(\theta) = \log L_n(\theta) = \sum_{i=1}^{n} \log f(x_i, \theta), \end{align*} or the **normalised log-likelihood** $\bar{\ell}_n(\theta) = \frac{1}{n} \ell_n(\theta)$. [/definition] Working with the log-likelihood is standard in practice because it converts the product $\prod_{i=1}^n f(x_i, \theta)$ into a sum, which is far more tractable analytically. The MLE is a function of the data $X_1, \ldots, X_n$ only, and can be generalised to non-i.i.d. settings whenever a joint density $f(x_1, \ldots, x_n; \theta)$ can be specified. ### Computing the MLE for the Poisson Model Returning to the Poisson example, the log-likelihood is \begin{align*} \ell_n(\theta) = -n\theta + \log(\theta) \sum_{i=1}^{n} x_i - \sum_{i=1}^{n} \log(x_i!). \end{align*} The last term is a constant in $\theta$ and plays no role in the maximisation. Setting the first-order condition $\ell_n'(\theta) = 0$ gives \begin{align*} -n + \frac{1}{\theta} \sum_{i=1}^{n} x_i = 0, \end{align*} with solution $\hat{\theta} = \frac{1}{n} \sum_{i=1}^{n} x_i = \bar{X}_n$, the sample mean. To confirm this is a maximum rather than a minimum, note that \begin{align*} \ell_n''(\theta) = -\frac{1}{\theta^2} \sum_{i=1}^{n} x_i < 0 \end{align*} for all $\theta > 0$ (provided not all $x_i$ are zero), confirming concavity of $\ell_n$. The edge case where all $x_i = 0$ can be handled directly: $\ell_n(\theta) = -n\theta$, which is decreasing on $[0, \infty)$, so the maximum is at $\hat{\theta} = 0$, consistent with the formula $\bar{X}_n = 0$. [example: Poisson MLE] For $X_1, \ldots, X_n$ i.i.d.\ with $X_i \sim \mathrm{Poi}(\theta)$, $\theta \geq 0$, the MLE is \begin{align*} \hat{\theta}_{\mathrm{MLE}} = \frac{1}{n} \sum_{i=1}^{n} X_i = \bar{X}_n. \end{align*} The MLE is the sample mean. This is a reassuring coincidence: since $\mathbb{E}_\theta[X] = \theta$, the sample mean is a method-of-moments estimator as well. In the Poisson model, the two methods happen to agree. [/example] The Poisson example is the lucky case: the MLE turns out to be the sample mean, a familiar and interpretable quantity. This is not always so. Consider instead the uniform model $\{\mathrm{Unif}(0, \theta) : \theta > 0\}$. The joint density of $X_1, \ldots, X_n$ i.i.d.\ $\sim \mathrm{Unif}(0, \theta)$ is \begin{align*} f(x_1, \ldots, x_n; \theta) = \frac{1}{\theta^n} \mathbb{1}_{\{\max_i x_i \leq \theta\}}, \end{align*} so the likelihood is $L_n(\theta) = \theta^{-n}$ for $\theta \geq X_{(n)} := \max_i X_i$ and zero otherwise. This is a decreasing function of $\theta$ on $[X_{(n)}, \infty)$, so it is maximised at $\hat{\theta}_{\mathrm{MLE}} = X_{(n)}$, the sample maximum — not the sample mean. The log-likelihood has no interior critical point; the MLE is found by inspecting where the domain starts, not by differentiating. This contrast is structurally important: the Poisson score equation has a smooth interior solution because the support of the Poisson distribution does not depend on $\theta$; for the uniform, the support boundary $\theta$ itself is the parameter, which forces a boundary optimum and renders the score equation useless. The Poisson computation illustrates the standard strategy for regular models: write down the log-likelihood, differentiate, solve the score equation $\ell_n'(\hat{\theta}) = 0$, and verify concavity. The uniform example warns us that this strategy has limits. Subsequent chapters will formalise both cases into the theory of Fisher information and the Cramér-Rao bound, which answers how good the MLE can possibly be — and in which models those bounds apply. Having established the foundations of statistical inference, we now turn to the fundamental principle of likelihood: finding parameter estimates that maximize the probability of observed data. The Maximum Likelihood Estimator will serve as our primary tool for estimation throughout this course. # 2. Maximum Likelihood Estimator ## The Likelihood Function and Its Maximisers In the first chapter we saw, through the Poisson example, why it is natural to choose an estimator by finding the parameter value that makes the observed data most probable. This chapter formalises that idea. We define the likelihood function and its logarithmic variants, state precisely what a maximum likelihood estimator is, examine the score function as the practical tool for finding it, and then study a deeper theoretical property: the expected log-likelihood is maximised at the true parameter value. This last result, proved via Jensen's inequality, gives MLE its statistical justification and connects it to the theory of Kullback–Leibler divergence. ## The Likelihood, Log-Likelihood, and Their Normalisations Before attaching names to objects, it helps to recall what we are trying to quantify. Given a statistical model $\{f(\,\cdot\,, \theta) : \theta \in \Theta\}$ and an observed sample $x_1, \ldots, x_n$, the joint probability (or probability density) of seeing exactly that sample is the product $\prod_{i=1}^{n} f(x_i, \theta)$, viewed as a function of the parameter $\theta$. This is the likelihood: not the probability of an event in the usual sense, but a function on the parameter space that reflects how well each value of $\theta$ accounts for the data. [definition: Likelihood Function] Let $\{f(\,\cdot\,, \theta) : \theta \in \Theta\}$ be a statistical model with p.d.f. or p.m.f. $f(x, \theta)$ for the distribution $P$ of a random variable $X$. Suppose we observe $n$ realisations $x_1, \ldots, x_n$ of i.i.d. copies $X_1, \ldots, X_n$ of $X$. The **likelihood function** $L_n : \Theta \to [0, \infty)$ is defined by \begin{align*} L_n(\theta) = \prod_{i=1}^{n} f(x_i, \theta). \end{align*} The **log-likelihood function** $\ell_n : \Theta \to \mathbb{R}$ is defined by \begin{align*} \ell_n(\theta) = \log L_n(\theta) = \sum_{i=1}^{n} \log f(x_i, \theta). \end{align*} The **normalised log-likelihood function** $\bar{\ell}_n : \Theta \to \mathbb{R}$ is defined by \begin{align*} \bar{\ell}_n(\theta) = \frac{1}{n}\,\ell_n(\theta) = \frac{1}{n} \sum_{i=1}^{n} \log f(x_i, \theta). \end{align*} [/definition] The three functions $L_n$, $\ell_n$, and $\bar{\ell}_n$ all attain their maximum at the same point, since $\log$ is strictly increasing and dividing by $n > 0$ is a positive rescaling. In practice one almost always works with $\ell_n$ or $\bar{\ell}_n$: products are numerically unstable and analytically harder to differentiate than sums. The normalised version $\bar{\ell}_n$ is especially useful when we want to compare behaviour as $n$ varies or take limits as $n \to \infty$, because it has a natural interpretation as a sample average of i.i.d. contributions $\log f(X_i, \theta)$. ## The Maximum Likelihood Estimator Given data $X_1, \ldots, X_n$ from a model $\{f(\,\cdot\,, \theta) : \theta \in \Theta\}$, the estimation problem is: which $\theta$ should we report? One might reach for simple statistics first. For counting data, the sample mean is a reasonable guess for the Poisson rate; for symmetric data, the sample mean or median might serve for a location parameter. But what do we do when the parameter is not a mean — a scale, a shape, a mixing proportion? And what principle guides us when the model has several parameters simultaneously? The sample mean has no answer to these questions. One might instead consider minimising some distance between the data and the model, such as the sum of squared deviations — but this presupposes a Euclidean geometry that has no intrinsic connection to the probability model. The likelihood approach offers a unified answer: given the data, choose the parameter value that makes the observed sample most probable. If the data really did come from the model, then the true $\theta$ ought to make the sample reasonably likely; a wildly wrong $\theta$ would assign very low probability to what we actually observed. Maximising the joint probability over $\Theta$ — the likelihood function — exploits the full structure of the probability model and is defined for any parametric family, regardless of whether the parameter has a moment interpretation. [definition: Maximum Likelihood Estimator] A **maximum likelihood estimator** (MLE) for the model $\{f(\,\cdot\,, \theta) : \theta \in \Theta\}$ is any measurable function $\hat{\theta} = \hat{\theta}_{\mathrm{MLE}}(X_1, \ldots, X_n) \in \Theta$ satisfying \begin{align*} L_n(\hat{\theta}) = \max_{\theta \in \Theta} L_n(\theta). \end{align*} Equivalently, $\hat{\theta}$ maximises $\ell_n$ or $\bar{\ell}_n$ over $\Theta$. [/definition] Several remarks are worth making immediately. The i.i.d. assumption is used only to factorise the joint distribution into a product. Whenever a joint p.m.f. or p.d.f. for $(X_1, \ldots, X_n)$ can be written down — even without independence or identical distributions — the MLE is defined in exactly the same way: maximise the joint likelihood over $\Theta$. The Gaussian linear model below illustrates this. ## Examples of Maximum Likelihood Estimators The following examples illustrate three features of MLE that will recur throughout the course: how the log-likelihood is computed from a specific model, how the score equation $S_n(\theta) = 0$ is solved to find the maximiser, and what happens when the parameter is multidimensional. The first two examples involve i.i.d. observations from standard families; the third breaks the i.i.d. assumption to show that the method extends further than the definition strictly requires. [example: Poisson MLE] Let $X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} \operatorname{Poi}(\theta)$ with $\theta \geq 0$. The p.m.f. is $f(k, \theta) = e^{-\theta} \theta^k / k!$, so the log-likelihood is \begin{align*} \ell_n(\theta) = \sum_{i=1}^{n} \left(-\theta + x_i \log \theta - \log(x_i!)\right) = -n\theta + \left(\sum_{i=1}^{n} x_i\right) \log \theta - \sum_{i=1}^{n} \log(x_i!). \end{align*} Setting the derivative equal to zero, \begin{align*} \ell_n'(\theta) = -n + \frac{1}{\theta} \sum_{i=1}^{n} x_i = 0 \implies \hat{\theta} = \frac{1}{n} \sum_{i=1}^{n} x_i = \bar{X}_n. \end{align*} Since $\ell_n''(\theta) = -(\sum_{i=1}^n x_i)/\theta^2 \leq 0$ for all $\theta > 0$, the second derivative is non-positive, so the log-likelihood is concave and the critical point is a global maximum. When all $x_i = 0$, the log-likelihood reduces to $-n\theta$, which is maximised at $\hat{\theta} = 0$, consistent with the formula. Thus $\hat{\theta}_{\mathrm{MLE}} = \bar{X}_n$, the sample mean. [/example] The Poisson model is one-dimensional and the log-likelihood is strictly concave, making the analysis clean. The Gaussian model adds a second parameter and requires a sequential argument: optimise over $\mu$ first, then substitute and optimise over $\sigma^2$. This order is not arbitrary — the log-likelihood decouples in $\mu$ at fixed $\sigma^2$, so the mean can be estimated independently of the variance. [example: Gaussian MLE] Let $X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} \mathcal{N}(\mu, \sigma^2)$ with parameter $\theta = (\mu, \sigma^2)^\top \in \mathbb{R} \times (0, \infty)$. The log-likelihood is \begin{align*} \ell_n(\mu, \sigma^2) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2. \end{align*} Differentiating with respect to $\mu$ and setting the result to zero gives \begin{align*} \frac{\partial \ell_n}{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^{n}(x_i - \mu) = 0 \implies \hat{\mu} = \bar{X}_n. \end{align*} Substituting $\hat{\mu} = \bar{X}_n$ and differentiating with respect to $\sigma^2$, \begin{align*} \frac{\partial \ell_n}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^{n}(x_i - \bar{X}_n)^2 = 0 \implies \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n}(X_i - \bar{X}_n)^2. \end{align*} The Hessian of $\ell_n$ at $(\hat{\mu}, \hat{\sigma}^2)$ is negative definite, confirming a maximum. Note that $\hat{\sigma}^2$ is biased: $\mathbb{E}_\theta[\hat{\sigma}^2] = \frac{n-1}{n}\sigma^2$. The MLE does not automatically produce unbiased estimators, but other desirable properties (consistency, asymptotic efficiency) will be established later in the course. [/example] The Poisson and Gaussian examples share a common pattern: the log-likelihood is concave, so a unique critical point of the score equation is guaranteed to be the global maximum. This is the friendly case. The linear model example below shows that MLE applies beyond the i.i.d. setting, while other models (not treated here) can produce non-concave log-likelihoods with multiple local maxima, where finding the MLE becomes a computational challenge. [example: Gaussian Linear Model MLE] Consider the Gaussian linear model $Y = X\theta + \varepsilon$, where $X \in \mathbb{R}^{n \times p}$ is a known design matrix, $\theta \in \mathbb{R}^p$ is unknown, and $\varepsilon \sim \mathcal{N}(0, I_n)$. The observations $Y_i = X_i^\top \theta + \varepsilon_i$ are not identically distributed (they have different means), but they are independent, so their joint density can be written as \begin{align*} f(y_1, \ldots, y_n, \theta) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi}} \exp\!\left(-\frac{(y_i - X_i^\top \theta)^2}{2}\right). \end{align*} The log-likelihood is \begin{align*} \ell_n(\theta) = -\frac{n}{2}\log(2\pi) - \frac{1}{2} \sum_{i=1}^{n}(y_i - X_i^\top \theta)^2 = -\frac{n}{2}\log(2\pi) - \frac{1}{2}\|Y - X\theta\|^2. \end{align*} Maximising $\ell_n(\theta)$ over $\theta \in \mathbb{R}^p$ is thus equivalent to minimising $\|Y - X\theta\|^2$: the MLE is exactly the ordinary least-squares estimator. This example shows that MLE subsumes least squares as a special case under Gaussian noise, and illustrates that the i.i.d. requirement in the definition can be relaxed. [/example] ## The Score Function In the examples above, the MLE was found by setting the gradient of the log-likelihood to zero. This gradient is important enough to have its own name. [definition: Score Function] For a parameter space $\Theta \subseteq \mathbb{R}^p$ and a log-likelihood $\ell_n$ that is differentiable in $\theta$, the **score function** $S_n : \Theta \to \mathbb{R}^p$ is defined by \begin{align*} S_n(\theta) = \nabla_\theta \ell_n(\theta) = \left(\frac{\partial}{\partial \theta_1}\ell_n(\theta), \ldots, \frac{\partial}{\partial \theta_p}\ell_n(\theta)\right)^\top. \end{align*} In the i.i.d. case, using $\ell_n(\theta) = \sum_{i=1}^n \log f(x_i, \theta)$, this becomes \begin{align*} S_n(\theta) = \sum_{i=1}^{n} \nabla_\theta \log f(x_i, \theta). \end{align*} [/definition] A crucial point about the score function deserves emphasis. Both $\ell_n$ and $S_n$ are functions of the parameter $\theta$, while the data values $x_1, \ldots, x_n$ enter as fixed constants (after observation). Derivatives and gradients are always taken with respect to $\theta$, not with respect to the $x_i$. The randomness in $S_n(\theta)$ comes entirely from the randomness in $X_1, \ldots, X_n$; for any fixed $\theta$, $S_n(\theta)$ is a random vector. The score equation $S_n(\hat{\theta}) = 0$ is a necessary condition for $\hat{\theta}$ to maximise $\ell_n$ when $\hat{\theta}$ lies in the interior of $\Theta$. In many standard parametric families the log-likelihood is concave, making this necessary condition also sufficient, so the MLE is the unique solution of $S_n(\theta) = 0$. The course will frequently work in this setting. ## The Population Log-Likelihood and Its Maximum The normalised log-likelihood $\bar{\ell}_n(\theta) = \frac{1}{n}\sum_{i=1}^n \log f(X_i, \theta)$ is a sample average of i.i.d. random variables. By the law of large numbers, it converges to its expectation as $n \to \infty$. This motivates studying the **population log-likelihood**, the deterministic limit of $\bar{\ell}_n$. [definition: Population Log-Likelihood] For a variable $X$ with distribution $P_{\theta_0}$ on $\mathcal{X} \subseteq \mathbb{R}^d$, and a model $\{f(\,\cdot\,, \theta) : \theta \in \Theta\}$ with $\mathbb{E}[|\log f(X, \theta)|] < \infty$, the **population log-likelihood** is the function $\ell : \Theta \to \mathbb{R}$ defined by \begin{align*} \ell(\theta) = \mathbb{E}_{\theta_0}[\log f(X, \theta)], \end{align*} where the expectation is taken under the true distribution $P_{\theta_0}$. Explicitly, \begin{align*} \ell(\theta) = \int_{\mathcal{X}} \log f(x, \theta)\, f(x, \theta_0)\, dx \end{align*} in the continuous case, and $\ell(\theta) = \sum_{x \in \mathcal{X}} \log f(x, \theta)\, f(x, \theta_0)$ in the discrete case. [/definition] The key theorem is that $\ell$ is maximised at the true parameter value $\theta_0$. This is not an assumption about the MLE — it is a general analytical fact that holds for any well-specified model. To see why this ought to be true, consider what the population log-likelihood measures: $\ell(\theta) = \mathbb{E}_{\theta_0}[\log f(X, \theta)]$ is the average log-probability the model $\theta$ assigns to data drawn from the true model $\theta_0$. If $\theta$ is close to $\theta_0$, it should assign high probability to those observations; if $\theta$ is far away, it will tend to assign low probability. The theorem makes this precise by showing the gap $\ell(\theta_0) - \ell(\theta)$ is always non-negative, with Jensen's inequality providing the quantitative bound. [quotetheorem:1838] [citeproof:1838] This theorem carries a fundamental message: if we somehow had access to the population log-likelihood $\ell$, maximising it would recover the true parameter $\theta_0$ exactly. Since $\ell$ is not observed, we substitute the empirical approximation $\bar{\ell}_n(\theta) = \frac{1}{n}\sum_{i=1}^n \log f(x_i, \theta)$, which is the sample mean of i.i.d. terms each with population mean $\ell(\theta)$. The inequality $\ell(\theta) \leq \ell(\theta_0)$ becomes strict when the model parametrisation is strictly identifiable, meaning $f(\,\cdot\,, \theta) = f(\,\cdot\,, \theta_0) \iff \theta = \theta_0$. Under strict identifiability, the equality case of Jensen's inequality cannot hold for $\theta \neq \theta_0$, so $\theta_0$ is the unique maximiser of $\ell$. Maximising $\bar{\ell}_n$ then approximately recovers the unique $\theta_0$. ## Connection to Kullback–Leibler Divergence What does it mean for one distribution to be "close" to another? The Euclidean distance between density functions is one answer, but it has no direct connection to probabilistic inference. A more natural question is: how much information is lost when we use $P_\theta$ as an approximation to the true distribution $P_{\theta_0}$? This is the question Kullback–Leibler divergence is designed to answer, and it turns out that the gap $\ell(\theta_0) - \ell(\theta)$ in the population log-likelihood theorem is precisely this quantity. [definition: Kullback–Leibler Divergence] For two distributions $P_{\theta_0}$ and $P_\theta$ on $\mathcal{X}$ with densities $f(\,\cdot\,, \theta_0)$ and $f(\,\cdot\,, \theta)$, the **Kullback–Leibler divergence** from $P_{\theta_0}$ to $P_\theta$ is \begin{align*} \operatorname{KL}(P_{\theta_0}, P_\theta) = \int_{\mathcal{X}} f(x, \theta_0) \log \frac{f(x, \theta_0)}{f(x, \theta)}\, dx. \end{align*} [/definition] By definition of the population log-likelihood, the KL divergence satisfies \begin{align*} \operatorname{KL}(P_{\theta_0}, P_\theta) = \ell(\theta_0) - \ell(\theta). \end{align*} The theorem above asserts that $\operatorname{KL}(P_{\theta_0}, P_\theta) \geq 0$ for all $\theta$, which is exactly the non-negativity of KL divergence, a standard fact in information theory. The reformulation \begin{align*} \ell(\theta) = \ell(\theta_0) - \operatorname{KL}(P_{\theta_0}, P_\theta) \end{align*} makes explicit that maximising the (population) log-likelihood over $\theta$ is equivalent to minimising the KL divergence from the true distribution $P_{\theta_0}$ to the model distribution $P_\theta$. Maximum likelihood estimation therefore has a clean information-geometric interpretation: it selects the model distribution closest to the truth in the sense of KL divergence. The KL divergence is not symmetric: $\operatorname{KL}(P_{\theta_0}, P_\theta) \neq \operatorname{KL}(P_\theta, P_{\theta_0})$ in general. It does satisfy $\operatorname{KL}(P_{\theta_0}, P_\theta) \geq 0$ with equality if and only if $P_{\theta_0} = P_\theta$ almost everywhere, so it behaves like a "distance" in terms of being zero only at equal distributions, but it is not a metric. Its asymmetry will be relevant later when we compare different divergence-based procedures. ## The Expectation of the Score The connection between the population log-likelihood and the score function leads to a fundamental identity. Since $\ell(\theta)$ is maximised at $\theta_0$, any gradient must vanish there — and this gradient is precisely the expectation of the score. [quotetheorem:1839] [citeproof:1839] In particular, setting $\theta = \theta_0$ gives $\mathbb{E}_{\theta_0}[\nabla_\theta \log f(X, \theta_0)] = 0$: the individual score contributions $\nabla_\theta \log f(X_i, \theta_0)$ are centred random vectors under the true distribution. This is consistent with the score equation $S_n(\hat{\theta}) = 0$: the MLE solves an equation whose expectation is zero at the truth, which is an essential ingredient in proving consistency and asymptotic normality. The identity $\mathbb{E}_\theta[S_n(\theta)/n] = 0$ is the population-level counterpart of the sample score equation $S_n(\hat{\theta}) = 0$. It says that the score, evaluated at the true $\theta$, fluctuates around zero. The MLE $\hat{\theta}$ is found by forcing the sample score to be zero, and this identity tells us that we are forcing it to a value that is on average correct. This is the statistical mechanism underpinning consistency of the MLE. Moreover, the score identity is the starting point for defining Fisher information in the next chapter. The score at $\theta$ is a centred random vector, so its covariance matrix is well-defined; that covariance matrix is precisely the Fisher information matrix $\mathcal{I}(\theta)$, which governs the precision achievable by any unbiased estimator via the Cramér–Rao lower bound. The MLE's appeal lies not just in its intuitive foundation, but in deeper mathematical properties. To understand how well MLEs perform, we must examine Fisher Information, which quantifies the precision of estimation through the curvature of the likelihood surface. # 3. Fisher Information ## Fisher Information Chapter 3 takes up a fundamental question left open by the theory of maximum likelihood: how much information does a single observation carry about the unknown parameter? The answer — Fisher information — quantifies the precision with which $\theta$ can be estimated from data. This chapter establishes the two equivalent representations of Fisher information, proves the key tensorization property for i.i.d. samples, works out the information for the Gaussian and Poisson models explicitly, and derives the reparametrisation formula. These results prepare the ground for the Cramér–Rao lower bound in Chapter 4. ## The Score Has Zero Mean Recall from Chapter 2 that the MLE $\hat{\theta}$ is found, in regular models, by solving the score equation $S_n(\hat{\theta}) = \nabla_\theta \bar{\ell}_n(\hat{\theta}) = 0$. The expected log-likelihood $\ell(\theta) = \mathbb{E}_{\theta_0}[\log f(X, \theta)]$ is maximised at $\theta_0$, and so the same equation should hold in expectation under $\mathbb{P}_{\theta_0}$. The following theorem confirms this for a single observation. [quotetheorem:1839] [citeproof:1839] Applying the theorem at $\theta = \theta_0$ gives $\mathbb{E}_{\theta_0}[\nabla_\theta \log f(X, \theta_0)] = 0$. In other words, the score evaluated at the true parameter is a centred random quantity. Its variance is therefore a natural measure of how spread out the score is around zero — which is precisely Fisher information. ## The Fisher Information Matrix Since the score $\nabla_\theta \log f(X, \theta)$ is a centred random vector under $\mathbb{P}_\theta$, its covariance matrix captures how much it fluctuates. A score that fluctuates widely indicates a model whose log-likelihood is sharply responsive to changes in $\theta$ — and therefore one in which the data carry substantial information about the parameter. A score that barely moves indicates a flat, uninformative log-likelihood. The covariance of the score is the right way to quantify this sensitivity. [definition: Fisher Information Matrix] Let $\{f(\cdot, \theta) : \theta \in \Theta\}$ be a regular parametric model with $\Theta \subseteq \mathbb{R}^p$. The **Fisher information matrix** at $\theta \in \operatorname{int}(\Theta)$ is the $p \times p$ matrix \begin{align*} I(\theta) = \mathbb{E}_\theta\!\left[\nabla_\theta \log f(X, \theta)\, \nabla_\theta \log f(X, \theta)^\top\right], \end{align*} with entries \begin{align*} I_{ij}(\theta) = \mathbb{E}_\theta\!\left[\frac{\partial}{\partial \theta_i} \log f(X, \theta)\, \frac{\partial}{\partial \theta_j} \log f(X, \theta)\right]. \end{align*} [/definition] Since $\mathbb{E}_\theta[\nabla_\theta \log f(X,\theta)] = 0$, the matrix $I(\theta)$ is precisely the covariance matrix of the score vector under $\mathbb{P}_\theta$: \begin{align*} I(\theta) = \operatorname{Cov}_\theta(\nabla_\theta \log f(X, \theta)). \end{align*} In particular, $I(\theta)$ is symmetric and positive semi-definite. In the one-dimensional case $p = 1$, this collapses to \begin{align*} I(\theta) = \mathbb{E}_\theta\!\left[\left(\frac{d}{d\theta} \log f(X, \theta)\right)^2\right] = \operatorname{Var}_\theta\!\left(\frac{d}{d\theta} \log f(X, \theta)\right). \end{align*} The quantity $I(\theta_0)$ controls the variance of $S_n(\theta_0) = \sum_{i=1}^n \nabla_\theta \log f(X_i, \theta_0)$ around zero — its mean. This is the key heuristic: a large Fisher information means the score fluctuates widely, which in turn means the score equation $S_n(\hat\theta) = 0$ pins down $\hat\theta$ close to $\theta_0$. A small Fisher information means the score is nearly flat, and $\hat\theta$ is imprecise. ## The Negative Expected Hessian Representation There is a second, equivalent way to compute the Fisher information: via the curvature of the log-likelihood. Why should curvature have anything to do with information? Consider the log-likelihood as a function of $\theta$. If the surface is sharply peaked at $\theta_0$, then moving away from $\theta_0$ causes a rapid drop in log-likelihood, and the MLE is tightly concentrated around the truth. If the surface is flat, many parameter values give nearly the same likelihood, and the MLE is poorly determined. The average curvature — formalised as the negative expected Hessian — captures precisely this distinction. [quotetheorem:1840] [citeproof:1840] The two representations of Fisher information, \begin{align*} I(\theta) = \operatorname{Cov}_\theta(\nabla_\theta \log f(X,\theta)) = -\mathbb{E}_\theta[\nabla^2_\theta \log f(X,\theta)], \end{align*} reflect two distinct perspectives on the same quantity. The **first representation** (variance of the score) emphasises the probabilistic role: $I(\theta)$ measures how much information about $\theta$ the random observation $X$ carries. A larger variance means the score function is more sensitive to changes in $\theta$, hence more informative. The **second representation** (negative expected Hessian) has a geometric flavour: it measures the average **curvature** of the log-likelihood surface near $\theta$. A sharply curved $\ell$ near $\theta_0$ means that $\ell$ drops off steeply away from the maximum — so $\hat\theta$, which maximises $\bar\ell_n \approx \ell$, is forced to stay close to $\theta_0$. This connection between curvature and estimation precision is made precise by the Cramér–Rao bound in Chapter 4. In dimension $p=1$, both reduce to the scalar identity \begin{align*} I(\theta) = \operatorname{Var}_\theta\!\left(\frac{d}{d\theta}\log f(X,\theta)\right) = -\mathbb{E}_\theta\!\left[\frac{d^2}{d\theta^2}\log f(X,\theta)\right]. \end{align*} ## Fisher Information for Multiple Observations In practice we observe a sample $X_1, \ldots, X_n$ rather than a single observation. The Fisher information of the full sample is defined analogously. [definition: Fisher Information for a Sample] For a random vector $X = (X_1, \ldots, X_n) \in \mathcal{X}^n$ with joint density $f(x_1, \ldots, x_n, \theta)$, the **sample Fisher information matrix** is \begin{align*} I_n(\theta) = \mathbb{E}_\theta\!\left[\nabla_\theta \log f(X_1, \ldots, X_n, \theta)\, \nabla_\theta \log f(X_1, \ldots, X_n, \theta)^\top\right]. \end{align*} [/definition] The following proposition shows that for i.i.d. data, the Fisher information scales linearly with the sample size. This reflects the intuition that each observation contributes independently to knowledge about $\theta$. [quotetheorem:1841] [citeproof:1841] The identity $I_n(\theta) = nI(\theta)$ underpins the Cramér–Rao bound. It says that collecting $n$ i.i.d. observations multiplies the available information by $n$, which translates (as we will see in Chapter 4) into a lower bound on the variance of any unbiased estimator that decreases like $1/n$. ## Fisher Information for the Gaussian and Poisson Models To make the abstract definition concrete, we compute $I(\theta)$ explicitly for the two most important models in the course. The calculations illustrate both representations of Fisher information and reveal a qualitative difference between the two models. ### The Gaussian Model [example: Fisher Information for the Gaussian Location Model] Consider $X \sim \mathcal{N}(\theta, \sigma^2)$ with $\theta \in \mathbb{R}$ and $\sigma^2 > 0$ known. The density is \begin{align*} f(x, \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x - \theta)^2}{2\sigma^2}\right). \end{align*} The log-density is \begin{align*} \log f(x, \theta) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x-\theta)^2}{2\sigma^2}. \end{align*} Differentiating with respect to $\theta$: \begin{align*} \frac{d}{d\theta}\log f(x, \theta) = \frac{x - \theta}{\sigma^2}. \end{align*} Since $X \sim \mathcal{N}(\theta, \sigma^2)$ and this score is centred, \begin{align*} I(\theta) = \operatorname{Var}_\theta\!\left(\frac{X - \theta}{\sigma^2}\right) = \frac{1}{\sigma^4}\operatorname{Var}_\theta(X - \theta) = \frac{1}{\sigma^4} \cdot \sigma^2 = \frac{1}{\sigma^2}. \end{align*} We verify this using the Hessian representation. Differentiating the score once more: \begin{align*} \frac{d^2}{d\theta^2}\log f(x, \theta) = -\frac{1}{\sigma^2}. \end{align*} This is constant (it does not depend on $x$), so \begin{align*} I(\theta) = -\mathbb{E}_\theta\!\left[-\frac{1}{\sigma^2}\right] = \frac{1}{\sigma^2}, \end{align*} confirming the result. The Fisher information $I(\theta) = 1/\sigma^2$ is constant in $\theta$: every parameter value is equally hard to estimate, and the difficulty is determined entirely by the noise level $\sigma^2$. For the full i.i.d. sample of size $n$, the sample Fisher information is $I_n(\theta) = n/\sigma^2$. [/example] ### The Poisson Model [example: Fisher Information for the Poisson Model] Consider $X \sim \operatorname{Poi}(\theta)$ with $\theta > 0$. The probability mass function is \begin{align*} f(x, \theta) = e^{-\theta} \frac{\theta^x}{x!}, \qquad x \in \{0, 1, 2, \ldots\}. \end{align*} The log-likelihood for a single observation is \begin{align*} \log f(x, \theta) = -\theta + x\log\theta - \log(x!). \end{align*} Differentiating with respect to $\theta$: \begin{align*} \frac{d}{d\theta}\log f(x, \theta) = -1 + \frac{x}{\theta}. \end{align*} Again the score is centred: $\mathbb{E}_\theta[X/\theta - 1] = \theta/\theta - 1 = 0$. The Fisher information is \begin{align*} I(\theta) = \operatorname{Var}_\theta\!\left(\frac{X}{\theta} - 1\right) = \frac{1}{\theta^2}\operatorname{Var}_\theta(X) = \frac{1}{\theta^2}\cdot \theta = \frac{1}{\theta}. \end{align*} We used that $\operatorname{Var}(X) = \theta$ for the Poisson distribution. Using the Hessian representation as a check, differentiate the score: \begin{align*} \frac{d^2}{d\theta^2}\log f(x, \theta) = -\frac{x}{\theta^2}. \end{align*} Taking the negative expectation: $-\mathbb{E}_\theta[-X/\theta^2] = \mathbb{E}_\theta[X]/\theta^2 = \theta/\theta^2 = 1/\theta$. This confirms $I(\theta) = 1/\theta$. Unlike the Gaussian case, the Poisson information $I(\theta) = 1/\theta$ decreases in $\theta$: larger values of $\theta$ are harder to estimate. This makes intuitive sense — when $\theta$ is large, observations of $X$ have a large absolute spread ($\sqrt{\operatorname{Var}(X)} = \sqrt{\theta}$), so individual observations are less informative per unit of $\theta$. [/example] ## The Reparametrisation Formula Often it is natural to work with a transformation $\psi = g(\theta)$ rather than with $\theta$ itself — for instance, using $\psi = \log\theta$ for a positive parameter, or $\psi = \theta^2$ for a variance. But does reparametrising change the amount of information available? It must, in general — the precision with which we can estimate $\psi$ depends on how $\psi$ relates to $\theta$. The question is how the information transforms, and the answer turns out to be a clean application of the chain rule. [quotetheorem:1842] [citeproof:1842] The formula $I_\psi(\psi) = I_\theta(\theta(\psi))\,(d\theta/d\psi)^2$ has a natural interpretation. If $g$ is highly non-linear near $\theta$, then small changes in $\psi$ correspond to large changes in $\theta$ (when $|d\theta/d\psi|$ is large), making $\theta$ vary rapidly and the model more sensitive to $\psi$ — hence more information. Conversely, if $|d\theta/d\psi|$ is small, the model is insensitive to changes in $\psi$. A cleaner way to see this: the Cramér–Rao bound (Chapter 4) will say $\operatorname{Var}(\hat\psi) \geq 1/(n I_\psi(\psi))$. For any unbiased estimator $\hat\theta$ of $\theta$, the delta method gives $\operatorname{Var}(g(\hat\theta)) \approx (g'(\theta))^2 \operatorname{Var}(\hat\theta) \geq (g'(\theta))^2/(nI_\theta(\theta))$. Setting these equal and inverting gives exactly the reparametrisation formula, confirming that the bound transforms consistently. The multivariate extension is: if $\psi = g(\theta)$ with $g: \mathbb{R}^p \to \mathbb{R}^p$ smooth and invertible and Jacobian matrix $J = D_\psi \theta(\psi)$ (the Jacobian of the inverse map at $\psi$), then \begin{align*} I_\psi(\psi) = J^\top\, I_\theta(\theta(\psi))\, J, \end{align*} where $J = d\theta/d\psi \in \mathbb{R}^{p \times p}$. [example: Reparametrising the Exponential Model] Let $X \sim \operatorname{Exp}(\theta)$ with rate parameter $\theta > 0$, so $f(x, \theta) = \theta e^{-\theta x}$ for $x > 0$. The log-density is $\log f(x,\theta) = \log\theta - \theta x$, and \begin{align*} \frac{d}{d\theta}\log f(x, \theta) = \frac{1}{\theta} - x, \qquad \frac{d^2}{d\theta^2}\log f(x, \theta) = -\frac{1}{\theta^2}. \end{align*} The Fisher information in the rate parametrisation is $I_\theta(\theta) = 1/\theta^2$. Now consider reparametrising by the mean: $\mu = 1/\theta$, so $\theta(\mu) = 1/\mu$ and $d\theta/d\mu = -1/\mu^2$. The reparametrisation formula gives \begin{align*} I_\mu(\mu) = I_\theta(1/\mu)\cdot\left(\frac{d\theta}{d\mu}\right)^2 = \frac{1}{(1/\mu)^2}\cdot \frac{1}{\mu^4} = \mu^2 \cdot \frac{1}{\mu^4} = \frac{1}{\mu^2}. \end{align*} This can be verified directly: in the mean parametrisation, $f(x,\mu) = (1/\mu)e^{-x/\mu}$, and one computes $I_\mu(\mu) = 1/\mu^2$ by the same method as above. [/example] Fisher Information bounds the variance of any unbiased estimator through the Cramér-Rao inequality. This bound reveals fundamental limits on what we can learn from data, and the convergence rates it predicts lead naturally to asymptotic inference. # 4. Cramér-Rao Bound and Convergence ## The Cramér-Rao Lower Bound Building on the Fisher information developed in Chapter 3, we now arrive at one of the central results of classical estimation theory: the Cramér-Rao lower bound. This inequality gives a universal floor for the variance of any unbiased estimator, and thereby formalises the intuition that higher Fisher information means the parameter is easier to estimate. The chapter then turns to convergence concepts — almost sure, in probability, and in distribution — which provide the language needed to discuss the asymptotic behaviour of the MLE in later chapters. ## The Scalar Cramér-Rao Inequality We begin with a question: how good can an unbiased estimator possibly be? Unbiasedness guarantees that on average we hit the target, but it says nothing about the spread of our estimates around $\theta$. The Cramér-Rao bound answers this question by providing a lower bound on the variance in terms of the Fisher information $I(\theta)$. [quotetheorem:1843] [citeproof:1843] The regularity assumption is precisely the condition that integration and differentiation with respect to $\theta$ can be exchanged — that is, $\frac{d}{d\theta} \int_{\mathcal{X}} g(x) f(x, \theta)\, dx = \int_{\mathcal{X}} g(x) \frac{d}{d\theta} f(x, \theta)\, dx$ for appropriate functions $g$. The precise conditions that guarantee this (dominated convergence, smoothness of $f$ in $\theta$) belong to the territory of Probability and Measure and are not examinable here. For all the models we encounter, this exchange is valid. The bound has a natural interpretation: $n I(\theta)$ is the total Fisher information in the sample, and the variance of any unbiased estimator is at least its reciprocal. More information means we can potentially estimate more precisely; the bound quantifies the wall we cannot break through. [definition: Efficient Estimator] An unbiased estimator $\tilde{\theta}$ is called **efficient** (or **Cramér-Rao efficient**) at $\theta$ if equality holds in the Cramér-Rao bound: \begin{align*} \operatorname{Var}_\theta(\tilde{\theta}) = \frac{1}{n I(\theta)}. \end{align*} [/definition] An efficient estimator achieves the smallest possible variance among all unbiased estimators, at that particular value of $\theta$. The bound is not always tight — there exist models where no unbiased estimator achieves it — so the existence of an efficient estimator is a special property of the model. A natural question is: which estimators are efficient, and how do we find them? Tracing back through the proof of the Cramér-Rao bound, equality in Cauchy-Schwarz holds if and only if $\tilde{\theta}(X) - \theta$ is proportional to the score $\frac{d}{d\theta} \log f(X, \theta)$, i.e., the estimator is an affine function of the score. This severely restricts which models admit an efficient estimator — but exponential family models satisfy precisely this condition, and in such models the MLE is typically efficient. The examples below illustrate this. [example: MLE for the Gaussian Model] Let $X_1, \ldots, X_n$ be i.i.d. $\mathcal{N}(\theta, 1)$ with $\theta \in \mathbb{R}$ unknown. The sample mean $\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i$ is unbiased: $\mathbb{E}_\theta[\bar{X}_n] = \theta$. Its variance is $\operatorname{Var}_\theta(\bar{X}_n) = 1/n$. We computed in Chapter 3 that for $\mathcal{N}(\theta, 1)$, the Fisher information is $I(\theta) = 1$, so the Cramér-Rao bound gives $\operatorname{Var}_\theta(\tilde{\theta}) \geq 1/(n \cdot 1) = 1/n$. Since $\bar{X}_n$ achieves this bound exactly, the sample mean is efficient for estimating the mean of a Gaussian with known variance. [/example] [example: MLE for the Poisson Model] Let $X_1, \ldots, X_n$ be i.i.d. $\operatorname{Poi}(\theta)$ with $\theta > 0$. The sample mean $\bar{X}_n$ is again unbiased and $\operatorname{Var}_\theta(\bar{X}_n) = \theta/n$. For the Poisson model, the log-likelihood for one observation is $\log f(x, \theta) = -\theta + x \log \theta - \log(x!)$, so $\frac{d}{d\theta} \log f(x, \theta) = -1 + x/\theta$. The Fisher information is \begin{align*} I(\theta) = \mathbb{E}_\theta\!\left[\left(-1 + \frac{X}{\theta}\right)^2\right] = \frac{1}{\theta^2} \operatorname{Var}_\theta(X) = \frac{1}{\theta^2} \cdot \theta = \frac{1}{\theta}, \end{align*} using $\operatorname{Var}_\theta(X) = \theta$ for the Poisson distribution. The Cramér-Rao bound therefore gives $\operatorname{Var}_\theta(\tilde{\theta}) \geq \theta/n$, and $\bar{X}_n$ attains this bound. The sample mean is efficient for estimating the rate of a Poisson model. [/example] These two examples reveal a pattern: the MLE often achieves the Cramér-Rao bound, at least asymptotically. The general version of this statement — that the MLE is asymptotically efficient — will be proved in later chapters once we have the convergence machinery. ## The Multivariate Cramér-Rao Bound The scalar Cramér-Rao bound gives a single number — a lower bound on the variance of any unbiased estimator of $\theta \in \mathbb{R}$. For a vector parameter $\theta \in \mathbb{R}^p$, this is no longer enough: there are $p$ components to estimate, and their estimators may be correlated. A single number cannot capture the full picture of estimation difficulty across all directions in parameter space. What we need is a matrix — specifically, the inverse Fisher information matrix — that bounds not the variance of $\tilde{\theta}$ as a whole, but the variance of any real-valued functional of $\theta$. Let $\theta \in \Theta \subseteq \mathbb{R}^p$ and let $\Phi : \Theta \to \mathbb{R}$ be a differentiable functional. An unbiased estimator $\tilde{\Phi}$ of $\Phi(\theta)$ is any statistic satisfying $\mathbb{E}_\theta[\tilde{\Phi}] = \Phi(\theta)$ for all $\theta \in \Theta$. [quotetheorem:1844] The proof follows the same Cauchy-Schwarz strategy as the scalar case, applied to the pair $(\tilde{\Phi}, \alpha^\top Z_n)$ for a suitably chosen direction $\alpha \in \mathbb{R}^p$, and then optimising over $\alpha$. The Fisher information matrix $I(\theta)^{-1}$ plays the role that $1/I(\theta)$ played before. A concrete special case arises by taking $\Phi(\theta) = \alpha^\top \theta = \sum_{i=1}^p \alpha_i \theta_i$ for a fixed direction $\alpha \in \mathbb{R}^p$. Then $\nabla_\theta \Phi(\theta) = \alpha$, and the bound becomes \begin{align*} \operatorname{Var}_\theta(\tilde{\Phi}) \geq \frac{1}{n} \alpha^\top I(\theta)^{-1} \alpha. \end{align*} [example: Two-Dimensional Gaussian] Let $(X_1, X_2)^\top \sim \mathcal{N}(\theta, \Sigma)$ with $\theta = (\theta_1, \theta_2)^\top$ and $\Sigma$ a known positive definite matrix, with $n = 1$ observation. **Case 1 — $\theta_2$ known.** The model is one-dimensional in $\theta_1$. One checks that $I_1(\theta_1) = 1/\Sigma_{11}$ (the $(1,1)$ entry of $\Sigma^{-1}$ when $\theta_2$ is fixed), so the Cramér-Rao bound gives $\operatorname{Var}_\theta(\tilde{\theta}_1) \geq \Sigma_{11}$. **Case 2 — $\theta_2$ unknown.** Apply the multivariate bound with $\Phi(\theta) = \theta_1$, so $\nabla_\theta \Phi = (1, 0)^\top$. Then $\operatorname{Var}_\theta(\tilde{\Phi}) \geq (I(\theta)^{-1})_{11} = (\Sigma)_{11}$. When $\Sigma$ is diagonal, $X_1$ and $X_2$ are independent: the additional unknown parameter $\theta_2$ does not affect the estimation of $\theta_1$, and both cases give the same bound. When $\Sigma$ is not diagonal, knowing $\theta_2$ can allow more precise estimation of $\theta_1$ through their correlation structure. [/example] ## Bias, Variance, and Mean Squared Error The Cramér-Rao bound is a sharp and satisfying result, but it has a built-in limitation: it only applies to unbiased estimators. What if we allow some bias? If we accept that our estimator is slightly off on average, we might be able to reduce its variance substantially, and the overall error — measured by how far we typically land from $\theta$ — could decrease. This tradeoff between bias and variance is one of the central tensions in statistical estimation, and to reason about it we need a single criterion that accounts for both. That criterion is the mean squared error. [definition: Bias and Mean Squared Error] Let $\tilde{\theta}$ be an estimator of $\theta$. The **bias** of $\tilde{\theta}$ at $\theta$ is \begin{align*} \operatorname{bias}_\theta(\tilde{\theta}) = \mathbb{E}_\theta[\tilde{\theta}] - \theta. \end{align*} The **mean squared error** (MSE) of $\tilde{\theta}$ at $\theta$ is \begin{align*} \operatorname{MSE}_\theta(\tilde{\theta}) = \mathbb{E}_\theta[(\tilde{\theta} - \theta)^2]. \end{align*} [/definition] The fundamental decomposition of MSE into bias and variance components is essential for comparing estimators: [quotetheorem:1845] [citeproof:1845] The decomposition shows that reducing bias can increase variance and vice versa. An unbiased estimator has MSE equal to variance, so the Cramér-Rao bound gives a lower bound for its MSE. However, a biased estimator with smaller variance may achieve a lower MSE in some range of $\theta$ — the so-called bias-variance tradeoff. This tradeoff is pervasive in statistics and machine learning. For unbiased estimators, $\operatorname{MSE}_\theta(\tilde{\theta}) = \operatorname{Var}_\theta(\tilde{\theta})$, so the Cramér-Rao bound directly bounds the MSE. The bias-variance decomposition motivates studying not just bias and variance separately, but the MSE as a combined criterion, especially in situations where perfect unbiasedness is difficult to achieve. ## Stochastic Convergence Concepts With the Cramér-Rao bound established, we now develop the convergence machinery needed to study how estimators behave as the sample size $n \to \infty$. Recall that an estimator $\tilde{\theta}_n = \tilde{\theta}(X_1, \ldots, X_n)$ is a sequence of random variables indexed by $n$. Asking whether $\tilde{\theta}_n \to \theta$ requires specifying what kind of convergence we mean, since convergence of random variables admits several distinct formulations of differing strength. Not all estimators are unbiased for finite $n$, but one might hope that $\mathbb{E}_\theta[\tilde{\theta}_n] \to \theta$ as $n \to \infty$. A stronger property — **consistency** — asks that the estimator itself converges to $\theta$ in a probabilistic sense. There are two primary notions of this convergence. [definition: Almost Sure Convergence and Convergence in Probability] Let $(X_n)_{n \geq 1}$ and $X$ be random vectors in $\mathbb{R}^k$, all defined on a common probability space $(\Omega, \mathcal{F}, \mathbb{P})$. (i) We say $X_n$ **converges almost surely** to $X$, written $X_n \xrightarrow{a.s.} X$, if \begin{align*} \mathbb{P}\!\left(\{\omega \in \Omega : \|X_n(\omega) - X(\omega)\| \to 0 \text{ as } n \to \infty\}\right) = 1. \end{align*} (ii) We say $X_n$ **converges in probability** to $X$, written $X_n \xrightarrow{\mathbb{P}} X$, if for all $\varepsilon > 0$, \begin{align*} \mathbb{P}(\|X_n - X\| > \varepsilon) \to 0 \text{ as } n \to \infty. \end{align*} [/definition] Almost sure convergence is the strongest notion: it says that for almost every realisation of the underlying randomness, the sequence of numbers $\|X_n(\omega) - X(\omega)\|$ converges to zero. Convergence in probability is weaker: it only requires that large deviations become increasingly unlikely, without asserting that any individual trajectory converges. For vector-valued random variables, both notions reduce to componentwise convergence. For almost sure convergence this is immediate from the definition. For convergence in probability it follows from the equivalence of norms in $\mathbb{R}^k$. Convergence in distribution is a yet weaker notion, appropriate when $X_n$ and $X$ may not even be defined on the same probability space. [definition: Convergence in Distribution] Let $(X_n)_{n \geq 1}$ and $X$ be random vectors in $\mathbb{R}^k$. We say $X_n$ **converges in distribution** to $X$, written $X_n \xrightarrow{d} X$, if \begin{align*} \mathbb{P}(X_n \leq t) \to \mathbb{P}(X \leq t) \quad \text{as } n \to \infty, \end{align*} for all $t = (t_1, \ldots, t_k) \in \mathbb{R}^k$ at which the map $t \mapsto \mathbb{P}(X \leq t)$ is continuous. Here $\{X \leq t\}$ denotes the event $\{X^{(1)} \leq t_1, \ldots, X^{(k)} \leq t_k\}$. [/definition] The continuity restriction at $t$ is necessary: for a discrete random variable $X$, the distribution function has jump discontinuities, and we cannot expect pointwise convergence at jump points from a sequence of continuous distributions. The three notions are ordered by strength. The following implication chain holds in general, but neither implication can be reversed. [quotetheorem:1846] The proof that almost sure convergence implies convergence in probability follows by applying Markov's inequality to $\mathbf{1}_{\{\|X_n - X\| > \varepsilon\}}$ and using the fact that the set of $\omega$ where convergence fails has probability zero. The second implication — that convergence in probability implies convergence in distribution — uses the fact that convergence in probability to $X$ implies the CDFs converge at every continuity point of the limiting CDF. These results are established rigorously in Probability and Measure. A counterexample showing the reverse implications fail: let $X \sim \mathcal{N}(0, 1)$ and set $X_n = -X$ for all $n$. Then $X_n \xrightarrow{d} X$ since $-X$ and $X$ have the same distribution. But $\|X_n - X\| = 2|X| > \varepsilon$ with positive probability for all $n$, so $X_n$ does not converge to $X$ in probability. Convergence in distribution is purely a property of the laws of $X_n$, not of the random variables themselves. ## Stability Under Continuous Maps If an estimator $\tilde{\theta}_n$ converges to $\theta$, does a function of it also converge? For instance, if we know $\tilde{\theta}_n \xrightarrow{\mathbb{P}} \theta$, does $\tilde{\theta}_n^2 \xrightarrow{\mathbb{P}} \theta^2$? The answer depends on the function. For continuous functions, the answer is yes — and this turns out to be a powerful and frequently used fact in asymptotic statistics. [quotetheorem:1847] The proof uses continuity of $g$ together with the definitions of each convergence mode. For almost sure convergence: on the full-probability event where $X_n(\omega) \to X(\omega)$, continuity of $g$ gives $g(X_n(\omega)) \to g(X(\omega))$. For convergence in probability and distribution, appropriate approximations with compact sets are used. The continuous mapping theorem is indispensable in asymptotic statistics. For example, if $\tilde{\theta}_n \xrightarrow{\mathbb{P}} \theta$ and we are interested in estimating $\theta^2$, the theorem immediately gives $\tilde{\theta}_n^2 \xrightarrow{\mathbb{P}} \theta^2$ without any additional work. The continuity assumption is essential and cannot be dropped. If $g$ has a discontinuity at the limiting value, convergence in distribution can fail even when $X_n \xrightarrow{d} X$. A concrete example: let $X_n \xrightarrow{d} X$ where $X$ is uniform on $[0, 1]$, and define $g(x) = \mathbf{1}_{\{x \geq 1/2\}}$. The function $g$ is discontinuous at $x = 1/2$. If the distributions of $X_n$ concentrate mass differently near the discontinuity point — for instance, if $X_n$ has an atom at $1/2$ while $X$ does not — then the distributions of $g(X_n)$ need not converge to that of $g(X)$. The precise condition under which the conclusion still holds is that the set of discontinuities of $g$ has probability zero under the limiting law $X$: if $\mathbb{P}(X \in \operatorname{disc}(g)) = 0$, convergence in distribution is preserved. When applying transformations in asymptotic arguments, this is the condition to verify. ## The Central Limit Theorem Convergence in distribution has a canonical and fundamental example: the central limit theorem (CLT). Rather than asking whether a sequence of random variables converges to a fixed constant (as in consistency), the CLT asks about the fluctuations around the mean once we rescale appropriately. [quotetheorem:1848] The proof proceeds via characteristic functions. One shows that $\phi_{\sqrt{n}(\bar{X}_n - \mu)}(u) = (\phi_{X_1 - \mu}(u/\sqrt{n}))^n$. Expanding $\phi_{X_1 - \mu}$ around zero using the moment conditions and taking $n \to \infty$ gives $e^{-\sigma^2 u^2/2}$, which is the characteristic function of $\mathcal{N}(0, \sigma^2)$. Convergence of characteristic functions then implies convergence in distribution by Lévy's continuity theorem. The CLT is a statement about the centred and rescaled sample mean $\sqrt{n}(\bar{X}_n - \mu)$, not about $\bar{X}_n$ itself. The Law of Large Numbers tells us $\bar{X}_n \xrightarrow{\mathbb{P}} \mu$, so $\bar{X}_n - \mu \to 0$ in probability. The CLT zooms in on this vanishing difference by multiplying by $\sqrt{n}$: the product $\sqrt{n}(\bar{X}_n - \mu)$ neither goes to zero nor diverges, but converges in distribution to a Gaussian. This suggests a practical approximation: for large $n$, \begin{align*} \bar{X}_n \approx \mathcal{N}\!\left(\mu, \frac{\sigma^2}{n}\right) \end{align*} in the sense that the distribution of $\bar{X}_n$ is approximately Gaussian with the stated mean and variance. This approximation underlies confidence intervals and hypothesis tests throughout statistics. [example: CLT for Bernoulli Sums] Let $X_1, \ldots, X_n$ be i.i.d. $\operatorname{Bernoulli}(p)$ with $p \in (0,1)$. Then $\mu = p$ and $\sigma^2 = p(1-p)$. The CLT gives \begin{align*} \sqrt{n}\,(\bar{X}_n - p) \xrightarrow{d} \mathcal{N}(0, p(1-p)). \end{align*} Equivalently, $\sqrt{n}(\bar{X}_n - p) / \sqrt{p(1-p)} \xrightarrow{d} \mathcal{N}(0, 1)$. Here $\bar{X}_n = \frac{1}{n}\sum X_i$ is the sample proportion of successes. For large $n$, the number of successes $\sum X_i = n\bar{X}_n$ is approximately $\mathcal{N}(np, np(1-p))$. To see that the Fisher information is consistent with this, recall from Chapter 3 that for a $\operatorname{Bernoulli}(p)$ model, $I(p) = 1/(p(1-p))$. The asymptotic variance of $\bar{X}_n$ is $\sigma^2/n = p(1-p)/n = 1/(nI(p))$, exactly matching the Cramér-Rao lower bound. The sample proportion is therefore efficient. [/example] The connection between the CLT and the Cramér-Rao bound is not coincidental. For the MLE in regular models, a general asymptotic efficiency result — which we will develop later — shows that $\sqrt{n}(\hat{\theta}_{\mathrm{MLE}} - \theta) \xrightarrow{d} \mathcal{N}(0, I(\theta)^{-1})$, precisely achieving the Cramér-Rao lower bound in the limit. The convergence concepts introduced in this chapter are the foundation for stating and proving that result. The Central Limit Theorem transforms these individual data points into powerful distributional results, allowing us to construct confidence intervals and hypothesis tests that rely on asymptotic normality rather than exact distributions. # 5. Central Limit Theorem and Inference ## The Law of Large Numbers and Convergence Tools The preceding chapters developed the maximum likelihood estimator and established the Cramér–Rao lower bound, which says that no unbiased estimator can have variance below $I(\theta_0)^{-1}$. This chapter asks: does the MLE actually attain this bound, at least asymptotically? The answer requires understanding the stochastic fluctuations of the sample mean — which is precisely what the central limit theorem describes. Along the way, we collect several convergence results that are indispensable tools for asymptotic analysis. ### Boundedness in Probability How do we ensure that a sequence of random variables does not escape to infinity, even if it fails to converge? Without a notion of probabilistic boundedness, we cannot guarantee that distributional limits stay well-defined or that estimators remain in a compact region. Before stating the main convergence theorems, it is useful to formalise this constraint. [definition: Bounded in Probability] A sequence of random vectors $(X_n)_{n \geq 0}$ in $\mathbb{R}^k$ is **bounded in probability**, written $X_n = O_\mathbb{P}(1)$, if for every $\varepsilon > 0$ there exists $M(\varepsilon) < \infty$ such that for all $n \geq 0$, \begin{align*} \mathbb{P}\bigl(\|X_n\| > M(\varepsilon)\bigr) < \varepsilon. \end{align*} More generally, $X_n = O_\mathbb{P}(r_n)$ for a positive sequence $r_n$ if $X_n / r_n = O_\mathbb{P}(1)$. [/definition] The notation $O_\mathbb{P}(1)$ is the probabilistic counterpart of ordinary boundedness: rather than requiring $\|X_n\| \leq M$ for all $n$, we require only that the probability of $\|X_n\|$ exceeding any fixed threshold can be made uniformly small. This is a useful sanity check — any sequence that converges in distribution to a proper random variable cannot escape to infinity. [quotetheorem:1849] This is immediate from the definition of convergence in distribution: the tail probabilities $\mathbb{P}(\|X_n\| > M)$ are eventually close to $\mathbb{P}(\|X\| > M)$, which can be made small by choosing $M$ large since $X$ has a proper distribution. The proposition will be applied as a corollary of the CLT: any sequence normalised by $\sqrt{n}$ from an i.i.d. square-integrable sample is $O_\mathbb{P}(1)$, so the unnormalised average deviates from its mean by $O_\mathbb{P}(1/\sqrt{n})$. ### Slutsky's Lemma Suppose we have established that $\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$, but $\sigma^2$ is unknown and must be estimated by $\hat{\sigma}_n^2$. Can we simply substitute $\hat{\sigma}_n$ for $\sigma$ without disturbing the limiting distribution? The answer is yes — but only because $\hat{\sigma}_n$ converges to a constant, not to a random variable. Slutsky's lemma makes this substitution principle precise. [quotetheorem:1850] Part (a) is not an independent claim but an observation: convergence in distribution to a constant is equivalent to convergence in probability to that constant, since the limiting distribution is a point mass. Parts (b) and (c) follow from the joint convergence $(X_n, Y_n) \xrightarrow{d} (X, c)$ and the continuous mapping theorem applied to addition and multiplication. Part (d) is particularly powerful in statistics: it allows the substitution of a consistent estimator for an unknown matrix (such as an empirical covariance or an estimated Fisher information) into an expression involving a distributional limit, without disturbing the limiting distribution. It is essential that $Y_n$ converges to a constant $c$, not merely to a random variable. If $Y_n \xrightarrow{d} Y$ where $Y$ is genuinely random, then the pair $(X_n, Y_n)$ need not converge jointly to $(X, Y)$ — the marginal distributional limits do not determine the joint limit without an assumption on the joint structure. Slutsky's lemma exploits the fact that convergence to a constant $c$ forces $Y_n \xrightarrow{\mathbb{P}} c$, which is enough to control the joint behaviour. The analogous statement for two sequences both converging in distribution to non-degenerate limits is false in general. ### The Law of Large Numbers The law of large numbers is the fundamental justification for using the sample mean as an estimator of the population mean. It comes in two forms: a weak version, which follows from elementary moment inequalities, and a strong version, which requires deeper measure-theoretic tools. [quotetheorem:1851] [citeproof:1851] The finite-variance hypothesis is not merely a technical convenience — it is genuinely necessary for the Chebyshev argument. If $X$ has infinite variance, such as a Cauchy-distributed random variable (which has no finite mean at all, let alone finite variance), then $\bar{X}_n$ does not converge in probability to any finite limit; in fact the Cauchy distribution is stable under averaging, and $\bar{X}_n$ has the same Cauchy distribution as $X_1$ regardless of $n$. More generally, the Weak Law can fail dramatically without second-moment control. The strong law of large numbers strengthens the conclusion to almost sure convergence and simultaneously weakens the assumption: finite mean $\mathbb{E}[\|X\|] < \infty$ is sufficient and also necessary for the almost sure convergence of the averages. The price for this sharper conclusion is a more sophisticated proof. [quotetheorem:1852] The proof relies on the Kolmogorov zero-one law and truncation arguments developed in measure-theoretic probability; the result is admitted from that theory. The key distinction from the weak law is that almost sure convergence is a path-by-path statement: for almost every realisation of the infinite sequence $(X_i)_{i \geq 1}$, the averages converge to $\mathbb{E}[X]$. ## The Central Limit Theorem The law of large numbers tells us that $\bar{X}_n$ converges to $\mathbb{E}[X]$, but it is silent about the rate and the shape of the fluctuations. The central limit theorem answers both questions at once: the deviations $\bar{X}_n - \mathbb{E}[X]$ are of order $1/\sqrt{n}$, and after rescaling by $\sqrt{n}$ they converge in distribution to a Gaussian. ### The Univariate Central Limit Theorem [quotetheorem:1848] The standard proof proceeds via characteristic functions. One shows that the characteristic function of $\sqrt{n}(\bar{X}_n - \mathbb{E}[X])$ converges pointwise to $e^{-\sigma^2 u^2/2}$, the characteristic function of $\mathcal{N}(0,\sigma^2)$. The key step is a second-order Taylor expansion of $\log \mathbb{E}[e^{iu X/\sqrt{n}}]$, using the finite-variance assumption to control the remainder. Pointwise convergence of characteristic functions implies convergence in distribution by Lévy's continuity theorem. As an immediate corollary, in the $O_\mathbb{P}$ language: under the conditions of the CLT, $\bar{X}_n - \mathbb{E}[X] = O_\mathbb{P}(1/\sqrt{n})$. The fluctuations of the sample mean around its limit are of order $1/\sqrt{n}$. To halve the width of any confidence interval derived from the CLT, one must quadruple the sample size. The finite-variance assumption $\sigma^2 < \infty$ is indispensable: it controls the second-order Taylor expansion of the characteristic function that drives the proof. Without it, the rescaled averages need not converge to a Gaussian at all — under certain heavier tails one obtains a stable distribution instead, with a different normalisation rate. The CLT also says nothing about the speed of convergence to normality; for this one needs Berry–Esseen-type bounds, which show that the approximation error in the distribution function is $O(1/\sqrt{n})$ when a third moment is available. ### Multivariate Normal Distributions What does it mean for a random vector in $\mathbb{R}^k$ to be Gaussian when $k > 1$? In one dimension the normal is characterised by its density, but in higher dimensions the natural density formula only applies when the covariance matrix is invertible, leaving out all degenerate cases — such as a vector supported on a hyperplane. We need a coordinate-free characterisation that works for singular covariance matrices and is preserved under linear maps. To state the multivariate CLT, we first recall this characterisation. [definition: Multivariate Normal Distribution] A random variable $X \in \mathbb{R}^k$ has a **normal distribution** with mean $\mu \in \mathbb{R}^k$ and $k \times k$ covariance matrix $\Sigma$, written $X \sim \mathcal{N}(\mu, \Sigma)$, if either of the following equivalent conditions holds: (i) When $\Sigma$ is positive definite, its probability density function is \begin{align*} f(x) = \frac{1}{(2\pi)^{k/2}\,|\det(\Sigma)|^{1/2}}\exp\!\left(-\frac{1}{2}(x-\mu)^\top \Sigma^{-1}(x-\mu)\right). \end{align*} (ii) For every $\alpha \in \mathbb{R}^k$, the scalar random variable $\alpha^\top X \sim \mathcal{N}(\alpha^\top \mu,\; \alpha^\top \Sigma\, \alpha)$. Condition (ii) extends the definition to singular covariance matrices $\Sigma$ and is the more fundamental characterisation. [/definition] The characterisation via linear forms says that every one-dimensional projection of $X$ is Gaussian. This makes many properties of the multivariate normal transparent: in particular, it implies that the distribution of $X$ is entirely determined by its mean and covariance. [quotetheorem:1853] Part (a) follows directly from characterisation (ii) of the multivariate normal: $\alpha^\top(AX + b) = (\alpha^\top A)X + \alpha^\top b$ is Gaussian for every $\alpha$, with the stated mean and variance. Part (b) is a direct application of Slutsky's lemma (part (d)). Part (c) reflects a fundamental fact: for Gaussian vectors, uncorrelatedness and independence are equivalent. Note that part (a) requires $A$ to be a fixed deterministic matrix. If $A$ itself were random and dependent on $X$, the result would fail entirely. Part (b) is the version relevant to statistics, where $A$ is an unknown matrix replaced by a data-driven estimator $A_n$: the Slutsky mechanism applies because $A_n$ converges to a constant. Part (c) is exceptional to the Gaussian world: in general, uncorrelated random variables need not be independent; it is the quadratic structure of the Gaussian — encoded in its characteristic function — that makes zero covariance sufficient for independence. ### The Multivariate Central Limit Theorem [quotetheorem:1854] [citeproof:1854] In practice, $\Sigma = \operatorname{Cov}(X)$ is unknown and must be estimated. The sample covariance matrix $\hat{\Sigma}_n = \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)(X_i - \bar{X}_n)^\top$ satisfies $\hat{\Sigma}_n \xrightarrow{a.s.} \Sigma$ by the strong law of large numbers. By Slutsky's lemma (part (d)), one may substitute $\hat{\Sigma}_n$ for $\Sigma$ in distributional statements without affecting the limit. ## Asymptotic Efficiency and Confidence Regions ### Asymptotic Efficiency The Cramér–Rao lower bound is a finite-sample result, but it only constrains unbiased estimators. For a biased estimator, or in large samples where bias is negligible, how should we compare the quality of different estimation procedures? The right question is not whether an estimator achieves the Cramér–Rao bound for every $n$, but whether its scaled variance converges to the information-theoretic minimum as $n \to \infty$. The Cramér–Rao lower bound established in Chapter 4 concerns the variance of estimators based on $n$ i.i.d. observations sampled from $P_{\theta_0}$. In the Fisher information framework, the natural benchmark for $n$ observations is $I_n(\theta_0)^{-1} = I(\theta_0)^{-1}/n$, since $I_n(\theta_0) = n I(\theta_0)$ by tensorisation. A reasonable optimality criterion is therefore that $n \operatorname{Var}_{\theta_0}(\hat{\theta}_n)$ approaches $I(\theta_0)^{-1}$ as $n \to \infty$. [definition: Asymptotic Efficiency] An estimator sequence $(\hat{\theta}_n)$ is **asymptotically efficient** at $\theta_0$ if \begin{align*} n \operatorname{Var}_{\theta_0}(\hat{\theta}_n) \to I(\theta_0)^{-1} \quad \text{as } n \to \infty, \end{align*} when sampling from $P_{\theta_0}$. [/definition] The Cramér–Rao bound suggests that $I(\theta_0)^{-1}$ is the smallest asymptotic variance achievable. In subsequent lectures, it will be shown that under suitable regularity conditions on the parametric model, the MLE satisfies \begin{align*} \hat{\theta}_{\mathrm{MLE}} \approx \mathcal{N}\!\left(\theta_0,\; \frac{I(\theta_0)^{-1}}{n}\right), \end{align*} which implies asymptotic efficiency. This distributional approximation is also the starting point for constructing confidence regions. ### Confidence Intervals from the CLT Given that $\bar{X}_n$ converges to $\mu_0$ but we do not know $\mu_0$, how do we construct an interval that traps $\mu_0$ with prescribed probability? The CLT turns the abstract convergence statement $\bar{X}_n \to \mathbb{E}[X]$ into a concrete recipe for uncertainty quantification. To construct a confidence interval at level $1 - \alpha$, recall that the standard normal quantile $z_\alpha$ is defined by $\mathbb{P}(|Z| \leq z_\alpha) = 1 - \alpha$ for $Z \sim \mathcal{N}(0,1)$. [example: Confidence Interval for the Population Mean] Let $X_1, X_2, \ldots$ be i.i.d. copies of a real-valued random variable $X \sim P$ with mean $\mu_0$ and known variance $\sigma^2$. For $\alpha \in (0,1)$, define the random interval \begin{align*} C_n = \left\{\mu \in \mathbb{R} : |\mu - \bar{X}_n| \leq \frac{\sigma z_\alpha}{\sqrt{n}}\right\}. \end{align*} We show that $\mathbb{P}(\mu_0 \in C_n) \to 1 - \alpha$, so $C_n$ is an asymptotic level $1-\alpha$ confidence set for $\mu_0$. Compute the coverage probability directly: \begin{align*} \mathbb{P}(\mu_0 \in C_n) &= \mathbb{P}\!\left(|\bar{X}_n - \mu_0| \leq \frac{\sigma z_\alpha}{\sqrt{n}}\right) = \mathbb{P}\!\left(\sqrt{n}\,\frac{|\bar{X}_n - \mu_0|}{\sigma} \leq z_\alpha\right). \end{align*} Set $\tilde{X}_i = (X_i - \mu_0)/\sigma$, which are i.i.d. with mean $0$ and variance $1$. By the CLT, \begin{align*} \sqrt{n}\,\bar{\tilde{X}}_n = \sqrt{n}\left(\frac{1}{n}\sum_{i=1}^n \tilde{X}_i\right) \xrightarrow{d} \mathcal{N}(0,1). \end{align*} The function $x \mapsto |x|$ is continuous, so by the continuous mapping theorem $\sqrt{n}|\bar{\tilde{X}}_n| \xrightarrow{d} |Z|$ where $Z \sim \mathcal{N}(0,1)$. Since $z_\alpha$ is a continuity point of the distribution function of $|Z|$, \begin{align*} \mathbb{P}(\mu_0 \in C_n) \to \mathbb{P}(|Z| \leq z_\alpha) = 1 - \alpha. \end{align*} For a $95\%$ confidence interval, take $\alpha = 0.05$, giving $z_{0.05} \approx 1.96$ and $C_n = \bar{X}_n \pm 1.96\,\sigma/\sqrt{n}$. [/example] When $\sigma^2$ is unknown, it can be replaced by the sample variance $\hat{\sigma}_n^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)^2$. By the strong law of large numbers, $\hat{\sigma}_n^2 \xrightarrow{a.s.} \sigma^2$. Slutsky's lemma (part (c)) then gives \begin{align*} \sqrt{n}\,\frac{\bar{X}_n - \mu_0}{\hat{\sigma}_n} \xrightarrow{d} \mathcal{N}(0,1), \end{align*} so the confidence interval $\bar{X}_n \pm z_\alpha \hat{\sigma}_n / \sqrt{n}$ retains its asymptotic level $1 - \alpha$. This chapter illustrates the two-step strategy that underpins much of asymptotic statistics. First, the law of large numbers establishes that an estimator converges to the true parameter. For the sample mean, $\bar{X}_n \to \mu_0$. Second, the CLT describes the fluctuations of the estimator around the truth, enabling inference. For the sample mean, $\sqrt{n}(\bar{X}_n - \mu_0)/\sigma \xrightarrow{d} \mathcal{N}(0,1)$. The following lectures will execute this same programme for the maximum likelihood estimator in general parametric models. The role of $\sigma^2$ will be played by $I(\theta_0)^{-1}$, and the law of large numbers applied to the score function will replace the simple calculation for the mean. Slutsky's lemma will be the workhorse for substituting estimated quantities in place of unknown theoretical quantities. As our sample size grows, the MLE's behavior becomes increasingly predictable — a phenomenon we formalize as consistency, where the estimator converges to the true parameter value in probability. # 6. Consistency of the MLE ## From Pointwise to Uniform: Why the Law of Large Numbers Is Not Enough The MLE is defined as the maximizer of the normalized log-likelihood $\bar{\ell}_n(\theta) = \frac{1}{n}\sum_{i=1}^n \log f(X_i, \theta)$. In Chapter 2 we established that the population criterion \begin{align*} \ell(\theta) = \mathbb{E}_{\theta_0}[\log f(X, \theta)] \end{align*} is uniquely maximized at the true parameter $\theta_0$, under identifiability. This gives a clear roadmap: if $\bar{\ell}_n(\theta)$ approximates $\ell(\theta)$ well, then the maximizer $\hat{\theta}_n$ of $\bar{\ell}_n$ should be close to the maximizer $\theta_0$ of $\ell$. The law of large numbers guarantees, for each fixed $\theta$, that $\bar{\ell}_n(\theta) \xrightarrow{\mathbb{P}} \ell(\theta)$. But there is a fundamental gap between pointwise convergence of a function and convergence of its maximizer. Consider a sequence of functions $g_n$ on $[0,1]$ where each $g_n$ is zero except for a narrow spike of height 1 that travels across the interval: $g_n(x) = \mathbb{1}_{[1/n, 2/n]}(x)$. Then $g_n(x) \to 0$ for every fixed $x$, yet $\arg\max g_n = 3/(2n)$ does not converge to $\arg\max 0$, which is the entire interval. The pointwise limit tells us nothing about where the maximum is achieved. This failure mode is not exotic — it can arise with log-likelihoods. To pass from pointwise convergence to convergence of the maximizer, we need the convergence to be *uniform* over $\Theta$: the approximation error $\sup_{\theta \in \Theta} |\bar{\ell}_n(\theta) - \ell(\theta)|$ must tend to zero in probability. This stronger condition is the central analytical ingredient of this chapter. ## Consistency of Estimators Before establishing consistency of the MLE, we state what consistency means in general. [definition: Consistent Estimator] Let $X_1, \ldots, X_n$ be i.i.d. from the parametric model $\{P_\theta : \theta \in \Theta\}$. An estimator $\tilde{\theta}_n = \tilde{\theta}_n(X_1, \ldots, X_n)$ is called **consistent** if \begin{align*} \tilde{\theta}_n \xrightarrow{\mathbb{P}} \theta_0 \quad \text{as } n \to \infty \end{align*} whenever $X_1, \ldots, X_n$ are drawn from $P_{\theta_0}$. [/definition] Consistency is a minimum requirement for a sensible estimator: as the amount of data grows without bound, the estimator should home in on the true value. It says nothing about the *rate* of convergence — that is the domain of efficiency theory and the Cramér–Rao bound studied in Chapter 4. An estimator can be consistent yet converge slowly, or it can perform well at small sample sizes while failing to be consistent. A simple verification: for $X_i \sim \mathcal{N}(\mu, 1)$ with $\mu \in \mathbb{R}$ unknown, the sample mean $\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i$ satisfies $\mathbb{E}_\mu[\bar{X}_n] = \mu$ and $\operatorname{Var}_\mu(\bar{X}_n) = 1/n$. Chebyshev's inequality gives, for any $\varepsilon > 0$, \begin{align*} \mathbb{P}_\mu(|\bar{X}_n - \mu| \geq \varepsilon) \leq \frac{1}{n\varepsilon^2} \to 0, \end{align*} so $\bar{X}_n$ is consistent for $\mu$. Since $\bar{X}_n$ is also the MLE for this model, this is a direct verification in a special case. The goal of the rest of the chapter is to establish consistency for a broad class of models. ## Regularity Conditions for MLE Consistency The consistency result requires several conditions on the model, collected as a single assumption. Each plays a distinct role. [definition: Regularity Assumptions for Consistency] Let $\{f(\cdot, \theta) : \theta \in \Theta\}$ be a statistical model of p.d.f./p.m.f. on $\mathcal{X} \subseteq \mathbb{R}^d$. The model satisfies **Assumption 2.1** (regularity for consistency) if: 1. $f(x, \theta) > 0$ for all $x \in \mathcal{X}$, $\theta \in \Theta$; 2. $\int_{\mathcal{X}} f(x, \theta)\, dx = 1$ for all $\theta \in \Theta$; 3. The function $\theta \mapsto f(x, \theta)$ is continuous for all $x \in \mathcal{X}$; 4. $\Theta \subseteq \mathbb{R}^p$ is compact; 5. For any $\theta, \theta' \in \Theta$: $f(\cdot, \theta) = f(\cdot, \theta') \implies \theta = \theta'$ (identifiability); 6. $\mathbb{E}_\theta \sup_{\theta' \in \Theta} |\log f(X, \theta')| < \infty$. [/definition] These conditions are not merely technical decoration; each is load-bearing in a specific way. **Identifiability (5)** ensures that $\theta_0$ is the *unique* maximizer of $\ell(\theta)$. Without it, there could be multiple parameters with identical log-likelihoods, and the MLE might converge to a value other than $\theta_0$. As established in Chapter 2, identifiability upgrades Jensen's inequality from $\ell(\theta) \leq \ell(\theta_0)$ to the strict inequality $\ell(\theta) < \ell(\theta_0)$ for all $\theta \neq \theta_0$. **Compactness of $\Theta$ (4)** serves two purposes. First, it guarantees existence of the MLE via the extreme value theorem: a continuous function on a compact set attains its maximum. Second, compactness is the key hypothesis that upgrades pointwise convergence to uniform convergence, via the uniform law of large numbers. **Continuity (3)** carries over, via condition (6) and the dominated convergence theorem, to continuity of $\theta \mapsto \ell(\theta) = \mathbb{E}_{\theta_0}[\log f(X, \theta)]$. The upshot is that $\ell$ is continuous on the compact set $\Theta$, so it attains its maximum and the set $\Theta_\varepsilon = \{\theta : \|\theta - \theta_0\| \geq \varepsilon\}$ has a strictly smaller maximum value than $\ell(\theta_0)$. **Uniform integrability (6)** justifies exchanging the supremum and the expectation in the uniform law of large numbers. Without a dominating integrable function for $\{\log f(X, \theta) : \theta \in \Theta\}$, neither the continuity of $\ell$ nor the uniform law of large numbers would follow. ## The Uniform Law of Large Numbers What distinguishes the ordinary law of large numbers from the tool we need here? The ordinary LLN says: for each fixed $\theta$, $\bar{\ell}_n(\theta) \to \ell(\theta)$ in probability. But "for each fixed $\theta$" is not strong enough — we need the approximation to hold simultaneously for all $\theta$ at once. On a non-compact space, the pointwise convergence can be arbitrarily slow as $\theta$ varies, and the supremum of the errors need not vanish. Compactness prevents this by allowing a finite covering argument. [quotetheorem:1855] The proof combines the ordinary law of large numbers with an equicontinuity argument on the compact set $\Theta$; the result is admitted from measure-theoretic probability. The key mechanism is: cover $\Theta$ by finitely many balls of radius $\delta$; on each ball, use continuity of $\theta \mapsto \log f(x, \theta)$ to bound the variation; and use the ordinary LLN at the center of each ball to control the average. The finite cover is possible precisely because $\Theta$ is compact. On a non-compact parameter space, the uniform LLN can fail even when the pointwise LLN holds at every $\theta$: the family $\theta \mapsto \log f(x, \theta)$ may have oscillations of unbounded amplitude as $\theta \to \infty$, preventing any uniform control. Compactness cuts off this escape to infinity. ## Consistency of the MLE We now have all the pieces: the population criterion $\ell$ has a unique maximum at $\theta_0$ (by identifiability), the sample criterion $\bar{\ell}_n$ converges uniformly to $\ell$ (by the ULLN), and $\Theta$ is compact (so maxima exist). The remaining question is whether these three facts combine to force the maximizer of $\bar{\ell}_n$ close to the maximizer of $\ell$. The following theorem confirms that they do. [quotetheorem:1856] The two conclusions are logically separate: existence must be established before consistency makes sense, and the proof handles them in order. [proof] **Existence.** The map $\theta \mapsto \bar{\ell}_n(\theta) = \frac{1}{n}\sum_{i=1}^n \log f(X_i, \theta)$ is continuous on the compact set $\Theta$ (by condition 3 and finiteness from condition 6). By the extreme value theorem, it attains its maximum, so an MLE exists. **Consistency.** Fix $\varepsilon > 0$ and define $\Theta_\varepsilon = \{\theta \in \Theta : \|\theta - \theta_0\| \geq \varepsilon\}$. This is compact (intersection of the compact set $\Theta$ with the closed set $\{\|\theta - \theta_0\| \geq \varepsilon\}$). Since $\ell$ is continuous and $\theta_0 \notin \Theta_\varepsilon$ is the unique maximizer of $\ell$ on $\Theta$, the maximum of $\ell$ on $\Theta_\varepsilon$ satisfies \begin{align*} c(\varepsilon) := \sup_{\theta \in \Theta_\varepsilon} \ell(\theta) < \ell(\theta_0). \end{align*} Choose $\delta(\varepsilon) > 0$ with $c(\varepsilon) + \delta(\varepsilon) < \ell(\theta_0) - \delta(\varepsilon)$. Define the event \begin{align*} A_n(\varepsilon) = \left\{ \sup_{\theta \in \Theta} |\bar{\ell}_n(\theta) - \ell(\theta)| < \delta(\varepsilon) \right\}. \end{align*} By the uniform law of large numbers, $\mathbb{P}(A_n(\varepsilon)) \to 1$. On $A_n(\varepsilon)$, we bound $\sup_{\Theta_\varepsilon} \bar{\ell}_n$ using the triangle inequality: \begin{align*} \sup_{\theta \in \Theta_\varepsilon} \bar{\ell}_n(\theta) \leq \sup_{\theta \in \Theta_\varepsilon} \ell(\theta) + \sup_{\theta \in \Theta} |\bar{\ell}_n(\theta) - \ell(\theta)| < c(\varepsilon) + \delta(\varepsilon). \end{align*} On $A_n(\varepsilon)$ we also have $\bar{\ell}_n(\theta_0) > \ell(\theta_0) - \delta(\varepsilon)$. By choice of $\delta(\varepsilon)$, \begin{align*} \sup_{\theta \in \Theta_\varepsilon} \bar{\ell}_n(\theta) < c(\varepsilon) + \delta(\varepsilon) < \ell(\theta_0) - \delta(\varepsilon) < \bar{\ell}_n(\theta_0). \end{align*} If $\hat{\theta}_n \in \Theta_\varepsilon$, then $\bar{\ell}_n(\hat{\theta}_n) \leq \sup_{\Theta_\varepsilon} \bar{\ell}_n < \bar{\ell}_n(\theta_0)$, contradicting the definition of the MLE. Therefore on $A_n(\varepsilon)$, $\hat{\theta}_n \notin \Theta_\varepsilon$, i.e., $A_n(\varepsilon) \subseteq \{\|\hat{\theta}_n - \theta_0\| < \varepsilon\}$. Since $\mathbb{P}(A_n(\varepsilon)) \to 1$, we conclude $\mathbb{P}(\|\hat{\theta}_n - \theta_0\| < \varepsilon) \to 1$. [/proof] The geometric idea behind the proof is worth dwelling on. The population criterion $\ell$ is strictly higher at $\theta_0$ than anywhere outside the $\varepsilon$-ball around $\theta_0$ — this is the gap $\ell(\theta_0) - c(\varepsilon) > 0$. Once $\bar{\ell}_n$ is uniformly within $\delta(\varepsilon)$ of $\ell$, any maximizer of $\bar{\ell}_n$ must lie in the region where $\ell$ is near its maximum, hence within the $\varepsilon$-ball. The gap is absorbed by the approximation error, and the maximizer is forced inward. Identifiability is necessary, not merely convenient. Without it, $c(\varepsilon)$ could equal $\ell(\theta_0)$, and no positive $\delta(\varepsilon)$ satisfying $c(\varepsilon) + \delta(\varepsilon) < \ell(\theta_0) - \delta(\varepsilon)$ can be found. The entire argument collapses. Similarly, without compactness, $\Theta_\varepsilon$ might not be compact, $c(\varepsilon)$ might not be attained, and the strict inequality $c(\varepsilon) < \ell(\theta_0)$ might fail even under identifiability. The theorem states that *any* MLE is consistent. When multiple maximizers of $\bar{\ell}_n$ exist, the theorem says that all of them converge to $\theta_0$. In practice, under Assumption 2.1 the MLE is unique with probability tending to 1, but the theorem does not require uniqueness for the consistency statement. [example: Consistency of the Poisson MLE] Let $X_i \sim \text{Poi}(\theta_0)$ with $\theta_0 \in \Theta = [\epsilon_0, M]$ for fixed $0 < \epsilon_0 < M < \infty$. The parameter space is a compact interval. The density $f(x, \theta) = e^{-\theta}\theta^x/x!$ is continuous in $\theta > 0$ and positive on $\mathcal{X} = \{0, 1, 2, \ldots\}$. The model is identifiable: if $f(\cdot, \theta) = f(\cdot, \theta')$ as functions on $\mathcal{X}$, then in particular $f(1, \theta) = e^{-\theta}\theta = e^{-\theta'}\theta' = f(1, \theta')$; the function $h(\theta) = \theta e^{-\theta}$ is strictly decreasing for $\theta > 1$ and strictly increasing for $\theta < 1$, and for $\theta, \theta' \in [\epsilon_0, M]$ with the interval chosen appropriately this forces $\theta = \theta'$. The log-density $\log f(x, \theta) = -\theta + x\log\theta - \log(x!)$ is bounded in absolute value by $M + x \max(|\log \epsilon_0|, |\log M|) + \log(x!)$; since $\mathbb{E}_{\theta_0}[X_1] = \theta_0 \leq M < \infty$, condition (6) holds. Assumption 2.1 is satisfied, and the MLE $\hat{\theta}_n = \bar{X}_n$ is consistent. This can also be seen directly: the weak law of large numbers gives $\bar{X}_n \xrightarrow{\mathbb{P}} \mathbb{E}_{\theta_0}[X_1] = \theta_0$. The direct argument works here because the MLE happens to equal the sample mean. For models where the MLE has no such closed form, the general theorem is indispensable. [/example] ## When the Hypotheses Fail Three main structural assumptions can fail, and each has a natural failure mode. **Failure of compactness.** Consider $X_i \sim \mathcal{N}(\mu, \sigma^2)$ with $\theta = (\mu, \sigma^2) \in \mathbb{R} \times (0, \infty)$, a non-compact parameter space. As $\sigma^2 \to 0$ with $\mu = X_1$, the likelihood at observation $X_1$ is $f(X_1, X_1, \sigma^2) = (2\pi\sigma^2)^{-1/2} \to \infty$. The likelihood is unbounded on the parameter space, and the MLE does not exist without restricting $\sigma^2$ to a compact interval bounded away from zero. In practice, physical constraints or a priori bounds on $\sigma^2$ restore compactness. **Failure of identifiability.** In a mixture of two Gaussians $\pi_1 \mathcal{N}(\mu_1, \sigma^2) + \pi_2 \mathcal{N}(\mu_2, \sigma^2)$, the parameterization by $(\pi_1, \mu_1, \mu_2)$ is not identified: the mixture with $(\pi_1, \mu_1, \mu_2)$ and the mixture with $(\pi_2, \mu_2, \mu_1)$ (weights and means swapped) define the same distribution. The population criterion $\ell$ has multiple maximizers related by this label-switching symmetry, and the MLE cannot distinguish between them. Consistency fails in the sense that $\hat{\theta}_n$ converges to the equivalence class, not to a unique $\theta_0$. **Failure of uniform integrability (6).** If the log-likelihood has heavy tails in $\theta$, the supremum $\sup_\theta |\log f(X, \theta)|$ may have infinite expectation, and the dominated convergence argument that gives continuity of $\ell$ and the uniform law of large numbers breaks down. The sample criterion $\bar{\ell}_n$ may converge pointwise to $\ell$ but the convergence may fail to be uniform, invalidating the proof. ## Beyond Compactness: Differentiability as a Substitute The compactness assumption is the most restrictive condition in Assumption 2.1 — many standard models have open or unbounded parameter spaces. Can it be replaced? When $\Theta$ is open in $\mathbb{R}^p$ and $\bar{\ell}_n$ is twice continuously differentiable, the MLE satisfies the score equation $S_n(\hat{\theta}_n) = \nabla_\theta \bar{\ell}_n(\hat{\theta}_n) = 0$. The score identity of Chapter 3 gives $\mathbb{E}_{\theta_0}[S_n(\theta_0)] = 0$, and the law of large numbers gives $S_n(\theta_0) \xrightarrow{\mathbb{P}} 0$. Under conditions on the Hessian (negative definite near $\theta_0$, bounded by an integrable function), an implicit function theorem argument shows that the root $\hat{\theta}_n$ of the score equation converges to $\theta_0$. This approach connects naturally to the asymptotic normality result in later chapters, where the Hessian at $\theta_0$ is $-I(\theta_0)$, the negative Fisher information. The compactness and differentiability approaches are complementary: compactness handles non-smooth models (count data, discrete parameter spaces) and yields a clean geometric proof, while differentiability applies to open parameter spaces and gives a rate connected to Fisher information. ## The M-Estimation Framework The consistency argument of this chapter applies to a much broader class of estimators than the MLE alone. How general is the "maximize a sample average, then show the maximizer converges" strategy? The answer is: very general. Any estimator defined as the maximizer of a sample criterion function falls under this umbrella. [definition: M-Estimator] Given i.i.d. observations $X_1, \ldots, X_n$ and a measurable function $m: \mathcal{X} \times \Theta \to \mathbb{R}$, an **M-estimator** is any element \begin{align*} \hat{\theta}_n \in \arg\max_{\theta \in \Theta}\, \frac{1}{n}\sum_{i=1}^n m(X_i, \theta). \end{align*} The **population criterion** is $M(\theta) = \mathbb{E}_{\theta_0}[m(X, \theta)]$. [/definition] The MLE is the special case $m(x, \theta) = \log f(x, \theta)$. Other important instances include least squares ($m((y, x), \beta) = -(y - x^\top \beta)^2$, maximized at the OLS estimator), the sample median ($m(x, \theta) = -|x - \theta|$), and quantile regression (replacing $|\cdot|$ by an asymmetric loss function). The consistency proof of this chapter — establish uniform convergence of the sample criterion to the population criterion, then use the strict maximum of the population criterion to trap the argmax — applies to all M-estimators under analogous regularity conditions. The key requirement is that $M$ has a unique maximizer $\theta_0$: this replaces the identifiability condition of the MLE setting. [example: Least Squares as an M-Estimator] Consider $Y_i = \theta_0 x_i + \varepsilon_i$ with fixed design points $x_i \in \mathbb{R}$, $\varepsilon_i \sim \mathcal{N}(0, 1)$ i.i.d., and $\theta_0 \in \Theta = [-B, B]$. The sample criterion is $M_n(\theta) = -\frac{1}{n}\sum_{i=1}^n (Y_i - \theta x_i)^2$. Taking expectation over $\varepsilon_i$: \begin{align*} M(\theta) = \mathbb{E}\left[-(Y_i - \theta x_i)^2\right] = -((\theta_0 - \theta)^2 x_i^2 + 1). \end{align*} If the design satisfies $\frac{1}{n}\sum_{i=1}^n x_i^2 \to c > 0$, then $M(\theta)$ is uniquely maximized at $\theta = \theta_0$. Under a uniform law of large numbers (which holds on the compact $\Theta = [-B, B]$), the least squares estimator $\hat{\theta}_n$ is consistent. The MLE for this Gaussian model coincides with least squares, consistent with the general theory. [/example] The M-estimation viewpoint reveals what the MLE proof is really about: not a special property of log-likelihoods, but a general phenomenon about maximizers of random functions converging to maximizers of their limits. The unique feature of maximum likelihood is that the population criterion $\ell(\theta) = \mathbb{E}_{\theta_0}[\log f(X, \theta)]$ is automatically maximized at the true parameter by the KL divergence inequality of Chapter 2. For a general M-estimator, this identifying property — that the population criterion is maximized at the true parameter — must be verified separately from the consistency argument. Beyond consistency, the MLE exhibits a stronger property: its distribution itself converges to a normal distribution, centered at the true parameter, with variance determined by the Fisher Information. # 7. Asymptotic Normality of the MLE ## The Uniform Law of Large Numbers Consistency of the MLE, established in the previous chapter, rests on a convergence result stronger than a pointwise law of large numbers. The normalized log-likelihood $\bar{\ell}_n(\theta) = \frac{1}{n}\sum_{i=1}^n \log f(X_i, \theta)$ must converge to its expectation $\ell(\theta) = \mathbb{E}_{\theta_0}[\log f(X, \theta)]$ not merely at each fixed $\theta$, but simultaneously across the entire parameter space. This simultaneity is precisely what the uniform law of large numbers guarantees. To see why pointwise convergence is insufficient, consider what goes wrong without it. Even if $\bar{\ell}_n(\theta) \to \ell(\theta)$ almost surely for every fixed $\theta$, the maximizer $\hat{\theta}_n$ of $\bar{\ell}_n$ might wander arbitrarily far from the maximizer $\theta_0$ of $\ell$: the pointwise events $\{|\bar{\ell}_n(\theta) - \ell(\theta)| > \varepsilon\}$ might collectively have positive probability even when each one has probability zero. The passage from "for each $\theta$" to "for all $\theta$ simultaneously" requires a uniform handle on the fluctuations. ### Why Compactness Replaces Finiteness In the finite-index setting the argument is elementary. Suppose $h_1, \ldots, h_M : \mathcal{X} \to \mathbb{R}$ satisfy $\mathbb{E}|h_j(X)| < \infty$ for each $j$. The strong law of large numbers gives, for each $j$, an almost sure event $A_j$ on which $\frac{1}{n}\sum_{i=1}^n h_j(X_i) \to \mathbb{E}[h_j(X)]$. Setting $A = \bigcap_{j=1}^M A_j$ we have $\mathbb{P}(A^c) \le \sum_{j=1}^M \mathbb{P}(A_j^c) = 0$, and on $A$ the maximum over $j$ satisfies \begin{align*} \max_{1 \le j \le M} \left|\frac{1}{n}\sum_{i=1}^n h_j(X_i) - \mathbb{E}[h_j(X)]\right| \to 0. \end{align*} This finite union has measure zero, so uniform convergence holds almost surely over the finite class. For a continuous parameter $\theta \in \Theta$, finitely many indices are replaced by uncountably many, and the union bound breaks down. The role of finiteness is played by compactness: a compact set can be covered, up to any precision $\delta > 0$, by a finite net $\Theta_0$. Uniform convergence over $\Theta_0$ then propagates to all of $\Theta$ via continuity in $\theta$. [quotetheorem:1855] The proof transfers uniform convergence from a finite $\delta$-net $\Theta_0 \subset \Theta$ to the full compact set. For each $\theta \in \Theta$ there exists $\theta_0 \in \Theta_0$ with $|\theta - \theta_0| \le \delta$; continuity of $\theta \mapsto q(x, \theta)$ then controls $|q(x, \theta) - q(x, \theta_0)|$ uniformly in $x$. Sending $\delta \to 0$ as $n \to \infty$ completes the argument. The integrability assumption $\mathbb{E}[\sup_\theta |q(X,\theta)|] < \infty$ is the price of uniformity: it rules out functions that grow unboundedly in some direction of $\theta$. The assumption $\mathbb{E}[\sup_{\theta \in \Theta}|q(X,\theta)|] < \infty$ is genuinely stronger than $\sup_\theta \mathbb{E}[|q(X,\theta)|] < \infty$. The supremum must be integrable, not merely bounded in expectation. Without it one can construct sequences where uniform convergence fails despite pointwise convergence for every $\theta$. ## Regularity Conditions for Asymptotic Normality Consistency tells us that $\hat{\theta}_n \xrightarrow{\mathbb{P}} \theta_0$, but it says nothing about the *rate* at which the MLE approaches the truth. To obtain a distributional limit — a precise description of the random fluctuations — we need to look more carefully at the geometry of the log-likelihood near its maximum. This requires conditions on the model beyond those that ensure consistency. The difficulty is that distributional limits involve derivatives and second-order expansions, which demand smoothness in $\theta$. A model that is merely measurable in $\theta$ cannot support a Taylor expansion of the score function, and without such an expansion there is no route to a Gaussian limit. The following assumption collects the conditions needed. [definition: Regularity Conditions for Asymptotic Normality] Let $\{f(\cdot, \theta) : \theta \in \Theta\}$ be a statistical model of p.d.f./p.m.f. on $\mathcal{X} \subseteq \mathbb{R}^d$. We say the model satisfies the regularity conditions for asymptotic normality if, in addition to the conditions ensuring consistency, the following hold. 1. The true value $\theta_0$ lies in the interior of $\Theta$. 2. There exists an open set $U \subseteq \Theta$ containing $\theta_0$ such that $\theta \mapsto f(x, \theta)$ is twice continuously differentiable in $\theta \in U$ for every $x \in \mathcal{X}$. 3. The $p \times p$ Fisher information matrix $I(\theta_0)$ is non-singular, and $\mathbb{E}_{\theta_0}[\|\nabla_\theta \log f(X, \theta_0)\|] < \infty$. 4. There exists a compact ball $K \subset U$ centred at $\theta_0$ with non-empty interior such that \begin{align*} \mathbb{E}_{\theta_0}\left[\sup_{\theta \in K} \|\nabla^2_\theta \log f(X, \theta)\|\right] < \infty. \end{align*} [/definition] Each condition has a specific role. Condition 1 ensures that the score $\nabla_\theta \bar{\ell}_n(\hat{\theta}_n) = 0$ holds as a genuine first-order condition, not merely a boundary inequality. Without it, the MLE could sit on the boundary of $\Theta$ and the gradient need not vanish there. Condition 2 provides the second-order differentiability needed to apply the mean value theorem to the score function. Condition 3 ensures the covariance structure of the score is non-degenerate — a singular Fisher information matrix would mean the score varies in a subspace, and the inverse covariance matrix in the limiting distribution would not exist. Condition 4 controls remainder terms in the Taylor expansion and validates the interchange of differentiation and integration. These conditions are referred to as "usual regularity assumptions" in the course. All standard exponential family models — Gaussian, Poisson, Binomial, Exponential — satisfy them. The assumptions become relevant in non-regular models such as the uniform distribution $\mathrm{Uniform}(0, \theta)$, where the density is not differentiable in $\theta$ at the boundary of the support, and the MLE converges at rate $n$ rather than $\sqrt{n}$. ## Asymptotic Normality of the MLE Consistency gives the direction of convergence; the asymptotic normality theorem quantifies the precision. The central insight is that in a regular model, the MLE $\hat{\theta}_n$ behaves like a sample average of score contributions, and such averages satisfy the central limit theorem. The Fisher information matrix enters as the natural covariance: it measures how sharply the log-likelihood curves near $\theta_0$, and hence how tightly $\hat{\theta}_n$ clusters around $\theta_0$. [quotetheorem:1857] The theorem says that the MLE converges at rate $\sqrt{n}$, which matches the Cramér–Rao lower bound from Chapter 4: no unbiased estimator can converge faster, and the MLE achieves this rate in the limit. The limiting covariance $I(\theta_0)^{-1}$ is the inverse Fisher information, so the MLE is *asymptotically efficient*. The non-singularity of $I(\theta_0)$ is indispensable. If $I(\theta_0)$ were singular, its inverse would not exist and the limiting distribution would degenerate — the MLE would converge along some directions of the parameter space at a rate faster than $\sqrt{n}$, while along others it might not converge in a Gaussian manner at all. The interior condition on $\theta_0$ is equally essential: if the true parameter sat on the boundary of $\Theta$, the MLE could be constrained there for many samples and the score need not vanish at $\hat{\theta}_n$, breaking the entire proof strategy. [proof] The proof proceeds in two stages: first a Taylor expansion of the score to relate $\sqrt{n}(\hat{\theta}_n - \theta_0)$ to the rescaled score at $\theta_0$, then the central limit theorem applied to that score. **Stage 1: Taylor expansion of the score.** Since $\hat{\theta}_n \xrightarrow{\mathbb{P}} \theta_0$ and $\theta_0$ lies in the interior of $\Theta$, with probability tending to one the MLE $\hat{\theta}_n$ lies in the interior of $\Theta$ as well. On this event the first-order condition $\nabla_\theta \bar{\ell}_n(\hat{\theta}_n) = 0$ holds. Applying the mean value theorem to each coordinate of $\theta \mapsto \nabla_\theta \bar{\ell}_n(\theta)$ between $\theta_0$ and $\hat{\theta}_n$ yields \begin{align*} 0 = \nabla_\theta \bar{\ell}_n(\hat{\theta}_n) = \nabla_\theta \bar{\ell}_n(\theta_0) + \bar{A}_n (\hat{\theta}_n - \theta_0), \end{align*} where $\bar{A}_n$ is a $p \times p$ matrix whose $(k,j)$-th entry is $\frac{\partial^2}{\partial\theta_k\partial\theta_j}\bar{\ell}_n(\bar{\theta}^{(j)})$ for an intermediate point $\bar{\theta}^{(j)}$ on the segment $[\theta_0, \hat{\theta}_n]$. **Stage 2: Convergence of $\bar{A}_n$.** The matrix $\bar{A}_n$ converges in probability to $-I(\theta_0)$. Each entry of $\bar{A}_n$ is decomposed as \begin{align*} (\bar{A}_n)_{kj} &= \left[\frac{1}{n}\sum_{i=1}^n \frac{\partial^2}{\partial\theta_k\partial\theta_j}\log f(X_i, \bar{\theta}^{(j)}) - \mathbb{E}_{\theta_0}\!\left[\frac{\partial^2}{\partial\theta_k\partial\theta_j}\log f(X, \bar{\theta}^{(j)})\right]\right]\\ &\quad + \left[\mathbb{E}_{\theta_0}\!\left[\frac{\partial^2}{\partial\theta_k\partial\theta_j}\log f(X, \bar{\theta}^{(j)})\right] - \mathbb{E}_{\theta_0}\!\left[\frac{\partial^2}{\partial\theta_k\partial\theta_j}\log f(X, \theta_0)\right]\right] + (-I(\theta_0))_{kj}. \end{align*} Setting $q(x, \theta) = \frac{\partial^2}{\partial\theta_k\partial\theta_j}\log f(x, \theta)$, the first bracket is bounded by $\sup_{\theta \in K}|\frac{1}{n}\sum_i q(X_i,\theta) - \mathbb{E}[q(X,\theta)]|$, which converges to zero almost surely by the uniform law of large numbers. The second bracket converges to zero in probability because $\bar{\theta}^{(j)} \xrightarrow{\mathbb{P}} \theta_0$ by consistency of $\hat{\theta}_n$ and continuity of $\theta \mapsto \mathbb{E}[q(X,\theta)]$. **Stage 3: CLT for the score.** From the expansion, rearranging gives \begin{align*} \sqrt{n}(\hat{\theta}_n - \theta_0) = (-\bar{A}_n)^{-1} \cdot \sqrt{n}\,\nabla_\theta\bar{\ell}_n(\theta_0). \end{align*} Now $\sqrt{n}\,\nabla_\theta\bar{\ell}_n(\theta_0) = \frac{1}{\sqrt{n}}\sum_{i=1}^n \nabla_\theta\log f(X_i, \theta_0)$, which is a sum of i.i.d. mean-zero vectors with covariance matrix $I(\theta_0)$ by definition. The multivariate CLT gives $\sqrt{n}\,\nabla_\theta\bar{\ell}_n(\theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0))$. **Stage 4: Slutsky's lemma.** Since $(-\bar{A}_n)^{-1} \xrightarrow{\mathbb{P}} I(\theta_0)^{-1}$, Slutsky's lemma yields \begin{align*} \sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} I(\theta_0)^{-1} \cdot \mathcal{N}(0, I(\theta_0)) = \mathcal{N}(0,\, I(\theta_0)^{-1} I(\theta_0) I(\theta_0)^{-1}) = \mathcal{N}(0,\, I(\theta_0)^{-1}). \end{align*} [/proof] The proof strategy is worth dwelling on. The key algebraic move is the first-order condition $\nabla_\theta\bar{\ell}_n(\hat{\theta}_n) = 0$, which holds precisely because $\hat{\theta}_n$ maximizes a smooth function in the interior of the parameter space. This lets us connect the MLE — which is defined by a global maximization — to the score at the truth $\theta_0$, which has a tractable distribution. The Hessian matrix $\bar{A}_n$ appears as the "conversion factor" between these two quantities, and its convergence to $-I(\theta_0)$ is where the Fisher information enters the limiting variance. ## The Scalar Case and the Delta Method ### Asymptotic Normality for Scalar Parameters When $p = 1$, the MLE $\hat{\theta}_n$ is a real-valued estimator and the asymptotic normality theorem reduces to the following. [quotetheorem:1858] An equivalent way to write this, more useful in practice, is \begin{align*} \hat{\theta}_n \approx \mathcal{N}\!\left(\theta_0,\, \frac{1}{n I(\theta_0)}\right), \end{align*} meaning that for large $n$ the MLE is approximately normally distributed with variance $1/(nI(\theta_0))$. This matches the Cramér–Rao bound exactly: the variance of the MLE achieves the lower bound $1/(nI(\theta_0))$ in the limit. ### The Univariate Delta Method Suppose we are not interested in $\theta_0$ itself but in some smooth function $g(\theta_0)$ — for instance, the standard deviation $\sigma = \sqrt{\sigma^2}$ of a Gaussian model when the MLE estimates $\sigma^2$. If $\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$, what is the asymptotic distribution of $\sqrt{n}(g(\hat{\theta}_n) - g(\theta_0))$? The answer comes from linearising $g$ around $\theta_0$. [quotetheorem:1859] The condition $g'(\theta_0) \neq 0$ is necessary: if $g'(\theta_0) = 0$, the first-order linearisation vanishes, and the asymptotic distribution is no longer Gaussian at the $\sqrt{n}$ scale. In that case one must expand to higher order, and the limiting distribution depends on $g''(\theta_0)$. [citeproof:1859] [example: Variance Estimation in a Gaussian Model] Let $X_1, \ldots, X_n$ be i.i.d. $\mathcal{N}(\mu_0, \sigma_0^2)$ with both parameters unknown. The MLE of $\sigma^2$ is $\hat{\sigma}_n^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)^2$. By asymptotic normality of the MLE for the parameter $\theta_0 = \sigma_0^2$, one has $\sqrt{n}(\hat{\sigma}_n^2 - \sigma_0^2) \xrightarrow{d} \mathcal{N}(0, I(\sigma_0^2)^{-1})$. To find the asymptotic distribution of $\hat{\sigma}_n = \sqrt{\hat{\sigma}_n^2}$, apply the delta method with $g(u) = \sqrt{u}$, so $g'(u) = \frac{1}{2\sqrt{u}}$ and $g'(\sigma_0^2) = \frac{1}{2\sigma_0}$. The delta method gives \begin{align*} \sqrt{n}(\hat{\sigma}_n - \sigma_0) \xrightarrow{d} \mathcal{N}\!\left(0,\, \frac{1}{4\sigma_0^2} \cdot I(\sigma_0^2)^{-1}\right). \end{align*} The Fisher information for the variance parameter in the Gaussian model (with $\mu_0$ known) is $I(\sigma^2) = \frac{1}{2\sigma^4}$, so $I(\sigma_0^2)^{-1} = 2\sigma_0^4$, and the limiting variance becomes $\frac{1}{4\sigma_0^2} \cdot 2\sigma_0^4 = \frac{\sigma_0^2}{2}$. [/example] ### The Multivariate Delta Method When $\hat{\theta}_n$ is a vector and $g : \mathbb{R}^p \to \mathbb{R}^q$ is a smooth map, the same linearisation principle extends using the Jacobian of $g$. [quotetheorem:1860] The proof follows the same linearisation strategy as the univariate case: write $g(\hat{\theta}_n) \approx g(\theta_0) + Jg_{\theta_0}(\hat{\theta}_n - \theta_0)$ and apply Slutsky's lemma to conclude. The covariance $Jg_{\theta_0} \Sigma Jg_{\theta_0}^\top$ is the covariance of the linear map $Jg_{\theta_0}$ applied to a $\mathcal{N}(0, \Sigma)$ random vector. When the delta method is applied to the MLE, the covariance matrix $\Sigma = I(\theta_0)^{-1}$ and the limiting distribution of $\sqrt{n}(g(\hat{\theta}_n) - g(\theta_0))$ is $\mathcal{N}(0, Jg_{\theta_0} I(\theta_0)^{-1} Jg_{\theta_0}^\top)$. ## Consequences and Applications ### Confidence Intervals from Asymptotic Normality The asymptotic normality theorem has an immediate practical payoff: it justifies constructing approximate confidence intervals for $\theta_0$ using the MLE and the Fisher information. In the scalar case, since $\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1})$, we have approximately \begin{align*} \mathbb{P}_{\theta_0}\!\left(\theta_0 \in \left[\hat{\theta}_n - \frac{z_{1-\alpha/2}}{\sqrt{n I(\hat{\theta}_n)}},\, \hat{\theta}_n + \frac{z_{1-\alpha/2}}{\sqrt{n I(\hat{\theta}_n)}}\right]\right) \approx 1 - \alpha, \end{align*} where $z_{1-\alpha/2}$ is the $(1-\alpha/2)$-quantile of the standard normal and $I(\hat{\theta}_n)$ is the plug-in estimate of Fisher information. The replacement of $I(\theta_0)$ by $I(\hat{\theta}_n)$ is valid by continuity of $\theta \mapsto I(\theta)$ and consistency of $\hat{\theta}_n$: $I(\hat{\theta}_n) \xrightarrow{\mathbb{P}} I(\theta_0)$, so another application of Slutsky's lemma shows the coverage probability remains approximately $1-\alpha$. This construction fails in non-regular models. For the uniform distribution $\mathrm{Uniform}(0, \theta)$, the MLE $\hat{\theta}_n = \max_i X_i$ converges at rate $n$ (not $\sqrt{n}$) and its rescaled distribution converges to an exponential, not a Gaussian. The confidence interval formula above would be entirely wrong in this case. ### Exponential Family Example [example: Asymptotic Normality for the Poisson MLE] Let $X_1, \ldots, X_n$ be i.i.d. $\mathrm{Poi}(\theta_0)$ with $\theta_0 > 0$. The MLE is $\hat{\theta}_n = \bar{X}_n$. For the Poisson model, $\log f(x, \theta) = x\log\theta - \theta - \log(x!)$, so \begin{align*} \frac{d}{d\theta}\log f(X, \theta) = \frac{X}{\theta} - 1, \qquad \frac{d^2}{d\theta^2}\log f(X, \theta) = -\frac{X}{\theta^2}. \end{align*} The Fisher information is $I(\theta) = -\mathbb{E}_\theta\!\left[-X/\theta^2\right] = \mathbb{E}_\theta[X]/\theta^2 = \theta/\theta^2 = 1/\theta$. The asymptotic normality theorem gives \begin{align*} \sqrt{n}(\bar{X}_n - \theta_0) \xrightarrow{d} \mathcal{N}(0, \theta_0). \end{align*} This agrees with the direct application of the central limit theorem to $\bar{X}_n$: since $\operatorname{Var}_{\theta_0}(X_1) = \theta_0$, we have $\sqrt{n}(\bar{X}_n - \theta_0) \xrightarrow{d} \mathcal{N}(0, \theta_0)$ by the standard CLT. The MLE framework recovers this directly from the Fisher information, without computing the variance of $X_1$ separately. To apply the delta method, suppose we want the asymptotic distribution of $\hat{\lambda}_n = e^{-\hat{\theta}_n}$, the MLE of the probability $\mathbb{P}(X = 0) = e^{-\theta_0}$. With $g(\theta) = e^{-\theta}$ we get $g'(\theta) = -e^{-\theta}$, so $g'(\theta_0) = -e^{-\theta_0}$. The delta method yields \begin{align*} \sqrt{n}(e^{-\bar{X}_n} - e^{-\theta_0}) \xrightarrow{d} \mathcal{N}(0,\, e^{-2\theta_0} \cdot \theta_0). \end{align*} [/example] The asymptotic normality theorem shows that the MLE is asymptotically efficient: its limiting variance achieves the Cramér–Rao lower bound $I(\theta_0)^{-1}$. For finite $n$, the MLE may be biased and may not achieve the Cramér–Rao bound, but the gap closes as $n \to \infty$. This is one of the key justifications for using the MLE in practice. Understanding asymptotic normality deepens our intuition about large-sample behavior. We examine precisely how fast convergence occurs and what happens when classical assumptions are violated. # 8. Discussion of Asymptotic Normality ## Asymptotic Efficiency and What It Requires The previous chapter established one of the central results of the course: under regularity conditions, the maximum likelihood estimator $\hat{\theta}_n$ satisfies \begin{align*} \sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, I(\theta_0)^{-1}). \end{align*} This chapter unpacks what that result really says, when it fails, and how to use it in practice. Three questions drive the discussion. First, what does it mean for an estimator to be *asymptotically efficient*, and how does this connect to the Cramér-Rao bound? Second, what goes wrong when the regularity conditions break down — and the Uniform distribution provides a canonical failure mode. Third, when we want to estimate not $\theta_0$ itself but some function $\Phi(\theta_0)$, how do we propagate the asymptotic normality of the MLE to the plug-in estimator $\Phi(\hat{\theta}_n)$? ## Asymptotic Efficiency The Cramér-Rao bound from Chapter 4 tells us that any unbiased estimator of $\theta_0$ has variance at least $I(\theta_0)^{-1}/n$. The bound concerns finite-sample, unbiased estimators. To state an asymptotic analogue — one that applies to the MLE, which need not be exactly unbiased in finite samples — we need a definition of optimality that makes sense as $n \to \infty$. [definition: Asymptotically Efficient Estimator] In a parametric model $\{f(\cdot, \theta) : \theta \in \Theta\}$, a consistent estimator $\tilde{\theta}_n$ is called **asymptotically efficient** if \begin{align*} n \operatorname{Var}_{\theta_0}(\tilde{\theta}_n) \to I(\theta_0)^{-1} \end{align*} for all $\theta_0 \in \operatorname{int}(\Theta)$ when $p = 1$, and \begin{align*} n \operatorname{Cov}_{\theta_0}(\tilde{\theta}_n) \to I(\theta_0)^{-1} \end{align*} in $\mathbb{R}^{p \times p}$ in the multiparameter case $\Theta \subseteq \mathbb{R}^p$. [/definition] The definition requires the rescaled variance to converge to $I(\theta_0)^{-1}$ — the Cramér-Rao benchmark — rather than merely staying bounded below by it. The MLE, under the regularity conditions of the previous chapter, achieves this benchmark exactly: its asymptotic normality with variance $I(\theta_0)^{-1}/n$ means it saturates the lower bound in the limit. The Cramér-Rao bound is a finite-sample statement about unbiased estimators. Asymptotic efficiency is an asymptotic statement that makes no unbiasedness requirement. The two notions are complementary rather than nested: a biased estimator can be asymptotically efficient even though the Cramér-Rao bound does not formally apply to it. ### When the Regularity Assumptions Cannot Be Relaxed Away The asymptotic normality result rests on conditions about the smoothness of $\theta \mapsto f(x, \theta)$: the ability to differentiate under the integral, uniform laws of large numbers, and second-order Taylor expansion of the log-likelihood. A natural question is whether these conditions can be weakened. The answer is nuanced. On the one hand, with more technical work the regularity conditions *can* be reduced. The Laplace distribution $f(x, \theta) = \frac{1}{2} e^{-|x - \theta|}$, for instance, has a non-differentiable density in $x$ but the score function with respect to $\theta$ is well-behaved almost everywhere, and the asymptotic normality result extends to it. On the other hand, *some* regularity is genuinely necessary. The uniform distribution on $[0, \theta]$ shows what can go wrong when the support of the distribution itself depends on the parameter. ## The Uniform Distribution: A Non-Regular Model The failure mode of the $\operatorname{Uniform}(0, \theta)$ model is worth examining carefully because it illustrates precisely which hypothesis does the work in the regular case. The density is \begin{align*} f(x, \theta) = \frac{1}{\theta} \mathbb{1}_{[0,\theta]}(x). \end{align*} The likelihood function based on observations $X_1, \ldots, X_n$ is \begin{align*} L_n(\theta) = \prod_{i=1}^n f(X_i, \theta) = \frac{1}{\theta^n} \mathbb{1}\{\theta \geq X_{(n)}\}, \end{align*} where $X_{(n)} = \max_{1 \leq i \leq n} X_i$ is the sample maximum. This function is $\theta^{-n}$ for $\theta \geq X_{(n)}$ and zero for $\theta < X_{(n)}$. It is decreasing on $[X_{(n)}, \infty)$, so the MLE is $\hat{\theta}_n = X_{(n)}$. [example: MLE for the Uniform Model] With $X_1, \ldots, X_n \sim \operatorname{Uniform}(0, \theta_0)$, the MLE $\hat{\theta}_n = X_{(n)}$ is consistent: since $X_{(n)} \leq \theta_0$ always, and $\mathbb{P}(X_{(n)} \leq t) = (t/\theta_0)^n \to 0$ for any $t < \theta_0$, we have $X_{(n)} \xrightarrow{\mathbb{P}} \theta_0$. The distribution of $\hat{\theta}_n$ is, however, far from normal. By direct computation, \begin{align*} \mathbb{P}(X_{(n)} \leq t) = \left(\frac{t}{\theta_0}\right)^n, \quad 0 \leq t \leq \theta_0, \end{align*} so the density of $X_{(n)}$ is $f_{X_{(n)}}(t) = n t^{n-1} / \theta_0^n$. In particular, \begin{align*} \mathbb{E}[X_{(n)}] = \frac{n}{n+1} \theta_0, \qquad \operatorname{Var}(X_{(n)}) = \frac{n \theta_0^2}{(n+1)^2(n+2)}. \end{align*} The bias is $\mathbb{E}[X_{(n)}] - \theta_0 = -\theta_0/(n+1)$, and the variance decays as $O(1/n^2)$ rather than $O(1/n)$. The rescaled error $n(\theta_0 - X_{(n)})$ converges in distribution to an exponential random variable $\operatorname{Exp}(\theta_0^{-1})$, not to a normal. The rate of convergence is $n$, not $\sqrt{n}$. Asymptotic normality at rate $\sqrt{n}$ fails entirely. [/example] The reason the regular theory breaks down is that differentiating $L_n(\theta)$ with respect to $\theta$ near $\theta_0$ is ill-posed: the likelihood function has a jump discontinuity at $\theta = X_{(n)}$, and the score function $\nabla_\theta \log L_n(\theta)$ does not exist at the MLE. The second-order Taylor expansion of the log-likelihood — the engine of the asymptotic normality proof — cannot be applied. ### Boundary Effects A separate failure of asymptotic normality occurs when the true parameter is at the boundary of the parameter space. Consider the model $\mathcal{N}(\theta, 1)$ with $\theta \in [0, \infty)$. At an interior point $\theta_0 > 0$, the MLE is the sample mean $\bar{X}_n$, and $\sqrt{n}(\bar{X}_n - \theta_0) \xrightarrow{d} N(0, 1)$ as expected. But when $\theta_0 = 0$, the sample mean $\bar{X}_n$ can be negative — yet any value $\theta < 0$ is outside the parameter space. The constrained MLE is $\hat{\theta}_n = \max(\bar{X}_n, 0)$, and the distribution of $\sqrt{n} \hat{\theta}_n$ converges to $\max(Z, 0)$ where $Z \sim N(0, 1)$: a half-normal distribution, not a full normal. ### The Hodges Estimator and the Limits of Efficiency A more subtle cautionary example involves the Hodges estimator. Given an asymptotically normal estimator $\hat{\theta}_n$ in a model on $\mathbb{R}$, define \begin{align*} \tilde{\theta}_n = \begin{cases} \hat{\theta}_n & \text{if } |\hat{\theta}_n| > n^{-1/4}, \\ 0 & \text{otherwise.} \end{cases} \end{align*} At any fixed $\theta_0 \neq 0$, the event $|\hat{\theta}_n| \leq n^{-1/4}$ eventually has probability zero, so $\tilde{\theta}_n$ behaves exactly like $\hat{\theta}_n$, with the same asymptotic variance $I(\theta_0)^{-1}$. But at $\theta_0 = 0$, we always have $\hat{\theta}_n \approx \theta_0 = 0$ and the estimator is shrunk to zero, achieving asymptotic variance *zero* — better than $I(0)^{-1}$. This seems to contradict the claim that $I(\theta_0)^{-1}$ is the best achievable variance. The resolution is that the pointwise convergence criterion used in the definition of asymptotic efficiency is too weak: it allows estimators that are superefficient at isolated points by doing worse elsewhere in ways the pointwise limit misses. A rigorous theory of optimality — using minimax criteria over neighborhoods of $\theta_0$ — closes this loophole. The course will return to this point when discussing minimax estimation. ## The Delta Method The asymptotic normality of the MLE becomes most useful in applications when combined with the Delta method, which propagates distributional limits through smooth transformations. The need for such a result arises constantly. Suppose $X_1, \ldots, X_n \sim \operatorname{Exp}(\theta_0)$ and we want to estimate the mean $\mu = 1/\theta_0$ of the distribution. The MLE of $\theta_0$ is $\hat{\theta}_n = n / \sum_{i=1}^n X_i$, and asymptotic normality gives $\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, I(\theta_0)^{-1})$. The plug-in MLE of the mean is $\hat{\mu}_n = 1/\hat{\theta}_n$. What is its limiting distribution? The Delta method answers this systematically. [quotetheorem:1861] The hypothesis $\nabla_\theta \Phi(\theta_0) \neq 0$ is essential: if the gradient vanishes at $\theta_0$, the leading-order term in the Taylor expansion is zero and the $\sqrt{n}$ rate is no longer the right scale. In that case the distributional limit requires a higher-order expansion and the limit is no longer normal in general. The condition that $\Phi$ is continuously differentiable at $\theta_0$ is also necessary: if $\Phi$ has a corner or a discontinuity at $\theta_0$, the linearization argument fails. [citeproof:1861] ### Application to the Plug-In MLE When $\hat{\theta}_n$ is the MLE under asymptotic normality, the limiting distribution is $Z \sim N(0, I(\theta_0)^{-1})$, and the Delta method gives \begin{align*} \sqrt{n}\bigl(\Phi(\hat{\theta}_n) - \Phi(\theta_0)\bigr) \xrightarrow{d} N\!\left(0,\ \nabla_\theta \Phi(\theta_0)^\top I(\theta_0)^{-1} \nabla_\theta \Phi(\theta_0)\right). \end{align*} In dimension $p = 1$ this simplifies to \begin{align*} \sqrt{n}\bigl(\Phi(\hat{\theta}_n) - \Phi(\theta_0)\bigr) \xrightarrow{d} N\!\left(0,\ \Phi'(\theta_0)^2 I(\theta_0)^{-1}\right). \end{align*} This confirms that the plug-in MLE $\Phi(\hat{\theta}_n)$ is **asymptotically efficient** for estimating $\Phi(\theta_0)$: its limiting covariance matches the Cramér-Rao bound for the reparametrised model. ## Profile Likelihood and Multiparameter Plug-In Estimation When the parameter space factors as $\Theta = \Theta_1 \times \Theta_2$ and we care only about the first coordinate $\phi = \theta_1$, how do we reduce the problem to a function of $\theta_1$ alone? The profile likelihood provides a canonical answer by optimising out the nuisance parameter. [definition: Profile Likelihood] For $\Theta = \Theta_1 \times \Theta_2$ and $\theta = (\theta_1, \theta_2)^\top$, the **profile likelihood** for $\phi(\theta) = \theta_1$ is defined by \begin{align*} L^{(p)}(\theta_1) = \sup_{\theta_2 \in \Theta_2} L\!\left((\theta_1, \theta_2)^\top\right). \end{align*} [/definition] The profile likelihood removes the nuisance parameter $\theta_2$ by optimising it out for each fixed $\theta_1$. Maximising $L^{(p)}(\theta_1)$ over $\theta_1$ is equivalent to maximising the full likelihood: the maximiser is the first coordinate of $\hat{\theta}_n^{\text{MLE}}$. More generally, for any $\Phi : \Theta \to \mathbb{R}^k$ and the reparametrised family $\{f(\cdot, \phi) : \phi = \Phi(\theta),\, \theta \in \Theta\}$, the MLE in the new parametrisation is $\Phi(\hat{\theta}_n^{\text{MLE}})$. This is the equivariance property of the MLE, and it is what makes the plug-in principle natural. [definition: Plug-In MLE] For a statistical model $\{f(\cdot, \theta) : \theta \in \Theta\}$ and a functional $\Phi : \Theta \to \mathbb{R}^k$, the **plug-in MLE** of $\Phi(\theta_0)$ is the estimator $\Phi(\hat{\theta}_n^{\text{MLE}})$. [/definition] The plug-in principle is compelling in practice: rather than solving a new estimation problem for each quantity of interest, one simply applies the transformation to the MLE. The Delta method then guarantees that the asymptotic normality and efficiency of $\hat{\theta}_n$ are inherited by $\Phi(\hat{\theta}_n)$, at least when $\Phi$ is smooth. [example: Exponential Mean Estimation] Let $X_1, \ldots, X_n \sim \operatorname{Exp}(\theta_0)$ with $\theta_0 > 0$. The mean of the distribution is $\mu = \Phi(\theta_0) = 1/\theta_0$. The Fisher information is $I(\theta_0) = 1/\theta_0^2$ (since the log-density is $\log \theta - \theta x$ and its second derivative with respect to $\theta$ is $-\theta^{-2}$, so $I(\theta_0) = \theta_0^{-2}$). The MLE is $\hat{\theta}_n = 1/\bar{X}_n$, and the plug-in MLE of $\mu$ is $\hat{\mu}_n = \Phi(\hat{\theta}_n) = \bar{X}_n$. Applying the Delta method with $\Phi'(\theta_0) = -1/\theta_0^2$: \begin{align*} \sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N\!\left(0,\, \frac{1}{\theta_0^4} \cdot \theta_0^2\right) = N\!\left(0,\, \frac{1}{\theta_0^2}\right) = N(0, \mu^2). \end{align*} This can be verified directly: $\bar{X}_n$ has variance $1/(n\theta_0^2) = \mu^2/n$, consistent with the limit. The plug-in MLE achieves asymptotic variance $\mu^2$, which matches the Cramér-Rao bound $(\Phi'(\theta_0))^2 I(\theta_0)^{-1} = \theta_0^{-4} \cdot \theta_0^2 = \theta_0^{-2} = \mu^2$. [/example] ## Observed vs Expected Fisher Information In practice, using the asymptotic normality result for inference requires knowledge of $I(\theta_0)$. Since $\theta_0$ is unknown, one must substitute an estimate. Two natural choices arise, and understanding which to use is an important practical question. The **expected Fisher information** $I(\theta_0)$ is defined as $-\mathbb{E}_{\theta_0}[\nabla^2_\theta \log f(X, \theta_0)]$. It is estimated by plugging in $\hat{\theta}_n$: \begin{align*} I(\hat{\theta}_n) = -\mathbb{E}_{\hat{\theta}_n}[\nabla^2_\theta \log f(X, \hat{\theta}_n)]. \end{align*} The **observed Fisher information** is the realised negative Hessian of the log-likelihood, evaluated at the MLE: \begin{align*} \hat{I}_n = -\frac{1}{n} \nabla^2_\theta \ell_n(\hat{\theta}_n) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log f(X_i, \hat{\theta}_n). \end{align*} By the law of large numbers and the second-order Fisher information identity, $\hat{I}_n \xrightarrow{\mathbb{P}} I(\theta_0)$. Both $I(\hat{\theta}_n)$ and $\hat{I}_n$ are therefore consistent estimators of $I(\theta_0)$, and by Slutsky's lemma, substituting either into the asymptotic normality statement preserves the limiting distribution. The observed Fisher information $\hat{I}_n$ is often preferred in practice because it uses only the data actually observed, without requiring computation of an expectation that may be analytically intractable. When the model is exponential family, both coincide at the MLE. In non-exponential-family models, $\hat{I}_n$ is typically more stable numerically. The consistency of $\hat{I}_n$ for $I(\theta_0)$ means that confidence intervals and hypothesis tests based on the asymptotic normality of the MLE can be implemented purely from the data, without knowing $\theta_0$. A $(1-\alpha)$-level approximate confidence interval for $\theta_0$ in the scalar case is \begin{align*} \left[\hat{\theta}_n - z_{\alpha/2} \frac{1}{\sqrt{n \hat{I}_n}},\ \hat{\theta}_n + z_{\alpha/2} \frac{1}{\sqrt{n \hat{I}_n}}\right], \end{align*} where $z_{\alpha/2}$ is the $(1 - \alpha/2)$-quantile of the standard normal. This interval has coverage converging to $1 - \alpha$ as $n \to \infty$, by Slutsky's lemma applied to the pivot $\sqrt{n \hat{I}_n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0,1)$. With firm confidence in the MLE's asymptotic properties, we can now construct hypothesis tests and confidence intervals that have optimal coverage and power in large samples. # 9. Asymptotic Inference with the MLE ## Asymptotic Tests from the MLE The asymptotic normality of the MLE, established in the preceding chapter, is more than a theoretical curiosity about limiting distributions. It is a practical engine for inference: it lets us build confidence regions for $\theta_0$ and test hypotheses about it, all without knowing the true parameter. The central question driving this chapter is how to turn the asymptotic statement $\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, I(\theta_0)^{-1})$ into usable statistical procedures when $I(\theta_0)$ itself is unknown. The answer involves three complementary test statistics — the Wald statistic, the likelihood ratio statistic, and the score statistic — each exploiting a different aspect of the likelihood geometry. ## The Wald Confidence Interval Having established that the MLE is asymptotically normal, the immediate desire is to build confidence intervals. But a difficulty arises before we can apply the normal approximation: the variance of the limiting distribution involves $I(\theta_0)$, the Fisher information at the true parameter, which is unknown. To see where the problem comes from, recall from the asymptotic normality theorem that for the $j$-th coordinate $\hat{\theta}_{n,j}$ of the MLE, with $e_j$ denoting the $j$-th canonical basis vector of $\mathbb{R}^p$, \begin{align*} \sqrt{n}(\hat{\theta}_{n,j} - \theta_{0,j}) \xrightarrow{d} N(0, (I^{-1}(\theta_0))_{jj}), \end{align*} where $(I^{-1}(\theta_0))_{jj}$ is the $j$-th diagonal entry of the inverse Fisher information matrix. If this quantity were known, the interval construction reduces to a direct application of the CLT and the continuous mapping theorem: set $z_\alpha$ such that $\mathbb{P}(|Z| \leq z_\alpha) = 1 - \alpha$ for $Z \sim N(0,1)$, and define \begin{align*} C_n = \left\{ \nu \in \mathbb{R} : |\nu - \hat{\theta}_{n,j}| \leq (I^{-1}(\theta_0))_{jj}^{1/2} \cdot \frac{z_\alpha}{\sqrt{n}} \right\}. \end{align*} One then verifies that $\mathbb{P}_{\theta_0}(\theta_{0,j} \in C_n) \to 1 - \alpha$ by the limiting distribution and the continuous mapping theorem. The issue is that to use this interval in practice, one needs $(I^{-1}(\theta_0))_{jj}$, which depends on the unknown $\theta_0$. The resolution is to estimate $I(\theta_0)$ from data. [definition: Observed Fisher Information] Let $X_1, \ldots, X_n$ be i.i.d. with density $f(\cdot, \theta)$. The **observed Fisher information matrix** $i_n(\theta)$ is the $p \times p$ matrix defined by \begin{align*} i_n(\theta) = \frac{1}{n} \sum_{i=1}^n \nabla_\theta \log f(X_i, \theta) \cdot \nabla_\theta \log f(X_i, \theta)^\top. \end{align*} The estimator $\hat{i}_n = i_n(\hat{\theta}_\text{MLE})$ is called the **plug-in observed Fisher information**. [/definition] This is simply the sample average of the outer products of the score at the observed data points, evaluated at the MLE. Contrast this with the theoretical Fisher information $I(\theta) = \mathbb{E}_\theta[\nabla_\theta \log f(X, \theta) \cdot \nabla_\theta \log f(X, \theta)^\top]$, which requires knowledge of $\theta$. The observed version replaces the expectation by an empirical average and plugs in $\hat{\theta}_\text{MLE}$ for the unknown $\theta_0$. [quotetheorem:1862] [citeproof:1862] Two remarks on this result are worth making. First, the uniform LLN — $\sup_{\theta \in \Theta} \|i_n(\theta) - I(\theta)\| \xrightarrow{\mathbb{P}_{\theta_0}} 0$ — is genuinely necessary, not merely convenient: a pointwise LLN at a single $\theta$ would be insufficient because $\hat{\theta}_n$ is itself random and converging to $\theta_0$, so the argument must control the approximation error uniformly over a neighbourhood. Without the uniform bound, the decomposition $i_n(\hat{\theta}_n) - I(\hat{\theta}_n)$ could fail to vanish. Second, the theorem says nothing about the rate at which $\hat{i}_n$ approaches $I(\theta_0)$: the consistency statement is qualitative, and quantitative control of the approximation error requires additional moment assumptions. An alternative consistent estimator of $I(\theta_0)$ is $\hat{j}_n = j_n(\hat{\theta}_\text{MLE})$, where $j_n(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log f(X_i, \theta)$. This uses the negative sample average of the Hessian of the log-likelihood. It is consistent by the same argument, exploiting the identity $I(\theta) = -\mathbb{E}_\theta[\nabla^2_\theta \log f(X, \theta)]$ from the second-order Fisher information representation. In practice, $\hat{i}_n$ is often preferred because it requires only first-order derivatives, while $\hat{j}_n$ requires computing second-order derivatives of the log-likelihood, which can be expensive. ## The Wald Statistic and Confidence Ellipsoids The scalar confidence interval generalizes to the multiparameter setting, but a naive coordinate-by-coordinate approach misses the correlation structure between components of $\hat{\theta}_n$. What shape should a multivariate confidence region take? The correct approach uses a quadratic form that respects the covariance geometry. [definition: Wald Statistic] For $\theta \in \Theta \subseteq \mathbb{R}^p$, the **Wald statistic** is defined as \begin{align*} W_n(\theta) = n(\hat{\theta}_\text{MLE} - \theta)^\top \hat{i}_n (\hat{\theta}_\text{MLE} - \theta). \end{align*} [/definition] Since $\hat{i}_n$ is a positive semidefinite matrix, $W_n(\theta)$ is a quadratic form in the deviation $\hat{\theta}_\text{MLE} - \theta$. Its level sets $\{W_n(\theta) \leq c\}$, as $\theta$ varies, are ellipsoids centered at $\hat{\theta}_\text{MLE}$, with axes and radii determined by $\hat{i}_n$. [quotetheorem:1863] [citeproof:1863] The condition that $\hat{i}_n$ converges to a positive definite limit $I(\theta_0)$ is what makes the ellipsoids nondegenerate and the test well-defined. If the Fisher information were singular at $\theta_0$ — which happens in overparameterized models where some directions in parameter space carry no information — the confidence ellipsoid would degenerate and the chi-squared approximation would fail. The statistic $W_n(\theta)$ also yields a test for the hypothesis $H_0 : \theta = \theta_0$ against $H_1 : \theta \in \Theta \setminus \{\theta_0\}$. Since $\mathbb{P}(W_n(\theta_0) > \xi_\alpha) \to \alpha$ under $H_0$, the decision rule $\psi_n = \mathbb{1}\{W_n(\theta_0) > \xi_\alpha\}$ has asymptotic type-I error $\alpha$. [example: Gaussian Mean with Unknown Variance] Suppose $X_i \sim N(\mu, \sigma^2)$ with both $\mu$ and $\sigma^2$ unknown, and we want a confidence ellipsoid for $\theta_0 = (\mu_0, \sigma^2_0)^\top$. The MLE is $\hat{\mu} = \bar{X}_n$ and $\hat{\sigma}^2 = \frac{1}{n}\sum(X_i - \bar{X}_n)^2$. The Fisher information matrix is diagonal with entries $I_{11}(\theta) = 1/\sigma^2$ and $I_{22}(\theta) = 1/(2\sigma^4)$. The plug-in observed Fisher information $\hat{i}_n$ is consistent for $I(\theta_0)$. The Wald confidence ellipsoid $\{W_n(\theta) \leq \xi_\alpha\}$ is an ellipse in the $(\mu, \sigma^2)$-plane, elongated in the $\sigma^2$-direction (reflecting the slower information accumulation about the variance). For $p = 2$ and $\alpha = 0.05$, one uses $\xi_{0.05}$ as the $0.95$-quantile of $\chi^2_2$. [/example] ## The Likelihood Ratio Test and Wilks' Theorem The Wald test is not the only way to test $H_0 : \theta = \theta_0$. An alternative approach compares the maximized likelihood under the full model to the likelihood at the null value $\theta_0$. Why consider an alternative at all? The difficulty is that the Wald test is not invariant under reparametrisation: testing $H_0: \theta = \theta_0$ and $H_0: g(\theta) = g(\theta_0)$ for a smooth bijection $g$ can give different Wald statistics. The likelihood ratio test, by contrast, is based solely on the log-likelihood values and is invariant by construction. [definition: Likelihood Ratio Statistic] For a hypothesis testing problem with parameter space $\Theta \subseteq \mathbb{R}^p$ and null parameter set $\Theta_0 \subseteq \Theta$, the **likelihood ratio statistic** is \begin{align*} \Lambda_n(\Theta, \Theta_0) = 2 \log \frac{\sup_{\theta \in \Theta} \prod_{i=1}^n f(X_i, \theta)}{\sup_{\theta \in \Theta_0} \prod_{i=1}^n f(X_i, \theta)} = 2\bigl(\ell_n(\hat{\theta}_\text{MLE}) - \ell_n(\hat{\theta}_\text{MLE,0})\bigr), \end{align*} where $\hat{\theta}_\text{MLE,0}$ is the maximum likelihood estimator restricted to $\Theta_0$, and $\ell_n(\theta) = \sum_{i=1}^n \log f(X_i, \theta)$ is the log-likelihood. [/definition] The factor of 2 is conventional: it is chosen precisely so that the limiting distribution is a standard $\chi^2$ without extra constants. The statistic $\Lambda_n$ is always nonnegative, and large values indicate strong evidence against $H_0$. [quotetheorem:1864] [citeproof:1864] The regularity assumptions cannot be dropped here. The key step is the Taylor expansion of the log-likelihood, which requires twice-differentiability of $\theta \mapsto \log f(x, \theta)$ and uniform control of the Hessian. For the uniform distribution on $[0, \theta]$, the likelihood is not continuous in $\theta$, the MLE is at the boundary, and the chi-squared approximation fails entirely. When $\Theta_0$ has dimension $p_0 < p$ (a composite null, such as $\Theta_0 = \{\theta \in \Theta : \theta_1 = c\}$ for some constraint), the same proof applies with minor modifications, and the limiting distribution under $H_0$ becomes $\chi^2_{p - p_0}$. The difference in degrees of freedom $p - p_0$ reflects the number of constraints imposed by $H_0$. [example: Testing a Poisson Rate] Suppose $X_i \sim \text{Poi}(\theta)$ and we test $H_0 : \theta = \theta_0$ against $H_1 : \theta \neq \theta_0$. The MLE is $\hat{\theta}_n = \bar{X}_n$. The log-likelihood is $\ell_n(\theta) = -n\theta + n\bar{X}_n \log\theta$ (up to terms not depending on $\theta$). The likelihood ratio statistic is \begin{align*} \Lambda_n = 2\ell_n(\hat{\theta}_n) - 2\ell_n(\theta_0) = 2n\bigl(\theta_0 - \hat{\theta}_n + \hat{\theta}_n \log(\hat{\theta}_n/\theta_0)\bigr). \end{align*} By Wilks' theorem, $\Lambda_n \xrightarrow{d} \chi^2_1$ under $H_0$. At the $5\%$ level with $p = 1$, we reject when $\Lambda_n > 3.84$ (the $0.95$-quantile of $\chi^2_1$). To see why $\Lambda_n$ is sensible, note that for $\hat{\theta}_n$ close to $\theta_0$, a Taylor expansion of $x \log(x/x_0) \approx (x - x_0)^2/(2x_0)$ shows that $\Lambda_n \approx n(\hat{\theta}_n - \theta_0)^2/\theta_0$, which is the squared Wald statistic (here $I(\theta_0) = 1/\theta_0$). The LRT and Wald test are asymptotically equivalent. [/example] ## The Score Test Both the Wald test and the LRT require computing the MLE $\hat{\theta}_n$, which may be numerically expensive. Is there a test that avoids optimizing the likelihood altogether? The score test offers exactly this: it requires only evaluating the score function at the null value $\theta_0$, not optimizing over $\Theta$. The idea stems from a key observation: under $H_0 : \theta = \theta_0$, the score $S_n(\theta_0) = \nabla_\theta \ell_n(\theta_0)$ should be close to zero. The MLE satisfies $S_n(\hat{\theta}_n) = 0$ by the first-order condition, but $S_n(\theta_0)$ will typically be nonzero even when $\theta_0$ is true — it is a random vector with mean zero and covariance $n I(\theta_0)$. The score test checks whether $S_n(\theta_0)$ is "small enough" to be consistent with $\theta_0$ being the true parameter. [definition: Score Statistic] The **score statistic** (also called the **Rao statistic**) for testing $H_0 : \theta = \theta_0$ is \begin{align*} T_n(\theta_0) = \frac{1}{n} S_n(\theta_0)^\top I(\theta_0)^{-1} S_n(\theta_0) = \frac{1}{n} \nabla_\theta \ell_n(\theta_0)^\top I(\theta_0)^{-1} \nabla_\theta \ell_n(\theta_0), \end{align*} where $I(\theta_0)$ may be replaced by $i_n(\theta_0)$ (the observed Fisher information at the null value). [/definition] The score statistic is a quadratic form in $S_n(\theta_0)/\sqrt{n}$, normalised by the covariance matrix $I(\theta_0)$. Under $H_0$, the score $S_n(\theta_0)/\sqrt{n}$ is a sum of i.i.d. mean-zero random vectors, so the CLT will drive it toward a normal with covariance $I(\theta_0)$; normalising by $I(\theta_0)^{-1}$ then produces a chi-squared. The key regularity input is that $I(\theta_0)$ is finite and positive definite, which makes the normalisation well-defined and ensures the CLT scaling is correct. [quotetheorem:1865] [citeproof:1865] The score test enjoys a particularly important practical advantage: computing $T_n(\theta_0)$ requires only evaluating the score $\nabla_\theta \ell_n$ at the single point $\theta_0$, and then inverting (or estimating) $I(\theta_0)$. No optimization over $\Theta$ is needed. What the score test sacrifices is that it only "looks" at the model from the perspective of $\theta_0$. Under the alternative, the score at $\theta_0$ will be large, but the test has no direct information about where in $\Theta$ the true parameter lies. The Wald test and LRT, by contrast, both involve $\hat{\theta}_n$ explicitly. The regularity hypotheses cannot be relaxed without cost. The CLT step requires that the score contributions $\nabla_\theta \log f(X_i, \theta_0)$ be i.i.d. with finite covariance $I(\theta_0)$; in non-regular models — such as the uniform distribution on $[0, \theta]$, where the support depends on $\theta$ — neither condition holds at the boundary, and the chi-squared approximation breaks down entirely. Even when the regularity conditions are satisfied, the convergence $T_n(\theta_0) \xrightarrow{d} \chi^2_p$ is a qualitative statement: it does not specify how large $n$ must be for the $\chi^2_p$ quantiles to be accurate, nor does it bound the finite-sample type-I error. The chi-squared approximation can be poor when $n$ is small relative to $p$, or when $I(\theta_0)$ is nearly singular. ## The Relationship Between the Three Tests The Wald test, the likelihood ratio test, and the score test appear to be three very different procedures, yet they share the same asymptotic chi-squared distribution under $H_0$ and the same asymptotic power under local alternatives. What is the connection? The connection can be seen most cleanly by examining the log-likelihood function near $\theta_0$. A second-order Taylor expansion of $\ell_n(\theta_0)$ around the MLE $\hat{\theta}_n$ gives \begin{align*} \ell_n(\theta_0) \approx \ell_n(\hat{\theta}_n) + \nabla_\theta \ell_n(\hat{\theta}_n)^\top (\theta_0 - \hat{\theta}_n) + \frac{1}{2}(\theta_0 - \hat{\theta}_n)^\top \nabla^2_\theta \ell_n(\hat{\theta}_n)(\theta_0 - \hat{\theta}_n). \end{align*} Since $\nabla_\theta \ell_n(\hat{\theta}_n) = 0$ and $\nabla^2_\theta \ell_n(\hat{\theta}_n) \approx -n I(\theta_0)$, this gives \begin{align*} \Lambda_n = 2(\ell_n(\hat{\theta}_n) - \ell_n(\theta_0)) \approx n(\hat{\theta}_n - \theta_0)^\top I(\theta_0)(\hat{\theta}_n - \theta_0), \end{align*} which is the Wald statistic with $I(\theta_0)$ in place of $\hat{i}_n$. Similarly, from the first-order Taylor expansion of the score: \begin{align*} 0 = S_n(\hat{\theta}_n) \approx S_n(\theta_0) + \nabla^2_\theta \ell_n(\theta_0)(\hat{\theta}_n - \theta_0) \approx S_n(\theta_0) - nI(\theta_0)(\hat{\theta}_n - \theta_0), \end{align*} which gives $\hat{\theta}_n - \theta_0 \approx \frac{1}{n}I(\theta_0)^{-1} S_n(\theta_0)$. Substituting into the Wald statistic yields $W_n(\theta_0) \approx T_n(\theta_0)$. [quotetheorem:1866] The three tests are therefore exchangeable for large $n$, and the choice between them is primarily computational and philosophical. The **Wald test** is natural when one has a precise estimate $\hat{\theta}_n$ in hand and wants to assess how far it is from the null. The **likelihood ratio test** is natural from the likelihood principle: it directly measures the improvement in fit achieved by the MLE over the null, and it is invariant under reparametrisation (since both $\ell_n(\hat{\theta}_n)$ and $\ell_n(\theta_0)$ depend only on the likelihood values). The **score test** is natural when the null model is "easy" and the full MLE is "hard," since it only evaluates the likelihood at $\theta_0$. Despite their asymptotic equivalence, the three tests can differ meaningfully in finite samples. The Wald test is sensitive to reparametrisation: testing $H_0: \theta = \theta_0$ and $H_0: g(\theta) = g(\theta_0)$ for a smooth bijection $g$ can give different Wald statistics. The likelihood ratio test is invariant to such reparametrisations. The score test becomes important when testing whether a constrained model is adequate, and in generalized linear models where score equations can be evaluated without full optimization. In the frequentist testing literature, the three tests are known as the "Holy Trinity" of asymptotic tests, attributed to Wald, Wilks, and Rao respectively. For most purposes in this course, the likelihood ratio test is the preferred procedure: it is parametrisation-invariant, has a clean interpretation in terms of evidence, and the chi-squared approximation given by Wilks' theorem is often quite accurate even at moderate sample sizes. The classical approach to inference has dominated our discussion. We now pivot to an alternative paradigm: Bayesian statistics, which incorporates prior beliefs about parameters directly into the estimation process. # 10. Introduction to Bayesian Statistics The previous chapters developed the frequentist paradigm in depth: the MLE is consistent, asymptotically normal, and achieves the Cramér-Rao bound in the limit. But all of this rests on a single, fixed value $\theta_0$ — the true parameter — which the data were generated from, and which we estimate without ever assigning it a probability. This chapter introduces a fundamentally different philosophy. In the Bayesian framework, the parameter $\theta$ is itself treated as a random variable, and statistical inference proceeds by updating a prior belief about $\theta$ to a posterior belief in light of the data. ## Why Treat the Parameter as Random? Every frequentist estimator must face an uncomfortable question: when a confidence interval says "95% of intervals of this type contain $\theta_0$," what does that mean for the specific interval you computed from your data? Either $\theta_0$ is in it or it is not. The interval itself is random, but once you have it, no probability remains. For a practicing scientist, this is philosophically unsatisfying: the data are fixed, the interval is in front of you, and you want to say something about where $\theta$ is, not about a hypothetical repetition of the experiment. The Bayesian framework offers a direct answer to this discomfort. Instead of treating $\theta$ as a fixed unknown, one posits a probability distribution $\pi$ over $\Theta$ that encodes prior beliefs about $\theta$ before any data are observed. After observing data $X$, Bayes' theorem produces the posterior distribution $\Pi(\cdot \mid X)$, which is a genuine probability distribution over $\Theta$ given $X$. A 95% credible set $C_n$ satisfying $\Pi(C_n \mid X) = 0.95$ really does mean: conditional on the data I observed, 95% of my probability mass for $\theta$ falls in $C_n$. But the framework is not only a philosophical alternative. It is also a method. Even a frequentist who believes in the fixed-$\theta_0$ world can use the Bayesian machinery as a recipe for constructing estimators, confidence regions, and tests — and then evaluate these procedures by their frequentist properties. The Bernstein-von Mises theorem, which closes this chapter, is precisely such a bridge result: it shows that Bayesian credible sets are also valid frequentist confidence sets in large samples. Before building the general theory, it is instructive to examine a case where the Bayesian updating becomes a purely combinatorial calculation. [example: Finite Parameter Space and Bayes Rule] Suppose the parameter space is finite: $\Theta = \{\theta_1, \ldots, \theta_k\}$. The $k$ hypotheses $H_i : \theta = \theta_i$ are mutually exclusive and exhaustive. A statistician assigns prior probabilities $\pi_i = \mathbb{P}(H_i)$, with $\sum_{i=1}^k \pi_i = 1$. Under hypothesis $H_i$, the observation $X$ has distribution given by $\mathbb{P}(X = x \mid H_i) = f_i(x)$. Upon observing $X = x$, Bayes' theorem gives the posterior probability of $H_i$ as \begin{align*} \mathbb{P}(H_i \mid X = x) = \frac{\pi_i f_i(x)}{\sum_{j=1}^k \pi_j f_j(x)}. \end{align*} The denominator is just a normalizing constant; the posterior probability is proportional to the product of the prior and the likelihood. To decide between $H_i$ and $H_j$, one can compare the posterior ratio: \begin{align*} \frac{\mathbb{P}(H_i \mid X = x)}{\mathbb{P}(H_j \mid X = x)} = \frac{f_i(x)}{f_j(x)} \cdot \frac{\pi_i}{\pi_j}. \end{align*} This is the likelihood ratio $f_i(x)/f_j(x)$ multiplied by the prior odds $\pi_i/\pi_j$. If the prior is uniform ($\pi_i = 1/k$ for all $i$), the ratio reduces to the pure likelihood ratio, and Bayesian model selection coincides with maximum likelihood. When the prior is non-uniform, it acts as a modifier: a hypothesis with a very small prior weight $\pi_j$ requires considerably stronger evidence from the data to overcome that disadvantage. [/example] This simple example already reveals the Bayesian recipe: combine prior information with the likelihood. The rest of the section builds this into a precise framework for continuous parameter spaces. ## The Prior, the Likelihood, and the Posterior To set up the Bayesian model rigorously, one must specify a joint distribution on the product space $\mathcal{X} \times \Theta$. [definition: Bayesian Statistical Model] Let $\{f(\cdot, \theta) : \theta \in \Theta\}$ be a statistical model with sample space $\mathcal{X}$. A Bayesian statistical model is specified by a joint probability measure $Q$ on $\mathcal{X} \times \Theta$ with density (or mass function) \begin{align*} Q(x, \theta) = f(x, \theta)\, \pi(\theta), \end{align*} where $\pi : \Theta \to [0, \infty)$ is the **prior distribution** of $\theta$, satisfying $\int_\Theta \pi(\theta)\, d\theta = 1$ (or summing to 1 in the discrete case). Given an observation $X = x$, the **posterior distribution** of $\theta$ is the conditional distribution \begin{align*} \Pi(\theta \mid X = x) = \frac{f(x, \theta)\, \pi(\theta)}{\int_\Theta f(x, \theta')\, \pi(\theta')\, d\theta'}, \end{align*} provided the denominator $m(x) := \int_\Theta f(x, \theta')\, \pi(\theta')\, d\theta'$ is finite and positive. The function $m(x)$ is called the **marginal likelihood** or **evidence**. [/definition] [remark: The Denominator Is Just Normalization] The denominator $m(x) = \int_\Theta f(x, \theta')\, \pi(\theta')\, d\theta'$ does not depend on $\theta$, so when computing the posterior up to proportionality, it can be ignored. In practice, we write \begin{align*} \Pi(\theta \mid X) \propto f(X, \theta)\, \pi(\theta), \end{align*} and recognize the shape of the posterior from this proportionality relation alone. [/remark] What happens when we observe $n$ i.i.d. observations $X_1, \ldots, X_n$? The joint likelihood of the sample is $\prod_{i=1}^n f(X_i, \theta)$, and the posterior simply replaces the single observation likelihood by this product: \begin{align*} \Pi(\theta \mid X_1, \ldots, X_n) = \frac{\prod_{i=1}^n f(X_i, \theta)\, \pi(\theta)}{\int_\Theta \prod_{i=1}^n f(X_i, \theta')\, \pi(\theta')\, d\theta'}. \end{align*} This is nothing but Bayes' theorem applied to the joint distribution of the sample. The posterior is proportional to the product of the prior and the full sample likelihood, and the denominator is again a normalizing constant that can be computed last (or numerically). ## Conjugate Priors A central practical question is: for which choices of prior $\pi$ does the posterior have a tractable closed form? Computing the integral $m(x)$ in closed form is generically hard. This motivates conjugate priors. [definition: Conjugate Prior] In a statistical model $\{f(\cdot, \theta) : \theta \in \Theta\}$, a prior $\pi$ is called a **conjugate prior** if the posterior $\Pi(\cdot \mid X)$ belongs to the same parametric family of distributions as $\pi$. [/definition] Conjugate priors do not always exist, and when they do they need not be the most realistic choice. Their importance is computational and pedagogical: they make the Bayesian update explicit and allow the posterior to be read off by pattern-matching the proportionality. The two canonical examples are the Gaussian-Gaussian model and the Beta-Binomial model. ### The Gaussian Model with Gaussian Prior The frequentist analysis of the Gaussian model in earlier chapters showed that the MLE for the mean is $\bar{X}_n$, with asymptotic variance $1/n$. But what if we have genuine prior knowledge that $\theta$ is close to zero — for example, because the parameter represents a small calibration offset? The Gaussian prior encodes this. [example: Gaussian Likelihood with Gaussian Prior] Let $X_1, \ldots, X_n$ be i.i.d.\ from $\mathcal{N}(\theta, 1)$ with unknown mean $\theta$. Place a Gaussian prior $\theta \sim \mathcal{N}(0, 1)$. The posterior numerator, as a function of $\theta$, is \begin{align*} \pi(\theta) \prod_{i=1}^n f(X_i, \theta) &\propto \exp\!\left(-\frac{\theta^2}{2}\right) \prod_{i=1}^n \exp\!\left(-\frac{(X_i - \theta)^2}{2}\right). \end{align*} Expanding the exponents and collecting all terms in $\theta$: \begin{align*} -\frac{\theta^2}{2} - \frac{1}{2}\sum_{i=1}^n (X_i - \theta)^2 &= -\frac{\theta^2}{2} - \frac{n}{2}\theta^2 + \theta \sum_{i=1}^n X_i - \frac{1}{2}\sum_{i=1}^n X_i^2 \\ &= -\frac{n+1}{2}\theta^2 + n\bar{X}_n \theta + \text{const}, \end{align*} where "const" collects all terms not depending on $\theta$. Completing the square in $\theta$: \begin{align*} -\frac{n+1}{2}\left(\theta - \frac{n\bar{X}_n}{n+1}\right)^2 + \text{const}. \end{align*} This is the kernel of a Gaussian density. The posterior is therefore \begin{align*} \theta \mid X_1, \ldots, X_n \sim \mathcal{N}\!\left(\frac{n\bar{X}_n}{n+1},\, \frac{1}{n+1}\right). \end{align*} The posterior mean is $\bar{\theta}_n = n\bar{X}_n/(n+1)$, a shrinkage of the MLE $\hat{\theta}_n = \bar{X}_n$ toward zero (the prior mean). The posterior variance is $1/(n+1)$, which is smaller than the prior variance of 1, reflecting the information gained from the data. As $n \to \infty$, $\bar{\theta}_n \to \bar{X}_n$: the data dominate the prior. [/example] [remark: Prior Mean and Sample Mean] The posterior mean $n\bar{X}_n/(n+1)$ is a weighted average of the prior mean (zero) and the MLE ($\bar{X}_n$), with weights proportional to the number of observations and the prior precision respectively. When $n = 0$, the posterior mean equals the prior mean, as expected. This interpolation between prior and data is the hallmark of conjugate Gaussian analysis. [/remark] ### The Beta-Binomial Model A coin-flipping experiment with unknown bias $\theta \in (0,1)$ provides the simplest discrete example. [example: Beta Prior with Binomial Likelihood] Let $X \mid \theta \sim \mathrm{Binomial}(n, \theta)$, so $f(x, \theta) = \binom{n}{x} \theta^x (1-\theta)^{n-x}$ for $x \in \{0, 1, \ldots, n\}$. Place a $\mathrm{Beta}(\alpha, \beta)$ prior on $\theta$, with density \begin{align*} \pi(\theta) \propto \theta^{\alpha - 1}(1-\theta)^{\beta - 1}, \quad \theta \in (0,1). \end{align*} The posterior numerator is \begin{align*} \pi(\theta) f(X, \theta) \propto \theta^{\alpha - 1}(1-\theta)^{\beta-1} \cdot \theta^X (1-\theta)^{n-X} = \theta^{\alpha + X - 1}(1-\theta)^{\beta + n - X - 1}. \end{align*} This is the kernel of a $\mathrm{Beta}(\alpha + X,\, \beta + n - X)$ density. The posterior is therefore \begin{align*} \theta \mid X \sim \mathrm{Beta}(\alpha + X,\, \beta + n - X). \end{align*} The parameters $\alpha$ and $\beta$ in the prior can be interpreted as "pseudo-observations": $\alpha - 1$ prior successes and $\beta - 1$ prior failures. The prior mean is $\alpha/(\alpha+\beta)$, and the posterior mean is $(\alpha + X)/(\alpha + \beta + n)$, again a weighted blend of prior and observed frequency. As $n \to \infty$ with $X/n \to \theta_0$, the posterior concentrates around $\theta_0$. [/example] The two examples above illustrate a general pattern. Within exponential families, one can always find a conjugate prior by matching the natural statistic. For Gaussian, Beta-Binomial, and Gamma-Poisson pairs, the updating rule replaces prior hyperparameters by adding sufficient statistics from the data — the posterior hyperparameters simply count data alongside pseudo-data. ## Improper Priors and the Jeffreys Prior Not every prior needs to be a proper probability distribution. In many problems, a natural choice of "uninformative" prior has infinite mass over $\Theta$ — yet the posterior can still be well-defined. [definition: Improper Prior] A nonnegative function $\pi : \Theta \to [0, \infty)$ is an **improper prior** if $\int_\Theta \pi(\theta)\, d\theta = +\infty$. An improper prior is admissible for the Bayesian model if the posterior \begin{align*} \Pi(\theta \mid X) \propto f(X, \theta)\, \pi(\theta) \end{align*} still has finite integral over $\Theta$ for (almost) all observed values of $X$, so that the posterior can be normalized to a proper probability distribution. [/definition] The simplest improper prior is the flat prior $\pi(\theta) = 1$, which assigns equal weight to every value of $\theta$. With this prior, the posterior is simply proportional to the likelihood, and the posterior mode coincides with the MLE. However, the flat prior is not invariant under reparametrization: if we reparametrize $\theta \mapsto \phi = g(\theta)$ and use a flat prior on $\phi$, we obtain a different posterior on $\theta$. This is a conceptual problem if we want the prior to represent genuine ignorance. The Jeffreys prior is designed to solve this reparametrization problem. [definition: Jeffreys Prior] For a regular statistical model $\{f(\cdot, \theta) : \theta \in \Theta\}$ with Fisher information $I(\theta)$, the **Jeffreys prior** is \begin{align*} \pi_J(\theta) \propto \sqrt{\det I(\theta)}. \end{align*} [/definition] The key property of the Jeffreys prior is its invariance: if $\phi = g(\theta)$ is a reparametrization, the Jeffreys prior for $\phi$ is the image of the Jeffreys prior for $\theta$ under $g$. This follows from the transformation rule for Fisher information and the change-of-variables formula for integrals. The Jeffreys prior thus provides a principled, coordinate-free notion of an uninformative prior. [example: Jeffreys Prior for the Gaussian Model] Consider the model $X_i \sim \mathcal{N}(\mu, \tau)$ with $\theta = (\mu, \tau) \in \mathbb{R} \times (0,\infty)$. The Fisher information matrix is \begin{align*} I(\mu, \tau) = \begin{pmatrix} 1/\tau & 0 \\ 0 & 1/(2\tau^2) \end{pmatrix}. \end{align*} The determinant is $\det I(\mu, \tau) = 1/(2\tau^3)$, so the Jeffreys prior is \begin{align*} \pi_J(\mu, \tau) \propto \tau^{-3/2}. \end{align*} This is constant in $\mu$ (flat in the mean, which is location-invariant) and proportional to $\tau^{-3/2}$ in $\tau$. With this prior, the posterior marginal for $\mu$ is $\mathcal{N}(\bar{X}_n, \tau/n)$, matching the frequentist result. [/example] ## The Posterior Mean as a Bayes Estimator Having computed the posterior, how does one extract a point estimator? The most natural choice is the posterior mean, but this can be motivated more precisely through decision theory. [definition: Posterior Mean Estimator] For a Bayesian model with posterior $\Pi(\cdot \mid X_1, \ldots, X_n)$, the **posterior mean estimator** is \begin{align*} \bar{\theta}_n = \bar{\theta}(X_1, \ldots, X_n) := \mathbb{E}_\Pi[\theta \mid X_1, \ldots, X_n] = \int_\Theta \theta\, \Pi(\theta \mid X_1, \ldots, X_n)\, d\theta. \end{align*} [/definition] The posterior mean minimizes the posterior expected squared error: for any estimator $\hat{\theta}(X_1, \ldots, X_n)$, the quantity $\mathbb{E}_\Pi[(\theta - \hat{\theta})^2 \mid X_1, \ldots, X_n]$ is minimized by $\hat{\theta} = \bar{\theta}_n$. This follows from the standard argument that the mean minimizes squared distance in $L^2$. More generally, for a squared-error loss function $L(\theta, \hat{\theta}) = |\theta - \hat{\theta}|^2$, the Bayes risk (expected loss averaged over both $\theta \sim \pi$ and $X \mid \theta \sim f(\cdot, \theta)$) is minimized by the posterior mean. For the Gaussian example above, the posterior mean is $\bar{\theta}_n = n\bar{X}_n/(n+1)$. For the Beta-Binomial example, the posterior mean is $(\alpha + X)/(\alpha + \beta + n)$. In both cases, the estimator shrinks toward the prior mean relative to the MLE. ## The Role of the Prior: Finite Samples vs Asymptotics One of the central questions in Bayesian statistics is: how much does the prior matter? The answer depends heavily on the sample size. In small samples, the prior can dominate. Consider the Beta-Binomial example with $n = 5$ observations and a prior $\mathrm{Beta}(10, 10)$ (strongly concentrated near $1/2$). If we observe 4 heads out of 5, the MLE is $4/5 = 0.8$, but the posterior mean is $(10 + 4)/(10 + 10 + 5) = 14/25 = 0.56$ — still much closer to the prior mean of $1/2$ than to the data. As the sample size grows, the likelihood $\prod_{i=1}^n f(X_i, \theta)$ concentrates sharply near the true parameter $\theta_0$, overwhelmingly outweighing the prior. The posterior therefore concentrates near $\theta_0$ regardless of the prior, as long as $\pi(\theta_0) > 0$. More precisely, the posterior mean converges to $\theta_0$ in probability under the frequentist distribution $\mathbb{P}_{\theta_0}$. To see why in the Gaussian case: under $X_i \overset{\text{i.i.d.}}{\sim} \mathcal{N}(\theta_0, 1)$, the posterior mean satisfies \begin{align*} \bar{\theta}_n = \frac{n}{n+1}\bar{X}_n \xrightarrow{\mathbb{P}_{\theta_0}} \theta_0, \end{align*} by the law of large numbers ($\bar{X}_n \to \theta_0$) and Slutsky's lemma ($n/(n+1) \to 1$). The prior mean of zero is completely forgotten in the limit. The asymptotic deviation of the posterior mean from the MLE also vanishes. Writing \begin{align*} \sqrt{n}(\bar{\theta}_n - \theta_0) = \sqrt{n}(\bar{\theta}_n - \hat{\theta}_n) + \sqrt{n}(\hat{\theta}_n - \theta_0), \end{align*} the second term converges in distribution to $\mathcal{N}(0, I(\theta_0)^{-1})$ by the asymptotic normality of the MLE. For the first term, \begin{align*} \sqrt{n}(\bar{\theta}_n - \hat{\theta}_n) = \sqrt{n}\left(\frac{n}{n+1} - 1\right)\bar{X}_n = -\frac{\sqrt{n}}{n+1}\bar{X}_n \xrightarrow{\mathbb{P}_{\theta_0}} 0, \end{align*} since $\bar{X}_n \to \theta_0$ and $\sqrt{n}/(n+1) \to 0$. Slutsky's lemma then gives \begin{align*} \sqrt{n}(\bar{\theta}_n - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1}). \end{align*} The posterior mean is asymptotically equivalent to the MLE. This result is not specific to the Gaussian case — it is a manifestation of the Bernstein-von Mises phenomenon. ## The Bernstein-von Mises Theorem The Bernstein-von Mises theorem is one of the deepest results in mathematical statistics. It provides a precise asymptotic characterization of the entire posterior distribution — not just its mean — and forms the theoretical bridge between Bayesian and frequentist inference. The result says that the posterior distribution, as a random probability measure on $\Theta$ (random because it depends on the random observations $X_1, \ldots, X_n$), converges in total variation to a Gaussian distribution centered at the MLE. The Gaussian limit has covariance $I(\theta_0)^{-1}/n$, which is precisely the asymptotic covariance of the MLE. In other words, in large samples, the posterior looks like a confidence distribution centered at the MLE. [quotetheorem:1867] The hypothesis that $\pi(\theta_0) > 0$ is necessary: if the prior assigns zero probability to a neighborhood of $\theta_0$, the posterior never learns $\theta_0$ regardless of how much data arrives, and the result fails. The regularity assumptions on the model (differentiability of the log-likelihood, integrability conditions) ensure that the Taylor expansion of $\ell_n$ around $\hat{\theta}_n$ is valid and that the likelihood is concentrated in a $O(1/\sqrt{n})$ neighborhood of $\hat{\theta}_n$. Without these conditions — for instance, for models with non-smooth likelihoods like the uniform distribution on $[0, \theta]$ — the posterior converges at a different rate and to a non-Gaussian limit. [proof] Both $\Pi_n$ and $\phi_n$ are probability distributions integrating to 1, so $\int_\Theta (\Pi_n(\theta) - \phi_n(\theta))\, d\theta = 0$. This means the positive and negative parts of $\Pi_n - \phi_n$ have equal integral, so \begin{align*} \|\Pi_n - \phi_n\|_{\mathrm{TV}} = 2\int_\Theta (\Pi_n(\theta) - \phi_n(\theta))_+\, d\theta = 2\int_\Theta \left(1 - \frac{\Pi_n(\theta)}{\phi_n(\theta)}\right)_+ \phi_n(\theta)\, d\theta. \end{align*} The key step is to show that the density ratio $\Pi_n(\theta)/\phi_n(\theta)$ converges almost surely to 1 for all $\theta$. Once this is established, the dominated convergence theorem (applied to the bound $(1 - x)_+ \le 1$) yields that the total variation distance converges to 0. To see why the ratio converges to 1, write $V = \sqrt{n}(\theta - \hat{\theta}_n)$ and study the rescaled posterior density $\Pi_{n,V}$ for $V$. Taking logarithms and expanding the log-likelihood $\ell_n$ around $\hat{\theta}_n$ by Taylor's theorem: \begin{align*} \log \Pi_{n,V}(v) &\approx \log \pi(\theta_0) + \ell_n(\hat{\theta}_n) + \frac{1}{2}\ell_n''(\hat{\theta}_n)\frac{v^2}{n} - \log Z_n' \\ &\approx -\frac{1}{2}I(\theta_0)v^2 - \log \tilde{Z}_n, \end{align*} using that $\ell_n'(\hat{\theta}_n) = 0$ (first-order condition of the MLE), that $\frac{1}{n}\ell_n''(\hat{\theta}_n) \to -I(\theta_0)$ (by the law of large numbers applied to the Hessian), and that $\pi(\hat{\theta}_n) \to \pi(\theta_0)$ by continuity. The log-density of $\phi_{n,V} = \mathcal{N}(0, I(\theta_0)^{-1})$ is $-\frac{1}{2}I(\theta_0)v^2 - \log C(\theta_0)$. The normalizing constants $\tilde{Z}_n$ and $C(\theta_0)$ agree in the limit (both must make the density integrate to 1), and the rescaled posterior and $\phi_{n,V}$ have the same limit. Full justification of the dominated convergence step requires control of the tails of the likelihood, which involves the regularity assumptions. [/proof] The significance of this theorem is large. It says that, from a frequentist perspective, Bayesian credible sets of level $1-\alpha$ are also asymptotically valid confidence sets of the same level. A credible set $C_n$ satisfying $\Pi_n(C_n) = 1-\alpha$ satisfies $\phi_n(C_n) \to 1-\alpha$ (since the total variation distance between $\Pi_n$ and $\phi_n$ goes to zero), and because $\phi_n$ is centered at the MLE with the correct asymptotic variance, $\mathbb{P}_{\theta_0}(\theta_0 \in C_n) \to 1-\alpha$ as well. The Bayesian and frequentist procedures agree in the limit.  [remark: What the Theorem Does Not Say] The Bernstein-von Mises theorem is an asymptotic result. In finite samples, the posterior can differ substantially from the Gaussian approximation, especially in high dimensions or when the prior is strongly informative. In infinite-dimensional (nonparametric) settings, analogous results require much more stringent conditions on the prior and model, and the centering is typically not the MLE but a different efficient estimator. The one-dimensional case treated here is the cleanest version. [/remark] The Bernstein-von Mises theorem also illuminates why the choice of prior is asymptotically irrelevant, provided $\pi(\theta_0) > 0$. Two statisticians with different priors $\pi_1$ and $\pi_2$, both positive at $\theta_0$, will compute posteriors $\Pi_n^{(1)}$ and $\Pi_n^{(2)}$ that both converge in total variation to the same Gaussian $\phi_n$. Their posteriors therefore agree in the limit. The prior only matters in finite samples, and its influence diminishes at rate $O(1/n)$ relative to the likelihood. This completes the introduction to Bayesian inference. The next chapter examines conjugate priors and the posterior distribution more carefully, working out the interplay between prior, likelihood, and posterior in specific models and discussing how the Bayesian posterior can be used for estimation, uncertainty quantification, and testing. In Bayesian inference, the posterior distribution combines prior knowledge with observed data through Bayes' theorem. The tension between prior and likelihood determines the final inference, and we explore how different priors lead to different conclusions. # 11. Between Prior and Posterior The previous chapter introduced Bayesian inference and the mechanics of computing the posterior distribution. Having established that machinery, a natural set of harder questions arises: how should one actually *use* the posterior to make decisions, and does the resulting inference bear any honest relationship to the frequentist methods developed in the first part of the course? This chapter works through those questions carefully — studying the posterior mean as a concrete estimator, the phenomenon of posterior concentration as $n \to \infty$, the Bernstein–von Mises theorem which formalizes the asymptotic agreement between Bayesian and frequentist inference, and the delicate question of whether Bayesian credible sets have valid frequentist coverage. ## The Posterior as a Probability Measure on the Parameter Space Before extracting point estimates or sets from the posterior, it is worth pausing on what kind of object the posterior actually is. The posterior $\Pi(\cdot \mid X_1, \ldots, X_n)$ is a *random* probability measure on the parameter space $\Theta$: it is a function of the data, and hence it varies from sample to sample. For a fixed realization of the data, it assigns a probability to every measurable subset of $\Theta$, expressing the updated degree of belief about where the true parameter lies. The denominator of the posterior, \begin{align*} Z_n := \int_\Theta \prod_{i=1}^n f(X_i, \theta') \, \pi(\theta') \, d\theta', \end{align*} is the *marginal likelihood* or *evidence*. It normalizes the posterior but does not depend on $\theta$, so it plays no role in comparing different values of $\theta$. In practice, this constant is often inaccessible in closed form, but since it cancels in any ratio involving the posterior, it can be set aside for the purpose of computing point estimates, credible sets, and Bayes factors. The remark that the posterior is proportional to $\prod_{i=1}^n f(X_i, \theta) \cdot \pi(\theta)$ is not merely a computational convenience — it is a conceptually clean statement that the data updates the prior through the likelihood, and the normalizing constant is just bookkeeping. ## Conjugate Priors Computation of the posterior requires evaluating the integral $Z_n$, which is analytically intractable in general. In certain favourable cases, the product of likelihood and prior belongs to a known parametric family and the posterior can be computed in closed form without any integration. [definition: Conjugate Prior] In a statistical model $\{f(\cdot, \theta) : \theta \in \Theta\}$, a prior $\pi$ is called a **conjugate prior** if for every sample size $n$ and every observation $(X_1, \ldots, X_n)$, the posterior $\Pi(\cdot \mid X_1, \ldots, X_n)$ belongs to the same parametric family as $\pi$. [/definition] Conjugacy is a property of the pair (likelihood family, prior family), not of the prior alone. The mechanism is always the same: the likelihood and prior have matching functional forms in $\theta$, so their product remains in the same family with updated parameters. [example: Gaussian-Gaussian Conjugacy] Let $X_i \mid \theta \sim N(\theta, 1)$ i.i.d. with prior $\theta \sim N(0, 1)$. The numerator of the posterior, viewed as a function of $\theta$, is \begin{align*} e^{-\theta^2/2} \prod_{i=1}^n \exp\!\left(-\frac{(X_i - \theta)^2}{2}\right) &\propto \exp\!\left(-n\theta \bar{X}_n - \frac{(n+1)\theta^2}{2}\right), \end{align*} where $\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i$. Completing the square in $\theta$: \begin{align*} -n\theta \bar{X}_n - \frac{(n+1)\theta^2}{2} = -\frac{(n+1)}{2}\left(\theta - \frac{n\bar{X}_n}{n+1}\right)^2 + \text{const}, \end{align*} so the posterior is Gaussian: \begin{align*} \theta \mid X_1, \ldots, X_n \sim N\!\left(\frac{n\bar{X}_n}{n+1},\, \frac{1}{n+1}\right). \end{align*} The posterior mean is $\frac{n}{n+1}\bar{X}_n$ and the posterior variance is $\frac{1}{n+1}$. Both the prior and posterior are Gaussian, confirming conjugacy. The general case $N(\theta, \sigma^2)$ with prior $N(\mu_0, \nu^2)$ is analogous and produces a Gaussian posterior whose parameters interpolate between prior and data. [/example] [example: Further Conjugate Families] Several other conjugate pairs appear throughout the course: - **Beta–Binomial.** For $X_i \mid \theta \sim \text{Bernoulli}(\theta)$ i.i.d. with prior $\theta \sim \text{Beta}(\alpha, \beta)$, the posterior is $\text{Beta}(\alpha + \sum X_i,\, \beta + n - \sum X_i)$. The Beta distribution is parameterized by two shape parameters that count successes and failures, and the likelihood simply increments these counts. - **Gamma–Poisson.** For $X_i \mid \theta \sim \text{Poi}(\theta)$ i.i.d. with prior $\theta \sim \text{Gamma}(\alpha, \beta)$, the posterior is $\text{Gamma}(\alpha + \sum X_i,\, \beta + n)$. Here the likelihood contributes the total count and the number of observations to the Gamma parameters. [/example] The computational appeal of conjugate priors is significant, but conjugacy should not be confused with statistical validity. Choosing a prior because it yields a tractable posterior is a modelling convenience, not a statement about what prior beliefs are appropriate. When the true parameter is far from the mass of the conjugate prior, the posterior may require many observations to overcome the prior's misspecification. ## Improper Priors and the Jeffreys Prior A difficulty with the Bayesian framework is choosing a prior that is "uninformative" — expressing minimal prior knowledge about $\theta$. One might hope to take a uniform distribution on $\Theta$, but this faces two problems. First, if $\Theta$ is unbounded, a uniform distribution does not integrate to a finite value and is not a valid probability distribution. Second, uniformity is not preserved under reparametrization: if $\pi(\theta) = 1$ on $\mathbb{R}$ and $\phi = \theta^2$, then the induced prior on $\phi$ is not uniform. This shows that "uninformative" is not an intrinsic concept — it depends on the chosen parametrization. [definition: Improper Prior] A function $\pi : \Theta \to [0, \infty)$ that is not integrable over $\Theta$ (i.e. $\int_\Theta \pi(\theta)\, d\theta = \infty$) is called an **improper prior**. It can still generate a valid posterior provided $\int_\Theta \prod_{i=1}^n f(X_i, \theta)\, \pi(\theta)\, d\theta < \infty$, in which case the posterior is defined by normalizing $\theta \mapsto \prod_{i=1}^n f(X_i, \theta)\, \pi(\theta)$. [/definition] Improper priors are a mathematical device, not probability distributions in the usual sense. One must verify in each case that the posterior is proper; the posterior being improper would render all posterior inference meaningless. [definition: Jeffreys Prior] The **Jeffreys prior** is the prior $\pi(\theta) \propto \sqrt{\det I(\theta)}$, where $I(\theta)$ is the Fisher information matrix at $\theta$. [/definition] The Jeffreys prior is designed to be invariant under smooth reparametrization. If $\phi = g(\theta)$ for a smooth bijection $g$, then the Jeffreys prior for $\phi$ is $\tilde\pi(\phi) \propto \sqrt{\det I_\phi(\phi)}$, which is exactly the pushforward of $\pi(\theta)$ under $g$. This invariance follows from the transformation rule for the Fisher information matrix under reparametrization, using the chain rule for the score function. [example: Jeffreys Prior for the Gaussian Location-Scale Model] In the model $N(\mu, \tau)$ with $\theta = (\mu, \tau)^\top \in \mathbb{R} \times (0, \infty)$, the Fisher information matrix is diagonal: \begin{align*} I(\mu, \tau) = \begin{pmatrix} 1/\tau & 0 \\ 0 & 1/(2\tau^2) \end{pmatrix}. \end{align*} Its determinant is $1/(2\tau^3)$, so $\sqrt{\det I(\mu, \tau)} = 1/\tau^{3/2}$. The Jeffreys prior is $\pi(\mu, \tau) \propto \tau^{-3/2}$, which is flat in $\mu$ and places less weight on small $\tau$ than the improper flat prior $\pi \equiv 1$. Under this prior, the posterior marginal distribution for $\mu$ is $N(\bar{X}_n,\, \tau/n)$, coinciding with the frequentist distribution of the MLE when $\tau$ is known. [/example] ## Statistical Inference via the Posterior A clinician estimating a drug's efficacy, or an engineer calibrating a sensor, cannot act on an entire probability distribution — they need a single number, or at most an interval. The posterior $\Pi(\cdot \mid X_1, \ldots, X_n)$ is a rich object, but translating it into a decision requires choosing how to summarize it. Different summaries serve different purposes: a point estimate minimizes some loss, a set communicates uncertainty, and a ratio of posterior probabilities compares two hypotheses. The question is how to extract these summaries in a principled way, and what guarantees they carry. [definition: Posterior Mean, Credible Set, and Bayes Factor] Let $\Pi(\cdot \mid X_1, \ldots, X_n)$ denote the posterior distribution. Three inference procedures are: **Posterior mean estimator.** The posterior mean is \begin{align*} \bar\theta_n := \mathbb{E}_\Pi[\theta \mid X_1, \ldots, X_n] = \int_\Theta \theta \, d\Pi(\theta \mid X_1, \ldots, X_n). \end{align*} **Credible set.** A subset $C_n \subseteq \Theta$ is a **level $1-\alpha$ credible set** if \begin{align*} \Pi(C_n \mid X_1, \ldots, X_n) = 1 - \alpha. \end{align*} **Hypothesis test via Bayes factor.** For a partition $\Theta = \Theta_0 \cup \Theta_1$, the **Bayes factor** in favor of $\Theta_0$ over $\Theta_1$ is \begin{align*} B_n := \frac{\Pi(\Theta_0 \mid X_1, \ldots, X_n)}{\Pi(\Theta_1 \mid X_1, \ldots, X_n)} = \frac{\int_{\Theta_0} \prod_{i=1}^n f(X_i, \theta)\, \pi(\theta)\, d\theta}{\int_{\Theta_1} \prod_{i=1}^n f(X_i, \theta)\, \pi(\theta)\, d\theta}. \end{align*} [/definition] Each of these three procedures has a clean Bayesian interpretation, but none comes with automatic guarantees of frequentist validity. The posterior mean minimizes posterior expected squared loss; a credible set is a set to which the Bayesian assigns probability $1-\alpha$; the Bayes factor measures the relative posterior odds. Whether these procedures also have useful frequentist properties — unbiasedness, frequentist coverage, correct size — requires separate analysis and is the subject of the rest of this chapter. ## The Posterior Mean as a Weighted Average How does the posterior mean relate to the MLE? And does the prior's influence ever vanish, or does it persist no matter how many observations are collected? These questions have a clean answer in the Gaussian model that turns out to be paradigmatic for the entire exponential family: the posterior mean is a convex combination of the prior mean and the frequentist estimator, with weights determined by the relative precisions of the prior and the likelihood. To see this in the Gaussian example, recall that after observing $X_1, \ldots, X_n$ from $N(\theta, 1)$ with prior $\theta \sim N(\mu_0, \nu^2)$, the posterior is \begin{align*} \theta \mid X_1, \ldots, X_n \sim N\!\left(\frac{\nu^{-2}\mu_0 + n\bar{X}_n}{\nu^{-2} + n},\, \frac{1}{\nu^{-2} + n}\right). \end{align*} The posterior mean is \begin{align*} \bar\theta_n = \frac{\nu^{-2}}{\nu^{-2} + n}\,\mu_0 + \frac{n}{\nu^{-2} + n}\,\bar{X}_n. \end{align*} This is a weighted average of the prior mean $\mu_0$ and the sample mean $\bar{X}_n$, with weights proportional to the prior precision $\nu^{-2}$ and the data precision $n$ (recall that $n$ i.i.d. observations from $N(\theta, 1)$ carry Fisher information $n$). As $n \to \infty$, the weight on the prior mean vanishes at rate $1/n$, and the posterior mean converges to $\bar{X}_n$, the MLE. [remark: The Role of Prior Strength] The parameter $\nu^{-2}$ acts as the "effective sample size" of the prior — a diffuse prior with large $\nu^2$ corresponds to weak prior information, which is quickly overwhelmed by data. A strongly informative prior with small $\nu^2$ exerts persistent influence unless the sample size is very large relative to $\nu^{-2}$. This makes explicit the Bayesian insight that prior information and observed data play symmetric roles in determining the posterior, weighted by their respective precisions. [/remark] The weighted-average structure is not limited to the Gaussian model. In exponential families with conjugate priors, the posterior mean always takes this form: it interpolates between the prior mean and the MLE, with the mixing weight on the MLE approaching 1 as $n \to \infty$. ## Posterior Concentration The weighted-average formula already suggests that the posterior concentrates around the true value $\theta_0$ as $n \to \infty$: the posterior mean converges to $\theta_0$, and the posterior variance shrinks to zero. But does the entire posterior distribution concentrate, not just its mean? The answer is yes, under mild conditions. This phenomenon is called **posterior concentration** or **posterior contraction**. Informally, if $\theta_0$ is the true value generating the data, then for any fixed $\varepsilon > 0$, the posterior probability assigned to the $\varepsilon$-ball around $\theta_0$ converges to 1 in $P_{\theta_0}$-probability. Before stating this precisely, it is worth understanding what can go wrong. Posterior concentration can fail if the prior assigns zero mass to every neighborhood of $\theta_0$ — in this case, no amount of data can bring the posterior mass to $\theta_0$. It can also fail if the model is not identifiable near $\theta_0$, or if the prior is so heavily concentrated elsewhere that the likelihood cannot overcome it for any finite $n$. These failure modes clarify what the theorem needs to assume. [quotetheorem:1868] The rate $1/\sqrt{n}$ matches the rate of convergence of the MLE under the same conditions. This is the first hint of the deep connection between Bayesian and frequentist inference in regular parametric models: both methods produce statements about $\theta_0$ at the same precision scale. The proof strategy is illuminating. The key observation is that, on the event $\{|\hat\theta_n - \theta_0| = O(1/\sqrt{n})\}$ (which holds with high probability by the consistency of the MLE), the log-posterior ratio \begin{align*} \log\frac{\prod_{i=1}^n f(X_i, \theta)}{\prod_{i=1}^n f(X_i, \theta_0)} \end{align*} can be approximated by a quadratic form in $(\theta - \hat\theta_n)$ using a second-order Taylor expansion of the log-likelihood. For $\theta$ far from $\theta_0$ (say $|\theta - \theta_0| > M_n/\sqrt{n}$), this quadratic form is large and negative, making the likelihood ratio exponentially small. Meanwhile, the prior $\pi(\theta)$ is bounded away from zero near $\theta_0$ and does not grow fast enough to compensate. The denominator $Z_n$ is bounded below by integrating only over a $O(1/\sqrt{n})$-neighborhood of $\theta_0$, which already captures a definite fraction of the posterior mass. Hypothesis 1 — positive definiteness of $I(\theta_0)$ — ensures the quadratic form is genuinely negative away from $\theta_0$. Without it, the log-likelihood could be flat in some direction, and the posterior would not concentrate in that direction at all (think of a model where $\theta$ enters only through $\theta^2$: two values $\pm\theta_0$ always have the same likelihood and the posterior cannot distinguish them). Hypothesis 2 — positivity of the prior at $\theta_0$ — ensures the denominator $Z_n$ does not vanish; if $\pi(\theta_0) = 0$, the prior actively avoids the truth and concentration fails. ## The Bernstein–von Mises Theorem Posterior concentration tells us that the posterior mass accumulates around $\theta_0$ at rate $1/\sqrt{n}$, but it does not specify the shape of the posterior within the $1/\sqrt{n}$-scale neighborhood. The Bernstein–von Mises theorem gives a precise answer: the rescaled posterior converges in total variation to a Gaussian distribution whose mean and covariance are determined by the MLE and the inverse Fisher information — exactly the limiting distribution of the MLE. [quotetheorem:1869] The proof proceeds by a careful local expansion of the log-likelihood around the MLE. On a $O(1/\sqrt{n})$ scale around $\hat\theta_n$, writing $\theta = \hat\theta_n + h/\sqrt{n}$, the log-likelihood ratio becomes \begin{align*} \sum_{i=1}^n \log\frac{f(X_i, \theta)}{f(X_i, \hat\theta_n)} \approx -\frac{h^\top I(\theta_0) h}{2}, \end{align*} by a second-order Taylor expansion using that the score at $\hat\theta_n$ vanishes and the empirical Hessian concentrates around $-I(\theta_0)$ by the law of large numbers. The prior $\pi(\theta)$ is approximately constant at this scale since it varies on a much coarser scale. The resulting posterior, after a change of variables $h = \sqrt{n}(\theta - \hat\theta_n)$, is proportional to $\exp(-h^\top I(\theta_0) h / 2)$, which is the kernel of $N(0, I(\theta_0)^{-1})$. The total variation convergence requires a uniform argument over measurable sets and uses the fact that the tails of the posterior are negligible. The theorem's hypotheses are essential. Without positive definiteness of $I(\theta_0)$, the Gaussian limit degenerates. Without the regularity conditions on the log-likelihood, the Taylor expansion may be invalid, and the posterior shape can be entirely non-Gaussian: consider an exponential family at the boundary of its natural parameter space, or a model with a discontinuous density. In non-regular models (such as estimation of the endpoint of a uniform distribution), the Bernstein–von Mises theorem fails completely, and the posterior and frequentist distributions have different limiting shapes. The Bernstein–von Mises theorem is, at its heart, a statement about asymptotic *equivalence* of Bayesian and frequentist procedures in regular parametric models. It implies that as $n \to \infty$, Bayesian credible sets and frequentist confidence sets constructed from the Gaussian approximation become indistinguishable. This is sometimes read as an asymptotic "washing out of the prior" — the prior only influences the posterior through a $O(1/\sqrt{n})$ perturbation to the mean, which vanishes after normalization. ## Credible Sets and Frequentist Coverage A level $1-\alpha$ credible set $C_n$ satisfies $\Pi(C_n \mid X_1, \ldots, X_n) = 1 - \alpha$ by construction. But from a frequentist perspective, the question is different: what is $P_{\theta_0}(\theta_0 \in C_n)$? The credible set is a random set (it depends on the data), and the true parameter $\theta_0$ is a fixed unknown constant. Does the frequentist coverage probability match the nominal level $1-\alpha$? In general, the answer is no: Bayesian credible sets do not have guaranteed frequentist coverage. To understand why, consider a prior that is highly concentrated at some $\theta_1 \neq \theta_0$. Even after observing a large sample, the posterior may remain partially influenced by this prior, and the $1-\alpha$ credible interval may systematically exclude $\theta_0$ with non-negligible probability. This failure is not a pathology — it is the honest Bayesian answer to a Bayesian question, which is a different question from the frequentist one. However, the Bernstein–von Mises theorem implies that, in the large-sample regime and under its regularity conditions, Bayesian credible sets *do* have asymptotically correct frequentist coverage. The argument is direct: the posterior is asymptotically $N(\hat\theta_n, I(\theta_0)^{-1}/n)$, so the highest posterior density (HPD) region at level $1-\alpha$ is approximately the ellipsoid \begin{align*} C_n = \left\{\theta : n(\theta - \hat\theta_n)^\top I(\theta_0)(\theta - \hat\theta_n) \leq \chi^2_{p,1-\alpha}\right\}, \end{align*} which coincides with the Wald-type confidence ellipsoid centered at $\hat\theta_n$. Since the MLE is asymptotically $N(\theta_0, I(\theta_0)^{-1}/n)$, the frequentist coverage of this ellipsoid converges to $1-\alpha$ as well. [remark: What "Asymptotic" Means Here] The guarantee is asymptotic: for every fixed $n$, the credible set coverage can differ from $1-\alpha$ by a prior-dependent amount. The Bernstein–von Mises theorem says this discrepancy vanishes as $n \to \infty$, but gives no finite-sample bound. In practice, the rate at which the discrepancy vanishes depends on how far the prior mean is from $\theta_0$ and on the curvature of $I(\theta)$. For a badly misspecified prior, the convergence can be slow. [/remark] ## What the Prior Contributes Asymptotically The Bernstein–von Mises theorem raises a natural question: if the prior washes out asymptotically and the posterior is eventually determined by the likelihood, what is the practical value of the prior for large samples? The answer has several layers. First, the prior governs finite-sample behavior and the transient phase before concentration. In moderate samples, a well-chosen prior can substantially improve estimation by regularizing the MLE — this is especially important in high-dimensional or sparse settings where the MLE has poor finite-sample properties. Second, the choice of prior is never entirely irrelevant even asymptotically: it determines the posterior at scale $O(1)$ before the $1/\sqrt{n}$ scaling, and for nonparametric or high-dimensional problems where $p$ grows with $n$, the Bernstein–von Mises theorem does not apply in its classical form and the prior can have a lasting effect. Third, the prior provides a principled mechanism for incorporating genuine domain knowledge — this has intrinsic value independent of the asymptotic behavior. To make the connection between the posterior mean and the MLE more concrete, return to the Gaussian example. The posterior mean is $\frac{n}{n+1}\bar{X}_n \to \bar{X}_n$ as $n \to \infty$, so the two estimators are asymptotically equivalent. The difference $\bar\theta_n - \hat\theta_n = -\frac{1}{n+1}\bar{X}_n = O(1/n)$ is smaller than the estimation error $\hat\theta_n - \theta_0 = O(1/\sqrt{n})$. This illustrates the general principle: the prior introduces a bias in the posterior mean of order $1/n$, which is dominated by the statistical error of order $1/\sqrt{n}$ and hence asymptotically negligible. [remark: Posterior Mean Converges to $\theta_0$] In both Example 3.2 (Gaussian model, prior $N(0,1)$) and Example 3.4 (Gaussian location-scale, Jeffreys prior), the posterior mean converges to the true parameter value $\theta_0$ with probability 1 under $P_{\theta_0}$, at the standard parametric rate $1/\sqrt{n}$. This is not coincidental: it is a direct consequence of posterior concentration combined with the fact that the posterior mean is within the posterior's support, and the posterior's support concentrates at $\theta_0$. [/remark] ## A Synthesis: Bayesian and Frequentist Inference in Regular Models Do Bayesian and frequentist inference ultimately agree? They start from different premises — one treats $\theta$ as random and conditions on the data; the other treats $\theta$ as fixed and averages over hypothetical repeated samples — and there is no a priori reason they should converge to the same answers. Yet in regular parametric models, this chapter has built a case that they do, at least in large samples. Here is the coherent picture that has emerged. In a regular parametric model with $n$ i.i.d. observations: The posterior distribution concentrates around $\theta_0$ at rate $n^{-1/2}$. The rescaled posterior is asymptotically Gaussian with mean equal to the MLE and covariance equal to the inverse Fisher information — the same distribution that the MLE itself has under $P_{\theta_0}$. Bayesian credible sets and frequentist confidence sets are asymptotically equivalent. The posterior mean is a consistent, asymptotically efficient estimator. This does not mean that Bayesian and frequentist inference are the same philosophy. They start from different premises: the Bayesian treats $\theta$ as a random variable and conditions on the observed data; the frequentist treats $\theta$ as a fixed unknown and considers the distribution over hypothetical repeated samples. The Bernstein–von Mises theorem says that in large samples and in regular models, both approaches produce the same numbers, even though they interpret those numbers differently. The agreement breaks down outside regular parametric models — in nonparametric settings, misspecified models, models with weak identifiability, or models with non-smooth likelihoods. In those settings, the Bayesian and frequentist frameworks can produce genuinely different answers, and understanding when each approach gives reliable inference becomes a substantive statistical problem rather than a question of asymptotic bookkeeping. While Bayesian methods have intuitive appeal, their performance must be evaluated rigorously using classical frequentist criteria. We ask whether Bayesian procedures control error rates and achieve good coverage properties under repeated sampling. # 12. Frequentist Analysis of Bayesian Methods The Bayesian and frequentist traditions in statistics are often presented as competing philosophies, but in large samples they tell the same story — or nearly so. This chapter asks the precise question: when we draw a credible set from the posterior, is it also a valid confidence set? Answering this requires understanding how the posterior distribution behaves when data come from a fixed true parameter $\theta_0$, not from the prior. We develop the Bernstein-von Mises theorem as the central tool, trace through its proof strategy, and examine both what it guarantees and where it can fail. ## The Frequentist Lens on Bayesian Inference A Bayesian analysis treats $\theta$ as a random variable and outputs a posterior distribution $\Pi_n = \Pi(\cdot \mid X_1, \ldots, X_n)$. From a Bayesian standpoint, this posterior is the complete answer: uncertainty is measured by probabilities under $\Pi_n$, and any inference procedure is derived from it. But a frequentist asks a different question entirely: if the data are genuinely generated by a fixed true parameter $\theta_0$ — that is, if $X_i \overset{\text{i.i.d.}}{\sim} f(x, \theta_0)$ — do Bayesian procedures have good frequentist properties? The worry is real. Suppose your prior $\pi$ assigns zero mass to neighborhoods of the true $\theta_0$. Then no amount of data can rescue the posterior — it concentrates away from $\theta_0$ regardless of sample size, and any credible set based on it fails as a confidence set. More subtly, even when $\pi(\theta_0) > 0$, a badly misspecified prior might distort the posterior severely enough that its shape at finite $n$ is misleading. The question is whether these pathologies disappear asymptotically under minimal conditions on the prior. This motivates a careful asymptotic study of $\Pi_n$ as a random probability measure under $P_{\theta_0}$. ## The Posterior as a Random Probability Distribution The setup throughout this chapter is the parametric model $\{f(\cdot, \theta) : \theta \in \Theta\}$ with $\Theta \subseteq \mathbb{R}$, equipped with a prior density $\pi$ on $\Theta$. After observing $X_1, \ldots, X_n$, the posterior density is \begin{align*} \Pi_n(\theta) = \frac{\pi(\theta) \prod_{i=1}^n f(X_i, \theta)}{Z_n}, \end{align*} where $Z_n = \int_\Theta \pi(\theta) \prod_{i=1}^n f(X_i, \theta)\, d\theta$ is the normalizing constant (the marginal likelihood). When the $X_i$ are random, $\Pi_n$ is itself a random element of the space of probability distributions on $\Theta$. This dual randomness — the data are random, and hence the posterior is random — is the essential feature of the frequentist analysis. We are not asking about the posterior for a single fixed dataset; we are asking about the typical behavior of the posterior under the data-generating process $P_{\theta_0}$. [definition: Total Variation Distance] For two probability distributions $\mu$ and $\nu$ on $\Theta \subseteq \mathbb{R}$ with densities $\mu(\theta)$ and $\nu(\theta)$, the **total variation distance** between them is \begin{align*} \|\mu - \nu\|_{\mathrm{TV}} = \int_\Theta |\mu(\theta) - \nu(\theta)|\, d\theta. \end{align*} Equivalently, $\|\mu - \nu\|_{\mathrm{TV}} = 2\sup_{A \subseteq \Theta} |\mu(A) - \nu(A)|$, where the supremum is over all measurable sets $A$. [/definition] Total variation is the natural metric here because it controls the worst-case discrepancy between two distributions over any event. If $\|\Pi_n - \phi_n\|_{\mathrm{TV}} \to 0$, then for any set $A$, including any credible set, $|\Pi_n(A) - \phi_n(A)| \to 0$. This is exactly what we need to transfer coverage guarantees between the two distributions. ## Asymptotic Normality of the Posterior Before stating the main theorem, it is useful to see why normality should emerge at all. Recall from the study of the MLE that the score function satisfies a law of large numbers and a central limit theorem under regularity assumptions. The log-posterior is \begin{align*} \log \Pi_n(\theta) = \log \pi(\theta) + \ell_n(\theta) - \log Z_n, \end{align*} where $\ell_n(\theta) = \sum_{i=1}^n \log f(X_i, \theta)$ is the log-likelihood. Near the MLE $\hat{\theta}_n$, a second-order Taylor expansion of $\ell_n$ gives \begin{align*} \ell_n(\theta) \approx \ell_n(\hat{\theta}_n) + \ell_n'(\hat{\theta}_n)(\theta - \hat{\theta}_n) + \frac{1}{2}\ell_n''(\hat{\theta}_n)(\theta - \hat{\theta}_n)^2. \end{align*} Since $\ell_n'(\hat{\theta}_n) = 0$ by definition of the MLE, and since $-\ell_n''(\hat{\theta}_n)/n \to I(\theta_0)$ by a law of large numbers applied to the observed information, the log-posterior looks approximately like a quadratic in $(\theta - \hat{\theta}_n)$, with curvature $n I(\theta_0)$. The corresponding density is approximately Gaussian with mean $\hat{\theta}_n$ and variance $1/(n I(\theta_0))$. The prior $\pi$ contributes $\log \pi(\theta_0) + O(\theta - \theta_0)$, which is asymptotically negligible because $\pi$ varies on scale 1 while the likelihood concentrates on scale $1/\sqrt{n}$. This heuristic is made precise by the Bernstein-von Mises theorem. ## The Bernstein-von Mises Theorem The following theorem is the central result of this chapter. It asserts that the posterior distribution is, in total variation, asymptotically indistinguishable from a normal distribution centered at the MLE. [quotetheorem:1867] The hypotheses are worth examining closely. The condition $\pi(\theta_0) > 0$ is the minimal requirement that the prior does not rule out the truth — if $\pi(\theta_0) = 0$, the posterior cannot concentrate at $\theta_0$ regardless of the data, and the theorem fails. The continuity of $\pi$ at $\theta_0$ is needed so that the prior acts approximately as a constant on the $O(1/\sqrt{n})$ scale where the posterior lives; a prior that oscillates rapidly near $\theta_0$ could in principle interfere with the posterior shape even in the limit. The regularity conditions on the model (the same ones guaranteeing asymptotic normality of the MLE) ensure that the Taylor expansion of the log-likelihood around $\hat{\theta}_n$ is valid with controlled error terms. Without identifiability, the log-likelihood need not have a unique maximum, and the posterior may split mass between multiple modes rather than concentrating near a single Gaussian. Notice also that the approximating distribution $\phi_n = \mathcal{N}(\hat{\theta}_n, I(\theta_0)^{-1}/n)$ is random: its center $\hat{\theta}_n$ depends on the data. This is entirely appropriate — the posterior mean is itself random, and the theorem says the whole posterior tracks this random Gaussian. [remark: Prior Does Not Affect the Limit] The approximating distribution $\phi_n$ depends only on the MLE $\hat{\theta}_n$ and the Fisher information $I(\theta_0)$, not on the prior $\pi$. This means that for any two priors $\pi_1$ and $\pi_2$ both satisfying the hypotheses, the corresponding posteriors $\Pi_n^{(1)}$ and $\Pi_n^{(2)}$ satisfy $\|\Pi_n^{(1)} - \Pi_n^{(2)}\|_{\mathrm{TV}} \to 0$ almost surely. The prior matters at finite $n$, but all well-behaved priors lead to the same asymptotic posterior. [/remark] ## Proof Strategy The full proof of the Bernstein-von Mises theorem is technically involved, but the key ideas can be laid out in three stages. [proof] **Step 1: Reduction via symmetry of total variation.** Since $\Pi_n$ and $\phi_n$ are both probability densities, they integrate to 1, so $\int_\Theta (\Pi_n(\theta) - \phi_n(\theta))\, d\theta = 0$. The positive and negative parts of $(\Pi_n - \phi_n)$ therefore have equal integral. Writing $(x)_+ = \max\{x, 0\}$, \begin{align*} \|\Pi_n - \phi_n\|_{\mathrm{TV}} = 2\int_\Theta (\Pi_n(\theta) - \phi_n(\theta))_+\, d\theta = 2\int_\Theta \left(1 - \frac{\Pi_n(\theta)}{\phi_n(\theta)}\right)_+ \phi_n(\theta)\, d\theta. \end{align*} The function $x \mapsto (1-x)_+$ is bounded by 1. **Step 2: Ratio convergence.** If the ratio $\Pi_n(\theta)/\phi_n(\theta) \to 1$ almost surely for all $\theta \in \Theta$, then the integrand $(1 - \Pi_n(\theta)/\phi_n(\theta))_+$ converges to 0 almost surely for each $\theta$. Since it is bounded, the dominated convergence theorem yields \begin{align*} \int_\Theta \left(1 - \frac{\Pi_n(\theta)}{\phi_n(\theta)}\right)_+ \phi_n(\theta)\, d\theta \to 0 \quad \text{a.s.} \end{align*} **Step 3: Establishing the ratio convergence.** The ratio $\Pi_n(\theta)/\phi_n(\theta)$ is analyzed by centering at the MLE. Writing $V = \sqrt{n}(\theta - \hat{\theta}_n)$, so that $\theta = \hat{\theta}_n + V/\sqrt{n}$, the change-of-variables density $\Pi_{n,V}(v) = n^{-1/2} \Pi_n(\hat{\theta}_n + v/\sqrt{n})$ satisfies \begin{align*} \log \Pi_{n,V}(v) = \log \pi\!\left(\hat{\theta}_n + \tfrac{v}{\sqrt{n}}\right) + \ell_n\!\left(\hat{\theta}_n + \tfrac{v}{\sqrt{n}}\right) - \log Z_n', \end{align*} where $Z_n'$ is a normalization constant. A second-order Taylor expansion of $\ell_n$ around $\hat{\theta}_n$, combined with the fact that $\ell_n'(\hat{\theta}_n) = 0$ and $\ell_n''(\hat{\theta}_n)/n \to -I(\theta_0)$ a.s. (by the LLN applied to the observed information), gives \begin{align*} \ell_n\!\left(\hat{\theta}_n + \tfrac{v}{\sqrt{n}}\right) \approx \ell_n(\hat{\theta}_n) - \tfrac{1}{2} I(\theta_0) v^2. \end{align*} Meanwhile, $\log \pi(\hat{\theta}_n + v/\sqrt{n}) \to \log \pi(\theta_0)$ because $\hat{\theta}_n \to \theta_0$ a.s. and $v/\sqrt{n} \to 0$. The approximating distribution $\phi_{n,V}$ under the same change of variables is $\mathcal{N}(0, I(\theta_0)^{-1})$, with $\log \phi_{n,V}(v) = -\frac{1}{2} I(\theta_0) v^2 - \log C(\theta_0)$. The ratio $\Pi_{n,V}(v)/\phi_{n,V}(v) \to 1$ for each fixed $v$, completing the sketch. [/proof] ## The Gaussian Approximation and Laplace's Method The proof strategy above is an instance of a general technique called **Laplace's method** for approximating integrals. Suppose $f: \Theta \to \mathbb{R}$ has a unique maximum at $\theta^*$ with $f''(\theta^*) < 0$. For large $n$, the function $e^{n f(\theta)}$ is exponentially concentrated near $\theta^*$. A second-order expansion gives \begin{align*} f\!\left(\theta^* + \tfrac{x}{\sqrt{n}}\right) \approx f(\theta^*) - \tfrac{1}{2n} x^2 |f''(\theta^*)|, \end{align*} so \begin{align*} e^{n f(\theta^* + x/\sqrt{n})} \approx e^{n f(\theta^*)} \cdot e^{-\frac{1}{2}|f''(\theta^*)| x^2}. \end{align*} [example: Laplace Approximation of the Posterior] Consider $X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} \mathcal{N}(\theta, 1)$ with prior $\theta \sim \mathcal{N}(0, 1)$. The log-posterior is \begin{align*} \log \Pi_n(\theta) = -\frac{\theta^2}{2} - \frac{1}{2}\sum_{i=1}^n (X_i - \theta)^2 - \text{const} = -\frac{n+1}{2}\theta^2 + n\bar{X}_n \theta - \text{const}. \end{align*} Completing the square, the posterior is $\mathcal{N}\!\left(\frac{n\bar{X}_n}{n+1}, \frac{1}{n+1}\right)$. The Bernstein-von Mises approximation $\phi_n = \mathcal{N}(\hat{\theta}_n, 1/n) = \mathcal{N}(\bar{X}_n, 1/n)$ has mean $\bar{X}_n$ instead of $\frac{n}{n+1}\bar{X}_n$ and variance $1/n$ instead of $\frac{1}{n+1}$. Both the mean difference (which is $\frac{\bar{X}_n}{n+1}$) and the variance difference (which is $\frac{1}{n(n+1)}$) go to zero, and the total variation distance between the two Gaussians goes to zero as $n \to \infty$. This confirms the theorem in the Gaussian-Gaussian conjugate case by direct computation. [/example] The computation in this example is exact because the conjugate Gaussian model produces a Gaussian posterior. In general, the posterior is not Gaussian at finite $n$, and the theorem asserts only the asymptotic equivalence. ## Consequences for Credible Sets The most important practical consequence of the Bernstein-von Mises theorem is that Bayesian credible sets are asymptotically valid frequentist confidence sets. [definition: Credible Set] Let $\alpha \in (0,1)$. A **$(1-\alpha)$-credible set** (or credible region) for $\theta$ based on the posterior $\Pi_n$ is any measurable set $C_n \subseteq \Theta$ such that \begin{align*} \Pi_n(C_n) = \Pi_n(\theta \in C_n \mid X_1, \ldots, X_n) = 1 - \alpha. \end{align*} A natural choice is the **highest posterior density (HPD) credible set**, which takes $C_n = \{\theta : \Pi_n(\theta) \geq c_n\}$ for the threshold $c_n$ making the posterior mass exactly $1-\alpha$. [/definition] In the parametric setting above, a common form of credible set is the symmetric interval \begin{align*} C_n = \left\{\nu : |\nu - \hat{\theta}_n| \leq \frac{R_n}{\sqrt{n}}\right\} \quad \text{or} \quad C_n = \left\{\nu : |\nu - \bar{\theta}_n| \leq \frac{R_n}{\sqrt{n}}\right\}, \end{align*} where $\bar{\theta}_n = \mathbb{E}_{\Pi_n}[\theta \mid X_1, \ldots, X_n]$ is the posterior mean and $R_n$ is a random threshold chosen so that $\Pi_n(C_n) = 1 - \alpha$. [quotetheorem:1870] [citeproof:1870] The condition that $\phi_n(C_n) \to 1-\alpha$ is automatically satisfied for the natural credible sets (symmetric HPD intervals around the posterior mean or MLE) because the BvM theorem makes $\Pi_n$ and $\phi_n$ equivalent in total variation. ## The Posterior Mean as a Frequentist Estimator A separate but related consequence concerns the posterior mean $\bar{\theta}_n = \mathbb{E}_{\Pi_n}[\theta \mid X_1, \ldots, X_n]$ as a point estimator. Recall from the Gaussian example that \begin{align*} \bar{\theta}_n = \frac{n}{n+1}\hat{\theta}_n. \end{align*} This is not equal to the MLE, but it is very close for large $n$. The decomposition \begin{align*} \sqrt{n}(\bar{\theta}_n - \theta_0) = \sqrt{n}(\bar{\theta}_n - \hat{\theta}_n) + \sqrt{n}(\hat{\theta}_n - \theta_0) \end{align*} separates the estimation error into a prior-induced bias term and the MLE error. The second term converges in distribution to $\mathcal{N}(0, I(\theta_0)^{-1})$. For the Gaussian example, \begin{align*} \sqrt{n}(\bar{\theta}_n - \hat{\theta}_n) = \sqrt{n}\!\left(\frac{n}{n+1} - 1\right)\!\hat{\theta}_n = -\frac{\sqrt{n}}{n+1}\hat{\theta}_n \xrightarrow{P_{\theta_0}} 0, \end{align*} because $\hat{\theta}_n \to \theta_0$ and $\sqrt{n}/(n+1) \to 0$. By Slutsky's lemma, the sum converges in distribution to $\mathcal{N}(0, I(\theta_0)^{-1})$ — the same limiting distribution as the MLE. More generally, whenever the BvM theorem holds, the posterior mean and MLE are asymptotically equivalent at first order, and the posterior mean can serve as the center of a frequentist confidence region. [example: Using the Posterior Mean in a Confidence Interval] In the $\mathcal{N}(\theta, 1)$ model with prior $\theta \sim \mathcal{N}(0,1)$, the Fisher information is $I(\theta_0) = 1$, and the $95\%$ confidence interval based on the posterior mean is \begin{align*} C_n = \left\{\nu : |\nu - \bar{\theta}_n| \leq \frac{1.96}{\sqrt{n}}\right\}. \end{align*} The coverage satisfies $P_{\theta_0}(\theta_0 \in C_n) \to 0.95$ because $\sqrt{n}(\bar{\theta}_n - \theta_0) \xrightarrow{d} \mathcal{N}(0,1)$. Here the Fisher information is known, so no estimation of $I(\theta_0)$ is needed. In models where $I(\theta_0)$ depends on $\theta_0$, one replaces it by $I(\hat{\theta}_n)$ or $-\ell_n''(\hat{\theta}_n)/n$, which are consistent estimators of $I(\theta_0)$. [/example] ## When the Theorem Fails: Misspecification and Infinite Dimensions The Bernstein-von Mises theorem is a finite-dimensional result, and it has genuine limitations. Understanding where it fails is as important as understanding what it says. **Model misspecification.** If the data truly come from a distribution $P_0$ that does not belong to the parametric family $\{P_\theta : \theta \in \Theta\}$, neither the MLE nor the posterior concentrates at a meaningful limit in general. The MLE converges to the pseudo-true parameter $\theta^* = \arg\min_{\theta \in \Theta} \mathrm{KL}(P_0, P_\theta)$, and under additional assumptions the posterior also concentrates near $\theta^*$. However, the asymptotic variance of the posterior is no longer $I(\theta_0)^{-1}$ — it involves a "sandwich" formula reflecting the mismatch between the model and the true distribution. Credible sets built on the assumption of a well-specified model can badly undercover or overcover in the misspecified regime. A symmetric interval of width $\pm 1.96/\sqrt{n}$ around $\hat{\theta}_n$ may have frequentist coverage that is completely wrong. **Infinite-dimensional parameters.** In nonparametric settings, such as estimating an unknown density $f$ or a regression function $g$, the parameter space is infinite-dimensional. The BvM theorem fails in great generality in infinite dimensions, even under very natural priors. The posterior can be consistent (concentrating near the truth in a suitable metric), but it does not converge in total variation to a Gaussian distribution centered at an efficient estimator, because no efficient estimator achieves the parametric rate in infinite dimensions. Freedman (1999) gave a striking negative result: for i.i.d. observations from a density, there exist priors such that $95\%$ credible intervals have asymptotic frequentist coverage converging to 0. The failure mode is that the posterior contracts at the correct rate but the credible sets are not calibrated to the actual sampling distribution. **Mismatched variance.** Even in parametric models, if the prior is improper or concentrates on a set of measure zero for the likelihood, subtle failures can occur. For example, a prior that is a Dirac mass at a point $\theta_1 \neq \theta_0$ gives a posterior that is identically $\delta_{\theta_1}$ regardless of the data — the worst possible failure. These failure modes underscore that the hypotheses in the BvM theorem are not mere technical conveniences: each one prevents a specific breakdown of the asymptotic Gaussian behavior. ## Posterior Consistency The BvM theorem is a strong convergence result, but even asking whether the posterior concentrates near $\theta_0$ at all — without any rate or shape — is already a nontrivial question called **posterior consistency**. [definition: Posterior Consistency] The posterior $\Pi_n$ is **consistent** at $\theta_0$ if, for every $\varepsilon > 0$, \begin{align*} \Pi_n\!\left(\theta : |\theta - \theta_0| > \varepsilon\right) = \Pi\!\left(|\theta - \theta_0| > \varepsilon \mid X_1, \ldots, X_n\right) \xrightarrow{P_{\theta_0}} 0. \end{align*} [/definition] Posterior consistency is a weaker statement than the BvM theorem: it only says the posterior mass escaping neighborhoods of $\theta_0$ goes to zero in probability, not that the shape of the posterior approaches a Gaussian. In parametric models under the BvM hypotheses, consistency is implied by the theorem itself. But the concept is especially important in nonparametric settings where the BvM theorem is unavailable. A sufficient condition for consistency, valid in fairly general settings, is **Schwartz's theorem**, which requires that the prior assigns positive mass to every Kullback-Leibler neighborhood of $f(\cdot, \theta_0)$. The intuition is that the likelihood ratio $\prod_{i=1}^n f(X_i, \theta)/f(X_i, \theta_0)$ decays exponentially fast for $\theta$ bounded away from $\theta_0$ (by the law of large numbers applied to $\log f(X, \theta) - \log f(X, \theta_0)$, which has negative mean $-\mathrm{KL}(P_{\theta_0}, P_\theta)$). As long as the prior gives positive mass to neighborhoods of $\theta_0$, the posterior mass near $\theta_0$ grows exponentially relative to the mass far from $\theta_0$, driving the ratio to zero. The rate at which the posterior concentrates is the subject of **posterior contraction theory**, which extends the study of estimation rates to the Bayesian setting. A posterior is said to contract at rate $\varepsilon_n$ if $\Pi_n(\theta : |\theta - \theta_0| > M\varepsilon_n) \to 0$ in probability for large enough $M$. In parametric models, the BvM theorem implies contraction at rate $1/\sqrt{n}$, the parametric rate. In nonparametric models, the contraction rate depends on the smoothness of $\theta_0$ and the concentration properties of the prior, mirroring the minimax estimation rates. ## Summary The Bernstein-von Mises theorem establishes a deep asymptotic equivalence between Bayesian and frequentist inference in parametric models. Under mild conditions on the model and prior, the posterior concentrates near the MLE at the correct $1/\sqrt{n}$ rate, its shape approaches a Gaussian with the correct Fisher information variance, and credible sets become valid frequentist confidence sets. The prior washes out: all well-behaved priors lead to the same asymptotic posterior, determined entirely by the likelihood and Fisher information. This equivalence breaks down — sometimes dramatically — in misspecified models, in infinite-dimensional parameter spaces, and when the prior violates even the minimal assumption $\pi(\theta_0) > 0$. The study of posteriors beyond the BvM regime — their consistency, contraction rates, and geometry in nonparametric settings — is an active and technically demanding area of mathematical statistics, building on the foundations laid in this course. A natural bridge between frequentist and Bayesian inference is the credible set — a Bayesian construction that often mimics the behavior of confidence sets. Yet important differences remain in interpretation and optimality properties. # 13. Credible Sets as Confidence Sets The preceding chapters established the Bernstein–von Mises theorem: under regularity conditions, the posterior distribution $\Pi_n(\cdot \mid X_1, \ldots, X_n)$ concentrates around the MLE $\hat{\theta}_n$ and, after rescaling by $\sqrt{n}$, converges in total variation to $N(0, I(\theta_0)^{-1})$. This chapter asks a sharper question: can we extract confidence sets from the posterior directly? Specifically, if we take the posterior's central $(1-\alpha)$ mass and read it off as a set for $\theta$, does that set carry genuine frequentist guarantees? The answer, given BvM, is yes — and making this precise is the main task of this chapter. We then pivot to decision theory, the framework that unifies estimation, testing, and inference under a common language of loss functions and risk. ## From Posterior Quantiles to Frequentist Sets The starting point is a natural family of credible sets. Given the posterior $\Pi_n = \Pi(\cdot \mid X_1, \ldots, X_n)$, define the symmetric ball around the MLE: \begin{align*} C_n = \left\{ \nu : |\nu - \hat{\theta}_n| \leq \frac{R_n}{\sqrt{n}} \right\}, \end{align*} where $R_n > 0$ is chosen so that $\Pi_n(C_n) = 1 - \alpha$. Here $R_n$ is a data-dependent radius, determined implicitly by demanding that the posterior assigns exactly mass $1 - \alpha$ to $C_n$. The set $C_n$ is a credible set by construction. The question is whether it is also a confidence set in the frequentist sense: does $P_{\theta_0}(\theta_0 \in C_n) \to 1 - \alpha$ as $n \to \infty$? To answer this, we need to understand the asymptotic behaviour of $R_n$ itself. Note that $R_n$ depends on $n$ both through $\hat{\theta}_n$ and through the shape of the posterior — so its limit is not immediate. [definition: Phi Zero Function] For all $t > 0$, define the function $\Phi_0 : [0, \infty) \to [0, 1)$ by \begin{align*} \Phi_0(t) = P(|Z_0| \leq t) = \int_{-t}^{t} \varphi_0(x)\, dx, \end{align*} where $Z_0 \sim N(0, I(\theta_0)^{-1})$ and $\varphi_0$ is the corresponding Gaussian density. The function $\Phi_0$ is strictly increasing, continuous, and bijective from $[0, \infty)$ to $[0, 1)$. Its functional inverse $\Phi_0^{-1} : [0, 1) \to [0, \infty)$ is also continuous. [/definition] The function $\Phi_0$ plays the role that the standard normal CDF plays in ordinary confidence intervals, but calibrated to the Fisher information at the true parameter $\theta_0$. When $\theta$ is one-dimensional, $I(\theta_0)^{-1}$ is just the asymptotic variance of $\hat{\theta}_n$, so $\Phi_0^{-1}(1-\alpha)$ is the quantile of $|Z_0|$ at level $1-\alpha$ — the analogue of $z_{\alpha/2}$ in classical normal-theory intervals. The definition requires that $I(\theta_0) > 0$, which is precisely the non-degeneracy condition on the Fisher information. If the model were non-identifiable at $\theta_0$, the Fisher information could vanish and $Z_0$ would be degenerate; the argument breaks down entirely in that case. ## Convergence of the Posterior Radius The key to the whole argument is that $R_n$, which is defined implicitly through the posterior, converges almost surely to the deterministic value $\Phi_0^{-1}(1-\alpha)$. [quotetheorem:1876] [citeproof:1876] The key step in this proof is identifying the posterior mass of $C_n$ with the Gaussian mass up to a total variation error — exactly the content of BvM. Without BvM, we would have no handle on the shape of the posterior, and $R_n$ could behave arbitrarily. The almost sure (rather than merely in-probability) convergence here is a consequence of the strong form of BvM used in the course. ## The Main Coverage Theorem With Lemma 3.1 in hand, we can now prove the central result. [quotetheorem:1884] [citeproof:1884] [remark: Posterior Mean Instead of MLE] The same result holds with the posterior mean $\bar{\theta}_n$ in place of $\hat{\theta}_n$: if we define $C_n$ centred at $\bar{\theta}_n$ with radius $R_n/\sqrt{n}$ chosen so that $\Pi_n(C_n) = 1 - \alpha$, the frequentist coverage still converges to $1 - \alpha$. This follows because the posterior mean and the MLE differ by $o(n^{-1/2})$ under BvM, so Slutsky applies with the same force. [/remark] The result is striking from a foundational standpoint. A Bayesian who builds a credible set entirely from the posterior — using whatever prior satisfies BvM's conditions — obtains a set that is simultaneously a frequentist confidence set in the large-sample limit. The two schools of inference converge, at least for this class of problems. There are important limitations, however. The result is purely asymptotic: for finite $n$, the frequentist coverage of $C_n$ may differ from $1 - \alpha$, and whether it is above or below depends on the prior and the model. The result also requires the BvM conditions, which exclude heavy-tailed likelihoods, non-regular models (boundaries, non-smooth densities), and infinite-dimensional parameters. When any of these conditions fail, a credible set can have frequentist coverage that does not converge to $1 - \alpha$ at all.  ## Decision Theory: A Unifying Framework Having studied confidence sets, it is natural to ask how all these statistical procedures — estimation, testing, confidence sets — relate to one another. Decision theory provides a single framework that treats them as instances of one problem. A statistical model $\{f(\cdot, \theta) : \theta \in \Theta\}$ yields observations $X \in \mathcal{X}$. A **decision rule** is any measurable function \begin{align*} \delta : \mathcal{X} \to \mathcal{A}, \end{align*} where $\mathcal{A}$ is the **action space** — the space of possible outputs or "decisions." The framework is complete once we specify what $\mathcal{A}$ is. [example: Three Statistical Problems as Decision Problems] Each of the main statistical problems corresponds to a particular choice of action space: - **Hypothesis testing.** $\mathcal{A} = \{0, 1\}$. The decision $\delta(X)$ is a test: output $0$ to accept $H_0$, output $1$ to reject. - **Point estimation.** $\mathcal{A} = \Theta$. The decision $\delta(X) = \hat{\theta}(X)$ is an estimator — a point in the parameter space. - **Confidence sets / inference.** $\mathcal{A} = \{\text{subsets of } \Theta\}$. The decision $\delta(X) = C(X)$ is a confidence set. In all three cases, the input is the same data $X$ and the output is a specific kind of object; the decision-theoretic language simply makes this explicit. [/example] ### Loss Functions and Risk To assess the quality of a decision rule, we need to measure how "wrong" a given action is when the true parameter is $\theta$. [definition: Loss Function] A **loss function** is a measurable map $L : \mathcal{A} \times \Theta \to [0, \infty)$. For an action $a \in \mathcal{A}$ and a true parameter $\theta \in \Theta$, the value $L(a, \theta)$ quantifies the cost of taking action $a$ when $\theta$ is the truth. [/definition] The loss function is always assumed non-negative, reflecting that there is no benefit to being wrong. Different problems naturally call for different losses. [example: Standard Loss Functions] **Hypothesis testing.** When $\theta \in \{0, 1\}$ indexes the hypotheses, the natural loss is the misclassification indicator: \begin{align*} L(a, \theta) = \mathbf{1}_{\{a \neq \theta\}}. \end{align*} This assigns loss $1$ to any incorrect decision and $0$ to correct ones; it does not distinguish between type I and type II errors. **Estimation.** Two standard choices are absolute error and squared error: \begin{align*} L(a, \theta) = |a - \theta|, \qquad \text{or} \qquad L(a, \theta) = |a - \theta|^2. \end{align*} Squared error is more tractable analytically (differentiable, related to variance) while absolute error is more robust to outliers. [/example] Since $X$ is random, the loss $L(\delta(X), \theta)$ is itself a random variable. The natural summary is its expectation under $P_\theta$. [definition: Risk Function] For a loss function $L$ and a decision rule $\delta$, the **risk function** is \begin{align*} R(\delta, \theta) = E_\theta[L(\delta(X), \theta)] = \int_{\mathcal{X}} L(\delta(x), \theta)\, f(x, \theta)\, dx, \end{align*} where the integral is over $x \in \mathcal{X}$ with respect to $P_\theta$. [/definition] The risk $R(\delta, \theta)$ depends on both the rule $\delta$ and the unknown true value $\theta$. This dual dependence is the source of all difficulty in decision theory. [example: Risk in Concrete Problems] **Hypothesis testing.** Under the misclassification loss $L(a, \theta) = \mathbf{1}_{\{a \neq \theta\}}$, the risk is the probability of error: \begin{align*} R(\delta, \theta) = E_\theta[\mathbf{1}_{\{\delta(X) \neq \theta\}}] = P_\theta(\delta(X) \neq \theta). \end{align*} At $\theta = 0$ this is the type I error probability; at $\theta = 1$ it is the type II error probability. **Estimation under squared error.** The risk is the mean squared error: \begin{align*} R(\delta, \theta) = E_\theta[(\delta(X) - \theta)^2] = E_\theta[(\hat{\theta}(X) - \theta)^2]. \end{align*} **Binomial proportion estimation.** Take $X \sim \operatorname{Bin}(n, \theta)$ with $\theta \in [0,1]$ and the estimator $\hat{\theta}(X) = X/n$ (the sample proportion). Then: \begin{align*} R(\hat{\theta}, \theta) = E_\theta\!\left[\!\left(\frac{X}{n} - \theta\right)^{\!2}\right] = \frac{\operatorname{Var}_\theta(X)}{n^2} = \frac{n\theta(1-\theta)}{n^2} = \frac{\theta(1-\theta)}{n}. \end{align*} Now consider a competing estimator that ignores the data entirely: $\hat{\eta}(X) = 1/2$. Its risk is: \begin{align*} R(\hat{\eta}, \theta) = E_\theta\!\left[\!\left(\frac{1}{2} - \theta\right)^{\!2}\right] = \left(\theta - \frac{1}{2}\right)^{\!2}. \end{align*} Comparing the two: $R(\hat{\theta}, \theta) = \theta(1-\theta)/n \leq 1/(4n)$, which tends to zero for all $\theta$. On the other hand, $R(\hat{\eta}, \theta) = (\theta - 1/2)^2$, which is zero only at $\theta = 1/2$ and approaches $1/4$ near $\theta = 0$ or $\theta = 1$. So $\hat{\theta}$ beats $\hat{\eta}$ whenever $\theta$ is far from $1/2$ and for all large $n$, but $\hat{\eta}$ can beat $\hat{\theta}$ when $n$ is small and $\theta$ is near $1/2$. [/example] ### No Uniform Comparison in General The binomial example reveals a fundamental obstruction. The risk functions $R(\hat{\theta}, \cdot)$ and $R(\hat{\eta}, \cdot)$ are not uniformly ordered over $\theta \in [0,1]$: neither estimator dominates the other for all $\theta$ simultaneously. For small $n$, the constant estimator $\hat{\eta}$ can achieve lower risk near $\theta = 1/2$ precisely because it exploits knowledge that $\theta$ is near the center of the interval — knowledge that the data-based estimator cannot guarantee without a large sample. This is not an accident specific to this example; it is a generic phenomenon in decision theory. When the action space is continuous and the risk depends on the unknown $\theta$, a single rule cannot minimize $R(\delta, \theta)$ for all $\theta$ simultaneously. This motivates the search for weaker notions of optimality — admissibility, minimax optimality, and Bayes optimality — which will occupy the next part of the course. [remark: Decision Theory as a Common Language] The decision-theoretic framework should be seen as a translation device rather than a new theory. Every concept from the previous chapters can be re-expressed in its language: the MLE becomes the minimizer of empirical risk under negative log-likelihood loss; a Neyman–Pearson test becomes a rule minimizing a weighted combination of type I and type II error probabilities; a confidence set is a decision rule whose risk (under appropriate coverage loss) is controlled uniformly. The framework clarifies which comparisons make sense and which optimality claims are meaningful. [/remark] Bayesian methods naturally accommodate the concept of risk: the expected loss under the posterior distribution. This perspective allows us to compare estimators and procedures in a unified framework. # 14. Bayesian Risk ## Motivation: Why Average the Risk? Throughout the course, risk has been evaluated pointwise: for each fixed parameter value $\theta \in \Theta$, the risk $R(\delta, \theta) = \mathbb{E}_\theta[L(\delta(X), \theta)]$ measures the expected loss of a decision rule $\delta$ when the truth is exactly $\theta$. This is a powerful framework, but it leads to a genuine difficulty. No single estimator minimises $R(\delta, \theta)$ for every $\theta$ simultaneously — the estimator that works best near one parameter value may perform poorly elsewhere. Comparing two estimators therefore often reduces to comparing two curves over $\Theta$, with no canonical way to declare a winner. The Bayesian approach resolves this by placing a probability distribution $\pi$ over $\Theta$, called a **prior**, which encodes a belief about which parameter values are plausible. With a prior in hand, one can average the risk over $\Theta$ and obtain a single number — the Bayes risk — that aggregates performance across the whole parameter space. This chapter develops the Bayes risk, identifies the decision rules that minimise it, and examines the structural consequences of being both unbiased and Bayes. ## The Bayes Risk and Bayes Decision Rules What does it mean to be an optimal estimator when the parameter is itself random? The question looks circular, but the averaging structure makes it precise. [definition: Bayes Risk] Let $\pi$ be a prior distribution on $\Theta$. The **$\pi$-Bayes risk** of a decision rule $\delta$ for a loss function $L$ is \begin{align*} R_\pi(\delta) = \mathbb{E}_\pi[R(\delta, \theta)] = \int_\Theta R(\delta, \theta)\, \pi(\theta)\, d\theta = \int_\Theta \int_{\mathcal{X}} L(\delta(x), \theta)\, \pi(\theta)\, f(x, \theta)\, dx\, d\theta. \end{align*} A **$\pi$-Bayes decision rule** $\delta_\pi$ is any decision rule minimising $R_\pi(\delta)$ over all $\delta$. [/definition] The double integral deserves unpacking. The inner integral over $\mathcal{X}$ is the frequentist risk $R(\delta, \theta)$: it averages loss over the randomness of the data $X$ when $\theta$ is fixed. The outer integral over $\Theta$ then averages this risk over the prior $\pi$. The result is a weighted average of pointwise risks, where the weights reflect prior beliefs about $\theta$. [example: Binomial Model with Uniform Prior] Consider the binomial model $X \sim \operatorname{Bin}(n, \theta)$ with prior $\pi = \operatorname{Uniform}[0,1]$ on $\theta$. Under quadratic loss, the pointwise risk of the estimator $\delta(X) = X/n$ is $R(X/n, \theta) = \theta(1-\theta)/n$, since $X/n$ is unbiased with variance $\theta(1-\theta)/n$. The Bayes risk is \begin{align*} R_\pi(X/n) = \mathbb{E}_\pi\!\left[\frac{\theta(1-\theta)}{n}\right] = \frac{1}{n} \int_0^1 \theta(1-\theta)\, d\theta = \frac{1}{n} \cdot \frac{1}{6} = \frac{1}{6n}. \end{align*} The integral $\int_0^1 \theta(1-\theta)\, d\theta = \int_0^1 (\theta - \theta^2)\, d\theta = \frac{1}{2} - \frac{1}{3} = \frac{1}{6}$ follows by direct computation. So the Bayes risk of the MLE $X/n$ under the uniform prior is $1/(6n)$, which decreases to zero at the expected rate. Whether this is the best possible rate — whether some other estimator achieves a strictly smaller Bayes risk — is exactly the question the theory of Bayes decision rules addresses. [/example] ## Posterior Risk and Its Relationship to Bayes Risk The prior $\pi$ summarises beliefs about $\theta$ before observing data. After observing $X = x$, those beliefs are updated through Bayes' theorem to the **posterior** $\Pi(\cdot | x)$. The posterior concentrates mass on parameter values that are consistent with the observation, and it gives rise to a natural notion of loss. [definition: Posterior Risk] For a Bayesian model with prior $\pi$, the **posterior risk** of a decision rule $\delta$ at observation $x \in \mathcal{X}$ is \begin{align*} R_\Pi(\delta) = \mathbb{E}[L(\delta(x), \theta) \mid x], \end{align*} where the expectation is over $\theta$ drawn from the posterior $\Pi(\cdot | x)$. [/definition] [remark: Direction of Expectation] The expectation in the posterior risk $R_\Pi(\delta) = \mathbb{E}[L(\delta(x), \theta) \mid x]$ is taken over $\theta$, not over $X$. This is the opposite of the frequentist risk $R(\delta, \theta) = \mathbb{E}_\theta[L(\delta(X), \theta)]$, where $\theta$ is fixed and the expectation runs over $X$. Conflating these two directions leads to confusion about what is random. For the binomial model with quadratic loss, expanding the squared error gives \begin{align*} R_\Pi(\delta) = \mathbb{E}_\Pi[(\delta(x) - \theta)^2 \mid x] &= \mathbb{E}_\Pi[\delta(x)^2 - 2\delta(x)\theta + \theta^2 \mid x] \\ &= \delta(x)^2 - 2\delta(x)\,\mathbb{E}_\Pi[\theta \mid x] + \mathbb{E}_\Pi[\theta^2 \mid x]. \end{align*} The value $\delta(x)$ is not random (the observation $x$ is fixed), so it factors out of the conditional expectation. [/remark] The crucial observation is that minimising the posterior risk pointwise — for each fixed $x$ — is enough to minimise the Bayes risk globally. This is the content of the following proposition, which is the workhorse of Bayesian decision theory. [quotetheorem:1892] [citeproof:1892] The proposition's proof reveals why the Bayesian approach is computationally attractive: instead of a global optimisation problem over all decision rules and all parameter values simultaneously, it decomposes into a family of local optimisation problems, one for each observed value $x$. [example: Posterior Mean Under Quadratic Loss] Under squared-error loss, the posterior risk is \begin{align*} R_\Pi(\delta) = \mathbb{E}_\Pi[(\delta(x) - \theta)^2 \mid x] = \delta(x)^2 - 2\delta(x)\,\mathbb{E}_\Pi[\theta \mid x] + \mathbb{E}_\Pi[\theta^2 \mid x]. \end{align*} This is a quadratic function of the scalar $\delta(x)$. Setting the derivative with respect to $\delta(x)$ to zero: \begin{align*} \frac{d}{d\delta(x)} R_\Pi(\delta) = 2\delta(x) - 2\mathbb{E}_\Pi[\theta \mid x] = 0, \end{align*} which gives $\delta(x) = \mathbb{E}_\Pi[\theta \mid x]$. Since the coefficient of $\delta(x)^2$ is positive, this is a minimum. Thus the **posterior mean** $\delta(X) = \mathbb{E}_\Pi[\theta \mid X]$ is the unique Bayes decision rule under squared-error loss. Different loss functions yield different Bayes rules: absolute loss $L(\delta, \theta) = |\delta - \theta|$ leads to the posterior median, and $0/1$ loss leads to the posterior mode. [/example] ## Unbiasedness and Bayes Rules Cannot Coexist A frequentist estimator is unbiased if its expectation under $\mathbb{P}_\theta$ equals $\theta$ for every $\theta \in \Theta$. Bayesian estimators aim at minimising expected posterior loss. These two demands are in fundamental tension: imposing unbiasedness forces a rigid global constraint on $\delta$, while the Bayes optimality condition $\delta(X) = \mathbb{E}_\Pi[\theta \mid X]$ is flexible and adapts to the prior. To see why they are nearly incompatible, consider what it would mean for the posterior mean $\mathbb{E}_\Pi[\theta \mid X]$ to be unbiased: the estimator would have to equal $\theta$ exactly, not just on average. [quotetheorem:1901] The hypotheses matter here. The theorem requires both that $\delta$ is unbiased for every $\theta \in \Theta$ (a strong frequentist condition) and that $\delta$ is Bayes for a prior $\pi$ that has full support over $\Theta$. Relaxing either condition creates room for an estimator to be nearly unbiased and nearly Bayes, but exactly satisfying both simultaneously forces degeneracy. The result would fail if the prior $\pi$ were concentrated on a single point $\theta_0$, since then the Bayes rule is the constant estimator $\delta(X) = \theta_0$, which is unbiased only if $\Theta = \{\theta_0\}$. [proof] The proof applies the tower property of conditional expectation in two directions, then combines the resulting equalities. Recall that for any function $Z(X, \theta)$ under the joint law $Q$, the tower property gives both \begin{align*} \mathbb{E}_Q[Z(X, \theta)] = \mathbb{E}_Q[\mathbb{E}_\Pi[Z(X, \theta) \mid X]] \quad \text{and} \quad \mathbb{E}_Q[Z(X, \theta)] = \mathbb{E}_Q[\mathbb{E}_\theta[Z(X, \theta)]], \end{align*} where the first conditions on $X$ and averages over $\theta$, and the second conditions on $\theta$ and averages over $X$. Set $Z(X, \theta) = \theta\, \delta(X)$. From the first form of the tower property, since $\delta$ is Bayes for quadratic loss it satisfies $\delta(X) = \mathbb{E}_\Pi[\theta \mid X]$, so \begin{align*} \mathbb{E}_Q[\theta\, \delta(X)] = \mathbb{E}_Q[\mathbb{E}_\Pi[\theta\, \delta(X) \mid X]] = \mathbb{E}_Q[\delta(X)\, \mathbb{E}_\Pi[\theta \mid X]] = \mathbb{E}_Q[\delta(X)^2]. \end{align*} From the second form, using unbiasedness $\mathbb{E}_\theta[\delta(X)] = \theta$: \begin{align*} \mathbb{E}_Q[\theta\, \delta(X)] = \mathbb{E}_Q[\mathbb{E}_\theta[\theta\, \delta(X)]] = \mathbb{E}_Q[\theta\, \mathbb{E}_\theta[\delta(X)]] = \mathbb{E}_Q[\theta^2]. \end{align*} Therefore $\mathbb{E}_Q[\delta(X)^2] = \mathbb{E}_Q[\theta^2]$, and expanding: \begin{align*} \mathbb{E}_Q[(\delta(X) - \theta)^2] = \mathbb{E}_Q[\delta(X)^2] - 2\mathbb{E}_Q[\theta\, \delta(X)] + \mathbb{E}_Q[\theta^2] = \mathbb{E}_Q[\theta^2] - 2\mathbb{E}_Q[\theta^2] + \mathbb{E}_Q[\theta^2] = 0. \end{align*} Since the integrand $(\delta(X) - \theta)^2$ is non-negative and integrates to zero under $Q$, it must equal zero $Q$-almost everywhere, meaning $\delta(X) = \theta$ with $Q$-probability one. [/proof] [remark: Unbiased and Bayes Estimators Are Typically Disjoint] The theorem has concrete consequences for familiar estimators. In the model $X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} \mathcal{N}(\theta, 1)$, the sample mean $\bar{X}_n$ is unbiased for $\theta$. The theorem implies $\bar{X}_n$ cannot be a Bayes estimator for any prior $\pi$ on $\mathbb{R}$: if it were, $\bar{X}_n$ would have to equal $\theta$ with probability one, which is impossible since $\bar{X}_n$ has a strictly positive variance $1/n$ under $\mathbb{P}_\theta$. Similarly, in the $\operatorname{Bin}(n, \theta)$ model, the MLE $X/n$ is unbiased. It can only be a Bayes rule in degenerate situations — for instance if the prior $\pi$ places all mass on a single point $\theta_0 = 1/2$ and $n = 2$, but such priors reduce the estimation problem to triviality. This structural incompatibility is not a defect of either framework. Unbiasedness is a frequentist demand for global calibration across all $\theta$; Bayesian optimality is a demand for posterior-weighted efficiency. The two criteria pull in different directions. [/remark] ## Least Favorable Priors The Bayes risk $R_\pi(\delta_\pi)$ depends on the choice of prior $\pi$: a prior that concentrates on an easy region of $\Theta$ will yield a small Bayes risk, while a prior on a hard region will yield a large one. This raises a minimax question: which prior is hardest in the sense that no Bayes rule can achieve a small risk against it? [definition: Least Favorable Prior] A prior $\lambda$ on $\Theta$ is called **least favorable** if for every prior $\lambda'$ on $\Theta$, \begin{align*} R_\lambda(\delta_\lambda) \geq R_{\lambda'}(\delta_{\lambda'}), \end{align*} where $\delta_\lambda$ and $\delta_{\lambda'}$ denote the respective Bayes decision rules. [/definition] The definition asks that the Bayes risk under $\lambda$ — even after optimising the decision rule to be the $\lambda$-Bayes rule — is at least as large as the Bayes risk under any other prior (after optimising for that prior). In other words, $\lambda$ is the prior that maximises the minimum achievable Bayes risk. Least favorable priors connect the Bayesian and minimax frameworks. Under regularity conditions, the minimax risk $\inf_\delta \sup_\theta R(\delta, \theta)$ equals the Bayes risk $R_\lambda(\delta_\lambda)$ under the least favorable prior $\lambda$, and the minimax estimator coincides with the $\lambda$-Bayes rule. This duality is a central theme in statistical decision theory: the worst-case frequentist risk and the worst-case Bayesian risk are two sides of the same coin, achieved by the same estimator. [remark: Connecting Minimax and Bayesian Theory] To see why least favorable priors arise naturally in the minimax context, consider that for any prior $\pi$ and any decision rule $\delta$, \begin{align*} \sup_{\theta \in \Theta} R(\delta, \theta) \geq \int_\Theta R(\delta, \theta)\, \pi(\theta)\, d\theta = R_\pi(\delta) \geq R_\pi(\delta_\pi). \end{align*} The first inequality holds because the supremum over $\theta$ bounds any average. The second holds because $\delta_\pi$ minimises the Bayes risk. Taking the infimum over $\delta$ on the left gives \begin{align*} \inf_\delta \sup_\theta R(\delta, \theta) \geq R_\pi(\delta_\pi) \end{align*} for every prior $\pi$. The least favorable prior $\lambda$ is the one that makes this lower bound as large as possible, pushing the minimax risk from below. When equality holds — when the lower bound is tight — the Bayes rule under $\lambda$ is simultaneously minimax. [/remark] Not all Bayesian procedures are optimal; some estimators dominate others for all possible prior choices. The concepts of minimax risk and admissibility provide tools to identify which methods are fundamentally unimprovable. # 15. Minimax Risk and Admissibility The previous chapter developed the Bayesian framework, where risk is averaged over a prior distribution on $\theta$. But averaging hides an uncomfortable question: what if the prior is wrong, or what if we refuse to commit to a prior at all? This chapter takes an entirely different stance. Instead of averaging, we ask for the **worst-case** guarantee — the best a decision rule can do when the parameter is chosen adversarially. This leads to the minimax criterion, which turns out to be deeply connected to the Bayesian framework in a non-obvious way: the hardest Bayesian problem and the minimax problem are two faces of the same coin. We also address a more elementary question: when is one decision rule simply better than another in every possible scenario? This is the notion of admissibility, and it gives a clean, prior-free way of discarding estimators that no rational statistician should use. ## Maximal Risk and the Minimax Criterion Why would we want worst-case risk rather than average risk? Consider the scenario where we must provide an estimator to be used in a safety-critical context. An engineer designing a bridge load estimator cannot afford to be right on average — they need to control the worst case. The Bayes risk framework cannot offer this: a prior that assigns tiny weight to some catastrophic region of $\Theta$ will produce an estimator that behaves poorly precisely where it matters most. The minimax criterion forces us to design for the adversary. [definition: Maximal Risk] Let $\delta$ be a decision rule and $\Theta$ a parameter space. The **maximal risk** of $\delta$ is \begin{align*} R_m(\delta, \Theta) = \sup_{\theta \in \Theta} R(\delta, \theta). \end{align*} [/definition] The maximal risk records the worst the decision rule can do. Two rules with the same average risk may differ drastically in their maximal risk, and the minimax criterion ranks them by this worst-case performance. [definition: Minimax Risk] The **minimax risk** over a parameter space $\Theta$ is \begin{align*} \inf_{\delta} \sup_{\theta \in \Theta} R(\delta, \theta) = \inf_{\delta} R_m(\delta, \Theta). \end{align*} A decision rule $\delta^*$ attaining this infimum — that is, with $R_m(\delta^*, \Theta) = \inf_\delta R_m(\delta, \Theta)$ — is called **minimax**. [/definition] The minimax rule is the one that minimises the worst-case exposure. Notice the order of quantifiers: $\inf_\delta \sup_\theta$ means we search for the rule that does best against the worst parameter value. Reversing the order — $\sup_\theta \inf_\delta$ — would mean the statistician knows $\theta$ before choosing $\delta$, which is a different (and easier) problem. The gap between these two quantities, which is nonnegative in general, is related to game-theoretic considerations, but the lectures focus on the $\inf_\delta \sup_\theta$ formulation. Before the main results, here is a fundamental comparison between Bayes risk and maximal risk. [quotetheorem:1907] [citeproof:1907] This short result is the key that unlocks the connection between Bayes and minimax theory. It says the Bayes risk is a **lower bound on the minimax risk**: any Bayesian estimator's worst-case risk cannot fall below its Bayes risk. The insight is that a prior $\pi$ acts as a certificate — a lower bound on how badly things can go. The harder the prior, the better the certificate. ## Least Favorable Priors and Finding Minimax Rules The Bayes–minimax inequality raises a natural question: can we find a prior $\pi$ whose Bayes risk actually equals the worst-case risk of the corresponding Bayes rule? If so, the Bayes rule would be minimax, and we could use Bayesian machinery to solve a non-Bayesian problem. This is the program behind the theory of least favorable priors. [definition: Least Favorable Prior] A prior $\lambda$ on $\Theta$ is called **least favorable** if for every prior $\lambda'$, \begin{align*} R_\lambda(\delta_\lambda) \geq R_{\lambda'}(\delta_{\lambda'}), \end{align*} where $\delta_\lambda$ denotes the $\lambda$-Bayes rule. In other words, $\lambda$ produces the highest Bayes risk among all priors — it represents the adversarially worst prior a statistician could face. [/definition] The least favorable prior is the one that the statistician most fears: it assigns probability where estimation is hardest. The following proposition is the central result connecting all three concepts — least favorable priors, Bayes rules, and minimax rules. [quotetheorem:1916] [citeproof:1916] The hypothesis of this theorem is the crux: the Bayes rule $\delta_\lambda$ must have **constant risk in $\theta$** when $\lambda$ is chosen to make the Bayes risk equal to the supremum. This happens exactly when $R(\delta_\lambda, \theta)$ is constant as a function of $\theta$, since the Bayes risk is then the common value of the risk at every point, which equals the supremum. This motivates the following cleaner corollary. [quotetheorem:1924] [citeproof:1924] The strategy that emerges from these results is a practical method for identifying minimax estimators: 1. Parametrise a family of conjugate priors by some hyperparameter. 2. Find the hyperparameter making the Bayes rule's risk constant in $\theta$. 3. Conclude the Bayes rule is minimax and the prior is least favorable. The constant-risk condition is the decisive criterion. Notice that constant risk alone does not guarantee admissibility (see below), but it does guarantee minimax optimality when combined with being a Bayes rule. ## Examples of Minimax Rules ### The Binomial Model To see the constant-risk strategy in action, consider $X \sim \operatorname{Bin}(n, \theta)$ with $\theta \in [0, 1]$, using quadratic loss $\ell(\hat\theta, \theta) = (\hat\theta - \theta)^2$. The MLE is $\hat\theta_{\text{MLE}} = X/n$, with risk \begin{align*} R(\hat\theta_{\text{MLE}}, \theta) = \mathbb{E}_\theta\left[\left(\frac{X}{n} - \theta\right)^2\right] = \frac{\theta(1-\theta)}{n}. \end{align*} This risk is not constant: it equals zero at $\theta = 0$ and $\theta = 1$, and attains its maximum $1/(4n)$ at $\theta = 1/2$. The MLE is therefore not minimax (or at least, not immediately apparent from this calculation — and in fact it is not). For a $\operatorname{Beta}(a, b)$ prior $\pi_{a,b}$ on $\theta \in [0,1]$, the posterior given $X = k$ is $\operatorname{Beta}(k + a, n - k + b)$, and the posterior mean (which is the unique Bayes rule under quadratic loss) is \begin{align*} \delta_{a,b}(X) = \frac{X + a}{n + a + b}. \end{align*} The risk of this estimator can be computed explicitly: \begin{align*} R(\delta_{a,b}, \theta) = \mathbb{E}_\theta\left[\left(\frac{X+a}{n+a+b} - \theta\right)^2\right]. \end{align*} Writing $\frac{X+a}{n+a+b} - \theta = \frac{X - n\theta}{n+a+b} + \frac{a - (a+b)\theta}{n+a+b}$ and expanding, this equals \begin{align*} R(\delta_{a,b}, \theta) = \frac{n\theta(1-\theta)}{(n+a+b)^2} + \frac{(a - (a+b)\theta)^2}{(n+a+b)^2}. \end{align*} For this to be constant in $\theta$, we need the $\theta$-dependent terms to cancel. Setting $a = b$ ensures the linear and quadratic terms balance, and the specific choice $a = b = \sqrt{n}/2$ makes the expression independent of $\theta$. The resulting estimator \begin{align*} \delta^*(X) = \frac{X + \sqrt{n}/2}{n + \sqrt{n}} = \frac{X/n + 1/(2\sqrt{n})}{1 + 1/\sqrt{n}} \end{align*} has constant risk $\frac{1}{4(1 + 1/\sqrt{n})^2} \cdot \frac{1}{n}$, approximately $1/(4n)$ for large $n$. By the corollary, $\delta^*$ is unique minimax — and it differs from the MLE by shrinking observations toward $1/2$. ### The Gaussian Location Model In the model $X_1, \ldots, X_n \sim \mathcal{N}(\theta, 1)$ with $\theta \in \mathbb{R}$ and quadratic loss, the sample mean $\bar X_n$ has constant risk \begin{align*} R(\bar X_n, \theta) = \mathbb{E}_\theta\left[(\bar X_n - \theta)^2\right] = \frac{1}{n} \end{align*} for every $\theta \in \mathbb{R}$. One must verify that $\bar X_n$ arises as a Bayes rule under some prior to apply the theorem — this is done via a normal prior $\mathcal{N}(0, \tau^2)$ in the limit $\tau^2 \to \infty$, corresponding to a diffuse improper prior. The argument that this improper-prior Bayes rule is minimax can be made rigorous, but the constant-risk property alone gives strong evidence: any rule that ties $\bar X_n$ in worst-case risk must also achieve risk $1/n$ everywhere, which is already very restrictive. The full proof that $\bar X_n$ is minimax in this model is presented later in the course.  ## Admissibility The minimax criterion is global: it compares worst-case risks. Admissibility is a more local and more elementary demand — it asks whether a rule can be improved everywhere simultaneously. ### The Failure of Inadmissible Estimators To see why admissibility matters, consider an estimator that is catastrophically bad at some specific parameter value while performing adequately elsewhere. No statistician would choose such an estimator if another rule matched its performance everywhere and was strictly better somewhere. The notion of admissibility captures exactly this: an inadmissible rule is one that is dominated — beaten everywhere and strictly beaten somewhere. [definition: Inadmissible and Admissible Decision Rules] A decision rule $\delta$ is **inadmissible** if there exists a decision rule $\delta'$ such that \begin{align*} R(\delta', \theta) &\leq R(\delta, \theta) \quad \text{for all } \theta \in \Theta, \\ R(\delta', \theta) &< R(\delta, \theta) \quad \text{for some } \theta \in \Theta. \end{align*} In this case, $\delta'$ **dominates** $\delta$. A rule $\delta$ is **admissible** if no such $\delta'$ exists. [/definition] The asymmetry in the definition is important: dominance requires uniformly no worse and strictly better somewhere. This makes inadmissibility a strong failure — the rule is beaten across all parameter values by a single competitor. A rule that is slightly worse than some competitor at one value of $\theta$ but better elsewhere is not inadmissible. [remark: Admissibility Is Necessary But Not Sufficient] Admissibility is a minimal sanity check, not a recommendation. A constant estimator $\delta(X) = c$ for a fixed $c \in \Theta$ is often admissible under quadratic loss: no single rule can uniformly beat it, because for $\theta$ near $c$, the constant estimator's risk is tiny. Yet no one would recommend using a constant estimator in practice — it ignores all the data. Admissibility rules out strictly dominated procedures, but it does not identify the best procedures. [/remark] The remark highlights that admissibility is a floor, not a ceiling. A complete ordering of estimators requires additional criteria (minimax, Bayes, unbiasedness, etc.), and admissibility simply filters out the logically indefensible choices. ### Unique Bayes Rules Are Admissible The strongest general source of admissible rules is the following result. [quotetheorem:1932] The proof of part 1 proceeds by contradiction: if the unique Bayes rule $\delta_\pi$ were dominated by some $\delta'$, then $R_\pi(\delta') \leq R_\pi(\delta_\pi)$, with the inequality strict on a set of positive $\pi$-measure, contradicting the minimality of $\delta_\pi$ among all rules for the Bayes risk. The proof of part 2 uses the Bayes–maximal risk inequality: since $\delta$ has constant risk and is admissible, any competing rule either ties it everywhere (in which case they're equivalent) or exceeds it somewhere — but then the minimax risk equals $\delta$'s constant risk value. The proofs are carried out in the examples sheet. The two parts of the theorem together with the corollary on constant-risk Bayes rules give a unified picture: - **Constant-risk unique Bayes rule**: admissible (by part 1), minimax (by the Bayes rule corollary), and uniquely minimax (by the uniqueness clause). - **Admissible constant-risk rule**: minimax (by part 2), but it need not be a Bayes rule. The hierarchy is: *unique Bayes* $\Rightarrow$ *admissible*; *admissible + constant risk* $\Rightarrow$ *minimax*. Neither implication reverses in general. [example: MLE vs Minimax Rule in the Binomial Model] In the $\operatorname{Bin}(n, \theta)$ model with quadratic loss, the MLE $\hat\theta_{\text{MLE}} = X/n$ has risk $\theta(1-\theta)/n$, which vanishes at the boundary $\theta \in \{0, 1\}$. The minimax rule $\delta^*$ with $a = b = \sqrt{n}/2$ has risk approximately $1/(4n)$ everywhere. We have \begin{align*} R(\hat\theta_{\text{MLE}}, \theta) = \frac{\theta(1-\theta)}{n} \leq \frac{1}{4n} \approx R(\delta^*, \theta) \end{align*} for all $\theta \in [0,1]$, with equality at $\theta = 1/2$. Neither rule dominates the other: the MLE is strictly better than $\delta^*$ near $\theta = 0$ and $\theta = 1$, while $\delta^*$ beats the MLE near $\theta = 1/2$. Both are admissible, but they reflect different risk priorities. The MLE exploits the fact that estimation near the boundary is easy; the minimax rule sacrifices that advantage to protect against the hardest case $\theta = 1/2$. [/example] This example exposes the philosophical disagreement between minimax and non-minimax optimality: choosing the minimax rule means paying a premium in easy cases to insure against the hard ones. Whether this trade-off is appropriate depends on the application. In contexts where the statistician genuinely fears the adversarial worst case — multiple use scenarios, unknown prior information, safety-critical decisions — the minimax rule is the principled choice. The striking James-Stein phenomenon — where shrinkage estimators outperform the MLE in high dimensions — illustrates that admissibility properties depend crucially on the problem structure. We examine this phenomenon in the familiar Gaussian model. # 16. Admissibility in the Gaussian Model The question at the heart of this chapter is whether naive, coordinate-by-coordinate estimation is genuinely optimal in high dimensions — or whether there is something fundamentally better. The previous lecture established the general strategy: showing that an estimator has constant risk and is admissible is sufficient for minimax optimality. Chapter 16 applies this strategy in the Gaussian model, first establishing admissibility of the sample mean in one dimension via a careful bias analysis, and then confronting a startling failure of this result when the parameter lives in $\mathbb{R}^p$ for $p \geq 3$. The James–Stein estimator exhibits that independent estimation across coordinates can be beaten, a phenomenon that has no analogue in dimensions one or two. ## Admissibility of the Sample Mean in One Dimension The Gaussian model with known variance is the natural starting point because its Fisher information is constant, its risk computations are explicit, and the MLE has the simplest possible form. [quotetheorem:1940] Before giving the proof, it is worth pausing to understand why this is not immediate. One might hope to invoke a general theorem, for instance a result that all proper Bayes rules are admissible. But $\bar{X}_n$ is not the Bayes rule for any proper prior: the prior that corresponds to it in the limit is $N(0, \nu^2)$ as $\nu \to \infty$, which is improper. The MLE sits at the boundary of the Bayesian framework, and its admissibility must be established by direct argument. [proof] Without loss of generality take $\sigma^2 = 1$. The general case follows the same argument with $1/n$ replaced by $\sigma^2/n$. The risk of the MLE is constant: $R(\hat{\theta}_{\mathrm{MLE}}, \theta) = \operatorname{Var}_\theta(\bar{X}_n) = 1/n$ for all $\theta \in \mathbb{R}$. For any decision rule $\delta$, the bias-variance decomposition gives \begin{align*} R(\delta, \theta) = B(\theta)^2 + \operatorname{Var}_\theta(\delta(X)), \end{align*} where $B(\theta) = \mathbb{E}_\theta[\delta(X)] - \theta$ is the bias. The Cramér–Rao lower bound applies to unbiased estimators, but the proof of the bound works for any sufficiently regular estimator and yields \begin{align*} \operatorname{Var}_\theta(\delta) \geq \frac{(1 + B'(\theta))^2}{n}, \end{align*} since the Fisher information of the $N(\theta,1)$ model is $I(\theta) = 1$ and $\mathbb{E}_\theta[\delta(X)] = \theta + B(\theta)$ differentiates to give derivative $1 + B'(\theta)$. If $\delta$ dominates $\bar{X}_n$, then $R(\delta, \theta) \leq 1/n$ for all $\theta \in \mathbb{R}$, which combines with the above to give \begin{align*} B(\theta)^2 + \frac{(1 + B'(\theta))^2}{n} \leq \frac{1}{n} \qquad \text{for all } \theta. \end{align*} This is the key inequality $(\dagger)$. It immediately bounds $B(\theta)$ and forces $B'(\theta) \leq 0$, so $B$ is nonincreasing. If $B'$ were bounded away from $0$ for large $|\theta|$, then $B$ would be unbounded, contradicting its boundedness from $(\dagger)$. Hence there exist sequences $\theta_n \to -\infty$ and $\theta_n' \to +\infty$ along which $B'(\theta_n) \to 0$ and $B'(\theta_n') \to 0$. Substituting into $(\dagger)$ forces $B(\theta_n)^2 \to 0$ along both sequences. Since $B$ is nonincreasing and vanishes along sequences going to $\pm\infty$, it must be identically zero. With $B \equiv 0$, the Cramér–Rao bound applies in its standard form: $\operatorname{Var}_\theta(\delta) \geq 1/n$, so $R(\delta, \theta) \geq 1/n$. The only way to achieve $R(\delta, \theta) \leq 1/n$ for all $\theta$ is to have equality everywhere. Therefore $\delta$ does not strictly dominate $\bar{X}_n$, which is admissible. Constant risk then implies minimaxity by the constant-risk admissibility criterion from the previous chapter. [/proof] [remark: Relation to Bayes Rules] The MLE $\bar{X}_n$ is not the Bayes rule for any proper prior. It can be viewed, however, as the limit as $\nu \to \infty$ of the Bayes rules $\delta_{\nu^2}$ corresponding to the prior $N(0, \nu^2)$. This is consistent with a general principle: every minimax rule is the limit of Bayes rules, so the class of proper-Bayes rules is "dense" in the class of minimax rules in an appropriate sense. [/remark] The proof's key insight is that the inequality $(\dagger)$ simultaneously constrains the size of $B$ and the sign of $B'$. These two constraints together force $B$ to be zero, which then locks in the Cramér–Rao bound in its classical form. The argument is a beautiful instance of using the structure of the Gaussian model — the constancy of the Fisher information and the differentiability of the bias — to squeeze out a global conclusion from local constraints. ## The James–Stein Phenomenon The result above holds without modification in dimension $p = 2$. In both the one-dimensional and two-dimensional cases, the MLE is admissible. The following section demonstrates that this is false for $p \geq 3$, which is one of the most counterintuitive results in mathematical statistics. The intuitive argument for optimality of coordinate-by-coordinate estimation runs as follows: the coordinates $X_i \sim N(\theta_i, 1)$ are independent, and under squared-error loss the best estimator for $\theta_i$ given $X_i$ alone is $X_i$ itself. If combining all the coordinates could improve estimation of a single coordinate, it would require the other $X_j$ for $j \neq i$ to carry information about $\theta_i$. But they are independent of $X_i$ given $\theta$, so they should be irrelevant. This argument, while compelling, is wrong for $p \geq 3$. The resolution involves a subtle difference between pointwise and simultaneous optimality. The MLE is optimal for each coordinate individually, but this does not prevent a different estimator from achieving a strictly smaller total squared error $\mathbb{E}_\theta[\|\delta(X) - \theta\|^2]$ across all coordinates simultaneously. The James–Stein estimator achieves exactly this by shrinking the observation vector toward the origin — borrowing strength across coordinates in a way that reduces risk everywhere. [definition: James–Stein Estimator] For a vector $X \in \mathbb{R}^p$, the James–Stein estimator is defined as \begin{align*} \delta^{\mathrm{JS}}(X) = \left(1 - \frac{p-2}{\|X\|^2}\right) X. \end{align*} [/definition] The estimator shrinks the observed vector toward the origin by the factor $1 - (p-2)/\|X\|^2$. When $\|X\|$ is large, the shrinkage factor is close to $1$ and $\delta^{\mathrm{JS}}(X) \approx X$. When $\|X\|$ is close to zero, the shrinkage is dramatic. The factor $p - 2$ is precisely calibrated — for $p \leq 2$, the factor would be nonpositive and the estimator would not make sense, while for $p \geq 3$ the shrinkage is sufficient to reduce risk everywhere. The risk of the MLE in model $X \sim N(\theta, I_p)$ with $\theta \in \mathbb{R}^p$ is \begin{align*} R(\hat{\theta}_{\mathrm{MLE}}, \theta) = \mathbb{E}_\theta[\|X - \theta\|^2] = \sum_{j=1}^p \mathbb{E}_\theta[(X_j - \theta_j)^2] = p, \end{align*} since each coordinate contributes variance $1$. To show that $\delta^{\mathrm{JS}}$ dominates the MLE, one must compute $R(\delta^{\mathrm{JS}}, \theta) = \mathbb{E}_\theta[\|\delta^{\mathrm{JS}}(X) - \theta\|^2]$ and verify that it is strictly less than $p$ for all $\theta$. This computation relies on a fundamental identity known as Stein's lemma. ## Stein's Lemma and Gaussian Integration by Parts The key technical tool for computing the risk of the James–Stein estimator is a distributional identity that may be thought of as integration by parts for Gaussian expectations. [quotetheorem:1951] The hypotheses here matter. The function $g$ must be bounded so that the boundary terms in the integration by parts vanish at $\pm \infty$; the condition $\mathbb{E}[|g'(X)|] < \infty$ ensures that the integral on the right is well defined. Without these conditions, the identity can fail. For instance, if $g$ grows too rapidly, the boundary term $[g(x)\phi(x)]_{-\infty}^{+\infty}$ — where $\phi$ is the Gaussian density — need not vanish even though $\phi$ decays rapidly, because $g$ may grow faster than $\phi$ decays. [citeproof:1951] The lemma has an elegant analytical interpretation. The Gaussian density $\phi$ satisfies the ODE $\phi'(x) + x\phi(x) = 0$, which can be written as $\mathcal{A}(\phi) = 0$ for the operator $\mathcal{A}(p) = p' + xp$. Integration by parts then gives, for any regular density $p$ and bounded differentiable $g$: \begin{align*} \langle g, \mathcal{A}(p) \rangle = \langle \mathcal{A}^*(g), p \rangle, \end{align*} where $\mathcal{A}^*(g)(x) = -g'(x) + xg(x)$. The fact that $\mathcal{A}(\phi) = 0$ implies $\langle \mathcal{A}^*(g), \phi \rangle = 0$ for all valid $g$, which is precisely Stein's lemma. [remark: Stein's Lemma as a Characterization of Gaussians] The converse of Stein's lemma is also true: if a distribution satisfies $\mathbb{E}[(X - \theta)g(X)] = \mathbb{E}[g'(X)]$ for all bounded smooth $g$, then $X \sim N(\theta, 1)$. This characterization has an important corollary: if the identity holds only approximately, in the sense that the two sides differ by at most $\varepsilon$, then the distribution of $X$ is close to Gaussian in an appropriate sense. This is the basis for Stein's method of normal approximation, a powerful tool for proving central limit theorems for dependent random variables where direct calculations are intractable. [/remark] The multivariate version of Stein's lemma states that if $X \sim N(\theta, I_p)$ and $g : \mathbb{R}^p \to \mathbb{R}^p$ is a differentiable map satisfying suitable integrability conditions, then \begin{align*} \mathbb{E}[(X - \theta)^\top g(X)] = \mathbb{E}[\operatorname{div}(g)(X)], \end{align*} where $\operatorname{div}(g) = \sum_{j=1}^p \partial_{x_j} g_j$ is the divergence. Applied to the function $g(x) = -(p-2)x/\|x\|^2$, which appears naturally in the James–Stein estimator, this identity allows an explicit computation of the risk reduction — demonstrating that $R(\delta^{\mathrm{JS}}, \theta) = p - (p-2)^2\, \mathbb{E}_\theta[1/\|X\|^2] < p$ for all $\theta \in \mathbb{R}^p$ when $p \geq 3$. The fact that this phenomenon requires $p \geq 3$ rather than $p \geq 2$ is not a coincidence. In dimension $p = 2$, the function $1/\|x\|^2$ is not integrable near the origin under the Gaussian measure, and the shrinkage factor $(p-2)/\|x\|^2$ vanishes. The threshold $p = 3$ is the critical dimension where the Gaussian measure assigns enough mass away from the origin for the shrinkage to provide a uniform risk improvement. This connects to deep facts about potential theory and the recurrence/transience of Brownian motion: Brownian motion is recurrent in dimensions one and two (it returns to any neighborhood of the origin infinitely often) and transient in dimension three and above. The same dimensional threshold governs both phenomena. Understanding the James-Stein estimator's surprising superiority illuminates why classical estimation breaks down in high dimensions and motivates new approaches to multivariate problems. # 17. Risk of the James-Stein Estimator The central question of this chapter is whether the James–Stein estimator actually delivers on its promise. In the preceding lectures we defined $\delta^{JS}$ and showed, via Stein's unbiased risk estimate, that its risk is strictly below $p$, the risk of the MLE $X$. Here we make that bound explicit and precise: we compute $R(\delta^{JS}, \theta)$ directly, identify exactly how much it saves over $X$, and then examine the limits of the improvement. Along the way we encounter a beautiful application of Stein's lemma to a multivariate functional, and we begin the transition toward classification problems that will occupy the rest of Part II. ## The Risk Formula for the James–Stein Estimator Recall the setup. We observe $X \sim N(\theta, I_p)$ for an unknown $\theta \in \mathbb{R}^p$, and we want to estimate $\theta$ under squared-error loss. The MLE is $\delta^{MLE}(X) = X$, with risk \begin{align*} R(X, \theta) = \mathbb{E}_\theta[\|X - \theta\|^2] = p, \end{align*} uniformly in $\theta$. The James–Stein estimator shrinks $X$ toward the origin: \begin{align*} \delta^{JS}(X) = \left(1 - \frac{p-2}{\|X\|^2}\right) X. \end{align*} The proposition below gives the exact risk. [quotetheorem:1961] [citeproof:1961] The formula $R(\delta^{JS}, \theta) = p - (p-2)^2 \mathbb{E}_\theta[1/\|X\|^2]$ has a transparent structure: the risk of $X$ minus a strictly positive correction. The correction vanishes only if $\|X\|$ is infinite, which never happens; thus the strict inequality is genuine for every $\theta$. ## Applying Stein's Lemma in the Multivariate Setting The hardest part of the proof is the evaluation of the cross term, so it is worth pausing to see what exactly the lemma achieves here. [explanation: How Stein's Lemma Enters the Risk Calculation] Stein's lemma states that for $Z \sim N(\mu, 1)$ and a function $h$ with $\mathbb{E}|h'(Z)| < \infty$, \begin{align*} \mathbb{E}[(Z - \mu)h(Z)] = \mathbb{E}[h'(Z)]. \end{align*} In the multivariate risk calculation, we encounter the term $\mathbb{E}_\theta[X^\top(X-\theta)/\|X\|^2]$, which can be written as \begin{align*} \sum_{j=1}^p \mathbb{E}_\theta\!\left[\frac{X_j(X_j - \theta_j)}{\|X\|^2}\right]. \end{align*} The difficulty is that the denominator $\|X\|^2 = X_j^2 + \sum_{i \neq j} X_i^2$ mixes all coordinates. This is handled by conditioning on $X_{(-j)}$, the vector of all coordinates except $j$. Given $X_{(-j)}$, the quantity $\sum_{i \neq j} X_i^2$ becomes a constant, so the $j$-th summand is $\mathbb{E}[(X_j - \theta_j) g_j(X_j)]$ with $g_j(x) = x/(x^2 + c_j)$ where $c_j = \sum_{i \neq j} X_i^2$. Computing the derivative: \begin{align*} g_j'(x) = \frac{x^2 + c_j - 2x^2}{(x^2 + c_j)^2} = \frac{c_j - x^2}{(x^2 + c_j)^2}. \end{align*} Stein's lemma now gives $\mathbb{E}[(X_j - \theta_j)g_j(X_j) \mid X_{(-j)}] = \mathbb{E}[g_j'(X_j) \mid X_{(-j)}]$. Taking outer expectations and summing over $j$: \begin{align*} \sum_{j=1}^p \mathbb{E}_\theta\!\left[\frac{X_j(X_j - \theta_j)}{\|X\|^2}\right] &= \sum_{j=1}^p \mathbb{E}_\theta\!\left[\frac{\sum_{i \neq j} X_i^2 - X_j^2}{\|X\|^4}\right] \\ &= \sum_{j=1}^p \mathbb{E}_\theta\!\left[\frac{1}{\|X\|^2} - \frac{2X_j^2}{\|X\|^4}\right] \\ &= p \,\mathbb{E}_\theta\!\left[\frac{1}{\|X\|^2}\right] - 2\,\mathbb{E}_\theta\!\left[\frac{\|X\|^2}{\|X\|^4}\right] \\ &= (p-2)\,\mathbb{E}_\theta\!\left[\frac{1}{\|X\|^2}\right]. \end{align*} This is the crucial telescoping that makes the calculation work. [/explanation] The requirement $p \geq 3$ appears here in a subtle way: the condition $p \geq 3$ is needed for $\mathbb{E}_\theta[1/\|X\|^2]$ to be finite (when $\theta = 0$, integrability of $1/\|x\|^2$ against the Gaussian density near the origin requires dimension at least 3). Without this, the shrinkage factor $p-2$ would not lead to a valid estimator with finite risk. [remark: Why Dimension Matters] For $p = 1$ or $p = 2$, the James–Stein estimator is not defined (or more precisely, the risk formula breaks down because $\mathbb{E}[1/\|X\|^2] = \infty$ when $\theta = 0$). The estimator $X$ is in fact admissible for $p \leq 2$, meaning no other estimator dominates it uniformly. It is only in dimension $p \geq 3$ that the Stein phenomenon occurs: $X$ becomes inadmissible. This dimension threshold is a genuine phase transition in the estimation problem, not an artifact of the calculation. [/remark] ## Behavior as $\|\theta\| \to \infty$ and the Maximal Risk The formula $R(\delta^{JS}, \theta) = p - (p-2)^2 \mathbb{E}_\theta[1/\|X\|^2]$ not only shows that $\delta^{JS}$ beats $X$ at every point, but also reveals how the improvement fades as $\theta$ moves far from the origin. When $\|\theta\| \to \infty$, the distribution of $X$ is centered far from the origin, so $\|X\|^2$ concentrates near $\|\theta\|^2$. Formally, $\mathbb{E}_\theta[1/\|X\|^2] \to 0$ as $\|\theta\| \to \infty$. The risk lower bound from the annulus argument, \begin{align*} \mathbb{E}_\theta\!\left[\frac{1}{\|X\|^2}\right] = \int_{\mathbb{R}^p} \frac{1}{\|x\|^2} \varphi(x - \theta)\,dx \geq \frac{1}{c_2^2} \mathbb{P}_\theta(\|X\| \in [c_1, c_2]) > 0, \end{align*} shows that the correction is always positive, but it becomes negligible for large $\|\theta\|$. [remark: Identical Maximal Risk Despite Uniform Dominance] Although $R(\delta^{JS}, \theta) < R(X, \theta) = p$ for every $\theta$, both estimators have the same maximal risk: $\sup_\theta R(\delta^{JS}, \theta) = p$. The supremum is not achieved but is approached as $\|\theta\| \to \infty$. This means that in a minimax sense, shrinkage toward the origin buys nothing: the worst-case performance of $\delta^{JS}$ matches that of the MLE. The gain from James–Stein is pointwise, concentrated near the origin, and dissipates at large parameter values. [/remark] This observation has a practical interpretation. If the practitioner has strong prior knowledge that $\theta$ is near the origin, the James–Stein estimator can substantially reduce risk. If $\theta$ could be anywhere in $\mathbb{R}^p$ with equal plausibility, the minimax criterion favors neither estimator over the other. ## Inadmissibility of $\delta^{JS}$ and the Positive-Part Estimator The James–Stein estimator $\delta^{JS}$ dominates $X$, but the story does not end there. The shrinkage factor $1 - (p-2)/\|X\|^2$ can be negative when $\|X\|^2 < p - 2$, which would mean the estimator overshoots the origin and points in the direction opposite to $X$. This is geometrically absurd: if we are shrinking toward the origin, we should never go past it. [definition: Positive-Part James-Stein Estimator] The positive-part James–Stein estimator is defined by \begin{align*} \delta^{JS+}(X) = \left(1 - \frac{p-2}{\|X\|^2}\right)_{\!+} X, \end{align*} where $(a)_+ = \max\{a, 0\}$ denotes the positive part. Equivalently, \begin{align*} \delta^{JS+}(X) = \begin{cases} \left(1 - \dfrac{p-2}{\|X\|^2}\right) X & \text{if } \|X\|^2 \geq p-2, \\ 0 & \text{if } \|X\|^2 < p-2. \end{cases} \end{align*} [/definition] The positive-part estimator dominates $\delta^{JS}$: whenever the shrinkage factor would be negative, replacing it with zero can only reduce the squared error. Therefore $R(\delta^{JS+}, \theta) \leq R(\delta^{JS}, \theta) < p$ for all $\theta$, and the inequality $R(\delta^{JS+}, \theta) < R(\delta^{JS}, \theta)$ is strict when $\theta$ is near the origin (where $\|X\|^2 < p-2$ has positive probability). [remark: Admissibility Requires Smoothness] Even $\delta^{JS+}$ is not admissible. It is a general principle, proved via completeness of the Gaussian family, that every admissible estimator in this model must be a smooth function of the observation. The estimator $\delta^{JS+}$ has a kink at $\|X\|^2 = p - 2$, so it cannot be admissible. The question of which estimators are truly admissible connects to the theory of proper Bayes estimators, which are smooth by construction. [/remark] ## Practical Interpretation and the Role of the MLE The theoretical dominance of $\delta^{JS}$ over $X$ is striking, but several practical considerations moderate its significance. First, the James–Stein estimator shrinks toward an arbitrary fixed point — here the origin. This choice makes sense if the origin is a natural center (for example, if we expect $\theta$ to be near zero on physical grounds), but is otherwise arbitrary. Shrinking toward the wrong point can waste much of the gain. Second, and more concretely: although $\delta^{JS}$ dominates $X$ for estimation, the distribution of $\delta^{JS}(X)$ is complicated and depends on $\theta$ in a nonlinear way. The MLE $X$ has a simple, explicit distribution — $X \sim N(\theta, I_p)$ — which makes it straightforward to construct tests and confidence regions. For $\delta^{JS}$, constructing valid confidence sets is substantially harder, and the practical gain from lower point-estimation risk may not justify the added complexity. [example: Shrinkage Gain in a Simple Case] Take $p = 5$ and $\theta = 0$. Then $X \sim N(0, I_5)$ and $\|X\|^2 \sim \chi^2_5$, so $\mathbb{E}[1/\|X\|^2] = 1/(p-2) = 1/3$ (this follows from the standard result $\mathbb{E}[1/\chi^2_\nu] = 1/(\nu - 2)$ for $\nu > 2$). The risk of $\delta^{JS}$ at $\theta = 0$ is \begin{align*} R(\delta^{JS}, 0) = p - (p-2)^2 \cdot \frac{1}{p-2} = p - (p-2) = 2. \end{align*} The MLE has risk $p = 5$. At the origin, James–Stein reduces risk by $60\%$. As $\|\theta\|$ grows, the risk climbs back toward $5$: the estimator "knows" where $\theta$ is not (not near zero), and the shrinkage becomes counterproductive, though never more than marginally so by the dominance result. [/example] The risk formula $R(\delta^{JS}, \theta) = p - (p-2)^2 \mathbb{E}_\theta[1/\|X\|^2]$ thus encapsulates both the promise and the limitation of shrinkage: the correction is always positive, it is maximized near the origin (where it can be substantial), and it vanishes at infinity. This concludes the analysis of the James–Stein estimator's risk and its position in the landscape of admissible estimation. With foundations in point estimation well-established, we now consider classification: the problem of predicting discrete outcomes rather than continuous parameters. # 18. Classification Problems ## The Two Views of Classification Classification problems sit at the boundary between statistics and statistical learning theory. Given two candidate distributions $f_0$ and $f_1$ on a feature space $\mathcal{X}$, and an observation $X$, the question is: under which distribution was $X$ generated? Equivalently, the label $Y \in \{0, 1\}$ indicates which distribution applies, and the classifier $\delta: \mathcal{X} \to \{0, 1\}$ is a decision rule that attempts to recover $Y$ from $X$. The previous chapter developed the problem from one direction: draw $Y$ from a prior $\pi$, then draw $X$ from $f_Y$. This chapter develops the other direction — viewing $X$ as having a marginal distribution and $Y$ as drawn conditionally — and shows the two are equivalent. This equivalence is not merely formal: the posterior viewpoint reveals why the Bayes classifier takes the specific threshold form it does, and it underpins the empirical risk minimization approach used in statistical learning. ## Joint Distributions and the Posterior Fix a prior $\pi = (\pi_0, \pi_1)$ with $\pi_1 \in (0, 1)$ and $\pi_0 = 1 - \pi_1$. The joint distribution $Q$ on $\mathcal{X} \times \{0, 1\}$ is defined by \begin{align*} Q(x, y) = f(x, y)\pi(y), \end{align*} where $f(x, 1) = f_1(x)$ and $f(x, 0) = f_0(x)$. There are two equivalent ways to sample from $Q$. **The mixture view** draws $Y$ from $\pi$ and then $X$ from $f_Y$ conditionally. The marginal distribution of $X$ is the mixture \begin{align*} P_X(x) = \sum_{y \in \{0,1\}} Q(x, y) = \pi_0 f_0(x) + \pi_1 f_1(x). \end{align*} **The posterior view** draws $X$ from $P_X$ directly, and then draws $Y$ from the posterior $\Pi(\cdot \mid X = x)$. By Bayes' theorem the posterior probabilities are \begin{align*} \Pi(1 \mid X = x) &= \frac{\pi_1 f_1(x)}{\pi_0 f_0(x) + \pi_1 f_1(x)}, \\ \Pi(0 \mid X = x) &= \frac{\pi_0 f_0(x)}{\pi_0 f_0(x) + \pi_1 f_1(x)}. \end{align*} A standard shorthand sets $\eta(x) = \Pi(1 \mid X = x)$, so the posterior is the pair $(\eta(x), 1 - \eta(x))$. The function $\eta: \mathcal{X} \to [0,1]$ captures the local probability that the label is $1$ given the feature $x$. [remark: Interpretation of eta] The function $\eta(x)$ is the posterior probability of class $1$ at the point $x$. When $\eta(x) > 1/2$, the observation is more likely to have come from $f_1$ than from $f_0$, given the prior $\pi$. The threshold $1/2$ is natural for equal priors, but will shift when $\pi_0 \neq \pi_1$, as encoded in the Bayes classifier below. [/remark] ## Classification Error as a Joint Probability The posterior view clarifies the meaning of the classification error. Recall that for a classification region $R \subseteq \mathcal{X}$, the classifier $\delta_R$ assigns label $1$ to points in $R$ and label $0$ to points in $R^c$. [quotetheorem:1966] [citeproof:1966] The posterior view gives a second expression for the same quantity: \begin{align*} Q(\delta(X) \neq Y) = \mathbb{E}_Q[\mathbf{1}_{\{\delta(X) \neq Y\}}] = \int_{\mathcal{X}} \Pi(\delta^c(x) \mid x)\, dP_X(x), \end{align*} where $\delta^c = 1 - \delta$ is the complement decision. This form makes the minimization strategy apparent: to minimize $R_\pi(\delta)$ pointwise in $x$, one should set $\delta(x) = 1$ when $\Pi(0 \mid x) \geq \Pi(1 \mid x)$ would contribute more error than $\Pi(1 \mid x)$, that is, when $\eta(x) \geq 1/2$. Equivalently, one should accept $X$ into the critical region $R$ whenever $\pi_1 f_1(x) / (\pi_0 f_0(x))$ exceeds $1$. ## The Bayes Classifier The argument above motivates the definition of the optimal rule. [definition: Bayes Classifier] For a prior $\pi = (\pi_0, \pi_1)$ with $\pi_1 \in (0, 1)$, the **Bayes classifier** is $\delta_\pi = \delta_R$ where \begin{align*} \delta_R(x) = \begin{cases} 1 & \text{if } x \in R, \\ 0 & \text{if } x \notin R, \end{cases} \end{align*} and the critical region is \begin{align*} R = \left\{ x \in \mathcal{X} : \frac{\pi_1 f_1(x)}{(1 - \pi_1) f_0(x)} \geq 1 \right\}. \end{align*} [/definition] The ratio $\pi_1 f_1(x) / ((1 - \pi_1) f_0(x))$ is the posterior odds of class $1$ over class $0$ at the point $x$. Accepting $x$ into $R$ when this ratio is at least $1$ is exactly the condition $\eta(x) \geq 1/2$. [quotetheorem:1970] [citeproof:1970] Why does the hypothesis that the boundary has measure zero matter? If the boundary has positive probability, one can reassign those points arbitrarily between $R$ and $R^c$ without changing the risk, so the minimizer is not unique. This is not a theoretical pathology — it can occur for symmetric distributions or when the two densities cross along a set of positive measure. In such cases there may be multiple Bayes classifiers with identical risk, all of them optimal. [remark: Admissibility and Minimax Connection] Because a unique Bayes rule is admissible (it cannot be uniformly dominated), the Bayes classifier is admissible when the boundary condition holds. This gives a route to finding minimax classifiers: for any $q \in (0, 1)$, let $\delta_q$ be the Bayes classifier for prior $(q, 1-q)$ and let $P(R_q^c \mid 1)$ and $P(R_q \mid 0)$ denote the two error probabilities. As $q$ varies, these probabilities trace out a curve. If one can find $q^*$ such that $P(R_{q^*}^c \mid 1) = P(R_{q^*} \mid 0)$, then $\delta_{q^*}$ has constant risk across both states of nature. A constant-risk admissible rule is minimax. [/remark] ## Linear Discriminant Analysis The Bayes classifier becomes explicit when $f_0$ and $f_1$ are Gaussian with the same covariance. [example: Gaussian Bayes Classifier and Linear Discriminant Analysis] Suppose $X \sim f_i = N(\mu_i, \Sigma)$ for $i = 0, 1$, where $\mu_0, \mu_1 \in \mathbb{R}^p$ and $\Sigma$ is a shared $p \times p$ positive definite covariance matrix. The likelihood ratio is \begin{align*} \frac{f_1(x)}{f_0(x)} = \exp\!\left(-\tfrac{1}{2}(x - \mu_1)^\top \Sigma^{-1}(x - \mu_1) + \tfrac{1}{2}(x - \mu_0)^\top \Sigma^{-1}(x - \mu_0)\right). \end{align*} Expanding the quadratic forms, the $x^\top \Sigma^{-1} x$ terms cancel because $\Sigma$ is the same for both distributions. What remains is \begin{align*} \frac{f_1(x)}{f_0(x)} = \exp\!\left(x^\top \Sigma^{-1}(\mu_1 - \mu_0) - \tfrac{1}{2}(\mu_1 + \mu_0)^\top \Sigma^{-1}(\mu_1 - \mu_0)\right). \end{align*} The Bayes classifier therefore accepts $x \in R$ when \begin{align*} x^\top \Sigma^{-1}(\mu_1 - \mu_0) \geq \tfrac{1}{2}(\mu_1 + \mu_0)^\top \Sigma^{-1}(\mu_1 - \mu_0) + \log\!\left(\frac{1 - \pi_1}{\pi_1}\right). \end{align*} The left side is linear in $x$. Setting $D(x) = x^\top \Sigma^{-1}(\mu_1 - \mu_0)$, both the Bayes classifier and any minimax classifier depend on $X$ only through the **discriminant function** $D(X)$. This method is called **linear discriminant analysis** (LDA). The decision boundary is the hyperplane $\{x : D(x) = c\}$ for a threshold $c$ determined by the prior. Equal priors $\pi_0 = \pi_1 = 1/2$ give $c = \tfrac{1}{2}(\mu_1 + \mu_0)^\top \Sigma^{-1}(\mu_1 - \mu_0)$, the midpoint of the two class means in the Mahalanobis metric. [/example] The critical structural fact enabling LDA is that the quadratic terms in the log-likelihood ratio cancel when the covariance matrices coincide. If $f_0 = N(\mu_0, \Sigma_0)$ and $f_1 = N(\mu_1, \Sigma_1)$ with $\Sigma_0 \neq \Sigma_1$, the $x^\top \Sigma_i^{-1} x$ terms do not cancel, and the boundary becomes a quadratic surface — this is **quadratic discriminant analysis**. LDA is thus specifically tied to the equal-covariance assumption.  ## Empirical Risk Minimization In practice, the distributions $f_0$, $f_1$, and the prior $\pi$ are unknown, so $\eta$ cannot be computed and the Bayes classifier cannot be implemented directly. Statistical learning theory addresses this by estimating $\eta$ from labeled data. Suppose we observe $n$ i.i.d. pairs $(X_1, Y_1), \ldots, (X_n, Y_n)$ drawn from $Q$. Since $Q(\delta(X) \neq Y)$ cannot be minimized directly, one minimizes the **empirical risk** — the proportion of misclassified training observations: \begin{align*} \hat{R}_n(\delta) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{\{\delta(X_i) \neq Y_i\}}. \end{align*} In practice, the decision rule is parameterized: one considers a family $\{h_\beta : \beta \in \mathcal{B}\}$ of score functions $h_\beta: \mathcal{X} \to [0, 1]$, and defines \begin{align*} \delta_\beta(x) = \mathbf{1}_{\{h_\beta(x) \geq 1/2\}}. \end{align*} The parameter $\beta$ is chosen to minimize the empirical risk. Noting that $\delta(X_i) \neq Y_i$ iff $|h_\beta(X_i) - Y_i| \geq 1/2$, the empirical risk can also be written as \begin{align*} \hat{R}_n(\delta_\beta) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{\{|h_\beta(X_i) - Y_i| \geq 1/2\}}. \end{align*} The indicator function $\mathbf{1}_{\{|h - y| \geq 1/2\}}$ is non-convex and discontinuous in $h$, which makes direct minimization computationally difficult (it is NP-hard in general). A standard resolution is to replace the $0$-$1$ loss with a **surrogate loss** $\ell(h, y)$ that is convex and smooth in $h$: \begin{align*} \hat{R}_n^\ell(\beta) = \frac{1}{n} \sum_{i=1}^n \ell(h_\beta(X_i), Y_i). \end{align*} Two standard choices are the **squared loss** $\ell(h, y) = (h - y)^2$ and the **logistic loss** $\ell(h, y) = \log(1 + e^{hy})$. The squared loss treats the classification problem as regression of $Y$ on $X$; the logistic loss is the negative log-likelihood of the logistic model and gives logistic regression. Both are upper bounds on the misclassification loss in appropriate senses, and both yield tractable convex optimization problems when $h_\beta$ is linear in $\beta$. [remark: Connection to Modern Machine Learning] The empirical risk minimization framework with surrogate losses is not merely classical. Much of recent progress in deep learning is a direct instantiation of these ideas: a neural network parameterizes the family $\{h_\beta\}$, the logistic (or cross-entropy) loss serves as $\ell$, and gradient descent minimizes $\hat{R}_n^\ell(\beta)$ over the high-dimensional parameter $\beta$. The theoretical question of when small empirical risk implies small population risk — generalization — is the central topic of statistical learning theory, and connects back to the Bayes risk as the unachievable lower bound. [/remark] Beyond single-variable prediction, modern data often has many features, and we turn to the structure within high-dimensional problems through Principal Component Analysis. # 19. Multivariate Analysis and PCA How much of the structure of a high-dimensional dataset can be recovered from the sample covariance matrix? This chapter turns to multivariate analysis — the study of random vectors in $\mathbb{R}^p$ rather than single real-valued random variables — and develops the theory of covariance matrices, correlation, partial correlation, and principal component analysis (PCA). The preceding chapters have focused on estimation and testing for scalar or low-dimensional parameters; here the parameter of interest is a $p \times p$ matrix, and the natural questions concern its identifiability, estimation, and geometric meaning. The key insight of PCA is that the leading eigenvectors of the covariance matrix identify the directions of greatest variability in the data, and these directions can be recovered from the sample covariance. ## Covariance and Correlation in the Multivariate Setting Before introducing population-level definitions, it is worth recalling why covariance alone is insufficient for comparing the strength of linear relationships across different pairs of variables. If $X$ is measured in meters and $Y$ in kilograms, their covariance carries units of meter-kilograms and cannot be directly compared to, say, the covariance between height and age. Normalising by the standard deviations removes the units and produces the correlation coefficient, which always lies in $[-1, 1]$. [definition: Covariance and Correlation] For two real-valued random variables $X, Y$ with $\operatorname{Var}(X), \operatorname{Var}(Y) > 0$, the **covariance** is \begin{align*} \operatorname{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])], \end{align*} and the **correlation coefficient** is \begin{align*} \rho_{X,Y} = \frac{\operatorname{Cov}(X, Y)}{\sqrt{\operatorname{Var}(X)} \cdot \sqrt{\operatorname{Var}(Y)}}. \end{align*} [/definition] Given i.i.d. observations $(X_1, Y_1), \ldots, (X_n, Y_n)$, the **sample correlation coefficient** is defined by replacing population moments with their empirical counterparts: \begin{align*} \hat{\rho}_{X,Y} = \frac{\frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)(Y_i - \bar{Y}_n)}{\sqrt{\frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)^2} \cdot \sqrt{\frac{1}{n}\sum_{i=1}^n (Y_i - \bar{Y}_n)^2}}, \end{align*} where $\bar{X}_n = n^{-1}\sum_{i=1}^n X_i$ and $\bar{Y}_n = n^{-1}\sum_{i=1}^n Y_i$ are the sample means. By the standard consistency results for sample moments covered earlier in the course — specifically, the law of large numbers applied to the empirical variance and covariance — $\hat{\rho}_{X,Y}$ is a consistent estimator of $\rho_{X,Y}$ whenever $\operatorname{Var}(X)$ and $\operatorname{Var}(Y)$ are positive and finite. The argument proceeds by combining the continuous mapping theorem with the joint consistency of $(\widehat{\operatorname{Var}}(X), \widehat{\operatorname{Var}}(Y), \widehat{\operatorname{Cov}}(X,Y))$. [remark: Correlation in the Gaussian Model] In the Gaussian model $X = (X^{(1)}, \ldots, X^{(p)})^\top \sim N(\mu, \Sigma)$ with $\Sigma$ positive definite, the sample correlation $\hat{\rho}_{X^{(i)}, X^{(j)}}$ coincides with the MLE $\hat{\rho}^{MLE}$. The $(i,j)$ entry of the correlation matrix $[\rho]_{ij} = \rho_{X^{(i)}, X^{(j)}}$ is related to $\Sigma$ by \begin{align*} \rho_{X^{(i)}, X^{(j)}} = \frac{\Sigma_{ij}}{\sqrt{\Sigma_{ii} \cdot \Sigma_{jj}}}. \end{align*} The matrix $[\rho]_{ij}$ has all entries in $[-1, 1]$, ones on the diagonal, and is positive semidefinite (it is a covariance matrix after standardising each coordinate). [/remark] ## The Structure of Covariance and Correlation Matrices A natural question is: which symmetric matrices can arise as covariance matrices or correlation matrices? The answer is clean and has practical importance — when estimating such matrices from data, the constraint set must be understood before setting up any optimisation problem. [quotetheorem:1974] The necessity of positive semidefiniteness is immediate: for any deterministic vector $v \in \mathbb{R}^p$, \begin{align*} v^\top \Sigma v = v^\top \mathbb{E}[X X^\top] v = \mathbb{E}[(v^\top X)^2] \geq 0. \end{align*} Symmetry follows from the definition. Sufficiency — constructing a random vector realising a given $\Sigma$ — is shown on the examples sheet and uses the Cholesky decomposition $\Sigma = L L^\top$: if $Z \sim N(0, I_p)$, then $X = LZ$ satisfies $\mathbb{E}[X X^\top] = L \mathbb{E}[Z Z^\top] L^\top = \Sigma$. The practical payoff is significant: the set of valid covariance matrices is a convex cone (positive semidefinite matrices), and the set of valid correlation matrices is a convex set cut from it by the constraint that diagonal entries equal one. Both constraints are expressible as semidefinite programmes, making constrained estimation tractable. ## The Distribution of the Sample Correlation Under the Null When testing whether two coordinates are uncorrelated — the natural null hypothesis $H_0: \rho_{X,Y} = 0$ — it is useful to know the exact distribution of $\hat{\rho}_{X,Y}$ under the null. The following result applies to the Gaussian model. [quotetheorem:1979] The proof uses the Gaussian-model structure to decompose the joint distribution of sample variances and covariance into a Wishart distribution; the marginal density of the correlation follows by integration. This result is not derived in the lectures; it belongs to classical multivariate normal theory. The formula makes precise what one might expect: under the null, the density is symmetric around zero and concentrates mass near zero for large $n$, since $(1 - r^2)^{(n-4)/2}$ peaks at $r = 0$ and falls steeply as $|r| \to 1$. This density underlies the Fisher $z$-transformation confidence intervals and the $t$-test for zero correlation. ## Partial Correlation and Conditional Covariance In multivariate data, the correlation between two variables may be entirely explained by their mutual dependence on a third variable. Partial correlation isolates the direct linear relationship between two variables after removing the linear effects of a conditioning set. [definition: Conditional Covariance and Partial Correlation] Let $X = (X^{(1)\top}, X^{(2)\top})^\top \in \mathbb{R}^p$ with $X^{(1)} \in \mathbb{R}^q$ and $X^{(2)} \in \mathbb{R}^{p-q}$, and suppose $X \sim N(\mu, \Sigma)$ with \begin{align*} \Sigma = \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{pmatrix}, \end{align*} where $\Sigma_{22}$ is invertible. The **conditional covariance matrix** of $X^{(1)}$ given $X^{(2)}$ is \begin{align*} \Sigma_{11|2} = \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}. \end{align*} The **partial correlation** of the $i$-th and $j$-th components of $X^{(1)}$, given $X^{(2)}$, is \begin{align*} \rho_{i,j \mid 2} = \frac{(\Sigma_{11|2})_{ij}}{\sqrt{(\Sigma_{11|2})_{ii} \cdot (\Sigma_{11|2})_{jj}}}. \end{align*} [/definition] The matrix $\Sigma_{11|2}$ is the Schur complement of $\Sigma_{22}$ in $\Sigma$. Its appearance here is not coincidental: in the multivariate normal model, the conditional distribution of $X^{(1)}$ given $X^{(2)} = x^{(2)}$ is \begin{align*} X^{(1)} \mid X^{(2)} = x^{(2)} \sim N\!\left(\mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(x^{(2)} - \mu_2),\; \Sigma_{11|2}\right). \end{align*} The conditional covariance $\Sigma_{11|2}$ does not depend on the conditioning value $x^{(2)}$ — this is a distinctive feature of Gaussian conditioning. The partial correlation $\rho_{i,j|2}$ therefore measures the residual dependence between $X^{(1)(i)}$ and $X^{(1)(j)}$ after accounting for the linear influence of $X^{(2)}$. A failure case clarifies what partial correlation is not: if $X^{(1)(1)}$ and $X^{(1)(2)}$ are independent conditionally on $X^{(2)}$ (i.e., $\rho_{1,2|2} = 0$), it does not follow that they are marginally uncorrelated — their marginal correlation $\rho_{X^{(1)(1)}, X^{(1)(2)}}$ may be nonzero due to their joint dependence on $X^{(2)}$. The direction of confounding can run either way, and without knowing the causal structure of the problem, partial correlation and marginal correlation can point in opposite directions. Estimation is straightforward: since $\Sigma_{11|2}$ is a continuous function of $\Sigma$, the plug-in MLE $\hat{\Sigma}_{11|2} = \hat{\Sigma}_{11} - \hat{\Sigma}_{12}\hat{\Sigma}_{22}^{-1}\hat{\Sigma}_{21}$ is consistent, and the resulting $\hat{\rho}_{i,j|2}$ is the MLE of $\rho_{i,j|2}$. ## Principal Component Analysis ### The Goal: Dimension Reduction While Preserving Variance A dataset with $p$ measurements per observation lives in $\mathbb{R}^p$. If $p$ is large, direct analysis is expensive and potentially misleading — many of the $p$ directions may contribute little to the total variability, while a few directions explain most of it. The goal of PCA is to find a low-dimensional linear subspace of $\mathbb{R}^p$ that captures as much variance as possible. The mathematical content of this goal is the eigendecomposition of the covariance matrix. For a random vector $X \in \mathbb{R}^p$ with $\mathbb{E}[X] = 0$ and covariance matrix $\Sigma = \mathbb{E}[X X^\top]$, the variance of the projection of $X$ onto a unit vector $v \in \mathbb{R}^p$ is \begin{align*} \operatorname{Var}(v^\top X) = v^\top \mathbb{E}[X X^\top] v = v^\top \Sigma v. \end{align*} The direction of maximum variance is therefore the unit vector $v$ maximising $v^\top \Sigma v$. By the variational characterisation of eigenvalues, this is the leading eigenvector of $\Sigma$. ### The Spectral Decomposition of the Covariance Matrix [definition: Principal Components and Eigendecomposition] Let $X \in \mathbb{R}^p$ with $\mathbb{E}[X] = 0$ and $\Sigma = \mathbb{E}[X X^\top]$. Since $\Sigma$ is symmetric and positive semidefinite, it admits an eigendecomposition \begin{align*} \Sigma = V \Lambda V^\top, \end{align*} where $V = (v_1 \mid \cdots \mid v_p)$ is orthogonal ($V V^\top = I_p$) and $\Lambda = \operatorname{diag}(\lambda_1, \ldots, \lambda_p)$ with $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0$. The vectors $v_1, \ldots, v_p \in \mathbb{R}^p$ are the **principal directions** (or **principal components**) of $X$, and $\lambda_i$ is the variance of $X$ in the direction $v_i$. Equivalently, \begin{align*} \Sigma = \sum_{i=1}^p \lambda_i\, v_i v_i^\top. \end{align*} [/definition]  The representation $\Sigma = \sum_{i=1}^p \lambda_i v_i v_i^\top$ decomposes the covariance matrix into a sum of rank-one contributions, each along a principal direction. The largest term $\lambda_1 v_1 v_1^\top$ dominates if $\lambda_1 \gg \lambda_2$, which is precisely the case in which a single direction captures most of the variation. ### Uncorrelated Coefficients via the Principal Basis Define the vector of **principal component scores** $U = V^\top X \in \mathbb{R}^p$, whose $i$-th entry $U_i = v_i^\top X$ is the projection of $X$ onto the $i$-th principal direction. The covariance matrix of $U$ is \begin{align*} \mathbb{E}[U U^\top] = \mathbb{E}[V^\top X X^\top V] = V^\top \mathbb{E}[X X^\top] V = V^\top \Sigma V = V^\top V \Lambda V^\top V = \Lambda. \end{align*} The scores $U_1, \ldots, U_p$ are therefore uncorrelated (since $\Lambda$ is diagonal), with $\operatorname{Var}(U_i) = \lambda_i$. Expressing $X$ in the principal basis diagonalises the dependence structure. This uncorrelation is a general property holding for any $X$ with covariance $\Sigma$. In the Gaussian case it becomes independence. [quotetheorem:1983] [citeproof:1983] The theorem has a structural interpretation: in the principal basis, the Gaussian vector $X$ breaks into $p$ independent scalar effects. The $i$-th effect acts along $v_i$ with magnitude $\sqrt{\lambda_i}$. ### The Generative Model and the Role of Eigenvalues The independence result leads to a generative picture of $X$. If $Z \sim N(0, \Lambda)$ (so $Z_i \stackrel{\text{ind.}}{\sim} N(0, \lambda_i)$), then $X = VZ$ in distribution, and writing $Z_i = \sqrt{\lambda_i} G_i$ with $G = (G_1, \ldots, G_p)^\top \sim N(0, I_p)$: \begin{align*} X = \sum_{i=1}^p v_i Z_i = \sum_{i=1}^p v_i \sqrt{\lambda_i}\, G_i. \end{align*} This is a decomposition of $X$ into a sum of orthogonal, independent effects. The $i$-th term is a random displacement along the direction $v_i$, with standard deviation $\sqrt{\lambda_i}$. The principal components are the "natural axes" of the distribution: the directions in which fluctuations are independently amplified. [example: Two-Dimensional Gaussian with Correlated Components] Let $X = (X^{(1)}, X^{(2)})^\top \sim N(0, \Sigma)$ with \begin{align*} \Sigma = \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix}. \end{align*} The eigenvalues of $\Sigma$ are $\lambda_1 = 3$ and $\lambda_2 = 1$, with corresponding eigenvectors \begin{align*} v_1 = \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ 1 \end{pmatrix}, \qquad v_2 = \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ -1 \end{pmatrix}. \end{align*} To verify: $\Sigma v_1 = (2+1, 1+2)^\top / \sqrt{2} = 3 v_1$ and $\Sigma v_2 = (2-1, 1-2)^\top/\sqrt{2} = v_2$. In the principal basis, $U = V^\top X$ satisfies $U_1 = (X^{(1)} + X^{(2)})/\sqrt{2}$ and $U_2 = (X^{(1)} - X^{(2)})/\sqrt{2}$, and $U_1 \sim N(0, 3)$, $U_2 \sim N(0, 1)$ independently. Most of the variance (fraction $3/4$) is concentrated along the diagonal direction $v_1 = (1, 1)^\top/\sqrt{2}$, which is the direction of positive covariation between $X^{(1)}$ and $X^{(2)}$. If we retain only the first principal component and approximate $X \approx v_1 U_1$, the reconstruction error has variance $\lambda_2 = 1$, which is $1/4$ of the total variance $\lambda_1 + \lambda_2 = 4$. Retaining the first component thus explains $75\%$ of the total variance. [/example] ### Dimension Reduction and the Rank-$k$ Approximation The practical use of PCA is to approximate $X$ by its projection onto the leading $k$ principal directions, for some $k \ll p$. Retaining the top $k$ principal components replaces the full $p$-dimensional vector with the $k$-dimensional score vector $(U_1, \ldots, U_k)$, at the cost of discarding the remaining $p - k$ components. The proportion of variance explained by the first $k$ principal components is \begin{align*} \frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^p \lambda_i}, \end{align*} which serves as a natural diagnostic for how much information is retained. If this ratio is close to one for small $k$, a low-dimensional representation is adequate. A key hypothesis for this to be informative is that the eigenvalues are genuinely separated — specifically that $\lambda_k \gg \lambda_{k+1}$. If the eigenvalues are all equal (the isotropic case $\Sigma = \sigma^2 I_p$), then every direction carries equal variance and there is no preferred subspace to project onto. In that case PCA recovers nothing: all directions are equivalent, the eigenvectors of the sample covariance are random rotations with no relationship to any population structure, and interpreting them as "principal directions" is statistically meaningless. The assumption $\lambda_k > \lambda_{k+1}$ is therefore not just a convenience but a genuine hypothesis on which the interpretability of PCA rests. [remark: Estimation via the Sample Covariance] In practice, $\Sigma$ is unknown and must be replaced by the sample covariance matrix $\hat{\Sigma}_n = n^{-1}\sum_{i=1}^n (X_i - \bar{X}_n)(X_i - \bar{X}_n)^\top$. The eigenvectors of $\hat{\Sigma}_n$ are the **empirical principal components**. By the consistency of $\hat{\Sigma}_n$ for $\Sigma$ (which follows from the law of large numbers applied entry by entry), and by Davis–Kahan-type perturbation theory for symmetric matrices, the empirical eigenvectors converge to the population eigenvectors as $n \to \infty$, provided the corresponding eigenvalues are distinct. The rate of convergence depends on the eigengap $\lambda_k - \lambda_{k+1}$: a larger gap yields more stable estimation. [/remark] Dimensionality reduction through PCA illuminates structure in data, but statistical inference about the reduced space requires careful treatment. The resampling methods we explore next provide a general framework for assessing the validity of data-driven procedures. # 20. Resampling Principles and the Bootstrap ## Why Resample? The Bootstrap Philosophy Throughout this course, a recurring theme has been that larger samples make statistical tasks easier: estimators concentrate, confidence intervals shrink, and asymptotic approximations improve. But having more data is not always possible. A natural question then arises: can we extract more information from the data we already have by reusing it in a principled way? Resampling techniques answer this question affirmatively. The central insight is that the empirical distribution of the observed sample — the distribution that places mass $1/n$ at each data point — is itself a proxy for the true underlying distribution $P$. Drawing from this empirical distribution simulates what would happen if we could collect additional samples from $P$. Two concrete tools emerge from this idea: the jackknife, which reduces estimation bias by systematic leave-one-out deletion, and the bootstrap, which approximates the sampling distribution of a statistic by resampling from the empirical distribution. This chapter develops both tools and culminates in a rigorous statement of the bootstrap's key theoretical guarantee: that the resampled distribution of $\sqrt{n}(\bar{X}_n^b - \bar{X}_n)$ converges almost surely (in the Kolmogorov sense) to the same limiting distribution as $\sqrt{n}(\bar{X}_n - \mu)$. ## Bias Reduction via the Jackknife Before introducing the bootstrap, it is instructive to see a simpler resampling idea that addresses a different problem: bias in point estimation. Let $X_1, \ldots, X_n \sim P$ i.i.d. and let $T_n = T(X_1, \ldots, X_n)$ be an estimator of a parameter $\theta$. The bias of $T_n$ is $B(\theta) = \mathbb{E}_\theta[T_n] - \theta$. If $B(\theta) \neq 0$, the estimator systematically overshoots or undershoots $\theta$, regardless of how large $n$ is. One might try to estimate the bias and subtract it — but this requires knowing $B(\theta)$, which depends on $\theta$, the unknown quantity. The jackknife circumvents this by forming bias estimates from the sample itself. [definition: Jackknife Bias Estimate] Let $T_{(-i)} = T(X_1, \ldots, X_{i-1}, X_{i+1}, \ldots, X_n)$ denote the estimator computed from the sample with the $i$-th observation removed. The **jackknife bias estimate** is \begin{align*} \hat{B}_n = (n-1)\left(\frac{1}{n}\sum_{i=1}^n T_{(-i)} - T_n\right), \end{align*} and the **jackknife bias-corrected estimator** of $\theta$ is \begin{align*} \tilde{T}_{\mathrm{JACK}} = T_n - \hat{B}_n. \end{align*} [/definition] The factor $(n-1)$ is not arbitrary. To understand it, consider what happens when $T_n$ has bias of order $1/n$: write $\mathbb{E}[T_n] = \theta + a/n + O(1/n^2)$ for some constant $a$. By removing one observation, $T_{(-i)}$ is an estimator based on $n-1$ observations, so $\mathbb{E}[T_{(-i)}] = \theta + a/(n-1) + O(1/(n-1)^2)$. The average $\bar{T}_{(-\cdot)} = \frac{1}{n}\sum_i T_{(-i)}$ then satisfies $\mathbb{E}[\bar{T}_{(-\cdot)}] \approx \theta + a/(n-1)$. Consequently \begin{align*} (n-1)\left(\mathbb{E}[\bar{T}_{(-\cdot)}] - \mathbb{E}[T_n]\right) \approx (n-1)\left(\frac{a}{n-1} - \frac{a}{n}\right) = \frac{a(n-1)}{n(n-1)} \cdot \frac{n - (n-1)}{1} = \frac{a}{n}, \end{align*} which matches the leading bias term. The jackknife therefore cancels the $O(1/n)$ bias to leave an error of smaller order. [quotetheorem:1986] The proof is left to the examples sheet. The key point is that the jackknife replaces a bias of order $1/n$ with one of order $1/n^2$, a genuine improvement. However, one must be cautious: reducing bias does not automatically reduce risk. Subtracting $\hat{B}_n$ from $T_n$ can increase the variance of the corrected estimator, and if the variance increase outweighs the bias reduction, the mean squared error $\mathbb{E}[(\tilde{T}_{\mathrm{JACK}} - \theta)^2]$ can actually increase. Whether bias correction is beneficial depends on the specific estimator and the relative magnitudes of bias and variance. ## The Empirical Distribution and Bootstrap Resampling The jackknife addresses bias through systematic deletion. The bootstrap addresses a more fundamental problem: characterising the sampling distribution of a statistic when the true distribution $P$ is unknown. To see why this matters, consider building a confidence interval for the mean $\mu = \mathbb{E}_P[X]$ based on $X_1, \ldots, X_n \sim P$ i.i.d. with $\operatorname{Var}(X) = \sigma^2$. The classical asymptotic confidence interval \begin{align*} C_n = \left\{\nu \in \mathbb{R} : |\nu - \bar{X}_n| \leq \frac{\sigma z_\alpha}{\sqrt{n}}\right\} \end{align*} requires knowing (or estimating) $\sigma^2$ and relies on the fact that $\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$. Both requirements can be inconvenient: estimating $\sigma^2$ introduces additional uncertainty, and invoking a CLT-based approximation may be poor in small samples. The bootstrap offers an alternative that bypasses both. [definition: Bootstrap Distribution] For fixed observations $X_1, \ldots, X_n$, the **bootstrap distribution** $P_n = P_n(\cdot \mid X_1, \ldots, X_n)$ is the discrete probability distribution on $\{X_1, \ldots, X_n\}$ defined by \begin{align*} P_n(X_n^b = X_i) = \frac{1}{n}, \quad 1 \leq i \leq n. \end{align*} A **bootstrap sample** $(X_{n,1}^b, \ldots, X_{n,n}^b)$ consists of $n$ independent copies of $X_n^b$ drawn with this law — equivalently, $n$ draws with replacement from $\{X_1, \ldots, X_n\}$. [/definition] The distribution $P_n$ is entirely determined by the observed data: once $X_1, \ldots, X_n$ are fixed, $P_n$ is a known discrete distribution. This is the key computational advantage. Before exploring how to use this for inference, note a basic property of the bootstrap mean. [quotetheorem:1991] [citeproof:1991] This result has a clean interpretation: the bootstrap sample mean $\bar{X}_n^b$ is an "estimate" of $\bar{X}_n$, just as $\bar{X}_n$ itself is an estimate of $\mu$. The bootstrap replicates the structure of the original estimation problem one level up, with $P_n$ playing the role of $P$ and $\bar{X}_n$ playing the role of $\mu$. ## The Bootstrap Principle for Confidence Intervals The bootstrap's application to confidence intervals rests on a conceptual symmetry. The quantity $\bar{X}_n - \mu$ measures how far the sample mean deviates from the true mean, and a confidence interval requires quantifying this deviation. The bootstrap principle asserts that, at least approximately, the conditional distribution of $\bar{X}_n^b - \bar{X}_n$ (given the data) mimics the distribution of $\bar{X}_n - \mu$. Note that $\bar{X}_n^b - \bar{X}_n = \bar{X}_n^b - \mathbb{E}[\bar{X}_n^b \mid X_1, \ldots, X_n]$, so this is a centred quantity — the deviation of the bootstrap mean from its own conditional expectation. Analogously, $\bar{X}_n - \mu = \bar{X}_n - \mathbb{E}[\bar{X}_n]$. Both expressions have the same structural form: statistic minus its mean. The bootstrap claims that these two distributions are close, and this closeness is what justifies using the bootstrap distribution to calibrate confidence intervals. [definition: Bootstrap Confidence Interval] For $\alpha \in (0,1)$, let $R_n^b = R_n^b(X_1, \ldots, X_n)$ be the value satisfying \begin{align*} P_n\!\left(|\bar{X}_n^b - \bar{X}_n| \leq \frac{R_n^b}{\sqrt{n}} \,\Big|\, X_1, \ldots, X_n\right) = 1 - \alpha. \end{align*} The **bootstrap confidence set** for $\mu$ is \begin{align*} C_n^b = \left\{\nu \in \mathbb{R} : |\nu - \bar{X}_n| \leq \frac{R_n^b}{\sqrt{n}}\right\}. \end{align*} [/definition] Several features of this construction deserve attention. First, $R_n^b$ is a $(1-\alpha)$-quantile of $|\bar{X}_n^b - \bar{X}_n|$ under $P_n$, and since $P_n$ is a known discrete distribution (given the data), this quantile can be computed exactly — or approximated to any desired precision by Monte Carlo simulation, simply by generating many bootstrap samples and computing the empirical quantile of $\sqrt{n}|\bar{X}_n^b - \bar{X}_n|$. Crucially, no knowledge of $\sigma^2$ is required, and no normal approximation is invoked explicitly. Second, the bootstrap confidence set has exact coverage $1-\alpha$ for the bootstrap problem: $P_n(\bar{X}_n^b \in C_n^b) = 1-\alpha$ by construction. What remains to establish is that this implies approximate frequentist coverage for the original problem — that is, that $\mathbb{P}(\mu \in C_n^b) \to 1 - \alpha$ as $n \to \infty$. This requires showing that the bootstrap distribution faithfully approximates the true sampling distribution. Theoretically, it is not even necessary to actually draw bootstrap samples; the bootstrap can be viewed as a thought experiment that defines $R_n^b$ as a deterministic function of $X_1, \ldots, X_n$. ## The Bootstrap Theorem The theoretical foundation for the bootstrap confidence interval is the following result, which shows that the bootstrap approximation to the sampling distribution of $\sqrt{n}(\bar{X}_n - \mu)$ is not merely approximate but converges almost surely in the uniform (Kolmogorov) sense. [quotetheorem:1995] The proof is deferred to the following lecture; it requires a strengthened form of the central limit theorem — specifically, a CLT that applies conditionally and holds almost surely rather than merely in probability. The proof strategy is outlined in Lecture 21. This theorem has a direct consequence for the coverage of $C_n^b$. By the classical CLT, $\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$. The bootstrap theorem says the conditional quantiles of $\sqrt{n}(\bar{X}_n^b - \bar{X}_n)$ under $P_n$ converge almost surely to those of $N(0, \sigma^2)$. Since $R_n^b/\sqrt{n}$ is the $(1-\alpha)$-quantile of $|\bar{X}_n^b - \bar{X}_n|$ under $P_n$, it follows that $R_n^b$ converges almost surely to $\sigma z_\alpha$, the corresponding quantile of the normal. Therefore \begin{align*} \mathbb{P}(\mu \in C_n^b) = \mathbb{P}\!\left(|\bar{X}_n - \mu| \leq \frac{R_n^b}{\sqrt{n}}\right) \to \mathbb{P}\!\left(|Z| \leq z_\alpha\right) = 1 - \alpha, \end{align*} where $Z \sim N(0, \sigma^2)$. This confirms that $C_n^b$ is an asymptotically valid $(1-\alpha)$ confidence interval. [remark: What the Bootstrap Does Not Require] The bootstrap confidence interval $C_n^b$ requires neither knowledge of $\sigma^2$ nor explicit invocation of the normal approximation. It adaptively estimates the scale of the sampling distribution from the data. When the distribution $P$ is far from normal in finite samples, the bootstrap quantiles may provide a better approximation than the normal quantile $z_\alpha$, because the bootstrap distribution $P_n$ retains some of the non-normality of the empirical distribution — unlike the classical interval, which forces a Gaussian shape from the outset. [/remark] [remark: Scope of the Bootstrap Principle] The mean estimation setting is used here for concreteness, but the bootstrap principle is far more general. For any statistic $T_n = T(X_1, \ldots, X_n)$, one can form a bootstrap version $T_n^b = T(X_{n,1}^b, \ldots, X_{n,n}^b)$ and use the conditional distribution of $T_n^b - T_n$ to approximate the sampling distribution of $T_n - \theta$. The theoretical justification requires a case-by-case analysis, and the bootstrap can fail — for instance, for statistics that depend on extremes of the distribution (such as the sample maximum) or for estimators at the boundary of their parameter space. In those cases, the empirical distribution $P_n$ does not adequately approximate the relevant features of $P$ for the statistic in question. [/remark] Resampling — repeating estimation on bootstrap samples drawn from the original data — offers a practical approach to uncertainty quantification that requires minimal distributional assumptions. We justify this powerful method through rigorous asymptotic theory. # 21. Validity of the Bootstrap ## Why Does the Bootstrap Actually Work? The previous chapter introduced the bootstrap and claimed that the conditional c.d.f. of $\sqrt{n}(\bar{X}_n^b - \bar{X}_n)$ converges uniformly to that of a $N(0, \sigma^2)$ distribution. This chapter fills that gap. The argument proceeds in two stages: first, a general lemma connects convergence in distribution to uniform convergence of c.d.f.s (valid when the limiting distribution is continuous); second, a CLT for triangular arrays gives convergence in distribution for the bootstrap statistics. The combination yields the bootstrap's validity for confidence intervals on the mean. A key structural feature of the argument is that it operates on two layers of randomness simultaneously. The outer layer is the infinite sequence $X_1, X_2, \ldots$ drawn from the true distribution $P$. The inner layer is the bootstrap resampling under $P_n$, conditional on that fixed sequence. The proof shows that for *almost all* fixed sequences (in the outer layer), the bootstrap distribution converges to the right limit — which is a consequence of the law of large numbers operating on the outer sequence. ## Uniform Convergence of C.D.F.s The first ingredient is a general fact about weak convergence. Convergence in distribution is defined pointwise at continuity points of the limit, but when the limiting c.d.f. is continuous everywhere, the convergence is actually uniform. This is what makes the bootstrap confidence interval construction rigorous. [quotetheorem:1996] The continuity hypothesis on $F$ is essential. Without it, convergence in distribution does not imply anything uniform — discrete limit distributions have jump discontinuities, and pointwise convergence at continuity points can leave the jumps poorly approximated. The normal distribution $N(0, \sigma^2)$ has a smooth c.d.f., so the theorem applies to the bootstrap setting. [citeproof:1996] This lemma will be applied with $A_n = \sqrt{n}(\bar{X}_n^b - \bar{X}_n)$ under the conditional distribution $P_n(\cdot \mid X_1, \ldots, X_n)$ and $A \sim N(0, \sigma^2)$. The uniform convergence of c.d.f.s is precisely what is needed to ensure that the bootstrap quantile $q_{n,\alpha}$ converges to the true quantile $z_\alpha$ of the normal, which in turn guarantees the bootstrap confidence interval has asymptotically correct coverage. ## Triangular Arrays and the Bootstrap CLT The bootstrap resampling scheme involves drawing $X_1^{b(n)}, \ldots, X_n^{b(n)}$ i.i.d. from the empirical distribution $P_n$ for each $n$. As $n$ increases, the distribution $P_n$ changes, so the rows of bootstrap samples do not share a common distribution. This is precisely the setting of a triangular array. [definition: I.I.D. Triangular Array] A sequence $(Z_{n,i} ; 1 \leq i \leq n)_{n \geq 1}$ is a **triangular array of i.i.d. random variables** if for each $n \geq 1$, the random variables $Z_{n,1}, \ldots, Z_{n,n}$ are i.i.d. draws from some distribution $Q_n$. The distribution $Q_n$ may vary with $n$. [/definition] [remark: Why "Triangular"] Writing out the array: \begin{align*} &Z_{1,1} \\ &Z_{2,1},\ Z_{2,2} \\ &Z_{3,1},\ Z_{3,2},\ Z_{3,3} \\ &\vdots \end{align*} gives a triangular shape. Within each row, the variables are i.i.d., but the row distributions $Q_1, Q_2, Q_3, \ldots$ need not be the same. [/remark] The bootstrap samples $(X_i^{b(n)} ; 1 \leq i \leq n)_{n \geq 1}$ form exactly such a triangular array, with $Q_n = P_n$ being the empirical distribution of $X_1, \ldots, X_n$. The classical CLT does not apply directly because $Q_n$ changes with $n$. Instead, one uses the following extension. [quotetheorem:1997] The three conditions collectively constitute a **Lindeberg-type condition** adapted to the triangular setting. Condition (1) says that extreme values of $Z_{n,i}$ become negligible: the probability that any single $Z_{n,i}$ exceeds the scale $\sqrt{n}$ decays faster than $1/n$, so the $n$ terms in the sum have essentially no contribution from their large-deviation tails. Condition (2) ensures that truncating to $|Z_{n,1}| \leq \sqrt{n}$ captures essentially all the variance in the limit. Condition (3) controls the mean of the truncated variables. Together they ensure that the sum behaves like a sum of uniformly small, finite-variance contributions — which is the regime where the CLT holds. The proof of this result uses characteristic functions and relies on tools from Part IB Probability and Measure. This theorem is used as a black box in the proof of bootstrap validity; it is not examinable. ## Proof of Bootstrap Validity We can now prove the main theorem stated in the previous chapter: the bootstrap distribution converges uniformly to the normal distribution, almost surely. [quotetheorem:1998] The theorem says something precise about the two layers of randomness. The inner probability $P_n(\cdot \mid X_1, \ldots, X_n)$ is the bootstrap resampling distribution for a fixed realisation of the data. The "almost surely" refers to the outer probability: the set of infinite sequences $X_1, X_2, \ldots$ for which the uniform convergence fails has probability zero under the original distribution $P$. [citeproof:1998] The argument is essentially deterministic once $\omega$ is fixed: for any fixed sequence satisfying the LLN and the Lindeberg conditions, the bootstrap distribution converges. The probability measure over $\omega$ is only needed to conclude that these conditions hold for almost all sequences. As a corollary, the bootstrap confidence interval $C_n^b = [\bar{X}_n - q_{n, 1-\alpha/2}/\sqrt{n},\; \bar{X}_n + q_{n,\alpha/2}/\sqrt{n}]$, where $q_{n,\alpha}$ is the $\alpha$-quantile of $\sqrt{n}(\bar{X}_n^b - \bar{X}_n)$ under $P_n(\cdot \mid X_1, \ldots, X_n)$, satisfies $\mathbb{P}(\mu \in C_n^b) \to 1 - \alpha$ as $n \to \infty$. ## Extensions to Parametric Models The validity result above covers the simplest case: confidence intervals for the mean. The bootstrap idea extends naturally to parametric models $\{P_\theta : \theta \in \Theta\}$, where the quantity of interest is the MLE $\hat{\theta}_n$ rather than the sample mean. Two versions arise. In the **nonparametric bootstrap**, one resamples $X_1^{b}, \ldots, X_n^{b}$ directly from the empirical distribution $P_n$ — without using the parametric structure — and computes the bootstrap MLE $\hat{\theta}_n^b$ from this resampled data. The pivot analogous to $\sqrt{n}(\bar{X}_n^b - \bar{X}_n)$ is $\sqrt{n}(\hat{\theta}_n^b - \hat{\theta}_n)$. Under regularity conditions, this pivot converges in distribution to the same limit as $\sqrt{n}(\hat{\theta}_n - \theta_0)$, enabling the construction of a bootstrap confidence set \begin{align*} C_n^b = \left\{\theta : \|\hat{\theta}_n^b - \hat{\theta}_n\| \leq \frac{R_n}{\sqrt{n}}\right\}, \end{align*} where $R_n$ is chosen so that $P_n(\|\hat{\theta}_n^b - \hat{\theta}_n\| \leq R_n/\sqrt{n} \mid X_1, \ldots, X_n) = 1 - \alpha$. Under suitable regularity, $P_{\theta_0}(\theta_0 \in C_n^b) \to 1 - \alpha$. A significant practical advantage of this approach is that it does not require estimating the Fisher information or knowing the asymptotic distribution of $\hat{\theta}_n$ in closed form — the bootstrap automatically captures the correct limiting variance. In the **parametric bootstrap**, one instead resamples from the fitted model $P_{\hat{\theta}_n}$ rather than from the raw data $P_n$. The resampling step uses the parametric structure explicitly. This is preferable when the model is well-specified and one wants to leverage its structure for greater efficiency, but it requires generating samples from $P_{\hat{\theta}_n}$, which may be computationally demanding. [remark: Nonparametric vs Parametric Bootstrap] The names can be counterintuitive. The nonparametric bootstrap resamples from the empirical distribution and requires no model; it can be applied even in fully nonparametric settings. The parametric bootstrap resamples from the fitted parametric model. In both cases the pivot idea is the same: use the bootstrap distribution of a statistic around its bootstrap mean as a proxy for the true distribution around the true parameter. [/remark] The bootstrap's validity rests on the principle that the empirical distribution approximates the true distribution. We examine when this approximation fails and how to modify the bootstrap for dependent data or heavy-tailed distributions. # 22. Monte Carlo Methods Many of the statistical procedures studied in this course — posterior means, bootstrap quantiles, multivariate Gaussian level sets — involve expectations or integrals that admit no closed-form expression. Monte Carlo methods address a deceptively simple question: given a distribution $P$ you cannot integrate analytically, how do you approximate $\mathbb{E}_P[g(X)]$ to any desired accuracy? The answer rests on the law of large numbers combined with clever schemes for generating samples from $P$, even when direct sampling is difficult or impossible. ## Pseudo-Random Sampling and the Uniformity Foundation The entire edifice of Monte Carlo simulation rests on the availability of uniform random samples. In practice, computers generate *pseudo-random* numbers that behave, for all practical purposes, as if they were drawn from the uniform distribution. [definition: Pseudo-Random Uniform Sample] A pseudo-random uniform sample is a collection of variables $U_1^*, \ldots, U_N^*$ such that for all $u_1, \ldots, u_N \in [0,1]$, \begin{align*} \mathbb{P}(U_1^* \leq u_1, \ldots, U_N^* \leq u_N) \approx u_1 \cdots u_N, \end{align*} where the approximation holds up to machine precision. For theoretical purposes, we treat $U_1^*, \ldots, U_N^* \sim U[0,1]$ i.i.d. [/definition] [remark: Machine Precision] The distinction between exact equality and approximate equality here is purely computational. In the mathematical analysis that follows, we treat pseudo-random uniform samples as genuinely i.i.d. $U[0,1]$. The theory then tells us what properties the resulting Monte Carlo estimates have, and the machine-precision approximation ensures those theoretical guarantees hold up in numerical practice. [/remark] Starting from uniform samples, one can generate samples from any distribution on a finite set. The idea is to partition $[0,1]$ into equal-length intervals, one for each point in the support. [quotetheorem:1999] [citeproof:1999] This proposition is more practically significant than it might first appear. Taking $x_1, \ldots, x_n$ to be an observed data set of size $n$, the construction generates an i.i.d. bootstrap sample from the empirical distribution — exactly what the nonparametric bootstrap requires. The construction generalises to any discrete distribution: one simply takes intervals of lengths proportional to the prescribed probabilities rather than equal-length intervals. ## The Inverse Transform Method Generating samples from continuous distributions requires a different tool — the generalised inverse of the cumulative distribution function. [definition: Generalised Inverse] Let $F$ be the cumulative distribution function of a distribution on $\mathbb{R}$. The generalised inverse $F^-: (0,1) \to \mathbb{R}$ is defined by \begin{align*} F^-(u) = \inf\{x : u \leq F(x)\}. \end{align*} [/definition] When $F$ is strictly increasing and continuous, $F^-$ coincides with the ordinary functional inverse $F^{-1}$, and $F^-(u)$ is simply the $u$-th quantile of the distribution. The generalised inverse extends this notion to arbitrary distributions, including those with flat parts or jumps in $F$. [quotetheorem:2000] [citeproof:2000] The hypothesis that $F$ is the cumulative distribution function of $P$ is what allows the final equality $\mathbb{P}(U \leq F(t)) = F(t)$ — it uses the fact that $U$ is uniform on $[0,1]$ and that $F(t) \in [0,1]$. Without $F$ being a valid cumulative distribution function, the output distribution would not be $P$. As a direct corollary, given a pseudo-random sample $U_1^*, \ldots, U_N^*$, the transformed sample $(F^-(U_1^*), \ldots, F^-(U_N^*))$ is approximately i.i.d. from $P$. The law of large numbers then gives Monte Carlo integration: \begin{align*} \frac{1}{N} \sum_{i=1}^N g(X_i^*) \xrightarrow{a.s.} \mathbb{E}_P[g(X)]. \end{align*} The fundamental limitation of inverse transform sampling is that $F^-$ must be computable. For a Gaussian distribution $\mathcal{N}(\mu, \sigma^2)$, the cumulative distribution function involves the error function, which has no closed-form inverse. One must then resort to alternative methods. ## Importance Sampling When simulating directly from $P$ is difficult, importance sampling draws from an easier distribution $h$ and corrects by reweighting. Let $P$ have density $f$ on a space $\mathcal{X}$, and suppose direct simulation from $f$ is computationally demanding or impractical. Let $h$ be a density on $\mathcal{X}$ whose support contains that of $f$, and from which simulation is straightforward. The key identity is: \begin{align*} \mathbb{E}_h\left[\frac{g(X) f(X)}{h(X)}\right] = \int_{\mathcal{X}} \frac{g(x) f(x)}{h(x)} h(x)\, dx = \int_{\mathcal{X}} g(x) f(x)\, dx = \mathbb{E}_f[g(X)]. \end{align*} The ratio $f(x)/h(x)$ is called the *importance weight*. It corrects for the discrepancy between the sampling distribution $h$ and the target distribution $f$. Since $\mathbb{E}_f[g(X)]$ equals the expectation of the weighted function under $h$, the law of large numbers applied to samples from $h$ gives: [quotetheorem:2001] The hypothesis $\operatorname{supp}(f) \subset \operatorname{supp}(h)$ is indispensable: if $h(x) = 0$ at some point where $f(x) > 0$, the importance weight $f(x)/h(x)$ is undefined, and the estimator does not converge to $\mathbb{E}_f[g(X)]$ — regions where $f$ places mass but $h$ does not are invisible to the estimator. The efficiency of importance sampling depends strongly on how well $h$ approximates $f$; a good proposal $h$ has its mass in roughly the same regions as $f g$, keeping the variance of the importance weights small. [example: Gaussian Approximation to a Heavy-Tailed Target] Suppose $f$ is proportional to $e^{-x^2/2}(1 + x^2)^{-1}$ on $\mathbb{R}$ (a target that is harder to normalise analytically) and one wishes to estimate $\mathbb{E}_f[g(X)]$ for some function $g$. Taking $h$ to be the standard Gaussian density $\phi(x) = (2\pi)^{-1/2} e^{-x^2/2}$, the importance weight is $f(x)/\phi(x) \propto (1 + x^2)^{-1} \cdot (2\pi)^{1/2}$, which is bounded and integrable. Drawing $X_1^*, \ldots, X_N^*$ from $\mathcal{N}(0,1)$ and computing the weighted average produces a consistent estimator. The support condition is satisfied because $\phi(x) > 0$ for all $x \in \mathbb{R}$. [/example] ## The Accept–Reject Algorithm Importance sampling reweights samples from a proposal; the accept–reject algorithm instead selects only those proposals that are consistent with the target, discarding the rest. Assume densities $f$ and $h$ satisfy $f(x) \leq M h(x)$ for all $x$, for some constant $M > 0$. The algorithm is: 1. Generate $X \sim h$ and $U \sim U(0,1)$, independently. 2. If $U \leq f(X) / (M h(X))$, accept $Y = X$. 3. Otherwise, return to step 1. [quotetheorem:2002] The proof shows this by computing $\mathbb{P}(X \leq t, \text{accepted})$: since $U$ is independent of $X$ and uniform, the probability of acceptance given $X = x$ is $f(x)/(Mh(x))$. Integrating over $x$ gives $\mathbb{P}(\text{accepted}) = 1/M$, and dividing the joint probability by this normalisation constant recovers $f$ as the conditional density of $X$ given acceptance. The condition $f \leq Mh$ is not merely a technical convenience — it is the requirement that ensures every region where $f$ places mass is explored by the proposal $h$ with sufficient frequency. The acceptance probability at any point $x$ is $f(x)/(Mh(x)) \leq 1$, so the algorithm is well-defined. The average number of proposals needed per accepted sample is $M$, so a tighter bound (smaller $M$) produces a more efficient algorithm. [remark: Efficiency and the Choice of M] The expected number of iterations of steps 1–2 before acceptance is exactly $M$. When $M$ is large, many candidate samples are rejected, and simulation is slow. The optimal choice $M = \sup_x f(x)/h(x)$ is the smallest valid constant, but computing this supremum may itself be non-trivial. In practice, one seeks $h$ that mimics the shape of $f$ as closely as possible. [/remark] ## The Gibbs Sampler The previous methods require the target density to be known explicitly in a form amenable to simulation or pointwise evaluation. For high-dimensional posterior distributions arising in Bayesian statistics, even this can be intractable. The Gibbs sampler exploits the fact that *conditional* distributions are often much simpler than the joint. Consider a bivariate distribution with joint density $f_{X,Y}$, and suppose one wants to sample from it. Starting from an initial value $X_0 = x$, the Gibbs sampler generates a sequence by alternating conditional draws: For $t \geq 1$: 1. Draw $Y_t \sim f_{Y|X}(\cdot \mid X = X_{t-1})$. 2. Draw $X_t \sim f_{X|Y}(\cdot \mid Y = Y_t)$. [quotetheorem:2003] The Markov property here is that the conditional distribution of $(X_t, Y_t)$ given the entire past depends only on $(X_{t-1}, Y_{t-1})$ — each new pair is generated using only the most recent state. The invariance of $f_{X,Y}$ means that if $(X_0, Y_0) \sim f_{X,Y}$, then $(X_t, Y_t) \sim f_{X,Y}$ for all $t$; more crucially, under mild conditions, the chain converges to $f_{X,Y}$ regardless of the starting point $X_0$. The hypothesis that the chain is valid depends on the conditional distributions $f_{Y|X}$ and $f_{X|Y}$ being well-defined and simulable. In many Bayesian models, the joint posterior is complex, but the conditionals have standard forms (Gaussian, Gamma, Beta), making Gibbs sampling practical where direct sampling from the joint is not. [example: Gibbs Sampler for a Bivariate Gaussian] Let $(X, Y)$ follow a bivariate Gaussian distribution with means $\mu_X = \mu_Y = 0$, variances $\sigma_X^2 = \sigma_Y^2 = 1$, and correlation $\rho \in (-1, 1)$. The conditional distributions are: \begin{align*} X \mid Y = y &\sim \mathcal{N}(\rho y, 1 - \rho^2), \\ Y \mid X = x &\sim \mathcal{N}(\rho x, 1 - \rho^2). \end{align*} Starting from $X_0 = 0$, the Gibbs sampler alternates: draw $Y_1 \sim \mathcal{N}(0, 1-\rho^2)$, then $X_1 \sim \mathcal{N}(\rho Y_1, 1-\rho^2)$, and so on. Each conditional draw is a one-dimensional Gaussian, straightforward to simulate. The ergodic average $(1/N)\sum_{t=1}^N g(X_t, Y_t)$ converges almost surely to $\mathbb{E}[g(X,Y)]$ under the bivariate Gaussian. [/example] The Gibbs sampler generalises naturally to higher dimensions. For a parameter vector $(\theta_1, \ldots, \theta_d)$, one cycles through coordinates, updating each $\theta_j$ by drawing from $f_{\theta_j | \theta_{-j}}$, where $\theta_{-j}$ denotes all coordinates except the $j$-th. This cycling procedure preserves the joint posterior as the invariant distribution, and the ergodic theorem ensures that time averages converge to posterior expectations — the core quantity needed for Bayesian inference. [remark: Gibbs Sampling in Bayesian Posterior Computation] The Gibbs sampler is the practical engine behind much of modern Bayesian computation. When computing the posterior mean $\mathbb{E}[\theta \mid X_1, \ldots, X_n]$ for a multivariate parameter $\theta$ — a task that in principle requires integrating out all but one coordinate of the posterior — the Gibbs sampler replaces intractable integration with iterated conditional simulation. The ergodic theorem guarantees that sufficiently long chains provide arbitrarily accurate approximations. [/remark] Beyond resampling finite data, Monte Carlo methods enable us to generate samples from complex posterior distributions and simulate from arbitrary models. This computational bridge connects exact Bayesian inference to practical algorithms. # 23. Introduction to Nonparametric Statistics Throughout this course, statistical inference has operated within a parametric framework: the distribution generating the data belongs to some family $\{P_\theta : \theta \in \Theta\}$, and the problem reduces to estimating the finite-dimensional parameter $\theta \in \Theta \subset \mathbb{R}^d$. This chapter asks what happens when we abandon that assumption entirely. The goal of nonparametric statistics is to estimate the distribution itself — directly, without encoding it through a parameter — relying only on the i.i.d. sample at hand. The central object is the empirical distribution function, and the main results describe how well it approximates the unknown population distribution, both pointwise and uniformly. ## Estimating a Distribution Without a Parametric Model How should one estimate a distribution when no parametric family is assumed? A natural answer is to use the data to build an approximation of the cumulative distribution function (c.d.f.) directly. Recall that for a distribution $P$ on $\mathbb{R}$, the c.d.f. is defined by \begin{align*} F(t) = \mathbb{P}(X \leq t) = \mathbb{E}[\mathbf{1}_{(-\infty,t]}(X)], \quad t \in \mathbb{R}. \end{align*} Given i.i.d. observations $X_1, \ldots, X_n$ drawn from $P$, the empirical analogue replaces the expectation by an average over the sample. [definition: Empirical Distribution Function] Let $X_1, \ldots, X_n$ be i.i.d. real-valued random variables. The **empirical distribution function** $F_n : \mathbb{R} \to [0,1]$ is defined by \begin{align*} F_n(t) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{(-\infty,t]}(X_i) = \frac{\#\{i : X_i \leq t\}}{n}, \quad t \in \mathbb{R}. \end{align*} [/definition] The function $F_n$ is a step function that increases by $1/n$ at each observed value. It is the natural empirical analogue of $F$: for any fixed $t$, the random variable $\mathbf{1}_{(-\infty,t]}(X_i)$ is Bernoulli with success probability $F(t)$, so $F_n(t)$ is the sample mean of $n$ i.i.d. Bernoulli$(F(t))$ random variables. The law of large numbers immediately gives $F_n(t) \xrightarrow{a.s.} F(t)$ for every fixed $t$. The question is whether this convergence can be made uniform over all $t \in \mathbb{R}$ simultaneously. ## Uniform Convergence: The Glivenko–Cantelli Theorem Pointwise convergence of $F_n(t)$ to $F(t)$ is a direct consequence of the law of large numbers, but it leaves open the possibility that, for any fixed $n$, the approximation could be poor at some exceptional values of $t$. The Glivenko–Cantelli theorem closes this gap: the supremum deviation goes to zero almost surely. [quotetheorem:2004] The hypotheses here are minimal: the result holds for every distribution $P$, without any moment conditions or smoothness assumptions. This is what makes the theorem genuinely nonparametric — the uniform approximation guarantee applies regardless of the shape of $F$. The theorem can be understood as a uniform law of large numbers over the class of indicator functions $\{\mathbf{1}_{(-\infty,t]} : t \in \mathbb{R}\}$. Classical law of large numbers results handle a single function of the data; the Glivenko–Cantelli theorem handles an entire indexed family simultaneously. It is a prototype for a broad theory of uniform convergence over function classes, which underlies modern nonparametric statistics. ## The Brownian Bridge and Distributional Limits Glivenko–Cantelli tells us that $\|F_n - F\|_\infty \to 0$ almost surely, but it says nothing about the rate of convergence or the distributional behaviour of the rescaled deviations $\sqrt{n}(F_n - F)$. To describe the limit distribution, we need two stochastic processes. [definition: Brownian Motion] A **Brownian motion** (or Wiener process) is a continuous stochastic process $(W_t)_{t \geq 0}$ satisfying: - $W_0 = 0$, - for all $s < t$, the increment $W_t - W_s \sim N(0, t-s)$, independently of $(W_{s'})_{s' \leq s}$. [/definition] A formal proof of existence of Brownian motion requires tools from measure theory and the theory of Gaussian processes, which lie outside the scope of this course. Informally, Brownian motion is the continuous-time limit of a symmetric random walk: as the time step $\delta \to 0$ and the step size scales as $\sqrt{\delta}$, the rescaled walk converges to Brownian motion. For the study of the empirical distribution function, the relevant process is not Brownian motion itself but a closely related object defined on the interval $[0,1]$. [definition: Brownian Bridge] A **Brownian bridge** is a continuous process $(B_t)_{0 \leq t \leq 1}$ defined as a Brownian motion conditioned on the event $\{W_1 = 0\}$. It satisfies: - $B_0 = B_1 = 0$, - $B_t \sim N(0, t(1-t))$ for each $t \in [0,1]$, - $\operatorname{Cov}(B_s, B_t) = s(1-t)$ for $s \leq t$. An explicit construction is $B_t = W_t - t W_1$, where $(W_t)_{t \geq 0}$ is a standard Brownian motion. [/definition] The variance $t(1-t)$ vanishes at both endpoints $t=0$ and $t=1$, reflecting the boundary conditions $B_0 = B_1 = 0$. The process is largest in variance near the midpoint $t = 1/2$, giving the Brownian bridge a characteristic arch shape.  The connection between the Brownian bridge and the empirical distribution function is made precise by the Donsker–Kolmogorov–Doob theorem, which is the functional analogue of the central limit theorem. [quotetheorem:2005] The convergence here is in distribution as a process, meaning the law of the rescaled empirical process converges in the space of bounded functions on $\mathbb{R}$. The limit $G_F$ is the Brownian bridge composed with $F$: it is a reparametrisation of the standard Brownian bridge by the c.d.f. itself. To understand the special case, suppose $X_1, \ldots, X_n \sim \mathrm{U}[0,1]$, so $F(t) = t$. Then $G_F(t) = B_t$, and $\sqrt{n}(F_n - F)$ converges to the standard Brownian bridge. The rescaled deviations are forced to zero near $t = 0$ and $t = 1$ because all data lie in $[0,1]$, and the fluctuations are largest near $t = 1/2$. For a general $F$, the process $G_F$ has the same structure, but the time parameter is warped by $F$. ## The Kolmogorov–Smirnov Statistic and Its Distribution A striking consequence of the Donsker–Kolmogorov–Doob theorem is that the supremum deviation $\sqrt{n}\|F_n - F\|_\infty$ has a limiting distribution that does not depend on $F$. This distribution-freeness is the key property that makes the Kolmogorov–Smirnov statistic useful in practice. [quotetheorem:2006] The limit distribution on the right-hand side — the supremum of the absolute value of a standard Brownian bridge — is a fixed distribution, entirely independent of $F$. This is because $G_F(t) = B_{F(t)}$ is merely a reparametrisation of $B$ by the monotone function $F$, and taking the supremum over all $t$ is the same as taking the supremum over all values $F(t)$ as $t$ ranges over $\mathbb{R}$, which recovers the supremum of $|B|$ over $[0,1]$. The distribution of $\sup_{t \in [0,1]} |B_t|$ is known explicitly (it is the Kolmogorov distribution) and its quantiles are tabulated. This distribution-freeness is what makes the Kolmogorov–Smirnov statistic genuinely nonparametric: the same critical values apply regardless of the true underlying distribution $F$. ### Hypothesis Testing with the Kolmogorov–Smirnov Statistic Suppose we wish to test the null hypothesis $H_0: F = F_0$ against $H_1: F \neq F_0$, where $F_0$ is a specified distribution. The Kolmogorov–Smirnov test uses the statistic \begin{align*} T_n = \sqrt{n}\,\|F_n - F_0\|_\infty. \end{align*} Under $H_0$, the Kolmogorov–Smirnov theorem gives $T_n \xrightarrow{d} \sup_{t \in [0,1]} |B_t|$. A level-$\alpha$ test rejects $H_0$ when $T_n$ exceeds the $(1-\alpha)$ quantile of $\sup_{t \in [0,1]} |B_t|$, which can be read from tables. [example: Kolmogorov–Smirnov Test for Uniformity] Consider testing $H_0: F = F_0$ where $F_0$ is the $\mathrm{U}[0,1]$ c.d.f., i.e., $F_0(t) = t$ for $t \in [0,1]$. With $n$ observations, compute \begin{align*} T_n = \sqrt{n} \sup_{t \in [0,1]} \left|F_n(t) - t\right|. \end{align*} Since $F_0(t) = t$, the warped process $G_{F_0}(t) = B_{F_0(t)} = B_t$ is already a standard Brownian bridge, and under $H_0$ we have $T_n \xrightarrow{d} \|B\|_\infty$. The $95\%$ quantile of $\|B\|_\infty$ is approximately $1.358$. One therefore rejects the null hypothesis of uniformity at level $5\%$ if $T_n > 1.358$. [/example] ### Uniform Confidence Bands The Kolmogorov–Smirnov theorem also yields a nonparametric confidence band for the entire c.d.f. $F$. Let $q_\alpha$ denote the $(1-\alpha)$ quantile of $\|B\|_\infty$. Define the band \begin{align*} C_n(x) = \left[F_n(x) - \frac{q_\alpha}{\sqrt{n}},\; F_n(x) + \frac{q_\alpha}{\sqrt{n}}\right], \quad x \in \mathbb{R}. \end{align*} By the Kolmogorov–Smirnov theorem, $\mathbb{P}(F(x) \in C_n(x) \text{ for all } x \in \mathbb{R}) \to 1 - \alpha$ as $n \to \infty$. This is a **uniform** confidence band: a single band that contains the entire graph of $F$ with asymptotic probability $1 - \alpha$, simultaneously for all $x$. The width $2q_\alpha / \sqrt{n}$ decreases at the parametric $\sqrt{n}$ rate, and — crucially — the coverage probability and band width are the same for every distribution $F$, a consequence of the distribution-free limiting distribution. [remark: Distribution-Freeness] The distribution-freeness of the Kolmogorov–Smirnov limit is not a coincidence. It holds because $G_F$ is a Brownian bridge composed with $F$, and the supremum norm is invariant under strictly monotone reparametrisations of a process with continuous paths. This property is exploited repeatedly in nonparametric testing: by applying the probability integral transform $U_i = F(X_i)$, one reduces any hypothesis about a continuous distribution $F$ to a question about the uniform distribution, which has a universal answer. [/remark]

Created by admin on 4/24/2026 | Last updated on 4/24/2026

What brings you to Androma?

Start with a route through the knowledge graph.

Cambridge II Principles of Statistics

Sign in to Androma

Check your inbox

One last step

Cambridge II Principles of Statistics

Prerequisites

Rate this page