Statistics is the science of learning from data under uncertainty. This course develops the theory and practice of parametric statistical inference — the art of using observations from a probability model to draw conclusions about unknown parameters that govern that model. We begin with a simple but powerful framework: you observe data drawn from some distribution, that distribution depends on parameters you don't know, and you want to use the data to estimate those parameters, test hypotheses about them, or make predictions. This framework unifies the entire course and motivates every major topic, from likelihood-based estimation to hypothesis tests to linear regression.
The course is built around three interconnected problems. First, how do we produce point estimates and interval estimates of unknown parameters? We explore bias, mean squared error, sufficiency, and the likelihood principle — tools that guide us toward estimators with desirable properties. Second, how do we make binary decisions about parameters in the face of uncertainty? Hypothesis testing, grounded in the Neyman-Pearson framework, answers this by balancing Type I and Type II errors and extending ideas from estimation to composite hypotheses and goodness-of-fit tests. Third, how do we estimate relationships between variables? Linear models are the workhorse of applied statistics, and we develop their theory from first principles, connecting them to estimation and testing through the lens of the normal distribution and the Gauss–Markov theorem.
These three chapters are not separate topics but chapters of a single story. Chapter 0 establishes the parametric framework and the "three questions of inference" that motivate the course. Chapter 1 gives you estimators — point and interval estimates derived from likelihood and Bayesian thinking. Chapter 2 teaches you to make decisions using those estimates, introducing hypothesis tests as a formal tool for binary choices. Chapter 3 applies all of this machinery to linear models, the most important and widely used probability model in practice, where estimation, testing, and inference collapse into a single elegant theory. By the end, you will see statistics not as a collection of recipes, but as a coherent body of reasoning about how data constrains our knowledge of unknown parameters.
# Introduction
Statistics is, at its core, a discipline of principled uncertainty. We observe data — numbers generated by some process in the world — and we want to draw conclusions about that process. This course is concerned with the formal mathematical theory that makes such conclusions rigorous. The question is not merely "what does the data suggest?" but "what can we legitimately infer, with quantified confidence, from the data we have seen?"
The three chapters that follow cover the three fundamental problems of statistical inference: estimating an unknown quantity from data (Estimation), bounding how far our estimate might be from the truth (Confidence Intervals, developed within the Estimation chapter), and deciding between competing claims about the world (Hypothesis Testing). A final chapter on Linear Models shows how these tools apply to the most widely used class of statistical models. This introduction sets up the common framework that underlies all three.
## The Parametric Framework
Here is the central difficulty statistics must overcome: any finite dataset is compatible with infinitely many different probability distributions. Suppose you record $n = 20$ counts of some phenomenon. A distribution with mean 3.1 fits the data; so does one with mean 3.2; so does a distribution of a completely different shape whose mean happens to coincide with the sample mean. Without additional structure, no amount of data can distinguish among them — you would need to specify infinitely many numbers to pin down an arbitrary distribution over the integers, and a finite sample provides only $n$ numbers. The data alone are simply insufficient to uniquely identify the data-generating process.
The parametric approach resolves this by restricting attention to a *known family* of distributions indexed by a finite-dimensional parameter. The key insight is that once we commit to a family — say, all Poisson distributions — the entire probability law is determined by a single number $\mu > 0$. The infinite search space collapses to a one-dimensional one, and a moderate sample can meaningfully pin down $\mu$. This is not a claim that the world is literally Poisson; it is a modelling commitment that makes inference tractable.
[definition: Parametric Family]
A **parametric family** of distributions is a collection $\{f(\,\cdot\,;\theta) : \theta \in \Theta\}$, where $\Theta \subseteq \mathbb{R}^k$ is the **parameter space**, and for each $\theta \in \Theta$, $f(\,\cdot\,;\theta) : \mathcal{X} \to [0,\infty)$ is a probability density or mass function on the **sample space** $\mathcal{X}$.
Here $\mathcal{X}$ is the set of values each observation can take: $\mathcal{X} = \{0, 1, 2, \ldots\}$ for count data, $\mathcal{X} = \mathbb{R}$ for continuous measurements, $\mathcal{X} = [0,1]$ for proportions. The unknown $\theta \in \Theta$ is the **parameter** of the model. When $k = 1$ we speak of a scalar parameter; when $k > 1$, a vector parameter.
[/definition]
[example: Common Parametric Families]
Three families appear repeatedly in this course, and they illustrate two structurally different ways the choice of family matters.
**Poisson.** If $X \sim \operatorname{Pois}(\mu)$ with $\mu > 0$, then
\begin{align*}
f(x; \mu) = \frac{e^{-\mu}\mu^x}{x!}, \qquad x = 0, 1, 2, \ldots
\end{align*}
The parameter $\mu = \mathbb{E}[X]$ is the mean rate of occurrence. We know the data come from *some* Poisson distribution, but we do not know $\mu$.
**Binomial.** If $X \sim \operatorname{Bin}(n, \theta)$ with $n$ known and $\theta \in (0,1)$ unknown,
\begin{align*}
f(x; \theta) = \binom{n}{x}\theta^x(1-\theta)^{n-x}, \qquad x = 0, 1, \ldots, n.
\end{align*}
**Normal.** If $X \sim N(\mu, \sigma^2)$ with one or both of $\mu \in \mathbb{R}$, $\sigma^2 > 0$ unknown, the parameter is $\theta = (\mu, \sigma^2) \in \mathbb{R} \times (0,\infty)$. Here $\Theta$ is two-dimensional.
Now consider what the Poisson assumption buys us. Suppose we observe $n = 10$ counts: say $x = (2, 0, 3, 1, 4, 2, 1, 0, 3, 2)$. Under the Poisson model, the only free quantity is $\mu$, and these 10 numbers collectively pin it down: the maximum likelihood estimate (derived in Chapter 1) is simply $\hat{\mu} = \bar{x} = 1.8$. The parametric assumption has converted 10 numbers into a single inferential target.
Without the Poisson assumption, those same 10 observations are compatible with any distribution on $\{0,1,2,\ldots\}$ that assigns positive probability to the values $\{0,1,2,3,4\}$. A geometric distribution, a negative binomial, a distribution that places all its mass on five arbitrary points — all are consistent with the data. The sample cannot distinguish among them because each introduces its own free parameters that can be tuned to fit the observations. The parametric assumption is the commitment that converts an underdetermined problem into a tractable one.
Notice also that the choice of family has predictive consequences. If we believe the data are Poisson, then the probability of observing $X = 10$ in a future trial is $e^{-1.8}(1.8)^{10}/10! \approx 0.00003$. If instead we model the data as negative binomial with the same mean, the tail probabilities differ — possibly substantially. The family is not a neutral container; it shapes what the model predicts beyond the range of the observed data.
[/example]
The parametric assumption is what makes inference possible. Without it, any continuous density on $\mathbb{R}$ — and there are uncountably many — is compatible with any finite sample, since the sample has probability zero under every continuous distribution in isolation. By restricting to a known family parameterised by $\theta \in \Theta \subseteq \mathbb{R}^k$, we reduce the infinite-dimensional estimation problem to a finite-dimensional one, where the data can genuinely discriminate among models. The choice of family is therefore the first and most consequential modelling decision.
## Simple Random Samples
A single observation from $f(\,\cdot\,;\theta)$ is rarely enough to say anything useful about $\theta$. But repetition raises an immediate difficulty: if our observations are not independent of each other, or if they come from different distributions, the bookkeeping becomes intractable. Yesterday's measurement might influence today's; a patient recruited early in a trial might differ systematically from one recruited later. These dependencies and inhomogeneities, if not accounted for, silently corrupt every downstream inference — an estimator derived under the i.i.d. assumption applied to dependent data can be badly biased, and confidence intervals can fail to cover the true parameter at anything near their nominal level. The simplest setting that avoids all these complications is one where each observation is a fresh, independent draw from the same distribution.
[definition: Simple Random Sample]
A **simple random sample** of size $n$ from $f(\,\cdot\,;\theta)$ is a collection $X_1, X_2, \ldots, X_n$ of random variables that are independent and identically distributed (i.i.d.), each with density or mass function $f(\,\cdot\,;\theta)$. Formally, each $X_i : \Omega \to \mathcal{X}$ is a random variable on the underlying probability space $(\Omega, \mathcal{F}, \mathbb{P})$, taking values in the sample space $\mathcal{X}$.
We write the sample as the vector $X = (X_1, \ldots, X_n)$. After observing the experiment, we obtain the **observed sample** $x = (x_1, \ldots, x_n)$, the realised values.
[/definition]
[remark: Random Variables vs Observed Values]
Throughout the course we maintain the convention that capital $X$ denotes the random vector (before observation) and lower-case $x$ denotes the observed numerical values. A function $T(X)$ is a random variable; $T(x)$ is a number. This distinction matters when we discuss the sampling distribution of estimators.
[/remark]
Because the observations are i.i.d., the joint density of the sample factorises:
\begin{align*}
f(x; \theta) = \prod_{i=1}^{n} f(x_i; \theta).
\end{align*}
This product structure is what makes the likelihood function — defined properly in Chapter 1 — tractable to maximise. Independence turns a complicated multivariate density into a simple product; this is the mathematical payoff for the i.i.d. assumption.
## The Three Questions of Statistical Inference
To see why statistical inference needs a formal framework, consider a concrete situation. An epidemiologist is studying a rare disease and wants to estimate its prevalence $\theta$ in a population. She recruits $n = 200$ individuals, tests each, and observes $k = 14$ positive cases. Three questions immediately arise, and they have quite different characters.
The first is: what is her best single guess for $\theta$? The raw proportion $\hat{\theta} = 14/200 = 0.07$ is a natural answer, but is it the *best* possible answer? Could a different function of the data do better? The second question is: how confident should she be in that guess? Reporting $\hat{\theta} = 0.07$ without any measure of uncertainty could be dangerously misleading — the true prevalence might be anywhere from $0.04$ to $0.12$ for all the data can tell. She needs an interval, not just a point. The third question arises when a policy decision depends on a threshold: is the prevalence above $0.05$, which would trigger a public health intervention? This is a yes-or-no decision, and making it incorrectly in either direction has real consequences. These three questions — point estimation, interval estimation, and hypothesis testing — are the three core problems the course addresses.
**Point estimation.** Can we produce a single number $\hat{\theta}(x)$ that is our best guess of the true parameter value? This is the most basic task: summarise what the data tell us about $\theta$ in a single estimate. Chapter 1 develops the theory of how to construct good estimators and how to measure their quality.
**Interval estimation.** A point estimate carries no information about how certain we are. Rather than reporting a single value, we might report an interval $(\hat{\theta}_1(x),\, \hat{\theta}_2(x))$ with the guarantee that such intervals capture the true $\theta$ with high probability, say 95%. These are *confidence intervals*, and their construction is woven through Chapter 1 alongside point estimation.
**Hypothesis testing.** Sometimes we want to decide between two competing hypotheses: for example, whether $\theta = 0$ (a drug has no effect) or $\theta \neq 0$ (the drug does something). Chapter 2 develops the formal framework for such decisions, including how to control the probability of making an error.
More formally, using the sample $x$ we aim to:
- produce a **point estimate** $\hat{\theta}(x) \in \Theta$ for the true value of $\theta$;
- produce an **interval estimate** $(\hat{\theta}_1(x),\, \hat{\theta}_2(x))$ for $\theta$;
- **test a hypothesis** about $\theta$, such as $H_0: \theta = \theta_0$ against $H_1: \theta \neq \theta_0$.
All three tasks share the same underlying logic: we use the observed data, together with the assumed probabilistic model, to make a controlled statement about the unknown parameter.
## What This Course Covers
The course is organised into three main chapters. The ordering is not arbitrary — each chapter addresses a difficulty left open by the previous one.
**Chapter 1: Estimation** confronts a fundamental question: there are many functions of the data we could call an estimator, so which one should we use? The chapter formalises this choice through the concepts of bias, variance, and mean squared error, giving tools to compare estimators quantitatively. A deeper question then emerges: some summaries of the data discard information about $\theta$, and some do not — how do we identify the summaries that preserve everything relevant? This leads to *sufficient statistics* and the Rao–Blackwell theorem, which shows how to systematically improve any estimator by conditioning on a sufficient statistic. The chapter culminates in *maximum likelihood estimation*, the principled answer to the question of how to construct a good estimator in the first place, and the theory of confidence intervals.
**Chapter 2: Hypothesis Testing** addresses a difficulty that point estimates cannot resolve: when the data must drive a binary decision — intervene or do not, reject or do not reject — how do we make that decision in a controlled way? A test that always rejects is perfectly powerful but useless; a test that never rejects never makes an error but is equally useless. The surprising consequence is that there is a precise mathematical sense in which *optimal* tests exist, and the Neyman–Pearson lemma characterises them for simple hypotheses. For the harder composite case — where the parameter under $H_0$ or $H_1$ is not fully specified — we develop uniformly most powerful tests and the generalised likelihood ratio method, which yields the $t$-tests and $F$-tests that arise throughout applied statistics.
**Chapter 3: Linear Models** brings estimation and testing together in the most important family of statistical models. The difficulty here is that in most scientific applications, data are not exchangeable draws from a single distribution — they are collected under different conditions, with different covariates. We observe $Y = X\beta + \varepsilon$, where $X$ is a known design matrix, $\beta$ is an unknown parameter vector, and $\varepsilon$ represents random noise. The maximum likelihood estimator coincides with the least squares estimator, and the Gauss–Markov theorem characterises its optimality. The chapter covers inference for $\beta$, geometric interpretation of least squares, simple linear regression, and one-way analysis of variance.
[motivation]
### Why Parametric Inference?
One might ask: why restrict to a parametric family at all? Real data are messy, and the true distribution generating them is never exactly Poisson or exactly normal. Two responses are in order.
First, the parametric assumption is not a claim about reality — it is a *model*. A model is a deliberate simplification that captures the essential features of a phenomenon while being tractable to analyse. The Poisson distribution models counts of rare events; the normal distribution arises universally through the Central Limit Theorem. Using these models introduces some approximation error, but it buys us a coherent framework for drawing conclusions from limited data.
Second, the alternative — making no distributional assumptions — is genuinely harder. Nonparametric inference is a rich subject, but it requires more data and yields less precise conclusions. The parametric theory developed in this course is both foundational and practically powerful, and it provides the conceptual vocabulary that nonparametric and semiparametric methods build on.
### The Role of Probability Theory
This course assumes familiarity with probability at the level of IA Probability. Random variables, expectation, variance, common distributions, and the Central Limit Theorem are prerequisites, not topics to be introduced here. What the course adds is a new perspective: rather than computing the properties of a known distribution, we use observed data to *recover* an unknown distribution (within a parametric family). This inversion — from data to model — is the defining move of statistical inference.
[/motivation]
We now turn from the abstract problem of choosing a good estimator to the practical task of using data to infer the true parameter: estimation theory provides the machinery to construct point estimates and confidence intervals, but it leaves open the question of how to make decisions when uncertainty remains.
# 1. Estimation
## Estimators
The parametric framework established in the introduction gives us a precise version of our problem: we observe $X_1, \ldots, X_n$ drawn i.i.d. from a distribution with density or mass function $f(\cdot\,;\theta)$, where $\theta \in \Theta$ is an unknown parameter. We know the family — say, Poisson, or Normal with known variance — but not the value of $\theta$. The first task is to turn the observed data into a sensible guess for $\theta$. This section makes that process precise.
### The estimator–estimate distinction
Given a realised sample $x = (x_1, \ldots, x_n)$, any function $T(x)$ that produces a number (or vector) in a plausible range for $\theta$ is a candidate guess. Before the experiment is run, however, the data are random: $X = (X_1, \ldots, X_n)$. So the same function, evaluated at the random vector $X$, becomes a random variable $T(X)$. This is the estimator; its numerical value $T(x)$ after a particular experiment is the estimate.
The distinction matters because we want to ask probabilistic questions: how close is $T(X)$ to $\theta$, on average? What is the spread of $T(X)$ around $\theta$? These are questions about the distribution of the random variable $T(X)$, not about the single number $T(x)$.
[definition: Statistic]
Let $\mathcal{X}$ denote the sample space of a single observation — for instance, $\mathcal{X} = \mathbb{R}$ for continuous data or $\mathcal{X} = \{0, 1, 2, \ldots\}$ for count data — and let $X = (X_1, \ldots, X_n)$ take values in $\mathcal{X}^n$. A **statistic** is a measurable map
\begin{align*}
T : \mathcal{X}^n \to \mathbb{R}^k, \qquad x \mapsto T(x),
\end{align*}
for some $k \ge 1$. The random variable $T(X)$ obtained by composing $T$ with the random sample $X$ is called an **estimator** of $\theta$; the numerical value $\hat{\theta} = T(x)$ produced by a particular observed sample is called an **estimate** of $\theta$.
The probability distribution of $T(X)$ is the **sampling distribution** of the statistic.
[/definition]
The sampling distribution is what connects a statistic to the true parameter. It tells us how the estimator behaves across all possible samples the experiment could have produced.
[example: Sample Mean for Normal Data]
Let $X_1, \ldots, X_n$ be i.i.d. $N(\mu, 1)$, where $\mu$ is unknown. A natural estimator of $\mu$ is the sample mean:
\begin{align*}
T(X) = \frac{1}{n}\sum_{i=1}^n X_i.
\end{align*}
For any particular observed sample $x = (x_1, \ldots, x_n)$, the corresponding estimate is
\begin{align*}
T(x) = \frac{1}{n}\sum_{i=1}^n x_i.
\end{align*}
What is the sampling distribution of $T$? Since $X_i \sim N(\mu, 1)$ independently, a sum of independent normals is again normal: $\sum_{i=1}^n X_i \sim N(n\mu, n)$, so
\begin{align*}
T(X) \sim N\!\left(\mu,\, \frac{1}{n}\right).
\end{align*}
Two features of this distribution are worth noting. First, the sampling distribution is centred exactly at $\mu$: on average, the sample mean recovers the true parameter. Second, the variance $1/n$ shrinks as we collect more data, so with a large sample the estimator is tightly concentrated around $\mu$. These are desirable properties. By the Central Limit Theorem, even if the $X_i$ were not normally distributed, $T(X)$ would still be approximately $N(\mu, 1/n)$ for large $n$ — but here the result is exact for every sample size $n$.
[/example]
### Bias
The sample mean above is an intuitively sensible estimator. Not all estimators are. Consider $T(X) = X_1$ — it ignores all the data except the first observation. Worse, consider $T(X) = 0.32$ — a constant that does not depend on the data at all. Both are technically statistics in the sense of the definition, and neither is absurd in every conceivable situation, but they are poor choices in general. We need a way to articulate what is wrong with them.
The fundamental requirement for an estimator is that it should be aiming at the right target: $T(X)$ should be centred at $\theta$. If $\mathbb{E}_\theta[T(X)] \neq \theta$ — if the estimator is systematically off — it will produce wrong answers even with unlimited data. This is what bias measures.
[definition: Bias]
Let $\hat{\theta} = T(X)$ be an estimator of $\theta$. The **bias** of $\hat{\theta}$ is
\begin{align*}
\operatorname{bias}(\hat{\theta}) = \mathbb{E}_\theta[\hat{\theta}] - \theta.
\end{align*}
The subscript $\theta$ in $\mathbb{E}_\theta$ indicates that the expectation is taken with respect to the distribution $f(\cdot\,;\theta)$ governing the data. The estimator is **unbiased** if $\operatorname{bias}(\hat{\theta}) = 0$ for all $\theta \in \Theta$, i.e.\ $\mathbb{E}_\theta[\hat{\theta}] = \theta$ for all $\theta$.
[/definition]
To compute $\mathbb{E}_\theta[T(X)]$ in practice, one either works directly with $T$ as a function of $X_1, \ldots, X_n$ and uses linearity of expectation, or first identifies the sampling distribution of $T$ and reads off its mean.
[example: Checking Bias]
Returning to the sample mean $T(X) = \frac{1}{n}\sum_{i=1}^n X_i$ with $X_i \sim N(\mu, 1)$:
\begin{align*}
\mathbb{E}_\mu[T(X)] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}_\mu[X_i] = \frac{1}{n} \cdot n\mu = \mu.
\end{align*}
So $\operatorname{bias}(T) = \mu - \mu = 0$: the sample mean is unbiased for $\mu$.
Now consider the constant estimator $T(X) = 0.32$. Then $\mathbb{E}_\theta[T] = 0.32$ regardless of $\theta$, so $\operatorname{bias}(T) = 0.32 - \theta$. Unless the true parameter happens to equal $0.32$, this estimator is biased — and even when the bias is zero, it is by accident rather than design. No amount of additional data improves it, since the estimator ignores the data entirely.
[/example]
[remark: Unbiasedness Alone Is Not Enough]
Unbiasedness is a necessary sanity check, not a complete measure of quality. An estimator can be unbiased yet have enormous variance, spreading its sampling distribution widely around $\theta$ so that individual estimates are far from the truth. The estimator $T(X) = X_1$ is unbiased for $\mu$ — since $\mathbb{E}_\mu[X_1] = \mu$ — but it ignores $n - 1$ data points and so is far more variable than the sample mean. We need a single number that captures both the bias and the spread of $T(X)$ around $\theta$.
[/remark]
This is the role of mean squared error, which we turn to next.
## Mean Squared Error
Bias alone is not enough to judge an estimator. The previous section showed that unbiasedness — $\mathbb{E}_\theta[\hat\theta] = \theta$ — is a natural desideratum, but an unbiased estimator can still be terrible in practice. Consider a sample of 1000 independent observations $X_1, \ldots, X_{1000}$. The estimator $\tilde\mu = X_1$ is perfectly unbiased, yet it ignores the information in $X_2, \ldots, X_{1000}$ entirely. Conversely, the estimator $T'(X) = 0.01 + \frac{1}{1000}\sum_{i=1}^{1000} X_i$ is biased (its bias is $0.01$), but it concentrates much more tightly around $\theta$. In any practical sense $T'$ is preferable. This suggests we need a measure that accounts for both the systematic error (bias) and the random scatter (variance) of an estimator.
### The Decomposition
[definition: Mean Squared Error]
Let $\hat\theta$ be an estimator of $\theta$. The **mean squared error** of $\hat\theta$ is
\begin{align*}
\mathrm{mse}(\hat\theta) = \mathbb{E}_\theta\!\left[(\hat\theta - \theta)^2\right].
\end{align*}
The **root mean squared error** is $\sqrt{\mathrm{mse}(\hat\theta)}$, which has the same units as $\theta$.
[/definition]
The MSE decomposes cleanly into variance and squared bias. Writing $\hat\theta - \theta = (\hat\theta - \mathbb{E}_\theta[\hat\theta]) + (\mathbb{E}_\theta[\hat\theta] - \theta)$ and expanding the square:
[quotetheorem:1424]
[citeproof:1424]
This identity makes the trade-off precise: reducing bias increases the estimator's freedom to move around $\theta$, potentially inflating variance; shrinking variance often requires imposing structure that introduces bias. Neither component is free.
Three further points are worth emphasising. First, the decomposition does **not** tell us the minimum achievable MSE — it merely splits whatever MSE an estimator happens to have into two additive pieces. Determining how small the variance can be made requires a genuinely deeper tool, the Cramér–Rao lower bound, which the present course does not develop. Second, if we restrict attention to **unbiased** estimators, the bias term vanishes and the identity collapses to $\mathrm{mse}(\hat\theta) = \operatorname{Var}_\theta(\hat\theta)$; minimising MSE within this class reduces to minimising variance, which motivates the search for the **uniformly minimum-variance unbiased (UMVU)** estimator. Third, the Rao–Blackwell theorem (proved later in this chapter) will make the identity actionable by showing concretely how to reduce variance while preserving unbiasedness — turning the decomposition from a descriptive accounting identity into a constructive recipe.
### The Bias-Variance Trade-Off in Practice
[example: Binomial Proportion Estimators]
Suppose $X \sim \mathrm{Binomial}(n, \theta)$ where $n$ is known and $\theta \in (0,1)$ is unknown. We compare two estimators of $\theta$.
**Unbiased estimator.** The natural estimator is $T_U = X/n$, the sample proportion. It is unbiased, and its variance is
\begin{align*}
\operatorname{Var}_\theta(T_U) = \frac{\theta(1-\theta)}{n},
\end{align*}
so $\mathrm{mse}(T_U) = \theta(1-\theta)/n$.
**Biased estimator.** Define
\begin{align*}
T_B = \frac{X + 1}{n + 2} = w\,\frac{X}{n} + (1-w)\,\frac{1}{2}, \qquad w = \frac{n}{n+2}.
\end{align*}
This is a weighted average of the sample proportion and the prior guess $1/2$. Its bias is
\begin{align*}
\operatorname{bias}(T_B) = \mathbb{E}_\theta[T_B] - \theta = \frac{n\theta + 1}{n+2} - \theta = (1-w)\!\left(\tfrac{1}{2} - \theta\right),
\end{align*}
and its variance is
\begin{align*}
\operatorname{Var}_\theta(T_B) = \frac{\theta(1-\theta)}{(n+2)^2} = w^2\,\frac{\theta(1-\theta)}{n}.
\end{align*}
Hence
\begin{align*}
\mathrm{mse}(T_B) = w^2\,\frac{\theta(1-\theta)}{n} + (1-w)^2\!\left(\tfrac{1}{2} - \theta\right)^2.
\end{align*}
For $n = 10$, $w = 10/12 = 5/6$, and one finds that $\mathrm{mse}(T_B) < \mathrm{mse}(T_U)$ for all $\theta$ not too close to $0$ or $1$. The biased estimator wins across most of the parameter space precisely because it sacrifices a little bias to gain a larger reduction in variance.
[/example]
[remark: The Bias-Variance Trade-Off]
The observation that a biased estimator can have smaller MSE than an unbiased one is called the **bias-variance trade-off**. It is a fundamental principle in statistics and machine learning: forcing $\operatorname{bias} = 0$ is a constraint that prevents the estimator from using certain shrinkage strategies that reduce variance, and the constraint may cost more than it saves.
[/remark]
### When Unbiasedness Forces Nonsense
The Binomial example shows that bias and MSE can point in different directions. But there is a more extreme phenomenon: in some problems, insisting on unbiasedness produces an estimator that is not merely suboptimal but genuinely absurd.
[example: Poisson Unbiased Estimator for $e^{-2\lambda}$]
Suppose $X \sim \mathrm{Poisson}(\lambda)$ and we wish to estimate $\theta = [P(X = 0)]^2 = e^{-2\lambda}$. Any unbiased estimator $T(X)$ must satisfy $\mathbb{E}_\lambda[T(X)] = e^{-2\lambda}$ for all $\lambda > 0$, which means
\begin{align*}
e^{-\lambda}\sum_{x=0}^{\infty} T(x)\,\frac{\lambda^x}{x!} = e^{-2\lambda}.
\end{align*}
Multiplying both sides by $e^\lambda$ gives $\sum_{x=0}^\infty T(x)\,\lambda^x/x! = e^{-\lambda} = \sum_{x=0}^\infty (-1)^x \lambda^x/x!$. Comparing coefficients of $\lambda^x$ yields $T(x) = (-1)^x$.
So the unique unbiased estimator is $T(X) = (-1)^X$: it estimates $e^{-2\lambda} \in (0,1)$ to be $+1$ when $X$ is even and $-1$ when $X$ is odd. No observation of $X$ can ever return a value in the range $(0,1)$ where $\theta$ actually lives. This is an estimator that cannot possibly be correct in any individual trial.
[/example]
[remark: Unbiasedness Is Not Sacred]
The Poisson example demonstrates that unbiasedness is a property of the estimator's average behaviour, not its individual realisations. An estimator can be unbiased and yet produce values that are impossible values of $\theta$. MSE — and more generally, the question of what information in the data we should actually use — is the right lens through which to evaluate estimators.
[/remark]
### Looking Forward
The Binomial and Poisson examples raise a question the MSE decomposition cannot answer on its own: **given a particular model, how small can the MSE be?** Is the sample mean $\bar X$ the best we can do for estimating a Poisson mean, or could some other function of $X_1, \ldots, X_n$ do better?
To answer this, we need to understand which functions of the data actually carry information about $\theta$ and which merely add noise. This is the idea of **sufficiency**: a statistic $T(X)$ is sufficient for $\theta$ if it captures everything in the sample that is relevant to $\theta$. The next section develops this concept precisely, and the Rao–Blackwell theorem will then show that any estimator can be improved — in MSE — by conditioning on a sufficient statistic. This gives a systematic method for constructing estimators that are optimal within the class of unbiased estimators.
## Sufficiency
The previous section raised a natural question: can we always do better than the naive estimator? Before we can answer this systematically, we need to understand which summaries of the data actually carry the full information about $\theta$, and which discard some of it irretrievably.
Think about what an experiment produces. We observe $X = (X_1, \ldots, X_n)$, a vector of $n$ data points. Almost always, we do not care about the individual values for their own sake — we only care about what they tell us about the unknown parameter $\theta$. We immediately ask: is there a lower-dimensional summary $T(X)$ that captures everything the data says about $\theta$, so that the raw observations $X$ become redundant once we know $T$?
Some summaries discard information in an obvious way. If we observe $n$ coin flips and record only whether the first flip was heads, we have thrown away most of what the data tells us about the bias $\theta$. Other summaries are lossless: the total number of heads $T = \sum X_i$ retains every feature of the data that is relevant to estimating $\theta$, even though it compresses $n$ binary values into a single number. The concept of a sufficient statistic makes this intuition precise.
### The Factorisation Criterion
To build intuition, consider a concrete case first.
[example: Bernoulli Sufficient Statistic]
Let $X_1, \ldots, X_n$ be i.i.d. $\text{Bernoulli}(\theta)$, so $\mathbb{P}(X_i = 1) = \theta$ for some $\theta \in (0,1)$. The joint probability mass function of the sample is
\begin{align*}
f_{X}(x \mid \theta) = \prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i} = \theta^{\sum x_i}(1-\theta)^{n - \sum x_i}.
\end{align*}
Notice that the right-hand side depends on $x$ only through $T(x) = \sum x_i$, the total number of ones. Now suppose we are told $T(X) = t$. The conditional distribution of $X$ given $T = t$ is
\begin{align*}
f_{X \mid T=t}(x) = \frac{\mathbb{P}_\theta(X = x)}{\mathbb{P}_\theta(T = t)} = \frac{\theta^t(1-\theta)^{n-t}}{\binom{n}{t}\theta^t(1-\theta)^{n-t}} = \binom{n}{t}^{-1},
\end{align*}
which does not depend on $\theta$ at all. Once we know the total count $t$, the particular arrangement of ones and zeros across the $n$ positions is uniformly distributed over all $\binom{n}{t}$ arrangements — and this randomness carries no information about the coin's bias. So $T = \sum X_i$ is a lossless compression of the data with respect to inference about $\theta$.
[/example]
This motivates the central definition of the section.
[definition: Sufficient Statistic]
A statistic $T = T(X)$ is **sufficient** for $\theta$ if the conditional distribution of $X$ given $T(X) = t$ does not depend on $\theta$, for every value $t$ in the range of $T$.
[/definition]
In the discrete case, sufficiency means that once we condition on the value of $T$, the probability of any specific sample $x$ is the same under every value of $\theta$. The sufficient statistic has absorbed all the $\theta$-dependence, leaving a residual distribution that is parameter-free.
Checking the definition directly — computing the conditional distribution and verifying it is $\theta$-free — is often cumbersome. The following theorem replaces this with a factorisation condition on the joint density, which is almost always easy to read off.
[quotetheorem:1425]
The theorem (due to Fisher and later formalised by Neyman) converts the question "is $T$ sufficient?" into a question about whether all the $\theta$-dependence in the joint density can be routed through $T(x)$. The function $h(x)$ is the part of the density that carries no information about $\theta$; the function $g$ is where all the parameter-dependence lives.
[example: Bernoulli via Factorisation]
Continuing the Bernoulli example: we have $f_{X}(x \mid \theta) = \theta^{\sum x_i}(1-\theta)^{n-\sum x_i}$. Taking $g(t, \theta) = \theta^t(1-\theta)^{n-t}$ and $h(x) = 1$ immediately gives the factorisation, confirming that $T(X) = \sum X_i$ is sufficient for $\theta$.
[/example]
[example: Uniform Maximum]
Let $X_1, \ldots, X_n$ be i.i.d. $\text{Uniform}[0, \theta]$. The joint density is
\begin{align*}
f_{X}(x \mid \theta) = \prod_{i=1}^n \frac{1}{\theta} \mathbf{1}_{[0 \le x_i \le \theta]} = \frac{1}{\theta^n} \mathbf{1}_{[\max_i x_i \le \theta]}\, \mathbf{1}_{[\min_i x_i \ge 0]}.
\end{align*}
Setting $T(x) = \max_i x_i$, we read off the factorisation with $g(t, \theta) = \frac{1}{\theta^n}\mathbf{1}_{[t \le \theta]}$ and $h(x) = \mathbf{1}_{[\min_i x_i \ge 0]}$. So $T = \max_i X_i$ is sufficient for $\theta$.
[/example]
[remark: Non-Uniqueness of Sufficient Statistics]
Sufficient statistics are not unique. If $T$ is sufficient for $\theta$, then so is any one-to-one function of $T$, since the factorisation carries through immediately. At the other extreme, $X$ itself is always sufficient (take $g(x, \theta) = f_{X}(x \mid \theta)$ and $h \equiv 1$), but it provides no data reduction at all. This raises the question of what counts as the "best" or most compressed sufficient statistic.
[/remark]
### Minimal Sufficiency
Sufficiency tells us that $T$ retains all information about $\theta$. But there may be many sufficient statistics, some more compressed than others. Any sufficient statistic induces a partition of the sample space $\mathcal{X}^n$: two data vectors $x$ and $y$ are in the same cell if and only if $T(x) = T(y)$. Knowing which cell $X$ falls into is all we need for inference about $\theta$. A coarser partition (fewer cells) means greater compression. The *minimal* sufficient statistic achieves the coarsest partition that is still consistent with sufficiency — it extracts the most information while discarding as much irrelevant randomness as possible.
[definition: Minimal Sufficient Statistic]
A sufficient statistic $T(X)$ is **minimal sufficient** for $\theta$ if it is a function of every other sufficient statistic. Equivalently, $T$ is minimal sufficient if for every sufficient statistic $T'$, the implication
\begin{align*}
T'(x) = T'(y) \implies T(x) = T(y)
\end{align*}
holds for all $x, y$ in the sample space.
[/definition]
The condition says that $T$ is a function of $T'$: whenever $T'$ cannot distinguish $x$ from $y$, neither can $T$. So $T$ makes at most as many distinctions as any other sufficient statistic — it partitions the sample space as coarsely as possible while remaining sufficient.
Finding minimal sufficient statistics from the definition requires checking all other sufficient statistics, which is impractical. The following theorem gives a direct characterisation via the likelihood ratio.
[quotetheorem:1426]
The key insight is that the likelihood ratio $f_{X}(x;\theta)/f_{X}(y;\theta)$ being $\theta$-free is exactly the condition under which $x$ and $y$ provide equally strong evidence about $\theta$ for every possible $\theta$. The minimal sufficient statistic groups together precisely those data vectors that are evidentially equivalent.
[example: Normal Minimal Sufficient Statistic]
Let $X_1, \ldots, X_n$ be i.i.d. $N(\mu, \sigma^2)$. Computing the likelihood ratio:
\begin{align*}
\frac{f_{X}(x \mid \mu, \sigma^2)}{f_{X}(y \mid \mu, \sigma^2)} &= \exp\!\left\{-\frac{1}{2\sigma^2}\!\left(\sum_i x_i^2 - \sum_i y_i^2\right) + \frac{\mu}{\sigma^2}\!\left(\sum_i x_i - \sum_i y_i\right)\right\}.
\end{align*}
This is constant in $(\mu, \sigma^2)$ if and only if $\sum_i x_i^2 = \sum_i y_i^2$ and $\sum_i x_i = \sum_i y_i$. By the Minimal Sufficiency Criterion, $T(X) = \!\left(\sum_i X_i,\, \sum_i X_i^2\right)$ is minimal sufficient for $(\mu, \sigma^2)$.
This is equivalent to taking $(\bar{X}, S^2)$, the sample mean and sample variance, since these are one-to-one functions of $(\sum X_i, \sum X_i^2)$. With two parameters to estimate, two summary statistics are needed, which is exactly what minimality gives us.
[/example]
### The Rao–Blackwell Theorem
We now have a way to identify which summaries of the data contain all the information about $\theta$. This turns out to have a striking consequence for estimation: given *any* estimator of $\theta$, we can always replace it with one that is a function of a sufficient statistic and is no worse in mean squared error. The result is the Rao–Blackwell theorem, which is one of the most useful tools in classical estimation theory.
The construction is elegant. Given an estimator $\tilde{\theta}$ and a sufficient statistic $T$, define a new estimator by taking the conditional expectation of $\tilde{\theta}$ given $T$:
\begin{align*}
\hat{\theta}(x) = \mathbb{E}[\tilde{\theta}(X) \mid T(X) = T(x)].
\end{align*}
Since $T$ is sufficient for $\theta$, the conditional distribution of $X$ given $T = t$ does not depend on $\theta$. Therefore this expectation is $\theta$-free, so $\hat{\theta}$ is a genuine statistic — it can actually be computed from the data without knowing $\theta$.
[quotetheorem:1427]
The Rao–Blackwell theorem says we never need to look beyond the sufficient statistic for a good estimator. Any estimator not already a function of $T$ wastes some information and can be improved by conditioning it on $T$. The process of replacing $\tilde{\theta}$ by $\hat{\theta}$ is called *Rao–Blackwellisation*.
Two further points are worth noting. First, if $\tilde{\theta}$ is unbiased, so is $\hat{\theta}$, so Rao–Blackwellisation preserves unbiasedness while reducing variance. Second, if $\tilde{\theta}$ is already a function of $T$, then $\hat{\theta} = \tilde{\theta}$ and nothing changes.
[example: Poisson and the Probability of Zero]
Let $X_1, \ldots, X_n$ be i.i.d. $\text{Poisson}(\lambda)$, and suppose we wish to estimate $\theta = e^{-\lambda} = \mathbb{P}(X_1 = 0)$. The joint mass function is
\begin{align*}
p_{X}(x \mid \lambda) = \frac{e^{-n\lambda}\lambda^{\sum x_i}}{\prod x_i!},
\end{align*}
and a factorisation with $T = \sum X_i$ is immediate: $g(t, \lambda) = e^{-n\lambda}\lambda^t$ and $h(x) = (\prod x_i!)^{-1}$. So $T = \sum X_i$ is sufficient for $\lambda$ (hence for $\theta$), and $T \sim \text{Poisson}(n\lambda)$.
A natural but crude unbiased estimator is $\tilde{\theta} = \mathbf{1}_{[X_1 = 0]}$: we estimate the probability of observing zero by observing whether the first trial produced zero. This is unbiased since $\mathbb{E}[\tilde{\theta}] = \mathbb{P}(X_1 = 0) = e^{-\lambda} = \theta$. But it ignores all observations after the first.
Rao–Blackwellising: using independence and the Poisson sum formula,
\begin{align*}
\hat{\theta} = \mathbb{E}[\mathbf{1}_{X_1 = 0} \mid T = t] = \mathbb{P}\!\left(X_1 = 0 \,\Big|\, \sum_{i=1}^n X_i = t\right) = \frac{\mathbb{P}(X_1 = 0)\,\mathbb{P}\!\left(\sum_{i=2}^n X_i = t\right)}{\mathbb{P}\!\left(\sum_{i=1}^n X_i = t\right)} = \left(\frac{n-1}{n}\right)^t.
\end{align*}
So the improved estimator is $\hat{\theta} = \left(1 - \tfrac{1}{n}\right)^{\sum X_i}$. For large $n$, this is approximately $e^{-\bar{X}}$, which is the natural plug-in estimator of $e^{-\lambda}$ using the sample mean to estimate $\lambda$. Rao–Blackwellisation has led us to a much more sensible estimator.
[/example]
[example: Uniform Distribution and the Sample Maximum]
Let $X_1, \ldots, X_n$ be i.i.d. $\text{Uniform}[0,\theta]$. We wish to estimate $\theta$. We have shown that $T = \max_i X_i$ is sufficient. A naive unbiased estimator is $\tilde{\theta} = 2X_1$, which uses only the first observation.
Conditioning on $T = t$ requires some care. We split on whether $X_1$ achieves the maximum:
\begin{align*}
\mathbb{E}[2X_1 \mid T = t] &= 2\mathbb{E}[X_1 \mid X_1 = \max X_i]\,\mathbb{P}(X_1 = \max X_i) \\
&\quad + 2\mathbb{E}[X_1 \mid X_1 \ne \max X_i, \max X_i = t]\,\mathbb{P}(X_1 \ne \max X_i) \\
&= 2\!\left(t \cdot \frac{1}{n} + \frac{t}{2}\cdot\frac{n-1}{n}\right) = \frac{n+1}{n}\,t.
\end{align*}
So the Rao–Blackwellised estimator is $\hat{\theta} = \frac{n+1}{n}\max_i X_i$. This is a well-known result: the sample maximum underestimates $\theta$ on average, and the correction factor $\frac{n+1}{n}$ exactly de-biases it. The resulting estimator uses all the observations (through the maximum) and has strictly smaller MSE than $2X_1$.
[/example]
The Rao–Blackwell theorem closes the first part of our investigation into good estimators. We now know that the search for a good estimator can be confined to the class of functions of the sufficient statistic. The remaining question is: how do we choose *which* function of the sufficient statistic to use? The next section introduces the likelihood function and the method of maximum likelihood, which gives a principled and broadly applicable answer.
## Likelihood
The previous sections gave us tools to judge whether a given estimator is good — unbiasedness, minimum variance, and sufficiency. But none of these tell us how to construct an estimator in the first place. Given observed data $x$, the natural question is: which value of $\theta$ is most consistent with what we saw? This question has a precise answer through the principle of maximum likelihood.
### The Likelihood Function and Log-Likelihood
Suppose $X_1, \ldots, X_n$ are random variables with joint pdf or pmf $f_{X}(x \mid \theta)$. Once we observe $X = x$, this joint density becomes a function of $\theta$ alone — it measures, loosely, how probable the observed data would be if the true parameter were $\theta$.
[definition: Likelihood Function]
Let $X_1, \ldots, X_n$ have joint pdf or pmf $f_{X}(x \mid \theta)$. For any observed $x$, the **likelihood function** of $\theta$ is
\begin{align*}
L(\theta) = L(\theta; x) = f_{X}(x \mid \theta),
\end{align*}
regarded as a function of $\theta$ with $x$ fixed.
[/definition]
[remark: Likelihood is not a probability]
$L(\theta)$ is not a probability distribution over $\theta$. It is a function of $\theta$ that takes values determined by the observed data. For continuous distributions, $L(\theta)$ has units of density, not probability. Two likelihood functions that are proportional as functions of $\theta$ carry identical information about $\theta$.
[/remark]
When $X_1, \ldots, X_n$ are i.i.d., each with marginal pdf or pmf $f_X(x \mid \theta)$, the likelihood factorises as a product:
\begin{align*}
L(\theta) &= \prod_{i=1}^n f_X(x_i \mid \theta).
\end{align*}
Products are unwieldy to maximise. Taking logarithms converts the product into a sum and preserves the location of the maximum, since $\log$ is strictly increasing. The **log-likelihood** is
\begin{align*}
\ell(\theta) &= \log L(\theta) = \sum_{i=1}^n \log f_X(x_i \mid \theta).
\end{align*}
In practice, maximising $\ell(\theta)$ is almost always easier than maximising $L(\theta)$ directly.
[definition: Maximum Likelihood Estimator]
The **maximum likelihood estimator** (MLE) of $\theta$, written $\hat\theta$, is any value of $\theta$ that maximises the likelihood function:
\begin{align*}
\hat\theta = \hat\theta(x) = \operatorname*{arg\,max}_\theta L(\theta; x) = \operatorname*{arg\,max}_\theta \ell(\theta; x).
\end{align*}
The MLE is a statistic — a function of the data $x$ — and may not exist or may not be unique.
[/definition]
Often there is no closed-form MLE and $\hat\theta$ must be found numerically. When a closed form does exist, it is usually found by differentiating $\ell(\theta)$ and solving the likelihood equations $\ell'(\theta) = 0$, then verifying that the critical point is indeed a maximum.
### Worked Examples
[example: MLE for Bernoulli]
Let $X_1, \ldots, X_n$ be i.i.d. $\mathrm{Bernoulli}(p)$, where $p \in (0,1)$ is unknown. The log-likelihood is
\begin{align*}
\ell(p) &= \left(\sum_{i=1}^n x_i\right)\log p + \left(n - \sum_{i=1}^n x_i\right)\log(1 - p).
\end{align*}
Differentiating with respect to $p$:
\begin{align*}
\frac{d\ell}{dp} &= \frac{\sum x_i}{p} - \frac{n - \sum x_i}{1 - p}.
\end{align*}
Setting this to zero and solving gives $\hat p = \frac{1}{n}\sum_{i=1}^n x_i = \bar x$. Since $\frac{d^2\ell}{dp^2} < 0$ at this point, it is a maximum. The MLE is the sample mean $\hat p = \bar x$, which is also unbiased: $\mathbb{E}[\hat p] = p$.
[/example]
[example: MLE for Normal with unknown mean and variance]
Let $X_1, \ldots, X_n$ be i.i.d. $N(\mu, \sigma^2)$, and suppose both $\mu$ and $\sigma^2$ are unknown. The log-likelihood is
\begin{align*}
\ell(\mu, \sigma^2) &= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2.
\end{align*}
Setting the partial derivatives to zero:
\begin{align*}
\frac{\partial \ell}{\partial \mu} &= -\frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu) = 0 \implies \hat\mu = \bar x = \frac{1}{n}\sum_{i=1}^n x_i, \\
\frac{\partial \ell}{\partial \sigma^2} &= -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4}\sum_{i=1}^n (x_i - \mu)^2 = 0 \implies \hat\sigma^2 = \frac{1}{n}\sum_{i=1}^n (x_i - \bar x)^2 = \frac{S_{xx}}{n},
\end{align*}
where $S_{xx} = \sum_{i=1}^n (x_i - \bar x)^2$. The MLE for the mean is the sample mean (unbiased), but the MLE for the variance is $\hat\sigma^2 = S_{xx}/n$. This follows from the chi-squared result established in Chapter 3: $S_{XX}/\sigma^2 \sim \chi^2_{n-1}$, which gives $\mathbb{E}[\hat\sigma^2] = \frac{(n-1)\sigma^2}{n}$. So the MLE for $\sigma^2$ is biased — a useful reminder that maximum likelihood does not automatically produce unbiased estimators.
[/example]
[example: German Tank Problem]
During World War II, the Allies estimated German tank production by observing serial numbers on captured tanks. Suppose $\theta$ tanks are produced, numbered $1, 2, \ldots, \theta$, and we observe $n$ tank numbers $x_1, \ldots, x_n$ sampled uniformly without replacement. The likelihood function is
\begin{align*}
L(\theta) &= \frac{1}{\theta^n} \cdot \mathbf{1}_{[\max_i x_i \,\le\, \theta]}.
\end{align*}
For $\theta \ge \max_i x_i$, the likelihood equals $1/\theta^n$, which is strictly decreasing in $\theta$. For $\theta < \max_i x_i$, the likelihood is zero (impossible). The maximum is therefore attained at the smallest permissible value: $\hat\theta = \max_i x_i$.
Is $\hat\theta$ unbiased? For $0 \le t \le \theta$, the CDF of $\hat\theta$ is
\begin{align*}
F_{\hat\theta}(t) &= \mathbb{P}(\hat\theta \le t) = \left(\frac{t}{\theta}\right)^n,
\end{align*}
so the pdf is $f_{\hat\theta}(t) = \frac{n t^{n-1}}{\theta^n}$ and
\begin{align*}
\mathbb{E}[\hat\theta] &= \int_0^\theta t \cdot \frac{n t^{n-1}}{\theta^n}\,dt = \frac{n\theta}{n+1}.
\end{align*}
Thus $\hat\theta$ is biased downward, but asymptotically unbiased as $n \to \infty$. The MLE systematically underestimates $\theta$ because the maximum observed serial number cannot exceed the true maximum.
[/example]
[example: Smarties — Discrete MLE without closed form]
Smarties come in $k$ equally frequent colours, but $k$ is unknown. Suppose we sample with replacement and observe four Smarties with colours Red, Purple, Red, Yellow. The likelihood of this sequence as a function of $k$ is
\begin{align*}
L(k) &= 1 \cdot \frac{k-1}{k} \cdot \frac{1}{k} \cdot \frac{k-2}{k} = \frac{(k-1)(k-2)}{k^3}.
\end{align*}
Here $k$ is a positive integer, so we cannot differentiate. Evaluating numerically, $L(k)$ is maximised at $k = 5$, giving the MLE $\hat k = 5$. This illustrates that when the parameter space is discrete, the MLE must be found by enumeration or other discrete optimisation methods — the calculus approach does not apply.
[/example]
### Connection to Sufficient Statistics
The relationship between the MLE and sufficiency is immediate. Suppose $T$ is a sufficient statistic for $\theta$. By the factorisation criterion, the likelihood takes the form
\begin{align*}
L(\theta; x) &= g(T(x), \theta)\,h(x),
\end{align*}
where $h(x)$ does not depend on $\theta$. When maximising over $\theta$, the factor $h(x)$ is a positive constant and plays no role. Thus maximising $L(\theta; x)$ is equivalent to maximising $g(T(x), \theta)$, and the resulting MLE $\hat\theta$ depends on the data only through $T(x)$. In other words, the MLE is always a function of the sufficient statistic — it automatically respects the information-preserving structure identified by sufficiency.
### Invariance of the MLE
A practically important property of maximum likelihood is that it respects reparametrisation.
[quotetheorem:1428]
[citeproof:1428]
This is a short but powerful result. For instance, if $\hat\sigma$ is the MLE of the standard deviation $\sigma$, then $\hat\sigma^2$ is automatically the MLE of the variance $\sigma^2$, without any separate computation. Bijectivity is essential: if $h$ is not injective, the relationship breaks down. Consider, for example, $\theta \in \mathbb{R}$ and the reparametrisation $\phi = h(\theta) = \theta^2$, which is not injective because $h(\theta) = h(-\theta)$. Given an observed MLE $\hat\theta$, the value $\phi = \hat\theta^2$ does not uniquely determine $\theta$: the equation $\theta^2 = \hat\theta^2$ has two solutions $\pm\hat\theta$, and the induced likelihood of $\phi$ must be constructed as a supremum over preimages, at which point the clean identity $\hat\phi = h(\hat\theta)$ is no longer automatic.
### From Points to Intervals
The MLE $\hat\theta$ is a **point estimate** — it names a single value of $\theta$ judged most consistent with the data. But how much should we trust this single number? If two parameter values have very similar likelihoods, our estimate is inherently uncertain. If the likelihood is sharply peaked around $\hat\theta$, we have strong evidence for that value.
Quantifying this uncertainty is the subject of the next section. Rather than a single point, we will construct an **interval estimate** — a random interval $[L(X), U(X)]$ that is designed to contain the true $\theta$ with a prescribed probability. These are called confidence intervals, and their construction builds directly on the structure of the likelihood and the sampling distribution of $\hat\theta$.
## Confidence Intervals
The previous section showed how to extract a single number from the data — the maximum likelihood estimate $\hat\theta$ — to summarise what the sample tells us about an unknown parameter. But a single number conceals a great deal. If we estimate $\hat\theta = 0.3$, the truth might be $0.15$ or $0.45$; without some measure of the uncertainty in $\hat\theta$, the estimate is hard to interpret and potentially misleading. A confidence interval answers this concern by replacing the point estimate with a random interval guaranteed, by construction, to cover the true $\theta$ with a prescribed probability.
### The Definition and its Interpretation
[definition: Confidence Interval]
A $100\gamma\%$ confidence interval for a parameter $\theta$ is a random interval $(A(X), B(X))$, where $A$ and $B$ are functions of the data $X = (X_1, \ldots, X_n)$, such that
\begin{align*}
\mathbb{P}_\theta\!\left(A(X) < \theta < B(X)\right) = \gamma
\end{align*}
for every possible value of $\theta$. The quantity $\gamma$ is called the **coverage probability**.
[/definition]
The phrasing of this definition rewards careful reading. The endpoints $A(X)$ and $B(X)$ are random — they change from sample to sample — while $\theta$ is a fixed, unknown constant. The probability statement is about the random interval, not about $\theta$.
This distinction matters enormously for interpretation. Think of it in terms of repeat sampling: if the same experiment were run a large number of times and a $95\%$ confidence interval computed from each sample, then approximately $95\%$ of those intervals would contain the true $\theta$. Each individual computed interval either does or does not contain $\theta$; there is no probability to assign to that specific event after the data have been observed.
[remark: What a Confidence Interval Does Not Mean]
A common and serious misreading is to say, after computing a $95\%$ confidence interval $(a, b)$ from data $x$: "there is a $95\%$ probability that $\theta \in (a, b)$." This is not what the definition says. The interval $(a, b)$ is a fixed pair of numbers; $\theta$ is a fixed constant; the probability that a fixed constant lies in a fixed interval is either $0$ or $1$. The $95\%$ is a property of the procedure, not of the particular interval produced.
A stark illustration: suppose $X_1, X_2$ are i.i.d. $U(\theta - \tfrac{1}{2}, \theta + \tfrac{1}{2})$ and we use $(\min(X_1, X_2), \max(X_1, X_2))$ as a $50\%$ confidence interval for $\theta$ (verified below). If we happen to observe $|x_1 - x_2| \geq \tfrac{1}{2}$, say $x_1 = 0.2$ and $x_2 = 0.9$, then both observations must lie within distance $\tfrac{1}{2}$ of $\theta$, which forces $\theta \in (0.4, 0.7) \subseteq (0.2, 0.9)$. In this particular case we are certain $\theta$ lies in the interval — even though the procedure only provides $50\%$ coverage on average. The $50\%$ coverage probability governs the long-run behaviour of the procedure, not the strength of the conclusion from any single experiment.
[/remark]
### The Pivotal Quantity Method
The standard technique for constructing confidence intervals is to find a function of the data and the parameter whose distribution does not depend on $\theta$.
[definition: Pivotal Quantity]
Let $\mathcal{X}^n$ be the sample space of $X$ and let $\Theta$ be the parameter space. A **pivot** (or pivotal quantity) is a map
\begin{align*}
R : \mathcal{X}^n \times \Theta \to \mathbb{R}, \qquad (x, \theta) \mapsto R(x, \theta),
\end{align*}
such that the distribution of the random variable $R(X, \theta)$ under the model with parameter $\theta$ is the same for every $\theta \in \Theta$.
[/definition]
Given a pivot $R$, constructing a confidence interval is mechanical:
1. Find constants $c_1 < c_2$ such that $\mathbb{P}_\theta(c_1 < R(X, \theta) < c_2) = \gamma$.
2. Rearrange the inequality $c_1 < R(X, \theta) < c_2$ algebraically to isolate $\theta$ in the middle.
3. Read off the endpoints $A(X)$ and $B(X)$.
The choice of $c_1$ and $c_2$ is not unique: any pair satisfying step 1 gives a valid $100\gamma\%$ interval. In practice one takes $c_1$ and $c_2$ to be the equitail percentage points — the $\tfrac{1-\gamma}{2}$ and $\tfrac{1+\gamma}{2}$ quantiles of the pivot's distribution — because for unimodal symmetric distributions this yields the shortest possible interval.
[example: Normal Mean with Known Variance]
Suppose $X_1, \ldots, X_n$ are i.i.d. $N(\theta, 1)$. Construct a $95\%$ confidence interval for $\theta$.
The sample mean satisfies $\bar X \sim N(\theta, 1/n)$, so the standardised quantity
\begin{align*}
R(X, \theta) = \sqrt{n}\,(\bar X - \theta) \sim N(0, 1)
\end{align*}
is a pivot. Let $z_{0.025}$ denote the upper $2.5\%$ point of the standard normal, so that $\mathbb{P}(-z_{0.025} < Z < z_{0.025}) = 0.95$ for $Z \sim N(0,1)$; numerically $z_{0.025} \approx 1.96$.
We have
\begin{align*}
0.95 = \mathbb{P}\!\left(-1.96 < \sqrt{n}\,(\bar X - \theta) < 1.96\right) = \mathbb{P}\!\left(\bar X - \frac{1.96}{\sqrt{n}} < \theta < \bar X + \frac{1.96}{\sqrt{n}}\right).
\end{align*}
The $95\%$ confidence interval for $\theta$ is therefore
\begin{align*}
\left(\bar X - \frac{1.96}{\sqrt{n}},\; \bar X + \frac{1.96}{\sqrt{n}}\right).
\end{align*}
Notice how the width of the interval shrinks like $1/\sqrt{n}$: collecting more data tightens the interval, as one would expect.
[/example]
[example: Normal Variance with Known Mean]
Suppose $X_1, \ldots, X_{50}$ are i.i.d. $N(0, \sigma^2)$. Construct a $99\%$ confidence interval for $\sigma^2$.
Since $X_i / \sigma \sim N(0,1)$, the sum of squares satisfies
\begin{align*}
\frac{1}{\sigma^2}\sum_{i=1}^{50} X_i^2 \sim \chi^2_{50},
\end{align*}
which is a pivot for $\sigma^2$. Let $\chi^2_{50}(\alpha)$ denote the upper $100\alpha\%$ point of $\chi^2_{50}$. For a $99\%$ interval we need the $0.5\%$ and $99.5\%$ points: $c_1 = \chi^2_{50}(0.995) \approx 27.99$ and $c_2 = \chi^2_{50}(0.005) \approx 79.49$.
Writing out the probability statement and rearranging:
\begin{align*}
0.99 = \mathbb{P}\!\left(c_1 < \frac{\sum X_i^2}{\sigma^2} < c_2\right) = \mathbb{P}\!\left(\frac{\sum X_i^2}{c_2} < \sigma^2 < \frac{\sum X_i^2}{c_1}\right).
\end{align*}
A $99\%$ confidence interval for $\sigma^2$ is $\left(\sum X_i^2 / c_2,\; \sum X_i^2 / c_1\right)$. Since the square-root function is monotone increasing, a $99\%$ confidence interval for $\sigma$ itself is
\begin{align*}
\left(\sqrt{\frac{\sum X_i^2}{c_2}},\; \sqrt{\frac{\sum X_i^2}{c_1}}\right).
\end{align*}
This follows from a general principle: if $(A(X), B(X))$ is a $100\gamma\%$ confidence interval for $\theta$ and $T$ is a monotone increasing function, then $(T(A(X)), T(B(X)))$ is a $100\gamma\%$ confidence interval for $T(\theta)$.
[/example]
[example: Uniform Distribution — a Structural Pivot]
Suppose $X_1, X_2$ are i.i.d. $U(\theta - \tfrac{1}{2}, \theta + \tfrac{1}{2})$. Find a $50\%$ confidence interval for $\theta$.
Each $X_i$ is equally likely to fall below $\theta$ or above $\theta$. The probability that one observation lands on each side of $\theta$ is $\tfrac{1}{2}$, and in that event $\theta$ is sandwiched between $\min(X_1, X_2)$ and $\max(X_1, X_2)$. Formally,
\begin{align*}
\mathbb{P}_\theta\!\left(\min(X_1, X_2) \leq \theta \leq \max(X_1, X_2)\right) = \frac{1}{2}.
\end{align*}
So $(\min(X_1, X_2),\, \max(X_1, X_2))$ is a $50\%$ confidence interval for $\theta$. The example in the remark above shows that after a particular pair of observations is obtained, the coverage probability no longer governs our certainty about $\theta$ — it is a property of the procedure before the data are seen.
[/example]
### Approximate Confidence Intervals via Asymptotic Normality
In many models there is no exact pivot. The previous section established that under mild regularity conditions the MLE $\hat\theta_n$ is asymptotically normal:
\begin{align*}
\sqrt{n}\,(\hat\theta_n - \theta) \xrightarrow{d} N\!\left(0, \frac{1}{I(\theta)}\right),
\end{align*}
where $I(\theta)$ is the Fisher information. This suggests treating $\sqrt{n\,I(\hat\theta_n)}\,(\hat\theta_n - \theta)$ as an approximate standard normal pivot for large $n$.
[quotetheorem:1429]
This is an approximate result in two senses: the asymptotic normality of the MLE is invoked, and the unknown $I(\theta)$ is replaced by its estimate $I(\hat\theta_n)$. Both approximations improve as $n \to \infty$.
The phrase "standard regularity conditions" bundles together a handful of technical requirements: the log-density $\log f(x;\theta)$ is differentiable in $\theta$; the Fisher information $I(\theta)$ exists, is finite, and is strictly positive; the true parameter lies in the **interior** of $\Theta$ (so local perturbations in every direction are legitimate); and the exchange of differentiation and integration $\partial_\theta \int f(x;\theta)\,dx = \int \partial_\theta f(x;\theta)\,dx$ is valid, which is the step that underpins the score identity $\mathbb{E}_\theta[\partial_\theta \log f(X;\theta)] = 0$.
The approximation breaks — sometimes dramatically — when these conditions fail. For $X_1, \ldots, X_n \sim \mathrm{Uniform}(0,\theta)$, the support depends on $\theta$, differentiation under the integral is illegal, and the MLE $\hat\theta_n = \max_i X_i$ is not asymptotically normal (it converges to $\theta$ at rate $n$ rather than $\sqrt n$, with a limiting exponential rather than normal distribution). Other sources of failure include **small $n$** (the CLT has not yet taken hold and the sampling distribution is visibly skewed), **boundary values** of $\theta$ (such as a Bernoulli proportion near $0$ or $1$, where the quadratic approximation to the log-likelihood is poor), and **highly skewed or multimodal likelihoods**.
The substitution of $I(\hat\theta_n)$ for $I(\theta)$ in the denominator of the pivot deserves comment. Under regularity, $I$ is continuous in $\theta$ and $\hat\theta_n \xrightarrow{\mathbb{P}} \theta$, so $I(\hat\theta_n) \xrightarrow{\mathbb{P}} I(\theta)$ by the continuous mapping theorem. By Slutsky's theorem, the pivot $\sqrt{n\,I(\hat\theta_n)}(\hat\theta_n - \theta)$ then has the same limiting $N(0,1)$ distribution as the ideal pivot $\sqrt{n\,I(\theta)}(\hat\theta_n - \theta)$, which justifies the substitution asymptotically. In practice we use $I(\hat\theta_n)$ because $I(\theta)$ is simply unavailable — we do not know $\theta$.
[example: Bernoulli Proportion]
Suppose $X_1, \ldots, X_n$ are i.i.d. $\mathrm{Bernoulli}(p)$ and we want an approximate $95\%$ confidence interval for $p$. The MLE is $\hat p = \bar X = n^{-1}\sum X_i$.
By the Central Limit Theorem, $\hat p$ is approximately $N(p, p(1-p)/n)$ for large $n$, so
\begin{align*}
\frac{\sqrt{n}\,(\hat p - p)}{\sqrt{p(1-p)}}
\end{align*}
is approximately $N(0,1)$. The difficulty is that the standard deviation $\sqrt{p(1-p)/n}$ depends on the unknown $p$. We substitute $\hat p$ for $p$ in the denominator, obtaining the approximate $95\%$ interval
\begin{align*}
\hat p \;\pm\; 1.96\,\sqrt{\frac{\hat p\,(1-\hat p)}{n}}.
\end{align*}
As a concrete application, suppose an opinion poll reports $20\%$ support for a party based on $n = 1{,}000$ respondents, so $\hat p = 0.20$. Then
\begin{align*}
\operatorname{Var}(X/n) \approx \frac{\hat p\,(1-\hat p)}{n} = \frac{0.20 \times 0.80}{1000} = 0.00016,
\end{align*}
giving the interval $0.20 \pm 1.96 \times 0.013 = (0.175, 0.225)$.
If one is unwilling to substitute $\hat p$ for $p$, the bound $p(1-p) \leq \tfrac{1}{4}$ (attained at $p = \tfrac{1}{2}$) gives the conservative interval $\hat p \pm 1.96/\sqrt{4n} \approx \hat p \pm 1/\sqrt{n}$. This interval is valid for any $p$ and shows that the margin of error in a survey of $n$ respondents is at most of order $1/\sqrt{n}$, regardless of the true proportion.
[/example]
[remark: Coverage Probability is a Long-Run Property]
It is worth repeating the key point because it is a perennial source of error. A $100\gamma\%$ confidence interval is not a probability statement about where $\theta$ lies after the data are observed. It is a statement about how the interval-producing procedure behaves across many repetitions of the experiment: $100\gamma\%$ of the intervals produced by the procedure will cover $\theta$. Once a specific interval $(a, b)$ has been computed, the parameter either lies in $(a, b)$ or it does not — there is no residual randomness. This distinction is what motivates the Bayesian framework, where uncertainty about $\theta$ is modelled explicitly as a probability distribution.
[/remark]
The frequentist framework treats $\theta$ as a fixed unknown and measures uncertainty through the behaviour of estimators across hypothetical repetitions of the experiment. But what if we have genuine prior knowledge about $\theta$ before collecting any data — knowledge that should influence our conclusions? The next section introduces Bayesian estimation, which does exactly this: it places a prior distribution on $\theta$ and updates it in light of the observed data to obtain a posterior distribution that quantifies remaining uncertainty about $\theta$ directly.
## Bayesian Estimation
Throughout this chapter we have worked within the frequentist tradition: the parameter $\theta$ is a fixed but unknown constant, and our inferential statements — point estimates, confidence intervals — are interpreted in terms of what would happen if we repeated the experiment many times. A confidence interval at level $95\%$ is not a statement that there is a $95\%$ chance the true $\theta$ lies inside it; $\theta$ is fixed, so it either does or does not. The probability refers to the procedure producing the interval, not to the parameter.
The Bayesian framework rests on a different philosophical foundation. It treats $\theta$ as a random variable, endowed with a probability distribution that encodes what we believe about $\theta$ before seeing any data. After observing data, that distribution is updated via Bayes' theorem. This shift from "probability as long-run frequency" to "probability as degree of belief" unlocks a genuine practical payoff: it lets us incorporate domain knowledge systematically. When data are sparse — a new hospital running its first ten operations, a rare-disease clinical trial — a sensible prior distribution prevents the maximum likelihood estimator from collapsing to implausible values such as zero.
### Prior and Posterior Distributions
Before formulating Bayes' theorem in this setting, we need the two central objects.
[definition: Prior Distribution]
The **prior distribution** $\pi(\theta)$ is the probability distribution assigned to the parameter $\theta$ before any data are observed. It encodes the investigator's beliefs and background knowledge about the plausible range of $\theta$.
[/definition]
[definition: Posterior Distribution]
The **posterior distribution** $\pi(\theta \mid x)$ is the conditional distribution of $\theta$ given the observed data $X = x$. It combines the prior with the information in the data.
[/definition]
Bayes' theorem connects them:
\begin{align*}
\pi(\theta \mid x) = \frac{f_{X}(x \mid \theta)\,\pi(\theta)}{f_{X}(x)}.
\end{align*}
The denominator $f_{X}(x) = \int f_{X}(x \mid \theta)\,\pi(\theta)\,d\theta$ is a normalising constant that does not depend on $\theta$. In practice we almost never compute it directly; instead we work with the proportionality relation
\begin{align*}
\pi(\theta \mid x) &\propto f_{X}(x \mid \theta)\,\pi(\theta),\\
\text{posterior} &\propto \text{likelihood} \times \text{prior},
\end{align*}
and identify the posterior by recognising the kernel of a known distribution. Notice also that the data enter only through the likelihood $f_{X}(x \mid \theta)$, which depends on $x$ only through any sufficient statistic — Bayesian inference is therefore automatically based on sufficient statistics.
[example: Three Coins]
Three coins are in a pocket: one biased $3:1$ towards tails ($\theta = 0.25$), one fair ($\theta = 0.50$), and one biased $3:1$ towards heads ($\theta = 0.75$). A coin is chosen uniformly at random and flipped once, showing a head. What is the posterior probability that the third coin was chosen?
The prior is $\pi(0.25) = \pi(0.50) = \pi(0.75) = 1/3$, and the likelihood is $f_X(1 \mid \theta) = \theta$. Computing $f_X(1 \mid \theta)\,\pi(\theta)$ for each $\theta$ and normalising:
\begin{align*}
\pi(0.25 \mid X = 1) &= \frac{0.25/3}{(0.25 + 0.50 + 0.75)/3} = \frac{1}{6},\\
\pi(0.50 \mid X = 1) &= \frac{1}{3},\\
\pi(0.75 \mid X = 1) &= \frac{1}{2}.
\end{align*}
Observing a head shifts probability mass towards the head-biased coin; there is now a $50\%$ posterior probability that the third coin was selected, up from the prior $33\%$.
[/example]
### Conjugate Prior Families
In general, the posterior may not have a closed form. A **conjugate prior** is a prior family that, when combined with a particular likelihood, yields a posterior in the same family. Conjugate priors make Bayesian updating analytically tractable, and they arise naturally in exponential families.
#### Beta–Binomial Conjugacy
The canonical example is Bernoulli data with a Beta prior.
[example: Hospital Mortality Rate]
A hospital $H$ is about to begin a new operation. Nationally, the mortality risk is around $10\%$, but rates across hospitals range from roughly $3\%$ to $20\%$. Hospital $H$ records no deaths in their first ten operations. What should we believe about the true mortality risk $\theta$?
Let $X_i = 1$ if the $i$th patient dies. The likelihood is
\begin{align*}
f_{X}(x \mid \theta) = \theta^{\sum x_i}(1 - \theta)^{n - \sum x_i}.
\end{align*}
Suppose the prior is $\theta \sim \mathrm{Beta}(a, b)$, so $\pi(\theta) \propto \theta^{a-1}(1-\theta)^{b-1}$. Then
\begin{align*}
\pi(\theta \mid x) &\propto \theta^{\sum x_i + a - 1}(1 - \theta)^{n - \sum x_i + b - 1},
\end{align*}
which is the kernel of a $\mathrm{Beta}(\sum x_i + a,\, n - \sum x_i + b)$ distribution. A beta prior leads to a beta posterior — the Beta family is conjugate for Bernoulli samples.
To encode the national experience we choose $a = 3$, $b = 27$, which gives a Beta prior with mean $3/30 = 0.10$ and $\mathbb{P}(0.03 < \theta < 0.20) = 0.90$. With $\sum x_i = 0$ and $n = 10$, the posterior is $\mathrm{Beta}(3, 37)$, with posterior mean $3/40 = 0.075$.
This contrasts sharply with the frequentist maximum likelihood estimate $\hat{\theta}_{\mathrm{MLE}} = 0$. Zero deaths in ten operations forces the MLE to zero, which is implausible given the broader context. The Bayesian posterior mean is positive because the prior incorporates evidence from other hospitals, giving a more defensible estimate when data are sparse.
[/example]
A special case illuminates the connection to familiar estimators. If $a = b = 1$ (a uniform prior on $(0,1)$), the posterior is $\mathrm{Beta}(\sum x_i + 1,\, n - \sum x_i + 1)$ with mean
\begin{align*}
\hat{\theta}_{\mathrm{Bayes}} = \frac{\sum x_i + 1}{n + 2}.
\end{align*}
The mode of this posterior is $\sum x_i / n$, recovering the MLE — a pattern that recurs generally: the posterior mode under a flat prior coincides with the maximum likelihood estimator. The posterior mean $(\sum x_i + 1)/(n + 2)$ is Laplace's estimator, which we saw earlier in the chapter has smaller mean squared error than the MLE for non-extreme values of $\theta$.
#### Gamma–Poisson Conjugacy
[example: Poisson Data with Exponential Prior]
Let $X_1, \ldots, X_n$ be i.i.d. $\mathrm{Poisson}(\lambda)$, and suppose $\lambda$ has an exponential prior with mean $1$, so $\pi(\lambda) = e^{-\lambda}$ for $\lambda > 0$. The posterior is
\begin{align*}
\pi(\lambda \mid x) &\propto e^{-n\lambda}\,\lambda^{\sum x_i}\,e^{-\lambda} = \lambda^{\sum x_i}\,e^{-(n+1)\lambda},
\end{align*}
which is the kernel of a $\mathrm{Gamma}(\sum x_i + 1,\, n + 1)$ distribution (shape, rate parametrisation). The posterior mean is $(\sum x_i + 1)/(n + 1)$. More generally, a $\mathrm{Gamma}(\alpha, \beta)$ prior yields a $\mathrm{Gamma}(\alpha + \sum x_i,\, \beta + n)$ posterior, making the Gamma family conjugate for Poisson data.
[/example]
#### Normal–Normal Conjugacy
[example: Normal Mean with Normal Prior]
Let $X_1, \ldots, X_n$ be i.i.d. $N(\mu, 1)$ with known variance, and suppose $\mu \sim N(0, \tau^{-2})$. The hyperparameter $\tau^2$ is the prior precision: large $\tau$ means strong prior belief that $\mu$ is near $0$. The posterior satisfies
\begin{align*}
\pi(\mu \mid x) &\propto \exp\!\left[-\tfrac{1}{2}\sum(x_i - \mu)^2\right]\exp\!\left[-\tfrac{\tau^2\mu^2}{2}\right]\\
&\propto \exp\!\left[-\tfrac{1}{2}(n + \tau^2)\!\left(\mu - \frac{\sum x_i}{n + \tau^2}\right)^{\!2}\right],
\end{align*}
obtained by completing the square in $\mu$. The posterior distribution is therefore $N\!\left(\dfrac{\sum x_i}{n + \tau^2},\, \dfrac{1}{n + \tau^2}\right)$.
The posterior mean $\sum x_i / (n + \tau^2)$ is a shrinkage estimator: it pulls the sample mean $\bar{x} = \sum x_i / n$ towards zero (the prior mean) by a factor of $n/(n + \tau^2)$. As $n \to \infty$ the prior is overwhelmed and the posterior mean converges to $\bar{x}$. As $\tau \to 0$ (flat prior) it also converges to $\bar{x}$.
[/example]
[remark: Posterior Mean Equals Posterior Median for the Normal]
Because the normal distribution is symmetric about its mean, the posterior mean and the posterior median coincide in the Normal–Normal model. Consequently, the same point estimate $\sum x_i / (n + \tau^2)$ minimises both the expected posterior quadratic loss and the expected posterior absolute-error loss.
[/remark]
### Bayes Estimators and Loss Functions
Having obtained the posterior distribution, we still need to extract a point estimate. The appropriate choice depends on what we wish to minimise.
Let $\mathcal{A}$ denote the **action space** — the set of admissible estimator values. In most parametric problems $\mathcal{A} = \Theta$ (we estimate $\theta$ by a number in the parameter space), but more generally $\mathcal{A}$ can be any subset of $\mathbb{R}^k$ containing the candidate estimates. A **loss function** is a map
\begin{align*}
L : \Theta \times \mathcal{A} \to [0, \infty), \qquad (\theta, a) \mapsto L(\theta, a),
\end{align*}
measuring the cost of declaring the estimate to be $a$ when the true value is $\theta$. Common choices are quadratic loss $L(\theta, a) = (\theta - a)^2$ and absolute error loss $L(\theta, a) = |\theta - a|$, both with $\mathcal{A} = \Theta$.
[definition: Bayes Estimator]
Given a loss function $L : \Theta \times \mathcal{A} \to [0,\infty)$, the **Bayes estimator** $\hat{\theta}$ is the value of $a \in \mathcal{A}$ that minimises the expected posterior loss
\begin{align*}
h(a) = \int L(\theta, a)\,\pi(\theta \mid x)\,d\theta.
\end{align*}
[/definition]
For quadratic loss, differentiating $h(a) = \int (a - \theta)^2\,\pi(\theta \mid x)\,d\theta$ and setting $h'(a) = 0$ gives $a = \int \theta\,\pi(\theta \mid x)\,d\theta$, the **posterior mean**. For absolute error loss, the minimiser satisfies $\int_{-\infty}^{a}\pi(\theta \mid x)\,d\theta = \tfrac{1}{2}$, the **posterior median**. A third natural estimator is the **maximum a posteriori (MAP) estimator**, the mode of the posterior, which minimises a certain $0$-$1$ loss.
[remark: MAP and MLE]
Under a uniform prior, the MAP estimator coincides with the MLE. Under informative priors, MAP incorporates prior beliefs and regularises the estimate, which is why MAP estimation appears in machine learning regularisation frameworks.
[/remark]
### Credible Intervals
Just as frequentist inference uses confidence intervals to convey uncertainty about $\theta$, Bayesian inference uses credible intervals — but the interpretation is fundamentally different.
[definition: Credible Interval]
A **credible interval** (or **posterior interval**) at level $1 - \alpha$ is a set $C(x) \subseteq \Theta$ such that
\begin{align*}
\mathbb{P}(\theta \in C(x) \mid X = x) = 1 - \alpha.
\end{align*}
Here the probability is taken with respect to the posterior distribution $\pi(\theta \mid x)$.
[/definition]
The most common choice is the **equal-tailed credible interval**, which places probability $\alpha/2$ in each tail of the posterior. Another is the **highest posterior density (HPD) interval**, the shortest interval containing posterior probability $1 - \alpha$.
The contrast with confidence intervals is worth dwelling on. A $95\%$ confidence interval $[L(X), U(X)]$ satisfies $\mathbb{P}(L(X) \le \theta \le U(X)) = 0.95$ with the probability over repeated samples and $\theta$ fixed — it does not assert a $95\%$ chance that $\theta$ lies in the observed interval. A $95\%$ credible interval, by contrast, does assert exactly this: given the observed data, there is a $95\%$ posterior probability that $\theta$ falls in the interval. This is the statement most users of statistics intuitively want to make, and the Bayesian framework is the one that licenses it.
[remark: Prior Dependence]
The credible interval depends on the choice of prior. Different investigators with different priors will reach different posterior distributions and different credible intervals from the same data. Frequentists regard this as a weakness; Bayesians regard it as an honest acknowledgement that inference always involves assumptions.
[/remark]
### Looking Ahead
With this section, we close Chapter 1. We have developed the major tools of **estimation**: how to construct good point estimators (method of moments, maximum likelihood, Bayes estimators), how to assess them (bias, variance, mean squared error, Cramér–Rao bound), and how to quantify uncertainty (confidence intervals, credible intervals). In the frequentist setting, estimation says: "given these data, what is our best guess for $\theta$, and how uncertain are we?" Bayesian estimation says: "given these data and our prior beliefs, what does the full posterior distribution of $\theta$ look like?"
The natural next question is different in character: rather than estimating the value of $\theta$, we want to make a **binary decision** — does the data provide sufficient evidence to reject a specific claim about $\theta$? This is the domain of **hypothesis testing**, which forms Chapter 2. The machinery we have built here — likelihood functions, sufficient statistics, the sampling distribution of estimators — will be the foundation for that theory. Chapter 3 then applies both estimation and testing to the linear model $Y = X\beta + \varepsilon$, the most widely used framework in applied statistics.
Point and interval estimation answer 'what is θ?', but many statistical questions demand a yes-or-no answer: Does this new treatment work? Is this coin fair? Hypothesis testing formalizes the logic of such binary decisions, replacing estimation with a framework of null and alternative hypotheses, significance levels, and power.
# 2. Hypothesis Testing
The preceding chapter developed estimation: given data, how do we produce a best guess for an unknown parameter, and how do we quantify our uncertainty around that guess? Point estimators and confidence intervals are the tools of estimation — they summarise what the data say about where $\theta$ lies. But many practical decisions are not of the form "where is $\theta$?" They are of the form "is $\theta$ in this region, or that one?" A pharmaceutical company does not merely want to know by how much a drug lowers the risk of heart attack — it needs a binary verdict: does the drug work or not? A quality-control engineer does not merely want an estimate of the defect rate — she needs to decide whether to halt production. Hypothesis testing is the formal framework for turning data into such binary decisions.
The essential structure is as follows. We have data $X_1, \ldots, X_n$ drawn i.i.d. from some distribution with unknown parameter $\theta$ (or unknown density $f$) — the same parametric setup established in Chapter 0 and used throughout Chapter 1. We entertain two competing claims about $\theta$: the **null hypothesis** $H_0$ and the **alternative hypothesis** $H_1$. On the basis of the observed data $X = x$, we must choose one. The asymmetry between $H_0$ and $H_1$ is fundamental and intentional. The null hypothesis represents the status quo — the default assumption we hold unless the evidence forces us to abandon it. We reject $H_0$ only when the data speak loudly enough against it. This conservatism is justified by the asymmetry of consequences: in a drug trial, approving an ineffective treatment is far more harmful than failing to approve an effective one; in a criminal court, convicting the innocent is considered a graver error than acquitting the guilty.
[definition: Hypothesis]
Let $X_1, \ldots, X_n$ be i.i.d. random variables taking values in $\mathcal{X}$, with common density or mass function $f$. A **hypothesis** is a statement about $f$. In the parametric setting where $f = f(\,\cdot\mid\theta)$ for $\theta \in \Theta$, a hypothesis specifies a subset of $\Theta$.
The **null hypothesis** $H_0$ is the default claim, typically of the form $\theta \in \Theta_0$ for some $\Theta_0 \subseteq \Theta$. The **alternative hypothesis** $H_1$ is the competing claim, $\theta \in \Theta_1$, where $\Theta_0 \cap \Theta_1 = \varnothing$.
A hypothesis is **simple** if it specifies $f$ completely (equivalently, $|\Theta_0| = 1$ or $|\Theta_1| = 1$ in the parametric case). A hypothesis is **composite** if it does not determine $f$ uniquely.
[/definition]
A hypothesis alone does not yet specify a decision rule. To act on data, we need two things: a quantity computed from the data that is sensitive to departures from $H_0$, and a criterion for when that quantity is extreme enough to warrant rejection. The following definitions formalise both.
[definition: Test Statistic and Critical Region]
A **test statistic** is a measurable function $T = T(X_1, \ldots, X_n)$ of the data, chosen so that its value provides evidence for or against $H_0$.
The **critical region** (also called the **rejection region**) is a set $C \subseteq \mathcal{X}^n$ such that we reject $H_0$ if and only if $(X_1, \ldots, X_n) \in C$. Equivalently, if the test is based on a statistic $T$, the critical region is often expressed as $\{T > c\}$ or $\{T \leq c\}$ for some **critical value** $c$.
[/definition]
No decision rule is error-free. When we reject or retain $H_0$, two kinds of mistake are possible, and understanding their probabilities is the foundation of all hypothesis testing theory.
[definition: Type I and Type II Errors]
A **Type I error** (false positive) occurs when $H_0$ is true but we reject it. Its probability is
\begin{align*}
\alpha &= \mathbb{P}(\text{reject } H_0 \mid H_0 \text{ true}).
\end{align*}
A **Type II error** (false negative) occurs when $H_1$ is true but we fail to reject $H_0$. Its probability is
\begin{align*}
\beta &= \mathbb{P}(\text{fail to reject } H_0 \mid H_1 \text{ true}).
\end{align*}
The **size** (or **significance level**) of a test is the supremum of the Type I error probability over all distributions consistent with $H_0$:
\begin{align*}
\alpha &= \sup_{\theta \in \Theta_0} \mathbb{P}_\theta(\text{reject } H_0).
\end{align*}
The **power** of a test at a parameter value $\theta \in \Theta_1$ is the probability of correctly rejecting $H_0$:
\begin{align*}
\pi(\theta) &= \mathbb{P}_\theta(\text{reject } H_0) = 1 - \beta(\theta).
\end{align*}
[/definition]
[remark: The Type I/Type II Trade-off]
Reducing Type I error (making the critical region smaller) typically increases Type II error: if we demand more evidence before rejecting $H_0$, we become less sensitive to genuine departures from $H_0$. Hypothesis testing theory is largely the art of constructing tests that control size at a prescribed level $\alpha$ while maximising power against the alternative.
[/remark]
To see these concepts interact, consider a concrete example drawn from the course.
[example: Coin Fairness Test]
A coin has $\mathbb{P}(\text{Heads}) = \theta$, unknown, and is tossed independently $n = 10$ times. We wish to test
\begin{align*}
H_0 &: \theta = \tfrac{1}{2} \quad \text{versus} \quad H_1 : \theta = \tfrac{3}{4}.
\end{align*}
Let $T = \sum_{i=1}^{10} X_i$ be the total number of heads. Under $H_0$, $T \sim \operatorname{Bin}(10, \frac{1}{2})$; under $H_1$, $T \sim \operatorname{Bin}(10, \frac{3}{4})$. Intuitively, a large value of $T$ is evidence against $H_0$ and in favour of $H_1$, so we consider a critical region of the form $C = \{T \geq c\}$ for some threshold $c$.
Choosing $c = 8$ gives:
\begin{align*}
\alpha &= \mathbb{P}_{1/2}(T \geq 8) = \sum_{k=8}^{10} \binom{10}{k} \left(\tfrac{1}{2}\right)^{10} \approx 0.055, \\
\beta &= \mathbb{P}_{3/4}(T < 8) = \sum_{k=0}^{7} \binom{10}{k} \left(\tfrac{3}{4}\right)^k \left(\tfrac{1}{4}\right)^{10-k} \approx 0.474.
\end{align*}
The power at $\theta = \frac{3}{4}$ is then $\pi(\frac{3}{4}) = 1 - \beta \approx 0.526$. Raising the threshold to $c = 9$ reduces the size to approximately $0.011$ but further reduces the power. The choice of $c$ encodes our tolerance for each type of error, and different values of $c$ trace out the full trade-off.
This example illustrates that once we fix the form of the critical region, the size and power are determined — and moving one in a favourable direction moves the other unfavourably. A principled framework for choosing the best critical region at a given size is therefore essential.
[/example]
The example above is special in one important respect: both $H_0$ and $H_1$ are **simple** — each specifies $\theta$ completely. This is precisely the cleanest setting in which to develop the theory, because the likelihood under each hypothesis is fully determined and the comparison between them is unambiguous. The next section takes up this case systematically, asking: among all tests of a given size, which one maximises power? The answer, due to Neyman and Pearson, gives a complete and elegant solution whenever both hypotheses are simple.
## Framework
The general parametric setup is as follows. We observe $X_1, \ldots, X_n$ i.i.d. with density $f(x \mid \theta)$, where $\theta \in \Theta$ is unknown. A test partitions $\mathcal{X}^n$ into the critical region $C$ (where we reject $H_0$) and its complement $C^c$ (where we retain $H_0$). The **likelihood ratio**
\begin{align*}
\Lambda(x) &= \frac{f(x \mid \theta_1)}{f(x \mid \theta_0)}
\end{align*}
measures how much more probable the observed data are under $H_1$ than under $H_0$, and will play a central role throughout the chapter. When $\Lambda(x)$ is large, the evidence favours $H_1$; when it is small, the evidence favours $H_0$.
In the simple vs.\ simple setting, the data contain all the information needed to compare the two hypotheses, and the theory will show that the optimal test is always a likelihood ratio test. The composite case — where one or both hypotheses allow $\theta$ to range over a set — requires additional ideas and will be addressed in later sections.
## Simple Hypotheses
The chapter introduction established the vocabulary of hypothesis testing: null and alternative hypotheses, critical regions, Type I and Type II errors, size, and power. With these definitions in hand, a natural and pressing question arises — given that we fix the probability of falsely rejecting $H_0$ (the size), which test extracts the maximum information from the data? Among all tests of a given size, is there one that dominates the rest in terms of power? This section answers that question decisively for the case of *simple hypotheses*, where both $H_0$ and $H_1$ fully specify the distribution.
[definition: Simple Hypothesis]
A hypothesis is called **simple** if it uniquely determines the distribution of the data. That is, $H: \theta = \theta^*$ is simple if specifying that $H$ holds leaves no free parameters — the density $f_X(x \mid \theta^*)$ is completely determined.
A hypothesis that constrains $\theta$ to a set (e.g. $H: \theta > 0$) is called **composite**.
[/definition]
Throughout this section we work with a random sample $X = (X_1, \ldots, X_n)$ drawn from a distribution with density $f(\cdot \mid \theta)$, and we test two simple hypotheses $H_0: \theta = \theta_0$ against $H_1: \theta = \theta_1$.
### The Likelihood Ratio Test
To compare two simple hypotheses, it is natural to ask: which hypothesis makes the observed data $x$ more probable? The likelihood of a simple hypothesis measures exactly this.
[definition: Likelihood and Likelihood Ratio]
The **likelihood** of a simple hypothesis $H: \theta = \theta^*$ given data $x$ is
\begin{align*}
L_x(H) = f_X(x \mid \theta = \theta^*).
\end{align*}
The **likelihood ratio** of $H_0$ against $H_1$ given data $x$ is
\begin{align*}
\Lambda_x(H_0; H_1) = \frac{L_x(H_1)}{L_x(H_0)} = \frac{f_X(x \mid \theta_1)}{f_X(x \mid \theta_0)}.
\end{align*}
A **likelihood ratio test** (LR test) is one where the critical region takes the form
\begin{align*}
C = \{x : \Lambda_x(H_0; H_1) > k\}
\end{align*}
for some threshold $k \geq 0$.
[/definition]
The intuition is immediate: if the data are much more likely under $H_1$ than under $H_0$, the ratio $\Lambda_x$ is large, and we have grounds to reject $H_0$. The threshold $k$ is then chosen to calibrate the size of the test to a prescribed level $\alpha$.
To evaluate how good a likelihood ratio test is, we need to return to the error probabilities introduced in the chapter opening. In the simple-vs-simple setting, both the Type I and Type II error probabilities are well-defined numbers (not suprema), because the distribution is completely specified under each hypothesis. The next definition restates these quantities in the context of a general critical region $C$, keeping them in view for the optimality argument that follows.
[definition: Type I and Type II Errors, Size and Power]
When testing $H_0$ against $H_1$ with critical region $C$:
- A **Type I error** occurs when we reject $H_0$ even though $H_0$ is true.
- A **Type II error** occurs when we fail to reject $H_0$ even though $H_1$ is true.
When $H_0$ and $H_1$ are both simple, set
\begin{align*}
\alpha &= \mathbb{P}(X \in C \mid H_0), \qquad \beta = \mathbb{P}(X \notin C \mid H_1).
\end{align*}
The **size** of the test is $\alpha$ (the probability of a Type I error), and the **power** is $1 - \beta$ (the probability of correctly detecting $H_1$).
[/definition]
For any fixed size $\alpha$, there are many possible tests — many choices of critical region $C$ satisfying $\mathbb{P}(X \in C \mid H_0) = \alpha$. The question of optimality is non-trivial: can we say that one particular form of critical region achieves higher power than every other?
### Neyman-Pearson Lemma
The answer is yes, and it is one of the most elegant results in classical statistics. The likelihood ratio test is not merely a reasonable heuristic — it is the uniformly most powerful test of its size.
[quotetheorem:1430]
[citeproof:1430]
The conclusion is worth pausing on. The Neyman-Pearson lemma says that the likelihood ratio test is not just *a* good test — it is the *best possible* test, in the sense that no other test of the same or smaller size can achieve a higher probability of detecting $H_1$. This is a global optimality statement: it holds for every alternative $H_1$ and every size $\alpha$. The theorem does not guarantee that the LR test is unique — multiple critical regions can achieve the same power — but it guarantees that no competing critical region can beat it. When the densities do not share common support (i.e., $f_1$ is nonzero on a region where $f_0$ is zero), the likelihood ratio can be infinite, and the region $\{f_0 = 0, f_1 > 0\}$ must automatically be included in any sensible critical region; the Neyman-Pearson argument then applies on the remaining region, and the conclusion is unchanged in spirit though the proof requires minor adjustment.
[remark: On Continuity]
The assumption that $f_0$ and $f_1$ are continuous is used only to guarantee that a threshold $k$ achieving exactly size $\alpha$ exists. For a discrete distribution, the set $\{x : \Lambda_x = k\}$ can have positive probability, and an exact size-$\alpha$ LR test may require randomisation on the boundary. In practice, this is rarely needed: whenever an LR test of exactly size $\alpha$ exists for a discrete distribution, the same proof applies and the conclusion holds unchanged.
[/remark]
### The Normal Example: The $z$-Test
The Neyman-Pearson lemma is most illuminating when applied to a concrete model. The normal distribution provides the canonical illustration, because the likelihood ratio reduces to a simple function of the sample mean, and the resulting test statistic has a known exact null distribution. This makes the normal case not merely an example but a template: the same reduction from likelihood ratio to sample mean will appear repeatedly in later sections whenever the model belongs to an exponential family.
[example: Optimal Test for Normal Mean (z-Test)]
Suppose $X_1, \ldots, X_n$ are i.i.d. $N(\mu, \sigma_0^2)$, where $\sigma_0^2$ is known. We test $H_0: \mu = \mu_0$ against $H_1: \mu = \mu_1$ with $\mu_1 > \mu_0$.
The likelihood ratio is
\begin{align*}
\Lambda_x(H_0; H_1) &= \frac{(2\pi\sigma_0^2)^{-n/2} \exp\!\left(-\tfrac{1}{2\sigma_0^2}\sum_{i=1}^n (x_i - \mu_1)^2\right)}{(2\pi\sigma_0^2)^{-n/2} \exp\!\left(-\tfrac{1}{2\sigma_0^2}\sum_{i=1}^n (x_i - \mu_0)^2\right)} \\
&= \exp\!\left(\frac{\mu_1 - \mu_0}{\sigma_0^2}\, n\bar{x} + \frac{n(\mu_0^2 - \mu_1^2)}{2\sigma_0^2}\right).
\end{align*}
Since $\mu_1 > \mu_0$, the exponent is a strictly increasing function of $\bar{x}$. Therefore $\Lambda_x > k$ if and only if $\bar{x} > c$ for some constant $c$ determined by $k$. By the Neyman-Pearson lemma, the optimal size-$\alpha$ test rejects $H_0$ when $\bar{x} > c$, where $c$ is chosen so that
\begin{align*}
\mathbb{P}(\bar{X} > c \mid H_0) = \alpha.
\end{align*}
Under $H_0$, we have $\bar{X} \sim N(\mu_0, \sigma_0^2/n)$, so the standardised statistic
\begin{align*}
Z = \frac{\sqrt{n}(\bar{X} - \mu_0)}{\sigma_0} \sim N(0,1)
\end{align*}
under $H_0$. Writing $z_\alpha$ for the upper-$\alpha$ quantile of $N(0,1)$, the size-$\alpha$ test rejects $H_0$ when
\begin{align*}
z = \frac{\sqrt{n}(\bar{x} - \mu_0)}{\sigma_0} > z_\alpha.
\end{align*}
This is called the **$z$-test**. Its critical region $\{z > z_\alpha\}$ is optimal by the Neyman-Pearson lemma: no other size-$\alpha$ test of $H_0$ against this $H_1$ has greater power.
**Worked numerical instance.** Take $\mu_0 = 5$, $\mu_1 = 6$, $\sigma_0 = 1$, $n = 4$, $\alpha = 0.05$, and observed data $x = (5.1, 5.5, 4.9, 5.3)$, giving $\bar{x} = 5.2$. From standard tables, $z_{0.05} = 1.645$. The test statistic is
\begin{align*}
z = \frac{\sqrt{4}\,(5.2 - 5)}{1} = \frac{2 \times 0.2}{1} = 0.4.
\end{align*}
Since $0.4 < 1.645$, we do not reject $H_0$ at the 5% level. The data are consistent with $\mu = 5$.
A word of caution: failing to reject $H_0$ is not the same as accepting $H_0$. We simply lack sufficient evidence against it — the data are compatible with $\mu_0 = 5$, but they are also compatible with other values of $\mu$.
[/example]
### The $p$-Value
In the normal example, the rejection rule $z > z_\alpha$ depends on the chosen size $\alpha$. Rather than fixing $\alpha$ in advance and reporting a binary reject/not-reject decision, it is more informative to report the *smallest* size at which the observed data would lead to rejection.
[definition: p-Value]
The **$p$-value** of observed data $x$ is the probability, computed under $H_0$, of obtaining a test statistic at least as extreme as the observed value. For the $z$-test above,
\begin{align*}
p^* = \mathbb{P}(Z > z \mid H_0) = 1 - \Phi(z),
\end{align*}
where $z$ is the observed standardised statistic and $\Phi$ is the standard normal CDF.
[/definition]
The $p$-value is sometimes called the *observed significance level*; both terms are used interchangeably in this course. The rejection rule $z > z_\alpha$ is equivalent to $p^* < \alpha$: the data fall in the critical region at level $\alpha$ if and only if the $p$-value is smaller than $\alpha$. In the numerical example above, $z = 0.4$ gives $p^* = 1 - \Phi(0.4) \approx 0.345$. Since $0.345 > 0.05$, we do not reject at the 5% level — consistent with the direct comparison. Because the $p$-value is a function of the data, it is itself a random variable under repeated sampling; when $H_0$ is true, $p^*$ is uniformly distributed on $(0,1)$, which is why the probability of a false rejection at level $\alpha$ is exactly $\alpha$. The following explanation addresses the common source of confusion about what the $p$-value actually measures.
[explanation: What the p-Value Measures]
A small $p$-value means that the observed data would be unusual if $H_0$ were true — the test statistic lies far into the tail of its null distribution. This is evidence against $H_0$, not proof of $H_1$. Conversely, a large $p$-value means the data are compatible with $H_0$, but says nothing about whether $H_1$ is true.
The $p$-value depends on the test statistic chosen, which is why the Neyman-Pearson lemma matters: it tells us that for simple hypotheses, the likelihood ratio statistic is the right quantity to use. Any other test statistic with the same size will have a weaker ability to detect $H_1$.
[/explanation]
### Looking Ahead: Composite Hypotheses
The Neyman-Pearson framework is complete and satisfying when both $H_0$ and $H_1$ are simple. The likelihood ratio test is optimal, it reduces to familiar statistics (like the $z$-test) in standard models, and the $p$-value summarises the evidence in a single number.
But the setting of simple hypotheses is restrictive. In practice, we rarely know the exact parameter value under $H_1$ — we may only know that $\mu > \mu_0$, or that $\theta \neq \theta_0$. Similarly, $H_0$ may specify a family of distributions rather than a single one. When $H_0$ or $H_1$ is **composite**, the Neyman-Pearson argument does not apply directly: there is no single density $f_1$ to place in the numerator of the likelihood ratio.
The next section develops the theory of composite hypotheses. The central question becomes: when $H_1: \mu > \mu_0$ (a one-sided alternative involving all values $\mu > \mu_0$), can we still find a single test that is most powerful for every value of $\mu$ in $H_1$ simultaneously? The answer — which involves the concept of a *uniformly most powerful* test — depends in a subtle way on the structure of the model, and the likelihood ratio will again play a starring role.
## Composite Hypotheses
The Neyman-Pearson framework from the previous section is powerful, but it rests on a restrictive assumption: both $H_0$ and $H_1$ must be simple, meaning each specifies a single distribution completely. In practice, we almost never know the exact value of a parameter — we test whether a mean exceeds a threshold, whether a variance lies in some range, or whether two populations differ at all. These are **composite hypotheses**, where $H_0$ or $H_1$ (or both) describe a set of parameter values rather than a single point.
When $H_0: \theta \in \Theta_0$ is composite, there is no single "probability of Type I error" — the rejection probability varies across the parameter space. We need a richer language to describe test quality.
### The Power Function and UMP Tests
[definition: Power Function]
Let $C$ be the critical region of a test. The **power function** of the test is
\begin{align*}
W(\theta) = \mathbb{P}(X \in C \mid \theta),
\end{align*}
the probability of rejecting $H_0$ when the true parameter is $\theta$.
[/definition]
The power function encodes everything about a test's behaviour across the parameter space. For $\theta \in \Theta_0$, $W(\theta)$ is the probability of a Type I error at $\theta$; for $\theta \in \Theta_1$, $1 - W(\theta)$ is the probability of a Type II error at $\theta$. A good test has $W(\theta)$ small on $\Theta_0$ and large on $\Theta_1$. The shape of $W$ across all of $\Theta$ is the definitive summary of a test's quality: two tests with the same size can have very different power functions, and choosing between them means comparing these curves.
Because $W(\theta)$ varies on $\Theta_0$, we summarise the Type I error behaviour by its worst case. This worst-case perspective is what allows a meaningful size constraint: rather than controlling the Type I error at each $\theta \in \Theta_0$ separately (which would require knowing which $\theta$ is true), we control it uniformly.
[definition: Size]
The **size** of a test with power function $W$ is
\begin{align*}
\alpha = \sup_{\theta \in \Theta_0} W(\theta).
\end{align*}
[/definition]
The size is the highest probability of falsely rejecting $H_0$ over all parameter values consistent with the null. Requiring a test to have size $\alpha$ is therefore a worst-case guarantee: no matter which distribution in $\Theta_0$ is the true one, the probability of rejection does not exceed $\alpha$.
With size as the constraint, we can ask which test maximises power on $\Theta_1$ simultaneously for every $\theta_1 \in \Theta_1$ — not just for one particular alternative. This is a strong demand: a single critical region must dominate every competitor at every point of $\Theta_1$. Such a test, if it exists, is called uniformly most powerful, and its existence is the exception rather than the rule.
[definition: Uniformly Most Powerful Test]
A test with critical region $C$ and power function $W$ is a **uniformly most powerful (UMP) size $\alpha$ test** of $H_0 : \theta \in \Theta_0$ against $H_1 : \theta \in \Theta_1$ if:
1. $\sup_{\theta \in \Theta_0} W(\theta) = \alpha$, and
2. for any other test $C^*$ of size at most $\alpha$ with power function $W^*$, we have $W(\theta) \geq W^*(\theta)$ for all $\theta \in \Theta_1$.
[/definition]
UMP tests do not always exist. When both hypotheses are composite, no single test can dominate all others simultaneously against every alternative — competing tests may be better for some values and worse for others. However, for one-sided tests against exponential-family models, UMP tests do exist, as the following example demonstrates.
[example: UMP Test for Normal Mean — One-Sided, Known Variance]
Let $X_1, \ldots, X_n$ be i.i.d. $N(\mu, \sigma_0^2)$ with $\sigma_0^2$ known. We test $H_0: \mu \leq \mu_0$ against $H_1: \mu > \mu_0$.
Begin with the simpler problem of testing $H_0': \mu = \mu_0$ against $H_1': \mu = \mu_1$ for a fixed $\mu_1 > \mu_0$. By the Neyman-Pearson lemma, the most powerful size $\alpha$ test has critical region
\begin{align*}
C = \left\{ x : \frac{\sqrt{n}(\bar{x} - \mu_0)}{\sigma_0} > z_\alpha \right\}.
\end{align*}
The key observation is that $C$ does not depend on the specific value of $\mu_1$ — only on the direction $\mu_1 > \mu_0$. This makes $C$ a candidate for the UMP test against the full composite alternative.
To verify, compute the power function for any $\mu \in \mathbb{R}$:
\begin{align*}
W(\mu) &= \mathbb{P}_\mu\!\left(\frac{\sqrt{n}(\bar{X} - \mu_0)}{\sigma_0} > z_\alpha\right) \\
&= 1 - \Phi\!\left(z_\alpha + \frac{\sqrt{n}(\mu_0 - \mu)}{\sigma_0}\right).
\end{align*}
Since $\Phi$ is increasing, $W(\mu)$ is an increasing function of $\mu$. At $\mu = \mu_0$, $W(\mu_0) = 1 - \Phi(z_\alpha) = \alpha$. Since $W$ is increasing and equals $\alpha$ at the boundary $\mu_0$,
\begin{align*}
\sup_{\mu \leq \mu_0} W(\mu) = W(\mu_0) = \alpha,
\end{align*}
so condition 1 holds.
For condition 2, let $C^*$ be any test of $H_0$ against $H_1$ with size at most $\alpha$, and let $\mu_1 > \mu_0$ be arbitrary. The test $C^*$ is in particular a test of $H_0': \mu = \mu_0$ against $H_1': \mu = \mu_1$ with size at most $\alpha$. By the Neyman-Pearson lemma, $W^*(\mu_1) \leq W(\mu_1)$. Since $\mu_1 > \mu_0$ was arbitrary, condition 2 holds for all $\theta \in \Theta_1$.
Therefore $C$ is a UMP size $\alpha$ test.
[/example]
The argument in this example has a clean structure that generalises: if the Neyman-Pearson test for a simple-vs-simple sub-problem has a critical region that is independent of the specific alternative value, then that same critical region is UMP for the full one-sided composite problem. This occurs precisely when the likelihood ratio is monotone in a sufficient statistic — a property that holds for all one-parameter exponential families.
### The Generalised Likelihood Ratio for Composite Hypotheses
For two-sided alternatives or multi-parameter problems, UMP tests typically do not exist: the best test against $\mu > \mu_0$ rejects in the upper tail, while the best test against $\mu < \mu_0$ rejects in the lower tail, and no single test dominates both simultaneously. We need a different principle for the general composite case.
The natural generalisation of the Neyman-Pearson likelihood ratio replaces each point likelihood with the supremum of the likelihood over the relevant parameter set. This supremum is achieved at the maximum likelihood estimator whenever the MLE exists, so the generalised likelihood ratio compares the unrestricted MLE against the best-fitting parameter value consistent with $H_0$. Intuitively, if $H_0$ is true, the constraint $\theta \in \Theta_0$ should not cost much in terms of fit, and the ratio is close to one; if $H_0$ is false, the MLE will be pulled away from $\Theta_0$ and the ratio will be large.
[definition: Likelihood of a Composite Hypothesis]
The **likelihood** of a composite hypothesis $H: \theta \in \Theta$ given data $x$ is
\begin{align*}
L_{x}(H) = \sup_{\theta \in \Theta} f(x \mid \theta).
\end{align*}
[/definition]
Since we are typically interested in departures from $H_0$ without specifying a particular alternative, we take $\Theta_1 = \Theta$ (the full parameter space, which contains $\Theta_0$). The **generalised likelihood ratio (GLR) statistic** is then
\begin{align*}
\Lambda_{x}(H_0; H_1) = \frac{\sup_{\theta \in \Theta} f(x \mid \theta)}{\sup_{\theta \in \Theta_0} f(x \mid \theta)} \geq 1,
\end{align*}
with large values of $\Lambda$ providing evidence against $H_0$. The numerator is the likelihood at the unrestricted MLE; the denominator is the likelihood at the best parameter value consistent with $H_0$. If $H_0$ is true, the constraint $\theta \in \Theta_0$ costs little in likelihood, so $\Lambda$ is close to 1. If $H_0$ is false, the MLE falls far outside $\Theta_0$, and $\Lambda$ is large.
[example: GLR for Normal Mean — Two-Sided, Known Variance]
Let $X_1, \ldots, X_n$ be i.i.d. $N(\mu, \sigma_0^2)$ with $\sigma_0^2$ known. Test $H_0: \mu = \mu_0$ against $H_1: \mu \neq \mu_0$, so $\Theta_0 = \{\mu_0\}$ and $\Theta = \mathbb{R}$.
The unrestricted MLE is $\hat{\mu} = \bar{x}$, giving
\begin{align*}
\Lambda_{x}(H_0; H_1) &= \frac{(2\pi\sigma_0^2)^{-n/2} \exp\!\left(-\frac{1}{2\sigma_0^2}\sum(x_i - \bar{x})^2\right)}{(2\pi\sigma_0^2)^{-n/2} \exp\!\left(-\frac{1}{2\sigma_0^2}\sum(x_i - \mu_0)^2\right)}.
\end{align*}
Taking twice the log and using the identity $\sum(x_i - \mu_0)^2 = \sum(x_i - \bar{x})^2 + n(\bar{x} - \mu_0)^2$:
\begin{align*}
2\log \Lambda_{x}(H_0; H_1) = \frac{n(\bar{x} - \mu_0)^2}{\sigma_0^2}.
\end{align*}
We reject $H_0$ when $\Lambda$ is large, i.e. when $|\bar{x} - \mu_0|$ is large. Under $H_0$, the statistic $Z = \sqrt{n}(\bar{X} - \mu_0)/\sigma_0 \sim N(0,1)$, so $2\log\Lambda = Z^2 \sim \chi_1^2$ exactly. The size $\alpha$ test rejects $H_0$ when
\begin{align*}
\left|\frac{\sqrt{n}(\bar{x} - \mu_0)}{\sigma_0}\right| > z_{\alpha/2}.
\end{align*}
This is a two-tailed test: we reject both when $\bar{x}$ is much larger than $\mu_0$ and when it is much smaller, which is the correct behaviour when the alternative is two-sided.
[/example]
The exact $\chi^2$ distribution in this example is a special feature of the normal model. In general, the null distribution of $2\log\Lambda$ must be approximated, and the following result — due to Wilks — provides the universal approximation.
### Wilks' Theorem
To state the theorem precisely, we need to quantify the "degrees of freedom" consumed by $H_0$. If the full parameter space $\Theta$ has $k$ free parameters and $H_0$ imposes $p$ independent restrictions (reducing $\Theta_0$ to $k - p$ free parameters), write $|\Theta| = k$ and $|\Theta_0| = k - p$.
[quotetheorem:1431]
The proof of Wilks' theorem is not covered in this course. It proceeds by a Taylor expansion of the log-likelihood around the MLE, showing that $2\log\Lambda$ is asymptotically equivalent to a quadratic form in a $p$-dimensional normal vector, which has a $\chi_p^2$ distribution. The number of degrees of freedom $p = |\Theta| - |\Theta_0|$ is the "dimension" of the alternative relative to the null — the number of additional free parameters the alternative model gains. The theorem does not apply when the parameter lies on the boundary of its parameter space (e.g., testing $H_0: \sigma^2 = 0$ in a variance-components model), where the Taylor expansion breaks down; in those cases the null distribution is typically a mixture of chi-squared distributions rather than a single $\chi_p^2$. Similarly, in non-identifiable models where different parameter values give the same distribution, the MLE may not be unique and Wilks' theorem fails. These are genuine constraints on the applicability of the GLR approach.
[remark: Exact vs Asymptotic]
The $\chi^2$ distribution in Wilks' theorem is an asymptotic approximation valid as $n \to \infty$. In special cases — such as the normal-mean example above, where $2\log\Lambda = n(\bar{X} - \mu_0)^2/\sigma_0^2 \sim \chi_1^2$ exactly — the asymptotic distribution holds for all $n$. But for most models, the $\chi^2$ critical values are only approximate, and their accuracy depends on $n$ and the regularity of the model.
[/remark]
The GLR framework also handles composite problems where the parameter space is multi-dimensional. When the variance is unknown, we no longer have a UMP test for the mean, and the GLR statistic leads to a new and practically important test.
[example: GLR for Normal Mean — Unknown Variance, Leading to the t-Test]
Let $X_1, \ldots, X_n$ be i.i.d. $N(\mu, \sigma^2)$ with both $\mu$ and $\sigma^2$ unknown. Test $H_0: \mu = \mu_0$ against $H_1: \mu \neq \mu_0$. Here $\Theta = \mathbb{R} \times (0, \infty)$ has $k = 2$ free parameters, and $\Theta_0 = \{\mu_0\} \times (0, \infty)$ has $k - p = 1$ free parameter, giving $p = 1$.
The unrestricted MLEs are $\hat{\mu} = \bar{x}$ and $\hat{\sigma}^2 = n^{-1}\sum(x_i - \bar{x})^2$. Under $H_0$, the MLE of $\sigma^2$ is $\hat{\sigma}_0^2 = n^{-1}\sum(x_i - \mu_0)^2$. Computing $2\log\Lambda$:
\begin{align*}
2\log\Lambda_{x}(H_0; H_1) &= n\log\hat{\sigma}_0^2 - n\log\hat{\sigma}^2 = n\log\!\left(\frac{\hat{\sigma}_0^2}{\hat{\sigma}^2}\right).
\end{align*}
Using $\sum(x_i - \mu_0)^2 = \sum(x_i - \bar{x})^2 + n(\bar{x} - \mu_0)^2$, we get $\hat{\sigma}_0^2 = \hat{\sigma}^2 + (\bar{x} - \mu_0)^2$, and so
\begin{align*}
2\log\Lambda_{x}(H_0; H_1) = n\log\!\left(1 + \frac{(\bar{x} - \mu_0)^2}{\hat{\sigma}^2}\right).
\end{align*}
This is a monotone function of $(\bar{x} - \mu_0)^2/\hat{\sigma}^2$, and equivalently of
\begin{align*}
T = \frac{\bar{x} - \mu_0}{\hat{S}/\sqrt{n}},
\end{align*}
where $\hat{S}^2 = (n-1)^{-1}\sum(x_i - \bar{x})^2$ is the sample variance. Under $H_0$, $T \sim t_{n-1}$ (Student's $t$-distribution with $n-1$ degrees of freedom). The GLR test therefore rejects $H_0$ when $|T| > t_{n-1,\alpha/2}$, which is precisely the two-sided $t$-test. Wilks' theorem predicts the asymptotic distribution $2\log\Lambda \xrightarrow{d} \chi_1^2$ under $H_0$ (one degree of freedom, consistent with $p = 1$), but the exact finite-sample distribution of $T$ gives a sharper test for all $n$.
[/example]
[remark: UMP Tests and Composite Alternatives]
The comparison between the one-sided and two-sided normal-mean tests illustrates a general principle. For $H_1: \mu > \mu_0$, a UMP test exists because the Neyman-Pearson critical region for any fixed $\mu_1 > \mu_0$ is the same upper-tail region, independent of $\mu_1$. For $H_1: \mu \neq \mu_0$, no UMP test exists — the optimal test against $\mu_1 > \mu_0$ uses the upper tail, while the optimal test against $\mu_1 < \mu_0$ uses the lower tail, and no single critical region can simultaneously dominate both. The GLR approach sidesteps this by optimising likelihood over all alternatives at once, at the cost of moving from exact to asymptotic distribution theory.
[/remark]
### Summary and Forward Look
The tools developed in this section — the power function, size, UMP tests, the GLR statistic, and Wilks' theorem — provide a complete framework for composite hypothesis testing. For one-sided problems with monotone likelihood structure, the Neyman-Pearson approach extends cleanly to yield UMP tests. For general composite problems, the GLR statistic $\Lambda = \sup_\Theta f / \sup_{\Theta_0} f$ provides a principled test criterion, and Wilks' theorem guarantees that $2\log\Lambda \xrightarrow{d} \chi_p^2$ under $H_0$ with $p$ degrees of freedom equal to the dimension reduction imposed by $H_0$.
The next section applies this GLR machinery to two canonical problems: **goodness-of-fit** (testing whether data come from a specified parametric family) and **independence** (testing whether two categorical variables are associated). In both cases, the parameter space has a natural structure — a simplex of cell probabilities — and the GLR statistic reduces to expressions involving observed and expected counts. Wilks' theorem provides the $\chi^2$ null distribution that makes these tests practical, completing the bridge from the abstract theory of this section to the widely-used chi-squared tests of applied statistics.
## Tests of Goodness-of-Fit and Independence
The tests developed in the previous section — the Neyman–Pearson lemma, UMP tests, and the generalised likelihood ratio — were all framed around a numerical parameter such as a mean or variance. A broader and very common question is: does the data come from a particular distribution at all? A six-sided die might be suspected of being unfair; pea colours from a genetics experiment might deviate from Mendel's predicted ratios; birth months of university admissions might cluster in certain seasons. In each case, we want to test whether the entire probability vector governing multiple categories matches a hypothesised model. The framework for doing this is the chi-squared goodness-of-fit test.
### The Multinomial Model
Suppose observations fall into one of $k$ mutually exclusive and exhaustive categories. Out of $n$ independent draws, let $N_i$ denote the number of observations falling in category $i$, for $i = 1, \ldots, k$.
[definition: Multinomial Distribution]
Let $n \geq 1$ and $k \geq 2$ be integers, and let $p_1, \ldots, p_k \geq 0$ with $\sum_{i=1}^k p_i = 1$. A random vector $(N_1, \ldots, N_k)$ follows the **multinomial distribution** $\mathrm{Multinomial}(n; p_1, \ldots, p_k)$ if
\begin{align*}
\mathbb{P}(N_1 = n_1, \ldots, N_k = n_k) = \frac{n!}{n_1! \cdots n_k!} p_1^{n_1} \cdots p_k^{n_k}
\end{align*}
for all non-negative integers $n_1, \ldots, n_k$ with $\sum_{i=1}^k n_i = n$. Each $N_i$ has marginal distribution $\mathrm{Binomial}(n, p_i)$, and $\mathbb{E}[N_i] = n p_i$.
[/definition]
This is the natural model for count data partitioned into categories: every individual observation independently falls into category $i$ with probability $p_i$. The multinomial is the multivariate generalisation of the binomial: with $k = 2$ categories it reduces to a binomial with parameters $n$ and $p_1$. The constraint $\sum p_i = 1$ means the probability vector lives in a $(k-1)$-dimensional simplex, a geometry that will determine the degrees of freedom in all the chi-squared tests below.
### Goodness-of-Fit to a Fully Specified Distribution
The simplest testing problem is the **fully specified null**: the values $p_1, \ldots, p_k$ are completely determined under $H_0$, with no free parameters. Formally:
\begin{align*}
H_0 &: p_i = \tilde{p}_i \text{ for all } i, \quad \text{(a specific, known probability vector)}, \\
H_1 &: p_i \text{ unrestricted (subject only to } p_i \geq 0,\, \textstyle\sum p_i = 1).
\end{align*}
Applying the generalised likelihood ratio approach from the previous section, the log-likelihood under $H_1$ is $\ell = \text{const} + \sum_{i=1}^k n_i \log p_i$. Maximising subject to $\sum p_i = 1$ via a Lagrange multiplier gives the MLE $\hat{p}_i = n_i / n$. The parameter space $\Theta_1$ has dimension $k - 1$ (the constraint $\sum p_i = 1$ removes one degree of freedom). Under $H_0$ the probabilities are fixed, so $\dim(\Theta_0) = 0$.
The generalised likelihood ratio statistic is therefore
\begin{align*}
2 \log \Lambda = 2 \sum_{i=1}^k n_i \log\!\left(\frac{n_i}{n \tilde{p}_i}\right),
\end{align*}
which by the Wilks theorem is asymptotically $\chi^2_{k-1}$ under $H_0$.
[example: Birth Month and University Admissions]
The following table gives the birth months of students admitted to Oxford and Cambridge in 2012 (total $n = 5300$):
| Sep | Oct | Nov | Dec | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug |
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
| 470 | 515 | 470 | 457 | 473 | 381 | 466 | 457 | 437 | 396 | 384 | 394 |
Is this consistent with no birth-month effect on admissions?
Under $H_0$ the expected count in month $i$ is $e_i = n \tilde{p}_i$, where $\tilde{p}_i$ is the proportion of all births in the UK in 1993–1994 falling in month $i$. Crucially, this is not $1/12$ — there is a well-documented "Christmas effect" causing excess September births — so using national birth statistics for $\tilde{p}_i$ is the correct baseline.
Computing the statistic gives $2 \log \Lambda = 44.86$. With $k - 1 = 11$ degrees of freedom, $\mathbb{P}(\chi^2_{11} > 44.86) = 3 \times 10^{-9}$. This $p$-value is far below even the $0.1\%$ threshold, providing very strong evidence against $H_0$: admission rates do depend on birth month.
The conventional significance thresholds $\alpha = 0.05$, $0.01$, and $0.001$ are loosely described as "evidence", "strong evidence", and "very strong evidence" against $H_0$, respectively.
[/example]
The statistic $2\log\Lambda$ in this example required knowing the hypothesised probabilities $\tilde{p}_i$ exactly. In practice one also needs to compute it — a task that calls for taking $k$ logarithms. Pearson proposed an algebraically simpler approximation that yields the same asymptotic distribution while replacing the logarithms with a sum of squared relative deviations. The approximation is justified by a Taylor expansion, and in most applications gives nearly identical numerical results.
### Pearson's Chi-Squared Test
The log-likelihood ratio statistic $2 \log \Lambda = 2 \sum n_i \log(n_i / e_i)$ is exact in the sense that it comes directly from the likelihood, but it admits a useful and computationally transparent approximation. Writing $o_i = n_i$ for the observed count and $e_i$ for the expected count under $H_0$, and letting $\delta_i = o_i - e_i$, the Taylor expansion $\log(1 + x) = x - x^2/2 + O(x^3)$ gives
\begin{align*}
2 \log \Lambda
&= 2\sum (e_i + \delta_i)\log\!\left(1 + \frac{\delta_i}{e_i}\right) \\
&= 2\sum (e_i + \delta_i)\!\left(\frac{\delta_i}{e_i} - \frac{\delta_i^2}{2e_i^2} + O(\delta_i^3)\right) \\
&= 2\sum \!\left(\delta_i + \frac{\delta_i^2}{e_i} - \frac{\delta_i^2}{2e_i} + O(\delta_i^3)\right).
\end{align*}
Since $\sum e_i = \sum o_i = n$ we have $\sum \delta_i = 0$, so the leading terms cancel and we obtain
\begin{align*}
2 \log \Lambda \approx \sum_{i=1}^k \frac{(o_i - e_i)^2}{e_i}.
\end{align*}
This approximation is Pearson's chi-squared statistic, and it has the same asymptotic chi-squared distribution as the exact log-likelihood ratio. In practice the two statistics give very similar numerical values when expected counts are not too small (a commonly cited rule of thumb is $e_i \geq 5$ for all $i$).
[quotetheorem:1432]
The degrees-of-freedom adjustment in the composite case deserves a moment of explanation. The $k-1$ degrees come from the $k$ categories minus the one constraint $\sum p_i = 1$. Each free parameter estimated from the data under $H_0$ "uses up" one additional degree of freedom, leaving $k - 1 - q$ in total. The rule of thumb $e_i \geq 5$ for all cells is worth taking seriously: when some $e_i$ is small (say $e_i < 2$), the chi-squared approximation can be badly wrong, and one should instead collapse adjacent categories or use an exact test. Fisher's exact test is the standard alternative for $2 \times 2$ tables with small expected counts. The examples below illustrate cases where all expected counts are comfortably large, so the approximation is reliable.
[example: Mendel's Pea Experiment]
Mendel crossed 556 smooth yellow peas with wrinkled green peas. Denoting the four phenotype counts as $(N_1, N_2, N_3, N_4)$ for smooth yellow, smooth green, wrinkled yellow, and wrinkled green, Mendelian genetics predicts
\begin{align*}
H_0: (p_1, p_2, p_3, p_4) = \!\left(\tfrac{9}{16}, \tfrac{3}{16}, \tfrac{3}{16}, \tfrac{1}{16}\right).
\end{align*}
The observed counts were $(315, 108, 102, 31)$ and the expected counts are
\begin{align*}
(e_1, e_2, e_3, e_4) = (312.75,\; 104.25,\; 104.25,\; 34.75).
\end{align*}
Since $H_0$ is fully specified, there are $k - 1 = 3$ degrees of freedom. Computing:
\begin{align*}
X^2 = \frac{(315-312.75)^2}{312.75} + \frac{(108-104.25)^2}{104.25} + \frac{(102-104.25)^2}{104.25} + \frac{(31-34.75)^2}{34.75} = 0.604.
\end{align*}
Since $\chi^2_3(0.05) = 7.815$, this value is nowhere near significant. The $p$-value is approximately $\mathbb{P}(\chi^2_3 > 0.6) \approx 0.90$, indicating an exceptionally close fit — so close, in fact, that statisticians have retrospectively questioned whether Mendel's recorded data were genuine.
[/example]
The previous example had a fully specified null, with no free parameters to estimate. The next example illustrates the composite goodness-of-fit setting, where $H_0$ constrains the cell probabilities to a parametric family without pinning them down completely. Estimating the parameter from the data costs one degree of freedom.
[example: Hardy–Weinberg Genotype Proportions]
Suppose individuals can be of one of three genotypes, with probabilities depending on a single allele frequency $\theta \in (0, 1)$:
\begin{align*}
p_1(\theta) = \theta^2, \quad p_2(\theta) = 2\theta(1-\theta), \quad p_3(\theta) = (1-\theta)^2.
\end{align*}
We test $H_0: p_i = p_i(\theta)$ for some unknown $\theta$ against $H_1$ with unrestricted $(p_1, p_2, p_3)$.
Under $H_0$, the log-likelihood is $\sum n_i \log p_i(\theta) = 2n_1 \log \theta + n_2 \log(2\theta(1-\theta)) + 2n_3 \log(1-\theta)$. Differentiating and solving gives $\hat{\theta} = (2n_1 + n_2)/(2n)$, the sample frequency of one allele.
Here $q = 1$, so the Pearson statistic $\sum (o_i - e_i)^2 / e_i$ with $e_i = n\,p_i(\hat{\theta})$ is referred to $\chi^2_{k-1-q} = \chi^2_1$.
[/example]
### Testing Independence in Contingency Tables
Goodness-of-fit tests handle the question of whether observations follow a specified distribution. A related but structurally different question arises when individuals are classified along two separate criteria simultaneously: are those two criteria independent? The natural framework is a contingency table.
[definition: Contingency Table]
A **contingency table** is a rectangular array in which the rows correspond to the levels of one categorical variable and the columns correspond to the levels of another, with each cell recording the number of observations falling into that combination of categories.
Formally, a two-way contingency table with $r$ rows and $c$ columns records counts $n_{ij}$ for $i = 1, \ldots, r$ and $j = 1, \ldots, c$, where $n_{ij}$ is the number of observations classified in row category $i$ and column category $j$. The row totals are $n_{i+} = \sum_j n_{ij}$, the column totals are $n_{+j} = \sum_i n_{ij}$, and the grand total is $n_{++} = n$.
[/definition]
Before setting up the independence test, it helps to look at an actual contingency table to develop intuition for what "dependence" looks like in tabular data. The following survey data will serve as the running example throughout this subsection.
[example: Car Size Choices]
A survey asked 500 people who had recently changed car about the size of their old and new vehicles, yielding the following two-way table:
| | New: Large | New: Medium | New: Small | **Row total** |
|---|---|---|---|---|
| **Old: Large** | 56 | 52 | 42 | **150** |
| **Old: Medium** | 50 | 83 | 67 | **200** |
| **Old: Small** | 18 | 51 | 81 | **150** |
| **Col total** | *124* | *186* | *190* | *500* |
Each person is classified by two criteria (old car size and new car size), making this a $3 \times 3$ contingency table.
[/example]
The fundamental null hypothesis for a contingency table is **independence**: row and column classifications are unrelated. Writing $p_{ij} = \mathbb{P}(\text{row } i, \text{col } j)$, $p_{i+} = \mathbb{P}(\text{row } i)$, and $p_{+j} = \mathbb{P}(\text{col } j)$, the hypotheses are
\begin{align*}
H_0 &: p_{ij} = p_{i+}\, p_{+j} \text{ for all } i, j, \\
H_1 &: p_{ij} \text{ unrestricted (subject to } p_{++} = 1,\, p_{ij} \geq 0).
\end{align*}
The full joint vector $(N_{11}, \ldots, N_{rc}) \sim \mathrm{Multinomial}(n; p_{11}, \ldots, p_{rc})$.
Under $H_1$ the MLEs are $\hat{p}_{ij} = n_{ij}/n$, with $\dim(\Theta_1) = rc - 1$.
Under $H_0$, the marginal probabilities $p_{i+}$ and $p_{+j}$ are estimated separately: $\hat{p}_{i+} = n_{i+}/n$ and $\hat{p}_{+j} = n_{+j}/n$. The parameters $(p_{1+}, \ldots, p_{r+})$ satisfy one constraint $\sum_i p_{i+} = 1$, contributing $r - 1$ free parameters. Similarly the column marginals contribute $c - 1$. Thus $\dim(\Theta_0) = (r-1) + (c-1)$.
The expected count in cell $(i,j)$ under $H_0$ is
\begin{align*}
e_{ij} = n\, \hat{p}_{i+}\hat{p}_{+j} = \frac{n_{i+}\, n_{+j}}{n}.
\end{align*}
The Pearson chi-squared statistic is
\begin{align*}
X^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{(n_{ij} - e_{ij})^2}{e_{ij}},
\end{align*}
and the degrees of freedom are
\begin{align*}
\dim(\Theta_1) - \dim(\Theta_0) = (rc - 1) - (r - 1) - (c - 1) = (r-1)(c-1).
\end{align*}
Under $H_0$, $X^2 \xrightarrow{d} \chi^2_{(r-1)(c-1)}$ as $n \to \infty$.
[remark: Why $(r-1)(c-1)$ Degrees of Freedom]
An intuitive check: a $2 \times 2$ table has only $1$ degree of freedom, meaning that once you fix the four marginal totals, knowing any single cell determines all the rest. For a $3 \times 3$ table, fixing the margins leaves $4$ cells free — exactly $(3-1)(3-1) = 4$.
[/remark]
With the degrees-of-freedom formula established, we apply the test to the car survey data introduced above. The question is whether people's choice of new car size is independent of the size of the car they previously owned, or whether there is an association — a tendency to buy cars of a similar size.
[example: Independence of Car Size Choices]
Returning to the car survey, we test $H_0$: old and new car size are independent. The expected counts under $H_0$ are $e_{ij} = n_{i+} n_{+j} / 500$:
| | New: Large | New: Medium | New: Small |
|---|---|---|---|
| **Old: Large** | 37.2 | 55.8 | 57.0 |
| **Old: Medium** | 49.6 | 74.4 | 76.0 |
| **Old: Small** | 37.2 | 55.8 | 57.0 |
Note the row and column totals of the expected table match the observed marginals exactly — this is always the case. Visually, the observed table shows clear concentration on the diagonal (people tend to buy cars of similar size), while the expected table is much more spread out.
Computing the Pearson statistic:
\begin{align*}
X^2 = \sum_{i,j} \frac{(o_{ij} - e_{ij})^2}{e_{ij}} = 36.20.
\end{align*}
The degrees of freedom are $(3-1)(3-1) = 4$. From chi-squared tables, $\chi^2_4(0.01) = 13.28$, so $X^2 = 36.20$ is significant at the $1\%$ level. There is strong evidence against independence: people's new car size is associated with their previous car size.
[/example]
[explanation: Connecting the Two Tests]
Both the goodness-of-fit test and the independence test are applications of the same generalised likelihood ratio framework applied to multinomial data. In both cases the statistic $2 \log \Lambda$ is approximated by the Pearson form $\sum (o - e)^2 / e$, and the degrees of freedom count the difference in parameter space dimensions between the unrestricted and constrained models. What changes is how the expected counts are computed: in a goodness-of-fit test to a fully specified null, $e_i = n\tilde{p}_i$ uses externally given probabilities; in an independence test, $e_{ij} = n_{i+} n_{+j} / n$ uses empirically estimated marginals.
The degree-of-freedom adjustment when parameters are estimated under $H_0$ is also the same mechanism in both settings: each estimated parameter reduces the degrees of freedom by one.
[/explanation]
The tests developed here apply when a single sample of individuals is classified by one or two variables. The next section examines what happens when several independent samples are drawn from potentially different populations — the **test of homogeneity** — and also uncovers the precise connection between chi-squared tests and the confidence intervals encountered earlier in the course.
## Tests of Homogeneity and Confidence Intervals
The previous section asked whether a single population fits a specified distribution, or whether two categorical variables measured on the same sample are independent. Here we face two further questions that round out the picture. First, when we draw independent samples from several distinct populations, how do we test whether those populations share the same categorical distribution — a property called **homogeneity**? Second, we step back to ask a structural question: what is the relationship between the confidence intervals built in Chapter 1 and the hypothesis tests developed in this chapter? The answer turns out to be a precise duality, not a mere analogy.
### Tests of Homogeneity
When we study the effect of a drug at multiple dosage levels, or compare voter preferences across age groups, we are not observing a single population and asking whether two variables are correlated. Instead, we fix the sample sizes in advance for each group and observe one categorical response per subject. The question is whether the response distribution is the same across all groups.
[example: Drug Trial with Three Treatment Groups]
A clinical trial allocates 150 patients equally to three groups of 50: a placebo group, a half-dose group, and a full-dose group. The response to treatment is recorded as "Improved", "No difference", or "Worse". The observed counts are:
\begin{align*}
&\text{Observed } (o_{ij}):\\
&\begin{array}{l|ccc|c}
& \text{Improved} & \text{No difference} & \text{Worse} & \text{Total} \\\hline
\text{Placebo} & 18 & 17 & 15 & 50 \\
\text{Half dose} & 20 & 10 & 20 & 50 \\
\text{Full dose} & 25 & 13 & 12 & 50 \\\hline
\text{Total} & 63 & 40 & 47 & 150
\end{array}
\end{align*}
The row totals are fixed by the experimental design. This distinguishes the homogeneity setting from the independence setting of the previous section, where row totals were themselves random. The null hypothesis is that the probability of each outcome is the same in all three groups:
\begin{align*}
H_0: p_{1j} = p_{2j} = p_{3j} = p_j \quad \text{for } j = 1, 2, 3.
\end{align*}
[/example]
The general framework is as follows. We observe an $r \times c$ contingency table $(n_{ij})$, where row $i$ records the counts from an independent sample of size $n_{i+}$ drawn from population $i$. The row vectors follow independent multinomial distributions:
\begin{align*}
(N_{i1}, \ldots, N_{ic}) \sim \operatorname{Multinomial}(n_{i+},\, p_{i1}, \ldots, p_{ic}), \quad i = 1, \ldots, r.
\end{align*}
The hypotheses are
\begin{align*}
H_0&: p_{1j} = p_{2j} = \cdots = p_{rj} = p_j \quad \text{for all } j = 1, \ldots, c, \\
H_1&: p_{ij} \text{ unrestricted}.
\end{align*}
Under $H_1$, the log-likelihood is
\begin{align*}
\log L = \text{const} + \sum_{i=1}^{r} \sum_{j=1}^{c} n_{ij} \log p_{ij},
\end{align*}
and maximising subject to $\sum_j p_{ij} = 1$ for each $i$ gives $\hat{p}_{ij} = n_{ij}/n_{i+}$.
Under $H_0$, the common probability vector $(p_1, \ldots, p_c)$ satisfies $\sum_j p_j = 1$, and the log-likelihood becomes
\begin{align*}
\log L = \text{const} + \sum_{j=1}^{c} n_{+j} \log p_j,
\end{align*}
where $n_{+j} = \sum_i n_{ij}$ is the $j$-th column total. Maximising gives $\hat{p}_j = n_{+j}/n_{++}$, the overall proportion in category $j$.
The log-likelihood ratio statistic is therefore
\begin{align*}
2\log \Lambda = 2 \sum_{i=1}^{r} \sum_{j=1}^{c} n_{ij} \log \!\left(\frac{n_{ij}}{n_{i+} n_{+j}/n_{++}}\right).
\end{align*}
A remarkable fact emerges: this formula is identical to the one derived in the previous section for the test of independence, where the row totals were random. The algebraic form of the test statistic is the same regardless of whether the margins are fixed or random; only the sampling model differs. The dimension count confirms the degrees of freedom. The alternative has $|\Theta_1| = r(c-1)$ free parameters (one probability vector of length $c-1$ per row), while the null has $|\Theta_0| = c - 1$. So the degrees of freedom is $r(c-1) - (c-1) = (r-1)(c-1)$, and by Wilks' theorem,
\begin{align*}
2\log \Lambda \xrightarrow{d} \chi^2_{(r-1)(c-1)} \quad \text{under } H_0.
\end{align*}
We reject $H_0$ at approximate size $\alpha$ when $2\log \Lambda > \chi^2_{(r-1)(c-1)}(\alpha)$.
The Pearson approximation applies here too. Setting $o_{ij} = n_{ij}$ and $e_{ij} = n_{i+} n_{+j}/n_{++}$, the same Taylor expansion as in the goodness-of-fit case gives
\begin{align*}
2\log \Lambda \approx \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(o_{ij} - e_{ij})^2}{e_{ij}}.
\end{align*}
The expected count $e_{ij}$ represents what we would observe in cell $(i,j)$ if the null hypothesis of homogeneity held exactly: it is the product of the $i$-th row total and the $j$-th column proportion under the null.
[example: Drug Trial Completed]
Returning to the clinical trial, the expected counts under $H_0$ are
\begin{align*}
e_{ij} = \frac{50 \cdot n_{+j}}{150} = \frac{n_{+j}}{3},
\end{align*}
giving the same expected table for each row:
\begin{align*}
&\text{Expected } (e_{ij}):\\
&\begin{array}{l|ccc|c}
& \text{Improved} & \text{No difference} & \text{Worse} & \text{Total} \\\hline
\text{Placebo} & 21 & 13.3 & 15.7 & 50 \\
\text{Half dose} & 21 & 13.3 & 15.7 & 50 \\
\text{Full dose} & 21 & 13.3 & 15.7 & 50 \\\hline
\text{Total} & 63 & 40 & 47 & 150
\end{array}
\end{align*}
Computing the test statistic gives $2\log \Lambda = 5.129$, referred to $\chi^2_4$. Since $\chi^2_4(0.05) = 9.488$ from tables, the observed value is far from significant at the 5\% level. The Pearson statistic gives $\sum (o_{ij} - e_{ij})^2/e_{ij} = 5.173$, leading to the same conclusion. We find no evidence of a difference between the drug at either dose and the placebo.
[/example]
[remark: Homogeneity vs Independence]
The homogeneity test and the independence test from the previous section produce the same statistic and the same degrees of freedom, but they arise from different sampling designs. In the independence test, the grand total $n_{++}$ is fixed and all margins are random. In the homogeneity test, the row totals $n_{i+}$ are fixed by design. Despite this difference, the test procedure is identical. This is a genuine mathematical coincidence, not a shortcut: the sufficient statistics and the maximum likelihood estimates under $H_0$ happen to agree across both models.
[/remark]
### The Duality Between Confidence Intervals and Hypothesis Tests
We now revisit Chapter 1's confidence intervals from a new vantage point. The construction there — produce an interval $I(X)$ such that $\mathbb{P}(\theta \in I(X)) = 1 - \alpha$ — and the hypothesis test approach here — reject $H_0: \theta = \theta_0$ when the data fall in a critical region $C(\theta_0)$ — look like separate enterprises. They are not. Each is the exact inverse of the other.
The key concept bridging the two is the acceptance region.
[definition: Acceptance Region]
Let $C(\theta_0)$ be the critical region of a test of $H_0: \theta = \theta_0$. The **acceptance region** is
\begin{align*}
A(\theta_0) = \{x : x \notin C(\theta_0)\}.
\end{align*}
When $X \in A(\theta_0)$, we do not reject $H_0$. The term "acceptance" is historical; it means non-rejection, not affirmation that $H_0$ is true.
[/definition]
The duality theorem states that running a family of hypothesis tests (one for each candidate value $\theta_0$) is equivalent to constructing a confidence set, and vice versa.
[quotetheorem:1433]
[citeproof:1433]
The practical meaning is worth dwelling on. After observing data $X$, a 95\% confidence interval is precisely the set of parameter values $\theta_0$ that a size 5\% test would fail to reject. Equivalently, if you already have a test for every null hypothesis of the form $H_0: \theta = \theta_0$, the confidence interval is exactly the set of nulls you cannot reject. The two procedures carry identical information.
[example: Normal Mean, Known Variance]
Let $X_1, \ldots, X_n$ be i.i.d. $N(\mu, 1)$. We want a 95\% confidence set for $\mu$.
In the previous sections we derived a size 0.05 test of $H_0: \mu = \mu_0$ against $H_1: \mu \neq \mu_0$: reject $H_0$ when $|\sqrt{n}(\bar{X} - \mu_0)| > 1.96$. The acceptance region is
\begin{align*}
A(\mu_0) = \{x : |\sqrt{n}(\bar{x} - \mu_0)| \leq 1.96\}.
\end{align*}
Applying the theorem, the confidence set is
\begin{align*}
I(X) = \{\mu : |\sqrt{n}(\bar{X} - \mu)| \leq 1.96\} = \left(\bar{X} - \frac{1.96}{\sqrt{n}},\; \bar{X} + \frac{1.96}{\sqrt{n}}\right).
\end{align*}
This is precisely the confidence interval one would build directly from the sampling distribution of $\bar{X}$. The duality theorem makes the derivation automatic: the hypothesis test immediately produces the confidence interval, without any separate construction.
[/example]
[remark: Confidence Sets Need Not Be Intervals]
The duality theorem produces confidence *sets*, not necessarily intervals. When the test's acceptance region has a complicated shape in data space, the resulting confidence set $I(X)$ can be a disconnected or non-convex subset of $\Theta$. For one-dimensional problems with monotone likelihood ratios, the confidence set turns out to be an interval, but this is a special property of those models, not a consequence of the duality itself.
[/remark]
The duality is more than a curiosity. It explains, for instance, why a likelihood ratio test at level $\alpha$ and the corresponding likelihood-based confidence region encode the same inferential content. It also provides a constructive method: in situations where it is easier to think about hypothesis testing than interval estimation (or vice versa), one can work in whichever framework is more tractable and translate the result automatically.
### Looking Ahead: Multivariate Normal Theory
The chi-squared tests of this chapter are appropriate for categorical data. But the most frequently encountered test statistics in practice — Student's $t$-statistic, the $F$-statistic in analysis of variance, the regression $t$-tests — are not chi-squared. They arise from a different distributional family, one built on normal data.
The foundation for all of these is **multivariate normal theory**: the joint distribution of a vector of correlated normal random variables, its geometry, and the distributions of quadratic forms derived from it. These tools are not merely technical machinery; they explain why the $t$-distribution has heavier tails than the normal, why sums of squared residuals in regression follow chi-squared distributions, and how to make exact (not just asymptotic) inferences about normal parameters. The next section develops this theory in full, providing the rigorous basis for the $t$-tests and linear model tests that follow.
## Multivariate Normal Theory
[motivation]
### Why This Machinery Is Needed
The t-test, which is the subject of the next section, rests on a pair of facts about a normal random sample $X_1, \ldots, X_n$: first, that $\bar{X}$ and the sample variance $S^2$ are independently distributed, and second, that $S_{XX}/\sigma^2$ follows a $\chi^2_{n-1}$ distribution. Neither fact is transparent from first principles. The sample mean and the sample variance are both computed from the same data, so there is no obvious reason why they should carry independent information. The $\chi^2_{n-1}$ degree count — losing one degree of freedom — is equally mysterious from a direct calculation.
Both results become natural the moment we view the $n$-dimensional observation vector $X = (X_1, \ldots, X_n)^\top$ as a single multivariate normal random vector and apply an orthogonal change of basis. The key insight is that orthogonal transformations preserve the multivariate normal structure, and they can be chosen to disentangle $\bar{X}$ from the residuals $X_i - \bar{X}$ completely. This section builds that framework. The F-test (coming later in the chapter) uses the same orthogonal-decomposition idea at a larger scale.
[/motivation]
### Random Vectors and Covariance Matrices
Before defining the multivariate normal, we set up the algebra of random vectors. A random vector $X = (X_1, \ldots, X_n)^\top$ has mean vector
\begin{align*}
\mu = \mathbb{E}[X] = (\mathbb{E}[X_1], \ldots, \mathbb{E}[X_n])^\top
\end{align*}
and covariance matrix
\begin{align*}
\operatorname{Cov}(X) = \mathbb{E}\bigl[(X - \mu)(X - \mu)^\top\bigr],
\end{align*}
whose $(i,j)$-entry is $\operatorname{Cov}(X_i, X_j)$. For any $m \times n$ matrix $A$, the transformation $AX$ satisfies
\begin{align*}
\mathbb{E}[AX] &= A\mu, \\
\operatorname{Cov}(AX) &= A\,\operatorname{Cov}(X)\,A^\top.
\end{align*}
The second identity follows because $AX - A\mu = A(X - \mu)$, so
\begin{align*}
\operatorname{Cov}(AX) &= \mathbb{E}\bigl[A(X - \mu)(X - \mu)^\top A^\top\bigr] = A\,\operatorname{Cov}(X)\,A^\top.
\end{align*}
This transformation law is the engine behind everything that follows. The covariance matrix $\Sigma = \operatorname{Cov}(X)$ is always symmetric and positive semi-definite: $t^\top \Sigma\, t = \operatorname{Var}(t^\top X) \geq 0$ for all $t \in \mathbb{R}^n$.
### The Multivariate Normal Distribution
[definition: Multivariate Normal Distribution]
A random vector $X \in \mathbb{R}^n$ has a **multivariate normal distribution** if for every $t \in \mathbb{R}^n$, the scalar random variable $t^\top X$ has a (univariate) normal distribution. If $\mathbb{E}[X] = \mu$ and $\operatorname{Cov}(X) = \Sigma$, we write $X \sim N_n(\mu, \Sigma)$.
[/definition]
This definition via linear combinations is more general than requiring a joint density, which would exclude the degenerate case where components are linearly dependent. The remark below clarifies why this projection-based characterisation is the right one to work with in practice.
[remark: Characterisation via Linear Combinations]
The definition says: $X$ is multivariate normal if and only if every one-dimensional projection $t^\top X$ is univariate normal. This is the standard and most useful characterisation — it is what we verify in proofs and what we use when applying orthogonal transformations.
[/remark]
The moment generating function (mgf) of $X \sim N_n(\mu, \Sigma)$ takes a clean form. Since $t^\top X$ is $N(t^\top \mu,\, t^\top \Sigma\, t)$, its mgf at $s = 1$ is
\begin{align*}
M_X(t) = \mathbb{E}[e^{t^\top X}] = \exp\!\left(t^\top \mu + \tfrac{1}{2}t^\top \Sigma\, t\right).
\end{align*}
When $\Sigma$ is positive definite (so $|\Sigma| > 0$), $X$ has a proper density:
\begin{align*}
f_X(x) = \frac{1}{(2\pi)^{n/2}\,|\Sigma|^{1/2}}\exp\!\left(-\tfrac{1}{2}(x - \mu)^\top \Sigma^{-1}(x - \mu)\right).
\end{align*}
Two structural properties follow directly from the mgf characterisation:
**Marginals.** If $X \sim N_n(\mu, \Sigma)$ and we partition $X = (X_1^\top, X_2^\top)^\top$ with $X_i \in \mathbb{R}^{n_i}$, then $X_i \sim N_{n_i}(\mu_i, \Sigma_{ii})$, where $\mu_i$ and $\Sigma_{ii}$ are the corresponding blocks of $\mu$ and $\Sigma$.
**Independence iff zero covariance.** With the same partition, $X_1$ and $X_2$ are independent if and only if $\Sigma_{12} = 0$. The "if" direction follows because $M_X(t) = M_{X_1}(t_1)M_{X_2}(t_2)$ for all $t$ when $\Sigma_{12} = 0$, which is the mgf criterion for independence. The "only if" direction holds for any joint distribution (independence implies zero covariance), and zero covariance implies independence here precisely because we are in the normal setting.
[remark: Zero Covariance and Independence]
For general random variables, zero covariance does not imply independence. The multivariate normal is special: within the normal family, uncorrelated components are automatically independent. This is the fact that makes the argument of the next subsection work.
[/remark]
[quotetheorem:1434]
[citeproof:1434]
The key consequence of the special case: an orthogonal transformation of an isotropic normal (i.e., independent components with equal variance) remains an isotropic normal, but with its mean vector rotated. This is exactly what we use below. The following two-dimensional example makes the rotation visible and confirms that the transformed components are independent — a fact that is obvious from the covariance structure but easy to lose sight of geometrically.
[example: Bivariate Normal with Independent Components]
Let $X = (X_1, X_2)^\top \sim N_2(\mathbf{0}, I)$, so $X_1, X_2 \overset{\text{i.i.d.}}{\sim} N(0,1)$. Consider the rotation by angle $\pi/4$:
\begin{align*}
A = \begin{pmatrix} 1/\sqrt{2} & 1/\sqrt{2} \\ -1/\sqrt{2} & 1/\sqrt{2} \end{pmatrix}.
\end{align*}
Then $A$ is orthogonal, so $Y = AX \sim N_2(\mathbf{0}, I)$. The components $Y_1 = (X_1 + X_2)/\sqrt{2}$ and $Y_2 = (-X_1 + X_2)/\sqrt{2}$ are again i.i.d. $N(0,1)$. Rotating an isotropic normal produces another isotropic normal of the same shape.
[/example]
### Independence of the Sample Mean and Sample Variance
We are now ready to prove the central result of this section. This theorem is what makes the t-test a ratio of independent quantities.
[quotetheorem:1435]
[citeproof:1435]
### The Sum-of-Squares Decomposition
The proof above establishes a decomposition worth stating on its own, since it recurs throughout the chapter:
\begin{align*}
\sum_{i=1}^n (X_i - \mu)^2 = n(\bar{X} - \mu)^2 + \sum_{i=1}^n (X_i - \bar{X})^2.
\end{align*}
To verify: expand $\sum (X_i - \mu)^2 = \sum \bigl[(X_i - \bar{X}) + (\bar{X} - \mu)\bigr]^2$ and note that the cross-term $2\sum (X_i - \bar{X})(\bar{X} - \mu) = 2(\bar{X} - \mu)\sum(X_i - \bar{X}) = 0$ because $\sum(X_i - \bar{X}) = 0$ always.
Dividing by $\sigma^2$:
\begin{align*}
\frac{1}{\sigma^2}\sum_{i=1}^n (X_i - \mu)^2 = \frac{n(\bar{X} - \mu)^2}{\sigma^2} + \frac{S_{XX}}{\sigma^2}.
\end{align*}
Each side has a distributional identity. The left-hand side is $\chi^2_n$ (a sum of $n$ independent squared standard normals). The first term on the right is $\bigl((\bar{X}-\mu)/(\sigma/\sqrt{n})\bigr)^2 \sim \chi^2_1$. The second term is $\chi^2_{n-1}$. That the two terms on the right add up to $\chi^2_n$ and are independent (so their $\chi^2$ degrees of freedom add) is consistent: $1 + (n-1) = n$. This is a prototype of Cochran's theorem.
[example: Verifying the Decomposition for $n = 3$]
Take $n = 3$ with observations $x_1 = 2$, $x_2 = 4$, $x_3 = 6$. Then $\bar{x} = 4$.
\begin{align*}
\sum_{i=1}^3 (x_i - \bar{x})^2 &= (2-4)^2 + (4-4)^2 + (6-4)^2 = 4 + 0 + 4 = 8 = S_{xx}.
\end{align*}
If the true mean were $\mu = 3$:
\begin{align*}
\sum_{i=1}^3 (x_i - \mu)^2 &= 1 + 1 + 9 = 11.
\end{align*}
The decomposition gives $n(\bar{x} - \mu)^2 + S_{xx} = 3(4-3)^2 + 8 = 3 + 8 = 11$. The identity checks out: the total squared deviation from the true mean splits exactly into the squared deviation of the sample mean from $\mu$ (scaled by $n$) plus the internal spread.
[/example]
### Setting Up the t-Distribution
The independence theorem puts us in a position to form the statistic that will dominate the next section. When $\sigma^2$ is unknown (the common situation), we estimate it by the sample variance
\begin{align*}
S^2 = \frac{S_{XX}}{n-1} = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2.
\end{align*}
We know that $\bar{X} \sim N(\mu, \sigma^2/n)$, so
\begin{align*}
\frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0,1).
\end{align*}
We cannot use this directly because $\sigma$ is unknown. Replacing $\sigma$ by $S$ gives the statistic
\begin{align*}
T = \frac{\bar{X} - \mu}{S/\sqrt{n}}.
\end{align*}
This is a ratio of a standard normal to the square root of an independent $\chi^2_{n-1}/(n-1)$ quantity. That ratio is precisely the definition of Student's $t$-distribution with $n-1$ degrees of freedom. The independence of $\bar{X}$ and $S^2$ — which we have now proved — is what makes this ratio tractable: numerator and denominator are independent, so we can work out the distribution of $T$ explicitly. This is the subject of the next section.
## Student's t-Distribution
In all of the tests constructed so far in this chapter, the variance $\sigma^2$ appeared explicitly in the test statistic. We assumed it was known. In practice, this assumption is almost never satisfied: we observe $X_1, \ldots, X_n$ i.i.d. $N(\mu, \sigma^2)$ with both $\mu$ and $\sigma^2$ unknown, and we want to test $H_0 : \mu = \mu_0$.
The natural pivot is $Z = \sqrt{n}(\bar X - \mu_0)/\sigma$, which has the $N(0,1)$ distribution under $H_0$. But $\sigma$ is unknown, so we cannot compute $Z$. The obvious remedy is to replace $\sigma$ by the sample standard deviation $\tilde\sigma = \sqrt{S_{XX}/(n-1)}$, the square root of the unbiased variance estimator. The resulting statistic is
\begin{align*}
T = \frac{\sqrt{n}(\bar X - \mu_0)}{\tilde\sigma}.
\end{align*}
This is no longer standard normal: the denominator is random, introducing additional variability. The question is: what distribution does $T$ have under $H_0$? The answer is Student's $t$-distribution, and the previous section has already assembled all the tools needed to identify it.
### The t-Distribution
[definition: Student's t-Distribution]
Let $Z \sim N(0,1)$ and $W \sim \chi^2_d$ be independent. The distribution of
\begin{align*}
T = \frac{Z}{\sqrt{W/d}}
\end{align*}
is called Student's $t$-distribution with $d$ degrees of freedom, written $T \sim t_d$.
[/definition]
The ratio $Z/\sqrt{W/d}$ divides a standard normal by the square root of a scaled chi-squared. When $d$ is large, $W/d \approx 1$ by the law of large numbers, so $T \approx Z$ and the $t$-distribution is close to $N(0,1)$. For small $d$ the denominator is highly variable, making $T$ spread out more than the normal.
The density of $t_d$ can be derived explicitly. It is
\begin{align*}
f_T(t) = \frac{\Gamma\!\left(\tfrac{d+1}{2}\right)}{\Gamma\!\left(\tfrac{d}{2}\right)} \cdot \frac{1}{\sqrt{\pi d}} \left(1 + \frac{t^2}{d}\right)^{-(d+1)/2}.
\end{align*}
This is symmetric about zero, bell-shaped, and unimodal. The key qualitative difference from the normal is the polynomial tail decay: $f_T(t) \sim t^{-(d+1)}$ for large $|t|$, compared with the Gaussian's exponential decay. Consequently, the $t$-distribution places more probability in the tails than the standard normal.
[remark: Moments of the t-Distribution]
For $d > 1$ the mean is $\mathbb{E}[T] = 0$ (by symmetry and integrability). For $d > 2$ the variance is $\operatorname{Var}(T) = d/(d-2)$, which exceeds $1$ and decreases to $1$ as $d \to \infty$. For $d = 2$ the variance is infinite, and for $d = 1$ even the mean is undefined: the $t_1$ distribution is the Cauchy distribution, which has no moments.
[/remark]
The heavy tails of the $t$-distribution also make it more sensitive to outliers than the normal distribution. A single unusually large observation inflates $\tilde{\sigma}$ and simultaneously pulls $\bar{X}$ away from $\mu_0$, but the exact $t$-distribution accounts for this correctly only when the data are truly normal — if the underlying distribution has heavier tails than the Gaussian, the $t$-statistic will not follow $t_{n-1}$ and the test will be miscalibrated. This sensitivity to non-normality is most acute when $n$ is small.
[remark: Convergence to Normal]
As $d \to \infty$, $W/d \xrightarrow{\mathbb{P}} 1$ by the law of large numbers applied to the chi-squared, so $T \xrightarrow{d} N(0,1)$. In practice, the $t$-distribution is well-approximated by the normal for $d \geq 30$.
[/remark]
We write $t_d(\alpha)$ for the upper $100\alpha\%$ point of $t_d$, meaning $\mathbb{P}(T > t_d(\alpha)) = \alpha$. By symmetry, $t_d(\alpha) = -t_d(1-\alpha)$.
### The t-Test for an Unknown Mean
We now identify the distribution of the natural test statistic when $\sigma^2$ is unknown.
[quotetheorem:1436]
[citeproof:1436]
The independence of $\bar X$ and $S_{XX}$, established via Cochran's theorem in the previous section, is exactly what makes this argument work. Without it, we could not factor the joint distribution and the $t$ pivot would not hold. It bears emphasising that this independence is specific to normal data: for a sample from an exponential distribution, for instance, $\bar{X}$ and $S^2$ are dependent (the exponential distribution has no location-scale symmetry that forces independence of mean and variance), and replacing $\sigma$ by $\tilde{\sigma}$ in the pivot would not yield a $t$-distribution. The $t$-test is an exact test only under the normality assumption; for other distributions it is approximate, and its validity depends on the sample size and the degree of departure from normality.
### Testing and Confidence Intervals
Suppose we observe $x_1, \ldots, x_n$ and wish to test $H_0: \mu = \mu_0$ against $H_1: \mu \neq \mu_0$ at level $\alpha$.
Under $H_0$, the theorem gives $T \sim t_{n-1}$. We reject $H_0$ when $|T|$ is large. The two-sided level-$\alpha$ test rejects when
\begin{align*}
|T| = \frac{\sqrt{n}|\bar X - \mu_0|}{\tilde\sigma} > t_{n-1}\!\left(\tfrac{\alpha}{2}\right).
\end{align*}
Inverting this acceptance region gives the confidence interval for $\mu$. Since
\begin{align*}
\mathbb{P}\!\left(-t_{n-1}\!\left(\tfrac{\alpha}{2}\right) \leq \frac{\sqrt{n}(\bar X - \mu)}{\tilde\sigma} \leq t_{n-1}\!\left(\tfrac{\alpha}{2}\right)\right) = 1 - \alpha,
\end{align*}
the $100(1-\alpha)\%$ confidence interval for $\mu$ is
\begin{align*}
\bar X \pm \frac{\tilde\sigma}{\sqrt{n}}\, t_{n-1}\!\left(\tfrac{\alpha}{2}\right).
\end{align*}
This has the same form as the normal-based interval $\bar X \pm \sigma z_{\alpha/2}/\sqrt{n}$, but with two modifications: $\sigma$ is replaced by the estimate $\tilde\sigma$, and the normal quantile $z_{\alpha/2}$ is replaced by the larger $t$-quantile $t_{n-1}(\alpha/2)$. The interval is wider, correctly reflecting the additional uncertainty from not knowing $\sigma^2$.
[example: One-Sample t-Test]
A laboratory measures the concentration of a chemical compound in $n = 16$ samples, obtaining $\bar x = 12.3$ mg/L and $\tilde\sigma = 2.0$ mg/L. The standard specification is $\mu_0 = 11.5$ mg/L. Test $H_0: \mu = 11.5$ against $H_1: \mu \neq 11.5$ at level $5\%$.
The test statistic is
\begin{align*}
T = \frac{\sqrt{16}(12.3 - 11.5)}{2.0} = \frac{4 \times 0.8}{2.0} = 1.6.
\end{align*}
With $n - 1 = 15$ degrees of freedom, the critical value is $t_{15}(0.025) \approx 2.131$. Since $|T| = 1.6 < 2.131$, we do not reject $H_0$ at the $5\%$ level. The data are consistent with the standard specification.
The $95\%$ confidence interval for $\mu$ is
\begin{align*}
12.3 \pm \frac{2.0}{\sqrt{16}} \times 2.131 = 12.3 \pm 1.066,
\end{align*}
giving the interval $(11.23, 13.37)$ mg/L.
[/example]
[example: Comparison with Normal Test]
Suppose in the same problem we had somehow known $\sigma = 2.0$ mg/L. The normal test statistic would be identical ($Z = 1.6$), but the critical value would be $z_{0.025} = 1.960$. The conclusion is the same here, but the normal test uses a slightly smaller critical value — it gives a narrower rejection region, which would be anti-conservative if $\sigma$ were not truly known. The $t$-test is the honest procedure when $\sigma^2$ is estimated from data.
[/example]
[explanation: Why the Heavier Tail Matters]
When we replace $\sigma$ by $\tilde\sigma$, we introduce randomness into the denominator. On days when $\tilde\sigma$ happens to be small, the statistic $T$ is inflated even if $\bar X$ is close to $\mu_0$; on days when $\tilde\sigma$ is large, $T$ is deflated. The effect is a symmetric spreading of the distribution relative to $N(0,1)$, captured precisely by the heavier polynomial tail of $t_{n-1}$.
The $t$-distribution is not merely a correction factor — it is the exact distribution of $T$ under normality. No approximation is involved. This exactness was the reason William Gosset (writing under the pseudonym "Student" in 1908) derived the distribution: he needed reliable inference for small brewery samples where $n$ was too small to ignore the variability of $\tilde\sigma$.
[/explanation]
### One-Sided t-Tests
For a one-sided alternative $H_1: \mu > \mu_0$, the level-$\alpha$ test rejects when
\begin{align*}
T = \frac{\sqrt{n}(\bar X - \mu_0)}{\tilde\sigma} > t_{n-1}(\alpha).
\end{align*}
The corresponding one-sided $100(1-\alpha)\%$ lower confidence bound for $\mu$ is $\bar X - \tilde\sigma\, t_{n-1}(\alpha)/\sqrt{n}$. The case $H_1: \mu < \mu_0$ is symmetric, rejecting when $T < -t_{n-1}(\alpha)$.
---
This completes the core machinery of hypothesis testing. We began the chapter with the Neyman–Pearson lemma, which characterises the most powerful test for a simple hypothesis. We then developed the likelihood ratio principle as a general recipe, and examined its application to tests on Gaussian means. The $t$-test is the culmination of this programme under the realistic condition that $\sigma^2$ is unknown: independence of $\bar X$ and $S_{XX}$ (proved via the spectral decomposition of quadratic forms) combines with the chi-squared distribution of $S_{XX}/\sigma^2$ to produce an exact pivot.
The next chapter, Linear Models, generalises this framework substantially. There, the data are not a single Gaussian sample but a vector $y \in \mathbb{R}^n$ modelled as $y = X\beta + \varepsilon$ where $X$ is a known design matrix, $\beta$ is a vector of unknown parameters, and $\varepsilon \sim N(\mathbf{0}, \sigma^2 I)$. Testing a single linear constraint on $\beta$ with unknown $\sigma^2$ produces — by the same logic of normal-divided-by-chi-squared — a $t$-statistic. Testing multiple constraints simultaneously produces an $F$-statistic, a ratio of two independent chi-squared quantities. The $t$-test for the mean is therefore not a special trick but the simplest instance of a general geometric structure that runs through all of classical linear statistical inference.
The t-test and chi-squared tests we derived for single samples are powerful, but real data often arises under varying conditions—multiple groups, measurements over time, or with contextual covariates. The linear model unifies and extends single-sample inference into a general framework where regression coefficients, group comparisons, and ANOVA tests all emerge from the same least-squares machinery.
# 3. Linear Models
With the tools of Chapter 2 — hypothesis testing, t-tests, likelihood ratio statistics, and Wilks' theorem — in hand, we now turn to a different kind of question: not whether a parameter equals some fixed value, but how one quantity depends on others. This chapter develops the theory of linear models, the foundational framework for regression analysis. We begin with the general setup, derive the least squares estimator, and establish the Gauss–Markov theorem, which tells us precisely why least squares is the right approach. The rest of the chapter then specialises this machinery to simple and multiple regression, model selection, and analysis of variance.
## Linear Models
### The Model
In many practical situations we observe a response variable — the time to run two miles, say, or an insurance claim rate — and we suspect it depends linearly on one or more explanatory variables (covariates). A linear model formalises this suspicion. The key phrase is *linear in the parameters*: the covariates may enter in complicated ways (as squares, products, or indicator functions), but the unknown coefficients $\beta_j$ always appear linearly. This restriction is precisely what makes the model tractable.
Before writing down the formal definition, note the roles of the different quantities. The design matrix $X$ encodes all the covariate information — it is fixed and fully known before we run any estimation. The parameter vector $\beta$ is unknown and is what we wish to estimate. The error vector $\varepsilon$ captures everything the model does not explain: measurement noise, omitted covariates, and genuine randomness.
[definition: Linear Model]
Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space. A **linear model** consists of:
- A response vector $Y: \Omega \to \mathbb{R}^n$, a random vector of $n$ observations.
- A **design matrix** $X \in \mathbb{R}^{n \times p}$, a known non-random matrix with $n > p$ and $\operatorname{rank}(X) = p$.
- An unknown parameter vector $\beta \in \mathbb{R}^p$.
- A random error vector $\varepsilon: \Omega \to \mathbb{R}^n$ satisfying $\mathbb{E}[\varepsilon] = \mathbf{0}$ and $\operatorname{Cov}(\varepsilon) = \sigma^2 I_n$ for some unknown $\sigma^2 > 0$.
The model equation is
\begin{align*}
Y = X\beta + \varepsilon.
\end{align*}
The entries of $X$ are $x_{ij}$, where $i = 1, \ldots, n$ indexes observations and $j = 1, \ldots, p$ indexes covariates. The $i$th observation satisfies
\begin{align*}
Y_i = \beta_1 x_{i1} + \cdots + \beta_p x_{ip} + \varepsilon_i.
\end{align*}
The condition $\operatorname{Cov}(\varepsilon) = \sigma^2 I_n$ means the errors are uncorrelated and homoscedastic (equal variance). No normality is assumed at this stage.
[/definition]
Immediate consequences of the definition: since $X$ and $\beta$ are non-random, $\mathbb{E}[Y] = X\beta$ and $\operatorname{Cov}(Y) = \sigma^2 I_n$. The observations $Y_1, \ldots, Y_n$ are uncorrelated with common variance $\sigma^2$.
[example: Motor Insurance]
How do motor insurance claim rates depend on the age and sex of the driver and their residential region? Here $n$ is the number of policyholder groups, $p$ covariates might include age bracket, a binary sex indicator, and regional dummy variables, and each row of $X$ records the covariate values for one group. The response $Y_i$ is the observed claim rate for group $i$.
[/example]
These two examples illustrate the breadth of problems that the linear model framework unifies. The motor insurance problem has a categorical structure — the explanatory variables are group indicators — while the oxygen/running-time problem uses a single continuous covariate. In both cases, the model equation $Y = X\beta + \varepsilon$ and the formula $\hat{\beta} = (X^\top X)^{-1} X^\top Y$ apply without modification.
[example: Oxygen Uptake and Running Time]
For each of 24 males, the maximum volume of oxygen uptake (mL/kg/min) and the time to run two miles (seconds) were recorded. Let $Y_i$ be the running time and $x_i$ the oxygen uptake for individual $i$. A plausible model is
\begin{align*}
Y_i = a + b x_i + \varepsilon_i, \quad i = 1, \ldots, 24,
\end{align*}
where $\varepsilon_i$ are uncorrelated errors with mean zero and variance $\sigma^2$. In matrix form, with $\beta = (a, b)^\top$,
\begin{align*}
X = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_{24} \end{pmatrix} \in \mathbb{R}^{24 \times 2}.
\end{align*}
The column of ones in $X$ is responsible for the intercept $a$: this is the standard device for including a constant term in a linear model.
[/example]
### Least Squares and the Normal Equations
We need to estimate $\beta$ from the data $Y$. Since we do not assume normality, maximum likelihood is not immediately available. The natural alternative is to choose $\hat{\beta}$ to minimise the total squared discrepancy between the observed responses and the fitted values $X\beta$. This is the **principle of least squares**.
[definition: Least Squares Estimator]
In the linear model $Y = X\beta + \varepsilon$, the **least squares estimator** $\hat{\beta}$ of $\beta$ is the value that minimises
\begin{align*}
S(\beta) = \|Y - X\beta\|^2 = (Y - X\beta)^\top(Y - X\beta) = \sum_{i=1}^{n}\Bigl(Y_i - \sum_{j=1}^{p} x_{ij}\beta_j\Bigr)^2.
\end{align*}
[/definition]
Geometrically, $S(\beta)$ is the squared Euclidean distance from $Y$ to the point $X\beta$ in $\mathbb{R}^n$. As $\beta$ ranges over $\mathbb{R}^p$, the vector $X\beta$ traces out the column space $\mathcal{C}(X) \subseteq \mathbb{R}^n$. Minimising $S(\beta)$ amounts to finding the point in $\mathcal{C}(X)$ closest to $Y$ — that is, the orthogonal projection of $Y$ onto $\mathcal{C}(X)$.
To find $\hat{\beta}$ analytically, differentiate $S(\beta)$ with respect to each $\beta_k$ and set the partial derivatives to zero. For the $k$th component:
\begin{align*}
\frac{\partial S}{\partial \beta_k}\bigg|_{\beta = \hat{\beta}} = -2\sum_{i=1}^{n} x_{ik}\Bigl(Y_i - \sum_{j=1}^{p} x_{ij}\hat{\beta}_j\Bigr) = 0.
\end{align*}
Collecting these $p$ equations in matrix form yields the **normal equations**.
[quotetheorem:501]
Since $X$ has rank $p$, the matrix $X^\top X \in \mathbb{R}^{p \times p}$ is positive definite: for any $t \neq \mathbf{0}$ in $\mathbb{R}^p$,
\begin{align*}
t^\top X^\top X t = \|Xt\|^2 > 0,
\end{align*}
where the strict inequality uses the fact that $X$ has full column rank (so $Xt = \mathbf{0}$ implies $t = \mathbf{0}$). Positive definiteness implies invertibility, so the normal equations have a unique solution:
\begin{align*}
\hat{\beta} = (X^\top X)^{-1} X^\top Y.
\end{align*}
This is linear in $Y$, a fact that will be central to the Gauss–Markov theorem.
The estimator $\hat{\beta}$ is unbiased:
\begin{align*}
\mathbb{E}[\hat{\beta}] = (X^\top X)^{-1} X^\top \mathbb{E}[Y] = (X^\top X)^{-1} X^\top X \beta = \beta.
\end{align*}
Its covariance matrix is
\begin{align*}
\operatorname{Cov}(\hat{\beta}) = (X^\top X)^{-1} X^\top \operatorname{Cov}(Y) X (X^\top X)^{-1} = \sigma^2 (X^\top X)^{-1},
\end{align*}
using $\operatorname{Cov}(Y) = \sigma^2 I_n$.
### The Geometric Interpretation
The fitted values $\hat{Y} = X\hat{\beta}$ have a clean geometric description. The hat matrix
\begin{align*}
H = X(X^\top X)^{-1} X^\top \in \mathbb{R}^{n \times n}
\end{align*}
is the orthogonal projection onto $\mathcal{C}(X)$, and $\hat{Y} = HY$. The residual vector $\hat{\varepsilon} = Y - \hat{Y} = (I_n - H)Y$ is the component of $Y$ orthogonal to $\mathcal{C}(X)$.
[remark: Orthogonality of Residuals]
The normal equations $X^\top X\hat{\beta} = X^\top Y$ can be rewritten as $X^\top(Y - X\hat{\beta}) = X^\top \hat{\varepsilon} = \mathbf{0}$. This says every column of $X$ is orthogonal to the residual vector: the residuals are orthogonal to the fitted values. The Pythagorean decomposition $\|Y\|^2 = \|\hat{Y}\|^2 + \|\hat{\varepsilon}\|^2$ follows immediately from this orthogonality, and it underpins the analysis-of-variance decomposition developed later in the chapter.
[/remark]
### The Gauss–Markov Theorem
We have shown that $\hat{\beta}$ is linear and unbiased. But could there be a better linear unbiased estimator — one with smaller variance? The Gauss–Markov theorem answers this decisively: no.
To make "smaller variance" precise for a vector estimator, we compare covariance matrices. We say a linear unbiased estimator $\tilde{\beta}$ is **better** than $\hat{\beta}$ if $\operatorname{Cov}(\tilde{\beta}) - \operatorname{Cov}(\hat{\beta})$ is positive semi-definite, meaning every linear combination $c^\top \tilde{\beta}$ has variance at least as large as $c^\top \hat{\beta}$.
[quotetheorem:1437]
[citeproof:1437]
The theorem requires no normality assumption — only uncorrelated errors with equal variance. This makes it remarkably robust. When normality does hold, $\hat{\beta}$ is also the maximum likelihood estimator, and its finite-sample optimality extends beyond the class of linear estimators (by the Cramér–Rao lower bound). But the Gauss–Markov result stands on its own, independently of any distributional assumption.
[remark: Homoscedasticity is Essential]
If the errors have unequal variances — the heteroscedastic case $\operatorname{Cov}(\varepsilon) = \sigma^2 \Sigma$ for some known positive definite $\Sigma \neq I_n$ — then the ordinary least squares estimator is no longer BLUE. The remedy is **generalised least squares**: pre-multiply the model by $\Sigma^{-1/2}$ to restore homoscedasticity, then apply ordinary least squares. The resulting estimator $\hat{\beta}_{\mathrm{GLS}} = (X^\top \Sigma^{-1} X)^{-1} X^\top \Sigma^{-1} Y$ is BLUE in the generalised setting.
[/remark]
The general theory developed here provides the foundation for everything that follows. The simplest and most important special case — where there is a single covariate and an intercept — is **simple linear regression**, to which we now turn.
## Simple Linear Regression
The general linear model $Y = X\beta + \varepsilon$ introduced in the previous section encompasses a vast family of regression problems. The most important special case — and the one that appears most frequently in practice — is **simple linear regression**: a single continuous predictor $x$ and an intercept. Understanding this case fully, with explicit closed-form estimators, is essential before tackling more complex designs. It also illustrates how the abstract matrix machinery of least squares collapses into something transparent and computable.
### The Model and a Useful Reparametrisation
[definition: Simple Linear Regression Model]
The **simple linear regression model** is
\begin{align*}
Y_i &= \alpha + \beta x_i + \varepsilon_i, \quad i = 1, \dots, n,
\end{align*}
where $x_1, \dots, x_n \in \mathbb{R}$ are fixed (non-random) predictor values, $\alpha, \beta \in \mathbb{R}$ are unknown parameters, and $\varepsilon_1, \dots, \varepsilon_n$ are uncorrelated errors with $\mathbb{E}[\varepsilon_i] = 0$ and $\operatorname{Var}(\varepsilon_i) = \sigma^2$.
[/definition]
Working directly with $\alpha$ as the intercept at $x = 0$ leads to correlation between $\hat\alpha$ and $\hat\beta$, which complicates both computation and interpretation. A reparametrisation centred at $\bar x = \frac{1}{n}\sum_{i=1}^n x_i$ eliminates this nuisance. Write $\alpha' = \alpha + \beta \bar x$, so that
\begin{align*}
Y_i &= \alpha' + \beta(x_i - \bar x) + \varepsilon_i.
\end{align*}
Since $\sum_{i=1}^n (x_i - \bar x) = 0$ by definition of the mean, the two regressors — the constant $1$ and the centred values $(x_i - \bar x)$ — are orthogonal. This orthogonality is the key structural feature that makes the algebra tractable.
### Deriving the Least Squares Estimators
In matrix form, the centred model is $Y = X\beta' + \varepsilon$ with
\begin{align*}
X &=
\begin{pmatrix}
1 & x_1 - \bar x \\
\vdots & \vdots \\
1 & x_n - \bar x
\end{pmatrix}, \qquad
\beta' = \begin{pmatrix} \alpha' \\ \beta \end{pmatrix}.
\end{align*}
The orthogonality condition $\sum (x_i - \bar x) = 0$ makes $X^\top X$ diagonal:
\begin{align*}
X^\top X &=
\begin{pmatrix}
n & 0 \\
0 & S_{xx}
\end{pmatrix},
\end{align*}
where $S_{xx} = \sum_{i=1}^n (x_i - \bar x)^2$ is the total sum of squares of the predictors. Because $X^\top X$ is diagonal, its inverse is immediate:
\begin{align*}
(X^\top X)^{-1} &=
\begin{pmatrix}
\tfrac{1}{n} & 0 \\
0 & \tfrac{1}{S_{xx}}
\end{pmatrix}.
\end{align*}
The least squares estimator $\hat\beta' = (X^\top X)^{-1} X^\top Y$ then gives
\begin{align*}
\hat\beta' &=
\begin{pmatrix}
\bar Y \\
\dfrac{S_{xY}}{S_{xx}}
\end{pmatrix},
\end{align*}
where $S_{xY} = \sum_{i=1}^n Y_i(x_i - \bar x) = \sum_{i=1}^n (Y_i - \bar Y)(x_i - \bar x)$ (the last equality holds because $\sum \bar Y (x_i - \bar x) = 0$).
[definition: Least Squares Estimators for Simple Linear Regression]
The least squares estimators of the intercept and slope in the simple linear regression model are
\begin{align*}
\hat\alpha' &= \bar Y, \\
\hat\beta &= \frac{S_{xY}}{S_{xx}} = \frac{\displaystyle\sum_{i=1}^n (x_i - \bar x)(Y_i - \bar Y)}{\displaystyle\sum_{i=1}^n (x_i - \bar x)^2}.
\end{align*}
The original intercept estimator is recovered as $\hat\alpha = \hat\alpha' - \hat\beta\,\bar x = \bar Y - \hat\beta\,\bar x$.
[/definition]
### The Fitted Line Passes Through the Centroid
A fundamental geometric fact follows immediately from these formulas.
[quotetheorem:1438]
[citeproof:1438]
This is not a coincidence — it is a direct consequence of the normal equations, which force the residuals to sum to zero. The property fails for regression without an intercept: if the column of ones is omitted from $X$, the normal equations no longer guarantee $\sum_i R_i = 0$, and the fitted line need not pass through $(\bar{x}, \bar{Y})$. The centroid property is therefore specific to models that include a constant term.
### Connection to the Correlation Coefficient
The slope estimator $\hat\beta$ has a pleasing factorisation. Since $\sum \bar Y(x_i - \bar x) = 0$, we can write
\begin{align*}
\hat\beta = \frac{S_{xY}}{S_{xx}} = \frac{\displaystyle\sum_i (Y_i - \bar Y)(x_i - \bar x)}{\displaystyle\sum_i (x_i - \bar x)^2}.
\end{align*}
Multiplying and dividing by $\sqrt{S_{yy}}$ (where $S_{yy} = \sum(Y_i - \bar Y)^2$) gives
\begin{align*}
\hat\beta &= \frac{\displaystyle\sum_i (Y_i - \bar Y)(x_i - \bar x)}{\sqrt{S_{xx}\, S_{yy}}} \times \sqrt{\frac{S_{yy}}{S_{xx}}} = r \sqrt{\frac{S_{yy}}{S_{xx}}},
\end{align*}
where
\begin{align*}
r &= \frac{S_{xY}}{\sqrt{S_{xx}\, S_{yy}}} = \frac{\displaystyle\sum_i (x_i - \bar x)(Y_i - \bar Y)}{\sqrt{\displaystyle\sum_i (x_i - \bar x)^2 \sum_i (Y_i - \bar Y)^2}}
\end{align*}
is the **Pearson product-moment correlation coefficient**. The slope is the correlation times the ratio of empirical standard deviations of the responses to the predictors.
[remark: Gradient Invariance Under Centring]
The estimated gradient $\hat\beta = r\sqrt{S_{yy}/S_{xx}}$ is the same whether the $x$-values are centred or not: centring affects only the intercept estimator. This is because shifting a predictor by a constant changes the intercept but leaves the slope unchanged.
[/remark]
### Variances of the Estimators
Since $\hat\beta' = (X^\top X)^{-1} X^\top Y$ and $\operatorname{Cov}(Y) = \sigma^2 I$, the covariance matrix of $\hat\beta'$ is
\begin{align*}
\operatorname{Cov}(\hat\beta') &= (X^\top X)^{-1}\sigma^2 = \sigma^2
\begin{pmatrix}
\tfrac{1}{n} & 0 \\
0 & \tfrac{1}{S_{xx}}
\end{pmatrix}.
\end{align*}
Reading off the diagonal entries:
\begin{align*}
\operatorname{Var}(\hat\alpha') &= \frac{\sigma^2}{n}, \qquad \operatorname{Var}(\hat\beta) = \frac{\sigma^2}{S_{xx}}.
\end{align*}
The two estimators are **uncorrelated** — a direct benefit of the centred parametrisation, which made $X^\top X$ diagonal. The variance of $\hat\beta$ decreases as the spread of the $x$-values increases (large $S_{xx}$), confirming the intuition that a wider range of predictor values yields a more precise slope estimate.
These variance formulas hold under the model assumptions $\mathbb{E}[\varepsilon] = 0$, $\operatorname{Cov}(\varepsilon) = \sigma^2 I$ alone — no normality is required.
### Residuals and the Decomposition of Variation
Once the parameters are estimated, the residuals $R_i = Y_i - \hat{Y}_i$ encode what the fitted model fails to explain. The central question of this subsection is how the total variation $\sum_i(Y_i - \bar{Y})^2$ splits into an explained part and a residual part — a decomposition that underpins both the coefficient of determination and the ANOVA table.
[definition: Residuals and Fitted Values]
The **fitted values** are $\hat Y_i = \hat\alpha + \hat\beta\, x_i$. The **residuals** are
\begin{align*}
R_i &= Y_i - \hat Y_i, \quad i = 1, \dots, n.
\end{align*}
In vector form, $R = Y - \hat Y = Y - X\hat\beta'$ is the residual vector, and the **residual sum of squares** is
\begin{align*}
\mathrm{RSS} &= \|R\|^2 = \sum_{i=1}^n R_i^2 = (Y - X\hat\beta')^\top (Y - X\hat\beta').
\end{align*}
[/definition]
The residuals satisfy two exact constraints: $\sum_i R_i = 0$ (residuals sum to zero) and $\sum_i x_i R_i = 0$ (residuals are uncorrelated with the predictor). Both follow from the normal equations $X^\top R = 0$.
To measure how well the model explains the data, one decomposes total variation into explained and unexplained parts. Define:
\begin{align*}
\mathrm{SS}_{\mathrm{tot}} &= \sum_{i=1}^n (Y_i - \bar Y)^2 = S_{yy}, \\
\mathrm{SS}_{\mathrm{reg}} &= \sum_{i=1}^n (\hat Y_i - \bar Y)^2, \\
\mathrm{RSS} &= \sum_{i=1}^n (Y_i - \hat Y_i)^2.
\end{align*}
[quotetheorem:1439]
[citeproof:1439]
The decomposition $\mathrm{SS}_{\mathrm{tot}} = \mathrm{SS}_{\mathrm{reg}} + \mathrm{RSS}$ partitions total variation into two orthogonal parts: one explained by the fitted line, one left in the residuals. This suggests a natural dimensionless summary of how much the regression reduces the unexplained variation.
[definition: Coefficient of Determination]
The **coefficient of determination** is
\begin{align*}
R^2 &= \frac{\mathrm{SS}_{\mathrm{reg}}}{\mathrm{SS}_{\mathrm{tot}}} = 1 - \frac{\mathrm{RSS}}{\mathrm{SS}_{\mathrm{tot}}}.
\end{align*}
It measures the proportion of total variation in the response explained by the regression. We have $0 \le R^2 \le 1$, with $R^2 = 1$ meaning a perfect fit and $R^2 = 0$ meaning the predictor explains none of the variation.
[/definition]
In simple linear regression, a direct computation gives $R^2 = r^2$, the square of the Pearson correlation coefficient. This provides a direct link between the strength of the linear association and the goodness of fit: when $r = \pm 1$ the data lie exactly on a line and $R^2 = 1$, while when $r = 0$ the predictor is uncorrelated with the response and $R^2 = 0$. The worked example below reads off $R^2$ directly from a reported correlation, illustrating how the two quantities carry identical information about linear fit.
[remark: $R^2$ and Model Quality]
A high $R^2$ does not by itself confirm that the linear model is appropriate. It is possible to have $R^2$ close to $1$ for a non-linear relationship (if the sample is small), or to have $R^2$ close to $0$ for a correctly-specified model with a large noise variance $\sigma^2$. $R^2$ should always be interpreted alongside residual plots and domain knowledge.
[/remark]
### Worked Example
[example: Oxygen Uptake and Exercise Time]
A dataset records oxygen uptake (in ml/min, variable $Y$) and exercise time (in seconds, variable $x$) for 24 subjects. The summary statistics are:
\begin{align*}
\bar y &= 826.5, \quad S_{xx} = 783.5 = 28.0^2, \quad S_{xy} = -10{,}077, \\
S_{yy} &= 444^2, \quad r = -0.81.
\end{align*}
The estimated slope is
\begin{align*}
\hat\beta &= \frac{S_{xy}}{S_{xx}} = \frac{-10{,}077}{783.5} = -12.9 \text{ ml/min per second}.
\end{align*}
The negative slope indicates that oxygen uptake decreases with exercise time in this dataset. The estimated intercept (in the centred parametrisation) is $\hat\alpha' = \bar y = 826.5$, so the fitted line in centred form is
\begin{align*}
\hat Y &= 826.5 - 12.9\,(x - \bar x).
\end{align*}
The coefficient of determination is $R^2 = r^2 = (-0.81)^2 = 0.656$, meaning roughly $65.6\%$ of the variation in oxygen uptake is explained by exercise time.
[/example]
### Towards Normal Assumptions
Everything derived so far — the estimators $\hat\alpha'$ and $\hat\beta$, their variances, and the decomposition of sums of squares — has been obtained using only the second-moment assumptions $\mathbb{E}[\varepsilon_i] = 0$ and $\operatorname{Var}(\varepsilon_i) = \sigma^2$. The Gauss-Markov theorem guarantees that $\hat\beta$ is BLUE under these conditions alone.
However, to perform hypothesis tests and construct confidence intervals, we need the **full sampling distribution** of $\hat\beta$, not just its mean and variance. The next section adds the normality assumption $\varepsilon \sim N(0, \sigma^2 I)$. Under this assumption, $\hat\beta$ is not merely unbiased with a known variance — it is exactly normally distributed, enabling $t$-tests for individual coefficients, $F$-tests for the overall regression, and exact confidence intervals. The sum-of-squares decomposition above will reappear as the basis for the $F$-statistic in the analysis of variance table.
## Linear Models with Normal Assumptions
The Gauss-Markov theorem, established in the previous section, is a powerful result: it tells us that the least squares estimator $\hat\beta$ is the best linear unbiased estimator of $\beta$, requiring only that the errors have zero mean and constant variance. But "best linear unbiased" is a statement about expected values and variances — it says nothing about the shape of the sampling distribution of $\hat\beta$. Without knowing that distribution, we cannot form confidence intervals, we cannot carry out hypothesis tests, and we cannot say anything about the probability that $\hat\beta$ lies within any particular range of the true $\beta$. To unlock these tools, we must make a distributional assumption.
The natural assumption in classical statistics is normality. We shall see that adding $\varepsilon \sim N(0, \sigma^2 I)$ to the linear model yields three beautiful consequences: the maximum likelihood estimator (introduced in Chapter 1) coincides exactly with the least squares estimator; $\hat\beta$ itself is normally distributed in a way that pins down both its mean and covariance; and the residual sum of squares, once scaled by $\sigma^2$, follows a chi-squared distribution — independently of $\hat\beta$. These facts, taken together, are the foundation for inference in linear models.
### Motivating Example
Before introducing the normal assumption formally, it helps to see the kind of problem it was designed to handle.
[example: Resistivity of Silicon Wafers]
Suppose we want to measure the electrical resistivity of silicon wafers. We have five instruments and five wafers, and each instrument measures each wafer once (so $n = 25$ observations in total). We want to know whether the instruments give consistent readings — that is, whether there is any systematic instrument effect.
Let $Y_{i,j}$ be the resistivity of the $j$th wafer as measured by instrument $i$, for $i, j = 1, \dots, 5$. The readings (in $\Omega \cdot \text{cm}$) are:
| Instrument | Wafer 1 | Wafer 2 | Wafer 3 | Wafer 4 | Wafer 5 |
|---|---|---|---|---|---|
| 1 | 130.5 | 112.4 | 118.9 | 125.7 | 134.0 |
| 2 | 130.4 | 138.2 | 116.7 | 132.6 | 104.2 |
| 3 | 113.0 | 120.5 | 128.9 | 103.4 | 118.1 |
| 4 | 128.0 | 117.5 | 114.9 | 114.9 | 98.8 |
| 5 | 121.2 | 110.5 | 118.5 | 100.5 | 120.9 |
A natural model assigns each instrument its own mean level:
\begin{align*}
Y_{i,j} = \mu_i + \varepsilon_{i,j},
\end{align*}
where the $\varepsilon_{i,j}$ are independent errors with $\mathbb{E}[\varepsilon_{i,j}] = 0$ and $\operatorname{Var}(\varepsilon_{i,j}) = \sigma^2$. This fits the general linear model $Y = X\beta + \varepsilon$ with the design matrix $X$ being a block of five indicator columns (each column flags one instrument), $\beta = (\mu_1, \dots, \mu_5)^\top$, and $n = 25$, $p = 5$.
The matrix $X^\top X = 5I_5$, so $(X^\top X)^{-1} = \frac{1}{5}I_5$. The least squares estimator gives
\begin{align*}
\hat\mu_i = \bar{Y}_i \qquad (i = 1, \dots, 5),
\end{align*}
the sample mean within each instrument group. The residual sum of squares is
\begin{align*}
\mathrm{RSS} = \sum_{i=1}^{5}\sum_{j=1}^{5}(Y_{i,j} - \bar{Y}_i)^2 = 2170,
\end{align*}
and this quantity has $n - p = 25 - 5 = 20$ degrees of freedom.
Note that everything computed so far uses only zero-mean, constant-variance errors — no normality yet. The question is: what further conclusions can we draw once we add a normal distribution for the errors?
[/example]
### The Normal Linear Model
What additional structure does assuming normality impose, and what does it purchase? Without a distributional assumption, the Gauss-Markov theorem gives the mean and covariance of $\hat{\beta}$ but nothing about its shape. Adding $\varepsilon \sim N_n(\mathbf{0}, \sigma^2 I)$ converts those second-moment statements into exact distributional ones — the foundation for confidence intervals and hypothesis tests with correct finite-sample coverage.
[definition: Normal Linear Model]
The **normal linear model** is
\begin{align*}
Y = X\beta + \varepsilon, \qquad \varepsilon \sim N_n(\mathbf{0}, \sigma^2 I),
\end{align*}
where $X$ is a known $n \times p$ design matrix with $\operatorname{rank}(X) = p < n$, $\beta \in \mathbb{R}^p$ is an unknown parameter vector, and $\sigma^2 > 0$ is an unknown error variance. Since $X\beta$ is a fixed vector, this is equivalent to
\begin{align*}
Y \sim N_n(X\beta, \sigma^2 I).
\end{align*}
[/definition]
Every result from the Gauss-Markov setting still holds here, since the normal model is a special case of the general linear model. We now derive the additional results that normality affords.
### The MLE Under Normality
Given that errors are normally distributed, maximum likelihood provides an independent route to estimating $\beta$. The question is whether the MLE agrees with the least squares estimator or leads somewhere different — and if they agree, what that coincidence reveals about the role of the Gaussian assumption.
[quotetheorem:1440]
[citeproof:1440]
This coincidence of least squares and MLE is not an accident. Historically, Gauss introduced the normal distribution precisely so that the two criteria would agree — he designed the error distribution to retroactively justify the least squares method. The lesson is that least squares is the natural estimation procedure whenever errors are approximately normally distributed.
[remark: Bias of the MLE for $\sigma^2$]
The MLE $\hat\sigma^2 = \mathrm{RSS}/n$ is biased for $\sigma^2$. The unbiased estimator is $\tilde\sigma^2 = \mathrm{RSS}/(n-p)$, which divides by the degrees of freedom $n - p$ rather than the sample size $n$. We shall see why shortly: $\mathrm{RSS}/\sigma^2$ follows a $\chi^2_{n-p}$ distribution, whose mean is $n-p$, so $\mathbb{E}[\mathrm{RSS}] = \sigma^2(n-p)$ and dividing by $n-p$ gives an unbiased estimator. The quantity $\tilde\sigma = \sqrt{\mathrm{RSS}/(n-p)}$ is called the **residual standard error** on $n-p$ degrees of freedom.
[/remark]
### Key Lemma: Quadratic Forms in Normal Vectors
Before deriving the distributions of $\hat\beta$ and $\mathrm{RSS}$, we need a lemma about chi-squared distributions that handles the core algebraic machinery.
[quotetheorem:1441]
[citeproof:1441]
Both symmetry and idempotence are essential: if $A$ is symmetric but not idempotent, its eigenvalues are not confined to $\{0, 1\}$ and $Z^\top AZ$ is not a sum of independent $\chi^2_1$ variables; if $A$ is idempotent but not symmetric, the diagonalisation by an orthogonal matrix — which decouples the components — fails.
This lemma will be applied twice: once with $A = P$ (the hat matrix) to handle $\hat\beta$, and once with $A = I - P$ to handle $\mathrm{RSS}$.
### Distribution of $\hat\beta$
Gauss-Markov established that $\hat{\beta}$ has mean $\beta$ and covariance $\sigma^2(X^\top X)^{-1}$ under weak assumptions, but without normality those are the only moment statements available. Under the normal model, the linearity of $\hat{\beta}$ in $Y$ propagates normality through the estimator, pinning down its entire distribution rather than just its first two moments.
[quotetheorem:1442]
[citeproof:1442]
Normality is the key driver: Gauss-Markov gave the mean and covariance of $\hat{\beta}$ without any distributional assumption on the errors, but the full $N_p$ distribution follows from the Gaussian structure being inherited through the linear map $C = (X^\top X)^{-1}X^\top$.
This result is the distributional upgrade of the Gauss-Markov theorem. We already knew $\hat\beta$ was unbiased with covariance $\sigma^2(X^\top X)^{-1}$; now we know its entire distribution. In particular, each component $\hat\beta_j \sim N(\beta_j, \sigma^2 [(X^\top X)^{-1}]_{jj})$, which immediately gives $t$-based confidence intervals for individual regression coefficients once $\sigma^2$ is estimated — but that requires independence from the residuals, which we establish below.
### Residual Sum of Squares
With the distribution of $\hat{\beta}$ established, the natural next question is: what governs $\mathrm{RSS} = \|Y - X\hat{\beta}\|^2$? Since $\mathrm{RSS}$ is used to estimate $\sigma^2$ and forms the denominator of $t$- and $F$-statistics, knowing its exact distribution is essential for valid inference.
[quotetheorem:1443]
[citeproof:1443]
The degrees of freedom $n - p$ has a clean geometric meaning: we start with $n$ observations and lose $p$ degrees of freedom by estimating the $p$ parameters in $\beta$, leaving $n - p$ for the residuals. This matches our intuition that after fitting a model with $p$ parameters, only $n - p$ pieces of information remain to estimate $\sigma^2$. This chi-squared distribution is specific to normality: under non-normal errors, $\mathrm{RSS}/\sigma^2$ still has mean $n - p$ (unbiasedness of $\tilde{\sigma}^2$ holds generally) and the correct variance, but the chi-squared shape — and hence the exact critical values of subsequent $t$- and $F$-statistics — requires the Gaussian assumption.
### Independence of $\hat\beta$ and the Residuals
The distributions of $\hat\beta$ and $\mathrm{RSS}$ individually are useful, but their real power comes from the fact that they are independent — a fact that is specific to the normal model and does not hold for general distributions.
[quotetheorem:1444]
[citeproof:1444]
The independence of $\hat\beta$ and $\mathrm{RSS}$ is special to the normal model. It comes from the orthogonal geometry of projection: the vector $\hatY = PY$ lives in the column space of $X$, while the residuals $R = (I - P)Y$ live in its orthogonal complement. For normal errors, orthogonality of Gaussian components implies independence; for non-normal errors, orthogonality only implies zero correlation.
### Summary and Implications
We can now assemble the three results into a single statement.
[quotetheorem:1445]
These three results work in concert: property (1) supplies a normal pivot for each $\hat{\beta}_j$, property (2) provides a chi-squared denominator for estimating $\sigma^2$, and property (3) guarantees the pivot and the denominator are independent so their ratio is exactly $t$-distributed. The explanation below traces how they combine into the workhorse inference tools of regression.
[explanation: What These Results Enable]
These three properties are the engine behind all classical inference in linear models. From property (1), we know that each component $\hat\beta_j$ is normal with mean $\beta_j$ and variance $\sigma^2[(X^\top X)^{-1}]_{jj}$. If $\sigma^2$ were known, this would immediately give $z$-tests and confidence intervals.
Since $\sigma^2$ is unknown, we replace it by $\tilde\sigma^2 = \mathrm{RSS}/(n-p)$. Property (2) says $\mathrm{RSS}/\sigma^2 \sim \chi^2_{n-p}$, so $\tilde\sigma^2/\sigma^2 \sim \chi^2_{n-p}/(n-p)$. Property (3) says this estimator is independent of $\hat\beta$.
Combining properties (1) and (2), we can form the ratio
\begin{align*}
\frac{\hat\beta_j - \beta_j}{\tilde\sigma\,\sqrt{[(X^\top X)^{-1}]_{jj}}} \sim t_{n-p},
\end{align*}
which follows a $t$-distribution with $n-p$ degrees of freedom: the numerator is standard normal (from property 1), the denominator involves $\sqrt{\chi^2_{n-p}/(n-p)}$ (from property 2), and the two are independent (from property 3). This pivot is the basis for $t$-tests and confidence intervals on individual regression coefficients.
For testing hypotheses about groups of coefficients simultaneously — for example, whether all coefficients in a sub-model are zero — we need a ratio of two independent chi-squared variables. This ratio follows the $F$-distribution, which is the subject of the next section.
[/explanation]
## The F-Distribution
When testing whether a specified subset of regression coefficients is zero, the natural approach is to compare two sums of squares: one from the full model and one from a reduced model. The question is what distribution governs their ratio. Both sums of squares follow (scaled) chi-squared distributions, and they turn out to be independent — so the ratio itself follows a known distribution, the F-distribution, which is precisely what we need to carry out the test.
[definition: F-Distribution]
Suppose $U$ and $V$ are independent with $U \sim \chi^2_{d_1}$ and $V \sim \chi^2_{d_2}$. The random variable
\begin{align*}
F = \frac{U/d_1}{V/d_2}
\end{align*}
is said to have an **F-distribution** on $d_1$ and $d_2$ degrees of freedom, written $F \sim F_{d_1, d_2}$. The integer $d_1$ is called the **numerator degrees of freedom** and $d_2$ the **denominator degrees of freedom**.
[/definition]
Since $U$ and $V$ have means $d_1$ and $d_2$ respectively, the scaled quantities $U/d_1$ and $V/d_2$ each have mean $1$, so an $F$-distributed variable is typically close to $1$ when the null hypothesis holds.
### Basic Properties
Three properties of the F-distribution are used repeatedly in what follows.
**Mean.** For $d_2 > 2$,
\begin{align*}
\mathbb{E}[F_{d_1, d_2}] = \frac{d_2}{d_2 - 2}.
\end{align*}
The mean exceeds $1$ and approaches $1$ as $d_2 \to \infty$, consistent with the heuristic above. The mean is undefined for $d_2 \le 2$.
**Reciprocal symmetry.** If $X \sim F_{d_1, d_2}$, then $1/X \sim F_{d_2, d_1}$. This follows directly from the definition by swapping the roles of $U$ and $V$. In practice it allows one to read off lower-tail critical values from upper-tail tables: since $\mathbb{P}(F_{d_1, d_2} < c) = \mathbb{P}(F_{d_2, d_1} > 1/c)$, the lower $5\%$ point of $F_{d_1, d_2}$ is the reciprocal of the upper $5\%$ point of $F_{d_2, d_1}$.
**Relation to the t-distribution.** If $Y \sim t_d$, then $Y^2 \sim F_{1, d}$. To see this, recall that $Y = Z/\sqrt{W/d}$ where $Z \sim N(0,1)$ and $W \sim \chi^2_d$ are independent. Then $Y^2 = Z^2/d \cdot d/W = (Z^2/1)/(W/d)$, which is the ratio of a $\chi^2_1$ variable divided by $1$ to an independent $\chi^2_d$ variable divided by $d$ — exactly an $F_{1,d}$ variable.
### Critical Values and Table Lookup
We write $F_{d_1, d_2}(\alpha)$ for the upper $100\alpha\%$ point of the $F_{d_1, d_2}$ distribution, so that $\mathbb{P}(X > F_{d_1, d_2}(\alpha)) = \alpha$ for $X \sim F_{d_1, d_2}$.
[example: Reading F-Tables]
Standard tables typically give upper $5\%$ and $1\%$ points for selected degree-of-freedom combinations. Suppose $X \sim F_{3, 20}$ and we want the lower $5\%$ point, i.e. the value $c$ such that $\mathbb{P}(X < c) = 0.05$.
By reciprocal symmetry, $\mathbb{P}(X < c) = \mathbb{P}(1/X > 1/c) = 0.05$, and since $1/X \sim F_{20, 3}$, we need the upper $5\%$ point of $F_{20, 3}$, which a table gives as $F_{20,3}(0.05) = 8.66$. Therefore $c = 1/8.66 \approx 0.115$.
This symmetry argument means that a one-sided table of upper critical values is sufficient to obtain any critical value needed.
[/example]
The F-distribution will appear in the next section as the null distribution of the test statistic for hypotheses of the form $H_0 \colon \beta_j = 0$ for a subset of coefficients. The key structural fact is that under $H_0$, the numerator and denominator of the test statistic are independent chi-squared variables (divided by their degrees of freedom), so their ratio follows an $F$-distribution exactly — enabling exact inference rather than asymptotic approximation.
## Inference for $\beta$
With the joint distribution of $\hat{\beta}$ and the residual sum of squares now in hand, we can move from estimation to inference. We know how $\hat{\beta}$ concentrates around $\beta$ globally, but often the practical question is narrower: is a particular predictor significant? Does a specific coefficient differ from zero? Answering these questions requires a distribution for individual components $\hat\beta_j$, and the key insight is that the same independence result that gave us the F-statistic also delivers an exact $t$-distribution for each coefficient.
### The t-Statistic for a Single Coefficient
Knowing that $\hat{\beta}_j \sim N(\beta_j, \sigma^2[(X^\top X)^{-1}]_{jj})$ would give a $z$-test for $H_0: \beta_j = 0$ if $\sigma^2$ were known. Since it must be estimated from the data, replacing $\sigma$ by $\tilde{\sigma}$ changes the pivot from a standard normal to a $t$-distribution — incurring $n - p$ rather than infinitely many degrees of freedom.
[definition: Standard Error of a Coefficient]
Given the OLS estimator $\hat{\beta}$ in the model $Y = X\beta + \varepsilon$ with $\varepsilon \sim N_n(0, \sigma^2 I)$, the **standard error** of the $j$-th component $\hat\beta_j$ is
\begin{align*}
\operatorname{se}(\hat\beta_j) = \sqrt{\tilde\sigma^2 \left(X^\top X\right)^{-1}_{jj}},
\end{align*}
where $\tilde\sigma^2 = \mathrm{RSS}/(n - p)$ is the residual mean square and $\left(X^\top X\right)^{-1}_{jj}$ denotes the $j$-th diagonal entry of $(X^\top X)^{-1}$.
[/definition]
The standard error is the practical version of the true standard deviation $\sqrt{\sigma^2 (X^\top X)^{-1}_{jj}}$: both measure the precision of $\hat\beta_j$, but only $\operatorname{se}(\hat\beta_j)$ is computable from the data since $\sigma^2$ is unknown.
[quotetheorem:1446]
[citeproof:1446]
### Confidence Intervals
The $t_{n-p}$ pivot translates directly into an exact confidence interval for each coefficient. Write $t_{n-p}(\alpha/2)$ for the upper $\alpha/2$ quantile of the $t_{n-p}$ distribution. Then a $100(1-\alpha)\%$ confidence interval for $\beta_j$ is
\begin{align*}
\hat\beta_j \;\pm\; \operatorname{se}(\hat\beta_j)\, t_{n-p}\!\left(\tfrac\alpha2\right).
\end{align*}
[remark: Exactness]
These intervals are exact under the Gaussian error assumption, not approximate. For large $n - p$ the $t_{n-p}$ quantile is close to the corresponding normal quantile, recovering the familiar $\pm 1.96$ rule at the 95% level. For small samples, using the $t$ quantile rather than the normal quantile is essential to maintain the nominal coverage.
[/remark]
### Significance Testing for Individual Predictors
The most common test in regression output is $H_0: \beta_j = 0$ against the two-sided alternative $H_1: \beta_j \neq 0$. Under $H_0$, the theorem gives
\begin{align*}
T_j = \frac{\hat\beta_j}{\operatorname{se}(\hat\beta_j)} \sim t_{n-p}.
\end{align*}
We reject $H_0$ at level $\alpha$ when $|T_j| > t_{n-p}(\alpha/2)$. The $p$-value reported in standard regression output is $\mathbb P(|T| \ge |T_j|)$ where $T \sim t_{n-p}$. A small $p$-value indicates that the data are unlikely under $H_0$, providing evidence that predictor $j$ contributes to the response after accounting for all other predictors in the model.
[remark: Relationship to the F-Test]
The F-test from the previous section tests all coefficients simultaneously, while the $t$-tests here test one coefficient at a time, conditional on the others remaining in the model. Neither subsumes the other in general: it is possible for the overall F-test to be significant while individual $t$-tests are not (due to collinearity), and vice versa.
[/remark]
With the theory of inference for individual coefficients established, the next section returns to simple linear regression — the special case $p = 2$ with a single predictor — and shows explicitly what these confidence intervals and $t$-tests look like when the design matrix has the concrete form $X = [\mathbf{1} \; x]$.
## Inference in Simple Linear Regression
Simple linear regression is the special case $p = 2$ of the general linear model, with design matrix $X = (\mathbf{1}_n, x)$ where $x = (x_1, \ldots, x_n)^\top$ is the vector of covariate values. The previous section derived the joint sampling distributions of $\hat{\beta}$ and $\hat{\sigma}^2$ in full generality. Here we specialise those results to $p = 2$, obtaining explicit formulae for the distributions of the intercept and slope estimators, and then use them to construct t-tests and confidence intervals.
### Sampling Distributions of $\hat{a}'$ and $\hat{b}$
It is convenient to work with the centred parametrisation introduced in Section 2: write
\begin{align*}
Y_i &= a' + b(x_i - \bar{x}) + \varepsilon_i, \qquad \varepsilon_i \overset{\text{i.i.d.}}{\sim} N(0, \sigma^2), \quad i = 1, \ldots, n,
\end{align*}
where $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$. Centring removes the correlation between the OLS estimators $\hat{a}'$ and $\hat{b}$, so that $X^\top X$ is diagonal and the algebra becomes transparent.
The OLS estimators are $\hat{a}' = \bar{Y}$ and $\hat{b} = S_{xY}/S_{xx}$, where $S_{xY} = \sum_i (x_i - \bar{x})Y_i$ and $S_{xx} = \sum_i (x_i - \bar{x})^2$. Applying the general distributional result $\hat{\beta} \sim N(\beta, \sigma^2(X^\top X)^{-1})$ to this case gives:
\begin{align*}
\hat{a}' &\sim N\!\left(a',\, \frac{\sigma^2}{n}\right), \\
\hat{b} &\sim N\!\left(b,\, \frac{\sigma^2}{S_{xx}}\right).
\end{align*}
The residual sum of squares satisfies $\mathrm{RSS} = \sum_i (Y_i - \hat{Y}_i)^2 \sim \sigma^2 \chi^2_{n-2}$, and the pair $(\hat{a}', \hat{b})$ is independent of $\hat{\sigma}^2 = \mathrm{RSS}/n$ (the MLE of $\sigma^2$) and of the unbiased estimator $\tilde{\sigma}^2 = \mathrm{RSS}/(n-2)$.
[remark: Biased vs Unbiased Estimator of $\sigma^2$]
Dividing RSS by $n$ gives the MLE $\hat{\sigma}^2$, which is biased. Dividing by $n - p = n - 2$ gives the unbiased estimator $\tilde{\sigma}^2$. For inference — t-statistics and confidence intervals — one uses $\tilde{\sigma}^2$ because it yields exact t-distributions with $n - 2$ degrees of freedom.
[/remark]
### t-Statistics and Hypothesis Tests
Since $\hat{b} \sim N(b, \sigma^2/S_{xx})$ and $\tilde{\sigma}^2$ is independent of $\hat{b}$ with $(n-2)\tilde{\sigma}^2/\sigma^2 \sim \chi^2_{n-2}$, the standardised slope estimator
\begin{align*}
T_b &= \frac{\hat{b} - b}{\mathrm{SE}(\hat{b})}, \qquad \mathrm{SE}(\hat{b}) = \sqrt{\frac{\tilde{\sigma}^2}{S_{xx}}},
\end{align*}
follows a $t_{n-2}$ distribution under the model. The analogous statistic for the intercept is
\begin{align*}
T_{a'} &= \frac{\hat{a}' - a'}{\mathrm{SE}(\hat{a}')}, \qquad \mathrm{SE}(\hat{a}') = \sqrt{\frac{\tilde{\sigma}^2}{n}},
\end{align*}
also distributed as $t_{n-2}$.
The most common hypothesis test for simple linear regression is $H_0: b = 0$ against $H_1: b \neq 0$. The null says there is no linear relationship between the response and the covariate. Under $H_0$ the test statistic reduces to
\begin{align*}
T_b \big|_{H_0} &= \frac{\hat{b}}{\mathrm{SE}(\hat{b})} \sim t_{n-2},
\end{align*}
and we reject $H_0$ at size $\alpha$ when $|T_b| > t_{n-2}(\alpha/2)$, the upper $\alpha/2$ critical value of the $t_{n-2}$ distribution.
### Confidence Intervals for $a'$ and $b$
A $100(1-\alpha)\%$ confidence interval for the slope is
\begin{align*}
\hat{b} \pm \mathrm{SE}(\hat{b}) \cdot t_{n-2}(\alpha/2),
\end{align*}
and an analogous interval holds for the intercept $a'$ with $\mathrm{SE}(\hat{a}') = \tilde{\sigma}/\sqrt{n}$. Importantly, if the confidence interval for $b$ does not contain zero, this is equivalent to rejecting $H_0: b = 0$ at the corresponding significance level — there is an exact duality between the two-sided t-test and the confidence interval.
[example: Oxygen Consumption and Time]
We return to the oxygen/time data ($n = 24$, $p = 2$). The unbiased variance estimate is
\begin{align*}
\tilde{\sigma}^2 &= \frac{\mathrm{RSS}}{n - p} = \frac{67968}{24 - 2} = 3089 \approx 55.6^2.
\end{align*}
The standard error of the slope estimator is
\begin{align*}
\mathrm{SE}(\hat{b}) &= \sqrt{\frac{\tilde{\sigma}^2}{S_{xx}}} = \frac{55.6}{28.0} = 1.99.
\end{align*}
A 95% confidence interval for $b$ has endpoints
\begin{align*}
\hat{b} \pm \mathrm{SE}(\hat{b}) \cdot t_{22}(0.025) &= -12.9 \pm 1.99 \times 2.07 = (-17.0,\; -8.8).
\end{align*}
Since this interval does not contain zero, we reject $H_0: b = 0$ at the 5% level. We can confirm directly: the test statistic is
\begin{align*}
T_b &= \frac{\hat{b}}{\mathrm{SE}(\hat{b})} = \frac{-12.9}{1.99} = -6.48,
\end{align*}
which satisfies $|{-6.48}| = 6.48 > 2.07 = t_{22}(0.025)$, so we reject $H_0$. The data provide strong evidence of a linear relationship between oxygen consumption and time.
[/example]
Having established the distributions of $\hat{a}'$ and $\hat{b}$ and seen how to test for a linear relationship, the next natural question concerns prediction: given the fitted line, what can we say about the expected response $\mathbb{E}[Y]$ at a new covariate value $x^*$? This requires a confidence interval not for a single parameter, but for the mean response at a specified point — the topic of the next section.
## Expected Response at $x^*$
Once we have fitted the linear model $Y = X\beta + \varepsilon$ and obtained the least-squares estimator $\hat{\beta}$, the natural next step is to use the model predictively. Suppose $x^*$ is a new vector of values for the explanatory variables. Two distinct questions arise, and it is important not to conflate them.
The first question is: *what is the expected response at $x^*$?* This is a question about a parameter, namely the quantity $\mathbb{E}[Y^* \mid x^*] = x^{*\top}\beta$. We want a confidence interval for this expected value.
The second question is: *what will the next observation at $x^*$ actually be?* The next observation is $Y^* = x^{*\top}\beta + \varepsilon^*$, where $\varepsilon^* \sim N(0, \sigma^2)$ is a fresh error term, independent of all past data. We want a prediction interval for this random quantity. Because $Y^*$ inherits both the uncertainty in estimating $\beta$ and the additional randomness of $\varepsilon^*$, the prediction interval will always be wider than the confidence interval for the expected response.
### The Confidence Interval for the Expected Response
We estimate $x^{*\top}\beta$ by $x^{*\top}\hat{\beta}$. To build a confidence interval we need the distribution of this estimator.
Since $\hat{\beta} \sim N(\beta, \sigma^2(X^\top X)^{-1})$, applying the linear transformation $x^{*\top}$ gives
\begin{align*}
x^{*\top}\hat{\beta} &\sim N\!\left(x^{*\top}\beta,\; \sigma^2x^{*\top}(X^\top X)^{-1}x^*\right).
\end{align*}
It is convenient to abbreviate the scalar variance factor. Set
\begin{align*}
\tau^2 &= x^{*\top}(X^\top X)^{-1}x^*.
\end{align*}
Then $x^{*\top}(\hat{\beta} - \beta) \sim N(0, \sigma^2\tau^2)$. Replacing $\sigma$ with the residual standard deviation $\tilde{\sigma} = \sqrt{\hat{\sigma}^2}$ and using the independence of $\hat{\beta}$ from $\hat{\sigma}^2$, we obtain the pivot
\begin{align*}
\frac{x^{*\top}(\hat{\beta} - \beta)}{\tilde{\sigma}\,\tau} &\sim t_{n-p}.
\end{align*}
[definition: Confidence Interval for the Expected Response]
Let $t_{n-p}(\alpha/2)$ denote the upper $\alpha/2$ critical value of the $t_{n-p}$ distribution. A $100(1-\alpha)\%$ **confidence interval for the expected response** $x^{*\top}\beta$ has endpoints
\begin{align*}
x^{*\top}\hat{\beta} \;\pm\; \tilde{\sigma}\,\tau\, t_{n-p}\!\left(\tfrac{\alpha}{2}\right),
\end{align*}
where $\tau^2 = x^{*\top}(X^\top X)^{-1}x^*$.
[/definition]
The width of this interval is controlled by $\tau$, which measures how far $x^*$ lies from the centre of the design. Points near the centroid of the observed $X$ values have small $\tau$ and therefore tighter intervals; extrapolating far outside the design inflates $\tau$ and produces wider intervals.
[example: CI for Expected Running Time]
Continuing the running-time example from the previous section. We have $n = 24$ men and the simple linear regression $Y = a' + b(x - \bar{x}) + \varepsilon$, fitted in centered form so that $(X^\top X)^{-1}$ is diagonal.
Suppose we wish to estimate the expected time to run 2 miles for a man with oxygen take-up $x^* = 50$. The centered explanatory variable is $x^* - \bar{x} = 50 - 48.6 = 1.4$, so $x^{*\top} = (1,\; 1.4)$.
The estimated expected response is
\begin{align*}
x^{*\top}\hat{\beta} &= \hat{a}' + 1.4\,\hat{b} = 826.5 - 1.4 \times 12.9 = 808.5 \text{ seconds}.
\end{align*}
For the centered simple regression, $(X^\top X)^{-1}$ is diagonal with entries $1/n$ and $1/S_{xx}$, where $S_{xx} = \sum_i (x_i - \bar{x})^2 = 783.5$. Therefore
\begin{align*}
\tau^2 &= \frac{1}{n} + \frac{(x^* - \bar{x})^2}{S_{xx}} = \frac{1}{24} + \frac{1.4^2}{783.5} = 0.0417 + 0.0025 = 0.044,
\end{align*}
so $\tau = 0.21$. With $\tilde{\sigma} = 55.6$ seconds and $t_{22}(0.025) = 2.07$, the $95\%$ confidence interval for $\mathbb{E}[Y \mid x^* = 50]$ is
\begin{align*}
808.5 \;\pm\; 55.6 \times 0.21 \times 2.07 &= 808.5 \pm 24.2 \;=\; (784.3,\; 832.7) \text{ seconds}.
\end{align*}
This interval concerns the average time we would see if we could observe infinitely many men all with oxygen take-up of exactly 50. It says nothing directly about any single future observation.
[/example]
### The Prediction Interval for a New Observation
Now we ask: what will the actual response $Y^*$ be for a new individual with explanatory variable $x^*$? This new individual generates a fresh error term $\varepsilon^* \sim N(0, \sigma^2)$, independent of $Y = (Y_1, \ldots, Y_n)^\top$.
[definition: Prediction Interval]
A $100(1-\alpha)\%$ **prediction interval** for $Y^*$ is a random interval $I(Y)$, depending only on the observed data, such that
\begin{align*}
\mathbb{P}(Y^* \in I(Y)) &= 1 - \alpha,
\end{align*}
where the probability is taken over the joint distribution of $(Y^*, Y_1, \ldots, Y_n)$.
[/definition]
The key distinction from a confidence interval is that $Y^*$ is a random variable, not a parameter. The interval must account for the variability of $Y^*$ itself, not just our uncertainty about $\beta$.
To derive the prediction interval, consider the prediction error $\hat{Y}^* - Y^*$, where $\hat{Y}^* = x^{*\top}\hat{\beta}$ is our point estimate.
[quotetheorem:1447]
[citeproof:1447]
The variance $\sigma^2(\tau^2 + 1)$ has a transparent decomposition: the term $\sigma^2\tau^2$ captures uncertainty in the estimated regression surface at $x^*$, and the term $\sigma^2$ captures the irreducible scatter of any individual response about that surface. Even if we knew $\beta$ exactly (so $\tau^2 = 0$), the prediction interval would still have half-width $\tilde{\sigma}\,t_{n-p}(\alpha/2)$ because of the individual-level noise $\varepsilon^*$. This is why prediction intervals cannot be made arbitrarily narrow by collecting more data.
[definition: Prediction Interval Formula]
A $100(1-\alpha)\%$ prediction interval for $Y^*$ has endpoints
\begin{align*}
x^{*\top}\hat{\beta} \;\pm\; \tilde{\sigma}\sqrt{\tau^2 + 1}\; t_{n-p}\!\left(\tfrac{\alpha}{2}\right).
\end{align*}
This is always strictly wider than the corresponding confidence interval for $\mathbb{E}[Y^* \mid x^*]$, since $\sqrt{\tau^2 + 1} > \tau$.
[/definition]
The prediction formula differs from the confidence formula only in replacing $\tau$ by $\sqrt{\tau^2 + 1}$: the extra $+1$ under the square root is the contribution of the individual error $\varepsilon^*$, which cannot be reduced by collecting more data. The following example shows concretely how dominant this term typically is relative to estimation uncertainty.
[example: Prediction Interval for Running Time]
Continuing from the running-time example. For a new man with oxygen take-up $x^* = 50$, we have $\hat{Y}^* = 808.5$, $\tau = 0.21$, $\tilde{\sigma} = 55.6$, and $t_{22}(0.025) = 2.07$.
The $95\%$ prediction interval is
\begin{align*}
808.5 \;\pm\; 55.6 \times \sqrt{0.21^2 + 1} \times 2.07 &= 808.5 \;\pm\; 55.6 \times 1.022 \times 2.07 \\
&= 808.5 \pm 117.6 \;=\; (690.9,\; 926.1) \text{ seconds}.
\end{align*}
Compare this to the $95\%$ confidence interval $(784.3, 832.7)$ for the expected response. The prediction interval is roughly five times as wide. The three sources of uncertainty are: not knowing $\sigma$ (estimated by $\tilde{\sigma}$), not knowing the regression coefficients exactly (contributing $\sigma^2\tau^2$), and the irreducible individual variation $\varepsilon^*$ (contributing $\sigma^2$). The dominant term here is the last one, which is why even large samples do not collapse the prediction interval.
[/example]
The running-time example had a single continuous covariate and a two-parameter model. The next example has a group-indicator structure, but the formula $\tau^2 = x^{*\top}(X^\top X)^{-1}x^*$ applies identically. It also illustrates a practically important phenomenon: pooling information across groups can substantially tighten intervals compared to using a single group in isolation.
[example: Wafer Resistivity]
In the wafer example, we model resistivity across five instruments. Suppose we wish to estimate the expected resistivity of a new wafer measured on instrument 1. The design matrix uses indicator variables, so $x^{*\top} = (1, 0, 0, 0, 0)$.
The estimated expected response is
\begin{align*}
x^{*\top}\hat{\mu} &= \hat{\mu}_1 = \bar{y}_1 = 124.3.
\end{align*}
With $n_1 = 5$ observations on instrument 1 and the balanced one-way structure, one can verify that
\begin{align*}
\tau^2 &= x^{*\top}(X^\top X)^{-1}x^* = \frac{1}{5},
\end{align*}
so $\tau = 1/\sqrt{5}$. Here $\tilde{\sigma} = 10.4$ and $t_{20}(0.025) = 2.09$ (since $p = 5$ parameters and $n = 25$ observations give $n - p = 20$ degrees of freedom).
The $95\%$ confidence interval for $\mathbb{E}[Y_1^*]$ is
\begin{align*}
124.3 \;\pm\; \frac{10.4}{\sqrt{5}} \times 2.09 &= 124.3 \pm 9.7 \;=\; (114.6,\; 134.0).
\end{align*}
An important feature of this interval is that the estimate $\tilde{\sigma} = 10.4$ is computed from all five instruments, pooling information across the whole dataset. If instead we used only the first instrument's five observations, the estimate would be $\tilde{\sigma}_1 = 8.74$, giving the one-sample interval
\begin{align*}
124.3 \;\pm\; \frac{8.74}{\sqrt{5}} \times t_4(0.025) &= 124.3 \pm 3.91 \times 2.78 \;=\; (113.5,\; 135.1).
\end{align*}
This is slightly wider, because the single-instrument estimate $\tilde{\sigma}_1$ is based on only 4 degrees of freedom compared to the pooled estimate's 20. In general, pooling yields substantially tighter intervals; this example is unusual in that instrument 1's data happens to be relatively tight compared to the others.
The $95\%$ prediction interval for a new observation $Y_1^*$ from instrument 1 is
\begin{align*}
124.3 \;\pm\; 10.4 \times \sqrt{\tfrac{1}{5} + 1} \times 2.09 &= 124.3 \pm 10.4 \times 1.095 \times 2.09 \;=\; (100.5,\; 148.1).
\end{align*}
The prediction interval is roughly twice as wide as the confidence interval, reflecting the additional variability of a single new measurement around the group mean.
[/example]
Both examples above share the same qualitative feature: the confidence interval narrows as more data are collected, whereas the prediction interval converges to a fixed non-zero width determined by $\sigma$. The remark below states this contrast precisely in asymptotic terms.
[remark: Confidence vs Prediction]
It is a common error to confuse confidence intervals for the expected response with prediction intervals for a new observation. A confidence interval narrows to zero width as $n \to \infty$ (we learn $\beta$ exactly), while a prediction interval converges to the interval $x^{*\top}\beta \pm z_{\alpha/2}\,\sigma$ of half-width determined by $\sigma$ alone, which cannot be reduced by more data. Confidence intervals are statements about parameters; prediction intervals are statements about future random variables.
[/remark]
With confidence and prediction intervals established for individual components of $\beta$ and for linear combinations $x^{*\top}\beta$, we are ready to address questions involving *multiple* linear constraints simultaneously. For instance, we might ask whether an entire subset of predictors can be dropped from the model, or whether two groups of coefficients satisfy some joint linear relation. The next section develops the $F$-test, which provides a unified framework for testing general linear hypotheses of the form $H_0: C\beta = d$ where $C$ is a matrix of constraints.
## Hypothesis Testing and ANOVA
The previous sections equipped us with individual t-tests: given a coefficient $\hat{\beta}_j$, we can ask whether it is consistent with zero by forming a ratio against its standard error. But in practice the question is rarely so narrow. We might instead ask whether a whole collection of variables collectively contributes to explaining the response — for instance, whether any of $x_2, x_3, \ldots, x_k$ matter once $x_1$ is included. A single t-test cannot answer this; we need a procedure for testing multiple constraints on $\beta$ simultaneously. The F-test, derived from comparing the residual sums of squares of a full model and a restricted model, provides exactly this.
### An Independence Lemma for Quadratic Forms
Before deriving the F-statistic, we need a geometric fact about independence of quadratic forms in normal vectors. Though it may appear unmotivated on first encounter, it is the engine behind the exact null distribution of the F-statistic.
[quotetheorem:1448]
[citeproof:1448]
[remark: Geometric Meaning]
The condition $A_1 A_2 = 0$ says the two projection matrices project onto orthogonal subspaces. The lemma reflects a fundamental feature of the multivariate normal distribution: components in orthogonal directions are independent. Each quadratic form $Z^\top A_i Z$ measures the squared length of the projection of $Z$ onto the column space of $A_i$, and orthogonal projections are independent.
[/remark]
### F-Tests for Nested Linear Models
We now set up the general testing framework. Partition the design matrix and coefficient vector as
\begin{align*}
X &= \bigl(X_0 \;\; X_1\bigr), \qquad \beta = \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix},
\end{align*}
where $X_0$ is $n \times p_0$, $X_1$ is $n \times (p - p_0)$, $\operatorname{rank}(X) = p$, and $\operatorname{rank}(X_0) = p_0$. We wish to test
\begin{align*}
H_0: \beta_1 = \mathbf{0} \quad \text{against} \quad H_1: \beta_1 \neq \mathbf{0}.
\end{align*}
The model under $H_0$ is $Y = X_0 \beta_0 + \varepsilon$, a strictly smaller (nested) model. We write $\hat{\hat{\cdot}}$ for estimators under $H_0$ and $\hat{\cdot}$ for estimators under the full model.
The MLEs under $H_0$ are $\hat{\hat{\beta}}_0 = (X_0^\top X_0)^{-1} X_0^\top Y$ and $\hat{\hat{\sigma}}^2 = \mathrm{RSS}_0 / n$, where
\begin{align*}
\mathrm{RSS}_0 &= (Y - X_0 \hat{\hat{\beta}}_0)^\top (Y - X_0 \hat{\hat{\beta}}_0) = Y^\top (I_n - P_0) Y,
\end{align*}
and $P_0 = X_0 (X_0^\top X_0)^{-1} X_0^\top$ is the orthogonal projection onto the column space of $X_0$. The full-model RSS satisfies $\mathrm{RSS} = Y^\top (I_n - P) Y$, where $P = X(X^\top X)^{-1} X^\top$.
The generalised likelihood ratio statistic simplifies to
\begin{align*}
\Lambda_{Y}(H_0, H_1) &= \left(\frac{\hat{\hat{\sigma}}^2}{\hat{\sigma}^2}\right)^{n/2} = \left(\frac{\mathrm{RSS}_0}{\mathrm{RSS}}\right)^{n/2} = \left(1 + \frac{\mathrm{RSS}_0 - \mathrm{RSS}}{\mathrm{RSS}}\right)^{n/2}.
\end{align*}
We reject $H_0$ when $\Lambda$ is large, equivalently when the ratio $(\mathrm{RSS}_0 - \mathrm{RSS}) / \mathrm{RSS}$ is large. The asymptotic approximation gives $2 \log \Lambda \approx \chi^2_{p - p_0}$, but the linear model allows an exact distribution.
[quotetheorem:1449]
[citeproof:1449]
The ANOVA table organises the decomposition of $\mathrm{RSS}_0$ into its two components:
| Source of variation | d.f. | Sum of squares | Mean square | F statistic |
|---|---|---|---|---|
| Fitted model ($\beta_1$) | $p - p_0$ | $\mathrm{RSS}_0 - \mathrm{RSS}$ | $\frac{\mathrm{RSS}_0 - \mathrm{RSS}}{p - p_0}$ | $\frac{(\mathrm{RSS}_0 - \mathrm{RSS})/(p - p_0)}{\mathrm{RSS}/(n-p)}$ |
| Residual | $n - p$ | $\mathrm{RSS}$ | $\frac{\mathrm{RSS}}{n - p}$ | |
| Total | $n - p_0$ | $\mathrm{RSS}_0$ | | |
The ratio $R^2 = (\mathrm{RSS}_0 - \mathrm{RSS}) / \mathrm{RSS}_0$ is the **proportion of variance explained** by the additional parameters $\beta_1$. It lies in $[0, 1]$, with larger values indicating that fitting $\beta_1$ substantially reduces the unexplained variability.
### Geometric Interpretation
The F-test has a clean geometric reading. Recall that $\hat{Y} = PY$ and $\hat{\hat{Y}} = P_0 Y$ are the projections of $Y$ onto the column spaces $\mathcal{C}(X)$ and $\mathcal{C}(X_0)$ respectively, with $\mathcal{C}(X_0) \subseteq \mathcal{C}(X)$.
The three terms in the decomposition $\mathrm{RSS}_0 = (\mathrm{RSS}_0 - \mathrm{RSS}) + \mathrm{RSS}$ correspond to three orthogonal components of $Y$:
\begin{align*}
Y &= \hat{\hat{Y}} + (\hat{Y} - \hat{\hat{Y}}) + (Y - \hat{Y}).
\end{align*}
The residual $Y - \hat{Y}$ lives in the orthogonal complement of $\mathcal{C}(X)$; the increment $\hat{Y} - \hat{\hat{Y}} = (P - P_0)Y$ lives in the orthogonal complement of $\mathcal{C}(X_0)$ within $\mathcal{C}(X)$; and $\hat{\hat{Y}}$ lies in $\mathcal{C}(X_0)$. The F-statistic compares the squared length of $(P - P_0)Y$, normalised by its $p - p_0$ degrees of freedom, to the squared length of the full residual, normalised by its $n - p$ degrees of freedom. If $H_0$ is false, the projection $(P - P_0)Y$ will be systematically inflated by the signal $X_1 \beta_1$, making $F$ large.
### Simple Linear Regression as a Special Case
To ground the general framework, consider simple linear regression
\begin{align*}
Y_i &= a' + b(x_i - \bar{x}) + \varepsilon_i, \qquad \varepsilon_i \sim N(0, \sigma^2),
\end{align*}
and the test $H_0: b = 0$. Under $H_0$ the model reduces to $Y_i \sim N(a', \sigma^2)$, so $\hat{\hat{a}}' = \bar{Y}$ and
\begin{align*}
\mathrm{RSS}_0 &= \sum_i (Y_i - \bar{Y})^2 = S_{YY}, \\
\mathrm{RSS}_0 - \mathrm{RSS} &= \hat{b}^2 S_{xx},
\end{align*}
where $S_{xx} = \sum_i (x_i - \bar{x})^2$. The ANOVA table takes the form:
| Source | d.f. | SS | MS | F |
|---|---|---|---|---|
| Fitted model | $1$ | $\hat{b}^2 S_{xx}$ | $\hat{b}^2 S_{xx}$ | $\hat{b}^2 S_{xx} / \tilde{\sigma}^2$ |
| Residual | $n - 2$ | $\mathrm{RSS}$ | $\tilde{\sigma}^2$ | |
| Total | $n - 1$ | $S_{YY}$ | | |
The proportion of variance explained is $R^2 = \hat{b}^2 S_{xx} / S_{YY} = S_{XY}^2 / (S_{xx} S_{YY}) = r^2$, the square of the Pearson correlation coefficient $r = S_{XY} / \sqrt{S_{xx} S_{YY}}$.
[remark: Equivalence of t-test and F-test in Simple Regression]
We showed in an earlier section that under $H_0: b = 0$, the t-statistic $t = \hat{b} / \operatorname{SE}(\hat{b}) = \hat{b}\sqrt{S_{xx}} / \tilde{\sigma}$ follows a $t_{n-2}$ distribution. Observing that $F = t^2$ and that a $t_{n-2}^2$ variable is $F_{1, n-2}$ confirms that the F-test and the two-sided t-test are completely equivalent in this case. The F framework generalises to multiple simultaneous constraints, which the t-test cannot handle.
[/remark]
### One-Way Analysis of Variance
The most classical application of the F-test is one-way ANOVA, which tests whether the means of $I$ groups are equal given $J$ observations per group.
[definition: One-Way ANOVA Model]
Suppose $J$ measurements are taken in each of $I$ groups, with
\begin{align*}
Y_{ij} &= \mu_i + \varepsilon_{ij}, \qquad i = 1, \ldots, I, \quad j = 1, \ldots, J,
\end{align*}
where $\varepsilon_{ij}$ are independent $N(0, \sigma^2)$ and $\mu_i \in \mathbb{R}$ are unknown group means. The null hypothesis of no group effect is $H_0: \mu_1 = \mu_2 = \cdots = \mu_I$.
[/definition]
This is a special case of the nested model framework. The full model has $p = I$ parameters (the group means $\mu_1, \ldots, \mu_I$); the null model has $p_0 = 1$ (a common mean $\mu$). The total number of observations is $n = IJ$.
Under the full model, $\hat{\mu}_i = \bar{Y}_{i\cdot}$, the group sample mean, so
\begin{align*}
\mathrm{RSS} &= \sum_{i=1}^I \sum_{j=1}^J (Y_{ij} - \bar{Y}_{i\cdot})^2 \qquad \text{on } n - I \text{ degrees of freedom.}
\end{align*}
Under $H_0$, $\hat{\mu} = \bar{Y}_{\cdot\cdot}$, the grand mean, so
\begin{align*}
\mathrm{RSS}_0 &= \sum_{i=1}^I \sum_{j=1}^J (Y_{ij} - \bar{Y}_{\cdot\cdot})^2 \qquad \text{on } n - 1 \text{ degrees of freedom.}
\end{align*}
The key calculation is the between-groups sum of squares:
\begin{align*}
\mathrm{RSS}_0 - \mathrm{RSS} &= \sum_{i=1}^I \sum_{j=1}^J \bigl[(Y_{ij} - \bar{Y}_{\cdot\cdot})^2 - (Y_{ij} - \bar{Y}_{i\cdot})^2\bigr] = J \sum_{i=1}^I (\bar{Y}_{i\cdot} - \bar{Y}_{\cdot\cdot})^2,
\end{align*}
on $I - 1$ degrees of freedom. This is the sum of squared deviations of group means from the grand mean, scaled by the group size $J$. It is large precisely when the group means differ substantially from one another.
[explanation: The ANOVA Decomposition]
The identity $\mathrm{RSS}_0 = (\mathrm{RSS}_0 - \mathrm{RSS}) + \mathrm{RSS}$ is the classical decomposition
\begin{align*}
\underbrace{\sum_{i,j}(Y_{ij} - \bar{Y}_{\cdot\cdot})^2}_{\text{Total SS}} &= \underbrace{J\sum_i (\bar{Y}_{i\cdot} - \bar{Y}_{\cdot\cdot})^2}_{\text{Between-groups SS}} + \underbrace{\sum_{i,j}(Y_{ij} - \bar{Y}_{i\cdot})^2}_{\text{Within-groups SS}}.
\end{align*}
The within-groups SS (also called the error SS) measures variability that cannot be attributed to group differences — it would be present even if all group means were equal. The between-groups SS measures the variability in the group means; under $H_0$ this should be of the same order as the within-groups variability, while under $H_1$ it is inflated by the true differences among the $\mu_i$.
[/explanation]
The F-statistic and ANOVA table are:
| Source | d.f. | Sum of squares | Mean square | F statistic |
|---|---|---|---|---|
| Between groups | $I - 1$ | $J\sum_i (\bar{Y}_{i\cdot} - \bar{Y}_{\cdot\cdot})^2$ | $\frac{J\sum_i(\bar{Y}_{i\cdot} - \bar{Y}_{\cdot\cdot})^2}{I - 1}$ | $\frac{J\sum_i(\bar{Y}_{i\cdot} - \bar{Y}_{\cdot\cdot})^2 / (I-1)}{\tilde{\sigma}^2}$ |
| Within groups | $n - I$ | $\sum_{i,j}(Y_{ij} - \bar{Y}_{i\cdot})^2$ | $\tilde{\sigma}^2 = \frac{\mathrm{RSS}}{n - I}$ | |
| Total | $n - 1$ | $\sum_{i,j}(Y_{ij} - \bar{Y}_{\cdot\cdot})^2$ | | |
We reject $H_0$ at level $\alpha$ when $F > F_{I-1,\, n-I}(\alpha)$. Under $H_0$, both the numerator and denominator of $F$ are unbiased estimates of $\sigma^2$, so $F$ should be close to $1$ when no group effect is present. Genuine differences among the group means inflate the between-groups mean square while leaving the within-groups mean square unaffected, pushing $F$ well above $1$. The following example shows a dataset where the observed $F$ far exceeds the 5% critical value.
[example: Wafer Measurements]
Suppose $I = 3$ groups (say, three production batches) each with $J = 5$ measurements of a wafer thickness. Let the group means be $\bar{y}_{1\cdot} = 101$, $\bar{y}_{2\cdot} = 98$, $\bar{y}_{3\cdot} = 103$, with grand mean $\bar{y}_{\cdot\cdot} = 100.67$ and estimated within-group variance $\tilde{\sigma}^2 = 4.2$.
The between-groups SS is
\begin{align*}
J \sum_{i=1}^3 (\bar{y}_{i\cdot} - \bar{y}_{\cdot\cdot})^2 &= 5\bigl[(101 - 100.67)^2 + (98 - 100.67)^2 + (103 - 100.67)^2\bigr] \\
&= 5[0.109 + 7.129 + 5.429] = 5 \times 12.667 = 63.33.
\end{align*}
The F-statistic is
\begin{align*}
F &= \frac{63.33 / 2}{4.2} = \frac{31.67}{4.2} \approx 7.54,
\end{align*}
on $2$ and $12$ degrees of freedom. Since $F_{2, 12}(0.05) \approx 3.89$, we have $F > F_{2,12}(0.05)$ and reject the hypothesis that the three batches have equal mean thickness at the 5% level.
[/example]
The calculation follows the standard template: compute the between-groups and within-groups sums of squares, divide each by its degrees of freedom, and form the ratio. The remark below situates one-way ANOVA within the general nested-model framework, explaining how the between-groups formula $J\sum_i(\bar{Y}_{i\cdot} - \bar{Y}_{\cdot\cdot})^2$ emerges from the general expression $Y^\top(P - P_0)Y$.
[remark: Connection to the General Framework]
One-way ANOVA is the F-test for $H_0: \beta_1 = \mathbf{0}$ with $p_0 = 1$ (intercept only) and $p = I$ (one indicator per group). The design matrix $X_0$ is a column of ones; $X$ contains the group indicator columns. The formula $J \sum_i (\bar{Y}_{i\cdot} - \bar{Y}_{\cdot\cdot})^2$ for the between-groups SS is exactly $(P - P_0)Y$ written in terms of group means.
[/remark]
### Looking Ahead
The linear model as developed in this chapter — from distributional results and t-tests through to F-tests and ANOVA — forms the theoretical core of much of applied statistics. Yet its assumptions are strong: responses are assumed normally distributed, the mean is constrained to be a linear function of the covariates, and the variance is constant across observations. Each of these restrictions can be relaxed in a principled way. **Generalised linear models** (GLMs) replace the normal assumption with any exponential family distribution and link the linear predictor to the mean through a link function, covering binary outcomes (logistic regression) and count data (Poisson regression) within a unified framework. **Mixed effects models** extend the fixed-parameter setup by allowing some coefficients to vary randomly across groups, capturing clustering and repeated measurements. **Nonparametric and semiparametric regression** — including splines, kernel regression, and generalised additive models — relax the linearity assumption while retaining interpretability. A natural next step is Part II Applied Statistics, which develops GLMs and their diagnostics in detail, or the Part III course on Modern Statistical Methods, which treats high-dimensional linear models and regularisation. The geometry and distributional theory of this chapter, particularly the projection-based derivations and the chi-squared/F calculus, remain indispensable tools throughout all of these extensions.
## References
- D. Spiegelhalter, *Statistics*, Cambridge IB Lent 2015 (lecture notes — source material).
Contents
- Introduction
- The Parametric Framework
- Simple Random Samples
- The Three Questions of Statistical Inference
- What This Course Covers
- Why Parametric Inference?
- The Role of Probability Theory
- 1. Estimation
- Estimators
- The estimator–estimate distinction
- Bias
- Mean Squared Error
- The Decomposition
- The Bias-Variance Trade-Off in Practice
- When Unbiasedness Forces Nonsense
- Looking Forward
- Sufficiency
- The Factorisation Criterion
- Minimal Sufficiency
- The Rao–Blackwell Theorem
- Likelihood
- The Likelihood Function and Log-Likelihood
- Worked Examples
- Connection to Sufficient Statistics
- Invariance of the MLE
- From Points to Intervals
- Confidence Intervals
- The Definition and its Interpretation
- The Pivotal Quantity Method
- Approximate Confidence Intervals via Asymptotic Normality
- Bayesian Estimation
- Prior and Posterior Distributions
- Conjugate Prior Families
- Beta–Binomial Conjugacy
- Gamma–Poisson Conjugacy
- Normal–Normal Conjugacy
- Bayes Estimators and Loss Functions
- Credible Intervals
- Looking Ahead
- 2. Hypothesis Testing
- Framework
- Simple Hypotheses
- The Likelihood Ratio Test
- Neyman-Pearson Lemma
- The Normal Example: The $z$-Test
- The $p$-Value
- Looking Ahead: Composite Hypotheses
- Composite Hypotheses
- The Power Function and UMP Tests
- The Generalised Likelihood Ratio for Composite Hypotheses
- Wilks' Theorem
- Summary and Forward Look
- Tests of Goodness-of-Fit and Independence
- The Multinomial Model
- Goodness-of-Fit to a Fully Specified Distribution
- Pearson's Chi-Squared Test
- Testing Independence in Contingency Tables
- Tests of Homogeneity and Confidence Intervals
- Tests of Homogeneity
- The Duality Between Confidence Intervals and Hypothesis Tests
- Looking Ahead: Multivariate Normal Theory
- Multivariate Normal Theory
- Why This Machinery Is Needed
- Random Vectors and Covariance Matrices
- The Multivariate Normal Distribution
- Independence of the Sample Mean and Sample Variance
- The Sum-of-Squares Decomposition
- Setting Up the t-Distribution
- Student's t-Distribution
- The t-Distribution
- The t-Test for an Unknown Mean
- Testing and Confidence Intervals
- One-Sided t-Tests
- 3. Linear Models
- Linear Models
- The Model
- Least Squares and the Normal Equations
- The Geometric Interpretation
- The Gauss–Markov Theorem
- Simple Linear Regression
- The Model and a Useful Reparametrisation
- Deriving the Least Squares Estimators
- The Fitted Line Passes Through the Centroid
- Connection to the Correlation Coefficient
- Variances of the Estimators
- Residuals and the Decomposition of Variation
- Worked Example
- Towards Normal Assumptions
- Linear Models with Normal Assumptions
- Motivating Example
- The Normal Linear Model
- The MLE Under Normality
- Key Lemma: Quadratic Forms in Normal Vectors
- Distribution of $\hat\beta$
- Residual Sum of Squares
- Independence of $\hat\beta$ and the Residuals
- Summary and Implications
- The F-Distribution
- Basic Properties
- Critical Values and Table Lookup
- Inference for $\beta$
- The t-Statistic for a Single Coefficient
- Confidence Intervals
- Significance Testing for Individual Predictors
- Inference in Simple Linear Regression
- Sampling Distributions of $\hat{a}'$ and $\hat{b}$
- t-Statistics and Hypothesis Tests
- Confidence Intervals for $a'$ and $b$
- Expected Response at $x^*$
- The Confidence Interval for the Expected Response
- The Prediction Interval for a New Observation
- Hypothesis Testing and ANOVA
- An Independence Lemma for Quadratic Forms
- F-Tests for Nested Linear Models
- Geometric Interpretation
- Simple Linear Regression as a Special Case
- One-Way Analysis of Variance
- Looking Ahead
- References
Cambridge IB Statistics
Content
Problems
History
Created by admin on 4/24/2026 | Last updated on 4/24/2026
Prerequisites
No prerequisites required for this page.
Rate this page
★
★
★
★
★
Poor
Excellent